#16 August 14, 2018

Descartes Labs, with Tim Kelton

Hosts: Craig Box, Adam Glick

Tim Kelton is co-founder and cloud architect for Descartes Labs. Prior to starting Descartes Labs, he was a R&D engineer for 15 years at Los Alamos National Laboratory, working on problem areas such as deep learning, space systems, nuclear non-proliferation, and counterterrorism. Tim talks to Craig and Adam about the use of Kubernetes and Istio in geopolitics, machine learning and food supply.

Do you have something cool to share? Some questions? Let us know:

News of the week

CRAIG BOX: Hi, and welcome to the Kubernetes podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

CRAIG BOX: Adam, have you checked out our website lately?

ADAM GLICK: I have. I noticed that something's been added. You've been hard at work, it looks like.

CRAIG BOX: Well, someone's been hard at work. I admit that it's not me listening to every episode and writing down all the lovely words. But our friends at 3Play Media are transcribing each episode, and we're now posting the transcripts. They'll be up a couple of days after each post. Give them time to go through and listen to all the complicated technical terms we use-- purple monkey dishwasher.

ADAM GLICK: [LAUGHING]

CRAIG BOX: But it's fantastic that we have another way for people to consume some of this great content. We were pleased with the response we got when we posted the interview, with Josh Berkus and Tim Pepper, about the release of 1.11 on the Kubernetes blog, so we thought, why not do this for every episode?

ADAM GLICK: It's great to see us get it out there and making this available to however people want to consume the podcasts. Shall we get to the news?

[MUSIC PLAYING]

CRAIG BOX: Last week, at the Prometheus Conference, PromCon, the CNCF announced that the open source monitoring software, popular for its close integration with Kubernetes, has graduated. Prometheus is the second CNCF project, after Kubernetes, to graduate, meaning there is thriving adoption, a documented, structured governance process, and a strong commitment to community sustainability and inclusivity.

ADAM GLICK: OpenMetrics has been added to the CNCF at the sandbox level. OpenMetrics is a format for describing application metrics, which grew out of the formats used inside Google and the Prometheus Project. Work is underway to use this format in OpenCensus, which is a set of uniform stats and tracing libraries that has the ability to be consumed by multiple vendors' monitoring software across a broad set of languages. The project was initiated by Google and has contributions and reviews from companies including AppOptics, Cortex, Datadog, InfluxData, Sysdig, and Uber.

CRAIG BOX: The Kubic Project, the Kubernetes distribution and platform from openSUSE, announced some changes in direction this week, the most prominent of which being the adoption of kubeadm for installation. This work will help the openSUSE containers-as-a-service platform eventually re-base on top of Kubic as an open source upstream.

ADAM GLICK: This week, Javier Salmeron posted an article about understanding role-based access control. He talked about the history of access controls in Kubernetes and how RBAC came in with 1.6 as a better way to manage permissions. Salmeron explains the basics of RBAC, which gives you the ability to specify what operations, say GET or DELETE, can be performed on what resources, like pods or services, and by whom, such as users, a group, or a service account. He concludes that everyone should have a service account per deployment and a minimum set of privileges to work in production, and that RBAC policies are essential for production cluster management.

CRAIG BOX: The Kubernetes API Machinery SIG announced Kubebuilder 1.0 this week. Kubebuilder is an SDK for rapidly building and publishing Kubernetes-style APIs in Go. Based on the techniques used by the core Kubernetes APIs, Kubebuilder helps you build your own API, which when used to manage third-party software, is sometimes called an operator.

ADAM GLICK: Operators come up a lot these days and are often described in ways that can be confusing to those of us that aren't using them widely. Do you want to give a quick "operators for normal people" definition?

CRAIG BOX: Sure. Think about a Kubernetes deployment, for example. There are data objects, which are the YAML files you upload, and controller code, which runs on the master. The deployment controller knows how to make the changes to get to your desired state safely.

You might want to update a deployment from version 1 to version 2, and the controller knows it needs to do a rolling update to the containers in that deployment to get there. When we apply that patent to a third party piece of software, say MySQL, we need to teach the API to do MySQL related things, like maybe backup the database or add a read replica. You can use a Custom Resource Definition, or CRD, as a way to store your own state of what you'd like your system to be, and then write an API server, which knows how to act on the resources it needs to create or modify based on that state.

Kubebuilder provides a way to manage these CRDs, helps implement the reconciliation loops that watch those objects inside your controller code, and provides tests and build frameworks. You can think of it conceptually similar to tools like Ruby on Rails or Spring Boot. Kubebuilder is used by several projects, including Knative, the application CRD, the Kubernetes Cluster Registry, and the Spark Operator.

ADAM GLICK: Speaking of operators, when you deploy a lot of applications, you end up with a lot of controllers to control. In a blog post, Jimmy Zelinskie from CoreOS, now with Red Hat, talks about using operators to manage operators, and what they learned from the experience of trying to automate, as much as possible, in the CoreOS Tectonic platform.

CRAIG BOX: And that's the news.

[MUSIC PLAYING]

ADAM GLICK: Tim Kelton is co-founder of Descartes Labs, based in Santa Fe, New Mexico. Welcome, Tim.

TIM KELTON: Thanks. Thanks for having me.

CRAIG BOX: Tell us about the platform that you've built at Descartes.

TIM KELTON: Yeah, so Descartes is focused on seeing changes happening in the Earth with remotely sensing the Earth, so that's fusing aerial satellite imagery and all types of really geospatial data. And we do that to try and build machine learning models to quantify how those changes might impact our customers, their supply chains, maybe the commodities or inputs they use in their businesses. So we've built a platform to be able to build large geographic models, over large windows of time, and then apply those models to that.

CRAIG BOX: What might be a use case for one of those models?

TIM KELTON: We've done a lot, in terms of agriculture, looking through farms and health and the vegetative health of commodities crops. We've also done things with building starts, with construction, and supply chains, and things like that.

CRAIG BOX: I remember reading an article that referred to your work in the context of the Arab Spring.

TIM KELTON: Yeah, so we have a really cool project we've done with DARPA, and that was around food shortage with Arab Spring and being able to quantify and see early indicators of famine and food shortage. And so that's been really exciting to be able to help in things like that.

ADAM GLICK: How have you been able to use that data in order to help other organizations that might be looking at how do they avoid civil unrest based upon food shortages?

TIM KELTON: Yeah, so the goal of that project is to be able to have really early indicators of when a food shortage is starting to happen in places like Northern Africa. So we'd be able to take those models that we originally built, for the United States, to see how much food was being grown in the US.

And then being able to apply that on different crops, besides just corn or soybeans or things like that in the US, and apply those in different geographies and try and see if there's early indicators that we can catch before they would have normally. Especially without-- countries like in Africa, you might not have a US Department of Agriculture that's going out and manually surveying farmers in every single field and crops and things like that. So it's a lot more useful there.

ADAM GLICK: What is the impact of the model that you built? I imagine that there'll be multiple people that are interested-- NGOs, organizations, companies, maybe even financial markets. What happens?

TIM KELTON: Yeah, we have some customers in financial. One of our biggest customers is an agricultural company called Cargill. And so we do modeling for them on how much food is being produced in various regions around the world. And that's a takeoff of our first models, that we were building as just a small company, looking at the entire US and quantifying, over the whole US and all the fields all over the US, how much food was being produced.

And what we're doing is looking at things like the non-visible light coming through the plants and how that affects the photosynthesis in the plants. And then we'll take that model, and we'll test that back, historically, in time. We'll want to say things like, how did this model do in this drought of 2009 or this flooding of last year? Or maybe there's a big hailstorm, and things like that. So better able to detect how big weather events might be inputs into the production of food. And then we can do things like that in places like Africa, as well.

CRAIG BOX: How and when did you come to Kubernetes?

TIM KELTON: We were very early users. The company was founded in late 2014, and that was right when Kubernetes was starting to become an open source project. And then the first managed Kubernetes clusters were coming live in early 2015, so we were very active in starting to build our really early prototype applications on top of Kubernetes.

CRAIG BOX: What was the process of evaluating technologies that led you to it?

TIM KELTON: Well, in past jobs and different opportunities, I've used tools like Mesos, and I've used a lot of virtualization stacks-- OpenStack and some of those. So I had some experience with that. And actually, if you really go back, I've used things like Solaris zones and things like that, so--

CRAIG BOX: Beowulf clusters?

TIM KELTON: Yeah, I've supported and had a few sys admin jobs doing Beowulf clusters, as well, so there's kind of a long history. But also way before Descartes, I remember reading the board papers and things like that when I was at Los Alamos, and I was always quite interested in that. And then when it was open sourced, it was like, wow. This is actually all those concepts in the open.

CRAIG BOX: And after starting that implementation and running software on Kubernetes, when was it that Descartes knew they'd made the right choice?

TIM KELTON: [CHUCKLING] Well, we had an early, I would call it a prototype project. And at one point, I think I was actually mountain biking out in Sedona, and somebody accidentally deleted a lot of the VMs there. And all of our microservices, on that Kubernetes cluster, basically restarted and were all running again, despite all of the machines initially being deleted.

And that was pretty compelling from the DR perspective. But then, also, just from the microservice's perspective, as well, that gives you a lot of agility to be able to break down the software development cycle into smaller units of work. And that's very powerful for velocity and bringing new teams on, and the agility that that gives. It brings challenges, as well, from a management perspective.

ADAM GLICK: What kind of microservices do you run on Kubernetes? And what things do you decide not to put in Kubernetes?

TIM KELTON: We're running more and more of our workloads on Kubernetes. I would say all of our core APIs are in just processing pipelines, where we get the raw pixels and raw satellite imagery. A few of those are still preemptible, just node pools. And we really leverage preemptible machines for asynchronous tasks.

So with that, we break out units of work, really small, and split them apart on lots and lots of machines. But we're even starting to move some of those more batch workloads into Kubernetes. I think Kubernetes maybe wasn't Cron. I was asking for Cron in early 2015, because I had used Chronos on Mesos, so I was always hoping that we would get-- I was actually a really big fan on jobs, and run-to-completion jobs in Cron and stuff like that.

So we're starting to do things like that now. But then the rest of our core APIs are all microservices and all on Kubernetes. So I think we have, any given day, plus or minus a few, because some are in alpha, but 40 or 50 APIs that sit there. And those are all managed on Kubernetes. Some of those will scale easily 10,000, 15,000 pod jobs behind those APIs, so they're quite sizable.

CRAIG BOX: And you offer this as a SaaS platform to your own customers?

TIM KELTON: Yeah, correct. We have two models right now. If you have expertise in machine learning and remotely sensing the Earth, you can use our Python client and just start building your own models. And then we do multi-tenancy inside of GKE, and we use things like namespaces to isolate those workloads. And then, if you have that expertise, you can build them yourself, and then otherwise we'll build them, build models and execute them.

CRAIG BOX: How has it been to provide that multi-tenancy within purely a single Kubernetes cluster?

TIM KELTON: So far, the namespaces isolation has been really a powerful concept for us. And then there's the RBAC work that's gone on in the last six months. It has been very useful to isolate out what services can interact with each other and breaking apart those types of workloads.

CRAIG BOX: Do you have a need for stronger isolation?

TIM KELTON: To some extent, that's where we start using things like Istio, to be able to do service-to-service communication, allow which services are talking to each other, being able to gain visibility into those core services. Maybe I should back up.

On our SaaS platform, the way it's architected, as an end-user, you would only be able to interact through our Python client and talk to our APIs. You're not talking at all to the Kubernetes APIs or the Kubernetes control plane. It's only our core internal APIs that are talking and scaling up and scaling down jobs and things like that. So that increases the need to have better visibility and understanding into those core APIs and how they interact with the Kubernetes control plane.

ADAM GLICK: You were one of the earliest case studies for Istio.

TIM KELTON: Mm-hmm.

ADAM GLICK: What made you decide to use Istio, and how are you using it within your organization?

TIM KELTON: With Istio, one of the biggest things we were trying to accomplish is, some of our APIs that serve back imagery, being able to have better control over the service, how it would do things like retry requests, maybe how it returns 500 requests, being able to do things like have alternatives on the routing besides just round-robin routing. That gives you a lot of different options there that you can experiment with.

And then the other side of it is just the pure visibility. You know, how is that service interacting with maybe two or three other services? Or one of the big things is we have a service. It's used by this one use case and this one API the way we're expecting. Maybe then a new application comes online and starts using it a little different way, and it's maybe querying it a little harder, or you're seeing worse performance, and you're not quite understanding.

You can see the backend logs, and you see things starting to slow down, but you're not quite sure why some requests are different. So just gaining visibility into that, that was the number one thing we were trying to get from Istio when we first started playing with it. That was maybe in the 0.2 time frame.

CRAIG BOX: Do I have you on the record saying that you were running Istio in production at 0.2?

TIM KELTON: We just had one core service that we were running very, very early. It was 0.2 or 0.3, but then we've been going and putting more and more services behind, especially after the 0.7 to 0.8 change. That was a pretty big change and took some time for us to work through. Especially things like routing changed a little bit there.

CRAIG BOX: Now that Istio has gone 1.0, are there components of it that you weren't using that you're looking to adopt?

TIM KELTON: Yeah, we're now pushing. We definitely want to use more of the things like mTLS and certificates. We want to do more and more on the rate limiting. We want to do almost every aspect of Istio. We want to push more and more, and push more and more of our own services as well. And that just gives us so much better visibility.

ADAM GLICK: With Istio, how did you decide to use that as your service mesh versus some of the other service meshes that are available out there?

TIM KELTON: I think Istio is maybe the first service mesh I really heard about, other than how Google-- we have a number of ex-Google engineers, as well-- and how Google kind of does their own internal APIs. So that was my first visibility into what this concept was there. But I feel like it's actually changed a lot in the last six months or so, and now there's all types of service meshes. So I don't know that we actually evaluated 10 different meshes and said, this one's the best.

But I also liked-- the more I started reading about Envoy, the more I was pretty impressed with some of the basic core functionality of Envoy. And actually, I think I talked to Matt Klein at SRECon last year quite a bit, and he answered all my questions. And I came back and I read through more and more of the docs, and then we started really hammering on Envoy and seeing the performance was actually quite good. And that gave us a lot more confidence.

CRAIG BOX: One of the many hats that you wear is the head of SRE at Descartes. Tell us a little bit about the implementation of SRE, and perhaps some of the ways it's similar or differs to the published standards that we put in the book.

TIM KELTON: [CHUCKLING] I guess when you read Google's book, you kind of come away like, those are amazing. And then you're like, well, I have four people. [LAUGHING] How--

CRAIG BOX: That's why we had to publish a second book, telling you how you can do it at different scales.

TIM KELTON: I picked up my copy of that. I've not opened it yet, but I'm looking forward to diving in and reading that, as well. So in some ways, we're doing a lot of the same concepts. We try and do blameless post-mortems and try and make things learning experiences that we can build a better and better product from. So that's, I think, a core concept.

But then other things, like service level indicators, service level objectives, how do I meet those? And then, for us, as we're selling a SaaS platform, then you would roll that into maybe an SLA. So we've used Istio. And we've been using some of the Stackdriver tools, as well, to basically see the service-to-service metrics and the historical traffic that comes in.

And that can help you establish a baseline for your service level indicators, and you can look back and say, well, was this service level indicator-- it's great that the PM set it there, but is that actually realistic over the last month? And you can do things like, based on those objectives, you can make an error budget.

You can try and roll in, when do we have time to-- we have some budget to burn. Maybe this is a good time to throw out a canary and see how that canary deploys. Other types of, I guess, SRE concepts.

CRAIG BOX: Was it hard getting buy-in from the engineers who would write the code that you wanted to do things in this way?

TIM KELTON: We have a fairly tight engineering team, I would say, overall. And so SRE and our platform engineers are actually all pretty closely work together. We all sit in the same office, so I think some of our platform engineers were as excited about Istio and some of these things as our SREs were. It's come a long way since-- I think I had heard of some of the things before I met you in Toronto, but that was about two years ago now.

CRAIG BOX: Yeah, it was August 2016.

TIM KELTON: Yeah. So it's changed a bit.

CRAIG BOX: A little bit, yeah.

ADAM GLICK: Hopefully for the better.

TIM KELTON: But it's actually really at an exciting place, right? Being able to see all of my services, how all of my services interact, being able to do things like a topology of my services. If I'm a brand-new SRE, I come into the job. I see, oh, here's this new service I'm supposed to start supporting. And being able to map all the different ways the services interact, that's incredibly powerful.

CRAIG BOX: It really helps raise the abstraction level that people think about.

TIM KELTON: Yes. Yeah. And then you can always dive down and get the lower level, what's going on on the Kubernetes cluster level. So that's also useful, as well.

CRAIG BOX: If anyone is interested in seeing some of Descartes' work in action, where should they go?

TIM KELTON: They can go to our homepage, descarteslabs.com, and there's a really interesting demo. We call it GeoVisual Search. So we've taken a really large composite image of the US or the world. And what we've done is break that out into tiny little, 128-pixel by 128-pixel grids. And then you can click anywhere on the map and it's going to do a similarity search and show you the 1,000 most similar results to anywhere you just picked.

CRAIG BOX: Wow.

TIM KELTON: And so you can do things like soccer fields, or I've had people come up to me after talks on golf courses, and some hole that they really like to play. Or I'll look for things like where's all the cool places just like Moab or things like that. But that's a pretty cool link to put on the show notes.

CRAIG BOX: Thanks very much.

ADAM GLICK: Tim, it was great having you on the show.

TIM KELTON: I really enjoyed it. Thanks, guys.

CRAIG BOX: You can find Tim on Twitter @timbuktuu-- T-I-M-B-U-K-T-U-U. That's a rather unique spelling.

TIM KELTON: Well, the single U was already taken, so.

CRAIG BOX: And you can find the link to Tim on Twitter and the notes from today's show at kubernetespodcast.com.

[MUSIC PLAYING]

Thanks as always for listening to our show. If you've enjoyed it, please continue to help us spread the word. Tell a friend. If you have any feedback, tell us on Twitter @KubernetesPod, or by email at kubernetespodcast@google.com.

ADAM GLICK: You could also check out the website and those new transcripts we mentioned at our website, kubernetespodcast.com. Until next time, take care.

CRAIG BOX: See you next week.

[MUSIC PLAYING]