Kubernetes Podcast from Google: Episode 87 - Multitenancy at Cruise, with Karl Isenberg

#87 January 21, 2020

Multitenancy at Cruise, with Karl Isenberg

Hosts: Craig Box, Adam Glick

Self-driving cars need self-driving backend infrastructure. Karl Isenberg is the tech lead & manager of the platform team at Cruise, a self-driving car company backed by GM and Honda. He joins hosts Craig and Adam to discuss two years of running multitenant Kubernetes.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.

[MUSIC PLAYING]

ADAM GLICK: So I've seen security has been in the news this week.

CRAIG BOX: Yes. I hope, if you're running Windows out there, that you're aware of two major developments in the last week. First is the end of life of Windows 7, and along with it, Windows Server 2008. That will probably drive a lot of people to Kubernetes, as they are taking their old workloads and bringing them forward and thinking, hm, we should put them in those fancy new containers that everybody's talking about.

ADAM GLICK: Fancy new containers, indeed.

CRAIG BOX: The other and probably bigger piece of news is that Windows 10 was afflicted by a cryptography bug, which basically renders all certificates useless. If you can generate a particular set of patterns for your elliptical curve, which I will claim to know nothing about, you can basically just sign anything however you like.

I saw a great meme on the internet which basically had the child writing on their school report "Mommy" in big pencil. And saying, Yeah, that's basically it. Like, you can just sign anything and pretend to be anybody.

ADAM GLICK: [LAUGHS] Well, you know at some point, someone will weaponize it and make a script out of it.

CRAIG BOX: It's already out there. There are rules for Windows Defender and other things that will detect people who are trying to sign certificates in this way, but this is one of those drop-everything and patch-everything things. Unless of course, you're running Windows 7, in which case, it's just throw it away and buy a Mac, I guess.

ADAM GLICK: [LAUGHS] Well, in lighter things from this week, I was watching some of the performances from "The Voice," which is that TV show where basically people don't see who's singing, they just hear it. And then the judges have to decide if they want that person to be on their team for the rest of the season.

CRAIG BOX: Mm-hmm.

ADAM GLICK: And normally, there's just really, really wonderful vocal performances. But there was one that was truly unique that I stumbled across this week, by Stephanie Stuber. There's a link in the show notes. And my hat's off to her because she did an incredible, very non-traditional performance. And God bless it, I love when people do things that are just not what you expect and take it to a different level what they're doing. So that was my little joyous find on YouTube this week.

CRAIG BOX: I checked that video out beforehand when you shared it with me. And it seems like we're not trying to give too much about the performance away, but there's not normally a lot for the band to do on those shows. It's really about the singer.

But this particular song gave the band a lot to do. And I'd say, the people in those bands are having to play multiple different genres every night for many, many weeks. And they do a really good job, so my hat's off to the band as well.

ADAM GLICK: Excellent. Shall we get to the news?

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

ADAM GLICK: The Kubernetes Product Security Committee has introduced a bug bounty program, announced in a blog post by Maya Kaczorowski and Tim Allclair from Google and funded by the CNCF. The bounty program covers Kubernetes, as well as its build and release processes, protecting the core code, as well as avoiding supply chain attacks.

Payouts of up to $10,000 will be made for critical vulnerabilities, which must be responsibly disclosed through the online forum. Save your efforts on attacking the mailing list and Slack channel though, as those are out of scope, as are container escapes, attacks on the Linux kernel, and non-code dependencies, such as etcd.

CRAIG BOX: Continuing the security week theme, the Center for Internet Security, CIS, recently published their benchmark analysis and recommendations for Kubernetes 1.15. Building on that publication, Google Cloud has worked with CIS to come up with a GKE benchmark, which helps outline which part of the recommendations Google Cloud automatically does for you and which you are responsible for yourself. If you are already running GKE or you're evaluating Anthos, check out the CIS guide to learn what you do and don't have to think about.

ADAM GLICK: Kyma has announced a tool for integration testing applications on Kubernetes, which they call Octopus. The tool came out of a desire to find a way to test the integration of all the different projects Kyma was using together and outgrowing what they could do with Helm. The project runs suites of tests defined using custom resources and provides integration with Prow to provide dashboards of results. Other benefits include selective testing and automatic retries of failed tests.

CRAIG BOX: Elastic has announced the GA release of ECK-- the Elastic Cloud on Kubernetes. As you might expect, ECK lets you manage Elasticsearch and related software with seamless upgrades, simple scaling, and security by default, using an operator. Regular ECK functionality is available on the Elastics free-forever basic tier, with their advanced features available as part of their paid enterprise plan.

ADAM GLICK: Red Hat has announced the upcoming release of OpenShift 4.3. New in this version is Phipps compliance at level 1 for people working with government data. For everyone else, etcd encryption is added for more secure data at rest.

This release also comes with enhancements to Red Hat's OpenShift Container Storage 4 product, incorporating NooBaa-- that's two O's and two A's-- to help provide abstracted storage access for multiple cloud vendors.

CRAIG BOX: Also from Red Hat this week, Fedora CoreOS is now generally available. Built on Fedora 31, Fedora CoreOS combines the provisioning tools and automatic update model of CoreOS Container Linux with the packaging technology, OCI support, and SELinux security of Fedora's Atomic Host. There are no in-place upgrades from either Atomic Host or Container Linux. Both require you to come up with new config files and redeploy.

ADAM GLICK: Last week, our guest, Lin Sun, talked about "istiod", a unification of several control plane components into one single binary. Christian Posta of sola.io has written a little more about the change in a blog post this week. He used the case of istiod to talk about the joy of monoliths. Microservices are not one-size-fits-all. And if you have a single team working on a set of services, you should consider if you're optimizing for the right thing by using them.

CRAIG BOX: In other Istio news, Banzai Cloud has launched version 1.1 of their Backyards service mesh. Backyards adds several CLI and UI features to Istio, and in 1.1, adds a friendly user experience around multi-cluster meshes, tapping traffic, and reducing resource usage via the sidecar resource, amongst other dashboard features. Backyards is a commercial product that can be run on top of any Kubernetes environments.

ADAM GLICK: Darren Shepherd, our guest on episode 57, loves minimizing things. Having already tackled Kubernetes by shrinking k8s to k3s-- five fewer-- he has turned his attention to Docker. K3c is a new container-focused complement to k3s for the things Docker used to do before it got laden with Swarm and other server-side stuff. You can use it to build, run, push and pull images, and wrap around the container runtime interface and Moby BuildKit. Darren says he's looking forward to unifying everything back into a monolithic project called k3.

CRAIG BOX: 2020 marks 10 years since the publication of the first book on and titled "Continuous Delivery." Arun Ramakani takes this to its logical conclusion, by introducing the continuous GitOps model where you are constantly deploying software to your environment by way of changing configurations and source control. He explains what this means to him and how you might introduce it in the first of a series of articles.

ADAM GLICK: Flant, a Russian and Estonian DevOps company, has announced version 1.0 of their GitOps CLI tool, Werf, which according to Wikipedia, is middle Dutch for wharf or shipyard. Werf can build containers, deploy to clusters, remove old images, and has automatic integration with GitLab CI.

CRAIG BOX: Google Cloud has released a new training masterclass entitled, "Architecting Hybrid Cloud Infrastructure with Anthos." This course will help developers, ops people, and architects with Kubernetes experience learn to build applications on the Google Cloud Anthos Solution. The training has three classes covering hybrid cloud infrastructure, service mesh, and multi-cluster.

Google also shared a case study on how Phoenix Labs, creators of the hit game "Dauntless," launch and run the servers on GKE. Kubernetes was the key technology enabling them to launch on three platforms simultaneously and scale to five continents.

ADAM GLICK: Catalogic, with a C, has announced a disaster recovery tool called KubeDR, with a K. With no pronunciation guide, it would be a shame for us not to call this one "cube doctor". Cube Doctor was built because existing tools that backup objects don't always capture everything needed to reproduce a cluster. And taking snapshots of etcd still means you need to backup certificates. The tool handles making these backups, sending them to cloud storage, and cleaning up old snapshots. The project is open source and currently available on GitHub.

CRAIG BOX: Finally, Chinese company, Inspur, has ported Kubernetes to the MIPS architecture, a RISC system found in embedded devices, like routers, and the university teaching boards that I learned to program in assembly language on. On the Kubernetes blog, the Inspur team talk about the different components they needed to compile for the project to work, how they ported the code, and how they cross-compile using QEMU. Having passed the conformance tests manually, the team are now looking to upstream their work and bring official MIPS support to the Kubernetes project.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

ADAM GLICK: Karl Isenberg leads the PaaS team at Cruise, a self-driving car company backed by GM and Honda. Karl and his team manage and operate a platform based on Kubernetes, and he has worked on container platforms for more than five years, including Kubernetes, DC/OS, and Cloud Foundry. Welcome to the show, Karl.

KARL ISENBERG: Yeah. Thanks for having me.

CRAIG BOX: How did you get started in the platform space?

KARL ISENBERG: I sort of worked my way down the stack from applications, web apps, to frameworks, to platforms for e-commerce, and then into infrastructure and platforms, and then containers, and making it all fit together, sort of empowering my other co-working engineers.

CRAIG BOX: Was that the desire of understanding each layer down as you were working on the layers above?

KARL ISENBERG: I don't know if it was ever a conscious plan up front, but over time, things looked more interesting further down.

CRAIG BOX: Do you ever miss the front-end pieces?

KARL ISENBERG: Occasionally, I write some JavaScript. And then I remind myself why I don't do that for a living.

ADAM GLICK: [LAUGHS] What is Cruise, and why was it founded?

KARL ISENBERG: Cruise is building the world's most advanced self-driving cars. We are testing those in San Francisco, and we have 160 licensed cars in California. And we're driving those with the intent of launching a ride-hailing service in San Francisco first. And we chose San Francisco because it has challenging streets and is a lot harder than some of the suburbs that other people are testing in. So that means that we get to challenge ourselves and solves a lot of hard problems right up front.

CRAIG BOX: I guess if you can do Lombard Street, you can do anything?

KARL ISENBERG: Pretty much.

CRAIG BOX: In the early days of Cruise, I understand that teams would build individual infrastructure that suited their needs. You lead a Platform as a Service team that's trying to consolidate that and effectively give a platform for everyone to build on top of. What was the transition in the beginning from one to the other?

KARL ISENBERG: I was hired two years ago to help lead the platform team that was just starting. And before that, there was a bunch of SREs and core infrastructure people working on pretty much everything. And Cruise has been scaling rapidly, hyper-growth as you call it, doubling every year pretty much for the last four or five years.

And so, the choices that were made when it was small were good at the time. And then over time, we needed to make new choices. So we had a Rancher cluster that ran most of our things originally, and then the SREs were investing in Kubernetes. And when we made the platform team, the idea was to wrangle a bunch of different snowflake clusters and make it into a platform that could help advance everybody towards production readiness and increased development velocity.

CRAIG BOX: So is the platform team purely an engineering function, or does it cross over into SRE as well?

KARL ISENBERG: We work closely with the SREs and sometimes have SREs paired on our team or working with us. And then we have some SREs that have come to join the platform team.

CRAIG BOX: Was Kubernetes always the foregone conclusion? You obviously mentioned that there was some Rancher in use beforehand, but was there ever any other platform technology considered?

KARL ISENBERG: When I got there, that decision was relatively made. But there was, I think, some Mesos going on for a while with some data infrastructure stuff and some experiments on other platforms. And we investigated EKS, which was really new at the time and not quite ready to use. So we decided to build our own Kubernetes platform off the bat. And then later, got access to GKE and GCP. And so we've been building around that for a long time now.

ADAM GLICK: How would you describe the architecture that you've built on top of Kubernetes?

KARL ISENBERG: So I generally say that the Cruise PaaS is a constellation of components, and operators, and integrations with SaaS systems around GKE and Kubernetes. The idea is that there's a lot of pieces you need to add to Kubernetes and to integrate with Kubernetes to get the most value out of it. And just setting up Kubernetes by itself isn't quite enough.

CRAIG BOX: Do you think that Kubernetes is the right level of abstraction, or do you think the project should include more of those things that you've had to build yourself?

KARL ISENBERG: Abstraction is a funny thing. I don't think there is a right layer of abstraction for anything. There's more abstraction or less abstraction, and they usually come in layers. So I think the right choice for your use case varies, depending on what's available and what can make you move quickly without getting in your way.

So we've tried to provide several layers of abstraction that allow our internal engineers and customers from the PaaS perspective to make their use case fit. So in some cases, those are VMs. In some cases, those are containers. In some cases, those are functions. And some need to run on bare metal.

So we try to provide the whole range of that functionality, and then let people choose. I would say that the majority of those run on containers, but we do have some large systems that run on VMs.

ADAM GLICK: Where do you run Kubernetes?

KARL ISENBERG: So like I mentioned, we have GKE providing Kubernetes. We also have Kubernetes on-premises running on bare metal or metal as a service. We don't necessarily have a cross-cloud Kubernetes installation. We tend to stick to having a Kubernetes in each region or locale, because Kubernetes scales better when everything's in the same availability or region. Also, there's a little bit of concern about fault domains and not wanting to have one single control plane that could fall over and cause chaos.

ADAM GLICK: Do you run Kubernetes on the vehicles?

KARL ISENBERG: We don't run Kubernetes on the vehicles for now, and we're not necessarily investigating it immediately. I think Kubernetes was designed for a different use case. And there's a lot of people investigating use cases that weren't intended for Kubernetes, and more power to them. I think there's been some investment in making it work in IoT spaces.

But I think the car, because of the safety concerns, and the concerns about real-time operating, and making sure that we respond as fast as possible, we can't really wait for a pod to fall over and get replaced.

We do need them to be highly available in the car. And we do have a very redundant system in the car, but we use a real-time operating system and not necessarily a cloud-based system that would self-heal over a long amount of time.

CRAIG BOX: Can you give us some examples of the kind of workloads that run on your PaaS?

KARL ISENBERG: We have a lot of workloads on Kubernetes. And that's actually one of the reasons why Kubernetes is valuable to us, as opposed to some of the alternatives, because Kubernetes supports many of them-- not necessarily all our use cases, like I mentioned about layers of abstraction.

But some, for example, is the standard web applications and microservices that I think it was core designed for, and then also jobs, and cron jobs, and batch jobs. We also have some large systems that have agents that are distributed over thousands of pods that consume from Pub/Sub queues and autoscale on demand, based on the load.

So it's not all just request query-based systems. It's a lot of different batch systems and processing that is required for either the machine learning or validating our assumptions in code and also CI/CD. And also, there's a bunch of use cases that are kind of miscellaneous. Having a multitenant cluster allows us to have all of those together and maximize utilization by filling in the blanks where one might be memory heavy, another might be CPU heavy, another might require GPUs, another might be disk heavy.

ADAM GLICK: Which development teams internally build on top of Kubernetes, versus choose to use some of the other solutions that you mentioned? And how do they make the decision as to when they want to run where?

KARL ISENBERG: I think the way we have it set up now, we generally recommend that everybody default to Kubernetes, unless it turns out that your use case won't support that, because that's the platform that we have with the most investment towards production readiness. So that will get you the furthest in the shortest amount of time, unless that doesn't work for your use case.

There's some other investments in the future we're trying to get towards around Functions as a Service and having a more abstract application layer so that people have to do even less and get further. But the more abstract you get in that layer, the less use cases it applies to.

CRAIG BOX: So what is the interactivity between an engineer and your platform today? What are they submitting to cause something to be run?

KARL ISENBERG: When we onboard people into engineering a Cruise, they go through this class that the SREs run that teaches them how to use Kubectl and to deploy to the cloud as a user account just manually, to give them a little bit of exposure to our PaaS. And then the next level is integrating with CI and CD. We have a couple of different CI solutions. And then for continuous deployment, we have Spinnaker managing a lot of our deployments so that we have complex deployment workflows.

But if all you require is a simple solution, you can also deploy from CI, or manually if it's a dev cluster or something. But we don't usually have humans deploying to the production cluster. We'd want to automate that.

CRAIG BOX: You're running a large set of workloads across many thousands of servers. How did you decide where to put the cloud project and cluster boundaries?

KARL ISENBERG: GCP has an interesting approach to project and multitenancy itself that's different from some of the other public clouds. For example, the VPC boundary and the project boundary are separate, compared to the way they are in AWS. So that allows for a little bit of flexibility where you can isolate security at the project boundaries and manage permissions at that level, but then share VPCs across projects.

So we use shared VPCs. And the networking team manages those network layers. And then each team gets a set of GCP projects in each of the environments, like dev, staging and prod. And those different environments have different networks set up with them.

And then the PaaS team runs our PaaS clusters, or GKE clusters, inside of our own GCP projects. So that allows us to provide namespaces to each of our tenants, while they still have access to their own project to run things like Cloud SQL or other SaaS products. That way, the PaaS team doesn't need to have access to their databases. And they can set up and manage their own role bindings.

CRAIG BOX: You have, obviously, a choice of allowing each team to have a cluster and have different namespaces in that cluster for their dev, test, production workloads, or having those be completely separate in terms of separate clusters. Does your design lead naturally to either of those?

KARL ISENBERG: I think there's a big difference between using Kubernetes and operating Kubernetes. And a lot of people have experience using it, and less people have experience operating it. And it takes a lot of time and energy to become an expert in Kubernetes operations. So we tend to bias towards the PaaS team operating the Kubernetes clusters.

CRAIG BOX: Right.

KARL ISENBERG: And that means we want to put them in our projects so that we have control over them, in terms of access control. And we don't necessarily want tenants managing their own clusters, partially because we have the expertise and can provide value there.

There are some scenarios where we have groups of teams that want their own cluster for other isolation reasons. So that means we have basically smaller multitenant clusters there that don't have everybody on them. So we generally have a set of clusters where everybody defaults to. And if that doesn't fit their use cases or their requirements for isolation, then we might put other clusters available to them.

ADAM GLICK: You mentioned that you run in multiple locations, including in the cloud and on premises. How do you manage security across those different locations?

KARL ISENBERG: Complicated-ly?

[LAUGHTER]

We use Vault for our secrets management. We believe that it is stronger at basically securing a wide variety of secrets. And we can use it for not just Kubernetes. We can use it for GCE instances. We can use it for EC2 instances. We can use it on premises to load secrets in.

So it allows us a cross-cloud secrets management solution. So that's what we tend to use for secrets, but that's not the whole story around security. Obviously, there's a lot of authentication, and authorization, and identity management that has to go into that, and then, permission management on top of all that. Even once you have all that set up, managing the permissions is a big job. So I think all of those have different solutions.

CRAIG BOX: You've open-sourced a tool called RBACsync, which allows you to have a consistent set of identity across all of your clusters?

KARL ISENBERG: RBACsync is sort of a glue piece that allows Kubernetes to do role binding on groups. There is some built-in capability for this Kubernetes. But when we started on GKE, GKE didn't support binding to Google Groups.

So we wrote this tool to allow a two-stage management of permissions. Basically, the platform team managed the bindings that gave people access to their tenant namespace. And then we could delegate out the group management to, like, an engineering manager of the team or a tech lead, that would then make sure that their group was up to date.

So when they onboard a person, they just add them to the group, instead of having to get more bindings. And then RBACsync would keep the Kubernetes permissions up to date. So it's not like a complete end-to-end solution, but it allows tying in their single-sign-on identity provider through G Suite, and Octa, and Duo Security for two-factor into Kubernetes and extending that with group management.

CRAIG BOX: That's one thing that a lot of people pick up on when they start looking at some of these projects, Kubernetes, especially, is that there are a lot of different objects. And quite often, the reason there are a lot of different API objects is that they need to be managed by different teams. You need to have bindings and consumption. They may be set up by administrators and not the end users.

Are we, as the Kubernetes community, doing a good job at onboarding people who maybe don't have that scale and explaining to them why all that complexity is there?

KARL ISENBERG: I think that, in order to have a multitenant platform or a platform team operating that provides service to other tenant teams on top of a cloud that is multitenant and offering us a tenant service, there's a lot of layers of complexity there. Kubernetes doesn't do a great job of isolating the tenant view from the platform operator view, and that is an interesting distinction on what should be visible.

The similar boundary is like, what objects have namespace in them, and which ones don't? So for example, a custom resource definition is a global object.

CRAIG BOX: Yes.

KARL ISENBERG: So only platform operators can install those if your tenants are at the namespace level.

CRAIG BOX: Do you think Kubernetes should have a tenant object in the way that it has a namespace object?

KARL ISENBERG: It could. I think tenant kind of means something different to different people, and so I don't love it as an object name. But I do wish there were some more orthogonal ways to slice and dice.

So when I think about the domains you might want to slice on, I usually think about either architectural domains, like separating components or microservices from each other, or project domains, which is in the architectural domain. Or you do organizational domains, like organization, department, team, individual. And then the other one is your environmental domains, dev, stage, prod, test.

And so, you have clusters, and you have namespaces. So you can't do all three of those domains. There's just not another object to do it with.

CRAIG BOX: This feels like a job for labels, to some degree. Because obviously, some of those things will overlap, depending on people's business structure.

KARL ISENBERG: I think labels are horizontal. And so that's-- what's needed is-- I mean, I was using the term orthogonal, but it needs a way to intersect. And when you get from this clustered namespace and add a third that overlaps, I don't think anybody's really decided what to call that. But I don't know that it's tenant.

Our tenants are at the namespace boundaries, but those namespaces are cross-cluster. So it's not a tenant in that cluster, It's like a tenant across clusters.

CRAIG BOX: So this almost feels like your PaaS should have a construct called tenant or something like it, rather than Kubernetes itself.

KARL ISENBERG: We've contemplated writing an operator or something that would manage that with another abstraction layer. And in fact, we've written some tooling that helps us deploy changes across multiple clusters.

We've recently open-sourced a tool called Isopod, which is fundamentally a CLI that gives us the ability to deploy to a bunch of different clusters with similar configurations, and then just templates those out with a nearly-Turing-complete language, rather than just using Helm templates and YAML.

So that way, we're not slicing all of our add-ons and releasing 10 different versions. Every time we want to roll something out, we can just roll from master and deploy it to all the clusters with minor permutations.

Another thing that we wrote internally was a Juno tool for effectively GUI managing and self-servicing the creation of projects. And we call them projects, not tenants, because we actually want people to move towards having them at the project boundary-- not necessarily the component boundary, but like a set of very tightly-integrated components in a project space.

And that kind of overlaps with GCP project space name, which is a little complicated, but that's not a tenant boundary either. So we have these project isolations, and that gives you a set of GCP projects, and a Vault workspace, like a hierarchical path, and a Kubernetes namespace across environments. So it's sort of a bigger idea.

ADAM GLICK: How do you get network traffic into your services and across clusters, when you have multiple locations?

KARL ISENBERG: We have a hybrid network across our clouds, and on-premises data centers, and offices. This is required for sending data between these behind the corporate firewall. And we try not to lean on the corporate firewall as the be-all end-all of security, but it's just another layer that gives us a little more assumption of it not being publicly leakable by intermediaries.

And so to do that, we have some private fiber laid to get to internet exchanges, to have high-bandwidth use cases from our garages into the cloud. And then from the platform layer, we have several different ingress and egress integrations that allow getting outside of our private GKE nodes.

So the private GKE nodes don't have direct ingress from the internet. They can egress to the internet through a NAT gateway. But for ingress, you'd have to go through a load balancer. And usually, that's a Layer 4, or a Layer 7 load balancer, or both. And then those load balancers allow for cross-region or public internet traffic, depending on which one.

ADAM GLICK: What do you use to make sure that you're keeping that network healthy? What do you use for observability to view what's going on in your clusters and between clusters with your traffic?

KARL ISENBERG: I think this is a not-completely-solved problem space, especially in hybrid cloud. There's really nothing that will allow you full visibility over everything that's going along in your custom hybrid cloud, because everybody has built their own. And they're all different.

Inside of a cluster, I mean, the obvious answer there would be a service mesh that allows you to add metrics onto your egress and ingress from each pod individually, or each application, or each namespace, or each cluster. And we've been piloting Istio to help in that namespace.

And I think you mentioned observability and viewing the hierarchy. And I think those are important value adds from Istio. Another one is quality of service. So first, you have to have the metrics to figure out where the problems are, and then you have to have isolation boundaries to make sure that those don't affect other tenants to reduce noisy neighbor problems.

We're still working on getting that in a single cluster. I think the multi-cluster service mesh problem is still being developed in the ecosystem and is not super mature. And we'll probably be investing in that really soon. But I don't think there's a pull off the shelf and be done with it, especially if you get outside of Kubernetes land. If you have to propagate that to VMs, or across the network to the on-prem machines, and then you're adding significant latency over interconnects or VPNs or regions, it becomes more complicated.

CRAIG BOX: You mentioned Isopod, which is a tool that you've open-sourced domain-specific language to define Kubernetes objects. You mentioned that in the context of helping you roll out to multiple clusters. What other problems does Isopod solve for Cruise?

KARL ISENBERG: Really, the fundamental problem it helped us solve was replace a bunch of Bash that we wrote. Even if you're using YAML and Helm, or some analogous feature, like, I don't know, a Terraform Kubernetes provider, there's still the abstraction layer around that of deploying that to a bunch of different clusters, and the release management of your tools, and the building and pushing, and doing zero-downtime deployments.

The other aspect of it is that not everything goes through Kubernetes. And so Isopod has providers in it that connect to GCP and connect to Vault, and can do the end-to-end to make sure that our add-ons, which is what we primarily use it for, are deployed with secrets in Vault, and with firewalls, or security groups, or network connections, or service accounts from GCP set up end-to-end, without having to have CRDs for all of those to go through Kubernetes.

It also allows us to use less YAML. We write it in Skylark is the front end of that. And the back end of that is a go back-end wrapping the Kubernetes go client, which is the most mature go client.

CRAIG BOX: Skylark is a language that's Python-derived, I guess you'd describe it as?

KARL ISENBERG: Right. It's a subset of Python. So it's not quite Turing-complete, but you can still do FOR loops, and IF statements, and functions, and a lot more abstraction than you can get with YAML. Plus, it gets us out of the problem where you can't read what's going on.

I think a lot of Helm charts will translate from the Kubernetes standards back to a new API effectively, with a bunch of fields that may or may not have the same names. And then, those are custom to your application. Whereas in our case, we're using an imperative language to fill out these objects, and then sending them straight to Protobuf to send them to Kubernetes.

ADAM GLICK: Do you think the infrastructure challenges that you're solving are the same, or different than the challenges being solved by other industries?

KARL ISENBERG: I think the Kubernetes problems, or adjacent problems surrounding Kubernetes, are mostly common in the industry. That's why we open-sourced five or six tools. Because there may be large projects, but then integrating those projects together is the last mile. And getting that working is a little complicated and specific to what integrations you chose.

Just getting Vault, and Octa, and GKE together is not the same choices that everybody makes. And so, doing last-mile integrations is a big thing that has to be done by someone. Either you pull that off the shelf, because we made it for you, or someone else did, or you build it yourself.

And in general, I think multitenancy is also a popular thing to talk about in the ecosystem, because a lot of people are doing it without really feeling confident about it. I tend to describe multitenancy really as running and operating multiple applications on the same hardware or the same environment.

So it's ambiguous whether the tenant is the applications, or the operators operating the tenants. But the fact is, everybody's already doing that with Kubernetes. They already have multiple applications on their Kubernetes. I don't know anybody who runs one application on a Kubernetes cluster. That seems like a waste of investment in infrastructure.

CRAIG BOX: You've built a custom platform for your needs at Cruise. Would that be your recommendation to a team starting today looking at a similar problem? Or do you think that the parts that you've built and the rest of the industry are working on collaboratively will eventually lead to this being something a bit more out of the box for this kind of use case?

KARL ISENBERG: I mean, as of today, I don't feel like there's a one-size-fits-all solution. It took a long time for Kubernetes to pan out as the winner in the space. And I still think it wasn't necessarily the best at everything, compared to some of the competition.

But that means it still has places to improve. It still has integrations to improve. And not everything can be baked into Kubernetes. It's getting to a point where CRDs is a better way to add functionality to a Kubernetes cluster, rather than trying to get some of the original primitives changed, because that will just be a much longer process.

So I don't think there's a one-size-fits-all. I think there's a lot of vendors doing things that are useful for certain use cases, but are not going to be one-size-fits-all, like I mentioned. So I think everybody is going to have to build some integrations around Kubernetes.

I don't think you can really get away with using Kubernetes and not build anything on top of it, or not integrate components, because it's not secure by default. And there's a lot of integrations with users, and access control, and networking, and DNS enhancements, and isolation configuration, and even just deployment strategy. Or some low-hanging fruit like multi-region, or cross-cloud, or multi-cluster ingress-- these things just aren't solved problems.

CRAIG BOX: Do you look back on some of the other platforms that you've worked on in the past and look at features that Kubernetes still doesn't have yet?

KARL ISENBERG: Sure. I think one of them that's easy to call out is scale. I know from example that both Mesos and Cloud Foundry can scale to more nodes than Kubernetes can with general synthetic workloads. And that's partly because of their simplicity.

Just running tasks, or just running applications with no sidecars, and not running daemon sets, it's easier to make that hit a higher scale. And a lot of those aren't necessarily worried about being as consistent as Kubernetes is with its back-end state. So if you have a distributed state or an eventually consistent state, you can scale much better.

So it's hard to call scale a feature. It's kind of a non-functional requirement. But I think that limitation of Kubernetes has created an industry where you need to manage multiple clusters. I mean, I think you need multiple clusters of any of those other competitors too, but it's more of a problem in Kubernetes land.

ADAM GLICK: Recently, you've been publishing a series of blogs about the learnings and experiences that you've had working on the infrastructure. We've certainly enjoyed reading them. What's been your reasoning behind going on "Medium" and sharing so much of what you've been learning and doing?

KARL ISENBERG: Partly, it's I like talking about it. [LAUGHS] And my company lets me. So I think it's good for Cruise, because it gets our name out there and attracts people who like hard problems. And Cruise is fundamentally-- our founder likes to call it the challenge of our generation.

And us on the back end have a different challenge in Kubernetes, and platform ecosystems, and developer velocity. And I'm not solving the car self-driving aspect myself, but I get to contribute on the back end doing this hard problem.

So sharing with the community with our open-source tools, and involving ourselves at KubeCon, and investing in talks both does some brand awareness, but also helps us attract talent and is good for the ecosystem. So it makes us feel good. And I think every engineer wants to feel good about what they're doing.

CRAIG BOX: Well, these are some very interesting problems that you're solving. And it's been great to hearing about them from you. So thank you very much, Karl, for joining us today.

KARL ISENBERG: Thank you for having me.

CRAIG BOX: You can find Karl on Twitter, @KarlKFI, with two Ks, and learn more about Cruise's platform at medium.com/cruise.

[MUSIC PLAYING]

ADAM GLICK: Thanks, for listening. As always, if you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter, @KubernetesPod, or reach us by email at kubernetespodcast@Google.com.

CRAIG BOX: You can also check out our website at kubernetespodcast.com where you will find transcripts, show notes, "The Voice" videos, and encouragement to patch your Windows machines. Until next time, take care.

ADAM GLICK: Catch you next week.

[MUSIC PLAYING]

View More Episodes

Multitenancy at Cruise, with Karl Isenberg

Chatter of the week

News of the week

Links from the interview

Transcript