#114 July 28, 2020

Scheduling, with David Oppenheimer

Hosts: Craig Box, Adam Glick

We finally scheduled some time to talk to David Oppenheimer. David, a software engininer at Google, has been working on scheduling there since 2007, including on both Borg and Omega. That experience naturally led to him working on the Kubernetes scheduler, as well as starting SIG Scheduling.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.


CRAIG BOX: Last week, we talked about the fact that it had been National Ice Cream Day, and you were making an ice cream pie. Follow-up time-- how was it?

ADAM GLICK: It was delicious. And in the United States, there were all these articles about where you could go to get the best ice cream in whichever city you were. And I chose to go and get some. My wife was very kind. She's like, did you make the crust yourself? And I was like, oh, yes. I went directly to the store, went to the baking aisle, and I made the decision to pull that crust right off of the shelf.

But the rest of it we did put together. It was a multi-layer ice cream pie, and it was fantastic. We dropped off a bit of it for the neighbors so they could enjoy as well. And especially as it's gotten a little warm here in the States over the past week, it's been a wonderful thing to kind of dip into.

CRAIG BOX: I don't know that I was quite clear enough last week in my expression of disdain for the flavor combination. That's an American thing, I know. But peanut butter, and caramel, and ice cream all together? It must be an acquired taste.

ADAM GLICK: Well, I do hear there is a New Zealand version that involves Vegemite and mutton, so perhaps for next year.


CRAIG BOX: You say it's warming up there in the US?

ADAM GLICK: It is. There's also a video we put into the show notes this week of just this very adorable, very large black bear that has found a kiddie pool in someone's backyard and is just plopping himself down and enjoying cooling off in a pool that is literally smaller than he is.

It just is a really cute picture. And the owner was not scared at all. She lives near an area with lots of wildlife, and really enjoyed it, and was happy to share it. So it's one of those just kind of fun things to look at the hot days of summer and just kind of have a little fun.

CRAIG BOX: I remember a trip to a friend's lake house on a lake near Toronto in Canada. And that area is obviously bear country. And we were just going through the room out the back. And I remember her saying, oh, we haven't seen the bear lately, but keep your eyes open. I'm like, OK. I wish we could stay on the inside, but what are you going to do?

ADAM GLICK: Shall we get to the news?

CRAIG BOX: Let's get to the news.


Fancy a service mesh without sidecars? Google Cloud has added support to gRPC for configuring it with the XDS APIs that they used to configure on Envoy. This means you can now control proxyless gRPC services with the Traffic Director product. This can improve latency by removing sidecar hops through the network and increase performance by avoiding the memory and CPU footprints of the sidecars.

ADAM GLICK: The future of monitoring and tracing data acquisition is open. So says Ramon Guiu, the VP of product management for New Relic. He announced this week that New Relic's client-side agents, integrations, and SDKs are all going open source. He also committed to publishing roadmaps for their open source projects. Guiu further committed to standardizing their work on the OpenTelemetry project that New Relic continues to be a contributor.

CRAIG BOX: Lyft (with a Y) are famous in the US for cars with pink mustaches and known in the rest of the world for not being Uber. In tech, they're well-known for releasing the Envoy proxy, and they've opened the curtains on their internal tooling once again. This week, they released Clutch, and open source web UI and API platform designed to simplify, accelerate, and de-risk common debugging, maintenance, and operational tasks.

Clutch consists of a React front end and a Golang back end to which you can plug in all your cloud and cloud native systems such that authorized users can trigger workflows. Over the last year, Lyft has used Clutch to replace use of the AWS console, and they're working towards the goal of it being a fully context-aware developer portal.

ADAM GLICK: Conftest, a utility to help people write tests against structured configuration data like their Kubernetes manifests, has become part of the Open Policy Agent, or OPA, project. The project started out as a way for Gareth Rushgrove to learn OPA. And after he shared it at KubeCon in May 2019, it caught on, attracting hundreds of pull requests from 30 different contributors.

The new tool works with Gatekeeper, a tool for using OPA to secure Kubernetes clusters, as the same policies can be used by Conftest in the development process and Gatekeeper at runtime. Looking forward, Gareth points out that there is now a discussion about which parts of Conftest should be integrated into OPA directly as Conftest stays focused on the developer experience and making OPA as easy for developers as possible.

CRAIG BOX: Emissary is a new open source project from GitHub. It's a bridge between a service proxy and like Envoy or HAProxy and the SPIFFE identity framework runtime. You can use it to apply authentication and authorization policies for ingress and egress in places where you can't use a library.

ADAM GLICK: Microsoft has released an update to their VS Code docker extension, which allows you to directly deploy a docker image you are working on to an Azure Container instance. It also lets you set up groups of containers to deploy, connect to the container shell, and view logs from the container. The new version is available by updating the docker extension from Microsoft in VS Code.

CRAIG BOX: In an article published in the latest ACMQ magazine, monitoring PM Beth Cooper and UX researcher Charisma Chan lift the lid on debugging incidents in Google's distributed systems. The article covers the outcomes of research on how Google engineers debug production issues, including the types of tools, high-level strategies, and low-level tasks they use in varying combinations to debug effectively.

It summarizes the common engineering journeys for production investigations and shares examples of how experts debug complex distributed systems. For example, a software engineer is more likely to consult logs, whereas an SRE is more likely to rely on service health metrics to isolate an issue. Finally, the article extends to the Google specifics of the research to provide some practical strategies that you can apply in any organization.

ADAM GLICK: That HashiCorp Consul Service for Azure is now generally available. HCS on Azure lets you provision HashiCorp-managed Consul clusters directly through the Azure portal. Use cases include service discovery, automated network configuration, and service mesh. This product is separate to their recently announced HashiCorp Cloud Platform, which offers Consul on AWS in beta.

CRAIG BOX: What's better than one Gloo API gateway? Many Gloo API gateways all gloo'd together. The team at Solo.io have added federation support to the enterprise version of Gloo for people running in multiple clusters.

It's implemented with custom resources running in the cluster of your choice, a common multi-cluster pattern. If you're sad you can't use this in the open source version, you can always attend the upcoming Enterprise e-webinar to learn more.

ADAM GLICK: And now for the security section. Amazon has worked with the Center for Internet Security to develop a security benchmark for the Elastic Kubernetes Service. This joins the base Kubernetes and GKE benchmarks, as well as many other CIS benchmarks for other Amazon services.

CRAIG BOX: Aqua Security has made two updates to the platform. The first is the release of Aqua Wave, a SaaS-only version of their basic security suite focused on development and runtime environments. The second is for Aqua Enterprise, their product that can run on-premises or as a SaaS.

The enterprise version has entered risk-based insights for vulnerability prioritization, VM security controls, including file integrity monitoring, otherwise known as antivirus, as well as system integrity monitoring for Linux and a registry monitoring for Windows. Role-based access controls across deployments, teams, and apps round out the release, along with visualizations for risks in Kubernetes clusters. Both products are generally available.

ADAM GLICK: Daniel Berman of Snyk (with a Y) has announced their new Vulnerability Prioritization Tools, which use factors like CVSS, age of vulnerability, existence of exploit code, and accessibility to provide a prioritization score. Part of this includes the beta launch of their new Reachable Vulnerabilities functionality, which is a code tracing tool that analyzes if vulnerabilities are actually accessible as part of your code execution path.

He also announced a Security Policies beta that lets teams set their own prioritizations for vulnerability types and scores to meet their needs. Regional Vulnerabilities is part of Snyk's standard paid tier while Security Policies is available in their Pro tier.

CRAIG BOX: Two new security vendors emerged from Stealth this week. First, Carbonetes (with a C), the self-proclaimed first container application security testing as a service solution. It was created to bring together a number of tools into one platform, including scanning for software dependencies and vulnerabilities, licenses, configuration, secrets, and malware.

Second, Prevasio, the self-proclaimed first dynamic threat and vulnerability analysis system for Docker containers. Their engine combines the Trivy scanner from Aqua with their own machine learning models as well as the ability to perform automatic penetration tests.

ADAM GLICK: Finally, Adam Gluck of Uber-- no relation-- has posted about their path to Domain-Oriented Microservices Architecture, or DOMA for short. DOMA applies domain-driven design to a microservices architecture. In practice, this means focusing around domains of microservices that are all related.

Collections of domains are collected into a layer which defines what dependencies the services in that layer can access. Single access points for layers are called gateways. Finally, domains are agnostic to each other, but define an extension architecture if they need to be expanded.

Other Adam then goes on to describe Uber's architecture and the benefits of this design. He also helpfully defines when a company should consider adopting a DOMA approach and when they shouldn't.

CRAIG BOX: And that's the news.


ADAM GLICK: David Oppenheimer is a software engineer at Google Cloud. He's worked on Kubernetes since 2014, and its predecessors, Omega and Borg, before that. And he is one of the authors of the Borg whitepaper. His work on Kubernetes has focused primarily on scheduling and multi-tenancy, and he was the co-founder of SIG Scheduling and the Kubernetes Multi-Tenancy Working Group. Welcome to the show, David.

DAVID OPPENHEIMER: Thanks. It's great to be here. I'm a big fan of the podcast.

CRAIG BOX: Thank you. That's very kind. You've been at Google since February 2007. What was it like back then?

DAVID OPPENHEIMER: At the beginning of 2007, I think Google still felt like a small company, even though it had more than 10,000 employees or something. I remember in our nuclear orientation, there was this one-hour session explaining how web search worked. And you really came away from it feeling like you understood the whole crawl indexing and serving pipeline at a pretty decent level of detail.

I think today, you could barely scratch the surface on how web search works in one hour. So that was one of the things that I remember kind of being impressed with, thinking that I kind of felt like I understood that much of such a critical part of the company from just a one-hour seminar.

And also, I guess another thing that comes to mind is that someone like me who was working in infrastructure, it was basically possible to know what all of the infrastructure systems used across the company were-- not all of their internal details, but at least to know what they were called and what they did. Today, there's just too many systems for that to be possible.

CRAIG BOX: Do you think that's true of Kubernetes as well? Five years in, is it possible for one person to come along and understand the whole thing?

DAVID OPPENHEIMER: I don't think so, not for a long time with a lot of study and getting your hands on lots of parts of the code base. I think that definitely in the early days, it was possible to feel like you understood the whole thing. But no, I would say that the system today is not something that you can from a practical sense understand all of, at least certainly not without working on it for a really long time.

Luckily, though, you don't have to, because the system is really modular and componentized. So when you want to make a change or understand how a particular part of the functionality works, you usually only have to understand a particular component or a set of components. So thankfully, you don't really have to understand the whole system.

CRAIG BOX: I think they call that Conway's Law, that anything designed by an organization will end up representing the organization that it came from.

DAVID OPPENHEIMER: Yes. Yes, I've heard that one for sure.

ADAM GLICK: Have you been working on scheduling the entire time?

DAVID OPPENHEIMER: I was hired to work on Borgmaster, which is the central control plane for Borg which people have probably heard of. But in case they haven't, it's the system that's responsible for managing all the jobs that run on all the machines in Google's data centers. And I was specifically hired to replace the main engineer who was working on scheduling at the time.

Funny story, I later found out he had left the team to start App Engine with some of his friends. And it just always blows my mind to think that Google was already starting to dip its toe into cloud when I was starting at Google more than 13 years ago. And, of course, listeners probably know App Engine is still alive and well today.

CRAIG BOX: Google's famous, though, for hiring engineers and then assigning them projects. Did you have a background of work beforehand that led you to be hired explicitly for this role?

DAVID OPPENHEIMER: I had done a PhD that had some connection to kind of resource discovery and resource allocation. So yeah, there was some background that I had related to scheduling. But I later found out they were also considering putting me in one of the storage teams. So I don't know how much that factored in.

CRAIG BOX: You could have had a very different career.

DAVID OPPENHEIMER: Yeah, definitely. But I'm really glad that I ended up working on Borg, because it was really a great system to work on and led to working on other cluster management systems later, like you mentioned. So I'm glad how that all turned out.

CRAIG BOX: Tell us about the Borgmaster.

DAVID OPPENHEIMER: I worked on Borgmaster for my first few years at Google. And like you said, my main focus was on scheduling. But the team was small, especially at the beginning. I think there were like six or seven of us when I joined. So I also worked on other parts of the system.

And we were talking before about understanding all of Kubernetes, or is it possible? And Borgmaster, in contrast to Kubernetes, was a very monolithic system. So you kind of had to understand the whole system usually in order to make any kinds of significant changes to it. So I was focusing primarily on scheduling.

But unfortunately, pretty much anything you wanted to do with it, you had to touch the whole system and understand the whole system. And that's definitely a lesson that was learned when the folks who worked on Borg designed Kubernetes.

One of the great things about Kubernetes is the separation between all of these different concerns like API surveying, and storage, and authentication, and authorization, and mission control, and scheduling, and controllers, and managing nodes, and so on. And in Borg, these are all tightly integrated. In fact, most of that functionality all ran in the same thread.

And so it was kind of a very monolithic system, and not just from the code perspective. Borg also had a single abstraction for running containers. It was called a job. So this job abstraction had a zillion configuration options and a single state machine associated with it. So it was both the internal structure and the external API was very monolithic.

And so sometimes people ask about, oh, well, why didn't we open source Borg? Why did we build a new system with Kubernetes? And it's just really unimaginable that you could ever have a large community of people working on a system that was as monolithic as Borg in parallel. So that's definitely part of that.

ADAM GLICK: Was there anything particularly interesting about working on Borg scheduling that attracted you to it?

DAVID OPPENHEIMER: Well, scale is always the first thing people assume is interesting. But actually, I'd say a bigger challenge was needing to support mixed workloads. We had to run a mixture of performance-sensitive user-facing production workloads alongside less critical batch workloads not just in the same cluster, but also on the same machines in order to maximize utilization.

And even the storage servers were running in the same clusters. So there was always a question of how to prevent interference between different workloads and how to manage over committing resources at the node level and the cluster level to maximize efficiency while not causing a lot of unpredictability.

And by unpredictability. I mean like getting latency spikes on your servers or getting a lot of preemption. So I would say that as much as scale is an issue, these issues around running mixed workloads were actually more interesting and more critical to the company.

CRAIG BOX: That's one of the first problems that Dawn Chen remembers working on. We spoke to her in an earlier episode, and also Brian Grant, both of them longtime Googlers. But you've actually been there a little longer than both of them. Do you remember when they first joined the team?

DAVID OPPENHEIMER: Dawn started a few months after me. She was hired to work on Borglet, which is the Borg equivalent of Kubelet. I think there were like three people on the Borglet team at the time, and Dawn was the fourth.

Borglet was an amazing piece of software, and I'm pretty sure it was the first large-scale production use of cgroups. Anyway, I guess, in some sense, I've been working with Dawn for more than 13 years, though it's kind of funny, because we've always worked on different parts of the systems that we've worked on.

CRAIG BOX: And Brian Grant?

DAVID OPPENHEIMER: Brian joined Google a few months after me. But initially, he wasn't working on cluster management. So I didn't actually know him until he joined the Borgmaster team, which was sometime in early 2009.

And Brian was given a mandate to make Borg less monolithic, which would enable us to make changes more quickly and with less risk, the same benefits you get from breaking up monoliths today. He was also given a mandate to improve performance, because we were starting to hit scaling limits in many parts of the system.

CRAIG BOX: So is that where Omega started?

DAVID OPPENHEIMER: Brian drove a really huge amount of progress in componentizing Borg and improving its scalability. And I would say the fact that Borg is still alive and well today is a testament to his work and, of course, the work of many people after him.

But eventually, he and some others came to the conclusion that addressing the next decade of challenges couldn't be done with just incremental changes to Borg. There was a feeling that we could make a bunch of improvements at the same time if we built a new system. It would be much more modular than Borg. It could have specialized handling of different workload types in different components.

And it could integrate the machine management workflows like draining machines, and repairs, and machine software updates. So there were a bunch of things that it just didn't seem were possible to integrate into Borg and improve in Borg incrementally, and so that's kind of how Omega started.

CRAIG BOX: Recently we spoke with Wojciech Tyczynski, who was with the Omega team in Warsaw. And we talked about the fact that you can think of Borg as the traditional monolith and Omega is the new let's rewrite it as a microservices application. Do you think that analogy holds true for you as well?

DAVID OPPENHEIMER: Yeah, definitely. I think that that was one of the big differences in the design between Omega and Borg. Trying to apply the lessons learned from Borg was that the monoliths caused a lot of problems. And so I'm not sure I'd go as far as to call them microservice architecture, but it was definitely a lot closer to a microservice architecture than a monolith. And so yeah, I would definitely agree with that characterization.

CRAIG BOX: So when did you start working on Omega?

DAVID OPPENHEIMER: I worked on Omega from its beginning in 2010, mostly focusing on scheduling and resource management. And then later, I led a project that was kind of fun where we ported the top half of Borgmaster to run on the bottom half of Omega.

And the goal of creating this Frankensystem, in some sense, was to allow users to run unmodified Google production jobs with all of the existing client tooling for Borg, and all of the Borg configs, and all the Borg user experience on top of the bottom half of Omega so we could get some production exposure and production experience with Omega without telling users they had to rewrite all of their Borg configs, and change their client tools, and so on. And we actually did run this in production for a while.

CRAIG BOX: You mentioned before the Borglet and obviously the Kubelet. I just think it's a fantastic piece of synchronicity the Omega agent was called the Omlet.

DAVID OPPENHEIMER: Yes. This was not my idea, but in retrospect, quite obvious and quite clever, yes. By the way, I think other folks that you've interviewed who worked on Omega have mentioned this.

But a number of the concepts that we developed for Omega can be seen in some form in Kubernetes, like pod disruption budgets, and taints, and toleration, and using optimistic concurrency control to deal with parallel mutations to shared cluster state.

CRAIG BOX: That's definitely my favorite feature of Kubernetes.

DAVID OPPENHEIMER: I can't tell if you're being ironic there. But it is actually one of the things that makes it possible to have this architecture where you have lots of controllers trying to manage lots of different resources and have kind of independent development of clients and the ecosystem tools. So I think that that actually was one of the key concepts in Omega that made it into Kubernetes.

Anyway, in early 2014, I started hearing about the Kubernetes Project from folks like Brian, and Dawn, and Tim Hockin, and it sounded really exciting. So I joined the Kubernetes team in late 2014, and I think Kubernetes was on version 0.4. So I wasn't there from day one, but it was definitely very early days in the project.

ADAM GLICK: What was it like to transition from working on an internal Google project to an open source project?

DAVID OPPENHEIMER: It was a real culture shock to work on Kubernetes. Having people from other companies reviewing your code, and design docs, and of course reviewing the code and design docks to people from other companies, it was definitely a big change, especially since Google had traditionally been somewhat secretive.

But the folks in the community who we were working with were amazing, so it was a really positive experience. Another change for me was having people outside of Google use what I was working on after I had been working for many years on internal infrastructure.

And in fact, just being able to talk about what I was working on publicly. I remember going to a bunch of local Bay Area tech companies and some meet-ups to evangelize Kubernetes. And this was a lot of fun, being able to talk about this really cool system that I was working on publicly.

Even though in retrospect, this may be surprising, but our hit rate for actually convincing companies to adopt Kubernetes was quite low back then. They were always very appreciative of hearing about it, and the engineers loved the ideas. We almost always got a polite, don't call us, we'll call you in response. But it was still a lot of fun.

CRAIG BOX: It sounded a bit too much like the future for them at the time?

DAVID OPPENHEIMER: Yeah, I think so. And I mean, there's a lot of practical challenges to moving a company's cluster management infrastructure from whatever they were using at the time, which was usually some kind of homegrown system, entirely homegrown or something that was built on top of Mesos or another system to something brand new.

So there's a lot of challenge there to overcome those barriers. So our feelings were not hurt, but it was definitely an interesting experience. But I'd say I guess the one change that I had mixed feelings about in moving into the open source world was moving from Google's internal developer tools to open source tools.

But the open source tools are definitely better today than they were six years ago. So I think this is less of an issue today, but it definitely took some getting used to.

CRAIG BOX: You've been working on scheduling in Borg and then in Omega. You started working on Kubernetes, so is it fair to assume you started scheduling in that as well?

DAVID OPPENHEIMER: Yeah. When I started working on Kubernetes, I was primarily working on scheduling, though I also worked with a lot of the internal Google teams who were working on some of the other areas, like scalability, and auto scaling, and logging, and monitoring.

And, of course, scheduling is closely related to node-level resource management and quality of service. So I also worked a lot with folks like Dawn and the SIG Node folks who were leading that part of Kubernetes. And then at some point along the way, I started SIG scheduling with Timothy St. Clair, who was at Red Hat at the time. And that really increased the community interest in the scheduler, and we started getting more external contributions to that part of the system.

CRAIG BOX: And looking back at how the scalability of Kubernetes has changed in my chat with Logic, we looked at a system that was designed to support 100 nodes at 1.0, and now supports 5,000 nodes with implementations up to 15,000 nodes in the wild. So there's clearly been a huge engineering change in what the program can do over the five, six years it's now been in existence. What has changed in the scheduling system since those early days?

DAVID OPPENHEIMER: Well, certain aspects of the Kubernetes scheduler were similar to today, like the idea of scheduling a single pod at a time, and picking the node for a pod by first applying a set of predicate functions that tell you which nodes the pod is allowed to run on. And then applying a set of what we call priority functions to pick the best node from the ones that pass through the predicate filtering stage.

So that's pretty much still the same. And this Kubernetes scheduler worked well if you had a very homogeneous cluster in terms of the nodes being identical, and if you had a single workload running on the cluster like a web server with multiple replicas.

But we knew from our experience at Google that people were going to want to create clusters with heterogeneous nodes like nodes with different CPU types, and also that people were going to want to run heterogeneous workloads like mixing latency-sensitive servers with opportunistic batch workloads. And we also knew that people were going to want to control the way out of pods from a single workload relative to one another.

For example, packing pods together in the same cloud availability zone to save on network charges or spreading pods across failure domains to improve the workload's fault tolerance. So many of the scheduling features we added were geared towards accommodating these different types of heterogeneity.

For example, Kubernetes had the concept of a label selector from the earliest days, even before I joined the project. The idea there is if you put labels on nodes and label selectors on pods, and that lets you constrain the set of nodes a pod can schedule onto.

And so one of the first things we did was generalizing that concept with a feature called node affinity and anti-affinity, which basically lets you use much more expressive language to select nodes based on labels. For example, you could say, put this pod in one of these cloud zones instead of having to specify which cloud zone.


DAVID OPPENHEIMER: And one of the other things we did there was to add kind of a preference form of it, which would allow you to say, try to put the pod on a node that matches these label criteria, but don't block the pod from scheduling if you can't meet the criteria. So there is kind of like this concept of a hard and soft version of the node affinity and anti-affinity constraint.

ADAM GLICK: When would someone want to use the soft constraint? The hard constraint makes sense to me if you need a GPU, for instance. You either have something there or you don't. You don't want to schedule it where it can't do it. One would be a good use case for soft affinity?

DAVID OPPENHEIMER: An example of that would be, say you have several different processor types in the cluster. And for some reason, your load runs best with one of those processor types. But you don't want to prevent the job from running just because all of the machines with that processor type are occupied at the time.

So then you might want to use one of these preference versions of the node affinity so that if all of, let's call them the good nodes that have the processor type that you prefer are occupied, you'll still get the job running on the cluster, and not prevent it from running. But if they are available, then it'll get scheduled onto those nodes that have the processor type that you want.

CRAIG BOX: You talk now about node affinity. There's also a concept of pod affinity.

DAVID OPPENHEIMER: Node affinity lets you say, put these pods on these nodes based on the labels that are on the nodes themselves. Pod affinity and anti-affinity lets you direct pods towards or away from nodes based on the labels of the pods that are currently running on the node as opposed to labels that are on the node itself.

So this lets you say things like, co-locate these pods in the same cloud zone because they communicate a lot, or, spread these pods across racks in a data center so the overall workload can tolerate rack failure. Or you can say, these two workloads are known to interfere with each other, so you shouldn't ever run them on the same node.

Or an even more extreme version of that is, like, this pod or this workload is so security-sensitive or performance-sensitive that I don't want it to ever share a node with any other pods. So this pod affinity and anti-affinity concept, it lets the scheduler look at what else is running on the node and make a decision about the scheduling based on what's already running on the node.

CRAIG BOX: How can the scheduler make decisions based on the relative priority between workloads?

DAVID OPPENHEIMER: There is this more recent feature we built to do exactly that, and it's yet another form of heterogeneity. This one is heterogeneity between, like you said, the importance of different workloads. So the feature there is what we call priority and preemption.

And by the way, when I say that we built all these features, I'm really talking about the Kubernetes community as a whole. The scheduling work from the beginning was always a community effort with folks from all around the world participating.

But anyway, so back to priority and preemption, this was one of the most critical features of the Borg scheduler. The place where it really shines is when you have a mixture of batch workloads and serving workloads, and you have fixed-sized clusters like an on-prem physical data center or a cloud-based cluster where you're limiting the maximum cluster size to avoid runaway costs.

And the idea is that you assign priorities to each of your workloads so that if the cluster is full and a new workload is submitted, the new workload can evict a running workload if that running workload's considered less important. This lets you do things like submit a bunch of batch jobs to run opportunistically in the cluster, and you don't need to worry about those batch jobs blocking your critical workloads.

Or you can turn on horizontal pod auto scaling for your server deployment. And you can be sure that if you get a load spike and the horizontal pod auto scaler scales that up, the added server replicas, because you give them high priority, they'll be able to evict less important workloads in order to scale up to accommodate that load.

ADAM GLICK: That was a Borg feature. And you mentioned a feature from Omega before which made it into Kubernetes, the pod disruption budget. Can you explain what that is?

DAVID OPPENHEIMER: Pod disruption budget is not so much about heterogeneity like those other features that we're talking about. It's about, I would say, enabling safe node maintenance. So imagine that you want to upgrade the operating system in your data center. So one approach is you go node by node evicting the pods from each node, upgrading the operating system, and then putting the node back in service, and then moving on to the next node.

But it's a lot faster if you can upgrade multiple nodes in parallel. But the worry if you upgrade multiple nodes in parallel is that some of your workloads might require a certain minimum number of running replicas at a given time, like a quorum in a distributed storage system or a minimum number of web server replicas that you want to make sure are always running in order to handle sudden load spikes.

And so if you're not smart about how you do the draining of those nodes, the eviction of the workloads on the nodes in order to prepare them for maintenance, you can end up taking down too many replicas of some single workload. And so pod disruption budget lets you specify this minimum number of replicas that have to be up for each workload.

So that a tool like Kubectl Drain can drain multiple nodes in parallel safely. It makes sure to never take down a node if it would bring one of these protected workloads below the minimum number of replicas that you've stated.

And speaking of features from Omega, draining works hand-in-hand with a feature called taints and tolerations, which is also something that came from Omega. But I don't want this interview to turn into an hour-long scheduling tutorial, so I'll leave that one for another day.

CRAIG BOX: I do think many of our listeners will listen to the extra special episode we do later on on that topic. Back when people didn't understand the whole problem space of Kubernetes, a lot of people were just considering it a scheduler or a scheduler, depending on where they're from. And there's a lot of conversation about things like one-level versus two-level scheduling. Do those things matter?

DAVID OPPENHEIMER: A lot of the early discussion around scheduling in Kubernetes related to this question of one-level schedulers versus two-level schedulers. So Borg, Omega, and Kubernetes all have one-level schedulers and Mesos has a two-level scheduler.

And, of course, when Kubernetes was introduced, Mesos was very popular, so people naturally made the comparisons. So to kind of explain really briefly what this is all about, in one-level scheduling, you have a scheduler that knows the state of the system in terms of the physical capacity of each node and what's running on each node.

And when the scheduler is asked to schedule a task or whatever the atomic unit of scheduling is-- different systems call them different things-- obviously in Kubernetes, it's a pod, but it could be a container, or a task or whatever-- the scheduler looks at the-- we'll call it a task. It looks at the task requirements.

And if it can't find a node that's suitable, it sends some kind of message or writes some state to assign the task to that node. So that's kind of the very familiar way for scheduling to work, and that's kind of the one single-level scheduler. Now, one thing that's important to mention is that you can have multiple schedulers operating in parallel and still be a one-level scheduling architecture.

For example, Kubernetes is a one-level scheduling architecture, but it allows multiple schedulers to operate in parallel where each of the schedulers is responsible for a separate set of pods based on a field in the pod stack that's called Scheduler Name. My point there is that you can have application-specific schedulers even with a single-level scheduling architecture.

And also, you can build application-specific workload abstractions and application-specific lifecycle management logic in a one-level scheduling system. This is exactly what Kubernetes operators do. For example, Spark on Kubernetes has a Spark application CRD and a corresponding controller for managing the application.

So the reason I'm mentioning all of this is because sometimes people associate two-level scheduling with the idea of either application-specific workload APIs, or application-specific workload lifecycle management, or application-specific scheduling. But actually, you can do all of those things in a single-level scheduler. And like I said in the case of Kubernetes, you do it with CRDs, and controllers, and customs schedulers if you want.

ADAM GLICK: So what is two-level scheduling?

DAVID OPPENHEIMER: There's a lot of definitions, I'd say, of two-level scheduling in general in computer science. But when people talk about two-level scheduling in this context, they usually just mean how Mesos does scheduling, really.


DAVID OPPENHEIMER: And the idea there is that there is this bottom layer resource allocator that knows the state of the cluster in terms of the capacity of the nodes and what's running on them. And then the resource allocator offers the resources that are available in the cluster to higher-level schedulers which make a decision about whether to schedule a task against one of these offers.

So an upper-level scheduler sits around and waits for an offer that it likes and then responds by saying, bind such and such a task to the node that was offered. And you can imagine this approach could be useful if you have multiple schedulers that are in some sense adversarial, because you can rely on that bottom layer resource allocator to enforce fairness and to hide what the other schedulers are doing.

But this two-level scheduling approach has some drawbacks because this mechanism where you have these offers coming from the resource allocator makes it so that each scheduler only knows about its own tasks. And this makes it hard to implement some of the features that we talked about earlier, like preemption between workloads and anti-affinity to isolate tasks that aren't supposed to share nodes with one another, because the schedulers don't have a global view of the system.

But having said all of that, Mesos has added features-- like one of them is called optimistic offers, and there's others-- that mitigate some of the downsides of two-level scheduling. And I think that's an important point. At the end of the day, you can add enough features to either of these two models really to make them equivalent. So I think that the debate about one-level scheduler versus two-level scheduler is kind of more philosophical than practical.

CRAIG BOX: In your single-level scheduler example, you said you can apply different schedulers to different pods or workloads inside your cluster. I've seen all sorts of examples of alternative schedulers from someone writing an example in Bash, which simply binds things just to prove that you can, all the way up to a thing called Firmament, which a guy wrote a PhD thesis on, and has now done some work to get that applied to Kubernetes as well.

Do you think those other implementations, perhaps more Firmament than the Batch scheduler, do you think they'll ever be relevant for commercial use in Kubernetes?

DAVID OPPENHEIMER: I guess my opinion, after working on cluster scheduling for a while, is that the algorithmically interesting aspects of cluster scheduling are actually not that important in practice, at least for typical cloud environments and on-prem environments with commodity hardware. I don't know. Maybe this sounds heretical, but in my experience, relatively simple well-known approaches are almost always good enough.

And what really matters is things that aren't exactly scheduling algorithms per se, like making the workload abstractions clean, and layered, and expressive, providing enough scheduling knobs so that people don't have to resort to manually assigning tasks to nodes-- that was always something we were fighting against in Borg-- and offering well-designed escape hatches and extensibility interfaces so people can customize the system without having to fork the code base.

So I think these things are actually a lot more important than the scheduling algorithm per se, and I think Kubernetes has done really well in all of these respects. But for people who do want to experiment with scheduling, Kubernetes makes it really easy to swap in alternative schedulers. And, in fact, Firmament, the one you mentioned, is available as an alternative to the default Kubernetes scheduler.

So that listeners can do a Google search for "Poseidon Firmament scheduler" if they want to play around with it. And I've been listening to the podcast long enough to know that you'll probably put it in the show notes.


ADAM GLICK: You're a wise man. If simple is usually good enough, do you see any interesting challenges in scheduling today?

DAVID OPPENHEIMER: I guess two things come to mind. The first one is how to concisely explain why a pod got scheduled onto a particular node rather than some other node. Now, the scheduler isn't some kind of complex AI system. But in the limit, the scheduling decision can take into account all of the state of the cluster.

So it can be really hard to explain the scheduling decision in an understandable way to a human. And sometimes, a human does want to understand why a scheduling decision was made. Sometimes it's for debugging. Sometimes it's because they're trying to optimize their workload placement through setting scheduler knobs or whatever.

A second challenge is I would say how to combine scheduling preferences to come up with a final decision on where a pod should be scheduled. The way Kubernetes does this is that it applies a set of priority functions, each of which returns a normalized value where a high value means this would be an awesome node for this pod, and a low value means I could take it or leave it.

And then it calculates a weighted sum, where the weights represent the relative importance of the different criteria. And then it chooses the node with the highest weighted sum. But as far as I can tell, there's no really principled way to pick these weights. For example, how do you weigh the importance of picking a node that already has the container images that the pod needs in its cache versus respecting a preference for a particular cloud zone versus maximizing utilization?

Or something even simpler, like, how do you even just balance the desire to pack pods together to achieve high utilization with a desire to spread them out to minimize the possibility of noisy neighbors? I would love to see some principled way of making these decisions. But today, it just seems pretty arbitrary the way we combine these different factors.

ADAM GLICK: How about the challenges in cluster management more generally?

DAVID OPPENHEIMER: So I think the biggest challenge in cluster management today is probably configuration. I'm probably not breaking any news here. Anybody who's worked with Kubernetes knows that there's a bazillion configuration tools available. And if there was even a kind of good enough configuration system, I think that everyone would have standardized on it by now.

The fact that there are so many config tools just in the Kubernetes ecosystem alone would seem to suggest that this is really a huge unsolved problem, and it's one people don't even always know exactly how to work on. And even within Google, there have been many iterations of configuration systems for Borg, and I don't think anyone has ever really been happy with any of them.

So this isn't what people necessarily think of when they think of cluster management. They tend to think of scheduling, and algorithms, and efficiency, and all those kinds of things. But at the end of the day, I think that one of the, if not the biggest unsolved challenges is configuration, really.

CRAIG BOX: Do you have a favorite of all the tools that are available at the moment? Have you contributed to the design of any of them?

DAVID OPPENHEIMER: I am definitely not an expert on config. I generally just edit YAML files by hand and use Kubectl to apply them. But there are some really great open source config tools developed by some of my colleagues on the Kubernetes team at Google.

One of the tools is called KPT, and another is called Kustomize. Some of the listeners might be familiar with one or both of these. And when you're trying to manage Kubernetes at scale, I think tools like these are really useful.

ADAM GLICK: You've spent 13-plus years solving some of the trickiest scheduling problems out there and coming up with incredible systems that really help amazingly complex systems schedule things out appropriately. Management of a calendar is one of those things that I've always seen as just like no matter who you are, you have a calendar, and it is just chock full of things.

That seems to me to be the biggest scheduling challenge that is facing technology and humanity as a whole right now. When can we have a scheduler that works for calendars?

DAVID OPPENHEIMER: Scheduling is a huge area. And when we talk about it in the context of Kubernetes, we have this very specific meaning. But you're bringing up a really good larger point, which is scheduling applies to your daily life. It applies to businesses trying to optimize their business processes and airlines scheduling flights and crews.

And even in computer science, there's schedulers at all different levels of the system, and the operating system, and the hardware, and the cluster level, and then like you said, human scheduling. And I don't know, but I'm sure there are a lot of people of working on that calendar scheduling problem. And I share your frustration with what's available today, and I do look forward to improvements there.

CRAIG BOX: Well, there are many other things we could talk to you about. You've been doing a whole bunch of work on multi-tenancy. But we'll have to schedule some time to talk to you about that in the future.

DAVID OPPENHEIMER: I would look forward to it.

CRAIG BOX: Thank you very much for joining us today.

DAVID OPPENHEIMER: Thanks. I enjoyed it.

CRAIG BOX: You can find David Oppenheimer on Twitter at @davidopp.


Thanks for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter at @KubernetesPod, or reach us by email at kubernetespodcast@google.com.

ADAM GLICK: You can also check out our website at kubernetespodcast.com, where you'll find transcripts and show notes as well as links to subscribe. Until next time, take care.

CRAIG BOX: See you next week.