#43 March 6, 2019

Borg, Omega, Kubernetes and Beyond, with Brian Grant

Hosts: Craig Box, Adam Glick

Brian Grant joined the Borg team in 2009, and went on to co-found both Omega and Kubernetes. He is co-Technical Lead of Google Kubernetes Engine, co-Chair of Kubernetes SIG Architecture, a Kubernetes API approver, a Kubernetes Steering Committee member, and a CNCF Technical Oversight Committee member, where he’s sponsored 11 CNCF projects. Your hosts talk to him about all those things.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.


ADAM GLICK: How was last Tuesday?

CRAIG BOX: I didn't have a last Tuesday. It's a shame, really, the politics of the International Date Line. I got on the plane on Monday evening, and I arrived here in New Zealand on Wednesday morning. So it just kind of went away, really. I hope you enjoyed last Tuesday.

ADAM GLICK: I enjoyed it enough for both of us.

CRAIG BOX: That's good. In return, though, I do get two Saturdays in about a month's time.

ADAM GLICK: That sounds like a fair trade. I think I'd trade Tuesdays for Saturdays.

CRAIG BOX: I'll have to be flying on that day, though, so it's a--

ADAM GLICK: Speaking of which, I saw you shot across a picture from where you're at. You want to explain the beautiful view that you're getting to look at?

CRAIG BOX: Yes, so as we record today, I am in Mount Maunganui. It is a beach town in New Zealand and is a lovely peninsula that goes off to a dormant volcano. And you can walk up the -- Mauao is the name of the volcano, and you can walk to the top of it. And it takes about an hour to get up there, and we walked up at sunset, and some lovely colors, lovely view on either side of the peninsula.

There's a bay on one side with some yachts and the occasional cruise ship parked there and a surf beach on the other side, where we have both surfing and surf lifesaving, which is basically people throwing themselves into the ocean to try and be thrown back and then people jumping into the ocean to catch those who didn't do such a good job of the first piece. So how's life back on dry land?

ADAM GLICK: It is surprisingly dry, especially here in Seattle, where we've had a little less rainfall and a lot less snow. It's been fantastic. Gotten a chance to do a little bit of Netflix binge-watching-- always a quality weekend activity.

CRAIG BOX: What was your show of choice this time?

ADAM GLICK: This time it was catching "Russian Doll," which, if you haven't seen it, without providing any spoilers, it's basically like a jaded New York version of "Groundhog Day." So if that sounds like you're kind of jam, then I'd recommend checking it out. It was certainly an interesting story with some nice twists to it.

CRAIG BOX: Did you watch the movie "Edge of Tomorrow"?


CRAIG BOX: I hear they're making a sequel to that.

ADAM GLICK: Are they? Well, we'll watch Tom Cruise again die over and over but always come back.

CRAIG BOX: Well, it's hard to tell. They might try and make it a prequel. There are a couple of different ways they could go with it, but it was an underrated, excellent movie.

ADAM GLICK: Underrated, excellent movie. Is that an oxymoron?

CRAIG BOX: Well, underrated by everybody but me. I rated it as highly.

ADAM GLICK: Shall we get to the news?


CRAIG BOX: Rancher Labs has announced 'keys', spelled K3s, and as they haven't provided a pronunciation guide, I get to decide how it's said our loud. First announced five months ago as "Kubernetes without the features I don't care about," the project has been toned down and relaunched as a top-level Rancher open source project.

It's a single 40-megabyte binary, a kind of maxikube, if you'll allow me, which includes everything you need to run Kubernetes, from IP tables to Containerd, an optional network, and Ingress controllers. K3s is pitched towards Edge and IoT use cases, and nodes can run on machines with as little as 75 megs of RAM and 200 megs of disk space with x86-64, ARMv7, and ARM64 variants.

ADAM GLICK: VMware announced their third Kubernetes offering-- Essential PKS. This offering, which was previously known as the Heptio Kubernetes subscription, provides enterprise support for whatever upstream Kubernetes parts you choose to run.

VMware's portfolio now consists of Essential PKS, Cloud PKS, which is their SAS platform initially launched as VMware Kubernetes Engine, and VMware Enterprise PKS, their on-prem enterprise version.

CRAIG BOX: I'm sure those last two are pretty essential also.

Banzai Cloud has announced the alpha release of an operator for Istio. The Istio operator allows deployment and management of Istio directly with the Kubernetes API and addresses problems with the current Helm-based template deployment method.

The operator model lets you specify outcomes and lets the system make sure the right things happen, including later changes to configuration like enabling mutual TLS across a mesh. The operator also includes the ability to configure a multicluster mesh and manage automatic sidecar injection.

ADAM GLICK: This week's Kubernetes vulnerability is named CVE-2019-1002100 and is a denial of service that can be performed against the kube-apiserver. An authorized user with write permissions to the API can cause the API server to consume excessive resources while handling a write request.

The issue is of medium severity and can be resolved by upgrading the kube-apiserver to a new release. The latest versions that have the fix are version 1.11.8, 1.12.6, and 1.13.4. All earlier versions of the kube-apiserver are vulnerable. You can mitigate this vulnerability by removing patch permissions from untrusted users.

CRAIG BOX: Congratulations to containerd, the fifth project to graduate from the school of CNCF, joining Prometheus, CoreDNS, Envoy, and of course, Kubernetes. containerd was born at Docker in 2014 and became part of the CNCF in March 2017. Since then, it has become an industry standard for container runtimes, focusing on simplicity, robustness, and portability.

It is widely adopted by modern versions of the Docker engine and Kubernetes environments, such as GKE, as the layer between the Docker or Kubernetes endpoint and the OCI runC executor.

ADAM GLICK: Scytale, spelled with a C, a startup founded by ex-Googler PMs to tackle service identity, has announced they have taken $5 million in venture funding and launched a commercial product, Scytale Enterprise. Scytale are the primary authors of the CNCF projects SPIFFE and SPIRE. Their products address service-to-service authentication and authorization in a similar way that products based on OAuth or OpenID address user-to-service authentication.

The enterprise product is billed as an industry-first service identity management for the cloud-native enterprise and claims to enable frictionless service authentication across clouds, container, and on-premises platforms.

CRAIG BOX: Red Hat has launched OperatorHub.io, a directory of Kubernetes operators in collaboration with Google Cloud, Azure, and AWS. OperatorHub enables developers and administrators to find and install curated, operator-backed services with a base level of documentation, active maintainership by communities or vendors, basic testing, and packaging for optimized lifecycle management Kubernetes.

For developers, it provides a common registry where they can publish their operators with descriptions and details of version, image, and code repository, and make updates to published operators as new versions are released. Users get the ability to discover and download operators from a central location, with code that has been scanned for known vulnerabilities and prescriptive examples of the custom resources that they will need to configure.

ADAM GLICK: RightScale's Annual State of the Cloud Report came out last week. There's lots of interesting information in there, but for those of us who love Kubernetes, the big news was that Kubernetes use from respondents almost doubled, increasing from 27% to 48% in the past year alone. Additionally, the number of respondents saying that they plan to use Docker, and those that planned to use Kubernetes, was equal for the first time, perhaps showing that container systems are starting to scale enough to spot the chocolate and peanut butter-like magic of containers and orchestration. No word yet on if Istio will eventually provide a caramel center to this confectionery yumminess.

CRAIG BOX: And that's the news.


CRAIG BOX: Brian Grant is a principal engineer with Google Cloud, and a co-founder of the Kubernetes project. This is going to take a while. He is a co-technical lead of Google Kubernetes Engine, co-chair of the Kubernetes SIG Architecture, a Kubernetes API approver, Kubernetes Steering Committee member, and CNCF Technical Oversight Committee member where he sponsored 11 CNCF projects.

Brian's experience while technical lead of Google's internal container platform Borg, motivated some of the key APIs in Kubernetes, such as pods, labels, and watch. He started to work on the API design for what would become Kubernetes around October 2013.

Before the coding was started, and has since then, at some point, lead numerous areas of the project, including API machinery, kubectl, workload APIs, scheduling, release, documentation, contributor experience, and many others. Brian, welcome to the show.

BRIAN GRANT: Hi Craig, hi Adam.

CRAIG BOX: Did I pronounce that right? kube-c-t-l?


CRAIG BOX: How would you have it said?

BRIAN GRANT: I say "kube control."

CRAIG BOX: Brilliant.

ADAM GLICK: That's quite a history you have with it, and I know that previously you've done some work with the Borg and Omega projects here at Google. For those that aren't familiar with those, can you give a little bit of a description of what Borg and Omega are, how they're different, and how those led into Kubernetes?

BRIAN GRANT: Sure. Borg is Google's internal container management platform. That project was started back in 2003. The successor to previous application management projects that were in use at Google WorkQueue, which was used for batch jobs, and the Babysitter, which was used for continuously running services managed on servers.

At that time, if you think back, there was sort of in the early stages of multicore era, and Google was famous for using kind of cheaper, off-the-shelf hardware. So the machines in Google were not ginormous boxes with zillions of processors. The early systems were fairly simple. I think WorkQueue scheduled workloads to machines just based on memory alone, and all other aspects of the machines were more or less equivalent and it didn't matter.

Once the multicore errors started, people realized we need something more powerful that could binpack, and that was really the motivation for Borg. And it actually was designed to slide right into the work queue hole, and be used to schedule map reduces, and it even runs today on the same port that the WorkQueue ran on. And it also subsumes some of the roles of Babysitter, so it actually created this unified platform where you could run both services and batchwork loads, and other workloads, all kinds of workloads-- eventually, almost everything Google now runs on Borg.

And around the time I started at Google, there was a big push called Everything On Borg to make that big transition, to get all the workloads onto Borg. Now, Omega is a product I founded shortly after I joined the Borg team, in fact. I join the Borg team back at the beginning of 2009, so I've already been working in this area 10 years, and I worked on high-performance computing before that, so I feel like I had been doing this for a very long time.

But you know, I observed how people were using Borg, and some of the issues it had with extensibility and scalability, and addressing some use cases better. And that motivated the Omega project, which was really a project trying to figure out how we could improve some of the underlying primitives and internal infrastructure in Borg.

And it was the Greenfield type of a project, started from scratch. If you squint, many of the same attributes of Omega, you can see in Kubernetes. In fact, sometimes I call Kubernetes-- it's more of an open source of Omega than an open source Borg. A lot of those same folks are working on it who worked on Omega, and a lot of the ideas that went to Omega made their way into Kubernetes.

Some of them were also folded back into Borg. It did run in production for a time, but over time it was unified back into Borg. So ideas like multiple schedulers actually even predated some of the Omega work on that.

CRAIG BOX: What were the problems with Borg at the time that Omega was meant to solve?

BRIAN GRANT: One of the big issues was that Borg-- Master, in particular, the control plane of Borg-- was not designed to have an extensible concept space. It had a very limited, fixed number of concepts that had machines, jobs-- tasks weren't even really a first class concept. Just arrays of tasks, which were jobs.

Alex, which were arrays of resource allocations that jobs could be scheduled into, and primitive called packages, which are actually just commands and metadata about packages that would be installed. Packages are Borg's--

CRAIG BOX: Container image?

BRIAN GRANT: Yeah, it's like, part of a container image. Container images are somewhat monolithic, except for things like volumes. The package mechanism inside of Borg was designed to be more composable.

And we talked about MPM publicly in a few places, which is the package manager we use internally.

CRAIG BOX: What does the M stand for?

BRIAN GRANT: Midas. Midas Package Manager.

CRAIG BOX: Not be confused with NPM.

BRIAN GRANT: Not to be confused with NPM, correct. It's MPM. But you can merge together a number of packages into a single file system image in your container, and actually, I think that would be a powerful primitive in Kubernetes as well. You could just mount multiple container images into the same file system, and that would be really convenient for distributing configuration for an application, or site assets, or separately managing the runtime from the application. So it's actually issue #831 in Kubernetes.

ADAM GLICK: We'll get some people working on that straight away.

BRIAN GRANT: Yeah, so similar to Kubernetes, Omega had a more general object model. So it was easy to introduce new concepts, and it had a separate store that was actually Praxis-based on the same library that our Chubby key value store uses internally.

ADAM GLICK: When you say Chubby, just a quick side note, what is that for people who aren't aware of what Chubby is?

BRIAN GRANT: So you can Google it and read the paper from 2006. It is a key value store that inspired ZooKeeper and consul and etcd. So Borgmaster stored its state in a checkpoint that was written by the individual Borgmaster binaries. It didn't have a separate storage component, and that made certain things, like disaster recovery, much more challenging.

It can be managed separately from the control plane instances themselves. So that was one of the changes, eventually-- a Praxis-based implementation of the store was folded back into the Borgmaster, which benefited from reusing very solid production code that Chubby used, for example. And that actually also enabled it to stop using Chubby for some purposes, like leader election, because it then could use his own store for that.

CRAIG BOX: Borg was able to learn from Omega, and Kubernetes was able to learn from both Borg and Omega. There are obviously areas where you said, hey, we shouldn't have done that. One of the areas that's been mentioned is ports, and the way Borg manages ports versus Kubernetes. Can you talk a little bit about that, and then maybe some of the other areas where you've said, hey, Borg did it wrong, and we have a chance to do better?

BRIAN GRANT: Yeah, in some cases Borg didn't necessarily do things wrong, but did things that had a lot of consequences. So like the dynamic port issue, Borg dynamically allocates ports for applications and uses host ports. It doesn't use the Kubernetes IP per instance model.

The advantage of that is you don't need as many IP addresses. The disadvantage-- and there are bunch of disadvantages. One is, inside of Google, most software was rewritten from scratch. So we have billions and billions of lines of code in our monorepo, all written from scratch.

If you can do that, then you can depart drastically from what people do in the outside world. If you actually need to be able to run things from the outside world, it gets harder, right? So most applications in the outside world assume statically-configured ports. If you have a distributed application, they generally assume that all instances of that application are on the same port.

For example, they might only share IP addresses or DNS names, and not even share what port they're on. DNS, in general-- the way people use it-- they mostly just look up the IPv4-- IPv6, now-- addresses, and don't look the ports. Because they assume the client knows the ports, because they use well-known ports.

Most proxies only deal with targets that are all on the same port. So this is a pervasive problem in pretty much all software that runs on servers in any kind of distributed fashion. So we made the decision that we were actually going to try to make these applications, and actually normal models for how applications are configured and discovered, and load balance, too, and things like that, compatible with Kubernetes. And that's what really drove that difference.

It also simplifies some human tasks. Like, humans don't need to go figure out what port something was running on. Or even just some debugging tools, it's easier to chase them down with an IP address than having to also know the port.

CRAIG BOX: It's easy to do quality of service on a network if you say everything on this port is important or not.

BRIAN GRANT: Yes, and that is easier the way firewalls work. It's easier, there are many things that are easier. Of course, that makes the challenge of, now, how do you scale your network and do you have enough IPv4 addresses? And things like that. So there are trade offs in that decision, but I think, overall, the trade was very successful for Kubernetes.

The other decisions that were different-- and so different than Omega, and something we had actually had considered doing for Omega, is we put in-- Omega, all their controller components directly access store. So they all had to be very highly-trusted, and the releases had to be tightly synchronized.

So in Kubernetes, we wanted to allow a much wider array of clients including, less-trusted ones and less coupled ones. So we put an API in front of the state, and I think that was a very good decision. Like I said, it's something we had planned to do for Omega, but didn't end up doing-- some of the ideas like pods and labels and watch were actually derived from observations in Borg.

For example, most-- I was actually chatting with one of the SRE tech leads back in 2011 or 2012 about how application teams were using Borg. I mentioned that Borg had this alloc primitive that allowed job tasks to be scheduled into portions of machines.

Turns out that almost all application teams didn't actually use that functionality. What they did is they pinned specified sets of tasks into each alloc instance. And that's what motivated the pod concept.

At the time we called it scheduling unit, which was an atomic set of tasks that would be scheduled on machines. We prototyped that in Borg, and we made that the fundamental scheduling prominent in Omega as well.

Labels were motivated by an analysis of how people used job names in Borg. Borg did not have labels, so we found users contaminating a number of different attributes in the job names. Or job names got long, like up to 180 characters, which was the limit.

So labels were actually proposed back in 2013 as a way to allow people to express those attributes in a more first-class, way both in Borg and in Cloud, and ultimately Kubernetes, which was-- a lot of the core ideas for Kubernetes were developed that year in 2013, which is when we started working on the fundamental APIs like pods and labels and replication controllers.

CRAIG BOX: It feels like Kubernetes was able to become what it is because it had a combination of people who were users of Borg coming to it, and then people who came from the Cloud side, but then especially it had the oversight of the people who were the Borg and Omega team at the time. Would you agree?

BRIAN GRANT: Yeah, absolutely. Several people who had worked on Omega and node agent for Omega Omelet, were involved in the early discussions before it was decided to even do an open source project. In fact, we had several previous-- Borglet was the node agent for Borg. We had several previous Borglet tech leads-- Tim Hockin, Dawn Chen, Eric Tune-- working on the node primitives for Kubernetes before we decided whether to do an open source project or hosted project.

Even before we decided to use Docker or not, since we had just open sourced our container software-- Let Me Contain That For You-- around that same time. We're investigating what the semantics and functionality and capabilities should be at a pretty detailed level.

ADAM GLICK: You've also been with the CNCF since its inception, and indeed on the TOC for that. What have you seen as part of the evolution of the CNCF both Kubernetes has grown, and as a number of other projects have come into the CNCF and have grown?

BRIAN GRANT: So we created the CNCF weaving Google. Not just to host the Kubernetes project, but also to foster an ecosystem of Cloud Native projects. Because we knew Kubernetes, despite being huge-- although it was not huge at the beginning. We only had four APIs. We knew it was going to expand, but we knew it wasn't going to cover everything.

Borg has a huge ecosystem internally. There's a very rich ecosystem of logging agents and monitoring systems and deployment systems of all kinds of automation and orchestration. Mechanisms, batch schedulers, cron services, and so on. So we wanted a home for those other things.

We also wanted it to be a home for Kubernetes very early on. The CNCF took a mission to expand the foundation beyond just the seed project, which was Kubernetes. So at the beginning, actually, there was-- I was on both sides. I was both one of the leaders of the Kubernetes project, and on the technical oversight committee, so I saw both sides. But until we got that second project, there was really a reluctance to-- there is a lot of concern, I guess, about the foundation accidentally becoming the Kubernetes foundation.

So we, on the TOC side, had engaged a number of projects to present to us and discuss the projects while we were figuring out, well, what kinds of projects are we looking for? What are their criteria? How should this work? That was very useful.

So before we accepted the second project, we just had mini-projects present to us, so we could discuss those issues. Prometheus was a pretty obvious choice for a second project. It was not created by Borg, it was created by ex-Googlers, but it was inspired by Borg's monitoring system, and something very compatible and needed in the Kubernetes space even though it's still early in the Kubernetes days.

So that was kind of a no-brainer, and people immediately saw the value of that and we added that, and that really unlocked the ability for the CNCF to absorb more projects after that.

CRAIG BOX: There's a big roster of CNCF projects, now.

BRIAN GRANT: More than 30, yeah.

CRAIG BOX: Is there any area that you think, we really need one of these and we don't have one?

BRIAN GRANT: Yeah, so that's something the CNCF and the TOC specifically is looking closely at. So one of the things I wanted to do, once we started figuring out what sorts of projects we wanted, is the TOC put together a spreadsheet. Alexis Richardson, the chair, started of what areas we needed projects in.

So one area that I knew we needed a project in from my Borg experience was identity. So the SPIFFE project was one we specifically engaged with about that, and that's in the CNCF sandbox now. So we filled those holes that we had identified initially.

There are proxies and service meshes, and SPIFFE for identity, monitoring systems, Open Policy Agent for policy. And so a lot of those areas are covered, so we need to actually go back now and look at, what are the next set of areas where there are still holes.

And the CNCF doesn't necessarily have to cover all areas of software. So that's still something we need to figure out, is what areas do we want to focus on? We have had data processing projects proposed in the past, for example.

And you know, Apache has a very rich, deep roster of such projects, so we haven't taken on anything in that category. I wouldn't rule it out, but we decided it wasn't a priority, because there wasn't as much need there. The storage, I think, is an area that people are very interested in. Storage solutions.

And there are also a number of those in the Apache ecosystem and elsewhere. So we do have one project in the space, Rook-- oh, actually, two, we also have Vitess, which I sponsored, so I should remember that. The horizontally-scalable MySQL storage middleware, I think was the category we decided that fell into, to help bridge the pre-cloud native to Cloud Native.

I would say storage is definitely a primary and that we're likely to get more deeply into. I definitely think there are other holes also, moving up the application stack. I think is an area that some people are interested in, although it's a super fragmented space, so we've been somewhat cautious about moving in that direction so far.

ADAM GLICK: How do you think about things that don't fit into the CNCF? For instance, the CNCF is a subset of the Linux Foundation, and we look at like, the Apache Foundation. Apache has all sorts of projects that kind of--

CRAIG BOX: It even has a web server. That doesn't really fit with the rest of the project.

BRIAN GRANT: I looked at the Apache website, and they have more categories of projects and we have projects.



ADAM GLICK: So how do you decide what things are the right things to bring into the CNCF as a project, versus what are things that might better live in another open source location?

BRIAN GRANT: So first of all, it has to meet the criteria for what it means to be Cloud Native, or to facilitate other software to actually operate in a Cloud Native fashion. So one of the things I did last year was update the definition for what it means to be Cloud Native for the foundation.

The original definition was very specific. It was written down in the charter-- the original charter of the Foundation was very specific to Kubernetes. It was about containers and dynamically-scheduled microservices. Super specific to Kubernetes.

So I decided to generalize it a bit and talk more about, well, what are we really trying to achieve with Cloud Native? So it's a challenging process to make the definition concise but also sort of cover the space, but really, it's all about facilitating management of applications and infrastructure in a dynamic environment in a way that doesn't incur a lot of human toil.

So if humans are in the loop, that puts a limit on the scale and velocity that you can operate at. If you can get humans out of the loop and replace them with robots, effectively, then that really empowers you to move much more quickly. Just using Kubernetes as an example, we have users of Kubernetes who say, well, I went from a successful deployment per quarter to 10 times a day.

It's that kind of transformative impact that we're really trying to achieve with Kubernetes and the Cloud Native projects generally. We're not looking for a 10% cost reduction, generally. We're really looking for that transformative impact.

Or if you say, well, I could manage three servers with a staff with one person with Cloud Native solutions, maybe you can manage 30,000, right? So inside of Google, that's where he talks about sub-linear growth in terms of adding more applications, more servers, more everything at a higher rate than you have to add people. And ideally, it would be more logarithmic, and that is really possible with sufficient amounts of automation.

And with automation, I always talk about that are two sides. There's control, which is kind of obvious, but there's also observability. Which, you actually need to be able to monitor and measure what is actually happening, so you can automatically react to that. Or even potentially, in some cases, escalate to humans as well, with a page or something like that.

But if you really need that closed loop-- I've worked a lot on, actually, adaptive control systems in the past. Adaptive frequency voltage scaling, adaptive compilation, PID controllers, even. So I'm a big believer in adaptive techniques, and it's, I think, really a big part of what is about.

CRAIG BOX: One area where the humans have been scaling is the contributors to the community, and you mentioned about the Apache project. I think that it's fair that there were probably around about 20 contributors to Kubernetes in the early days, and probably about that many special interest groups, or SIGs, today. What's your involvement been in the process of stewarding and creating new SIGs in the Kubernetes project?

BRIAN GRANT: I think, SIGs, we started creating shortly after 1.0. Around that frame. So mid-2015, some of the earlier areas of the project, we had informal working groups across Google and Red Hat. Especially, for example, the folks working on kubelet, which evolved into SIG Node.

There was an effort to get more, broader engagement from the community on issues like scalability at Kubernetes 1.0. We officially supported a small number of nodes in the cluster, which is very controversial, but actually, our goal was to make it useful to some set of people first and then push on scalability later, because we were pretty confident that we could-- like, we've been there, done that, we know how to do that. We know exactly what sort of optimizations we need to do to the scheduler, for example, to get it to scale.

So as those initial efforts were being formalized into SIGs, I was involved in several of them. I mean, over the history of the project, as you mentioned, I've been involved in API machinery and kubectl and scheduling and auto-scaling in many areas. Quality of service at the node level, pod API, and so on. But I really tried to get more emphasis on areas that had been somewhat under-invested.

So I worked with bootstrapping SIG Docs, and I both reviewed the original kubernetes.io site at the 1.0 launch, and also multiple revamps of that site, and helped get that SIG off the ground. And SIG Contributor Experience, back when it was a working group, when it was officially started, before we really had a clear idea of what we wanted working groups to be. And that was really an effort to reduce friction for contributors.

A lot of the initial focus was on automation around GitHub, because we had a lot of challenges just scaling project management. So some efforts, actually, that came out of that, were things like dev stats, which CNCF developed for us, and then expanded to all other projects. But, yeah, a lot of the auto-labeling machinery we have in now in Prow, originally in tool called Munch GitHub, were born out of that SIG CLI. My main involvement there was dumping the history back to kubectl and pre-kubectl-- like, #1325 is the PR that created kubectl, go back and take a look at that, and sort of talk about the original vision for what kubectl was intended to do and what some of the challenges are.

CRAIG BOX: If only a canonically supplied pronunciation was in that PR.

BRIAN GRANT: What would you have to talk about, then?

CRAIG BOX: What is your role as co-chair of SIG Architecture?

BRIAN GRANT: What I view my role is-- so SIG Architecture was created-- the idea for it was proposed that the Kubernetes leadership summits, I want to say, back in 2017. There were a number of processes, again, that were kind of informal on the project. We have API review and approval for APIs all over the project, and Kubernetes is an API-centric system. Almost all functionality is somehow surfaced through some API, and that's really critical to the design and the architecture.

CRAIG BOX: So your group is responsible for making sure those APIs are correct or extensible?

BRIAN GRANT: So that's part of it. I guess where I was going with that is SIG Architecture was about formalizing some of the informal processes and responsibilities in the project, and formalizing it so that we could expand both the set of people involved and the impact across the project as the rest the project expanded. And that's actually still a struggle and still a goal. Right now, we're making an effort to-- and Jordan Liggitt, who's another one of the API approvers and doing a preponderance of the API reviews of late-- initiating an effort to onboard more API reviewers across the entire project.

So SIGs that are making changes to their own APIs have the expertise and experience that is required to know what to look for for backward compatibility gotchas, for example, or stylistic consistency across the different APIs in the project. And I think that is a really good approach that we should repeat in other areas, which is set guidelines for the project, best practices, help inform people about why the system is designed the way it is so they can self-align with that consistent model. And we can be consultants in that, answer questions and help write down some of the answers to those things that should've been written down a long time ago.

But we're really there to help people understand how to best achieve the things that they're trying to do in the context of the code base. Even though I haven't had time to work much directly on the code for quite a while, but I'm frequently surprised by how much has stayed the same, so I view that as a success. Like if I squint, I can still understand how things work, even without looking closely at the code.

CRAIG BOX: And it still looks like Omega to you?

BRIAN GRANT: It does. I mean, the core building blocks are still the same. The highest level description that we've come up with for how the system works is there is a schema ties key value store with asynchronous controllers, and the entire system works that way. So kubelet is a controller, the scheduler is a controller.

Obviously, things like the ReplicationController controller. The ReplicaSet controller now -- that was confusing. All the business logic, or almost all of it, are in these asynchronous controllers. And that's turned out to be a really powerful model for how to design the control plane, is to have a declarative database, effectively, that contains both the desired state and the source of truth for observe state, to have all the components interact through modifications of that state.

And people are using that model across-- the last time we did a count more than six months ago-- more than 500 projects. We're actually using the Kubernetes resource model for control in all kinds of things, and people talk a lot about operators for managing applications. But people are also using it for workflows and log management, and functions, platforms, and all kinds of things.

ADAM GLICK: Cool. This has been great to have you on, Brian. Thank you so much for the time today.

BRIAN GRANT: Oh, thank you. It was great.

ADAM GLICK: You can find Brian on Twitter at @bgrant0607.

CRAIG BOX: Is that your PIN number?

BRIAN GRANT: It is my start month and year at Google.

CRAIG BOX: What is PR #607?

BRIAN GRANT: I do not actually know that. Yeah, a lot of issues in PRs that have less than five-digit numbers, I do remember, but that was not one that I reviewed. I did actually use to subscribe to the entire repo and look at every issue in PR, so literally in the first two years of the project, about 200,000 GitHub notifications hit my inbox.

CRAIG BOX: "Write the JSON content type for API responses", merged by lavalamp in 2014.

ADAM GLICK: Do you know that one off hand?

CRAIG BOX: I just looked it up now.


ADAM GLICK: Thanks for listening. As always, if you enjoyed this show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter, @KubernetesPod, or reach us by email at kubernetespodcast@google.com.

CRAIG BOX: You can also check out our website at kubernetespodcast.com, where you will find transcripts and show notes. Until next time, take care.

ADAM GLICK: Catch you next week.