#85 January 7, 2020
Five years ago, Clayton Coleman took a bet on a new open source project that Google was about to announce. He became the first external contributor to Kubernetes, and the architect of Red Hat’s reinvention of OpenShift from PaaS to “enterprise Kubernetes”. Hosts Adam Glick and Craig Box return for 2020 with the story of OpenShift, and their picks for Game of the Holidays.
Do you have something cool to share? Some questions? Let us know:
ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.
CRAIG BOX: And I'm Craig Box.
Welcome back. It is a new decade. Welcome to 2020.
ADAM GLICK: Indeed. How were the holidays for you?
CRAIG BOX: Pretty good. We had some friends visiting from New Zealand, and that meant we got to be London tourists all over again. Every time someone comes by it's like, well, what do you want to see? And it turns out to be museums. And in this case, went down the Harrods. I think that Harrods probably sells more items in its gift store than it does in the entire rest of the department store, to be honest.
ADAM GLICK: I hear you did a little bit of gaming.
CRAIG BOX: Yes. You'll never guess what my gaming recommendation of 2020 was.
ADAM GLICK: "My Little Pony-- Friends Forever?"
CRAIG BOX: Close. Starts with an M. It's "Minesweeper." I found a link to a very interesting game. It's called "Kaboom!," a variant of "Minesweeper." And one thing you might remember, if your experience with playing "Minesweeper" was in Windows 3.x, as I'm sure it was for many people listening, it's possible to get all the way through the game, and then you can't guarantee that you can win. You can end up in situations where the game offers you a 50-50 chance, effectively.
And there have been variants of "Minesweeper" made throughout the years that have fixed that a little bit. One that I've joined in the last few days is Simon Tatham's "Mines," which you can find a link to in the notes. But "Kaboom!" is a variant, which is basically the game being fair but cruel, as it's described. And what that means is that if you make a guess, you will be wrong. They don't generate the board at the beginning. They generate it on the fly. So if you deduce things, you can work your way through the game and figure out everything. But if you just say, I'm going to randomly click somewhere, you'll die.
ADAM GLICK: Interestingly, it's a punishing version of "Minesweeper" to teach you to use logic and not to guess.
CRAIG BOX: Absolutely. And that sort of made me think, I've never really played the game seriously. It is a time waster, obviously, but I enjoyed that. And I can thoroughly recommend it if you like a little logic puzzle to welcome in the new decade.
ADAM GLICK: That's kind of neat.
CRAIG BOX: How were your holidays?
ADAM GLICK: I got to spend time with the family, which was wonderful. They got to meet the grandkid and play around. And while I was there, I got a chance to watch a very interesting video, speaking of classic gaming. If you're taking it back to "Minesweeper," I'm going to take it back to "Snake." Do you remember "Snake" from the Nokia phones that everyone had?
CRAIG BOX: Yes, of course.
ADAM GLICK: Or even going back to early, Basic computers.
CRAIG BOX: NIBBLES.BAS!
ADAM GLICK: Someone took "Snake," and they wrote AI to basically play "Snake." Can you make an AI that plays "Snake," and can it be perfect? How long can it get the snake to be? Can you basically fill the screen?
And it is a fascinating video to watch someone go and try and solve this problem and what they go through. And all I can think is, the amount of time it took them to make the video, not only to come up with the AI algorithms, but to actually shoot this video of it playing the "Snake" game is just unbelievable. And it's one of those things that you just have to be impressed at how much work someone put into something and how deep they went on it.
And if you just want to go down a rabbit hole, look up AI playing "Snake" on YouTube. And it is 15 minutes of pure geeky joy.
CRAIG BOX: With that, let's get to the news.
CRAIG BOX: Google Cloud dropped a holiday gift onto the industry in the form of BeyondProd, a paper describing their approach to cloud-native security. BeyondCorp told the story of how Google staff were able to securely access workloads not by being on a corporate network, but by identity and credentials given to them and their machines. BeyondProd is the expression of those ideas to production workloads, including mutually authenticated service endpoints, transport security, edge termination with global load balancing and denial of service protection, runtime sandboxing, and end-to-end code provenance.
For that last point, Google goes on to describe binary authorization for Borg, their implementation of code review, provenance, and enforcement for the internal platform. The posts and associated papers also described a method of implementing similar patterns in Anthos, GKE, or open source. Maya Kaczorowski, one of the authors, was our guest way back in episode number 8.
ADAM GLICK: VMware completed its $2.7 billion acquisition of Pivotal just in time for New Year's. Astute observers may have noticed that VMware and their parent company, Dell, already owns 2/3 of Pivotal, but this allows VMware to pick up the rest of the public stock as well as buy Dell out of their shares via stock swap. Pivotal and VMware's cloud-native applications team, which includes Heptio, will now become the modern application's platform business unit at VMware, headed by Ray O'Farrell, who was previously VMware's CTO.
CRAIG BOX: Cloud-native database company PingCAP has released Chaos Mesh, a platform for chaos engineering on Kubernetes. The project so far consists of a chaos operator, which you deploy into your cluster, and a dashboard, which initially only supports their TIDB database but has plans to add more. The repo shows an example video where a chaos experiment found the bug in the TIKV store, which they have since fixed, which can give you an idea of the kind of thing you can do with it. If you want to learn more about chaos engineering in general, download episode 82 with Ana Medina.
ADAM GLICK: Google Cloud has announced that global access for internal TCP and UDP load balancing is now in beta. This means that internal IPs can now be accessed from any region around the world within a VPC.
CRAIG BOX: Hot on the heels of dual stack IPv4 and v6 in Kubernetes 1.17, Calico has released version 3.11 for Workgroups with support for it. It's worth noting that you had to manually install TCP/IP into Windows 3.11 for Workgroups, and IPv6 was definitely not supported.
ADAM GLICK: Crunchy Data has released a version 4.2 of their PostgreSQL operator. This release adopts Zalando's Patroni framework for high availability clusters and brings it in line with Zalando's operator. Given there are now two operators using the same technology, we are not sure how you choose between them.
CRAIG BOX: A gift from krew maintainer and episode 66 guest Ahmet Alp Balkan in the form of kubectl-tree. This new plugin allows you to visualize Kubernetes object ownership, showing, for example, which deployments and replica sets your pods belong to. This is especially useful for people using Knative, where you can have multiple configurations or service versions attached to a Knative service object.
If you prefer an interactive tool, you can also check out kubelive, a node JS app which lets you pull up the objects in your cluster and then move through them and delete them, or copy the name with the keyboard.
ADAM GLICK: Josh Van Leeuwen from Jetstack has posted a write up on how to use open ID connect to authenticate using an external service to AWS Kubernetes clusters. Because EKS only allows their own IM accounts for identity, Josh's write up is a nice way to help users who want to use external accounts and open source authentication technology.
CRAIG BOX: Twitter sometimes appears to be evenly split between people saying Kubernetes is completely unnecessary and people who say, I very proudly used Kubernetes to do something completely unnecessary. Two examples of the latter this week-- Austin Vance used Kubernetes and Prometheus to monitor his barbecued meat smoker, and Patrick Easters created an operator to control his Christmas tree lights. We salute you both.
ADAM GLICK: TechTarget has published an article for what they think will be the biggest things to move the cloud-native community in 2020. Of little surprise to our listeners, Envoy and Istio are the two they focus on, as they are widely seen to be the next big growth areas for the Kubernetes space.
CRAIG BOX: Victor Adossi runs a Kubernetes cluster created with kubeadm, and that cluster suffered an outage at the end of 2019. In this case, the cause of the outage was the unexpected expiration of a TLS certificate. Victor's blog post dives into how the issue was first noticed not through monitoring, but through deployment failures with a crypto verification error. It's a good read to talk about the issues you can face while running your own clusters.
ADAM GLICK: Wander Hillen of WeTransfer has shared his thoughts on the horizontal pod auto scaler, and it's not an article full of praise. Wander talks about his background in control systems, and points out a number of challenges and shortcomings of the HPA. These challenges mostly center around the fairly coarse way that the HPA chooses to scale and the edge cases that can cause this to fail, such as seeing a momentary drop in load, killing a number of pods, and then having traffic spike and not scaling up quickly because of cool-down periods. Wander also points out why some of the replacement technologies for the HPA don't quite solve the challenges he's identified, and he lists what he thinks the attributes of a good solution to doing horizontal pod auto scaling would be.
CRAIG BOX: It's easy 400 words time, as tech journalists write up their best of 2019 and present their predictions for 2020. I shouldn't complain, as I may have written the "Best Episodes of 2019" up on the GCP blog. Common themes include vendors, packaging, applications, natively for Kubernetes, better support for edge workloads, and a continued focus on security.
In other news, water continues to be wet, and 2020 will be the year of both Kuberenetes and Linux on the desktop. You can find a large list of these write ups in the show notes.
ADAM GLICK: Christopher Tozzi at Container Journal has posted about four ways he thinks Kubernetes could be improved. He calls out the install process, YAML, multi-cluster support, and the difficulty with doing small scale deployments. If you're using Kubernetes, these probably aren't surprises to you, though his thoughts on small scale deployments are a good note for people as they think about when they want to use Kubernetes as their platform.
CRAIG BOX: Finally, some sad news. In a candid blog post, Kontena's CEO, Miska Kaipiainen, all with K's, explains that they're ceasing operations immediately. Kontena were our guests in episode 31. The code for their open source projects is still on GitHub, but they are shutting down their build pipelines and see the end in the near future if they cannot get support in the form of funding. Customers using Kontena's account service are advised to migrate off it as soon as possible.
ADAM GLICK: And that's the news.
ADAM GLICK: Clayton Coleman is an architect with Red Hat and focuses on Kubernetes and containerized application infrastructure. He was the first non-Googler to contribute to Kubernetes, and has been driving Red Hat's technical Kubernetes and container strategy since the project was released. He's an emeritus member of the Kubernetes Steering Committee. Welcome to the show, Clayton.
CLAYTON COLEMAN: Hi. It's great to be here today.
CRAIG BOX: So post-acquisition, you now work for IBM. But I understand this is not the first time.
CLAYTON COLEMAN: That's right. I worked for IBM for about 10 years prior to moving to Red Hat, and it was mostly on the user interface side. So I worked on user experience design. And I came over to Red Hat because they're like, hey, we've got this new online service. It's this thing called PaaS. I'm like, that's the worst acronym I've ever heard. What does that mean? Well, it's like Heroku. And I was like, what was Heroku?
And so that I actually had to go use it, and I was like, oh, this is what all the cool kids are doing these days. So I came over to do user interface, and then got sucked into OpenShift through that angle.
ADAM GLICK: How was the transition going from essentially front end to back end?
CLAYTON COLEMAN: It was really fascinating going for front and the back end, which was I worked on UIs all the time. I knew what I wanted, and the back guys just weren't giving me what I needed. And so I said, well, I know what the users want. I can do this better than they can.
So it's a little bit of user experience hubris. We all like to think that we understand the user better than anybody else.
ADAM GLICK: That's the true engineer. I'm going to go build it better myself.
CLAYTON COLEMAN: That's right. And as I got more involved, it was, wow, there's a lot of things in the developer experience which are about communicating clearly to developers. So I like to think that that approach, coming from the user experience side, has made certainly something that I brought to Kubernetes, contributing to the project, which was like, how do we make this approachable and easy for the novice user?
Because I don't like to harsh on Googlers, but there was definitely some very, very smart people working on Kubernetes in the early days. And I think one of the things I could bring and Red Hat could bring was that making it more accessible to the first-time developer, to someone who'd never seen a container or even knew what containers were.
CRAIG BOX: Let's step back, then, a little before Kubernetes. People today are familiar with Red Hat's OpenShift platform as being enterprise Kubernetes, but it actually had a life before that. And you've worked on it since version 1. What can you tell us about the history of the OpenShift product?
CLAYTON COLEMAN: As I said before, I came to Red Hat and I started working on OpenShift online, which was hosted software as a service, a little bit like Heroku, trying to make it easy to run Ruby, Java, PHP applications. And we kind of came at it from an interesting angle at Red Hat, which was we knew a lot about Linux sysadmins. So Red Hat's worked for a really long time with the operators and the people who run these systems. And we said, well, we want to make it easy for the dev teams and the operation teams to work together. So that was kind of our unique spin on PaaS.
And so OpenShift was always kind of, well, what if you took Red Hat Enterprise Linux and made it really accessible for end developers building Java and PHP and other enterprise applications? And as that evolved, it was interesting. We used containers from a very early days, so Red Hat's been involved with containers a long time, just like Google has. Some of the folks who work at Red Hat, like Dan Walsh, were fundamental in getting SELinux into a form that could be used to protect containerized applications.
And we really had something we thought was interesting, but there is always something kind of missing. And for us, that was actually when Docker came on the scene. So I remember someone sent me this demo video, and they were like, hey, you need to check this out. And then I saw it, and then I went and downloaded it a couple weeks later. And I was like, this is awesome. It takes something that was super painful, that I didn't even think was possible, and made it super easy on Linux. And that was really-- that change actually like opened us up to the idea that maybe we can start thinking about making it easier for developers in a different way.
CRAIG BOX: So that was, then, the launch of OpenShift version 2. Was that Docker based?
CLAYTON COLEMAN: We never actually launched a version of OpenShift. We had talked about it. We said, well, this Docker thing is awesome. What are we going to do with it? And there was that whole year where everywhere you turned, it was Docker this and Docker that. And we kind of looked at OpenShift. It had evolved from a system built fairly simply. We said, there's a lot of opportunity for improving how we do orchestrating all this stuff.
And a new phrase started to show up in kind of a popular consciousness, which was Docker orchestration. And that phrase really was interesting because we were like, well, we can do better. And so there was a number of early people in the space. So CoreOS had the Fleet project, and there are a number of smaller startups that are doing really interesting things with orchestration of things that were almost containers, but nobody had really nailed that Docker use case.
And so we had looked around. We talked to a number of people in the ecosystem. We did a stint with OpenStack that, even though there is a ton of interest, we really couldn't feel like we fit into that ecosystem. And then we took a step back, and we were about to go to Mesosphere, actually. So we had been talking with the folks at Mesosphere. It was really exciting. They'd done a lot of stuff they were running at scale. Had a lot of really interesting use cases. We knew that there are a lot of enterprises using them for batch workloads.
And we got this weird connection out of the blue, actually, with some folks at Google. And they said, hey, we're working on this thing. We think we might open source it, but we're not sure. And we're like, OK, well, what is it? They're like, well, we call it Seven. And we were like, that's kind of a weird name. You guys need to get better at this whole naming thing.
And we saw a demo, actually, really early. And Brendan Burns showed the sevenlet, which was the predecessor of the kubelet, and they showed a little bit the UI they were thinking about. And it was like, this seems really interesting. But you guys-- I don't know if you're actually going to ship any of this.
And so we were about to actually say, you know what? We've got to make a decision. This whole Docker orchestration thing, we need to decide. And so we were about to actually say, OK, well, let's just build around Mesos. So the next version of OpenShift would be built around Mesos, and we'll work with Marathon and help bring some developer flexibility, help make containers in Mesos work really great.
And at the 11th hour before the first DockerCon, about a week before, there'd been this on again, off again dance with Google, where they were like, well, we don't know whether we're going to open source it. We might open source it. OK, now-- we don't know. The lawyers won't let us.
And so we were about to make that decision, and then we got this call which was, hey, we're going to open source it. Are you guys in? And you have a day to decide. And so we talked it over. My boss and a bunch of the PMs on the team were like, we were talking about Mesos, which was this fairly mature, pretty widely used, relatively capable system versus this incredibly out-of-left-field bet on this thing that we'd seen demoed once and that we're just about to see the source code for.
But everybody's really excited. And it was interesting because it was such a hard choice, which was go with something that's completely new but was going to be designed by the community that could learn from the lessons of the past, but also have a chance to really capture that excitement. And I don't regret any of it because it was the right choice. And so we said yes, and we announced it at DockerCon, and it just took off from there.
ADAM GLICK: How did people feel about that at the time? I mean, I was a big bet to take. You were essentially betting on the less established player in the space. What made you take that bet?
CLAYTON COLEMAN: It was interesting. It was almost the nonexistent player because at that time it was a repo with a bunch of the internal iteration. I think the name Kubernetes was decided and almost nothing else was. And so we knew that we were using etcd, which CoreOS was a really cool, simple, easy to operate key value store that at the time was also kind of crazy. It was like, oh, you're not using a database to store all your state? What are you doing?
It was interesting because I remember at that first DockerCon, I met a ton of folks from Google, folks from CoreOS who were super excited about this. There was a lot of buzz, and it really came down to-- I think there was so much excitement about the idea that we were all going to come together in a community, take the experience of Google doing similar systems, folks who had this experience with Google, folks who have experience in Linux and experience and all these other systems, folks who had experience in Docker and containers, trying to come together and build something that was-- it's still kind of amazing to me how quickly that seemed like the only obvious decision that any of us could have made, was to work on Kubernetes.
CRAIG BOX: How is it different when you're working for Red Hat, who obviously are a participant and open source in a way that is not necessarily the same for some of the other companies who built these things? How would that decision have influenced things?
CLAYTON COLEMAN: It's interesting. Red Hat is very focused on having an open source development model that helps us build products that support enterprise use cases. And so a lot of times, our history with Linux-- and Red Hat Enterprise Linux was very much about getting an early understanding that there was a real opportunity here, working within that community to make sure that community is successful. It wouldn't just be a we throw some stuff together and throw it over the wall, but we built this sustainable ecosystem.
Being involved in Kubernetes was almost like it was one of those right place, right time-- everybody wanted to make something better. So it was a very easy because we all had sustainability on the mind. So from the very early days, we wanted to make an open, inclusive community. We wanted to make it really easy for everybody to get a say in the conversation. So there's a ton of early contributors who-- folks from Mesosphere, folks from Docker who came into Kubernetes and tried to bring those pieces together.
So for Red Hat, we've done that before in ecosystems. This was really a chance for us to put that to the test. I hadn't worked on open source for very long prior to coming to Red Hat. Certainly, at IBM and before, I was familiar with open source. I used it, but it wasn't in my DNA the way someone who's been working on the kernel for 20 years has. And it really was the everybody's trying to build something. And the obvious thing to do was let's build a community that we trust each other, we trust each other's judgment, we have good processes in place that make sure that technical decisions that make sense for the end users happen.
It was really amazing the way that we were all able to look at what we actually wanted to build from a technical perspective, look at the users, the folks who jumped on the Kubernetes bandwagon super early on, and were just willing to bet-- it's kind of crazy, actually, a little scary. They were just willing to bet, oh, we'll run all our production workloads on Kubernetes. You know it hasn't hit V1 yet, right? Yeah, we'll be fine. That was a very interesting transition. People were willing to put aside their vendor interests in favor of trying to build something that would work for everyone.
CRAIG BOX: Kubernetes came out of the gate with a lot of ideas from Google's internal systems, Borg and Omega. Both of those systems have one customer, which is Google, and they have Google Developers and SREs as they users, who will be very different to Enterprise's. Even though a lot of the technical stuff was there from the beginning, there will have been huge gaps that needed to be implemented in order to help the use case that you have with OpenShift and the customers that you have. What was missing? And what did you build?
CLAYTON COLEMAN: It was very interesting coming into Kubernetes. I learned a ton. I like to joke with Brian grant, who was one of the early Google architects on Kubernetes and helped really drive the project, held a lot of that idea in his head. I would joke that anytime Brian responded with a comment on a PR, he was basically letting millions of dollars of Google research out into the open. And Brian was like, yeah, we got a waiver for that.
And there was a lot of stuff I learned, but one of the things that I felt we could really bring is having been in this platform as a service space where people were talking about 12 factor apps, and they were like, the goal is to be the simplest possible application that can possibly run. Microservices were taking off. And it was interesting coming in from that was it was at the forefront of our mind that we wanted to make both Kubernetes and the things that we would build on top of it accessible. They had to make sense to someone who may never have run a containerized operating system or who had never heard the words microservices.
And so we tried to do small things scattered throughout the code to make it easier to use, and so one of them-- there was a funny discussion we had early on Kubernetes where we were talking about adding health checks to the Kubernetes pod spec. And a health check, basically, is something that happens from the kubelet, and it makes sure that your process is still running. And if anything doesn't respond with a health check, Kubernetes says, well, I'll go ahead restart your service, or I'll take you out of the load balancer and put you back in.
And that had been a concept that was pretty familiar to some people, but I remember being like, well, you know, guys-- they were talking about making the field required. And I was like, well, yeah, but you know that most software doesn't have health checks. And I don't remember who it was, but somebody was responding to that who was a Googler, and they were like, what do you mean it doesn't have a health check? And I was like--
CRAIG BOX: What do you mean not everything is compiled from source from one giant repository?
CLAYTON COLEMAN: I was like, the open source projects, they don't have health checks. And so some of that was bringing small things as we went through the core of Kubernetes. And then in OpenShift, we made a huge bet. We were like, OK, we're going to totally rip out the underpinnings of what we've built. We wanted to have this really flexible system because that was one of the limitations that people complained about with platform as a service. It just wasn't flexible. 12-factor apps are great, but you can't have stateful 12-factor apps. You can run a stripped down version of Java, but you can't run an Enterprise Java application.
So we went through that exercise of we're going to strip out the guts, and we want to have the same kind of platform-type concepts on top, like somebody's got to build container images. Well, you build them, someone has to roll those out. We had end replication controllers in Kubernetes at the time, but we didn't have deployments.
And so we talked a lot in the early days where we said, we want Kubernetes to appeal to everybody. And we know that in the future, we want to build things on top of Kubernetes. And so from the OpenShift side, OK, we'll take a really hard line. We'll add new constructs to Kubernetes that are supposed to work well with Kubernetes. But we don't have any of the extension mechanisms yet, so we had one choice, which is you can pile it in. That's the going motto, is if you can't sub it into a single binary, good luck.
And so we added things like deployment configs, which was based on a lot of the really early discussions we had about, OK, how do you move the chains between one version to another? So we took a stab at it. We added some concepts that tied into developer flows like building images and pushing them, and having a place where-- making it really easy to push images to any Kubernetes cluster. We integrated what at the time was the Docker distribution registry into the cluster, and we made an API resource to represent a repository, so things that made it easier to get familiar with Kubernetes.
And it was funny because most of the complaints were, well, this isn't high level enough, followed immediately by, well, but I want to do this like one really specific thing down at the level. So we for a long time-- I'd say the first couple of years of OpenShift, the third version, the one based on Kubernetes, it was always, well, you know what? You're not in the middle, you're not at the top, and you're at the bottom. Why can't you meet me exactly where you are? And I'm like, we just haven't gotten there yet. And I think that's where we kind of are today as these sorts of things are starting to come in the community over the last several years.
ADAM GLICK: That's one of the questions as you see projects and, indeed, sometimes companies evolve, is kind of there's two methods of, do you start with something that's incredibly complex? And then how do you make it simpler and address more people with it? Or do you start with something that's so easy everyone can use it but it only fits a small section of the use cases, and then expanded out over time?
Generally, when I think of startups, they start with do one thing really well and then kind of grow from there. Kubernetes seems to have been the opposite of do a whole bunch of stuff, and then worry about simplifying it after it's all grown.
CLAYTON COLEMAN: This is a really interesting tension, I think, in Kubernetes, which is there was a lot of folks coming into the project who had some good gut feels about what the right level of distraction was. And I certainly can't say that we got it all right. Tim Hockin did a talk about we have the basis of a service mesh already in Kubernetes. Maybe we should take that a little bit further.
And I think we've always kind of had that idea. We had all these ideas. We wanted to build something huge. We also knew that we couldn't build everything and that we weren't going to get it all right. So there is an interesting tension at the early days of Kubernetes of trying to make just enough that you could do almost anything you wanted.
But obviously, with great power comes great responsibility. You can create some uncontainable messes for yourself. And as we've seen, a lot of those core Kubernetes primitives have evolved slowly. Pods haven't changed terribly. Services haven't. But we've added new community concepts like Istio, Knative, the idea of operators.
And these are kind of naturally emergent characteristics of how Kubernetes got built, that idea in Kubernetes that you can have these API objects, and then you've got some really dumb and almost simplistic processes like, is this true? No, fix it. Is this true? If not, fix it. And it's literally the simplest thing that could possibly work when you have to keep two systems in sync.
And surprisingly, we've made a huge amount of progress with that in the Kubernetes community. Most people integrate into Kubernetes build a controller of some sort, that core idea of just a loop that says, is this true? Fix it. Is this true?
And operationally, that came with huge advantages. If something went wrong in the early days of the kubelet, all you had to do was go delete your pod, and the replication controller or replica set that replaced it would create a new one. And so I think that as we've gone on, we weren't too simple, but we weren't too complex. Certainly, Kubernetes' challenge today is there's a lot of complexity that we have to help people manage.
CRAIG BOX: Before those extension mechanisms existed, OpenShift was effectively a single binary that compiled in the Kubernetes of the time and added on its own pieces. How did you decide which bits of work to do in what is now upstream Kubernetes and which bits do in OpenShift?
CLAYTON COLEMAN: This was a really interesting one. So in the very first days leading up to Kubernetes 1.0, we made that decision in Kubernetes that Kubernetes was not a platform as a service. And Brian Grant actually has a line in one of the docs for Kubernetes that says, Kubernetes is not platform as a service, and I helped get that written in. We were kind of having this discussion about we can't do too much in Kubernetes, or it'll just collapse under its own weight. How do we draw the line?
So we said, we want Kubernetes to be a great platform for you to deploy applications on top of. It was kind of focused for a single user or small set of users. It has some really core concepts that were supposed to work well.
And in OpenShift, we said, will help build and prototype and try out some of the ideas we're discussing, some of which were specific to building applications. Builds never really were something we ever talked about adding to Kubernetes. Other people-- every couple of months, someone opens an issue in Kubernetes and was like, hey, can you guys add builds to Kubernetes? And we'll have to go in and be like, see issue 2.
CRAIG BOX: And then go look at Tekton.
CLAYTON COLEMAN: And then go look at Tekton.
ADAM GLICK: I was about to say, but then you see you know lots of CI/CD pipeline tools coming about out there. Tekton is out there now as well. Build is certainly becoming, if not baked into the core of it, certainly something that sits around the periphery of it and fairly close to it in terms of how people use it.
CLAYTON COLEMAN: Absolutely, things like Draft and things like Skaffold, these projects were iterations-- and I think this is the greatest part about the open sourcing Kubernetes ecosystem is everybody tries something a little bit different way. And in OpenShift, we tried some things some ways. And they give us context, actually, to go back and help inform what would become features in Kubernetes.
So for instance, we used our experience with deployment configs, which were a very early form of deployments in OpenShift, to actually help review and add comments. And some of the OpenShift engineers who had worked at Red Hat and were working on deployment and figs actually helped write the deployment controller.
Ingress-- Ingress is just now maybe almost possibly going to get to a GA. And in OpenShift. we were like, no, no, no. You have to have that. It's table stakes for a platform as a service. You have to be able to do Ingress. And so we did a very early version of Ingress called Routes, and then Ingress came along in Kubernetes. We provided feedback, and some of the core code that helps make up the shift Ingress controller ended up making up some of the core code of the NGINX in HAProxy, initial Ingress controller implementations.
It's funny because now it's come full circle. So the Contour Project had a resource that they call an Ingress Route, so I like to think that there's a natural progression that-- just like the Germans will keep tacking on mounds to the end of our Kubernetes objects. And Kubernetes has a well-known German bias with all the K's. So I just want you to know that.
But those features-- I like to think of OpenShift helped. We had to do something to support the users who wanted to do enterprise platform as a service style use cases. We did a lot of work with multi-tenancy. So RBAC started in OpenShift. Role-based access control, which everybody hates to turn on because then they can't do anything they want on their cluster.
ADAM GLICK: It's kind of like every developer, that anytime something goes wrong, what do you do? You open up the firewall, you change all the permissions.
CRAIG BOX: Get rid of SELinux.
CLAYTON COLEMAN: Turn off SELinux. Don't let Dan Walsh hear this. And RBAC was taken by a CoreOS engineer, actually, and we went through and we did a second revision of the API to really take feedback and go in. And then OpenShift actually updated our version so that, as extension mechanisms in Kubernetes evolved, some of which we helped do so that we could actually make that transition, we made our API objects be thin shims on top of the Kubernetes one. So it's been this very intertwined evolution. And I actually like it when we can get real concrete experience and then turn that around is something that goes into Kubernetes.
These days, you don't need to add things to Kubernetes. We have customer resources.
ADAM GLICK: There are interfaces, essentially, for everything to plug in.
CLAYTON COLEMAN: That's right, the plug-in interfaces. And for us, it was because we had gotten involved so early with Kubernetes, we needed to kind of get away from that giant monolithic approach. And so we did the same thing that we've been telling customers to do. You need to break your monoliths up into microservices. And so there's a number of OpenShift folks who'd been working for a really long time who were overjoyed when we broke the monolith into its component pieces, and that's continued.
And the moment we do that, then we run into new problems like, hey, wouldn't it be really nice if there was something that just made all this stuff just work? And so after the acquisition of CoreOS, we really bought into that operator model so that we could hide some of those details because it's no good to give someone microservices if they don't keep working. You have to make that just something you don't even have to think about.
CRAIG BOX: That sounds like a good time to transition to the fact that you are here today with us wearing a CoreOS t-shirt.
CLAYTON COLEMAN: That's right.
CRAIG BOX: Obviously, in the podcast, you have to just imagine what that might look like or check our Twitter feed for a picture. But what can you tell us about that acquisition?
CLAYTON COLEMAN: I've worked with folks core us practically from the first days of Kubernetes, even before that, to the first days of Docker. And the CoreOS mindset, in a lot of ways, was a really interesting evolution of the same spirit that Red Hat and Red Hat Enterprise Linux was built around, which was the idea that open source matters, that we have to make it easy for people to own and control their own software. That was a huge part of the initial open source movement is everybody has a right to see the software that's running their lives.
And the CoreOS guys took the unique spin on it as, how do we secure all the software in the world? And it has to be easy to update and it has to be reliable. There's just going to be too much software out there for all of us deal with. So CoreOS has always been a key part of the container ecosystem.
We had some early discussions with them, which was with their Tektonic Kubernetes distribution, they were exploring some really awesome ideas. They had some deep Prometheus integration. And Prometheus, as we all know, is an amazing solution for ad hoc monitoring. They really spearheaded some of the initial work around making Kubernetes easily monitorable and observable from Prometheus. And so we had a lot of discussions with them.
The acquisition of CoreOS was a huge sea change within Red Hat as well. So looking at the two companies, we do a lot of things. We had a lot of similar projects. There's always the tension when you acquire someone that you're going to be like, well, let's just take all that software and throw it out. Or let's throw out all our software and use what they've built.
I want to thank everyone at CoreOS for being part of this for us. We actually looked at it and we said, what can we do better? And so one of the things that really was in our eyes was CoreOS had the great idea with container Linux, of this immutable operating system.
And we're like, hey, we're an operating system company. We know how to do operating systems. And we're like, what if we could take all of the benefits of Container Linux and combine it with all the benefits of Red Hat Enterprise Linux? And we do insane lifecycling, we can do 10 years of support for a particular version of the kernel, we can back port crazy fixes. How do we take some of the unique characteristics of having a kernel engineering team and tying that with the great things that CoreOS pulled together?
We looked at running Kubernetes, and we're like, well, most people are kind of putting a Kubernetes cluster together and then they walk away. How do you keep a Kubernetes cluster running year after year after year after year? How do you update it? How do you make that easy?
So CoreOS had really popularized the idea of taking Kubernetes controllers and custom resources, and they came up with a magical word, the best branding ever, which is they were like, well, the whole goal of this is to make things easy. So let's call them operators because we all know that operators try to make things as easy as possible for their end users.
But those operators, that idea of baking in the domain knowledge of the software being run was kind of light bulb for me because for years in Kubernetes, we kind of made this somebody else's problem. Hey, you can build all the YAML you want. You can sit on giant piles and piles of YAML, and then you just know give it to the cluster, and the right thing will happen.
Well, then, the next question is what happens when somebody refactors how that entire component works? Well, obviously, you bring the whole development team in, you have them manage all of the lifecycle. And that works great when it's your microservices. Kubernetes is a microservice system. We make it easy to run microservices. Who runs the microservices so that everybody else doesn't have to? Operators were a great pattern for that.
So we can combine those two ideas and we tried to really take what was great that CoreOS had built and bring that to Kubernetes and to go that step further. And our job is to make it so that operations teams don't think about running Kubernetes or the extensions on top of Kubernetes because Kubernetes is not just this little kernel anymore. It's your Ingress controller. It's your container runtime. It's the logging plugin that you have. It's your monitoring system, it's your monitoring agent. It's all of the things that you need to make a modern Kubernetes system. How do I make that easy?
ADAM GLICK: Technology from CoreOS drove many of the features that went into the latest release of OpenShift 4. What changed and why?
CLAYTON COLEMAN: It really was about we wanted to make a Kubernetes distribution that felt like Kubernetes, which is the big advantage of Kubernetes is you say, this is what I want you to do, and it goes and happens. And so that was the first step, which was, instead of having config files-- Kubernetes has, like, 800 config flags now. And we're trying to improve that, and you don't have to set all of them.
But for a lot of the feedback that we would get from users and customers was like, that's a lot of flags. Which one do I set? And then they would set it, and then we'd forget about it. And then someone upstream would deprecate it, and then that would break their cluster. And they were very unhappy with us.
And so we said, well, why don't we focus on the actual use case that people have? So you don't need to set all 75 kubelet parameters. You need set the important ones, like which feature gates are enabled or not? And we said, OK, well let's come up with the idea of declarative configuration. And a number of other projects, actually, have done this and many other scopes, but we want the clusters configuration to just be another custom resource. So you say, hey, I want you to have this feature gate, and we would roll that out to all the machines and to all the APIs servers and all of the controller managers, and to do that automatically. So if you made a mistake, you just change it back to the one.
So making the cluster the target of declarative config was one. Another one was tying the machine to the cluster. So Kubernetes goes on and on about how are these awesome containerized system, but there's this big, secret heart to Kubernetes, which is it's just running on Linux and Windows now. And Linux was the thing that actually runs your containers.
Well, you've got to keep Linux updated. So CoreOS had pioneered this with Container Linux, which was it's an immutable Linux operating system. It just gets updated. It happens automatically. And we said, well, how about we go a step further? How about we make a version of Linux that is tied to the exact version of the cluster?
So we think about, hey, I need to go debug something. You can debug pretty far down Kubernetes. You start at the API servers and maybe you go off to a controller, then you go to a node. The kubelet-- is the kubelet working? I don't know. Let's go see. Is it talking to the right container runtime engine? Well, cRIO's broken.
Well, why is cRIO broken? cRIO's calling run c. Well, what's run c calling? Run c's calling the kernel. Well, what's the kernel calling? The kernel's calling a storage driver that has a bug in it. OK, let's go fix that storage driver.
How do you make sure that when you fix that, everybody gets it? And for us, it was that idea of saying, well--
ADAM GLICK: I want to know what debugging tool you have that will take you through all of those different things.
CLAYTON COLEMAN: They're called very, very smart engineers that are much smarter than me. And that process of, well, it's just software. It's got to work really well. Why am I thinking about what version of the container runtime I'm running? That's the Kubernetes system's job.
So we said, let's tie a version of the OS, all of the things you need to run containers, the version of the kubelet, let's lock that to a specific version of OpenShift. So if you install OpenShift, no matter where you are, no matter who you are, you will get the exact same version of the container runtime. And it just happens we roll out software updates with new versions of OpenShift.
And the third one was, how do we make it easy to add more features to Kubernetes? And there's a lot of exciting things going on in this space. I don't want to ever imply that the only interesting things are the things that we pick in OpenShift because what happens is, as a community, as an ecosystem, we build things. Everybody gets a chance to show off and see what works best.
We had worked with CoreOS on their operator lifecycle manager component. And an operator is basically just a controller and a custom resource, and an RBAC rule that lets you actually do those things. And so that component had a UI and had a great experience for managing and maintaining the extensions to Kubernetes. You can put anything in an operator. You can stick a Helm chart into an operator. You can have a three line loop of bash.
And don't try this, please. No one do this, but you could have a kubectl inside of bash, and you could deploy a YAML file. And you could just put that in a loop, and that's an operator. And so we said, well, let's make it easy to install operators. And we worked in the community. Operator Lifecycle Manager is the open source organization on GitHub. And we worked with Operator Hub and worked with a number of people in the ecosystem, say, hey, you've got this really great thing that runs Elasticsearch or CockroachDB or Kafka. Wouldn't it be great if anybody could go install that?
And so that's really how we got down that path of operators. It's not just about making the basics of the operating system and what's in Kubernetes network. It's about, well, how do you give someone a version of Kafka that works really well and works natively on Kubernetes, and the operations team doesn't have to worry about upgrading it? So that combination of those three declarative config machines tied to cluster and operator lifecycle manager is really what I think of as the heart of OpenShift for us, just trying to make the world a simpler place for operations teams.
CRAIG BOX: Red Hat was acquired by IBM, citing in large part the experience that Red Hat brings from the Kubernetes space. As the OpenShift team, what was it like to hear that?
CLAYTON COLEMAN: I always like to joke that it was like we're the world's biggest startup and we got a $34 billion exit.
CRAIG BOX: [CHUCKLING] Thats not bad.
CLAYTON COLEMAN: And it was fascinating. We're really proud of work in OpenShift. We're really proud of the work we do in Kubernetes and in Linux and the container runtime space.
And for us, one of the biggest opportunities was IBM has a pretty rich history of open source. IBM was one of the first big companies to really buy in and support Linux. I don't know if anybody remembers--
ADAM GLICK: That was back in the '90s.
CLAYTON COLEMAN: Back in the '90s. Remember those commercials with the little kid with the blond hair creepily staring into the camera about-- that was Linux.
ADAM GLICK: They had Power Linux as their distro back then, if I recall.
CLAYTON COLEMAN: That's right, and Linux runs on the mainframe. Linux runs anywhere. And IBM has had a strong commitment to open source. They saw the way that containers were changing how you deployed software. And so with Red Hat, helping to build open source communities and help make open source innovation possible, to work with folks in the community, to find the best software and to bring it up, IBM really benefits from that as a partner and as bringing their commitment to open source to help the world's largest companies succeed because at the end of the day, we're all just trying to run applications. We can talk about daddies all day long. But at the end of the day, somebody somewhere has got to actually go write some code, push it out, and then get on with their lives.
And I think some of what we've tried to do on OpenShift and in Kubernetes and in the open source ecosystem is everybody wants to do this. By working together, we can all make that process just a little bit easier.
ADAM GLICK: Now that Kubernetes is a brand most people know and OpenShift is marketed as Enterprise Kubernetes, is OpenShift still a PaaS?
CLAYTON COLEMAN: This is one of the funniest things is there was a big backlash against PaaS, which was PaaS was, oh, somebody is going to prevent me from running software the way I want to be. And so we always had this really careful line in OpenShift, which is when we talk to people all the time, we're like, yeah, my developers don't know anything about containers. They don't know anything about service meshes or functions as a service. They want to write some Java code, they want to deploy it, and they want that to be easy. And the operation teams want to sleep at night.
And so we always had this real tension. It was like, you talk to people in the Kubernetes community, and they're like, if you're not running 300 million containers and doing billions of requests a second, don't even talk to them. And on the other extreme, you have like all the people who they may not even know they're touching Kubernetes in their development flows, in their daily lives. They just want to write software to solve that function.
It's really weird. I feel like platform as a service is a term that is dead. Instead, we just have the platform which is Kubernetes in that ecosystem.
ADAM GLICK: Do you think a CaaS is a different thing?
CLAYTON COLEMAN: Certainly, there's a lot of people out there that would like to tell you that a CaaS is a different thing. I honestly think-- I mean, I'm horribly biased in this, but I think Kubernetes won, and we're living in a Kubernetes world. And you're going to build things on top of Kubernetes. And I bet you if you asked 30 different people, 30 different people would have a different idea about how they want to develop, how they ended up on Kubernetes, what they want to use, how they want to integrate it with the cloud that they're running on or the software that they prefer.
I think that flexibility is actually the best part about Kubernetes, is there is a huge amount of room for anyone to go build these kinds of systems and build the kinds of experience they want, and we're just starting down this path.
CRAIG BOX: Where would you like the shift team to take the Kubernetes world next?
CLAYTON COLEMAN: I think the OpenShift team should follow the Kubernetes world. I think our focus is always going to be on trying to make operators' lives easier. Most of the time is, this is software. It should be helping you, not causing you to tear your hair out at 3 o'clock in the morning because somebody's replication or this is a great, but we actually--
ADAM GLICK: That'll be the marketing tagline right there.
CLAYTON COLEMAN: Don't tear your hair out at 3:00 AM. That's OpenShift. And as much as possible, working within these communities, which is the moment someone has a great idea, we want to take that great idea and make sure that great idea makes everybody's lives easier.
And so KubeCon is a great opportunity for that. I think one of my favorite things is to come to KubeCon, to see everybody in the halls that I haven't seen in six months or a year, and all of us to instantly speak up as like, what is driving you up the wall about Kubernetes? Then we go fix that. That'll keep continuing.
CRAIG BOX: What was the bug you were going to mention?
CLAYTON COLEMAN: A lot of times, I like to think that if you don't actually have to deal with the problem, you don't ever actually really empathize with your users. And so one of the things with OpenShift 4 is we started running more of OpenShift on top OpenShift. Well, that was kind of an exciting experience because we run the control plane as a pod.
Well, if you're running the control plane as a pod, then you need to make sure the pods run really well, and so we started using disruption budgets. So pod disruption budgets is a kind of obscure feature in Kubernetes. You can say, hey, there's supposed to be three of me. Never let me get below two. And so this is of how you can control if you're doing a rolling update or an administrator comes along and tries to drain a node, that you don't take down all the pods at once.
Well, we found this awesome bug, which is apparently nobody ever really thought about this, is that if you have a disruption budget that's set to three, you can never upgrade your cluster because you'll be like, hey, we need to go drain this node so we can you know replace the software. And we noticed that we were getting all these reports of people who were like, yeah, my cluster has been trying to upgrade for like three days now, and nothing's happening. And we actually dug it in, and it was one of the operators-- and it names will be withheld to protect the guilty-- one of the others in the cluster, and said, oh, well, we can never be disrupted. And so all of these clusters will get to the point and be like, I can't do anything. I can't drain.
So it's this usability experience of if you've never actually sat there and had someone yelling at you because their cluster wasn't upgrading, you don't really empathize with the problem. So what did we do? We said, well, this is obviously something most people should know about. So we went and worked with the Prometheus and sig instrumentation. And we had an alert to anybody using Prometheus and the Kubernetes mix-ins for Prometheus. You'll get an alert now that says, hey, you've got a pod disruption budget there will never be satisfied.
I don't know about you, but you should probably go yell at the person who created that app. And so that actually is-- those small things, like really gives you that-- if you're in there every day, it's like this combination of you look at something, you're like, oh, that's terrible. Well, at least I can fix it. That's something they like keeps me coming in every day, is I get that sick satisfaction from finding something horrible and being like, well I can't fix everything, but I can fix this.
CRAIG BOX: Clayton, thank you so much for joining us today.
CLAYTON COLEMAN: It was a pleasure.
CRAIG BOX: You can find Clayton Coleman on Twitter, @smarterclayton.
ADAM GLICK: Thank you for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @KubernetesPod, or reach us by email at KubernetesPodcast@google.com.
CRAIG BOX: You can also check out our website at KubernetesPodcast.com, where you'll find transcripts, show notes, and "Minesweeper" implementations. Until next time, take care.
ADAM GLICK: Catch you next week.