Kubernetes Podcast from Google: Episode 219 - API Machinery, Chaos and Dishwashers, with Lucas Käldström

#219 February 20, 2024

API Machinery, Chaos and Dishwashers, with Lucas Käldström

Hosts: Abdel Sghiouar, Kaslin Fields

Lucas Käldström is a CNCF Ambassador, Kubernetes contributor and expert. Lucas Co-led SIG cluster lifecycle, ported Kubernetes to ARM and shepherded kubeadm from inception to GA. Today Lucas runs three meetup groups in Finland, studies at Aalto University, and, when time allows, contributes to cloud native software as a contractor.

We chatted about Kubernetes API machinery, Chaos, Entropy, and Dishwashers.

Do you have something cool to share? Some questions? Let us know:

News of the week

Weaveworks shutdown their operations

Weavwork CEO Alexis Richardson post on Linkedin

kubetrain.io

Bytedance KubeAdmiral on GitHub

Bytedance KubeAdmiral Announcement on InfoQ

Strimzi joins the CNCF Incubator

Microsoft new Cost Management tools for Azure

Links from the interview

Lucas Käldström

Kubernetes as a dishwasher

Understanding Kubernetes Through Real-World Phenomena and Analogies - Lucas Käldström

Lucas research thesis

Paper - Large-scale cluster management at Google with Borg

Kubernetes API Machinery

Dr. Stefan Schimanski LinkedIn

KCP - Kubernetes-Like Control Plane

Kubernetes API Conventions

SIG Architecture

Ingress2gateway - Ingress to Gateway Migrator

Promise Theory: Principles and Applications (Mark Burgess, Jan Bergstra)

In Search of Certainty: The Science of Our Information Infrastructure (Mark Burgess)

Sweden Finns

Links from the post-interview chat

Keynote: Reperforming a Nobel Prize Discovery on Kubernetes - Ricardo Rocha & Lukas Heinrich

Why Service Is the Worst API in Kubernetes, & What We’re Doing About It - Tim Hockin

Gateway API TCP Routes

Community-Powered Kubernetes LTS: Ensuring Stability and Compatibility While Driving Innovation Jeremy Rickard

https://github.com/yannh/kubeconform

Transcript

Show full transcript

ABDEL SGHIOUAR: Hi, and welcome to the Kubernetes Podcast from Google. I'm your host, Abdel Sghiouar.

KASLIN FIELDS: And I'm Kaslin Fields.

[MUSIC PLAYING]

ABDEL SGHIOUAR: In this episode, we chat with Lucas Kaldstrom. Lucas is a senior software engineer at Upbound, a Kubernetes contributor and expert, and a meetup organizer. We talked about API machinery, entropy, Kubernetes, and dishwashers.

KASLIN FIELDS: But first, let's get to the news.

[MUSIC PLAYING]

After almost 10 years in business, Weaveworks is shutting down their operations. CEO Alexis Richardson shared the news in a LinkedIn post. Weaveworks was a UK-based startup who developed multiple popular container tools for GitOps. One of these tools is Flux, which is a CNCF graduated project since 2019.

In the post, Alexis assured the community that they are working with other companies to ensure flux continues to be maintained.

ABDEL SGHIOUAR: If you want to make your KubeCon Paris travel more sustainable, Kube Train is what you are looking for. This year, some community members, with the support of sponsors, got together to book trains to take participants from various European to Paris. Check the website, KubeTrain.io, to see if your departure is included and for details on how to secure your tickets.

KASLIN FIELDS: ByteDance, the company behind popular global platforms like TikTok, has open sourced KubeAdmiral, its cluster federation system for Kubernetes. KubeAdmiral is designed to manage multiple clusters with efficiency and effectiveness KubeAdmiral scales to run more than 10 million pods across dozens of federated Kubernetes clusters.

ABDEL SGHIOUAR: The CNCF technical oversight committee, TOC, has voted to accept Strimzi as a CNCF incubating project. Strimzi is focused on deploying and running Apache Kafka clusters on Kubernetes. Apache Kafka is a leading platform for building event-based microservices architectures and real-time data pipelines, and it is horizontally scalable and fault tolerant by design.

Running Apache Kafka on Kubernetes can be complicated, but Strimzi reduces the complexity by using the operator pattern. This includes the initial installation as well as all day two operations for upgrades and security.

KASLIN FIELDS: Microsoft has released some new cost management tools for Azure. In addition to some updates to pricing for individual services, they've also released some new billing tag capabilities, including the ability to tag billing profiles and invoice sections and use them for cost reporting and analysis by enabling tag inheritance at the billing profile level. And that's the news.

[MUSIC PLAYING]

ABDEL SGHIOUAR: Today, I'm talking to Lucas. Lucas is the CNCF ambassador, Kubernetes contributor, and expert, and coled cluster lifecycle. Lucas ported Kubernetes to Arm and Shepherd, Kube EDM from inception to jay. Today, you are running three meetup groups in Finland, which is impressive. You studied at the Aalto University, which you are going to help us actually pronounce, and when times allows, contribute to cloud-native software as a contractor. Welcome to the show, Lucas.

LUCAS KALDSTROM: Thank you very much. Glad to be here.

ABDEL SGHIOUAR: Thank you. Awesome. So how do you-- is it Aalto?

LUCAS KALDSTROM: Aalto.

ABDEL SGHIOUAR: Aalto? OK.

LUCAS KALDSTROM: Aalto, yeah.

ABDEL SGHIOUAR: Yeah, I--

LUCAS KALDSTROM: Yeah, yeah, yeah, yeah, you're doing well. [LAUGHS]

ABDEL SGHIOUAR: Good to-- well, I've been in Sweden for six years, so I had to learn how to pronounce these strange sounds.

LUCAS KALDSTROM: [LAUGHS]

ABDEL SGHIOUAR: All right. Cool, cool. So let's get going. I like to throw curveballs at my guests on the show, just get going strong. So I saw one of your interviews, and you said that Kubernetes is kind of like a dishwasher for servers. Can you help us understand what do you mean by that?

LUCAS KALDSTROM: Absolutely. Absolutely. [LAUGHS] It's quite funny. I wrote my bachelor thesis a couple of years back, wanted to understand why does Kubernetes work the way it does. This is quite-- even for me, having programmed and being part of development for many years, some of the ways we are doing things, sometimes when we are coding, we take them for granted, right? They were handed to us, to some degree. That's, OK. Well, we do everything in controller loops, and everything is in the API, it's versioned, and all of this kind of stuff.

And I started thinking, like, but really, why is this, then, better than what we're doing before, if better at all. And when I researched this, I found a lot of control theory and lots of other kind of physical-- kind of from the physical world phenomena.

And then one of these analogies-- I tried in the thesis to have a lot of analogies to make it easier to understand, because many of Kubernetes concepts are very abstract, in one sense, or sometimes even quite mathematical. So I wanted to have some easy-to-understand analogy, right?

And I started the phrasing as the universe-- so the second law of thermodynamics states that the universe always goes towards a more entropic or chaotic state, right? So we start from the Big Bang, and it's very ordered. Now look at how chaotic the world is these days, or the whole universe, as such.

And this second law of thermodynamics-- that it either chaoticness stays constant or increases applies to everything, also information, right? This we can see in everyday life.

And I use the example of, if I go and make lunch for myself, then before, all of my dishes are clean, so everything is great. Everything is highly ordered, just as I want it in my kitchen. Then I make lunch, and I eat it, and afterwards, it's a mess. The kitchen is a mess. And I need the dishwasher to kind of-- or, well, of course, you can do that-- [LAUGHS] manually as well, but now we have technology that help us with the cleaning-up part.

And Kubernetes is kind of like that. So the dishwasher is a technology that helps us to minimize the chaos in our kitchen, same way Kubernetes helps us minimize the chaos of our servers.

ABDEL SGHIOUAR: OK. That's a lot to impact, starting from the fact that preparing for this interview, I was actually reading a part of your paper that you have-- the dissertation that you have done for your master's, I guess, right?

LUCAS KALDSTROM: Bachelor, even.

ABDEL SGHIOUAR: Bachelor, yeah. And then I was like, wow, the first couple of chapters talks about entropy and the second law of thermodynamics. And I was like, is this a paper about Kubernetes or is this a paper about physics? But it's actually quite cool that you are able to draw these parallels in a way between how Kubernetes is-- I guess how Kubernetes helps making order in software, similar to how a dishwasher helps make order in the kitchen, right?

LUCAS KALDSTROM: Yeah.

ABDEL SGHIOUAR: But before we move on, I want to talk about that particular part because I found it interesting. How do you think, and to be just very precise for people listening to this episode, we're not talking about chaos as in its use in the word chaos engineering, right?

LUCAS KALDSTROM: No.

ABDEL SGHIOUAR: Because the whole point of chaos engineering is to make chaos in an orderly system. But what we are talking about here is the other way around. It's like, how does Kubernetes help with chaos?

LUCAS KALDSTROM: It helps in multiple ways, but the first thing we need to take into account is that in order for us to measure the amount of chaos, we need to know what is order. We need to define order. Because only if we know what is order, we can know the delta between the order where we want to be and the chaos which we probably are in now.

So this can be thought of as two points abstractly on a map, for example. And the distance between them is the chaos. And Kubernetes does this very clearly. So in every API object, we have the specification, which is the desired state we want to be in.

So we say-- we tell Kubernetes that I want to have a WIMP application that is running this version of this container and three replicas, for example. And so we tell what the end state should be. But at the exact time instant when we submit this request, it's not fulfilled, because we always need some time to fulfill the request, and anything can happen in such a big system, usually, that server systems are. Then we need to record what did the user really want, not just what is happening right now.

And Kubernetes does this very clearly. And then we have these controllers, or we sometimes also called operators that minimize this delta between the desired state and the actual state. So it drives us continuously towards-- regardless of what state we are in now, we always go towards a state which we want to be in because that's clearly defined by the user.

ABDEL SGHIOUAR: And that's essentially what we usually refer to as the reconciliation loop, right?

LUCAS KALDSTROM: Yeah.

ABDEL SGHIOUAR: And I find it super interesting that-- well, we're going to talk about this a little bit more in detail, but this whole control loop theory or control loop-- not theory, technically, like the control loop implementation of Kubernetes is not novel, in a way. It's not like something-- it's existed in other fields, not even in the tech field.

LUCAS KALDSTROM: Definitely, definitely.

ABDEL SGHIOUAR: So we're going to talk about this, but before I move on-- so there is one thing that was interesting in your paper, which was mentioned in a paper that was published by Google about-- was it about Borg? I think it was about Borg, right?

LUCAS KALDSTROM: Yeah

ABDEL SGHIOUAR: Yeah, so it was probably one of the first publications that described what Borg, which is Borg is what gave birth to Kubernetes.

LUCAS KALDSTROM: Yes.

ABDEL SGHIOUAR: And when I'm explaining this to people, I always say this is as much of similarities as you can describe between Kubernetes and Borg, because they are vastly different in terms of how they're implemented. It's just the idea came from Borg, right?

LUCAS KALDSTROM: Yeah.

ABDEL SGHIOUAR: And in this paper, there is this mention that says failures are the norm in a large-scale system, or large-scale systems, with an S at the end. So why do you think it was phrased-- like, what's your take about this particular mention?

LUCAS KALDSTROM: It is really interesting. Once I started thinking and researching these topics, I've had to take multiple new courses. So I'm starting at university, now just starting my master's. Before that bachelor, such as thermodynamics, probability and statistics, and control theory. Actually, I wish I would have taken control theory before writing the thesis because-- [LAUGHS] it went the other way. So I did the thesis, saw that, oh, wow, there's a lot of control theory I would need to know here. But only after that had a chance to take the course.

And when doing all of these, probability is one of them that is really hard to wrap your head around. I can't say-- that goes for any of these topics. Like, they're so deep, all of them, that I don't think-- probably no one can comprehend them fully.

But with the probability, like-- if we always talk about SLAs or SLOs and all that kind of stuff and how many nines do you have there. But even thinking a really quite good policy, like, I don't know, they have 1 in 10,000 is failing, or something you're doing, that sounds good. But if you do it 10 times a day and for a year or two or three years, then quite soon, you will-- the probability of none of these things failing will be very low, actually.

So, like, when you start taking this 9,999 to the power of 10,000, so, like, what you need in order to not have-- for all of them, cases you have executed this command or something like that, you have dodged the risk of there being one of these failures.

So the large-scale system like Google, the exponent-- even though we have really reliable things, softwares or hardware or all of these things, systems in general, just when you raise the power high enough to tens of thousands or more, you will start seeing these systems.

And it's also kind of documented that even though we have these control loops that fix the failures-- you know, like, for example, one of the earliest things about Kubernetes was that, OK, you have 10 servers or a hundred servers. You run the web applications on them. And then even though you-- maybe a third of your servers go down, Kubernetes will kind of heal them automatically and reschedule. And that's all good.

But sometimes, even though you're really fast try out-- these control loops, really fast, try to fix all of the problems, maybe the problems arise with such speed that you will navigate to ever zero. That is quite-- usually, in some of those systems, that errors just arise as fast as you have time to fix them. And that can easily be the case.

It's quite interesting to think about this because one of the-- when people say that Kubernetes is complex, a lot of these basic terminologies or these, I don't know, philosophies are from other fields and quite well known. Not too complex at the heart, but the implementation becomes complex, of course.

ABDEL SGHIOUAR: Yes.

LUCAS KALDSTROM: So then if you don't understand the philosophy, it's hard to know what is the system doing now. But many small-scale users have never seen this-- including myself before Kubernetes, have not seen this kind of rate of failures because it seems to-- you've never hit that 1 in 100,000, or something like that yet, so you think it will never happen, right?

ABDEL SGHIOUAR: Yeah.

LUCAS KALDSTROM: And then you think that a system like Kubernetes that fixes and takes these failures into account is like, oh, I don't need that. My systems will never fail, right? [LAUGHS]

ABDEL SGHIOUAR: Yes.

LUCAS KALDSTROM: You just not run the thing long enough.

ABDEL SGHIOUAR: I'm super happy to have this conversation because I feel like you are one of the probably a few people that think the same way I think about Kubernetes in a sense of, like, I'm coming from a world where Kubernetes did not exist. And so the only options back in that days were things like Ansible and Chef and Puppet and the way you would orchestrate servers, or even before that, just bash, just pure bash, right? So--

LUCAS KALDSTROM: We've all been there.

ABDEL SGHIOUAR: It's--

[LAUGHTER]

Yeah, exactly. So it's very hard, actually, for a lot of folks to think what does Kubernetes brings to the table in terms of making certain things easier. But then the way I always think about it is, in our industry, the more abstractions you build to make things easier, the more problems you introduce, right? That's just an inherent effect, I would say.

LUCAS KALDSTROM: Yeah.

ABDEL SGHIOUAR: So before I move on, I want to talk about something. Like, we talked about control theory a lot. What's control theory? Now that you took the course.

LUCAS KALDSTROM: Yes, yes. [LAUGHS] I did. So the kind of easiest way to explain control theory from a practical perspective is that you have the cruise control in your car. And you want to-- you have a road, and say it's-- well, here in Finland, it's kilometers per hour.

ABDEL SGHIOUAR: Yes.

LUCAS KALDSTROM: I think in Sweden as well, but--

ABDEL SGHIOUAR: That's good because most of our audience are actually American, so it's kind of--

LUCAS KALDSTROM: [LAUGHS] So say that in normal cases, maybe 80 kilometers an hour. And of course, you can imperatively press the gas pedal to do something, and now I want more speed or I want this speed. But as the environment changes around you, you know that-- in Finland, there's almost no hills, but if we take that aside, if we're in Norway--

ABDEL SGHIOUAR: Yeah. [LAUGHS]

LUCAS KALDSTROM: --then as the-- you can't press-- in order to keep the 80 kilometers an hour, you can't press the gas pedal the same amount all the time if you're going up a hill or going down the hill. Then if you're going down the hill, from most likely, you need to brake.

And this is kind of-- the cruise control is then a controller, they also call it that, that sits between you and the engine. So it kind of looks at what is the current speed-- well, say, maybe 75, and what is the desired speed? OK, 80.

OK, I have a delta here. And depending on what that delta is, it will take different actions, like mathematically take different actions and press the gas pedal in a different, then, amount. And if the delta is large, say you're going 40 kilometers an hour and you want to do 80 still, you will press the gas pedal much more than if you're-- the delta is 5 or 1 or 0.1, right? Because if you press it too much, then you will go over, and then you need to brake or stop, stop pressing the gas pedal at all.

So it's the kind of science of how much should we press the gas pedal in order to optimally get to the speed that we want, but also be able to have stability in the process. Because, you know, of course, if we press it way too much or too little, we won't have any stability and, I don't know, our car will go into the ditch or something.

And what I found really interesting about the control theory course is that we have-- and the control theory is what-- driving the world in all of these analog, physical systems. Airplanes would not fly if they would be controlled there, and many other things.

And the key in control theory is to measure stability, so-- because it's better to have, say, in an airplane, you want to have stability over performance. This is-- actually, this analogy I'm taking from Mark Burgess' book, "In Search of Certainty," where he says if you're riding in a passenger plane from Helsinki to Paris for KubeCon, then I guess the customer definitely want to have the stability.

I don't need to get there in one hour. I can wait two or three. Much more interested in getting there. But in a fighter jet or something like that, you want to have the controllability instead or some kind of fast response and ability to do a lot of quick turns.

So there's different design decisions, but the fact that in analog control theory, in the normal control theory, you can calculate that-- will this controller be stable or not, just by examining some mathematical properties. I think it's really interesting because we don't have that for our software controllers.

So say that you apply two storage controllers to your Kubernetes cluster or two ingress or something like that, and they wouldn't-- well, two ingress is a good example because if you forgot to set which ingress should this be, and both of them assume that it's them, they should be acting on it, then they start fighting. And one of them applies like, oh, I'm putting this state, and I'm doing this other changes in the cluster and all of that.

We don't have this way of making sure that-- is this combination of controllers stable or not. So that is kind of my key takeaway from control theory, like what I learned that we would need. [LAUGHS]

ABDEL SGHIOUAR: That's typically what we would refer to as race theory, right? So you would have two things trying to act on the same object at the same time, and they would just like overwrite each other, changes.

And I think in the way you explained it, I find something very interesting. It's the fact that, like a cruise controller inside the car, the gas pedal is not the only inputs that the controller has toward the engine. It has other inputs that it can make.

And this is kind of fairly similar to how, actually, Kubernetes itself work because if you're trying to deploy a pod or a bunch of pods, the controller has multiple things it uses at the same time, right? Like, you have the scheduler, which tells it where to run what, and then you have the kubelet itself, and then the kubelet itself has its own logic of, is there enough resources in this place? And then you have the thing that's assigned the IP address and the thing that generates the name.

So it's like multiple sorts of API calls, if you want, under the hood that you don't see because it's all abstracted away from-- by a simple operation, right?

LUCAS KALDSTROM: Yeah.

ABDEL SGHIOUAR: So drawing the parallels between control theory and Kubernetes, I think, is a super interesting thing. So I want to move on to something else. So in your paper, there was something said-- operators bring a novel programming model. Why is that?

LUCAS KALDSTROM: So to talk about first what is operators for those that might not have heard of them, essentially, the term was coined by CoreOps or Brandon Phillips in 2016, I think, or '17, something like that, back in the day, where Kubernetes-- the API server started off as just any normal API server. There are-- you can list, edit the resources that are-- Kubernetes knows about.

But then the founders of Kubernetes quite-- and some other community people quite early realized that this control mechanism is much more widely applicable than just for containers and just for Kubernetes' own needs. And the fact that you can build these abstraction layers on top and make this kind of app store on top of Kubernetes with all kinds of other things of automation really was enabled by the fact that Kubernetes introduced custom resource definitions.

And custom resource definitions lets you register your own type or data schemas, your own rest resources, if you want, into the core Kubernetes API server, sharing the same database that core Kubernetes does, but then still you're in control of what data is there and how they are managed and all of that.

So upon realizing that, we now can control much more advanced things than just pods and deployments and replica sets and that kind of stuff. That is one type of abstraction. But for some things, like databases being the best example, it's not enough. When doing a rolling upgrade, we can't just kill all of the-- or step by step, just kill all of the etcd, for example, workers or Postgres or whatever database you have, and hope for the best or just hope that they come back up with consistent data in store.

So etcd operator was the first one that CoreOS was then made when they coined it, being this kind of control of that application and taking the application semantics into account. So taking into account that we need to do backups, we need to be really careful about upgrades, and all of these kind of stuff, but while still allowing the end user of this automation to be declarative, so work in exactly the same way as Kubernetes does itself.

So in the etcd operator or with the etcd operator, we can control it by doing a normal kubectl apply of the declarative end state of the schema. So first, your end state is maybe, I want to have etcd 3.1 or something, and then after that, when you want to upgrade to 3.2, you just change that desired state. And the update won't immediately just kill all of the etcds and put 3.2 there. It will take it step by step, making sure that all of the data is consistent during this upgrade and maybe rollback if that wouldn't happen.

So then this kind of generalized paradigm of controlling some infrastructure-- we don't know-- you know, it's not specific to etcd. It's not specific to databases. It's not specific to anything, really, but some type of infrastructure somewhere that we want to keep in sync. And we keep them in sync with some kind of controller loop, just as Kubernetes does. Kubernetes has no internal APIs. All of them are open to integrate with. So we can just make all of these similar things for our own thing.

So cert manager is a nice one, being-- kind of automating the provisioning of Let's Encrypt certificates from the Let's Encrypt server, right? And this brings this kind of novel programming model in the sense that before, if I'm writing bash script, if we say, I don't so often, because it's not part of the programming model, always check that is the desired state or even separate the desired state and actual state.

Sometimes, we just, in a bash script, push the-- only do the actions. Sometimes we don't need to do the action because we're already at the end state. But sometimes, we need to take in lots of other things into account before we even can start taking that specific action. So it kind of really brings a different mindset to this.

And also, the fact that operators continuously check the state and continuously, even though-- say you're running some etcd operator or just any deployment in Kubernetes. That deployment controller will actually check, say, every minute or at some other user-specified interval, that the deployment actually is still there and is still correctly figured in all of these.

Previously, we-- maybe the-- at best, maybe once a day check that things are consistent with Ansible, for example, but that was even a stretch. Sometimes, when I'm thinking, you know, usual one being certificates, OK? So we have cert manager now renewing, checking everything maybe, like, every five minutes or every 10 minutes or hour or something. Previously, maybe we did certificate rotation at best once a year, when it was just expired or the day before.

ABDEL SGHIOUAR: Or about to be expired.

LUCAS KALDSTROM: Right. And by then, we had forgot how we did it, right?

ABDEL SGHIOUAR: Yeah.

LUCAS KALDSTROM: [LAUGHS] The last time. So this kind of being in constant motion, we can think about. We always make sure the desired state is the actual state because, as we said, otherwise, we will have too many failures. Otherwise, we will not be stable in the system. The system won't be stable in total because the failures will take over and will drive into an irrecoverable state. So that's kind of why.

ABDEL SGHIOUAR: I also believe that this operator model in Kubernetes is a super powerful thing, especially that you can extend it, right? You can even build your own controllers and your own operators, and you can just build things that make sense for your organization slash application, right? It's quite interesting.

I want to talk about something that's a very interesting. I don't know how many people have heard of this, but there is this word called API machinery in Kubernetes world. I am not sure if it's actually new to Kubernetes or it existed before. What is API machinery? What does people mean when they say API machinery?

LUCAS KALDSTROM: It's, in my opinion, maybe one of the key, if not the key invention of-- or innovation of Kubernetes, is this way of-- we've talked about a lot of the API machinery kind of builds upon this thought of control theory and all of these things, but I'm really interested in the kind of-- if we would split Kubernetes in two the lower part, like contain-- you had a previous discussion with Phil about container DM and Docker, right?

ABDEL SGHIOUAR: Mm-hmm.

LUCAS KALDSTROM: There, we managed to split Docker-- well, in three, technically, but let's say in two, with container DM and Docker itself. So a little bit the similar-- I'm thinking similarly about Kubernetes, so the lower half being the API machinery, all of these controller patterns, all of the best practices ways of organizing the system, and then getting these generic control planes, like, again, control theory exists everywhere else as well. And this is the software framework for control theory, right?

But then on top of that is the actual container stuff. But we don't need the container stuff to do all of the other things. Like, we can-- Stefan Schimanski, one of the leading persons in the development of custom resource definitions, for example, also been very active in sync API machinery in Kubernetes, and was also my thesis advisor of the bachelor thesis, so I had a really good advisor.

He also made this KCP project. It doesn't stand for anything, but yeah, KCP, which is essentially Kubernetes without any of the core resources. It's just an API server, Kubernetes API server, but it doesn't have any pods.

You can use kubectl as much as you want. You can apply. You can do kubectl and get config maps, secrets, namespaces, all of that kind of stuff. But all of the container-specific things-- they are removed, right? So there's no deployments, no pods, no nothing, such. And this is a really interesting split, and a split which I'm hopefully going to spend some more time working on in the coming years.

And this bottom part, the API machinery, taking the APIs, for example, it's really about making sure that the client and the server speak a common language. And then this issue of communication, we can say, is a normal one from everyday life. Like, if two people don't speak the same languages, maybe they can communicate some way, with body language and stuff, but it's very hard to communicate with spoken language, then.

And the same way for programs. Without a common way to understand what we're talking about when dealing with different programs, it's very hard to actually get anything done. So API machinery-- one of the innovations was to actually be very specific and explicit with what is it that we're describing.

So if we specify a pod, for example, we actually have to write kind pod, kind equals pod, into that specification. So we tell Kubernetes, or any reader of this JSON object, that this is-- now I want to describe a pod.

Technically, I could fuck up the schema as much as I wanted, and it will just be an invalid pod, but at least the reader knows that I meant to write a pod, maybe, although I wrote something completely different, right?

And this explicitness is actually really hard to find in many other systems. If you go to most other cloud APIs, for example, and then you shoot a JSON object, some kind of payload, to some endpoint, you usually don't actually specify what is it that you meant. So you can-- it's kind of easy to mix between these different payloads, these different JSONs that you throw different-- to different places. And sometimes, both you and the server you're trying to communicate will be very confused on what you actually meant.

In some other, previous systems, we've tried to do-- use intrinsics, right? So we kind of, oh, well, if this field-- if a field with this name is here, then I know it describes a VM. If this other field is here, then I know we meant a VM disk, right?

ABDEL SGHIOUAR: Yeah.

LUCAS KALDSTROM: And these other things. But that also goes haywire at some point when you have a lot of similarly looking JSON objects. So that is one of the things, that I'm being really explicit of what do we mean when we say. And this allows us to do generic decoders.

So we can, for example, kubectl apply. We can just apply the whole folder of a lot of different JSONs and YAMLs and whatnot because we know when the server reads them, it knows that we don't have to rely on the file having a specific name, which is also another kind of commonly used intrinsic that, OK, if this file is called deployment.yaml, we know it's a deployment. Now we can call it anything, can come from anywhere, because we actually specify it in that same schema.

So that is one. Then the API version-- even though it would be a deployment, how do we know which-- what kind of fields are allowed and what kind of schema of that object we should use for decoding to interpret and understand this?

Like, as many people know, business requirements are never static, right? They always evolve. And that means that sometimes we need to deprecate. We need to remove old things. We need to-- stuff that we thought was a Boolean, true/false thing, like enabled true/false, becomes easily over time like, yeah, it should be enabled on weekends, but not on Saturdays between 5:00 to 10:00, you know? This kind of-- when we designed the system first time, we just had an enable true/false. It evolved into a massive subobject of when should this thing be enabled, right?

ABDEL SGHIOUAR: Yeah.

LUCAS KALDSTROM: And for that, we need a kind of schema evolution of that object. So also encoding the version very explicitly-- we have the API version field equals V1, V2, whatever. And if we want to tell customers that this is a kind of unstable API, we add maybe alpha, V1 alpha 1, V1 alpha 2, and keep evolving them like this. Having those two there is really important.

Then, of course, as-- you always cannot anticipate your own growth, so it's kind of funny that many people ask why is API version actually have some other thing that is not really a version, it's a group. It's a BNS thing, right?

ABDEL SGHIOUAR: Yes.

LUCAS KALDSTROM: Where does that come from? And that actually-- it was not as anticipated before we added these custom resource definitions and learned more about the system, that we would actually need to not only have-- specify what kind is this, also what group does it-- of APIs, as a holistic set of resources, does this belong to so that if we would design it today, maybe we would have what we use in Kubernetes internally, which is group version kind. That fully qualifies one resource, one payload, if you like.

But yeah, so we added it into the API version. But it could have made it into its own field as well. But these describe just the type itself. Then to describe the object, that's also consistent for all of the-- so metadata, and then we have name. We have namespace. We have all kinds of labels and annotations. These are common for each every object.

So we can, again, make generic clients, generic interpreters that make sense, organize these resources based on that. So in this case, like, name and namespace are the primary key, if you want, of the resource. And by having the primary key have a well-defined place, it's much easier to do everything else with this-- these resources because in other systems, if the primary key sometimes is called ID, sometimes primary key, and sometimes name, and sometimes whatever else, it's very hard to make something generic that actually can organize the things just-- the files just by name.

ABDEL SGHIOUAR: Yeah. I actually have a followup question. I have two followup questions. We'll get to the other one. I think we didn't really agree on that question yet, but I will-- I'm super curious to hear your thoughts. But before we get there-- so describing all the stuff you've been describing, about how the APIs are managing groups and stuff like that, yet when you are a user of Kubernetes, it usually looks and feels very consistent across all the objects or across, at least, most of the objects, right?

Maybe ingress is the outlier there because of all the annotations that have been added over time, but-- so considering that the Kubernetes project itself is split into six special interest groups, how is it possible that the user experience, the UX, is still consistent if everything is just built by a completely separate set of people?

LUCAS KALDSTROM: Yes. And of course, having been a core developer over the project, it's a lot of work that goes into giving that feeling.

ABDEL SGHIOUAR: Yeah.

LUCAS KALDSTROM: [LAUGHS] And also lots of communications in Slack. But in a couple of different ways, that's possible. Well, firstly, Kubernetes has very strong best practices or API design conventions. So there is an API conventions document that is, if you print it to A4s, maybe 20 pages, I think, of text, so fairly sizable compared to many other projects. And they describe, for example, that you should probably never use Booleans. That is one of the things that they say because realistically, you will probably have to replace it with a string or some enum or just a whole different subfield in the future.

ABDEL SGHIOUAR: Sure.

LUCAS KALDSTROM: And then another thing is, like, don't use floating point numbers, because one of the funny things that some people know, some people don't, is that many floating point numbers or decimal numbers can't be represented exactly by computers because they have finite memory. So then you, as a user wrote-- I don't know, it doesn't work for such simple things, but, like, 0.16 or something. Maybe if you're unlucky, that would actually, when coming to the server, would be 0.17, or something like that.

ABDEL SGHIOUAR: Oh, yeah.

LUCAS KALDSTROM: And then your intent was lost, right? So you cannot round trip it. Many things like this are in the API design conventions. The other one is like, there is a specific special interest group called architecture, SIG Architecture, that tries to govern and enforce these as well as possible. And there's also a set of API reviewers of-- at least the core Kubernetes project that try to make sense, make such that they are quite consistent.

ABDEL SGHIOUAR: Cool. Got it. So then the other question, and I'm going to throw you another curveball here-- I think we have-- it's kind of started around KubeCon Amsterdam last year, and now it's becoming a thing. Long-term support, LTS, right?

LUCAS KALDSTROM: Yeah.

ABDEL SGHIOUAR: We all know that Kubernetes famously have officially one year support, right? So by the time 1.30 is out this year, 1.27 will not be supported anymore, right, officially. It's like whatever current version minus 2, right? So 1.30, so 1.29 and 1.28 will be supported, 1.27 will be technically end of life.

Then at KubeCon, one of these big cloud providers went ahead and got like, OK, we're going to do LTS, Long Term Support. So we're going to charge you more, but we're going to support-- yeah, but to be precise, we're going to do LTS in our managed Kubernetes offering, right?

LUCAS KALDSTROM: Yeah.

ABDEL SGHIOUAR: So they announced it. Now the price is out, which is kind of insane, in my opinion. Then there is another cloud provider, which also came out with the same communication. And they are-- it's interesting that they both actually charge the same amount of money.

So basically, it's the control plane. You pay for the control plane. If you are running your managed Kubernetes version on a currently supported version of Kubernetes, it's $0.1 per cluster per hour, right? If you run an out-of-support version, it's $0.6, so it's six times the price for the control plane. So what's your take about that? What's your opinion?

LUCAS KALDSTROM: So one of the key things about the API machinery that-- what it lets us do, in theory-- and again, one of my favorite quotes is that, in theory, there is no difference between theory and practice, but in practice, there is a difference.

And in theory, Kubernetes-- this is API machinery, it handles all of the decoding, encoding, and also conversion between different versions. So say that you had-- of course, there are many other changes between Kubernetes versions than just the APIs, but APIs are maybe the biggest one to the users because you can't use some old fields or some new fields in the old APIs, so you need to do some hacks around that. And then when you upgrade, you need to actually take those into account and so on and so forth.

But the Kubernetes actually supports this kind of round-trip model, or convert to anything model, so even though we released maybe a new version, say that there would be-- well, actually ingress and gateway API's quite a good example, and you mentioned ingress as well. There is this ingress to gateway migrator, if you want, or that maps the semantic things about ingress to the gateway equivalent. And kind of in a similar way, you can submit to Kubernetes an old version, what Kubernetes thinks is old, at least, because it has a newer one now. [LAUGHS] A version of that API object.

And internally, Kubernetes will actually convert it and will store it in a version that is some storage version. So say that we have-- the horizontal pod autoscaler is a good one because that actually has two versions, so we can actually talk concretely about that.

So say that you have a user that uses a horizontal pod autoscaler version 1. And then you submit it to Kubernetes. And then it will decide that's part of the setting-- or API server configuration, but it will then decide that-- does it convert that now to the V2 internally, because you can convert any schema version between itself, so that it may be-- say that it converts to V2, and then it converts your old request into a new version of schema and stores the new thing into the database, etcd or Kubernetes.

But then you can still-- if you then ask that, hey, I just updated a horizontal pod autoscaler V1. Can I see it? Can I look at it? And then, it will actually-- although it stored the V2, it will downgrade it on the fly and give you a V1 representation of that, while the controller maybe that operates on the resource and makes actually autoscale some of the things, it maybe actually uses the V2 version to Kubernetes API server, but they both have the same storage.

So this thing that all versions that are supported of an API schema is convertible and so-called round-tripable between each other, and it has this star model so that there's one hub version, that's the one usually in storage, and all the other versions must convert to this one without loss of information.

So that's the basic setup of Kubernetes. And it allows us to do upgrades in a nice way. But one interesting thing is that you can only upgrade the storage version when you're sure that you will never downgrade again to a version that doesn't know about this, this newer version. So if you now store V2 of the HPA, you can't downgrade anymore to a version that doesn't recognize, where V2 of the horizontal pod autoscaler was not a thing.

So these are the kind of trickiness about upgrades because in order to roll back, you cannot move the kind of things you have in storage, but then you need to support the old versions forever. And still, there are coming more and more fields all the time, even the version is not officially upgraded or even though we didn't go-- V1.00 of Kubernetes had pod object of V1, but it had maybe a third of the fields that 1.30 Kubernetes V1 pod has, right?

So we can't even go back in practice to that version and expect the pod to behave the same. So upgrades are very hard because of that more-- this kind of organizational reason. Like, how long should we support old APIs? How long should we support behaviors of the past? When can we deprecate fields? When can we do all of these things? And organizational burden of the community maintaining those.

So going back to the cloud providers and all of this, it all boils down to, What is the matrix that we want to support as a community? and, How do we test all of those combinations? Because then also the kubelets, the node agents, can be out of sync by-- I think it's two minor versions?

ABDEL SGHIOUAR: Yeah.

LUCAS KALDSTROM: Well, compared to the API server, and taking that thing-- kind of diff into account and if you do skip upgrades from, say, 1.28 to 1.30, how do you make sure that any migrations that you might have along the way actually are preserved? All of these kind of things are making the problem very hard.

But one interesting thing that-- when I was talking to Tim Hockin, actually, now last KubeCon in Chicago, he was having a lightning talk about the service APIs, and talked about how bloated it has become, that there's so many things you can do with the service APIs, represent so many different things. He probably had 15 examples, at least, in that talk.

And then we were talking about, like, well, in practice we have the possibility to do a V2 service API because we have the possibility to still give-- you have a V1 client, so you can still request, even though we would store it as a V2 or provide a newer client with V2 API view of it, we still can support the old V1.

But that becomes more of a organizational problem. That will still break someone, right? Eventually, we will break someone, right? And they will eventually be some-- although we have this perfect conversion mechanism and all of these kind of technically practical-- [LAUGHS] in the community, it will be such a big change that we can't do it.

So it is a little bit same thing with the LTS stuff, that-- how fast can we move, and how much resources do we have in the open source community that actually sustains all of this work, does the testing of all the matrices? There are so many combinations of everything and governs that people don't-- in-- violate the principles of when, for example, you can move to a new version in your cluster and when you cannot, if-- to preserve downgrade possibilities and all.

So it's very interesting, but I'm interested in seeing how that kind of model of charging six times more--

ABDEL SGHIOUAR: Yes.

LUCAS KALDSTROM: [LAUGHS] --for an old version will be. If that's the case, I would hope that most of that money-- personally, I would hope that most of that kind of excess money or delta money would actually go into sustaining Kubernetes contributions and maybe fund people that then add, maybe, from-- expand from 2 to 3 or have some kind of support for older versions or LTS versions.

ABDEL SGHIOUAR: Yeah. That was exactly my thoughts about-- all this extra money should somehow make its way back to the community in the CNCF, maybe not directly in the form of direct money, but maybe just hire more people--

LUCAS KALDSTROM: No, no.

ABDEL SGHIOUAR: --to actually work on the open source project, right?

LUCAS KALDSTROM: Definitely.

ABDEL SGHIOUAR: But you mentioned something very interesting. I think in my opinion, the move from ingress to the gateway API would be the first time in Kubernetes' lifetime that we are trying this-- completely replace an API with new one, right?

Once we get to a stage where ingress will be end of life, whenever that happens, it would be the first time that the conversation you've been having with Tim Hockin about service API version 2 would be-- the first time that this would be brought up, right? Because then everybody will have to move to a new thing, regardless of whether they want or not, right?

So I think that that would be probably the first exercise that would give us a sort of-- like, an idea about whether this is doable in the future, I guess.

LUCAS KALDSTROM: Yeah, yep. It definitely incurs a lot of oil on both the project and its users, but sometimes-- [LAUGHS] it just needs to be made. Ingress is one of those where-- I like the concept of Hyrum's law. Maybe I'm butchering the name, but if I recall correctly, a Google engineer that did a law that any kind of user-facing consequence or kind of behavior of the API will be used and depended on, although it's not part of the API description.

So, like, if there's a side effect that-- like, kind of like the XKCD, that if there's a side effect that the text editor uses 100% of the CPU when typing an M character or something, then someone will have used the CPU heater to warm--

ABDEL SGHIOUAR: [LAUGHS]

LUCAS KALDSTROM: --something else and will be broken by the fact that now that we fix that inefficiency, I can't start my home heating with-- by pressing M anymore, you know? [LAUGHS]

ABDEL SGHIOUAR: Yeah. Yeah, that's actually-- that's pretty funny example. I think the case of the ingress API specifically is interesting because there are companies that build their entire business on this API, right? So as we are moving toward this new territory of the gateway API being the next-gen API for north/south traffic or maybe even east/west traffic, because even the service mesh is actually leveraging that through the GAMMA initiative, so it's going to be interesting.

Well, look, I don't want to waste more of your time. This was already a lot of intelligent conversations. And I think that a lot of people that will listen to this episode will have a lot of googling to do around, I guess, right?

LUCAS KALDSTROM: Yeah.

ABDEL SGHIOUAR: But any last words you want to tell the audience? Go read your paper. Obviously, that's one of them.

LUCAS KALDSTROM: Yes, yes. So one of the interesting things with the connection between entropy that we discussed and its consequences to Kubernetes-- one day, just-- it hit me when I was writing the thesis.

And I was thinking that it really is like, no matter how hard we try to make our systems good and keep configuration in sync here and there and stuff, no matter how hard we try, it always becomes more messy, right? And upon realizing that and realizing that Kubernetes is the kind of force against that, I saw how it was related to control theory and also the physical part.

And I mentioned this to Stefan Schimanski, who was my bachelor thesis advisor. And he mentioned that, well, have you tried reading or looking what Mark Burgess have written about the subject? I had not. I never heard the name before at that time.

But he has written a lot of, I guess, actually, since the '90s already, developed a lot of theory, including promise theory and all kinds of other-- written many books about it in a kind of scientific way, describing this, how-- what is the linkage between physics or-- and these concepts that we know quite well about, actually, because they've been studied for hundreds of years.

Computer science is a really new thing at the end of the day. [LAUGHS] And our computer systems. And I highly recommend the book "In Search of Certainty," which he has written as well. It's about this subject, so only first, after then, I know-- saw that there are other people actually researching this.

And when I took the thermodynamics course, one of the statements was that, well, life is a highly orderly piece of-- we can't have-- we are very consistent in the way life is created and is, but it's kind of interesting because then no matter what we do as a species, we still just create chaos around us.

And one of the papers-- I think it was from one of the famous physicists back in the '40s, 1940s, was that maybe life purpose is to actually create chaos, right? Because we can't-- there is no process where we would go in and change something in the Kubernetes API or would go and change something, configuration, error servers, or whatever, where we wouldn't make stuff a little bit more chaotic.

ABDEL SGHIOUAR: Yes.

LUCAS KALDSTROM: Right? It's very hard.

ABDEL SGHIOUAR: Yeah.

LUCAS KALDSTROM: Maybe there exists some small pieces. But there is a linkage between information theory, like Shannon created in the '40s, and the physics. And they have been shown to be actually-- really, like, exactly consistent. They are the same kind of thing.

ABDEL SGHIOUAR: Awesome.

LUCAS KALDSTROM: So this is will-- we can leave you with that. [LAUGHS]

ABDEL SGHIOUAR: Yeah. That would be a good thing to leave people with. Just go read these papers, I think. Yeah, there's a lot of interesting ways of looking at our field and looking at real life in kind of close similarities. And maybe the tech industry and the IT field is just an extension of life as it is or as supposed to be-- [LAUGHS] to start with.

LUCAS KALDSTROM: Yeah. I think we have a lot to learn from other fields.

ABDEL SGHIOUAR: Exactly. All right. Well, Lucas, that was really awesome discussion. Thank you very much for your time.

LUCAS KALDSTROM: Yeah, thank you very much as well.

ABDEL SGHIOUAR: And I just realized now that at the beginning of the call, I did not actually mention your full name. So it's-- let me give this a stab, considering that your family name has two letters with two dots.

LUCAS KALDSTROM: Yes. [LAUGHS]

ABDEL SGHIOUAR: That would be "Kahld-stroom"?

LUCAS KALDSTROM: Yes, if you omit it, but "Keld-struhm."

ABDEL SGHIOUAR: Ah, "Cheld-strum." Yeah, oh, yeah.

LUCAS KALDSTROM: "Keld-struhm," yeah.

ABDEL SGHIOUAR: I remember now. You are actually-- you speak Swedish, right?

LUCAS KALDSTROM: Yeah, yeah.

ABDEL SGHIOUAR: Oh, so you are from the 6% Finnish people--

LUCAS KALDSTROM: Yes, yes.

ABDEL SGHIOUAR: --who are-- oh, OK. Got it now. OK.

LUCAS KALDSTROM: Yes.

ABDEL SGHIOUAR: So for those who doesn't know, this-- in this call, in Finland, actually, Sweden is an official language, and you can actually request paperwork done in Swedish. You could even-- signs in the streets are written in both Swedish and Finnish. And there is, like, what, 6% to 10% of the population who are like, yeah. And usually, these are the people who are born at the border between Sweden and Finland, which is a very small border. So it's one of these fascinating things about how the world works.

All right. Thank you very much, Lucas.

LUCAS KALDSTROM: Thank you very much. Hope people will go and read my paper. [LAUGHS]

ABDEL SGHIOUAR: Yeah, please do. Thank you very much. Thanks for your time. And yeah, we'll talk to you next time. Have a good one.

LUCAS KALDSTROM: Yes.

[MUSIC PLAYING]

KASLIN FIELDS: Thank you so much, Abdel, for that interview. Lucas is always so great to talk to. [LAUGHS]

ABDEL SGHIOUAR: It was really interesting. I know Lucas for a while. We've been talking. We've been meeting. I go to, like, meetups by the Nordics CNCF team quite often. And last year, I was in Helsinki for one of the meetups that Lucas was hosting. And it was last year that Lucas shared the paper with me. And I started reading it. And I was like, hmm, this is interesting. We need to get this person on the podcast.

KASLIN FIELDS: Oh, yeah. I mean, Lucas is always great. The first time I met Lucas was at KubeCon EU 2019 in Barcelona. And we were both lost-- [LAUGHS] looking for the way into the convention center, I think, because it was a little bit confusing. And we ended up walking together and finding our way into the convention center. And then later, he was on the keynote stage. And I was like, oh. [LAUGHS]

ABDEL SGHIOUAR: Sure.

KASLIN FIELDS: Right.

ABDEL SGHIOUAR: So-- so my a-ha moment at KubeCon 2019 in Barcelona was the week before I was in Geneva at CERN with Ricardo Rocha--

KASLIN FIELDS: Oh, yeah. Mm-hmm.

ABDEL SGHIOUAR: And then the week after, he was at KubeCon on stage at the keynote.

KASLIN FIELDS: Yeah.

ABDEL SGHIOUAR: And I was like, OK, I had a cheese fondue with this person last week.

KASLIN FIELDS: The era of CERN at KubeCon was great. [LAUGHS]

ABDEL SGHIOUAR: Yeah. For the people who doesn't know, I think we need to find the link. It was replicating the Higgs Boson experiment on Kubernetes.

KASLIN FIELDS: Oh, so cool.

ABDEL SGHIOUAR: Amazing. Yeah, it was really cool.

KASLIN FIELDS: I think there were a couple of KubeCons where they presented. But Ricardo was also one of the cochairs of KubeCon, so he was at multiple of them. I'm sure he talked about certain stuff at a couple of them. So that's-- but the one where they talked about replicating the experiment was-- and how Kubernetes could have made it so much faster to find it was so cool.

ABDEL SGHIOUAR: Yeah. I don't want to spoil our upcoming episodes, but we found another interesting use case of Kubernetes coming out of Switzerland, which we are going to have on the show hopefully.

KASLIN FIELDS: I look forward to hearing more about that.

ABDEL SGHIOUAR: It's going to be interesting.

KASLIN FIELDS: Also, this talk about CERN really pairs nicely with this episode with Lucas with all the physics talk. [LAUGHS]

ABDEL SGHIOUAR: Yes. So for those of you-- I think if you go to this stage, you have listened to the episode, but Lucas has a paper, a research paper.

KASLIN FIELDS: Yep, yeah. We'll have that in the show notes.

ABDEL SGHIOUAR: There's a lot of similarities between physics and Kubernetes, which is super interesting.

KASLIN FIELDS: There's a lot of variants in analogies. I think when you're first trying to understand, especially a technical concept where things are often very abstract, having analogies is a really useful way to get it to stick in your brain. This is something that I've given talks on, actually, is about how basically the way learning works in your brain is all about making connections.

The more connections you can make between new information and old information, the better you'll be able to retain that new information. So when you're learning something for the first time, especially a very abstract concept, these analogies can be really helpful in getting it.

So the use of physics as an analogy, where it's also, like, there's a lot of physics in Kubernetes, really, when you think about it, kind of blurs that line, which I think is really interesting. [LAUGHS]

ABDEL SGHIOUAR: Yes. It's quite cool.

KASLIN FIELDS: So one thing that I was really excited about with the physics connection was where you all were talking about the perspective of failures being normal in a sufficiently large system.

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: That's one of those areas where a physics issue with Kubernetes, essentially, but also that you can represent using physics as an analogy.

And it made me think about this one example that I honestly think about a lot, where I was watching some show or something about physics. And some experts were talking about the concept of intelligent life in the universe, and how there's this concept that basically it's next to impossible to imagine that we are the only intelligent life in the universe with a sufficiently large universe, because anything that's happening, even if it's rare, will happen enough that it should be, in a sense, frequent, but still rare. [LAUGHS]

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: I don't know how to use words to explain that right. But-- [LAUGHS] so they were talking about this concept, and it's something that I come back to frequently. And that's kind of what you were talking about here as well, was in a sufficiently large system, failure, even if it's rare, is going to be common, essentially.

ABDEL SGHIOUAR: Yeah. Yeah. So Lucas made reference to this paper that was published by Google called "Failures are Normal in Large Scale Systems," and that was in the context of talking about chaos management and how Kubernetes is good at doing chaos management. And I think my favorite-- it's something I didn't know. Actually, I learned it from Lucas. It's this thing called promise theory. So Kubernetes is based on the idea of promise theory. Because when you are deploying stuff to Kubernetes, Kubernetes promise you that stuff will be running.

KASLIN FIELDS: I see.

ABDEL SGHIOUAR: And that promise is the base of basically all the, how do we call it, the reconciliation loop, right? The fact that when something fails, it will just be respun again. So I've never heard of promise theory until I had this discussion, which is-- essentially, that's what you want out of an orchestration system, is deploy this thing for me and run it for as long as you can. And if it fails, just restart it, right?

KASLIN FIELDS: Yep. I also had not heard about it. And we'll definitely have links to the books that Lucas mentioned in the show notes. I already went and grabbed those links because I want to read those myself.

ABDEL SGHIOUAR: Yeah.

KASLIN FIELDS: But y'all didn't talk much about promise theory itself, so I'm glad that you gave that little overview there.

ABDEL SGHIOUAR: I think we probably mentioned it very briefly, but it was pretty interesting because the research paper from Lucas talks-- like, there's a section that talks about promise theory.

KASLIN FIELDS: OK.

ABDEL SGHIOUAR: I didn't read the entire research paper. It's 90 pages. It's quite long. And it starts talking quite a lot about the law of thermodynamics and entropy and stuff like that. And I'm like, hmm, I haven't been in university for 15 years, so--

KASLIN FIELDS: Yeah.

ABDEL SGHIOUAR: I feel a little bit rusty.

[LAUGHTER]

KASLIN FIELDS: Can relate. Another thing about entropy that you all mentioned at the end of the interview was a concept that I feel like I've mentioned in a couple of episodes, maybe, recently, or at least in conversations recently, which is that I think this concept of entropy and chaos and how everything is always getting more complicated-- you all were saying it's core to tech, it's core to the whole universe, but one thing I've been talking with folks about regarding entropy in tech is we always create tools, and then over time, the tools take on more and more responsibility more and more features, and they get complicated.

And then someone new comes along and wants to use it. And they're like, oh, this is so complicated. I'll just make something new that's simple and just does what I want it to do, and it'll be so much better. And then that thing over time gets more complicated, and then we just repeat this whole cycle again. So it's very core to the way technology as an industry evolves.

ABDEL SGHIOUAR: You're completely on point. And as you were talking, I just remembered that keynote we talked about it before, the keynote that Tim Hockin gave about--

KASLIN FIELDS: Exactly, yeah.

ABDEL SGHIOUAR: --like, in the next 10 years of Kubernetes, how should we, as a community, make sure we don't overcomplicate it, right? What you're describing-- we can't just-- we don't just keep adding stuff. And what's the cost of just keeping adding things, right?

KASLIN FIELDS: Yeah. And at that same KubeCon, Tim gave the lightning talk that Lucas also mentioned about the service APIs, specifically within Kubernetes being one of those cases where it's gotten really complicated. It does way too many things. And so at some point, it may need to be replaced. And you all had a fantastic conversation about that. I really enjoyed that.

ABDEL SGHIOUAR: Yeah. We need to find that lightning talk. I saw the lightning talk by Tim. It's amazing.

KASLIN FIELDS: Yeah. It's really good. It's very concise because it's five minutes of Tim being, like-- and I didn't realize this as a user of Kubernetes. I'm like, service is the thing that you use to get to your application running in Kubernetes. But there's so much that it does. And he kind of lays out all of the different jobs that services as an API object do for the Kubernetes cluster. And it's like, oh, yeah, hmm, that does do a lot that makes it really weird and unwieldy within the Kubernetes ecosystem.

ABDEL SGHIOUAR: Yeah. And I think one interesting observation there is when we talk to people-- so when I was doing consulting, it's very hard to try to explain to people why there are two different APIs that try to achieve the same thing. Back in the days, it was ingress, so ingress and service. Why do you need to use service to get to--

KASLIN FIELDS: That one is really confusing.

ABDEL SGHIOUAR: Yes. And now there is the gateway one, right? So now there are three APIs. But what's cool there is I saw this article that was published by the maintainers of the gateway API. They are basically adding some sort of routes to the gateway API that can give you the same type of load balancer as you would get with the service APIs, right?

KASLIN FIELDS: Mm-hmm.

ABDEL SGHIOUAR: I don't know if the plan long term is to deprecate it, but there is going to be potentially a place where you would need only one thing, which is the gateway to do basically north/south and east/west traffic inside Kubernetes.

KASLIN FIELDS: And that is kind of the core takeaway, by the way, from Tim Hockin's lightning talk, is that the gateway API actually takes on a lot of the responsibilities that service is also doing. And so the adoption of gateway API, like you all were saying, is going to be the first example where we're totally deprecating one API maybe, someday. I don't know about that. But--

ABDEL SGHIOUAR: Yeah.

KASLIN FIELDS: Yeah. I don't know what the plan is there, honestly. But there will be a totally new API that's essentially the second updated version of this original API. And it does so much. And it does relate to services.

So I hadn't thought about that, really, when you all were like, oh, maybe services will be the next thing in Kubernetes that goes this route of having a totally new implementation. I hadn't thought about that possibility. And I haven't heard anything about it within the project, so I don't know what the plans are there. But it does seem likely that something will happen there.

ABDEL SGHIOUAR: Yeah. I think-- I mean, in the context of simplification, as Tim says, it's, I think, important to try to capitalize on one single way of doing things so that people don't get confused. And so, as I always say, the easiest way to make things simple is to remove code, so the more code you're able to remove, the better it is for everybody, basically.

KASLIN FIELDS: Talking about moving on to simpler implementations and leaving older, more complex implementations behind so that you don't have everything to deal with at the same time, you all talked about long-term support. [LAUGHS]

ABDEL SGHIOUAR: Yes. I wanted to get Lucas' perspective on that.

KASLIN FIELDS: It's such a point of kind of-- I don't know if "contention" is quite the right word within the community, but it's pretty close. It's definitely something that a lot of people are talking about and a lot of people have strong feelings about.

ABDEL SGHIOUAR: Yes. So I will refrain from sharing my opinion because I do, of course, have opinions about this, but the only thing I would say is I would highly encourage people to go look at how much AWS and Microsoft charges for people running nonsupported-- by nonsupported you mean out-of-support, upstream Kubernetes.

KASLIN FIELDS: Yeah.

ABDEL SGHIOUAR: Right?

KASLIN FIELDS: But it is a trade off that in some cases is worth making for some companies. It's going to be really expensive, but I know that there are some cases where folks will be like, the ease is worth the money. But we'll see what kind of trouble that gets us all in. [LAUGHS]

ABDEL SGHIOUAR: Yes, sure. It's a very valid point. And maybe some companies will basically want to throw money at the problem to make it go away. There is, however, still-- I mean, LTS in Kubernetes is only two years, so you're only delaying it by another extra year, which, to me, feels-- I don't know. It's like, to me, it feels like what's the point of just wasting money instead of just-- it's just-- that's my opinion. That's how I think about it.

KASLIN FIELDS: There's a lot of use cases here to think about.

ABDEL SGHIOUAR: Sure. I bet there are.

KASLIN FIELDS: Yeah.

ABDEL SGHIOUAR: Yeah.

KASLIN FIELDS: I'm glad that you all talked about it. Hearing more perspectives is always useful for understanding. It's really a big deal with the community, within users, so it's good to hear more about what people are thinking about it.

ABDEL SGHIOUAR: I think it would be interesting to have probably an episode about this. I'd like to have somebody who has an insider knowledge-- not an insider knowledge, really, but more-- I mean, there is a SIG, right? There is a special interest group--

KASLIN FIELDS: There is.

ABDEL SGHIOUAR: --for LTS.

KASLIN FIELDS: Jeremy Rickard from Microsoft did a fantastic keynote at KubeCon North America 2023 about the history of long-term support in the Kubernetes project and what's going on with it now. So maybe we'll talk to him later, but-- also, if you're just interested in this and want to get a head start, go check out that keynote. It's very, very good.

ABDEL SGHIOUAR: Cool, cool.

KASLIN FIELDS: It's short, too. I think it's one of the shorter ones. But he goes over the history of all of that.

ABDEL SGHIOUAR: Yeah. We'll try to have somebody to talk about this. I think it's important as a topic to cover.

KASLIN FIELDS: Yeah. So I could talk more about the history of that, but we'll save that for another time. One other thing that I want to mention from your interview is API machinery. So when you get involved with open source Kubernetes, you start hearing API machinery a lot. And just by itself, honestly, I feel like that doesn't mean much. [LAUGHS] API machinery. What does that mean? And I really like that Lucas explained that and went into that. Of course, that's the area where he's been focused.

You figure out very quickly in open source that API machinery is very important. And I think, honestly-- I'd never put these pieces together. But I think, honestly, API machinery is the place to answer a question that I've had for a long time about understanding Kubernetes YAML and APIs.

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: So you can go to the docs, and you can learn about how Kubernetes YAML is supposed to be structured. You can take courses for the CKAD, the Kubernetes application developer certification, and you'll learn about the structure of YAML. But a lot of it feels, at least to me, as a user, very implementation detail-y.

It kind of reminds me of history class. One complaint a lot of students have about history class is it's a lot of remembering dates, right? There might be interesting things behind those dates, but you still have to just remember the dates, which feels like a detail and feels silly to remember.

Kind of I feel the same way about YAML, where you have to remember the API version. And for most APIs-- like, there's big swaths of APIs where that's the same, but there's some where it's different, and it can kind of get confusing. So understanding that deeper is something I've always wanted to do. And what I learned from this is that API machinery docs are where I should look. [LAUGHS]

ABDEL SGHIOUAR: Yes. It's interesting you're mentioning this. So I came across, actually, a tool. I was just scouring the internet reading some stuff this week. And I came across a tool. We will leave a link in the show notes. It's called Kube Conform. It's a command line tool on GitHub which you can actually use to validate Kubernetes manifests. So you can run it against a bunch of YAML files, and it will tell you if these YAML files are correct or not.

KASLIN FIELDS: It still feels often to me like guesswork, honestly, when I put together a YAML file from scratch.

ABDEL SGHIOUAR: Oh, yes, heh.

KASLIN FIELDS: Like, I know I need these sections. I know these sections have these parts. There's probably a bunch of parts I don't know about. I don't know how to find out about all of the pieces that I don't know about. I mean, I can go to the API docs, but sometimes it's kind of unclear how it relates to actually writing the YAML.

Yeah, that's something I've always wanted to understand more deeply. And that is API machinery, is what I learned in this.

ABDEL SGHIOUAR: That's API machinery for you. Correct, yes. And I think that one of the conversation we've been having with Lucas was the fact that Kubernetes is being maintained by a bunch of people, so multiple SIGs, and then there is a SIG API machinery, which is, in my opinion, quite uncommon.

Because typically, if you are a developer maintainer in the backend, you also maintain the API that goes with that backend. But the fact that there is this disconnect between the API machinery part and the controllers part, if you want, because everything is a controller, technically, in Kubernetes, and still works fluently and works beautifully, right?

And I think one other thing I want to mention is if you go read the API machinery documentation and understand why it was set up the way it was set up, then if you are following the KEPs, like the Kubernetes Enhancement Proposals, and you see people mentioning why APIs are not supporting whatever people want them to support, then you start understanding why there is always push backs about just adding stuff to APIs. And it just makes things more complicated.

KASLIN FIELDS: Yeah. I thought it was really interesting the way that he framed it as-- I feel like the job of API machinery is to help with the communication between humans and the computers.

ABDEL SGHIOUAR: Yes. Pretty much, yes.

KASLIN FIELDS: It's like making it clearer to the computer this is what the human was going for.

ABDEL SGHIOUAR: Yes, yes.

KASLIN FIELDS: I know that they didn't do the best job, but have some grace, computer. Here's what they were going for.

ABDEL SGHIOUAR: Bear with us.

KASLIN FIELDS: Try to make it work.

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: [LAUGHS]

ABDEL SGHIOUAR: That's a whole interesting part of Kubernetes itself that is, I think, is worth-- if people are curious, worth spending some time to understand.

KASLIN FIELDS: Yeah. Very interesting. And I'm excited to read some more API machinery documentation. I think I'm going to go to GitHub after this, honestly, and look up the API machinery SIG and just see what's in their GitHub repo. That's where I'm going to start.

ABDEL SGHIOUAR: We'll make sure to leave a link.

KASLIN FIELDS: So that's kind of all of the main points that I wanted to hit on the interview, except maybe one last thing we should mention is dishwashers.

ABDEL SGHIOUAR: Oh, yes.

[LAUGHTER]

Kubernetes is the dishwashers of servers. Is that what you're talking about?

KASLIN FIELDS: I did not know where he was going to go with that when he said it. I was like, Does he mean servers as in computers, or does he mean servers as in, like, waitstaff at a restaurant?

ABDEL SGHIOUAR: Oh, yeah.

KASLIN FIELDS: Which makes sense to me, too. Honestly, when I first thought about it that made more sense to me than the other way because it's like, servers use a dishwasher to accelerate their work, right? It makes their jobs and their lives a little bit easier because they can just put the dishes into the dishwasher, and it'll do the washing for them, which is one way that we could look at things, but not where he went with this.

ABDEL SGHIOUAR: No. I mean, the conversation was around the lines of, like, making mess out of chaos, essentially. That's the analogy there, right?

KASLIN FIELDS: Yeah. It was about the cleaning part--

ABDEL SGHIOUAR: The cleaning.

KASLIN FIELDS: --of dishwashers.

ABDEL SGHIOUAR: Yeah, yeah.

KASLIN FIELDS: Yeah.

ABDEL SGHIOUAR: It was-- I love how Lucas brings up a lot of simple analogies to explain these complicated terms. It's quite interesting.

KASLIN FIELDS: Yeah. And the whole perspective of Kubernetes' role in managing chaos is one that I have not heard often. It's fascinating, especially the way that he presented it. He made it quite approachable, even though it was a new concept for me.

ABDEL SGHIOUAR: Yeah, yeah. I wanted to carry on with the conversation. I think the interview was quite long. I don't remember how much we've been talking. I think we-- I think--

KASLIN FIELDS: So is this part.

ABDEL SGHIOUAR: --an hour. Yeah, this is-- I mean, this ep is going to be ridiculous. It's going to be an hour and a half, so. [LAUGHS]

KASLIN FIELDS: Sorry, folks. Hope you're enjoying your run.

ABDEL SGHIOUAR: Exactly.

[LAUGHTER]

Or your dog walk or whatever you do to listen to this.

KASLIN FIELDS: Yeah. So we'll leave it at that for now, but thank you so much, Abdel. Thank you, Lucas. I really enjoyed learning from that. And I think we all have more to learn.

ABDEL SGHIOUAR: Thank you.

KASLIN FIELDS: [LAUGHS]

[MUSIC PLAYING]

That brings us to the end of another episode. If you enjoyed this show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media at Kubernetes Pod or reach us by email <KubernetesPodcast@google.com>. You can also check out the website at KubernetesPodcast.com, where you'll find transcripts, show notes, and links to subscribe.

Please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening, and we'll see you next time.

[MUSIC PLAYING]

View More Episodes