#218 February 9, 2024

Kubernetes stale reads, with Madhav Jivrajani

Hosts: Abdel Sghiouar, Kaslin Fields

Madhav Jivrajani is an engineer at VMware, a tech lead in SIG Contributor Experience and a GitHub Admin for the Kubernetes project. He also contributes to the storage layer of Kubernetes, focusing on reliability and scalability.

In this episode we talked with Madhav about a recent post on social media about a very interesting stale reads issue in Kubernetes, and what the community is doing about it.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

Mofi Rahman co-hosts this episode with Kaslin

Kubernetes Podcast episode 211

News of the week

Google announced a new partnership with Hugging Face

RedHat self-managed offering of Ansible Automation Platform on Microsoft Azure

The schedule for KubeCon CloudNativeCon EU 2024 is out

CNCF Ambassador applications are open

The CNCF Hackathon at KubeCon CloudNativeCon EU 2024 CFP is open now

The annual Cloud Native Computing Foundation report for 2023

CNCF’s certification expiration period will change to 24 months starting April 1st, 2024.

Sysdig 2024 Cloud Native Security and Usage Report

Madhav Jivrajani

Priyanka Saggu Interview

Stale reads Twitter/X thread by Madhav

“Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees” - GitHub Issue tracking the stale reads CAP Theorem issue

CMU Wasm Research Center

“A CAP tradeoff in the wild” blog by Lindsey Kuper

“Reasoning about modern datacenter infrastructures using partial histories” research paper

The Kubernetes Storage Layer: Peeling the Onion Minus the Tears - Madhav Jivrajani, VMware

KEP-3157: allow informers for getting a stream of data instead of chunking.

KEP 2340: Consistent Reads from Cache

Journey Through Time: Understanding Etcd Revisions and Resource Versions in Kubernetes - Priyanka Saggu, KubeCon NA 2023

Kubernetes API Resource Versions documentation

KASLIN FIELDS: Hello, and welcome to the "Kubernetes Podcast from Google." I'm your host, Kaslin Fields.

MOFI RAHMAN: And I'm Mofi Rahman.

[MUSIC PLAYING]

KASLIN FIELDS: Abdel is out and about this week, so I'm excited to have Mofi Rahman back hosting with me. Granted, I haven't hosted with him before, but I'm excited to be hosting with you for the first time. You have hosted an episode before. I just wasn't there. I was the one not there that time.

MOFI RAHMAN: Yeah, I was a guest host at episode 211, the episode where we talked about etcd.

KASLIN FIELDS: And how has that been going for you? I think that may have inspired something for you.

MOFI RAHMAN: That was pretty great. Since I learned a lot more about etcd, I met some of the maintainers at KubeCon North America 2023. And since then, I have been helping out with the etcd project a little bit.

KASLIN FIELDS: Exciting. You've been doing a lot of stuff in open-source Kubernetes and that's very relevant to this week's interview. In this week's interview, I talk with Madhav Jivrajani. Madhav is an engineer at VMware, a tech lead in the special interest group for contributor experience, and a GitHub admin for the Kubernetes project. He recently posted on social media about a very interesting stale reads issue in Kubernetes and we're going to dive into it. But first, let's get to the news.

[MUSICAL STING]

MOFI RAHMAN: Google announced a new partnership with Hugging Face, the AI and machine learning community. Hugging Face describes their goal as to build an open platform, making it easy for data scientists, machine learning engineers, and developers to access the latest model from the community, and use them within the platform of their choice. The strategic partnership will enable new ways for Google Cloud customers to easily train and deploy Hugging Face models within Google Kubernetes engine and Vertex AI.

KASLIN FIELDS: In December 2023, Red Hat introduced the self-managed offering of Red Hat Ansible Automation Platform on Microsoft Azure. This new offering allows customers the flexibility to deploy Ansible Automation Platform on Azure, and customers can now choose their managed application or self-managed application via the Azure Marketplace.

MOFI RAHMAN: The schedule for KubeCon CloudNativeCon 2024 is out. You can check out the talks and start building your agenda now.

KASLIN FIELDS: CNCF Ambassador applications are open. The Ambassador Program is a way for the Cloud Native Computing Foundation to get feedback from and support communities through those communities' leaders. They describe it as, ambassadors are an extension of CNCF, furthering the mission of making cloud native ubiquitous, through community leadership and mentorship. If you're a leader in a cloud native community and would like to help make sure your communities' voices are heard, read through the requirements and consider applying to the program.

MOFI RAHMAN: The CNCF is hosting their first-ever hackathon in KubeCon CloudNativeCon EU 2024 and the CFP is open now. The theme is the United Nations 17 Sustainable Development Goals. Submissions should aim to help solve problems for a more sustainable world. A link is in the show notes for more information on how to submit your team. You must be registered to KubeCon CloudNativeCon EU 2024 to participate in the hackathon.

KASLIN FIELDS: The annual Cloud Native Computing Foundation report for 2023 is out now. The report reflects on a year of the CNCF, with graphs and data outlining its community growth and projects progress.

MOFI RAHMAN: Starting April 1, 2024, all CNCF certifications with an expiration period of 36 months will change to 24 months. If you hold a certification before this date, the expiration date of your certification will not change.

KASLIN FIELDS: Sysdig released their 2024 Cloud Native Security and Usage Report. Some of their most interesting findings include shift-left is still a major goal but not a reality, 91% of runtime scans fail, but critical and high vulnerabilities in use have reduced by 50%, and only 2% of permissions are being used, which means identity management is the most overlooked cloud attack risk.

MOFI RAHMAN: And that's the news.

[MUSICAL STING]

KASLIN FIELDS: All right, today we are speaking with Madhav Jivrajani and I'm very excited to be talking with you today. Madhav is an engineer at VMware, a tech lead in SIG contributor experience in open-source Kubernetes, and a GitHub admin for the Kubernetes project. He also contributes to the storage layer of Kubernetes, focusing on reliability and scalability. Welcome to the show, Madhav.

MADHAV JIVRAJANI: Yeah, thank you so much for having me. I'm very excited for what we're about to talk today.

KASLIN FIELDS: Woo! So before we dive into today's main topic, we've recently also interviewed Priyanka, by the way, because Priyanka was the 1.29 release lead. But Priyanka is also your co-tech lead in SIG ContribEx and I'm a co-chair. So we all work together in SIG ContribEx. So could you first tell me a little bit about the work that you do in open source? Honestly, I don't know-- I feel like there's a lot of it that I don't know.

MADHAV JIVRAJANI: Yeah, I mean, a lot of it is what you mentioned already. I'm one of the tech leads of SIG ContribEx experience with Priyanka. What that essentially entails is working on and looking after the technical aspects of the SIG contributor experience that involves GitHub automation, Slack automation, anything technical that comes up relating or pertaining to ContribEx. I'm also one of the GitHub admins of the project. So that involves anything related to any of the Kubernetes GitHub orgs. It usually ends up being something security related, which isn't always fun, but that's what it is.

KASLIN FIELDS: I feel like those are-- in some ways they're separate hats that you both just happen to wear, but also I feel like most folks who are tech leads of SIG ContribEx are probably going to be deeply involved in GitHub administration.

MADHAV JIVRAJANI: Yeah, there's definitely a strong overlap, and I think becoming a GitHub admin of the project is a natural stepping stone from being a tech lead of ContribEx because you're already doing a lot of the same stuff.

KASLIN FIELDS: Makes sense.

MADHAV JIVRAJANI: But yeah, you have keys to the kingdom, right, as a GitHub admin.

KASLIN FIELDS: Awesome. And today, we are going to talk about a topic that I think seems related to the GitHub administration, but it's also in the realm of the storage layer of Kubernetes, which we'll talk about more here in a second. But the main topic here today is inspired by a Twitter thread that you shared recently, where you talked about an interesting issue that I hadn't-- Abdel and I hadn't really heard about before. And we were talking about it one day and we were like, we would love to talk to you about that. [LAUGHS]

So what happened was someone on Twitter shared a blog post by Lindsey Kuper, she talks about an example of the CAP theorem, which also happens to be an active issue in Kubernetes. So there's an active GitHub thread on the issue-- GitHub thread in an issue that is the issue-- and there's also even a research paper about it that Lindsey Kuper linked to. There's a whole bunch of resources in that thread which we'll make sure are in the show notes. So can you explain in the simplest terms that you can what this stale reads CAP theorem issue is in Kubernetes?

MADHAV JIVRAJANI: Yeah, for sure. So Professor Kuper is a huge inspiration for me. I've learned a lot from her over the past few years, and she's also one of the reasons I have an understanding of some of these things to begin with. So I was very excited to see that she's written a blog post about Kubernetes. So when I read it I'm like, OK, this is something familiar. And that blog post gained traction on Twitter.

But the interesting part was, since that issue was last filed, there have been quite a few developments that have happened. So to talk about the issue itself at hand, as you might be aware, Kubernetes uses etcd to persist the state of the cluster. And the state in etcd, ideally, is considered to be what is called the source of truth. We also have a layer of caching at the API server that helps to serve these read requests, in order to reduce latencies for these requests.

Now, this is like a typical, famous pattern to achieve scalability. You add a caching layer in front of the database and you get increased performance. Now, the downside of caching, however, is that you have stale data in your cache. The most up-to-date, recent data is always going to be in etcd. And when you have one instance of the API server running, things are still pretty good. Your cache still might be stale, but that's OK. Kubernetes tolerates that, to some extent.

The main issue, and this is when things get interesting, is when you have two instances of the API server, or maybe three if you're running a highly available Kubernetes cluster. The issue that turns up in very specific scenarios is, one of these API's servers might have a cache that is arbitrarily stale. That is, the data in that cache might be arbitrarily old compared to another API server. So what might end up happening is your controller might just talk to the API server with the stale cache, get a view of the world which was something of the past, and then undo a lot of the work that another controller might have done in that time frame when that cache might have gotten updated.

So this is essentially why it's called the stale reads issue. Your controller is reading stale data and potentially causing destructive actions without even knowing about it. And the issue that is linked, and the original issue that brought this whole discussion up, talks about something called pod safety guarantees. So it illustrates a specific scenario wherein you can have two pods running in your cluster with the same name, and any controller or any client of Kubernetes that assumes that pods are going to have unique names, which is something Kubernetes guarantees, is going to break. So that involves stateful set controllers or PVCs for example, right? And that's a pretty severe thing to encounter in the wild.

KASLIN FIELDS: A question that occurred to me while you were talking about this-- this is an issue that's inherent in the way that Kubernetes works and has worked for a very long time. So do you know how long this issue has been around for?

MADHAV JIVRAJANI: The issue has been around since 2018.

KASLIN FIELDS: OK, Yeah. So it's one of those fairly long-lived issues.

MADHAV JIVRAJANI: Yes. Yes, it is one of those very long-lived issues.

KASLIN FIELDS: Yeah.

MADHAV JIVRAJANI: But it's not all bad. The issue was originally reported in 2018. There has been work that has happened since then, that reduces the blast radius of something like this happening quite significantly, as well. Yeah.

KASLIN FIELDS: So do we have any idea of how likely folks are to actually hit this issue? I'm not sure if we have data on how often this issue occurs in cluster, but do we have some rough idea of how likely folks are to encounter this?

MADHAV JIVRAJANI: So I would cautiously say that it would be rare.

KASLIN FIELDS: Mhm

MOFI RAHMAN: But yes, it's often the rarity is that inflict the most pain.

KASLIN FIELDS: Yeah, that's true.

MADHAV JIVRAJANI: Yeah. But it is comforting to know, like I mentioned, there has work that's gone in to minimize the blast radius of these things. For example prior to Kubernetes is 1.18, this sort of issue could happen more frequently. But since Kubernetes 1.18, there exists only one particular scenario where this issue can happen, which is when a controller restarts, or a component restarts.

KASLIN FIELDS: OK. That's what I got from reading things, so good to hear that confirmed.

MADHAV JIVRAJANI: Yeah, so that's the only sort of one scenario where this still might happen.

KASLIN FIELDS: OK.

MADHAV JIVRAJANI: But what's interesting although, is that figuring out what would need to happen in order for this stale read issue to occur in the first place is the challenging part.

KASLIN FIELDS: Mhm

MADHAV JIVRAJANI: Because if you can figure out what the sequence of events is that leads to this issue, you can put some form of guardrails in place, but figuring that itself out is really, really difficult. This is also why it made me immensely happy when academia had started to take a liking towards systems like Kubernetes.

KASLIN FIELDS: Academia-- I love that. I hadn't heard that before.

MADHAV JIVRAJANI: Yeah, even this specific issue, right?

KASLIN FIELDS: Yeah.

MADHAV JIVRAJANI: The tools that I linked in the thread, which is basically the research paper that you mentioned, They. Have come out of universities like UIUC, and automatically test your controllers without you having to think about any of these things and surface bugs like this, which are quite rare but still catastrophic. I mean, they were catastrophic. They still might be, but Hyrum's law would really have to kick in for that to happen.

KASLIN FIELDS: Speaking of which, I'm excited to see more academic research and academic work at KubeCon. Because they're creating a new like paper symposium kind of thing--

MADHAV JIVRAJANI: Mhm.

KASLIN FIELDS: --at KubeCon, so I expect we'll see more of that in those spaces as well.

MADHAV JIVRAJANI: Yes. I was very excited to see that because I really enjoy reading research papers, or just research in general, because they think about things in a way that my brain just can't. So it's always nice to see. And it's also nice to see universities have taken a liking to cloud native, which was really, really interesting.

KASLIN FIELDS: They want to teach their students things that are relevant and skills that the industry needs, which is wonderful.

MADHAV JIVRAJANI: Absolutely.

KASLIN FIELDS: It's great seeing academic folks at events and things.

MADHAV JIVRAJANI: Yeah. And interestingly, CMU just started a Wasm research center. It's amazing. I'm very excited.

KASLIN FIELDS: Interesting. A Wasm research center?

MADHAV JIVRAJANI: Yeah.

KASLIN FIELDS: I should talk to them about that.

MADHAV JIVRAJANI: Yeah. I'm very excited to see where things are headed. I think the next few years are going to be very exciting here.

KASLIN FIELDS: Huh. Fascinating. And to clarify, the university that worked on the research paper that Professor Kuper cites and which, I guess maybe you did also mention in your thread. I think that was one of the tweets. But the University that worked on that was University of Illinois Urbana-Champaign.

MOFI RAHMAN: Yes, that's the one.

KASLIN FIELDS: Along with VMware.

MADHAV JIVRAJANI: Yeah. Yeah. This came out of a industry collaboration, yep.

KASLIN FIELDS: Awesome. So it's a fairly rare issue, where essentially etcd is the source of truth, your nodes have caches on them. Say, one of your nodes goes down, unexpectedly, could this happen in an expected scenario, when it's reading from a list? Basically, the node has to be reading from a list and it goes down, or doing a list operation I guess is the right way to say that.

MADHAV JIVRAJANI: Yeah.

KASLIN FIELDS: From etcd or from the cache.

MADHAV JIVRAJANI: Yeah, so there is a cache on the node also, but there is a cache on the API server as well. So the staleness occurs at the cache of the API server, which is why it looks and then sees that. It doesn't know that it's reading data of the past, but it is. So--

KASLIN FIELDS: Gotcha.

MADHAV JIVRAJANI: For example, if I have 10-- 5 pods created, and both the caches of both my API servers has that, but I go ahead and say create 5 more, but the other cache hasn't gotten that update, if a controller goes and sees the second cache, which is the stale one, and sees that OK, I still need 5, it might not do anything, which is OK, but you still aren't getting what you asked for. But the worst case scenario is it's done something already, and it still looks at the stale cache, and it says that there are five now, so it downsizes from 10 to 5 again, and then it looks at the proper cache and it sees 10, so it upscales from 5 to 10 again, and then you have this sort of clashing going on over there.

KASLIN FIELDS: So is there a cache on each node, and then a cache on the control plane nodes?

MADHAV JIVRAJANI: Yes, yes.

KASLIN FIELDS: OK.

MADHAV JIVRAJANI: The caching is like an onion, right? It's just layer after layer after layer. So on the node, that's the cache which almost every Kubernetes controller maintains, which is called the shed informer. If you write a controller yourself, that's what you would call it. That's the type name. It's called shed informer.

On the API server side of things, the caching layer is somewhat more sophisticated. It's called the cacher-- very aptly named. The whole goal of the cacher is to try and mimic etcd and its operations, but in a lightweight manner. That's the whole goal of the cacher. At the end of the day, it's still a cache, so the whole staleness thing is a universal caching problem. It's not specific to something like Kubernetes, for example.

KASLIN FIELDS: So this is an issue where the node goes down.

MADHAV JIVRAJANI: Mhm.

KASLIN FIELDS: The cache on that node obviously isn't getting updated because it went down, and that affects the cache on the API server?

MADHAV JIVRAJANI: When the node comes back up, the key thing that happens is the Kubelet also wakes up again.

KASLIN FIELDS: Uh-huh.

MADHAV JIVRAJANI: And the Kubelet also in itself is a controller. So whenever a controller wakes up, it needs a list of all the pods that it has to run. I mean, when a Kubelet wakes up, it needs a list of all the pods that's supposed to be running.

KASLIN FIELDS: And it's pulling it from the cache--

MADHAV JIVRAJANI: Exactly.

KASLIN FIELDS: --which might not have been updated while the node.

MADHAV JIVRAJANI: Exactly.

KASLIN FIELDS: There we go.

MADHAV JIVRAJANI: Yeah, yeah. I mean, the Kubelet could be-- the main issue that crept up here is you pass in a certain parameter whenever these sort of initial lists happen.

KASLIN FIELDS: Mhm.

MADHAV JIVRAJANI: And that's mainly for scalability. And we've had to do some sort of weirdness to ensure backward compatibility but it works. And I think a great deal of work has happened in making sure it's backward compatible right now. But yeah, that's what happens.

KASLIN FIELDS: All right. And you also did a talk about this whole area-- this storage layer of Kubernetes-- at KubeCon North America 2023, called "The Kubernetes Storage Layer-- Peeling the Onion, Minus the Tears," where you walk through how you understood this whole process that Kubernetes goes through to manage its caching, it sounds like. My first question about this talk, which has a really wonderful, visual explanation of this workflow, of etcd, and the control plane, and the nodes, and how workloads actually get deployed onto the nodes using all of the caches, and things like that.

Beautiful visual explanation in this talk. But my first question about it is, the Kubernetes storage layer, what exactly does that mean? Because I'm someone who forays often into the world of persistent volumes and storage outside of Kubernetes itself, which is a separate kind of storage than what we're talking about here, right?

MADHAV JIVRAJANI: The terminology is slightly confusing in fact, I've had a few of my friends who've been confused by this same thing. But storage here means the part of the system that stores the state of the cluster. So maybe not the storage that's user facing, but the storage that's meant for the cluster to operate to begin with.

So when I say the storage layer, I mainly mean two pieces of the codebase, If. You were to explore the codebase. One is how the API server, and by extension Kubernetes as a whole, interacts with etcd. How it calls into etcd, how it stores the data, what's the format, what's the pattern of the key that it stores into etcd, all of those things. And the second is the actual caching layer that I was talking about, which we call the cacher.

So the cacher decides what to cache, what not to cache, when to evict items from the cache, when to say that this is probably not a request for me to serve, I'm going to send it to etcd directly. So it knows its limitations, as well. So that's the cacher. So when I say the storage layer, I mainly mean these two components, which is how the API server talks to etcd and how the API server manages this caching of resources that Kubernetes then uses to efficiently serve read requests, and even voice requests, and things like that.

KASLIN FIELDS: Makes sense. I kind of got that from the talk. I'm glad to hear you explain it that way. Of course when I first saw it, I was like, Storage?, and went off into the data on Kubernetes realm. But I think this is going to be the year of etcd for me. Like, with etcd becoming a SIG, and it's obviously so central to the way that Kubernetes works. I've always wanted to understand it more deeply and I'm excited to be doing so this year for.

MADHAV JIVRAJANI: Yep. Yep, for sure. Yeah.

KASLIN FIELDS: So the storage layer is referring to etcd and these caches, and we have this issue where sometimes we get stale reads. In rare scenarios we get these stale reads, and it's a CAP theorem issue, which we haven't really explained deeply. Definitely check out the blog post. It explains that part really well.

MADHAV JIVRAJANI: Yeah.

KASLIN FIELDS: We can talk about that more, along with, How do we think that the Kubernetes community is going to address these things? You mentioned a couple of GitHub issues working on related-- not exactly fixes, but things to help address this issue. So how are we working on this?

MADHAV JIVRAJANI: The first thing that the community did was back in Kubernetes 1.18, which was in 2022.

KASLIN FIELDS: Mhm.

MADHAV JIVRAJANI: So that was the part where it ensured that the blast radius of this was limited to only certain scenarios. So that was the first thing, but the problem still isn't fully mitigated, right? I mentioned two CAPs that aim to solve this issue that are ongoing right now. So the first CAP is called Consistent Reads from Cache, which is self-explanatory in the name itself. And the second is called Watchlist.

Now both these CAPs, without going into too much detail, while they aim to achieve different things, they end up solving the stale reads issue as a consequence of their implementation. So what they essentially end up doing is, let's say we're back in a highly available setup and the API server receives a request that should be served from the cache. Now, a request that should be served from the cache is a request that can get potentially stale data because cache in itself can be stale.

So what these two CAPs do is, it modifies the API server to go back to etcd and ask etcd, Hey, how recent does my data need to be? And etcd responds with a number indicating recency. So if you start learning more about these things, you'll find that this number is actually what you will come to know as the resource version. So etcd responds with a resource version. Now, the API server knows how recent the data needs to be, so it waits for the cache to catch up to this level of recency. And once it has, then it serves the read from the cache itself.

So this way, even though the cache might be stale, and you might get stale data, Kubernetes ensures that no matter which API server the request goes to, you still end up not going back in time and all those scenarios don't really happen anymore. So both of these CAPs they are aimed to achieve different goals, but this is what they essentially end up doing and that solves the problem to begin with.

And I think Watchlist is in beta right now. Consistent Reads from Cache is in alpha. So if you're up for trying any of these out and providing feedback to the community, I'm sure they'd be super grateful for that. And as a plus side, there are also tangible benchmarks that the implementers of these CAPs have done, to show that you actually get quite a good amount of performance boost by enabling these features on your cluster.

KASLIN FIELDS: Yeah, you have some graphs in your talk, where you show exactly how the performance boost is with these.

MADHAV JIVRAJANI: Yep.

KASLIN FIELDS: And you also explain this concept of the resource version, and how that addresses this issue, in the talk. So I definitely recommend that folks check that out. Any other parts of the talk that you want to call out? Because parts that I really liked about it were, you did a really nice, visual explanation of essentially how Kubernetes deploys workloads onto the nodes, and then you also talked about these two CAPs that you just mentioned. Anything else that you want to call out about the talk? For folks who are interested in learning more about this, it's a good resource.

MADHAV JIVRAJANI: Yeah. So specifically, if you're interested in the stale read side of things, there's a part in the talk where I talk about how a list request is served by Kubernetes. So that particular part talks about the stale reads issue. But what was really interesting to me was understanding this one issue really gives you a nice overview and a mental model of how the entire storage layer looks like and functions.

KASLIN FIELDS: It does, right? It's not a talk about that issue, but it overlaps so much.

MADHAV JIVRAJANI: It ends up becoming one, yeah.

KASLIN FIELDS: Yeah.

MADHAV JIVRAJANI: Yeah. It culminates into this issue, And. Because the issue touches upon almost all facets of the storage layer, and just Kubernetes as a system in general. So I think that I found that part to be really interesting.

KASLIN FIELDS: Cool. So to rephrase in my own terms, the CAPs that we have in to fix this, and they're already in beta, which I hadn't looked into exactly where they are in it, but that's good to know. But the way it works essentially, is that we have these caches on the nodes. We have the cache on the control plane. We have etcd. And this is having the nodes go to the control plane to check what the most recent version of something is, just to get a number to compare to what it already has, and then it knows if it's out of date, is that right?

MADHAV JIVRAJANI: Yeah. So anything can send a request to the API server, so it doesn't really have to be the node itself. For example, kubectl or a controller that's running in your cluster. But any request that the API server receives, the API server then does this additional sort of check, by going back to etcd and asking how fresh does my data need to be? Or rather, more specifically, how fresh does the data in my cache need to be, for me to successfully serve this request to whoever has requested it? So that's what the API server performance is sort of extra one check.

KASLIN FIELDS: Nice. So it has an element of going all the way back to etcd, which is the source of truth, to figure out exactly what's going on and help prevent this issue from happening.

MADHAV JIVRAJANI: Yes.

KASLIN FIELDS: Excellent. And one other thing in the blog post that I wanted to mention, that you mentioned here, is the concept of safety and aliveness, I think it was described as in the blog post, which is the CAP theorem part of all of this.

MADHAV JIVRAJANI: Yeah.

KASLIN FIELDS: Essentially, the whole concept where the node might give you or the API server might-- I don't know how to explain this properly. The API server on the node might give you stale data.

MADHAV JIVRAJANI: Yeah.

KASLIN FIELDS: That is a violation of safety, whereas if it had just not responded that would have been a violation of availability. And in these types of situations, those are pretty much the two things that can happen, is what the CAP theorem says, is either you don't respond, because you know might have stale data, or you do respond and the data might be stale, which is a violation of safety whereas not responding would have been a violation of aliveness. So we, in Kubernetes, go the violation of safety route with this issue currently, where it might return stale data and these CAPs are trying to address that.

MADHAV JIVRAJANI: Yeah. So if you look at the CAP theorem and then put it in the context of Kubernetes, I think it's really interesting because you have two perspectives of CAP here. One perspective is etcd itself because etcd is a distributed key value store that Kubernetes uses to persist this data.

KASLIN FIELDS: And is now a SIG in Kubernetes, as we have discussed.

MADHAV JIVRAJANI: Yes, it is in Kubernetes. But it's a strongly consistent data store. What that means is, any read that you do, or at least any read that Kubernetes does, is linearizable. So that's just a way of saying that any data that etcd returns to you is going to be the most recent that it knows of.

So in the context of the CAP theorem, etcd is what is called CP. It can't be highly available without sacrificing these properties, because it needs to do that round of consensus to give you the latest data. But the moment you put Kubernetes in the picture, and you have this caching at the API server, the scenario sort of changes a little bit. Because now, if you have a partition across the two API servers, Kubernetes is still going to be alive and still going to be serving reads to you because there is no communication between the API servers themselves.

So it's this dichotomy that happens, if you look at CAP from a perspective of etcd, and then from etcd and Kubernetes with the caching enabled. But yeah, if you have etcd and Kubernetes together in a picture, and you look at CAP, Kubernetes will still serve requests from the cache, even if both-- or how many of-- our API servers are partitioned. But consistency might be violated, which is what this issue tells us. That's the limitation of the CAP theorem itself.

KASLIN FIELDS: Theoretical computer science for you this evening, morning, or afternoon, whenever you're listening to this.

MADHAV JIVRAJANI: Yeah, yeah. You know, it's really interesting, because the person who came up with the CAP conjecture-- I'm going to call it a conjecture because it wasn't a theorem when it was proposed-- is at Google right now. And he is one of the-- he played a big role in the success of Kubernetes itself.

KASLIN FIELDS: Oh, right. Professor Kuper mentioned in the blog post that it was Eric Brewer, who--

MADHAV JIVRAJANI: Yeah.

KASLIN FIELDS: --I talked to at a Google event, actually. And I did not realize how deep of an engineer he is. And I was talking about some technical issue that I wanted to try and solve. And he was like, you know, that's very interesting, and he gave some suggestions. It was fantastic.

MADHAV JIVRAJANI: That's so cool though.

KASLIN FIELDS: Yeah.

MADHAV JIVRAJANI: I would love to talk to him someday. I don't know. Maybe someday.

KASLIN FIELDS: Yeah.

MADHAV JIVRAJANI: But yeah, it was really interesting because-- so the blog post in itself was, of course, very insightful in terms of what the issue does and how to look at it. But I previously hadn't really thought about etcd plus Kubernetes from a CAP perspective, and what that would look like. Because now you have a layer of caching involved, and that always complicates things when you reason about the trade-offs that CAP provides you with.

KASLIN FIELDS: Yeah, Professor Kuper mentions that it's a slight deviation from the original CAP theorem, because it's indirect communication rather than direct communication.

MADHAV JIVRAJANI: Yeah.

KASLIN FIELDS: It still falls into the same realm.

MADHAV JIVRAJANI: Yeah, for sure. So the API servers don't talk to each other. Rather-- so if you watch the talk that I had given, any request that creates a resource or updates a resource is always going to go to etcd, and then that update is back propagated into the cache. So that's how the cache is remained coherent-- quote unquote, "coherent." because anything that goes into etcd is then put back into the cache via a watch mechanism that Kubernetes uses to serve everything. So that's what I think Professor Kuper meant when she said it's an indirect communication between these layers, not a direct one as such.

KASLIN FIELDS: Yeah, that makes sense. And this Watchlist operation thing is key to this whole scenario, and you have--

MADHAV JIVRAJANI: Yes.

KASLIN FIELDS: --an explanation of that deeper in the presentation as well.

MADHAV JIVRAJANI: Yep. Yeah, for sure.

KASLIN FIELDS: Excellent. Hello, So we have this stale reads issue, where it's possible when a node goes down, it might read from a stale cache. We've got some features to address this coming in, in Kubernetes, in beta right now. Where there's a check between the API server and the etcd to make sure that everything is up to date before it responds. And I had a whole lot of fun learning about etcd and the CAP theorem today. So thank you so much for joining me, Madhav.

MADHAV JIVRAJANI: Yeah, for sure. I had a lot of fun thinking about it as well. So, yeah, thank you for inviting me. I had a lot of fun with that.

KASLIN FIELDS: And for everybody out there, a SIG ContribEx thing for you, an open-source Kubernetes thing-- there's been a banner on Kubernetes.io for several months now about the Linux package repositories being deprecated. So if you're running Kubernetes yourself, where it has traditionally pulled the images to run Kubernetes from, that is going away. And it was frozen in September,

So if you're beyond 1.24, then you must have already switched over to the community-owned repository that's replacing it. But if you are running a cluster yourself that is earlier than 1.24, then you may be using the deprecated, frozen repositories. And those repositories are going away. Just like in Kubernetes versions, a feature will be deprecated, which means that it's going to go away in a future version. And now, in January, these old repositories are going to be removed-- January of 2024.

Depending on when you're listening to this, this hopefully maybe has passed you by and you didn't even know about it. But if you're running a cluster before 1.24 yourself, you need to go and make sure that it's using the updated community-- well, for future versions beyond 1.24 is using the community-owned repository. Or preferably, honestly, you're creating mirrors of those images, or building your own images, and using them from a local repository.

So there's some documentation, some blog posts, about how to do this and exactly what you need to do. We'll make sure that those are linked in the show notes. And I just wanted to shout that out. If you are an end user running your own clusters, make sure that about this issue that's happening in January 2024.

MADHAV JIVRAJANI: Yep.

KASLIN FIELDS: Anything you want to add, Madhav?

MADHAV JIVRAJANI: So if you don't want to have a caching layer, you can disable it. There's a flag on the API server to do so. You can also disable the cache-- so a little bit more context. There is a cacher that exists per resource type in Kubernetes. So there's a cacher for services. There's a cacher for pods. There's a cacher for nodes. All resource types, essentially.

If you want to disable caching for a specific resource type, you can do that as well. So you aren't obligated to use the cacher at all. If you aren't running massively large Kubernetes clusters, you may not even need the caching mechanisms to work with. At least the API server caching mechanism. That is the cacher stuff. So there are flags on the API server that you can set, that will disable the caching for you, and you won't really need to worry about all of these issues popping up.

KASLIN FIELDS: Interesting. So the trade-off here is, Kubernetes does this caching because it's meant, of course, to run very large scalable systems. But if you're running a relatively small cluster, you don't necessarily need the performance that the caching gives you. You could avoid this availability, consistency issue that can occur by not having the cache at all, and just going straight to etcd?

MADHAV JIVRAJANI: Yeah. So you get highly consistent reads, but at the same time you lose the performance benefits that the cacheer gives you.

KASLIN FIELDS: Gotcha. Interesting.

MADHAV JIVRAJANI: So it's-- everything's a trade-off. There's no free lunch. So depending on what your use case is, you have the option of disabling the cacher. And it's not a very unpopular option, as well. I know quite a few folks who do it, as well. So, yeah.

KASLIN FIELDS: I did not know that this was an option. I should talk to my friends who run small Kubernetes clusters to run home automation and stuff, and see if they're using the caching, and also make sure that they're on the most recent package repositories, or have their own local ones, preferably.

MADHAV JIVRAJANI: Yes. Yes.

KASLIN FIELDS: Awesome. Thank you everyone who joined us today and made it this far, and thank you, Madhav, for joining us.

MADHAV JIVRAJANI: Yeah, Thanks so much for having me.

KASLIN FIELDS: Had a great time learning.

MADHAV JIVRAJANI: Yes. I had a great time talking. Thank you so much.

[MUSICAL STING]

MOFI RAHMAN: Thanks, Kaslin, for the interview. In Kubernetes world, although it's a system for running distributed application, for the most part we of get to take a seat back, and we don't have to think about it as a distributed system, and we just handle YAML. And I can put on my Chief YAML Officer hat.

KASLIN FIELDS: Which you actually, physically have.

MOFI RAHMAN: Yeah. Audiences can't really see it, but I'm wearing a hat that says Chief YAML Officer. And for the most part, we can just like talk to Kubernetes via YAML and all of the things just work. And this is one of the examples of, What if it didn't work?

KASLIN FIELDS: Yeah. Because most people, you can just ignore that it's a distributed system, right? It's just you use Kubernetes, and it does the thing, and it's hiding from you all of the distribution that's happening underneath. But this is one case where it's like, well, there are still issues with that distribution.

MOFI RAHMAN: Yeah, like when Madhav was describing the ways to reproduce this issue, and there is a GitHub issue that talks more about how-- what is the possible hypothetical scenario? There is like four or five different things has to simultaneously fail in a specific order for this exact situation to arise. Which again, that's one way assuring that most people will never face this. But at the same time, when it does happen, it's going to require so much more debugging to figure out why this happened. Because reproducing is not a trivial task either.

KASLIN FIELDS: Yeah. And getting the right error messages, if there were an error message that you could throw for this. In my life as a test engineer and a QA engineer, a lot of my bugs were like, hey, this error message isn't going to tell users what they need to know.

MOFI RAHMAN: Yeah. And when you're talking about CAP theorem, it basically threw me back to University days, when we were learning this in our courses. And I actually did not associate this name. In the Wikipedia page for CAP theorem, it actually also says Brewer's theorem.

KASLIN FIELDS: Brewer's theorem, it's called?

MOFI RAHMAN: It mentions Eric Brewer by name. And I saw his name in our internal communication and documentation. Eric Brewer happens to work for Google and I never associated those two things together.

KASLIN FIELDS: Nope.

MADHAV JIVRAJANI: Yeah because Eric, right now in Google, he's a VP of something. And in my mind, most VP's are dealing with business decisions.

KASLIN FIELDS: Yep.

MOFI RAHMAN: So I never actually associated Eric Brewer, the computer scientist, as Eric Brewer, the VP I see in different chats and such.

KASLIN FIELDS: Yeah. I think I mentioned in the episode, I went to an event at the Kirkland office of Google, and there were some high level leaders there. It was kind of a schmoozy thing, where people got to hang out before the holidays, and just meet each other. And Eric Brewer was there and I was talking about some issue in his vicinity.

And he was like, oh, that's a really interesting topic, actually. And he gave me some suggestions, which I wrote down somewhere and need to find. But I just thought that he was a VP, and then he had these really deep insights on the technology. And now I'm very retroactively embarrassed to have talked to Eric Brewer about something that I was going to turn into an intern project.

MOFI RAHMAN: Yeah, and also, this is going to test the viewer's knowledge about 2010 memes, which is the easiest way I understand CAP theorem is that classic meme of, you can have sleep, social life, or good grades, and you can choose one of the two.

KASLIN FIELDS: That's exactly what I thought when I heard it, yeah.

MOFI RAHMAN: And CAP theorem basically says you can have one of those two, which is consistency, availability, or partition tolerance. And basically, any distributed system that you build out, you have to cognizantly choose two of them you want to target, and the third one you basically either have some guardrails so that that doesn't happen, or have some tolerance when it does happen somehow, to recover from that failure. Which again, University me saw that meme, saw the CAP theorem. I'm like, OK, that's how I'm going to understand it, and if in life ever have to explain it, that's how I'll explain it to people.

KASLIN FIELDS: Maybe I didn't have that meme in my arsenal yet, but when I saw CAP theorem, when I saw Madhav's posts about this, I was like, that is definitely something we covered in school. Yep. Those are some words that I once maybe knew.

MOFI RAHMAN: I mean, another interesting thing is, a lot of these theoretical computer science things for most engineers working in most industries, will probably never encounter. Because again, these systems are built by a very small group of people. Another classic "'XKCD" thing that pops up is that almost all infrastructure is built on top of this one tiny thing, one pin.

KASLIN FIELDS: Mhm

MOFI RAHMAN: That someone, somewhere understands these kind of theoretical computer science things enough so that they can keep building, maintaining, and finding issues in our everyday things that we use, and thank god for that.

KASLIN FIELDS: Yeah. Have you seen one of Kubernetes, with that little pin thing holding up the whole thing as etcd?

MOFI RAHMAN: Yeah.

KASLIN FIELDS: Yeah.

MOFI RAHMAN: And again, the last episode was in, we actually talked about the same exact thing there too because that episode was about etcd.

KASLIN FIELDS: I actually have a light-up, acrylic standee of that diagram of Kubernetes with etcd as the thing holding everything up. Ashley Willis, also known as Ashley McNamara on social media, made that for me. It was very nice of her.

MOFI RAHMAN: So I think as an action item for the viewers, and if you are someone who is running a very large-scale Kubernetes cluster, these problems, like inconsistency of a distributed system, does not happen often, but it can. So the goal as a community, probably, is to look out for these kind of issues that can happen in a distributed system like Kubernetes. And when it does happen, document it well and bring up the issues to the Kubernetes project itself, so that as a community, we can find this out, document it, and try to find solutions to that wherever possible.

KASLIN FIELDS: Yeah. And I hope that you all enjoyed this deep dive into how Kubernetes works as a distributed system, a good reminder that Kubernetes is a platform that has its own considerations for how it works. And there are a bunch of other resources that I mentioned in the chat with Madhav, that we'll have linked in the show notes.

So if you want to go check out his talk, where he dives deeper into the considerations around-- how did he word it? Kubernetes state, Kubernetes storage. Still very confusing terminology for me. But if you want to dive into that more, I'll have those links in the show notes, as well as a talk by Priyanka that he mentions in that talk, that is also on a similar topic with etcd.

[MUSIC STING]

MOFI RAHMAN: That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media at Kubernetes Pod, or reach us on by email, by <kubernetespodcast@google.com>. You can also check out the website at kubernetespodcast.com, where you will find transcripts, show notes, and links to subscribe. Please consider rating us in your podcast player, so we can help more people find and enjoy the show. Thanks for listening, and we will see you next time.

[MUSIC PLAYING]