Kubernetes Podcast from Google: Episode 249 - Kubernetes at LinkedIn, with Ahmet Alp Balkan and Ronak Nathani

#249 March 25, 2025

Kubernetes at LinkedIn, with Ahmet Alp Balkan and Ronak Nathani

Hosts: Abdel Sghiouar, Mofi Rahman

Ahmet Alp Balkan and Ronak Nathani are software engineers at LinkedIn compute infrastructure team running the Kubernetes platform for LinkedIn and they joined us today to talk about how they run Kubernetes at scale and what they learned along the way.

Do you have something cool to share? Some questions? Let us know:

News of the week

Links from the interview

LinkedIn Engineering Blog- Stateful workload operator: stateful systems on Kubernetes at LinkedIn
Kubernetes Blog: How we run Kubernetes in Kubernetes aka Kubeception
Flannel: Flannel is a simple and easy way to configure a layer 3 network fabric designed for Kubernetes.
Spanner: Google Cloud’s globally-distributed database service.
Kubernetes Architecture - learn more about the control plane from the Kubernetes docs!
Kubernetes Resource Model
Kubernetes Resource Orchestrator (KRO)
Ahmet Alp Balkan Blog: So you wanna write Kubernetes controllers?

Transcript

Show full transcript

ABDEL SGHIOUAR: Hi, and welcome to the "Kubernetes Podcast from Google." I'm your host, Abdel Sghiouar.

MOFI RAHMAN: And I'm Mofi Rahman.

[MUSIC PLAYING]

ABDEL SGHIOUAR: In this episode, we talked to Ahmet Alp Balkan and Ronak Nathani. Ahmet and Ronak are software engineers at LinkedIn, part of the Compute Infrastructure team running the Kubernetes platform for LinkedIn. They joined us today to talk about how they run Kubernetes at scale and what they learned along the way.

MOFI RAHMAN: But first, let's get to the news.

[MUSIC PLAYING]

CubeFS was moved to the CNCF Graduated maturity level. CubeFS is a distributed storage system supporting access protocols like POSIX, HDFS, and S3. The project was created in 2017 and was accepted to the CNCF in 2019. The project is used for AI/ML workloads, but also container platforms, where separation of computing and data storage is required, like databases.

ABDEL SGHIOUAR: Canonical announced 12 years' Kubernetes Long Term Support. Frequent releases and upgrade frequency is a topic of discussion within the Kubernetes community. While Upstream Kubernetes offer 14 months of support and major cloud providers extending that to two years, this new announcement from Canonical aligns with the company's strategy for long-term support for Linux extended to Kubernetes.

The company will be releasing LTS versions of Kubernetes every two years, starting with version 1.32, and interim releases every four months. With the Ubuntu Pro subscriptions, LTS version of Kubernetes will continue to have CVE patches for at least 12 years.

MOFI RAHMAN: The conference season is starting, and events are rolling out. Here is a rundown of what to expect in March and up to KubeCon London-- KCD Beijing on March 15, KCD Rio de Janeiro on March 22, KCD Guadalajara on March 29.

ABDEL SGHIOUAR: And that's the news.

[MUSIC PLAYING]

All right, today, we are talking to Ahmet and Ronak. Ahmet and Ronak are software engineers at LinkedIn. They work for the Compute Infrastructure team running the Kubernetes platform for LinkedIn. And they join us today to talk about how they run Kubernetes at scale and what they learned along the way.

Welcome to the show, Ahmet and Ronak.

AHMET ALP BALKAN: Hey. Thanks for having us.

RONAK NATHANI: Yeah, thanks for having us.

ABDEL SGHIOUAR: So we had a very, very interesting discussion at KubeCon North America. And you folks told me you're running an insane scale of Kubernetes on bare metal, which is-- I still have to comprehend. So let's start with the basics. Is everything at LinkedIn doing Kubernetes? Is everything running in Kubernetes, or do you run something else?

RONAK NATHANI: Right now, it's not just Kubernetes. And I can provide some context to this as well. Back in the day, I would say around 10, 11 years back, when Docker 1.0 wasn't around, LinkedIn still needed containerization because we wanted to make sure we can win back applications, stack them on a single machine. So we wrote our own container runtime. We also wrote our own scheduler.

ABDEL SGHIOUAR: Of course.

RONAK NATHANI: Friends don't let friends write schedulers, by the way. Just saying. And that stack has served us pretty well, actually, it's been running within all of our bare metal data centers, and it's been scaling as the site has grown. But over the last few years, we realized that it's aging a little bit too.

So the marginal cost of adding every new feature is increasing. Or rather, it's increasing more than linearly. And with Kubernetes and other open-source ecosystems becoming just way more mature, it just made sense for us to transition onto that path. So we've been on this journey for a while now, moving majority of our workloads to Kubernetes.

And this includes stateless, stateful, as well as batch workloads. Not everything is on Kubernetes yet, but it is soon going to be. And if you ask any of our managers, they will say, oh, they wish it was yesterday.

ABDEL SGHIOUAR: All right, so then I think this is like-- I have to ask this question, what about databases? Do you folks do databases on Kubernetes? What do you think about that?

RONAK NATHANI: So we are insane enough to run Kubernetes on bare metal, and we are also insane enough to run databases on Kubernetes. It's kind of a running joke, where people say, well, Kubernetes cannot support stateful systems. But to be honest, Kubernetes is quite flexible and adaptable. And if you understand it deeply and when you control the full stack all the way from your bare metal machines to the configuration on top, to what kind of disk you can attach to it, to the scheduler, where you can control what scheduler plugins you write, where you can also control your API server and have very strict policies on what features one can and cannot use, you can go as far to run stateful systems on Kubernetes too.

And I'll also say that we have several stateful systems which use local disk. So we don't use network-attached storage everywhere because of performance issues. So we run these applications on Kubernetes right now. And again, these are in transition phase. Migration is ongoing. And the part about controlling the stack that lets us do this is the full maintenance lifecycle.

And I'm sure Ahmet can speak more to that, where, when we run maintenance across our data centers, it coordinates the requirements with the stateful systems. So we can go into that rabbit hole, if you would like. But we don't just evict the pod and say, here, you have a one-hour grace period. Hey, database, you got to shut down.

But what we do is we respect whether an application or a stateful system is OK being shut down at that moment in time or not. And this is beyond just using PDBs.

ABDEL SGHIOUAR: Got it.

AHMET ALP BALKAN: And honestly, I might add that we have written our own generic stateful workload operator. This was our talk at last KubeCon North America. So we introduced this system that you can bring a stateful system to. And the stateful system needs to implement a particular protocol to take a workload out of rotation or add a new instance.

And that protocol is largely Kubernetes agnostic, which lets us run any number of different databases without writing a separate Kubernetes operator for each.

ABDEL SGHIOUAR: Oh, interesting. So does that make the workload itself Kubernetes agnostic if it has to implement the protocol? Like, the database will have to listen to something before it gets evicted. How does that work? I'm curious about this.

RONAK NATHANI: So the database doesn't need to be scheduler aware in this case. So let's say, for example-- I'll just take a simple example. I'm running etcd on Kubernetes in this case. So this protocol, what we call, this is an application control manager. So think of it as an endpoint which our controller talks to.

So this endpoint is aware of Kubernetes, but your database is not. Any operation that requires your part to be moved, whether that's because of an update that you're making to the pod because of a version change, CPU resource change, or because this is because of maintenance, what this generic controller does, it will basically contact your ACM in this case to say, hey, I'm trying to do this sort of an operation.

It's an update operation. Or if it's scale out or if it's scale in, one thing which is also interesting-- and just a side note-- is, for many stateful systems, there's a need for instant swaps, as well, meaning you don't want the entire system to go through a rebalance term just because a machine went down. So what you sometimes want to do is you say, hey, Shard A just stay Shard A.

I'm going to give you another machine, replicate this data. And don't shuffle everything. So there are these kind of operations that this protocol supports. So during this update, we talk to that ACM for the specific database and say, hey, this is the kind of operation I'm doing. This ACM is aware of the health of the database to see, do I have enough replicas for each partition that might be impacted if you take down this part, for instance?

And that provides a yea or nay, whether the system can proceed with that specific operation or not.

ABDEL SGHIOUAR: Got it.

RONAK NATHANI: And we go into way more detail on a blog post, as well, that we published for the stateful system. And we're more than happy to share that with you if you want to add it to the show notes.

ABDEL SGHIOUAR: Yeah, we'll make sure that both the talk and the blog is added to the show notes. You folks reminded me of something that we have at Google, so I think I have just one more question about this, and then we can move on.

RONAK NATHANI: Sure.

ABDEL SGHIOUAR: Does this mean that this application controller manager can also be configured by specific teams with policies? Like, can I say what sort of disruption my application can handle or can support?

RONAK NATHANI: Yep. So every single database or a stateful system that you run on this platform brings its own ACM. And we provide the protocol. So as long as you abide by their interface and the protocol, you write your own ACM. In many cases, several teams share this ACM, too, because they follow a similar disruption model. And behind the scenes, they can control how to approve a disruption or an update.

ABDEL SGHIOUAR: Got it. Got it. Interesting. OK. So what about the things that are not Kubernetes specific? So you talked about etcd. Do you run etcd on Kubernetes, which then require Kubernetes to-- how does that work? How do you handle cyclical dependencies?

AHMET ALP BALKAN: Today, our Kubernetes runs on our, quote, unquote, "legacy" orchestration stack. We directly run Kubernetes and all its control plane components, API server, etcd, controller manager, scheduler directly as system, these services. Now, that said, we don't want to maintain these tool stacks for running the Kubernetes control plane, as well as running workloads on Kubernetes.

So that's why we actually want to run Kubernetes itself on Kubernetes. I think some people call it Kubeception. Yeah, we want to go through this journey. And we're actually midway through our development. We don't have anything in production, operating this model, but we know for a fact that there are cloud providers out there running the Kubernetes control plane as pods.

And so, as a result, we know that this is possible. And we know that there's a lot of cost savings, actually. When you're running etcd on a machine, just because there's a single etcd instance on that very gigantic bare metal machine, it's actually pretty wasteful. So, yeah, we were planning to run Kubernetes inside Kubernetes for that reason as well.

So today, our etcd directly runs on host disk. And I think we'll continue to maintain that. And network-attached storage latency is unfortunately not tenable for etcd writes for us, so we'll probably keep it that way as well. Now, one thing that we're actively participating is there's an open-source project going on in the community right now around etcd-operator.

And we want to participate that, mostly because we have been solving the exact same problem ourselves. We wrote an operator for etcd specifically just so that we can handle these disruptions without losing data, et cetera. And we don't want to create too much cyclic dependencies between etcd and Kubernetes as well. So we want to keep this stack pretty lean.

And that's where the etcd-operator itself is pretty promising. And in terms of stuff like networking, I would say that any company that is running on bare metal at this scale pretty much has their own networking stack. That is, that has completely nothing to do with anything out there in the cloud-native ecosystem right now.

So a lot of our workloads, whatnot, we don't use kube-dns, right? We don't use anything like Flannel for the vast majority of our workloads. So a lot of our network stack is pretty much flat, data-center-routable networking, where any pod can be routable to any other pod with an IP address directly.

ABDEL SGHIOUAR: Got it. So what are your thoughts, then, on-- I think you probably have seen it. There is a movement-- I don't know if it's a movement. But we announced last year that we can do 65,000 nodes. And part of this announcement was moving towards Spanner.

I know that Spanner is not something that exists on prem. But what are your thoughts on that, moving away from etcd altogether?

RONAK NATHANI: Our thoughts on this are we would love to do that at some point. In fact, this is a question that was pretty popular, I would say, or a topic that was very popular at KubeCon North America in fall last year. Most of the teams we spoke with who run Kubernetes at any decent scale basically came back and said the same thing, where now our bottleneck is etcd.

Small, couple hundred node clusters, they just don't cut it for something that we are running, where you have machines on the order of six digits. So we want to be able to run or replace etcd, where we can push the boundaries of how big a cluster can get. Now, some of the things that Ahmet mentioned, because we have this control over how we manage the entire control plane, plus the database, today, what we do is we shard events into a separate etcd cluster, just so that we can scale that part.

But because we don't use several components within Kubernetes-- and that's by choice-- for example, kube-dns or CoreDNS or Kubernetes services-- that's a design choice where we don't want to use some of these objects as part of our application ecosystem because several-- for example, service discovery, network policy, et cetera-- these kind of things run at a global scale for us. So because we don't use these things, it takes away that load from API server as it etcd

But it also means, because we run large clusters, we need certain large objects, like number of nodes, number of pods. So in general, I would say if there are open-source alternatives available to etcd which allows us to scale the cluster way beyond what we can do today, that would be of lots of interest not just at LinkedIn, but also some of the other folks we have spoken to. And the Spanner part, I wish it was available.

[LAUGHTER]

ABDEL SGHIOUAR: I wish as well. So when we talked at KubeCon last year, one of the questions that lingered in the back of my head is bare metal. So you run at bare metal. So I worked on data centers way before Kubernetes existed. So I racked servers. I actually dealt with physical hardware.

So I am a technician at LinkedIn, I come in, I rack a hardware, a piece of server, right? I connect power and networking. What happens after that? Where does Kubernetes come into the mix?

AHMET ALP BALKAN: I would say that the answer a year ago at LinkedIn versus the answer now is pretty substantially different. And so last year has been pretty transformational in that regard, that I would say we built our own infrastructure as a service machine management layer from the ground up. And as part of that project, we still have data center technicians that are still racking machines.

All that's still there. Now, one thing that we've done is more programmatic management of our data center inventory. So any time you rack a machine, it automatically gets added to our infrastructure as a spare machine in our data center. It essentially is a spare capacity and a spare pool.

And so from that point on, we have built various APIs that are very similar to Kubernetes APIs to manage the list of machines and the pools that we have in our data center. So these APIs are kind of like Kubernetes Resource Model, declarative APIs almost. And we also built an orchestration layer on top in Kubernetes to configure these pools.

So basically, if I want to add that new capacity to my pool in one of my Kubernetes clusters, I just go to an object called Kubernetes pool-- and as you can imagine, a custom resource that we created. And once I declare that intent, that intent is communicated to our infrastructure as a service layer. And then that machine gets provisioned and added to our Kubernetes cluster.

Now, when I remove this machine, it similarly goes through some sort of state machine where the machine gets wiped clean and gets returned back to the spare capacity so that next time someone else requests it, it can be added to a new pool. So that's basically pretty similar to how cloud providers work, right? And the main difference here is that they are not VMs, except they're just bare metal machines. And we have a way to reimage them and clean them up every time we use them.

ABDEL SGHIOUAR: Interesting. So you mentioned that there are controllers and there are CRDs. So Kubernetes is still the orchestration layer even for the actual physical hardware. So I think the question pretty much is-- I mean, it's a question that comes often. What is the right number of clusters? Is it one or is it thousands?

RONAK NATHANI: 42.

[LAUGHTER]

ABDEL SGHIOUAR: 42, that's a good one.

RONAK NATHANI: I would say, depending upon the use case, the answer would vary for that specific team. In our case, there are different environments. So specifically for our staging environments, we create test clusters. They are not very big in size. So, many teams developing these platforms on, top where they're running CRDs, webhooks-- kind, minikube, or any of these solutions, yes, they are helpful, but they just go so far.

In many cases, you want to be able to run these tests in an environment where you will actually be running these systems. So we covered these test clusters, which are isolated from the rest of the cluster running production workloads, and they are much smaller in size. They look very similar in terms of their, let's say, control plane shape and size, where policies enabled, et cetera. Those aspects are similar.

But the number of data plane nodes is not too big. However, when we get to our production clusters, in those cases, we want to run large clusters where we ensure the blast radius is not crazy. So again, 65,000 nodes for a cluster sounds really fascinating, but we wouldn't necessarily run 65,000 nodes in one cluster, even if we had Spanner, for instance.

But we want to be able to push Kubernetes beyond 5,000 nodes. We run our clusters pretty close to that size in production. And considering our scale, we already know that we need to run many of these clusters across multiple regions. And part of the reason why we don't want to run too small of clusters is because of fragmentation.

Capacity gets fragmented too much in certain cases. We have workloads which are really large. And in those cases, when you have this capacity fragmented, it's too much wastage in terms of compute.

ABDEL SGHIOUAR: Of course. Yeah. So what about hardware upgrades? So say I want to upgrade a certain shape of a physical server, a certain shape of a bare metal. I don't know, go from 128 gigabytes of memory to 256. On the same cluster, spanned across multiple regions, how would you handle that? I mean, a cluster is still a single blast radius technically, right?

AHMET ALP BALKAN: Yeah. I would say that, for us, hardware refresh is a pretty regular fact of life that we have to go through every couple of years. And I would say that the way we designed our infrastructure as a service layer is accommodating the fact that we'll have to go through these hardware refreshes. So the way we handle this whole notion of, hey, the bare metal machines may go away someday is actually through the pool concept.

For example, imagine you're in a stateless machine pool. And that machine pool actually doesn't directly declare these are the SKUs, or "skews" of machines that I'm going to run on. They actually declare something like a node profile. This is a concept that we introduced as an abstraction between our bare metal and workloads expect.

So for example, if your workload expects high memory, well, you should probably have a node profile called high memory. So as part of the hardware refresh cycle, what we do is we add and remove SKUs from these node profiles. And another capability that we have is that we can make two different machine pools look like a single pool inside a Kubernetes cluster.

Because essentially, we use a label called pool. And we control what machines that we add to the pool. So for us, going from one hardware to another generation of hardware just looks like scaling down one of the pools and scaling up the other pool. By doing that, we can basically decommission the set of machines that we have in the data center.

RONAK NATHANI: And one thing which I would just add is that our machines within a pool are also spread across what we call maintenance zones. Maintenance zones, again, is a concept that we define within our data centers and which is encoded throughout our infrastructure stack. So any pool, if you get machines in that, we spread them as much as possible across all these maintenance zones.

And this also shows up as a label on machines. So we use topology spread constraints on all parts. So any application gets, let's say, at most, 5% of its replicas in one maintenance zone. So at any time you go through the scale-up, scale-down exercise, it only impacts, maximum, 5%.

ABDEL SGHIOUAR: Yeah, got it.

AHMET ALP BALKAN: And kubelet upgrades are part of that, too, by the way. Anytime we upgrade the cluster, we actually upgrade them 5% by 5% so that we don't take more than a certain amount of capacity in the cluster.

ABDEL SGHIOUAR: So, Ahmet, you mentioned something very interesting, which I want to follow up. So you obviously have hardware refresh. So you will always have new generation of hardware coming, and all generations will be deprecated or whatever. But I assume, like any big company of your size, you're not going to even have the same platform of CPU.

You're not going to have the same architecture. So let's say I'm a developer at LinkedIn. How do I ensure that I have predictable performance regardless of where my workload runs?

RONAK NATHANI: It's actually one of the most interesting challenges that we deal with because a lot of applications that fall in our serving stack, meaning they're in the critical path of linkedin.com. An example is your feed. An application that is powering the LinkedIn feed, for instance, is extremely latency sensitive and, behind the scenes, is actually calling out to multiple services, evaluating multiple posts, and seeing which one to rank and give you back on the feed so that you're most engaged.

So in this case, what happens is many of these teams are very sensitive or aware of what kind of hardware their applications run on. So from time to time, what we do is-- so I'll cover it from a couple of perspectives. Let's say we're introducing a new SKU. Many teams would actually go through the exercise of running performance tests to see how does this queue perform as opposed to the old one that they had.

And some applications care about the SKU very much, in this case, the specific CPU generation. They care about that a lot. In other cases, not so much. So what we do right now is we have different node profiles in this case. And this is something we want to evolve, as well, because of fragmentation problems.

But what we do right now is we'll have a multitenant pool which has different generation SKUs. But we know this pool doesn't go below a certain threshold, meaning these are just the last two generations, and we won't go beyond that. Anything that is older than these two generations goes into a separate pool, which is for internal applications which are not too performance sensitive.

But there are certain cases where we would provide a pool of machines which has the latest SKU. And applications who actually want this would basically say, I want to opt in to asking for this specific SKU. And we ensure they are routed to that specific pool. And then, of course, you have quotas in place to make sure that not everyone comes asking for, just give me the latest one.

But one challenge that this comes with is you end up with fragmentation, where I have a pool of machines which are pretty beefy with the latest generation of CPU. The application asking for it doesn't take up all of it, but I still want to make sure I can guarantee that capacity. One of the ideas we have is to introduce scheduler plugins so that we can adjust the node's weights depending upon what the application is asking for while still having these new-generational SKUs in the general pool. This is not something we have done yet, but something we definitely want to explore.

ABDEL SGHIOUAR: Got it. Or maybe another idea-- and I'm just going to throw an idea here.

AHMET ALP BALKAN: Please do.

ABDEL SGHIOUAR: And then you do whatever you want-- is create something like flexible capacity, capacity that is spare, that no one is using, that can be used by a less sensitive-- and when I say sensitive, it could be performance-sensitive or time-sensitive workloads. If you have a batch that you can wait for it to run wherever you want-- this is essentially what we have inside of Google, so I'm just telling you how we do it.

[LAUGHTER]

RONAK NATHANI: You're absolutely right. I think we could totally do that. What we see in our experience is the most compute-intensive workloads, they actually care about the SKU a lot. And those are the ones where we need to plan for enough capacity and make sure they get it. But yes, for the lower-priority ones, we could totally do what you suggested.

ABDEL SGHIOUAR: Got it. And so we talked a bunch about how you do stuff. And it sounds to me like whatever answer to any question where Kubernetes is involved is a custom controller. And I know that Ahmet is a big fan of custom controllers, and I know that I'm going to ask you an immensive task of resuming all your blog. What's your thinking process about this, custom controllers, building your own, using something that the community already provides?

AHMET ALP BALKAN: I think it's the reality that a lot of shops out there right now using Kubernetes are developing their own controllers. And that's fine. That's why the whole notion of custom resources and controllers exists to begin with. Now, in our experience, we noticed that the controllers that we put in our productions' hot path have been risky enough to the extent that we have to develop them extremely carefully.

Even the smallest things that you can think of in a controller actually have a lot of importance. When the controller actually runs, when it starts managing thousands of objects, it suddenly becomes a hotspot. It becomes really important how that controller is implemented. That's why I've been trying to share some of the stuff that we learned about.

What are the controller development pitfalls on my blog? Now, I would say that if there is a task that can be achieved without writing a controller, we will obviously have to do that. There are, of course, some tasks that we know that, yeah, we opted it explicitly into the Kubernetes Resource Model. We said, we're going to create a CRD for this.

And that's why we're writing controllers for this now. In the open source, by the way, if you're just writing controllers to Gloo, a few resources, there's been a project that-- I think AWS, open source-- called kro.dev. That's been, actually, a pretty fun project. I've been trying to find use cases for it internally as well. So far, we don't have any.

But I will say that, as we spoke earlier, some of our internal workload types are pretty custom. Our stateful workload type-- as we call them creatively, LI StatefulSet-- that's a huge controller. That's probably one of our biggest operators. Similarly, we have cluster management controllers, pool management controllers.

We have to have them. That's why we have them. But any time a random team out there in the company shows up saying, hey, I have a controller that I would like to deploy to all our clusters, please, usually, our response is not very positive. I mean, the thing with controllers is that it looks deceptively simple to develop one.

And you can also start believing that your controller works excellent. However, in a real production environment, you're going to start dealing with memory issues, throughput issues, and strain that you put into API server, and accidental infinite loops that your controller is going to get caught up in, and things like that. So that's why we're pretty-- we're trying to scrutinize development of new CRDs and controllers pretty actively in our cluster ecosystem.

ABDEL SGHIOUAR: It's funny you mentioned the controllers' pitfalls blog because that's the one I was actually reading specifically. And it feels to me like one of the problems with controllers is that they can do a lot of things that makes them so fragile-- if you have a controller that can randomly add or remove labels from nodes-- this is an example from your blog-- that sounds like-- like, remove labels?

A lot of things in Kubernetes relies on labels, right? So if you just have an extra piece of software that can just randomly add labels or remove them or whatever, that's a recipe for disaster in a way.

RONAK NATHANI: Yeah. I think you're referring to the tale of the Node Feature Discovery incident--

ABDEL SGHIOUAR: Yes, that's the one.

AHMET ALP BALKAN: --we have on our blog. Yeah, so that was a fun incident. Again, this is one of those things that-- we were running on an old version of this component that was written a really long time ago, and we decided to upgrade. It turns out the entire component was rewritten. And the component occasionally started to remove the labels and stuff like that.

There has been, also-- that prompted us to actually dig into the source code of the controller and figure out how many more places could this thing actually start removing the labels. And we found a few more places. I mean, now, we reported all those paths that the component can fail. But at this point, I think we're probably not going to use that component. As you said, it's too risky.

If something is actually has the chance of bringing your production down, maybe that shouldn't be a controller. Or maybe that shouldn't be an off-the-shelf controller that you're bringing in from the open source-- unless that component is very well defined, the whole world relies on it exactly the way you rely on it or whatnot. Yeah, I think we can talk a little bit about how do we evaluate the cloud-native ecosystem separately.

ABDEL SGHIOUAR: I was about to ask that question, actually. I think that that's one of the topics that Ronak mentioned that you would be willing to discuss. You used the component from the open source. It screwed up your environment.

[LAUGHTER]

So you learned from it, I guess, right? So what now? How do you go about evaluating? Number of stars? Do you like the maintainers? How does that work?

RONAK NATHANI: So that's a good question. And before I say anything, I'll just say that all of us-- I mean both Ahmet and I plus the team at LinkedIn-- we really appreciate all the work that all developers do in the open source. Kubernetes is open source, again, and we rely on a whole bunch of other components which are open source too. So we appreciate all the work that goes in, and we want to make sure all the developers are recognized as well.

But I will say that number of stars doesn't represent how something will run in your production environment. So none of these components are necessarily not good. They are just not the right fit for how we want to use them. I'll take an example. Ahmet mentioned NFD. I'll share another example, Argo CD, for instance.

I love Argo CD as a system. We used it extremely heavily, or we still use it extremely heavily, but it's more of a behind-the-scenes system. It's basically a GitOps engine for us which no user actually sees. It was running perfectly fine until our clusters and our workloads hit a specific threshold. And we had a week-long outage.

And that outage is basically things are taking forever to reconcile. When we dig into that code base, we're like, well, we basically have maybe five custom resources that we need this thing to reconcile. But it is actually looking at the entire cluster. It's looking at every change made to a config map. It's looking at all the node objects in the events.

I don't need it to look at all of that. And when I go and try to disable these settings, in some cases, you end up with a bug where there are layers and layers of caches at different places, which some of which have a good TTL on. In some cases, you have a memory issue. So again, not to pick on Argo CD necessarily-- again, great product. Folks who build it, I've spoken with them personally. Really smart engineers.

And this solves a really critical problem. Again, what we found out is any time we had a specific gap in our capabilities, what we typically do is we would go out and see, is there a solution in the open-source world? Because we don't want to reinvent the wheel. So then we'll go out and see, can we leverage something off the shelf and, in some cases, even contribute back?

We have a very curated experience for an average LinkedIn developer, and we can go into what that looks like as well. In those cases, many components that we pick off the shelf aren't necessarily exposed to the users. So we have a very opinionated platform that we build on top of Kubernetes. So as a user, you wouldn't even know, many times, that your app is running on Kubernetes, and that's by design.

So when we go and use some of these components off the shelf, we start putting them as part of our stack. Once we start scaling our clusters, once we start scaling our workloads, if we hit a threshold where that component doesn't scale anymore, or we find out that, operationally, running this component is really challenging, and as we dig more into that code base, we find out the quality of the code doesn't match our style or our standards internally, where we want to put something in production and be on call for it in the middle of the night, then we go and, essentially, replace it.

And this is an exercise that we have done time and time again, where we start with an open-source component-- because it solves the need of the hour, meaning I don't have to wait three months to solve this gap in our capability. But as soon as we do that, we start evaluating it to make sure this component is actually going to be stable and remain part of our stack for the long term. And if we find out that any of the things I mentioned aren't true, then we look at the cost of what it means for us to write that component from scratch.

And in many cases, if there is a capability we really need as part of our Compute platform and if that capability is really important, then investing in it, where we build it from scratch, seems like the right thing to do. And at this point, we have done that for a few things. We'll see how many more we do.

AHMET ALP BALKAN: I will say in terms of the evaluation path, anything that we're bringing new into our ecosystem, we're trying to have the teams bringing those things do a pretty large stress test as much as they can. Now, again, stress tests only can also go far. For example, in a bare-metal environment, you really don't always have 5,000 machines sitting around doing nothing, so you can't easily create a very large cluster.

Now, that said, it's still possible to exercise a lot of the controllers and components and see exactly how they break. And especially if you read the source code, or at least you have a pretty good understanding of how does this controller work, you can figure out where it's going to break. And I would say running Kubernetes at scale is mainly about figuring out where these scaling challenges of each component is.

ABDEL SGHIOUAR: Got it. Got it. So now I want to shift away a little bit from your team and talk about your developers. So how do you focus platform? Because that seems to be the term of 2024 and 2025 and beyond. How do you platform your platform, I guess?

RONAK NATHANI: How do we platform our platform? Great question. If you ask us, lots of improvements to be made, but we do our best. What I would say is--

ABDEL SGHIOUAR: Good enough. We can stop here. I'm just kidding.

[LAUGHTER]

RONAK NATHANI: Well, I was going to say, it depends on who you ask. If you ask me, our platform is awesome. If you ask some of our users, they're like, well-- they might have a different opinion. But jokes aside, I'll start by saying Kubernetes is really flexible, and it is very adaptable, but I don't think all of these features need to be exposed to all the end users. I think, in our case, our team is pretty opinionated about what we expose to the end users and what are the features that they can use and, in fact, how we use Kubernetes.

And part of that is because we want to curate an experience for our engineers. Now, Kubernetes is a beast. Anyone who says otherwise is either lying or hasn't used it enough, I would say. It's very easy to get started, but we don't want our engineers to worry about what does podspec-dot-DNS policy really mean? Or what does podspec-dot-host-network really mean?

Or how do I go about setting my liveness probes in specific detail, for instance? So generally, what we provide is, as a LinkedIn engineer, you specify your compute resources, which is CPU memory, in some cases storage. You would specify if you care about it, the kind of node profile or SKU you want to run on. You would specify your application identifiers.

So we have certain nouns within LinkedIn which uniquely identify your application. You don't have to worry about the registry URI. You just specify the identifier. We take care of where the registry comes from. We ensure it's mapped to the right region. And then you go to-- so you check this into your repo, which is next to your code.

There is a specific directory structure that you follow to specify different variants of your application that you want to run. So you might say, this is my staging, that's my production, for instance. You can do that. You can also enable autoscaling in some cases where you don't want to worry about replica size of your app. You want a system to take care of the replica size based on the site traffic.

Once you do that, you check this in, you go to a UI which has all your application environments listed. So you say, I want to deploy in these three regions for production, these two regions for staging. And then you set up that workflow. And on an average day, you typically go through either clicking that or you have a workflow, which is preset for you.

And as long as your tests pass-- so what you would do is you would deploy something in staging. You would deploy a canary in a production. You would run a test to make sure your canary passes the test that you set up initially. And we have some that we provide out of the box too so, for example, if you're running an application, your CPU memory usage doesn't blow up with the new version.

And if all of those tests pass, most applications have an auto advance where they would go to rest of that region, then on to the next region, so on and so forth. So this is what the typical rollout experience looks like for a user. Now, I will say where some of the opinionated pieces here-- that include a user doesn't just write a deployment file.

A user writes what we call an LI deployment or analyze StatefulSet, which only has maybe six fields you want to specify, just to get this going. But you have the flexibility to override that podspec if you really want to. And you know what you're doing. Kubernetes is abstracted, but it is not hidden. So we have our own kubectl plugins, which users can use to look at their pods, exec into it, while making sure they don't have to worry about which cluster my application is running into.

ABDEL SGHIOUAR: Got it.

RONAK NATHANI: And again, I can go into different details based on what you're interested in.

ABDEL SGHIOUAR: Are these kubectl plugins using Ahmet plugin manager for kubectl?

AHMET ALP BALKAN: Damn. I'm keeping all that mess out of this company.

ABDEL SGHIOUAR: All right. [LAUGHTER]

RONAK NATHANI: I don't know. I love the plugin manager, so--

ABDEL SGHIOUAR: He's too modest. Thank you.

RONAK NATHANI: I will say-- Ahmet might not say it, but all of our platform teams use all of Ahmet's plugins really heavily. Our end users, not so much, because we want to try and abstract some things away from them. If they need to worry about a cluster, a namespace, then we did something wrong. We don't want them to worry about that.

AHMET ALP BALKAN: Yeah. I would say that maybe one thing that I would highlight is that I don't think we want users to be entirely unaware of the fact that they're running on Kubernetes. Because eventually, they're going to find out. And them finding out this hard way is probably not really preferable.

We want them to understand kubectl logs, kubectl exec exactly when they need to. Or if they need port forward, OK, let's have you port forward because that's probably going to help you troubleshoot something. But aside from that, we really want to hide Kubernetes in the well-paved path, when everything is happy and dandy. Other than that, yeah, they'll probably see Kubernetes.

ABDEL SGHIOUAR: So you have, basically, a golden path. But then you can deviate from the golden path if you really know what you are doing, in a way, right?

RONAK NATHANI: Yeah. And this is where we found users doing several interesting things.

ABDEL SGHIOUAR: Of course.

RONAK NATHANI: For example, setting your replica count to zero is darn pretty easy. [LAUGHTER] And sometimes, scaling down your application-- again, you go from 1,000 to 10 just because you fat-fingered something? Well, that is really easy too. And unfortunately, we have seen some of those incidents in production.

So what we ended up doing was adding a bunch of these guardrails where you can do a lot of things when you want to go off the golden path, but then there are guardrails to make sure your application is protected and you don't take the side down.

ABDEL SGHIOUAR: Can I ask a quick follow-up question? As a developer, what about my application dependencies if I need a database or if I need anything else? Monitoring, logging, all the extra stuff that an application needs, how does that work? Is that part of the automation that your platform provides, or is that something that I still have to do myself?

RONAK NATHANI: We have several database teams at LinkedIn. We run a set of-- we provide a set of databases as a service. So there is a team providing relational storage. There is a team providing key value store. There's a team providing a cache as a service. There's a team providing object store, so on and so forth.

And many of these, LinkedIn wrote itself. LinkedIn wrote Kafka. So we have a team running Kafka. We have a team running one of our key value store called Espresso. We run Venice as a heavy store which is used for machine-learning features. That is also open source. So we have a data infra org which has a bunch of these teams which are providing these stateful services.

If I want to deploy a service, and depending upon what I need for my application, I would essentially go and get a database provisioned. And all of this is also automated through a UI. So you would go to a UI and say, I want this much storage for this org. It wraps up into your org quota, for example. So that is handled by a separate team.

It's not part of Compute. But then you get coordinates for your database, which you set up as part of your configs. And you say, hey, application, go talk to this database for example.

ABDEL SGHIOUAR: Got it. Got it. So obviously talking to you guys is awesome. But this is part of a feedback we got last year, which is people wanted us to have more end users on the show, more people actually using Kubernetes and not just vendors and community, which is typically what we do. And I guess I will have to assume that part of the reason why people want end users is because they want to hear about incidents. Because those are the fun parts, right?

[LAUGHTER]

Can we talk about, if any, if you can, if you are allowed to say anything, just high level-- you don't have to go into details.

AHMET ALP BALKAN: Yeah, I think one example that I can give is the NFD example that I mentioned earlier. Here is an open-source component we brought into our cluster. It worked all these years very fine, and we decided to upgrade one day. Everything seemed good until we started hitting the largest cluster that we had.

And at that largest cluster, the controllers informer was failing to sync timely. And the timeout error was not handled properly. So the controller thought that there is no data, so I'm just going to clear all the labels on all the nodes. So it didn't distinguish between error versus empty data case. This happened when I was at Google, by the way.

There was a famous Google Cloud incident where, yeah, I think all of the load balancer, GCLB configs were cleared out because of an empty file. So I would say this sort of stuff happens pretty regularly. I think Ronak's Argo CD example was also pretty relevant to the scale of the cluster itself and how many objects are in the cluster and how much churn there is in the cluster. I would say that, for the most part, a lot of the incidents that we're hitting are a result of churn.

ABDEL SGHIOUAR: All right, cool. We're going to put a link to your blog post, too, for people to go read more about that. Obviously, talking about details about incidents is sensitive for a lot of companies, especially that, from the conversation, it sounded to me like LinkedIn, you just reinvented an internal cloud platform, really. That's how I'm understanding how you function.

RONAK NATHANI: More or less, yeah.

ABDEL SGHIOUAR: So that's actually pretty cool. Well, guys, thank you very much for being on the show. Before we go, Ahmet, you have a blog you want to plug?

AHMET ALP BALKAN: Absolutely, yeah. So we're blogging actively on our "LinkedIn Engineering Blog." We're going to be talking about our Kubernetes platform more in detail. Ronak and I, we do have a talk coming up in KubeCon London. I'm hoping that this episode airs by that time.

ABDEL SGHIOUAR: It will be.

AHMET ALP BALKAN: Yeah. So we'll be in London, hopefully, talking about Kubernetes platform. Also, I personally blog about various Kubernetes controller misadventures as well. I think my next article is going to be about all the ways Kubernetes can evict your pods. Because we learned this the hard way as well.

ABDEL SGHIOUAR: I cannot wait to read that.

AHMET ALP BALKAN: Ronak, is there anything else you want to plug?

RONAK NATHANI: I will say we are actively hiring. So if you want to chat more about opportunities at LinkedIn or just want to join the discussion and talk about running Kubernetes at scale, everyone can find us on both Twitter, or X, and LinkedIn. And happy to chat more.

ABDEL SGHIOUAR: Yeah, find the LinkedIn LinkedIn. Ronak, you have a podcast. "Software Misadventures," if I remember correctly?

RONAK NATHANI: "Software Misadventures," yes. That's the right one.

ABDEL SGHIOUAR: All right. We'll make sure to have a link for it in the show notes so people can go take a listen. Thank you very much, folks, for your time. I appreciate it, and have a good day.

RONAK NATHANI: Yeah. Thanks so much for having us.

AHMET ALP BALKAN: Bye.

[MUSIC PLAYING]

ABDEL SGHIOUAR: That brings us to the end of another episode. If you enjoyed this show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media at KubernetesPod. Or reach us by email at <kubernetespodcast@google.com>. You can also check our website at kubernetespodcast.com, where you will find transcripts and show notes and links to subscribe.

Please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening, and we'll see you next time.

[MUSIC PLAYING]

View More Episodes