Kubernetes Podcast from Google: Episode 241 - 65k node clusters on GKE, with Maciej Rozacki and Wojciech Tyczyński

#241 November 13, 2024

65k node clusters on GKE, with Maciej Rozacki and Wojciech Tyczyński

Hosts: Abdel Sghiouar, Kaslin Fields

In this episode, we speak with Maciej Rozacki, Product Manager on GKE for AI Training, and Wojciech Tyczyński, Software Engineer on the GKE team at Google. We explore what it means for GKE to support 65k nodes, and the open source contributions that made this possible

Do you have something cool to share? Some questions? Let us know:

News of the week

Links from the interview

Links from the post-interview chat

Transcript

Show full transcript

KASLIN FIELDS: Hello, and welcome to the Kubernetes Podcast from Google. I'm your host, Kaslin Fields.

ABDEL SGHIOUAR: And I'm Abdel Sghiouar.

[MUSIC PLAYING]

KASLIN FIELDS: Today, we're talking with Maciej Rozacki and Wojciech Tyczynski about some exciting new updates in the world of scaling Kubernetes. But first, let's get to the news.

ABDEL SGHIOUAR: We are publishing live from KubeCon CloudNativeCon North America 2024. It's going to be a week filled with learning, networking, and cool technology in the cloud-native space. Stay tuned to the community and to our social media for updates from the event.

KASLIN FIELDS: Speaking of social media, the Kubernetes Podcast from Google would like to invite you to follow us on our new account on Bluesky. There will be a link in the show notes.

ABDEL SGHIOUAR: OpenTelemetry is expanding into the CI/CD space. With the release of version 1.27 of the OpenTelemetry semantic convention, which is the common spec for defining objects, operations, and data in OpenTelemetry, CI/CD attributes have been added. So things like pipelines, tasks, run IDs, et cetera, can be observed, and their execution status can be reported and monitored.

KASLIN FIELDS: Gitpod announced they are moving away from Kubernetes. In a blog post, the company behind the cloud-based development environment cited the many technical challenges they faced running their workloads on top of Kubernetes. Check out the detailed article from the links in the show notes.

ABDEL SGHIOUAR: OpenCost is now a CNCF-incubated project. OpenCost is a vendor-neutral tool that provides visibility into Kubernetes costs across major cloud providers and on premise.

KASLIN FIELDS: And that's the news. Today, I'm speaking with Maciej, the product manager on GKE for AI training, and Wojciech, engineer on GKE. Would you each introduce yourselves, maybe Maciej first?

MACIEJ ROZACKI: Hi. My name is Maciej Rozacki. I'm a product manager in the Google Kubernetes Engine team. I'm responsible for shaping our roadmap of capabilities for supporting AI training and machine-learning training use cases on Kubernetes and GKE.

WOJCIECH TYCZYNSKI: I'm Wojciech Tyczynski. I'm one of the engineering leads in GKE. I'm also heavily involved in open-source Kubernetes. For example, I'm a TL of SIG Scalability.

KASLIN FIELDS: I am so excited to be speaking to you both today. We have a very interesting topic, and I'm very interested in it from open-source perspectives, as well, which we will get into. But we are going to be talking today about GKE announcing support for 65,000 node clusters, which is enormous.

I don't remember what the previous recommended from the open-source project was, but GKE has had an industry-leading 15,000 nodes for some amount of time, so this is a massive increase from 15,000 to 65,000 stated supported nodes. So can you tell us a little bit more about what it means to say that GKE supports a 65,000-node cluster?

MACIEJ ROZACKI: What we've seen over the course of recent few years is that with the era of-- or, say, new generation of AI technology development, there is a clear demand for customers to start running at a much larger scale than before. We've seen earlier in the previous years that customers were interested in operating in clusters of the scale of a couple of thousands of nodes for microservices workloads.

For more high-performance computing use cases, customers were going above 10,000 nodes. You can see, for example, a case study that we did together with PGS. I think they were even a guest on this podcast. But recently, we've seen that these needs for scale and sizes of the computing power in clusters have grown even further.

Today, the scale limits of Kubernetes are good enough, we think, for training and serving models of the sizes of 1 trillion parameters. And in some time, hard to say when, but we will see models 10 times bigger, maybe even larger.

And to meet the needs of customers, to be able to both train and serve these models, we need to innovate both in the sizes of clusters and in the capabilities of hardware that they run with. So what this means is that to operate at 65,000 scale, we've narrowed down the use case that we want to support with these clusters to building AI platforms.

And if we make some assumptions on what the customers will do with the cluster combined with lots of innovations both in open source and in-house at Google, we were able to offer to customers the ability to operate at 65,000 VM nodes type of computing power just in a single cluster.

KASLIN FIELDS: And I was also just thinking about the PGS episode. Fantastic call out if you haven't checked that out, very interesting use case there, a lot of stuff going on here with supercomputing. And regardless of what you may think of where AI is going, it's definitely an exciting time to be in infrastructure. So Wojciech, from an engineer's perspective, is there anything you would like to add?

WOJCIECH TYCZYNSKI: I would just add to what Maciej just mentioned that we were primarily focused on training and inference or AI/ML, the gen AI use cases. But we are also thinking about mixing those workloads, and this is something that we are already supporting with this announcement.

Many of the customers, or many of the users, consider splitting those two into separate clusters, but I think it's important to give them a possibility to actually mix those. And this is something that we are actually testing as part of evaluation and as part of ensuring that clusters of such scale actually work, too.

MACIEJ ROZACKI: I would maybe add to Wojciech said is that it's a very interesting domain, these machine-learning platforms and backends for artificial intelligence, because on one hand, what's quite unique and also stems from the supercomputing patterns that were less adopted in cloud is that the training workloads and the whole process of building the model, it involves quite a lot of tightly coupled workloads.

So you have jobs that are very vulnerable to physical characteristics of the data center in which they run. Proximity of machines, how far each individual host of your pod and container is, how far is it from another container? This matters and affects cost efficiency and speed and even scalability of your workload.

And the same applies also to inference, so the largest models are very difficult to serve just from one host. Customers typically shard them into multiple workloads that run on a few VMs, and this creates a need for co-locating massive amounts of computing power in one physical location.

And then, as Wojciech mentioned, when customers want to both train and validate their models, if we think about where AI is, especially the leaders of this space, everybody is working on finding the right model, the best model. Folks are competing on the quality of models, how they are responding, various rankings online.

So users that we see, what they need is the ability to very rapidly repurpose their hardware. That is very scarce at the moment. We need to say that we have the crunch of chips with availability of electrical power to power the data centers. So customers are dealing with scarce resources, and then we want to here enable them to have a tool that will easily allow them to readapt the infrastructure that they have and those resources to various use cases.

So you may be training your workloads using 60% or 80% of total computing power that you have available to you and use the remainder for some research workloads or running inference to validate your model, to get feedback from your users, from customers. And at the same time, if you see, for example, a significant success with one of your models or a surge of traffic associated with that model, then you can very quickly stop other workloads, move easily virtual machines to be serving a different purpose.

And Kubernetes is just great for that. And so unlike other systems that were built primarily with supercomputing in mind, Kubernetes was built both with supercomputing, and these research workloads, and the microservices. And here we are enabling customers to run both in one environment and dynamically adapt the need and the use case that they serve, within minutes even, as their needs of their business change.

WOJCIECH TYCZYNSKI: Yeah, I think that within the minutes is super important here, because, especially during the times of like shortages and stockouts of the capacity, it's often impossible to get the new capacity within this time frame. And the minutes is often-- or even seconds or tenth of seconds is what the customers often expect. And being able to have that capacity without provisioning the accelerators or provisioning the VMs themselves, which usually takes minutes, it's an important factor for why those users choose to repurpose the existing capacity instead of provisioning or reprovisioning some of the machines they were previously using.

KASLIN FIELDS: I think it's amazing that the pressures of the infrastructure world that Kubernetes exists in have changed so dramatically from the time that it was created to now, like you were saying, the hardware pressures, the energy pressures. The scale of these workloads is just so immense, and the core concept of distributed computing is just essential to how we actually technically make these workloads possible.

I really like that you called out scheduling there and how the workloads can be interconnected. You have to have them be across hardware just because you need so much hardware to do them, but they still need to be very tightly coupled and working together. Which one goes first? What pieces interact with each other? You have to consider all of these things. So there's a lot of work going on, I know, in open-source Kubernetes to enable these kinds of things.

And in order to enable a 65,000-node cluster that is focused on these types of workloads, supercomputing, AI, all of these types of topics, you must have done some really cool engineering work to solve some of the problems there. So could you call out some of the cool engineering and technical challenges that the team had to solve in order to make a 65,000-node AI-oriented Kubernetes cluster, GKE cluster, possible?

WOJCIECH TYCZYNSKI: Yes. We, indeed, had to solve a bunch of interesting problems, and in fact, we were kind of preparing for that for years. Even though-- if we look back like three, four years ago, we weren't thinking about supporting that scale, but there were a bunch of investments that just takes years and were preparing us to where we are now.

And I think probably one of the most interesting and most challenging things that we did is actually replacing etcd with our own specific storage. We call it spanner-based storage because underneath, it's using the spanner, which is Google's technology that we are internally using as the database solution for many our Google products, not just in cloud. We are actually in the middle of replacing all the storage for all the existing GKE clusters with spanner-based multi-tenant solution.

The main goal here wasn't really the scale and increasing the scale. The main reasons for that were making our control plane stateless, making it more-- we are making it more flexible. All the operations will be faster and so on. But scalability was one of our design principles, so it just unblocked us here without any additional work that has to be done concretely for that effort.

MACIEJ ROZACKI: I can also a little bit extend like the investments in the control plane and performance are one aspect. Second one is probably like investments in data plane and the ability to handle the network traffic. Another element is also on various APIs around Kubernetes. So if you may remember, we started the work around high-performance computing workloads and batch workloads as a very deliberate work with CNCF, I think, three years ago. That's when the batch working group was established.

This year, we also have the serving working group joining the, let's say, portfolio of CNCF's working groups that look at Kubernetes APIs and see where they need to evolve to support this new era of workloads. There's lots of very cool examples that are really capabilities that enable us also to tap into the possibilities that such scale offers.

As an interesting example, dynamic resource allocation. That is a whole domain of, how do you model this very advanced and sophisticated hardware, and how do you operate these scheduling domains? Domain is very interesting one, because these AI workloads change a couple of paradigms in scheduling. In the past, a typical microservice, it had a couple of assumptions that-- we were designing these systems with them in mind.

For example, a typical, let's say, replica of a microservice is rather small. It's definitely smaller than a single host on which it runs. So we've invested quite a lot in Kubernetes into capabilities like oversubscribing physical machines with many microservices that run, then share them. We have pod bursting and various node-level and kubelet-level capabilities to manage that and combine cost-efficiency elasticity with great service that these applications are offering to users.

Now, in AI space, we have jobs that take thousands of VMs, if not now tens of thousands of VMs, or even like individual replicas of model server, it's not uncommon that they run on more than one VM. So now, you have deployments that need-- every replica of the deployment is actually multi-host workload, which is an API extension, at the same time, also a very interesting scalability dynamics, cost-efficiency dynamic.

So we have the advent of APIs like LeaderWorkerSet to enable these more complicated deployments or job sets to have a job-level equivalent, where the job is heterogeneous, and it accounts for these network structures and topologies. So lots of cool stuff happening in Kubernetes and lots of cool stuff also in the domain of allowing us to effectively use this platform.

Maybe I'll mention just one more, which is queue, the job-scheduling addon that we've added. We've also started also, I think, this three years ago when we started the batch working group, maybe a little bit later, as part of the work with the community there, which is-- we believe that is the best cloud-native job-scheduling extension in the Kubernetes ecosystem.

That also then allows you to mix various workloads within these large AI platforms and juggle your resource allocation between jobs, also deployments, stateful sets. So it is capable of integrating also with these serving type of workloads so that you can balance the capacity sharing between jobs and your serving workloads.

WOJCIECH TYCZYNSKI: So let me just add one more thing, because I started with GKE, but we also did a lot of cool stuff in open source Kubernetes itself. One of the interesting features or enhancements that is not super directly visible to users but helps a lot with scalability is consistent list from cache that allow us to serve the list requests directly from API server cache without contacting [? SCD, ?] or in our case, the spanner-based solution that helps a lot with reducing the load on the storage and helps a lot with scalability.

But that's just one example. We did a bunch of improvements across not just core Kubernetes, but also in other projects. We improved connectivity, which is very tightly coupled with Kubernetes, but technically a separate thing, thanks to which you can actually dynamically add or remove the new API servers or control-plane replicas to your cluster without the need to restart all the others. So that's just one example.

Another thing, there are a lot of scalability-related improvements going directly into cilium, for example, which is one of the options for data plane and networking solutions that we use also in GKE. But it's-- while a bunch of improvements that were done internally in Google, there are also improvements that are going into upstream cilium done by our engineers.

Yes, there is a lot of things that we are giving back to the community. I would even say more that whenever we actually need to change, or adjust, or enhance Kubernetes itself, we always do that in open source. We are not patching our internal fork of Kubernetes. All the improvements that we need in core Kubernetes we are doing upstream so that everyone can actually benefit from that.

KASLIN FIELDS: This is something that I have seen much more clearly since joining Google and that I'm constantly surprised by is how much the GKE engineers-- when they're working on a feature for GKE, they are doing stuff in open source, and stuff is just appearing in open source. And it may not be clear that that is what is backing these GKE features, but that's kind of the point of it. It's available in open source, and anyone can use these improvements, too.

And I think another theme that I want to call out here is there were a lot of new features that went into open source to enable 65,000 nodes on GKE that are very kind of individual improvements. It's hard to understand the whole, I think, sometimes, with these individual features that lead up to some big thing that you can now do.

The same thing is true in the world of stateful applications on GKE or on Kubernetes, open source in general. I've given some talks about how it's very difficult to understand the space of stateful applications on Kubernetes because a lot of the features that enable stateful applications are just features. They're not called out as stateful features. They are just features within Kubernetes.

And I think the same thing is kind of true here. There's a whole bunch of features in networking. There's a whole bunch of features in scheduling, and workload types, and all kind of throughout the project that are all enabling this together. But those ties may not be immediately clear if you were just looking at the new features in Kubernetes.

WOJCIECH TYCZYNSKI: Yeah. I would just add to this that there were a bunch of improvements that we were making in open source, and, in fact, none of them we justified with increased scale. I mean, we were justifying them with increased scale, but not in the dimension of size of cluster, but throughput of the system, or some other things because those enhancements or improvements, they don't just help for the size of the cluster, but they can help many other users, even if they have much smaller clusters, for other dimensions of scalability. But also, they actually make the system itself more reliable. They reduce cliffs.

They help with stability of the system under high load and so on and so on. It's not that we only complicate the system for higher scale, we just make the lives of many other users that just use smaller clusters also better.

MACIEJ ROZACKI: I can actually share also a funny anecdote on this, that we were discussing with Wojciech and a couple of our teammates, how do we wrap this launch of the support of 65,000 node clusters in the proper formal launch process? Like every large software provider, google has a strict-- especially in Google Cloud, we have a strict launch process that we follow to make sure we correctly support enterprise customers and do all of the regression validation, all that stuff.

And so definitely from how this capability and the feature looks, it's a massive launch for us. But then when we try to pinpoint it to a very specific technical milestone, what did we change in the code base, or what's in production that is this launch? And we couldn't. It's a funny thing. It was just lots of micro changes that indeed were oriented on solving a variety of problems.

So we really believe that while this launch pushes the edges of the technology, it makes the lives of all Kubernetes users easier and better, like much more rapid control plane scaling, which means that you can have ephemeral clusters that work much better. And you have less worries, like with control-plane warm up effects on managed clouds or the performance characteristics for your speed of pod scheduling.

This is a dimension of performance that is not directly tied to scale. You may want to have a very rapidly churning workloads also on a very small cluster, and like lots of improvements, went to kube scheduler, to the other components on running on the API server in the Control plane to make sure that we can actually support fantastic performance characteristics irrespective of the scale.

And then when you are all of a sudden look at the hundreds of thousands of really changes everywhere-- and we ran tests and just put it together and then cleaned up some of your changes when you actually do test the very large case and make changes also associated with when you make a targeted launch of that scale. But it's not-- that there is a specific thing that was changed that really made this possible. It's four years of the work of our engineers of the community to enable this.

WOJCIECH TYCZYNSKI: I would just add that or slightly clarify that, yes, indeed, there are tons of smaller changes across pretty much all the product, and not just the product, also in a ton of our dependencies. But there were also a bunch of large launches that we did that were actually going through that process and that we heavily depend.

They were just happening across a couple past years. But without those, we wouldn't be where we are now with scale, too. So it's a combination of those big transformational projects plus a lot of [? glue ?] code and smaller improvements to make all of those work together.

KASLIN FIELDS: I think all of this is a really strong indication of how robust Kubernetes is in its core concepts. Of course, Kubernetes has made it to being a decade old, and it's not slowing down. There's still so much that's going on with the project and the core concepts that underlie it. Enabling distributed systems are still so relevant and even more relevant, arguably, in the world today.

And so we see this continuing movement toward kind of the same goals that Kubernetes always had, where making all of the components that go into the distributed system that Kubernetes has designed better, and that's enabling us to reach these higher scales as well. And it also means for the community that there's still lots of work going on, and it's important to celebrate that work and for people to know about how awesome it is, which we will be doing at KubeCon this week.

So this episode is coming out right as KubeCon is beginning, and we have lots of exciting things planned for KubeCon. Of course, it's a huge event for the community that builds Kubernetes, but it's also a huge event for the end users who use Kubernetes. And it's one of those wonderful moments where both of those things get to come together, and we get to see the interactions between those communities. So I'm excited for KubeCon. I know that we've got lots of stuff planned for Google. Is there anything you all would like to call out about KubeCon?

MACIEJ ROZACKI: There is lots of very exciting stuff happening at KubeCon. I'm personally very excited about the presentations that will happen on the AI day and co-located event. Our engineers, together with our customers and partners from the community, will be presenting a couple of very interesting things.

Also, the lineup of talks is very cool. The one that I'm really keen to hear about is a presentation by an engineer from our team and engineers from Apple talking how they use Kubernetes and Q to build a very sophisticated multi-tenant environment for researchers where and how researchers can share resources and capacity of preallocated hardware between them. But there is lots of also other interesting stuff. I don't know, Wojciech, if you want to add on things on the main event.

KASLIN FIELDS: I also want to call out that KubeCon also added the poster session, so I think we're probably seeing a lot more researchers at KubeCon these days. So I bet the audience of that session will be very interesting, and I'd love to talk to them. Go ahead, Wojciech.

WOJCIECH TYCZYNSKI: Thank you. So, yes, there are definitely a lot of interesting stuff. I think there are a bunch of talks by actual Kubernetes contributors, the special-interest groups, working groups, and so on, that you can talk to people who actually create this stuff and influence how they do that. I highly recommend those.

But I think the biggest value for me personally was actually talking to different people during the corridor discussions. And if you are specifically interested in the 65,000 nodes clusters, there will be a lot of people from Google. You can find them in the Google booth or on the corridors. And there were so many people involved in this work that by just asking any of those, they will either be able to tell you something about that or will easily redirect you to someone who you can speak to.

KASLIN FIELDS: And like we said, there are so many changes that went into open source, so if you go to any of the open-source sessions, you might hear about some features that went into this.

MACIEJ ROZACKI: Yes, especially the SIG Scheduling presentations. I think there is going to be a maintainer track batch working group from this domain of AI platforms. It's going to be definitely very interesting. Probably also the serving working group will have a presentation that I'm less involved in.

But I would expect the team also will have a very interesting content during the session. And as Wojciech mentions, going to that session or to our booth and then just spending time with all of the presenters and attendees, that's the best way to make the most out of the KubeCon event.

KASLIN FIELDS: For all of you end users out there listening, I would like to issue a challenge. If you are attending KubeCon or if you check out the recordings later, I challenge you to check out at least one maintainer track session. See what these contributors are doing and how they talk about their work, and see if it's interesting to you and how it might relate.

Think creatively, because like we said, a lot of the work that goes into open source is these maybe changes that look small, or look like they might not be related to what you're doing, but you might find out that they are actually related to the entire system of Kubernetes, and it all kind of bubbles up into things that you use. So I challenge you to check out a maintainer track session from this KubeCon and see what you learned from it. I would be very interested to know.

MACIEJ ROZACKI: I would add that maintainer tracks are really cool. They are very different. It's very interesting that they are a little bit neglected. They don't get the same sizes of audiences as many other sessions, while at the same time, if you think about-- those sessions are led by some of the most thought leaders in the community that shape the direction of Kubernetes within various domains.

The format of these sessions, given that they are a little bit low key and lower profile, there's not much marketing in there. The sessions are about what we see in the industry, what patterns we see, how users are changing. Where do we see the IT industry in a year, in five years from now? And how do we need to evolve Kubernetes to be able to respond to these challenges? Or a specific aspect of Kubernetes, how do we need to evolve it?

So these are very exciting and very interesting sessions, and also building some direct relationship with those folks that present. These are very frequently introvert engineers for whom this is a stressful presentations, but at the same time, these are just fantastic folks and sessions. Make sure you meet them, you get their contacts, because these folks are really the ones that shape the direction. And those sessions give you an insight into where Kubernetes is going to be evolving in a particular space.

WOJCIECH TYCZYNSKI: And even more than the sessions themselves, just grabbing those people right after the presentation, and throwing the topic at them, and having brainstorming or whatever, it's something that I always really enjoyed. I took a lot of that on both sides, actually, both as a maintainer and as a person that is trying to challenge some other maintainers.

Yes, I highly recommend those. And, actually, this is one of the best opportunities to influence the direction in which the project is going and what we, as a project, will be working on in the upcoming months or quarters.

KASLIN FIELDS: At a maintainer-track session, you know that you're talking directly to the engineers who are influencing those areas of the Kubernetes project, so if you have things you want to discuss, bring them up. This is what they do, and they would love to talk about it generally. And like we said, they do tend to be smaller sessions, so I highly encourage you to ask questions during these sessions. These engineers are trying to make these decisions and make this work happen, so your input could influence the direction of Kubernetes.

Actually, one last thing on maintainer track sessions that I wanted to mention-- also, if you happen to go to multiple maintainer-track sessions, I personally find it fascinating how you hear similar themes of influences and pressures on the project that are influencing the project in different ways in different areas. Fascinating. So if you can do more than one, you might see that, too.

MACIEJ ROZACKI: And I would add also that it's actually a very interesting place where Kubernetes is, that we, at the moment, are in a place where-- Kubernetes is built with that premise of separation of concerns, that there are various components that really are responsible for doing one specific task well.

And at the same time, we see more and more requirements for capabilities that are cross-cutting those layers. If you think about scheduling, the concept of job scheduling or queue, it introduces the idea that you run a full workload or not, and then you have Kubernetes scheduler that wants to decide where to run which pod. But you need to combine the placement of those pods with information about the network, where the VMs are.

You add a scaler to the mix, and all of a sudden, you end up with a very interesting situation where you have various components of Kubernetes optimizing certain behaviors, and at the same time, having a user needs to have that coordinated and cutting across these components. And how to do it well so that we move Kubernetes forward without breaking it is a very interesting challenge that Kubernetes is facing, and you can hear that through all of these conversations across these working groups and SIGs.

And that's why you also see that those maintainer tracks also that are use-case oriented because if you think about-- you have SIG Scheduling meeting, which is very related to a component, as an example. But then you have this horizontal SIGs or working groups. Scalability is a horizontal that crosses various components-- or Batch, which is a use-case oriented working group.

And then people from Node, and from Autoscaler, and Scheduler, and other meet in one place and try to figure out how to make all of these various distributed components work together well without breaking the nature of a distributed system that is extensible, pluggable, very interesting challenges for Kubernetes. And I'm sure there's going to be lots of very interesting discussions in corridors how to figure it all out.

WOJCIECH TYCZYNSKI: Yeah. I would just emphasize that the use case-driven thing that you mentioned-- even in this scalability, the goal is not to push the boundaries as far as possible. The goal is like to meet the user requirements. We don't want to optimize for the sake of optimization, we just want to solve real user problems. Understanding those is critical to making good decisions.

And if you have a use case that Kubernetes is currently not addressing, please come to talk to us, because we probably didn't hear about it, or maybe we hear, and we just didn't yet figure out how to do that. Or maybe we didn't prioritize because we didn't think it's important enough. Ensuring that we, as project maintainers, understand the priorities, understand the use cases, is the most important thing that you can help us with as a user.

KASLIN FIELDS: Please interact with the community and get involved. We would love to hear from you, especially you end users out there. We hope to see many of you at KubeCon. So let's wrap this up with, where can folks learn more about 65,000 nodes on GKE?

MACIEJ ROZACKI: We will be posting information on our cloud blog. You can also find updates in our documentation. And through the coming days and weeks, we will be releasing a variety of materials, so stay tuned with demos. We will show how these clusters work and how you can use them.

And we will be sharing more insight into some of the technical capabilities that enable this technology so that we can give you a deeper dive. For example, especially this innovation that we're proud of, that Wojciech mentioned, where we will be using Spanner as the cluster state storage and what it means, really, to how we operate and manage Kubernetes control planes and the variety of capabilities that this opens to not only the scale, but also adaptability, flexibility, and various other characteristics that this opens for us.

If you actually want to run at this scale, it is scaled large enough that it takes a power plant to also power such a cluster. So definitely reach out directly to us or to your account team so that we can work together on plugging the necessary power supplies and hardware and helping you build such large infrastructure for your workloads.

KASLIN FIELDS: It's an exciting time in the infrastructure world, and we hope you all have a wonderful time at KubeCon. Maciej and Wojciech, I will see you on the show floor. Thank you so much for being on today.

WOJCIECH TYCZYNSKI: Thank you.

MACIEJ ROZACKI: Thank you. See you there.

ABDEL SGHIOUAR: Well, Kaslin, that's some very exciting news.

KASLIN FIELDS: Indeed. It's pretty cool to break this exciting announcement for everyone.

ABDEL SGHIOUAR: Yes, pretty good to premiere the news on the show. I guess we should probably open up by saying if you are listening to this and you got to this portion of the podcast, we are probably in the middle of the keynote at KubeCon.

KASLIN FIELDS: Yeah. If you're listening right after it was released, then yes.

ABDEL SGHIOUAR: Yes. If you are listening right after, yes. But yeah, the conversation was pretty cool. So already GKE was ahead of the market by supporting the 15,000 nodes. But now we are going all the way until 65, which is a huge leap.

KASLIN FIELDS: Yeah. It's a huge leap in terms of the availability of super high-scale clusters in a managed provider. But also, I really wanted to talk about the open-source side of it. A lot of the features that go into GKE are things that the engineers contribute back to the community, and there's a whole bunch of contributions back to the community that have happened with this. So I was really glad that I got to talk with Maciej and Wojciech about some of the work that they've done that is now available in open source, as well, so you can build bigger clusters wherever you may be.

ABDEL SGHIOUAR: Yeah. I mean, it's important to stress the fact that none of this would be possible without all the years of contribution and improvements into Kubernetes open source, and those are improvements that are going to benefit everybody else. But before we go there, there's actually one comment that Maciej mentioned at the beginning, which I found hilarious and interesting.

Maciej was saying, basically, we are looking at potentially large language model having trillions of parameters, and that's where we want to be in terms of making Kubernetes good for that. And I'm like, are we already-- I don't think we're already at trillions of parameters. I think we're still at billions for now.

KASLIN FIELDS: The whole idea of this really large cluster-- so background-- always starting with the background-- one of the first questions people always started to ask with containers and also, of course, with Kubernetes is, how big can it scale? So like scale for scale's sake is exciting.

But it's very interesting to me that a lot of this work is driven by the whole AI movement that's happening right now. These workloads are unique in a lot of ways, and we're seeing more features and more technology coming out to specifically address the needs of these workloads. And so this huge update to scalability of Kubernetes itself being driven by that is very interesting to me.

ABDEL SGHIOUAR: Yeah. No, definitely. I mean, I guess that even before AI, there was-- I mean, we had a couple of episodes where we talked to some people running large-scale clusters. And I think we didn't cover everything, but there was probably people doing HPC on top of Kubernetes. I know that CERN, for example, was doing quite a lot of that. So there was already people running at very high scale. It's a very massive leap driven probably by the AI and helped by all this-- the micro adjustments or the micro improvements as Maciej mentioned in the show.

KASLIN FIELDS: Yeah. And I thought it was really interesting when they went over the capabilities of the absolute tippity-top scale of these new GKE clusters is they are very geared toward these AI-specific workloads.

ABDEL SGHIOUAR: They are scoped, yeah.

KASLIN FIELDS: There's a lot of-- yeah. There's a lot of the improvements that went into this that are useful for all kinds of workloads, and the max cluster size for all sorts of workloads is going to be improved by all of this work. But at the very top levels, the work has been very focused and very scoped, like you said, with the needs of these AI workloads.

ABDEL SGHIOUAR: I think there's-- one interesting thing also related to this was that-- Maciej and Wojciech both mentioned one of the obvious benefits, which I never thought about before, is allowing people to do both training and serving on the same cluster instead of doing them on separate clusters. I never thought about it as being an actual problem that people care about, but I guess having one not impact the other negatively is something that people ran into in the past, so they had to separate them.

KASLIN FIELDS: Yeah. I do tend to lump the AI workloads together because of the way that AI, as a whole, is influencing the project. But one of the first things that I did when I started diving into this space was try to understand the differences between those workloads, and they are very different in terms of how you need to run them, and how they work, and what they need to do.

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: Inference and serving each have such unique characteristics. I hadn't thought either about them being on separate clusters or what that really means for how you would implement them in practice. But yeah, that is an interesting point that I also learned from Wojciech and Maciej.

ABDEL SGHIOUAR: Well, I think the first thing that jumps into mind is that training is usually a batch-type workload, so it's a lot of mini pods or a lot of pods that have to spin up and then finish very quickly, while serving is more like you're running a workload for an extensive period of time.

So the assumptions I have in my head are around, how do you make sure that one doesn't influence the other in terms of resource availability? And then also, they were talking about training jobs are sensitive to latency, sensitive to the host configuration, and all this stuff. So now, it makes a lot of sense after listening to the episode.

KASLIN FIELDS: Yeah. And I said inference and serving, but yeah, training--

ABDEL SGHIOUAR: Train and serving.

KASLIN FIELDS: Inference or serving, because inference and serving are used kind of interchangeably, though I personally have opinions about that.

ABDEL SGHIOUAR: Yes. Yes, me too. And so then the technical challenges. A lot of very small improvements, as was mentioned, mostly stuff that has been done open source or upstream, and before even AI was a thing that people cared about, just stuff that Kubernetes was not very good at doing and that had to be fixed, put in fixed between codes because it really depends who defines what is good and what is bad.

The consistent list from cache, I had to go dig into that one, like serving list requests from the API server instead of serving them from the storage. That's something I never thought about. But in my day-to-day using Kubernetes, I do tend to use the list function very, very often.

KASLIN FIELDS: Yeah. I haven't looked into that feature much yet, so hearing about it from them was my first time hearing about it.

ABDEL SGHIOUAR: Yeah. This was also my first time. And then yeah, all the work that the batch-working group was doing for a while, because that's existed for a while as well, and then a bunch of other things that were mentioned. I find it interesting when Maciej was talking about when we had to pinpoint what was the thing that allowed us to get to this scalability, we couldn't, right?

KASLIN FIELDS: Yeah, you can't. I remember doing a video several years ago about this concept. And I don't remember the exact terminology that we used, but it's like envelopes of different limits that impact how large of a scale you can actually go to on Kubernetes.

The terminology that we used didn't make any sense until you kind of explained it more. But the way that scaling limits work in Kubernetes-- the way that I think about it, at least-- is you have all of these different limits, and your most strict one is pretty much the one that sets everything else in a lot of ways--

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: --though they interact with each other in weird ways so that there's no finite limit on exactly how many pods you can put on a node or exactly how many nodes that you can put in a cluster. It's all about how you set up the tools underneath so that it affects those limits.

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: And so depending on how you do that, you can achieve really huge scale.

ABDEL SGHIOUAR: Yeah. I think I know what you're talking about. If I remember correctly, it was a talk from 2019, which was the scalability envelope or the scalability limits. I don't remember what it was called, but there was a talk at one of the KubeCons talking about these scalability dimensions and the fact that scalability is a multi-dimensional problem. So it's not a single one. It's multiple things that you have to-- we'll find a link and add it to the show notes.

KASLIN FIELDS: Yeah. It is surprising to me how often that comes up in conversations with folks, and keeping that in mind is really important when you're talking to folks about their environments. If you're trying to learn about the scalability of different Kubernetes environments, you're going to hear different answers for what is keeping people from reaching higher scales in different situations. So it's important to keep in mind that all of those are valid.

ABDEL SGHIOUAR: Yes. And I think just one thing that also comes to mind while we're talking about this topic is something that I don't think a lot of people realize-- and I've seen this being reported as an issue very often-- is the API server is actually technically a choke point in Kubernetes, since it's a single point--

KASLIN FIELDS: It certainly can be.

ABDEL SGHIOUAR: Or it could be, right, because what you talk to when you use kubectl or whatever your CI/CD pipeline talks to whenever you are deploying or updating things, but also all the components inside Kubernetes talk to the API server.

KASLIN FIELDS: Yep.

ABDEL SGHIOUAR: And so this comes very often when people install too much operators in the cluster, and they all have to query the API server. And then it ends up being right? So yeah, it's quite interesting. I mean, it's quite difficult to wrap your head around it, but when you spend some time thinking about it, it all makes sense.

KASLIN FIELDS: Yeah. That is one that-- it's not the most common one that I hear. The most common limit that I hear, of course, is IP exhaustion.

ABDEL SGHIOUAR: Yes, of course. Yeah. That's common.

KASLIN FIELDS: Yeah. IPv4. Don't we all want to move to IPv6? Isn't it the year of IPv6? It's not.

ABDEL SGHIOUAR: It's been the year of IPv6 since 2012.

KASLIN FIELDS: It's never going to be the year, it feels like. But really, the IPv4 exhaustion is a real issue that hits a lot of folks who run Kubernetes clusters, because you need to have IPs for all of the workloads that are running on the nodes. And then how do those workloads interact with other workloads? Do they have their own externally accessible IPs and load balancers?

It's a networking problem in the end, distributed computing, because you're just trying to hook a bunch of computers together. So naturally, the networking gets tricky. But yeah, the API server is kind of a hidden one. I feel like that when it comes up, you're like, really? But yeah, it handles all of the requests from within the cluster, as well as from outside of the cluster. So naturally, it can get overwhelmed.

ABDEL SGHIOUAR: Yeah. And then just to wrap up, I think that by the time people would listen to this, either you listen to it on the day it drops or later. And if it's later, there was a conversation about--

KASLIN FIELDS: Those are the options.

ABDEL SGHIOUAR: Yeah, of course. Actually, that's kind of like-- sometimes, I tend to state the obvious. But I think what I wanted to say is if you do later, you will notice in the show that you talked about all the maintainer tracks, all the talks basically at KubeCon, which will be available on YouTube later.

KASLIN FIELDS: Yes. Absolutely check out those recordings if you can. Yeah, I challenged everyone in the interview to check those out. Good luck.

ABDEL SGHIOUAR: Yeah. And go listen to the maintainer tracks. I go very often when I'm at a KubeCon. I find them some of the most interesting talks-- not that the other ones are not interesting, but I find them very interesting, because they go very deep into the weeds of how things work.

KASLIN FIELDS: Yeah. It's at the core of the thing that the conference is about, the open-source projects that are at the core of the Cloud Native Computing Foundation, so they are foundational to the event.

ABDEL SGHIOUAR: That is pun not intended. All right. Cool. Well, thank you, Kaslin. That was pretty cool, and we'll report back from KubeCon.

KASLIN FIELDS: Yeah. We hope you all enjoyed learning about Kubernetes at scale, and we'll see you in our KubeCon episode.

ABDEL SGHIOUAR: Yes. All right. Cheers.

KASLIN FIELDS: That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media at Kubernetes Pod or reach us by email at <kubernetespodcast@google.com>.

You can also check out the website at kubernetespodcast.com, where you'll find transcripts, show notes, and links to subscribe. Please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening, and we'll see you next time.

[MUSIC PLAYING]

View More Episodes