Kubernetes Podcast from Google: Episode 255 - HPC Workload Scheduling, with Ricardo Rocha

#255 July 9, 2025

HPC Workload Scheduling, with Ricardo Rocha

Hosts: Abdel Sghiouar, Mofi Rahman

Ricardo Rocha leads the Platform Infrastructure team at CERN with a strong focus on cloud native deployments and machine learning. He has led the internal effort to transition services and workloads to use cloud native technologies, as well as dissemination and training for several years. Ricardo got CERN to join the CNCF and is a member of the Technical Oversight Committee (TOC), currently chairs the End User Technical Advisory Board (TAB), as well as leading the Research User Group (RUG).

Do you have something cool to share? Some questions? Let us know:

News of the week

Links from the interview

Transcript

Show full transcript

ABDEL SGHIOUAR: Hi, and welcome to the "Kubernetes Podcast" from Google. I'm your host, Abdel Sghiouar.

MOFI RAHMAN: And I'm Mofi Rahman.

[MUSIC PLAYING]

Ricardo leads the platform infrastructure team at CERN, with a strong focus on Cloud Native deployments and machine learning. He has led the internal effort to transition services and workloads to use Cloud Native technologies, as well as dissemination and training for several years.

Ricardo got CERN to join the CNCF and is a member of the Technical Oversight Committee, currently chairs The End User Technical Advisory Board, as well as leading the research user group. But first, let's get to the news.

[NEWS JINGLE]

ABDEL SGHIOUAR: Kubernetes introduced NFD, or Node Feature Discovery. NFD is an open source project that automatically detects and reports hardware and system features on the cluster nodes, helping users schedule workloads on nodes that meet specific requirements. This feature bridges the gap between the workload container image and the node OS, making it possible for applications to leverage drivers for GPU and network devices, libraries and software, and kernel features like VFIO.

MOFI RAHMAN: Google announced the Gemini CLI, a command-line based AI agent, to interact with Gemini from your terminal. The tool can be used to query GitHub issues, codebases and pull requests, scaffold new apps, generate media, and more. And the cherry on top is that it's all open source and available on GitHub.

ABDEL SGHIOUAR: The CNCF announced the Vietnamese version of the Cloud Native glossary is live. The effort to localize the glossary and the addition of the Vietnamese language brings the number of languages to 15.

MOFI RAHMAN: The CNCF announced a new executive director. Jonathan Bryce joined as the new executive director, replacing Priyanka Sharma, who served the role for the past five years. Jonathan Bryce brings 15 years of experience in the open source space, including Rackspace, OpenStack, and the OpenInfra Foundation. And that's the news.

[NEWS JINGLE]

Welcome to the show, Ricardo.

RICARDO ROCHA: Yeah, it's a pleasure. Thank you for the invitation.

MOFI RAHMAN: So to kick us off, instead of talking about tech things, I wanted to start the show off by talking about one of your hobbies. Doing a bit of internet stalking, I looked into your profile, and it looks like you are into flying airplanes. So my question, is there anything you can bring into the world of Cloud Native with learning how to fly a plane?

RICARDO ROCHA: That's a pretty good question. So yeah, indeed, my main hobby and passion in life, apart from computing, is anything related to aviation-- so flying motor planes and gliders. I never thought about it in this way, but now that you ask, actually, there might be quite a lot of similarities between the two.

If you think about Kubernetes clusters, they behave pretty well, as long as you prepare things in advance. With flying, it's a bit the same. If you're flying multiple planes, you probably want to check the weather in the morning, prepare your flight path, your plan, where you're about to pass, the airspace, all these things. And things work better in Kubernetes as well if you do these steps in advance.

On the other hand, maybe my real passion, in addition to motor planes, is sailplanes and gliders. And there, it's more the other side of Kubernetes, I would say, which is the exciting part where you push a bit the boundaries and you look for turbulence constantly. And you end up, more often, getting into trouble than with a standard flight, I would say.

MOFI RAHMAN: So speaking of like pushing boundaries, so you work at CERN, which is scientific research. How did you come about in the world of Kubernetes? In my mind, when I think of scientific research, seems very rigorous, like a bit more traditional, not necessarily like the cutting edge of Cloud Native. So how did that connection happen?

RICARDO ROCHA: Actually, at CERN, we always had very large requirements for computing resources. Even before big data was called big data, we already had to deal with terabytes and petabytes of data. So we are constantly looking for the new technologies that will allow us to do more with a fixed budget. Because we don't sell anything, so our budgets don't change when we produce more data. So we have always to find better ways to cope with the increasing requirements from the experiments, the physics experiments. And this is basically how we end up looking at everything.

And everything 10 years ago was definitely starting to look into cloud native and making computing resource usage more efficient and automating more and making ourselves more efficient. And I think the main drive has been that if you look at the pre-cloud era, we ended up writing a lot of the tools ourselves because there was no open source or community offering this kind of tooling. And this is how I joined CERN, actually, as a software developer, building distributed computing tools.

But then, with the advent of the cloud and eventually Cloud Native, this massive community with very large organizations with similar needs started working together, which is kind of magical if you think about it. And we decided, OK, this is how we should be focusing for the near future and for the future, and just join the community instead of staying in our corner.

MOFI RAHMAN: So I guess, like a follow-up question to that, then you mentioned that having a fixed budget and having no commercial selling-- like selling ideas from CERN. Other than that, how else would you say the scientific research competition is fundamentally different from products that people are building or working in Cloud Native that are meant to be more used by end users then?

RICARDO ROCHA: Yeah, that's a very good question. So the original design of Kubernetes was really for the typical IT service, where you have an endpoint and you have requests, and you might have to scale the resources according to the number of requests coming. But it was mostly service oriented. Scientific computing is different in terms of how the workloads are managed. They're usually wrapped in jobs and then you have to be able to scale significantly in the number of jobs and the resources these jobs consume.

So you need concepts for advanced scheduling, scalability, a lot of the concepts that we now all manage with the recent changes in the computing ecosystem, but things like queues, quotas, priorities, preemption, all these things that have been in scientific computing infrastructure, supercomputers for many decades and that were missing in the original Kubernetes. Even if you think of the original job concept in Kubernetes, which has been there for a long while, it was very much designed focusing on the notion of MapReduce workloads, which is not the traditional batch computing that we need for scientific computing.

The other part is also that, because of this demand in the resource and efficiency, there's a big push for optimization constantly. So things like optimizing the node usage, like pinning CPUs or NUMA awareness, all these very low-level things, they were not a priority for the traditional service. You were mostly wanting to scale up and down but not necessarily taking all the small percentage left on the resources.

MOFI RAHMAN: So at this point, you already mentioned the word, probably with a different spelling at this point. I wanted to ask you about Kueue. It is a project that is part of the Kubernetes SIG. And how did you get involved, and what made you interested in learning more and using it at this point?

RICARDO ROCHA: This has been something we looked from the very first day. We started using Kubernetes, like everyone else, for our internal services in the campus and for a ton of things, but very much service oriented. But at the same time, from day one, we started thinking, OK, can we also use this stack to do better also for our scientific computing workloads? And there were projects appearing even in the early days that were focusing on this. And this was about mostly having a batch scheduler. And a batch scheduler is, as we said, like queues, quotas, priorities, preemption, all these things, gang scheduling.

There were projects in the beginning like Volcano, Kube-Batch even before that. For federated or multi-cluster deployments, there was something called Kubefed v1 and v2. And all of them did the job, but they were not on the core Kubernetes, which meant some of the integrations were not necessarily perfect. And also, you need to buy into different types of resources. These projects are still very popular and largely used, but I think there was a reason to come up with a core common component in the scheduler that even those projects can rely on.

And the Kueue came from this idea from multiple people. We were one of the advocates in groups like the research and user group, and then other larger organizations with more development capacity bought in. And we started collaborating to make sure that, as an end user, we can provide the requirements and also pushing the community to build momentum around this project. So this is how we started talking to other universities around this topic and then Google and other organizations in the CNCF.

It's really out of our needs to simplify and make overall usage internally better. We saw the value of Kubernetes, so there was a lot of motivation to do the same for scientific computing.

MOFI RAHMAN: So at this point-- I think the latest version of Kueue is 0.11, if I'm not mistaken. Could be over 0.12 by now. But it is still, I would say, early days for something like this. But Kubernetes itself is about 11 years old. So it feels like, in many ways-- as you mentioned, also, Kubernetes initially was for stateless web application type things. Is it a lack of having scientific and research type voices early days in the Kubernetes world? Why do you think it took so long for Kubernetes project to have a strong opinion about how jobs should have these features that research needs?

RICARDO ROCHA: I think that's it. The use case and motivation was not there. The use case existed, but traditionally, scientific computing workloads are done in very specific infrastructures, especially for High Performance Computing, HPC workloads, where we rely on very large supercomputers or you have your own on-premises data centers with things like low latency connectivity, InfiniBand, very specific scheduling requirements. And tools existed to do this. There was tools like HTCondor or Slurm that is very popular.

So the motivation was not there from the people managing those centers to transition to something new. The motivation appeared when people started realizing that by using or looking at something like Kubernetes as a kind of commodity these days, where everything integrates with it and all infrastructures expose an API to manage the resources V8, we could go beyond what we can do today with traditional HPC schedulers. And this is where the topic became more popular.

Now, still, there were not so many large organizations that would justify implementing this. This changed a lot. As big data became the norm, any kind of company or a startup will now talk about petabytes or even exabytes of data. And once you get to that, then you start looking at the things that are traditionally scientific computing workloads.

And then the last bit has been GenAI. This has really been the big transition. Once GenAI appeared and people started thinking that building on the existing stack we've worked on for the last 10 years is probably the best way to manage the AI workloads instead of building something completely new, then the investment really came. And you saw this growth in these projects and in this area at all levels. This has been really the transition. So I think it took long because the use cases were also being built at scale at the same time.

MOFI RAHMAN: So currently at CERN, is the workload mostly running on-prem hardware or is it a mixture of on-prem and cloud or mostly cloud?

RICARDO ROCHA: So CERN is mostly on-premises. And the reason for that is really cost management. If you have very large workloads, like we do, it is cost effective to build on-premises data centers. And we do that. But also, we have a history of managing data centers, so we know how to do it. And we also have a history of managing remote data centers from the sites that collaborate with us and things like the grid that we've built in the last 20 years.

The usage then of external resources, in particular public clouds, are mostly for bursting capacity for peak workloads and for scarce or specialized resources. And this is especially important for things like GPUs, which are extremely expensive. You don't want to overprovision those.

And then, the number of updates coming and new cards and new heterogeneous kinds of hardware appearing, it's very hard to follow that when you have an on-premises data center and you're not offering it as a service. So we tend to use external resources for things like benchmarking, POCs, and even for scaling out our workloads.

MOFI RAHMAN: So you are using Kueue currently at CERN. So do you get a lot of benefit? Because Kueue works really well when you have this elastic workload, and you can create preemption and also create priorities. But for on-premise workload when you have a data center that you own all the hardware anyway, what benefits are you getting out of Kueue with the preemption and the fair sharing?

RICARDO ROCHA: Yeah, that's an excellent question. The model is very different. In the public cloud, you want to minimize the resource usage for your workloads, that you pay the absolute minimum. In on-premises, because you already bought the hardware, you want to maximize overall usage.

So the principles are different. You don't look so much at cluster scaling or auto scaling in general, but you do have requirements to optimize overall efficiency and usage that are-- you need to have tenants, and each tenant has nominal quotas. But you need to be able to borrow from different cues, from different tenants, in case those tenants are not filling up their nominal quotas. Because again, what you want is to maximize overall usage, not specific tenant usage.

And things like Kueue provides concepts like cohorts for borrowing or things like fair sharing. Fair sharing is one of the key features that motivates us to use Kueue. And all these ideas of having priorities so that you can backfill for resources that are currently available, you can backfill with other workloads, and you can preempt them and replace with higher-priority workloads when they come in.

All of this is extremely important for us, and this is really the key features of Kueue that are not necessarily that important in the case of public clouds. For example, fair share, it is important, but it's actually a key feature if you're running an on-premises data center. Then there are other features that we use internally, things like gang scheduling or array jobs. These are also things that require a scheduler, what Kueue is offering, and that you cannot do with standard Kubernetes as well.

MOFI RAHMAN: You did mention a little bit about existing schedulers, the tools like Unicorn, Volcano, Kubefed. They all had some ideas of how to do non-web app type applications on Kubernetes. So can you speak a little bit more about what is something Kueue being more Kubernetes native-- how did that help you make the decision to choose Kueue versus something that is in itself a new scheduler on top of Kubernetes?

RICARDO ROCHA: That's a good point. And I think I can answer that with a bit of history. When we started looking into Kubernetes, I used to work on the development of what we call the grid computing infrastructure, where we've built a lot of software, a long time ago, that we would like to replace with something more sustainable. And so one of the first things I started looking was, can Kubernetes be a replacement for this notion of grid computing sites and managing jobs?

So I started looking at jobs in Kubernetes and how I could submit a lot and monitor them and all these things. And Kubefedm at the time, was around, Kubefed v1. And Kubefed v1 had one big advantage, which is it was the only using normal resources from Kubernetes. So I could just take an existing workload, configure Kubefed with multiple sites on the back or multiple clusters on the back, and everything would work.

Now, the job concept was not good enough, because there was missing abstractions in the job at the time. So people came with Kubernetes v2, but Kubefed v2 actually went too far. It created its own custom resources. So then suddenly, none of the tools in the ecosystem was compatible. You would actually have to change existing Helm charts and all of this to make use of Kubefed v2. So it wasn't a big success, I think, a lot because of this.

And I think the same is with the rest of the tools. The motivation for Kueue is that it really is designed inside the Kubernetes project. So every design decision is reviewed by people that are contributing directly to Kubernetes and reviewed by the other special interest groups. If you're designing policy for managing scale out jobs, the auto scaling, the people working on auto scaling cluster scaling will have to approve that and ensure that whatever the decision is, it fits well into the cluster scaling policies and the ways of working.

And the same for the management of low-level devices and optimizations on the nodes-- you will have signaled looking into it. So all of this makes Kueue a very good solution for the ideal integration with the rest of the Kubernetes core. That doesn't mean that it replaces the other projects.

It means that, probably, it takes a lot of what the other projects had to implement themselves into a common core. And this is where I see the value of Kueue, is that it really gives us all the functionality we need for the batch computing workload, HPC workloads, integrated into the rest of Kubernetes.

MOFI RAHMAN: Yeah, I guess it goes down back to the point of going fast versus going far, right? All the other projects, initially when nothing existed, they paved the path, showed the use case, in many ways, showed people that Kubernetes is a place you can run HPC and batch-type workload. And then, in some ways, it actually motivated the maintainers of Kubernetes to see, OK, we need to make Kubernetes natively better. In some ways, there was a huge benefit from having those projects proving the POC of, yes, it's work.

And these projects still exist. I actually spoke to a few of the maintainers of like Volcano, Unicorn in one of the past KubeCon just to hear how those projects are doing. I feel like Cloud Native is so big, there is room for pretty much any type of use case there.

RICARDO ROCHA: But this is pretty much the point, is that even if we put a lot of the common things in the core, there's always things that will not make it into the core, because there's not enough use cases supporting integrating it. But it doesn't mean that those use cases are not valid. It just means that they have to live outside the core of the ecosystem, which justifies the continuation of this project.

The CNCF does a pretty good job with the idea of sandbox, incubation, graduation, and supporting the projects in this maturity level path. And we learn, as they go, a lot of experimentation. And then there's some consolidation happening. And this is what we are seeing in this area.

MOFI RAHMAN: Yeah, I think again, more and more functionality in Kubernetes even are being added out of tree instead of in tree. Things like gateway API is not actually part of Kubernetes/Kubernetes. It's somewhere else. So I think the whole picture you have painted so far about scientific computing, Kueue, batch, and all this stuff, fantastic.

I love to see the work. And I think last year, you also accepted an award, an end user award. So congratulations, if I already have not said it in one of the KubeCons that we run into. I think the last thing I was going to say, it's a bit of a future looking. So you are putting on a speculation hat. You get to speculate as much as you want in this point. Where do you see the future of batch workload and potentially Kueue also going in the next, let's say, five years from now? In 2030, we record a next episode of "Kubernetes Podcast" with you. What kind of things we'll be talking about?

RICARDO ROCHA: OK, I will take the risk, but I will start slowly, and then I'll go for the more out there ideas. But I think if we look at Kueue, I think the big developments will be on things like multi Kueue and better support for this multi-cluster, even multi-domain, multi-region, multi-cloud management.

And especially if the trend for this high demand of high-end GPUs continues, it will be essential that we get this optimized so that we optimize costs across multiple deployments. But also, this is something that I talk about from time to time, which is, with this kind of high-end GPUs, the cloud is no longer what it used to be. It doesn't feel on demand anymore. It feels more like on-premises because you're doing this very long reservations to get any kind of discount you can, which means you're basically precommitting resources for a year, two years, three years, which is not that far from buying stuff to put on premises and then having them there and having to manage them efficiently.

So I think Kueue and multi Kueue have a really good opportunity for this idea of, I think, exploring the notion of this provisioning requests and future reservations. All these things are really essential to manage this kind of very high demand type of resources. That's on the, I would say, the easiest side or the less speculative.

I think the other one is-- and this is a very big point, which is we see, especially in the AI world, but we see a trend to build very dense compute resources. We had a trend with the clouds, with having commodity hardware and a lot of nodes.

And we see this going back a bit to having less nodes and much more density with very low-latency interconnects and very tightly coupled or other type of resources. And this kind of feels like going a bit back to the mainframe era, where you have this beasts in your data centers, and you end up giving users timeshares instead of full nodes or even GPUs.

I think this is a challenge in all respects. The first one is allocating resources to users because again, it will be very similar to what used to be done with timeshares and mainframes.

The other one is, for anyone that is not a hyperscaler, the data centers are not designed to accommodate this kind of thing, the density of power, the needs for cooling. We suffer from this today. There are servers that you can buy with very high-density GPUs that we cannot fit in a rack, in a full rack that we own for a single server because the power is too much.

So I think a lot of it in the next couple of years will be learning again how to use Kubernetes to manage different kinds of resources. Again, it's the story repeating itself and probably learning how to partition very dense resources in a way that people can share them. Yeah. So I think it's very interesting but extremely challenging.

MOFI RAHMAN: Yeah, absolutely. I think the work that is happening in the DRA space with making this specialized hardware almost similar API to something like storage, bringing it down to that level of understanding and usability, is going to be very key in making this happen. In any case, I'm super excited to whatever the future brings in terms of compute.

The last question, I guess-- maybe not the last question, because I've been enjoying talking to you so much. So have you had any conversation with folks or yourself trying to run-- you mentioned something like Slurm that people used to use in data centers. But there have been some work that is happening over the last few years as well to make Slurm work on Kubernetes. Thoughts?

RICARDO ROCHA: I have a lot of thoughts. Actually, this is one of the most popular topics of discussion in the research and user group. And for those listening, the Technical Oversight Committee in the CNCF just had some restructuring. We called it TAG Reboot. And there's this notion of initiatives, where anyone can come forward with an initiative. There will be one about Cloud Native HPC, which is focusing on exactly that, doing a kind of survey of the options available in the ecosystem for this.

If you ask me, I think it will be very hard to transition to Kubernetes managed HPC supercomputers, the traditional ones that I know, the very large ones in the top 500, because there's a lot of history and integrations with tools like Slurm. So I think the best option that we have, and that is being followed by several people in several projects, is to still use Kubernetes for managing the workloads but being able to submit to Slurm endpoints behind. This is this bridge between Kubernetes and the traditional HPC scientific computing.

I think that there are several motivations for that. The two that I think are the main ones-- one is that you don't have to convince the sysadmins of these supercomputers to move anything else. And the second one is that, for any kind of modern workload, machine learning AI, the frameworks and tools that exist, they all integrate with Kubernete. And they know how to manage their distributed trainings and similar workloads with the Kubernetes backend. They do not integrate with things like Slurm very easily. So there is a lot of motivation to just rely on Kubernetes as the common API for all of this.

If you're interested, there are projects like Interlink that just became a sandbox project, Supernetes. There's Slurm Bridge from-- or Slinky from SchedMD. There's plenty of things popping up.

MOFI RAHMAN: Yeah. So in that world, are you saying that you would submit your jobs through the Slurm CLI and that would pop up in Kubernetes or the other way around?

RICARDO ROCHA: So I think both are possible. So if you want to support Slurm users, but your infrastructure is based on Kubernetes, that's the option. I think the more interesting one for me is the opposite, is to submit and manage your workloads as Kubernetes workloads but still make use of the infrastructure that exist that expose Slurm APIs.

The reason for that is there are very large supercomputers in Europe and the US with a lot of GPUs, and I would love to have an easier way to get access to them. Right now, the easier way, at least from my point of view, is just to expose and integrate with Kubernetes APIs.

MOFI RAHMAN: Lovely. Yeah. I think we also had a project similar to this called XPK, that you could use the very much Slurm-like commands but through a CLI tool called XPK that can create and run your Kubernetes cluster and the jobs as a single CLI command. We'll link all of that and the links you mentioned. I'll get them from you to link to for our listeners so that they can also take a look.

RICARDO ROCHA: Anyone listening, if you're interested, watch the space under the Technical Oversight Committee. There will be initiatives coming on in this area. So just join and participate.

MOFI RAHMAN: I'm super excited to learn more. Hopefully, I'll run into you in person in one of the KubeCons that happens. But before we finish this off, any final thoughts, anything people should know about?

RICARDO ROCHA: I will finish, as I often finish, which is, everyone involved in this community, it doesn't matter if you're a maintainer or a supporter or an end user making sure these projects are successful and helping everyone or just giving feedback, I think it's quite important that everyone realizes that we've built a community that goes way beyond individual organizations. You mentioned the awards I got.

The reason that we are so involved is because all this community, all this software that we are supporting in different projects is completely changing the way we do scientific computing, and for the better. We can do a lot more now than we could 10 years ago, thanks to the efforts of the whole community.

Of course, we thank very much the big organizations that keep the lights on and the projects going and the releases coming, but also everyone-- every other organization that helps keeping the groups together, keeping podcasts going, of course, and all the rest that is required to keep the community healthy. I always stress this. You're making a huge difference for science and scientific research.

MOFI RAHMAN: Yeah. And also to anybody listening that happens to be not in a position to directly contribute to Kubernetes, but if you are using Kubernetes in a meaningful way, being an end user, giving the project feedback about how you're using it, finding interesting ways to-- a lot of the things that exist in Kubernetes now that didn't exist in the early days happened because we found end users that came up and said, this use case does not cover this thing we're trying to do. Can we do something?

From that, came CAPs or new ideas, new PRs. And we probably even found new maintainers and contributors to projects because they were using something. Something was not working the way they wanted to, and they added. There's no better time to start-- the best time to start was yesterday. The next best time to start is today.

RICARDO ROCHA: Absolutely.

MOFI RAHMAN: That's how I'll end it. Ricardo, thank you so much for spending the time. And hopefully you still have some sunlight left in your day, so enjoy the rest of the day. Thank you so much.

RICARDO ROCHA: Yeah, thank you. Thank you.

[MUSIC PLAYING]

ABDEL SGHIOUAR: Thank you, Mofi, for recording that episode. I know that you've been trying to get your hands on Ricardo for a while, and it has been challenging.

MOFI RAHMAN: Yeah. I mean, again, we have a very short sliver of time during the day. He is based off Europe time. I'm based on US, New York time. But I really wanted to chat with Ricardo. I have a lot of interest in the Kueue project. And so wanted to chat with him about the batch. And yeah, I'm glad finally, we make to manage to make it work and got the interview.

ABDEL SGHIOUAR: Yeah. No, it was a good interview because you touched on a lot of things. But before we get into it, I'd like to start-- he's into flying. I didn't know that. You managed to find that information.

MOFI RAHMAN: Yeah. I mean, it is not that much of a digging. I was trying to get his bio for the episode, and it was in his personal website. That was the first thing. So I was like, yeah, we should talk about. Oftentimes, in cloud native or in technology, we talk to folks, and we get too deep into their technical background. I thought it would be fun to start the conversation off with something that is not related to tech personally. But again, I think I managed to bring in the question that tied it together anyway.

ABDEL SGHIOUAR: Yeah, yeah. He said basically that planning a flight is basically like planning a Kubernetes installation. The more you do upfront, the easier it is basically, right?

MOFI RAHMAN: Again, I think in any type of things you do, getting a good plan in place is probably really powerful. The old adage of measure twice, cut once that is used in carpentry, it's the same idea, right? The more you plan upfront, the more things you get done. And more you know what is to come, the less surprises you have. It's not that you're going to get it perfect every time, but it's almost like you get to make new mistakes, not the same mistakes again.

ABDEL SGHIOUAR: Yeah. And so I don't know if this was a mistake or not, but one of the first things discussed was Kubefed, which seems to be a federation multi-cluster tool that-- looking into it, version 1 doesn't exist anymore. It's archived actually until the Kubernetes projects, but I don't know if-- I think version 2 is now called KubeAdmiral or something.

MOFI RAHMAN: Yeah. I think the conversation kind of went there, as we're talking about Ricardo's experience and the world of batch workload in Kubernetes. Again, Kubernetes is about 11 years old this year. It was 10 last year. And Ricardo has been in this space looking at it since Kubernetes was, let's say, like two or three years old, right?

So at that time, there were-- I mean, there are many ways to solve this problem of I need to run this ephemeral job in this platform. How do I go about doing this? And many people, rightly so, were thinking about, OK, I can run stateless web application pretty well. How do I do jobs?

So there are a number of different solutions actually. And some of them still exist. Some of them has changed. Kubefed is now called KubeAdmiral as of 2021. And it still exists to some capacity. So in Kubernetes, when you want more resource, you could either have bigger pods, more pods or bigger nodes, more nodes. That's kind of the four dimension of scaling in Kubernetes.

But there is an upper limit. Open-source Kubernetes have an upper limit of 5,000 nodes. But there are jobs that you need to run for the training workloads for large language models that go way beyond 5,000 VMs. You need way more machines to do that.

At that point, you have a few options. One is the architecture of Kubefed, or this multi-cluster is that you have a controller cluster that knows about a bunch of other clusters and sends the job that way. Or it could do something where instead of like owning or controlling the cluster, you just control some mechanism to send jobs or send resources, use resources in other clusters.

So there are many ways to handle this. Recently, we announced multi-cluster orchestrator. That is a similar concept. You have a centralized hub, and that can send things to other things. But at the same time, in GKE, we have been working on making clusters bigger. I was part of one of the demo-- me and Maciek did-- on KubeCon about showing off 65,000 nodes on a single cluster.

So 65,000 nodes on a single cluster is equivalent to 13 5,000-node clusters. So if you had 10 5,000-node clusters before, now you potentially can do that in a single cluster. But eventually, we will find a world where 65,000 node is not enough. Then you will need like 10 65,000-node clusters or one 650,000 node cluster.

So we're going to have to keep one upping each other if the scale of training jobs is the same trajectory stays there. So far, making models bigger have always made it better. So as long as that trend continues on large language models, we're going to continue to see jobs and training gets bigger and bigger.

And there have been other approaches too. There have been like some rewrites of the Kubernetes scheduler itself, with Volcano or Unicorn, which are batch systems that runs on top of Kubernetes but somehow replaces the Kubernetes scheduler itself. There have been work-to-run Slurm on Kubernetes. There have been rare Kube reworks on Kubernetes. We gave a talk on Kubecon last year about KubeRay.

So it's not like, oh, everybody just sat together in a room and decided on one solution. People kept trying different things. And we learned from each approach what is good and what is bad about it. And Kueue kind of came from a lot of this conversation.

And talking to Ricardo was almost like getting history lesson in the whole spectrum because he has been through all of it from the beginning and seeing the evolution. So it was, for me, a really good experience to talk to him about that journey, almost, to see why this is the way it is and what sets Kueue apart from the other solutions. It's not that Kueue is inherently better than anything else. It's just that Kueue had the time to take in the learning from all the other solutions that came before.

ABDEL SGHIOUAR: So can you just tell maybe the audience one of the key learnings that Kueue does different? Because one of the things you mentioned during the interview is native objects, which is unique

MOFI RAHMAN: So pretty much, every other scheduler on Kubernetes that does job related things would basically create a new concept of a pod. You basically need to have something that understands what a job is. It's basically a new CRD. That means you are replacing what Kubernetes gives out of the box, which is great. You can have a lot of control. You can put as many hooks and flags as you want.

But now, you are out of tree from what Kubernetes gives you. So as Kubernetes gets better, the job objects and the pod objects get better, you either have to implement that yourself or you have to live without it. Kueue on the other hand, on Kubernetes 1.25, I believe, there was a flag introduced to the pod object called Suspend. So if the Suspend flag is set to True, scheduler knows this pod should not be scheduled. Basically, that's all that is.

So what Kueue controller does, looks at your resources, looks at how much resource you have, and either sets the Suspend flag to True or False. So that's all Kueue does, fundamentally. So that means when Kueue sets up pod's Suspend flag to True, scheduler won't touch it.

And again, it's a very simple idea, but what it means is that, since we're not actually reinventing any of the pod semantics on Kubernetes, the job and the pod, as it gets better with newer Kubernetes releases, Kueue takes advantage of all of that, because Kueue is not writing anything. It's just one flag, literally one flag in the Kubernetes pod object it touches.

So the learning was, if you go out of tree of the Kubernetes semantics, you have to do way more work to catch up if Kubernetes pod, which again over time gets better. But if you're on the outside of that line, you have to either do more work. Or over time, if you go out of maintenance, all of a sudden people are losing out on good features that got introduced-- performance improvement, quality of life, more metrics, logs, everything. You have to manually do it yourself again.

So initially, it's easy to move fast with it because you are doing your own thing. You don't have to wait for Kubernetes to implement things. But later down the line, now you are slowed down because you have to implement all the things that is in the main tree yourself again.

ABDEL SGHIOUAR: Yeah. Yeah. And see this philosophy of doing things also replicated across multiple other tools. MCU arguably does the same thing. It doesn't really touch your pods. It doesn't schedule. It doesn't do anything. It just provides you with a recommendation for where to place the workload, right?

MOFI RAHMAN: Yeah.

ABDEL SGHIOUAR: I understand that philosophy of let's let Kubernetes do what it's better at doing and only provide extra stuff to handle edge use cases in a way.

MOFI RAHMAN: Yeah. So I think that has been the biggest, I think, learning and changes in the mindset of folks building on Kubernetes over the few years, is that instead of fighting the system, you're basically using the system to get what you want to get done. Use as much of native Kubernetes as you can and then extend it, rather than trying to have a parallel execution on the side.

ABDEL SGHIOUAR: Yeah. Cool. Well, thank you, Mofi, for that interview.

MOFI RAHMAN: Yeah, I really enjoyed it. I hope people also enjoyed it. And Ricardo also shared a few links that we will put in the show notes. And I think there are some fun learnings there. That brings us to the end of another episode. If you enjoyed this show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media at Kubernetes Pod or reach us by email at kubernetespodcast@google.com.

You can also check out the website at kubernetespodcast.com, where you will find transcripts and show nodes and links to subscribe. Please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening, and we'll see you next time.

[MUSIC PLAYING]

View More Episodes