Kubernetes Podcast from Google: Episode 253 - Multi-Cluster Orchestrator, with Nick Eberts and Jon Li

#253 May 28, 2025

Multi-Cluster Orchestrator, with Nick Eberts and Jon Li

Hosts: Abdel Sghiouar, Kaslin Fields

Nick Eberts is a Product Manager at Google working on Fleets and Multi-Cluster and Jon Li is a Software Engineer at Google working on AI Inference on Kubernetes. We discussed the newly announced Multi-Cluster Orchestrator (MCO) and the challenges of running multiple clusters.

Do you have something cool to share? Some questions? Let us know:

News of the week

Links from the interview

Transcript

Show full transcript

KASLIN FIELDS: Hello, and welcome to the "Kubernetes Podcast" from Google. I'm your host, Kaslin Fields.

MOFI RAHMAN: And I am Mofi Rahman.

[MUSIC PLAYING]

KASLIN FIELDS: At Google Cloud Next, Abdel sat down with Nick Eberts, a product manager at Google, and John Li, a software engineer at Google, to talk about Multi-Cluster Orchestrator, or MCO, a new open-source tool announced at KubeCon EU 2025. MCO addresses the need for managing workloads across multiple Kubernetes clusters, especially with expensive, accelerated hardware like GPUs, which require efficient scaling. But first, let's get to the news.

[MUSICAL CHIMES]

MOFI RAHMAN: etcd has released version 3.6.0. This is the first minor version release in about four years since June, 2021. The new release includes some exciting updates, like downgrade support and some significant performance improvements. Check out the blog on etcd.io for more details.

KASLIN FIELDS: Kubernetes 1.33 is now available in the Rapid channel on GKE. This update comes just two weeks after the open-source 1.33 release.

MOFI RAHMAN: Kyverno 1.14.0 was released, marking a significant milestone in the journey to making policy management in Kubernetes more modular, streamlined, and powerful. This release introduces two new policy types, validating policy and image validating policy. Kyverno 1.14.0 begins a new chapter for Kyverno with the introduction of specialized policy types that separate concerns and confusion about validation checks being written in various patterns by providing a more focused approach to functionality.

KASLIN FIELDS: And that's the news.

[MUSICAL CHIMES]

ABDEL SGHIOUAR: Hello, everyone, and welcome to a new episode of the "Kubernetes Podcast." I'm your host, Abdel, and we are here live from Google Cloud Next 2025. I am here with John and Nick. Hi, guys.

NICK EBERTS: Howdy.

ABDEL SGHIOUAR: We're going to be talking Multi-Cluster Orchestrator, which is something we actually announced at KubeCon London last week. I don't know when this episode is going to be out but last week, whatever that last week is. But before we get there, let's start with some introductions. Why don't we start with you, Nick? Who are you? What do you do?

NICK EBERTS: My name is Nick Eberts. I am a product manager at Google, working on GKE and all of the Multi-Cluster tooling that we build, mainly fleets and, of course, Multi-Cluster Orchestrator.

ABDEL SGHIOUAR: Awesome. And John?

JOHN LI: Yeah, I'm John. I'm a software engineer, also working at GKE closely with Nick. In the past year or so, I've been focusing on inference on a lot of the JNI workloads.

ABDEL SGHIOUAR: Like everybody basically these days, awesome. So let's start with something basic, Multi-Cluster Orchestrator. We announced that last week. There is a blog out. There is a GitHub repo somewhere. Why don't you explain to us what is MCO?

NICK EBERTS: Yeah, sure. Actually, before we explain what MCO is, I'm going to have John describe the problem. So John and I, we got together about a year ago to try and solve this particular problem. I'm going to let you go through it.

ABDEL SGHIOUAR: Yeah, let's start with that.

JOHN LI: In the past two or three years, we see the cloud to evolve. When we first built Kubernetes, it was built with the assumption of the cloud is infinite, and the cloud is uniform. You have infinite amount of capacity or CPUs. They're all about the same everywhere. That's the assumption we took back 10 years ago, when Kubernetes was started, when accelerators started to enter the picture, right?

It's not quite infinite. We've got stockouts. Sometimes, our customers can't get capacity. Also, it's also not uniform. The different generations of GPUs and TPUs, right, that calls for a different solution. When customers have stockout in one region, it's natural to expand out to different regions close by. And also when, for a lot of the inference workloads, your latency requirement is not that strict. Actually, going a little bit further, it's OK. So that kind of calls for a multi-region multi-cluster solution.

ABDEL SGHIOUAR: Got it. Got it. And before we start talking about MCO, Nick, I guess that this problem of managing multiple clusters is not even new to the inference world. It has existed for a while. And it has to do with things like I need to run my app closer to where my customers are. That would be one of them. So it's not only just like stockouts and running out of availability of hardware.

But I want you, Nick, to try to answer this question. Where are we right now? Where do you think we are on the question of one large cluster or multiple small clusters because that kind of blends into this conversation, right?

NICK EBERTS: Yeah, at Google or even upstream in Kubernetes, we're making pretty massive clusters these days. We have 65k nodes. I know you did a podcast about all that.

ABDEL SGHIOUAR: Yeah.

NICK EBERTS: But it's not-- just because you can, I think, doesn't mean you necessarily should.

ABDEL SGHIOUAR: Got it.

NICK EBERTS: Even though your cluster can be massive, there's still the blast radius of the control plane on any particular cluster.

ABDEL SGHIOUAR: Sure.

NICK EBERTS: The idea here is you don't want too many small clusters. And I don't think you necessarily want one massive cluster, but you certainly want some kind of middle ground in between in which maybe you have a cluster that represents a shape of applications that bin pack nicely together per region. And so the idea is that the goal of the products that I build and what I'm trying to push upstream is this ability to think about how you want to bin pack applications together onto the same shapes of clusters-- clusters are fungible-- and not really have to think too much, actually, about a physical cluster.

Here's the set of configuration that represents these apps. And I want to make sure that it runs highly available. And I have customers in maybe these three regions and just make sure that it's there for them to answer lower latency requests.

ABDEL SGHIOUAR: Got it. Got it. And we're going to talk a little bit about details of how MCO works because I had to go do some reading to prepare for this episode. But then let's get into it. What is MCO?

NICK EBERTS: Yeah, I'm going to take a second here just to describe what we have without MCO--

ABDEL SGHIOUAR: Sure.

NICK EBERTS: --to maybe lead into what MCO provides. So if you were going to build a Multi-Cluster inferencing engine, you could just build that. Literally, you could deploy n number of clusters across n number of regions. You can host that inferencing app in those regions. And, by the way, inferencing is just the serving app.

ABDEL SGHIOUAR: I know that there are some nuances there.

NICK EBERTS: As it relates to networking and workload placement, I think that you could think of them somewhat the same. But the point is that you can have these clusters set up across multiple regions. Then you just need to make sure you have a load balancer that can reach them. And you want to make sure that load balancer can route traffic based on preferences so you can ensure that, if you want most of your requests to go to a certain region, there's a preference. You can send traffic there.

So you could do all this in Kubernetes. There's lots of ways to solve it today. But the thing that you can't really solve for, unless you introduce a PaaS service running in Knative, or any of these services that can scale to 0 in a cluster, what you can't do is take those workloads, those inferencing engines running in all those clusters and scale them to 0.

ABDEL SGHIOUAR: Sure.

NICK EBERTS: Now this is one of the main differences when you're talking about accelerated hardware versus regular CPUs. Accelerated hardware is expensive, so you don't necessarily want to have to pay for GPUs in regions where they're not serving requests. So just having to have a GPU sit in a region just in case is not really cost-effective, right? And so one of the problems that we're out to solve with Multi-Cluster Orchestrator is this idea of taking secondary regions-- [COUGHS] sorry-- taking the HPA of those workloads from 0 to 1 when there is a need for them to scale out and then also taking them back to 0 when there's no longer a need to have that inferencing engine running in that extra region.

So Multi-Cluster Orchestrator's job is to allow an ML operator, or just a workload operator, to define a set of priorities. And those priorities are like which regions they prefer, which clusters they prefer, stuff like that, and then evaluate those against actual capacity and return a result that's a recommendation for which particular cluster in your fleet, should this workload land.

Now I just want to be clear. It's just making a recommendation. So what Multi-Cluster Orchestrator is not is a CD tool. We have enough of those. I don't think we need one more. And so the first implementation that you'll see is with Argo CD because that's where most of our customers are right now. But we are working upstream to get it to work with Flux and also Config Sync down the line.

ABDEL SGHIOUAR: And so the work you're doing upstream is to standardize the way those recommendations are spit out by MCO, so they can be consumed by like a CD tool?

NICK EBERTS: Yeah, so MCO is open-source, and we believe-- it's still early days-- but the idea is that we're going to have almost like a cloud provider model. We're not a cloud-- I shouldn't say cloud provider but a provider model. So if you're a Kubernetes provider in which you could plug in maybe an API that MCO is going to call to search for capacity.

ABDEL SGHIOUAR: Got it.

NICK EBERTS: And, also, we are going to make the metric that determines whether or not there actually is a capacity issue in any region that's live, open and accessible, so that you could bring your own sort of metric to evaluate the running workloads and decide whether or not they're stocked out or any logic that you want to decide to add another cluster or remove cluster. So those are the two. That's the surface area for integration with other providers. Yeah.

ABDEL SGHIOUAR: Got it. Got it. And, John, you were going to say something about inference specifically. I want to hear your thoughts.

JOHN LI: Great, yeah. For inference, like Nick earlier said, it's just like web service. And a lot of it is stateless, not keeping track of any states. It's not typically connected to any relational database. In those sense, yes, it's very much like a web server.

There are also cases where it's not quite like a web server, where oftentimes the actual computation is done by an accelerator. And with transformer inference, it is autoregressive, and the workload is divided between prefill and decode. Prefill typically is very compute bound, and decode is memory bandwidth bound and autoregressive in the sense that, for one path forward of the neural net, you generate one token. And then you keep generating until you hit the end of sentence token. So because of that nature, the latency here could be in the order of seconds, as opposed to what we're used to. In the microservice world, things are in the millisecond round, so that's the major difference.

ABDEL SGHIOUAR: Yeah, so to rephrase what you said, basically traffic for LLMs is not your typical web traffic in the sense that the request could be long. The size of the request could be big. And there is also the fact that a lot of these LLMs today are multi-modal. So the request is not always text. It could be audio. It could be video. It could be a picture. It could be whatever, both the request and response.

And I think that's partially what the gateway API inference extension is trying to address, and we're going to have an episode about that. But I want to go back to MCO. I was looking at the demo that you built, Nick. So there is MCO, and there is this thing called workload placement, which is a CRD object, right?

NICK EBERTS: Yep.

ABDEL SGHIOUAR: You deploy that. You say, I want this workload. These are my list of clusters in order of preference. And then it spits out in its status field a recommendation.

NICK EBERTS: Correct.

ABDEL SGHIOUAR: And then you plug a CD tool to take that recommendation and do something with it, right?

NICK EBERTS: Yeah.

ABDEL SGHIOUAR: But MCO itself builds on top of two things, the cluster inventory API and the cluster profile API.

NICK EBERTS: Correct.

ABDEL SGHIOUAR: So what are those?

NICK EBERTS: Yeah, so, actually, just to be super clear, cluster inventory isn't an API as much as it's just a word to describe a number of cluster profiles.

ABDEL SGHIOUAR: OK.

NICK EBERTS: So cluster profile is a CRD that we built upstream in SIG Multicluster. And it's essentially just a pointer to an actual cluster. But the thing that we notice is that a lot of providers were building their own cluster lists, including us. We have fleets.

ABDEL SGHIOUAR: Yeah.

NICK EBERTS: Azure has fleets. OpenShift has their own version of things. And then a lot of multicluster tools were maintaining their own lists. If you use Argo CD, the secrets on your central Argo CD server are essentially a list, right? If you're looking at MultiQueue, that's another service that's a multicluster type of workload distributor. That thing has its own list. And Multi-Cluster Orchestrator could have had its own list.

But what we tried to do is normalize that list into open-source specs. Essentially, if you think about it, all of the clusters that are generated as cluster profiles in a namespace on a central hub cluster-- and that's actually a term that we use. We've decided to accept upstream, it's the hub cluster-- those represent a sameness boundary to some degree. And you could consider them to be analogous to a fleet.

ABDEL SGHIOUAR: Got it.

NICK EBERTS: And that's what n cluster inventory is. It's a number of cluster profiles.

ABDEL SGHIOUAR: Got it.

NICK EBERTS: And so-- sorry-- and it's not just like the cluster name. It's all the metadata that you want to decorate that cluster with.

JOHN LI: It's almost like the capability list, right?

NICK EBERTS: Yeah.

JOHN LI: For cluster, it's got GPUs in it. Also, that cluster has got certain networking. It's a way for you to express what this cluster can do.

NICK EBERTS: Yeah.

ABDEL SGHIOUAR: The word cluster inventory is pretty self-explanatory in my opinion, but it's still interesting to talk about it. So I think my follow-up question then would be who generates-- by who, I don't mean the person. Which part of a system generates the cluster profiles? Is it something that you would generate as a user manually? Or would you write something to generate it, say, it would query your cluster and then generate the profile and update it?

NICK EBERTS: Yeah, I think the whole point of cluster profile, or I think the whole point of cluster profile is to remove the need from an end user to have to write a bunch of glue code to sync all of these disparate lists together. So if you add a cluster before cluster profile, you'd have to add that cluster. You'd have to add the Argo CD secret. You would have to add n number of other entries to a list somewhere that represents that cluster.

So I think what I'm seeing is both with Microsoft certainly and us at Google is we have a service that's generating the cluster profiles based on the cluster creation. So, for example, if you're using GKE fleets, when you add a new cluster to the fleet, we are automatically going to create a cluster profile on your hub cluster that you've identified with the label, and say, so we are going to, everything gets added. Every time you make a change to some metadata on a label, we're going to reflect that into the profile.

ABDEL SGHIOUAR: Got it.

NICK EBERTS: Yeah.

ABDEL SGHIOUAR: And that's what MCO uses as an input, to say these are my available clusters. For each cluster, these are the capabilities, as you said, John.

NICK EBERTS: Yeah, let's say that's, in terms of SQL, that's the select, right? And then there's a filter that you can apply to it, which is part of the spec of Multi-Cluster Orchestrator in that placement.

ABDEL SGHIOUAR: Yes, yes.

NICK EBERTS: So you could do regex, or you can hard code the list. And then, shortly, we're going to provide a way for you to use label selection to decide which clusters because not every workload in your fleet probably needs to be a target for-- sorry-- not every cluster in your fleet probably needs to be a target for MCO.

ABDEL SGHIOUAR: Yeah, sure. You don't have to add all the clusters under MCO essentially.

NICK EBERTS: Yeah.

ABDEL SGHIOUAR: But so then a follow-up question would be, the engineering me is thinking, one problem that could happen is if your cluster profile is not up-to-date, would there be a situation in which MCO would make a recommendation that would be outdated in a way? Especially think about it as you have multiple clusters, and then multiple people are using other tools to deploy to these clusters. So how fast can you reflect the status of a cluster to MCO matters in this case?

NICK EBERTS: Sure. I can only speak for how fast we could do it in GKE, and it's on the order of milliseconds. It's pretty quick.

ABDEL SGHIOUAR: OK.

NICK EBERTS: But that's up to-- I think that's an implementation detail of the provider of cluster profile. That's not even an MCO thing. That's just how quickly is your sync occurring between whatever that source of your cluster list is and the actual hub cluster with the profile.

ABDEL SGHIOUAR: Got it. Got it. OK, cool. So MCO is under SIG Multicluster right?

NICK EBERTS: Not yet.

ABDEL SGHIOUAR: Not yet, OK.

NICK EBERTS: That's the intent.

ABDEL SGHIOUAR: That's what you're trying to do.

NICK EBERTS: Yeah.

ABDEL SGHIOUAR: All right. And when do you expect people will be able to play around with this?

NICK EBERTS: Yeah, so depending on the time of release, maybe today. But so two weeks after next, let's say, is our goal. And so the first thing you're going to see is that the images, so it's going to be first available obviously for GKE clusters because this is the team that's building it. And so the images for the binaries will be public and available in the GitHub repo. And then you'll have a Terraform sample that shows you how to build it all out using our implementation.

ABDEL SGHIOUAR: Got it.

NICK EBERTS: And then shortly after that, we're actually going to release the code into that very same repo. And then over the next six months, I'm going to go through the process of working with SIG Multicluster to figure out how, where, and when it gets pushed in.

ABDEL SGHIOUAR: Got it.

NICK EBERTS: Yeah.

ABDEL SGHIOUAR: And another question to you, John, on the inference side because the demo that I saw that you built, the one that is internal for now, does also leverages multi-cluster gateway for multi-cluster load balancing for inference. Is it based on the GKE Inference Gateway or just the regular gateway?

JOHN LI: So that's based on a gateway class that we built for GCP. That's a multi-cluster, multi-region, internal load balancer gateway class.

ABDEL SGHIOUAR: Got it.

JOHN LI: There's also some other work that we're doing upstream with the gateway spec in respects to inference pools. There's a separate inference gateway.

ABDEL SGHIOUAR: Yeah, so that was actually my actual question. And my follow-up would be because I did some reading about the inference extension. And I did actually a talk at KubeCon last week, like a lightning talk, about it. And part of what this does, there is this thing called the endpoint picker, which kind of plugs the inference pools into the gateway itself to tell the gateway where to route traffic.

JOHN LI: Exactly.

ABDEL SGHIOUAR: So how do you see this plug into MCO? Is the inference extension going to be able to leverage MCO in a way?

JOHN LI: Yes, yes, so that's something we're still actively figuring out the details. But I can give you some high levels.

ABDEL SGHIOUAR: Yeah.

JOHN LI: So MCO, what you think of, the routing at that layer is region picking.

ABDEL SGHIOUAR: Yes.

JOHN LI: And then what EPP does is the endpoint picking.

ABDEL SGHIOUAR: Yes.

JOHN LI: So you can think of this in two layers right? So, first, when the request goes in on the data path, you first decide what region you will assign to. And then once you get to the regional level, what EPP is going to give you is it has the capability, instead of doing round-robin load balancing, it can use the custom metric, for instance, KV cache utilization, to fully balance all your accelerators.

ABDEL SGHIOUAR: Or queue size or something like that, yeah.

JOHN LI: So the latency that it works at is in the milliseconds. It needs to be fast because it needs to be able to round to an endpoint. That's the latency we're working with. And for regional picking, what we wanted to do is we wanted to send to regions where there's capacity. And the latency there doesn't actually need to be that low, compared to the endpoint picking part.

ABDEL SGHIOUAR: Of course.

JOHN LI: So think of this as a two-layer problem. So, first, pick the region. And then we have ways to direct shape traffic to regions where there is capacity. And, second, once a whole bunch of requests lands there, there's mechanisms to balance the load among all the accelerators that's doing the endpoint picking.

ABDEL SGHIOUAR: I think my question was kind of slightly more broader than that, in the sense that if you have a situation where you need to auto scale based on utilization, that MCO would be able to be plugged to make that workload placement auto scaling recommendation. Or am I--

NICK EBERTS: So MCO's job is to take it from 0 to 1 and 1 to 0.

ABDEL SGHIOUAR: OK.

NICK EBERTS: The HPA takes over once it gets to 1.

ABDEL SGHIOUAR: Got it. Got it.

NICK EBERTS: So you're going to configure the HPA on a metric that makes sense for that particular workload. In the case of inferencing, we've seen people use KV. We're recommending KV cache or maybe even queue depth of the LLM.

ABDEL SGHIOUAR: Or maybe metrics from the GPUs or something else. Yeah, cool. Awesome.

JOHN LI: One thing we'll add, though, on the multi-cluster gateway side, this is, I guess, we're getting more to the details of GCP. There's something called preferred back end.

ABDEL SGHIOUAR: Yep.

JOHN LI: That is for the customers to say, I wanted to shave my traffic to a particular region. For reasons like if people have bought reservations in that region, they want to fill it up. So those are ways to steer traffic and shave traffic. And then once you've done that, MCO can auto scale that workload to that region. And also HPA can scale the number of nodes within the region pods and the nodes.

ABDEL SGHIOUAR: Awesome, awesome. Thank you very much, folks. This was pretty cool. I'm looking forward for MCO to come out, and maybe one of you coming back on the podcast to tell us more about the new stuff.

NICK EBERTS: My goal is to have this conversation with you again in six to seven months and talk about its sort of birth into SIG Multicluster and use by other companies besides Google. That's 100%.

ABDEL SGHIOUAR: Awesome. The last time we had you on the show, that episode was very popular, so we're happy to have you back.

NICK EBERTS: All right.

ABDEL SGHIOUAR: And then maybe we can have John come back to talk about Sichuan peppers because I've heard you guys gossiping about Sichuan peppers.

JOHN LI: We could do a whole episode on cooking.

ABDEL SGHIOUAR: Well, probably not on the "Kubernetes Podcast" but--

NICK EBERTS: All right.

ABDEL SGHIOUAR: Awesome. Thank you very much, folks.

NICK EBERTS: All right, thanks.

JOHN LI: Thank you.

[MUSICAL CHIMES]

KASLIN FIELDS: That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media, @KubernetesPod or reach us by email at <kubernetespodcast@google.com>.

You can also check out the website at kubernetespodcast.com, where you'll find transcripts, show notes, and links to subscribe. Please consider rating us in your podcast player, so we can help more people find and enjoy the show. Thanks for listening, and we'll see you next time.

[MUSIC PLAYING]

View More Episodes