#258 August 20, 2025
Guests are Clayton Coleman and Rob Shaw. Clayton is a Core contributor to Kubernetes, the containerized cluster manager, and founding architect for OpenShift, the open source platform as a service. Clayton helped launch the shift to cloud native applications and the platforms that enable them. At Google my mission is to make Kubernetes and GKE the best place to run workloads, especially accelerated AI/ML workloads, and especially especially very large model inference at scale with the inference gateway and llm-d. Rob Shaw is an Engineering Director at Redhat and is a contributor to the vLLM project.
Do you have something cool to share? Some questions? Let us know:
KASLIN FIELDS: Hello, and welcome to the "Kubernetes Podcast" from Google I'm your host, Kaslin Fields.
MOFI RAHMAN: And I'm Mofi Rahman.
[MUSIC PLAYING]
This week, I had a chance to sit down and talk with Clayton Coleman and Rob Shaw. Clayton Coleman is a core contributor to Kubernetes, the container's cluster manager, and founding architect for OpenShift, the open-source platform as a service. Clayton helped launch the shift to cloud-native applications and the platforms that enable them.
At Google, his mission is to make Kubernetes and GKE the best place to run workloads, especially accelerated AI/ML workloads, and especially, especially very large model inference at scale with the Inference Gateway and llm-d. Rob Shaw is a director of engineering at Red Hat and is a contributor to the vLLM project. In the interview, we talked about why LLMs are different than any other workload running on Kubernetes and why projects like llm-d exist. But first, let's get to the news.
[MUSIC PLAYING]
KASLIN FIELDS: Kubernetes 1.34 is expected to release here at the end of August. If you haven't seen the sneak peek blog yet, head over to kubernetes.io to check it out, and look forward to our interview with the release lead.
MOFI RAHMAN: KubeCrash is a community-led virtual event happening on September 23rd. Attendees can expect to learn about a variety of topics in the form of cloud-native, open-source crash courses for platform engineers. The event will also be raising money for Deaf Kids Code, a non-profit organization with a mission to provide equitable access to computer science education. Check out the schedule and register for the event. The link is available in the show note.
KASLIN FIELDS: The CNCF published a blog post, listing the top 30 open source projects in 2025. Unsurprisingly, Kubernetes has the largest contributor base, followed by OpenTelemetry, which is quickly becoming the Kubernetes of o11y communities, as called out in the blog post. The blog lists out a number of other projects, like Backstage, which we featured on episode 136, as well as Argo, Crossplane, Kubeflow, and many others. This growth shows where the community is headed and where the future investments are. Make sure to check out the link in the show notes. And that's the news.
[MUSIC PLAYING]
MOFI RAHMAN: Welcome to the show, Clayton and Robert.
CLAYTON COLEMAN: Hello.
ROB SHAW: Hey, thanks for having me on.
MOFI RAHMAN: So, Clayton, this past KubeCon EU 2025, there was a keynote where we talked about Inference Gateway for Kubernetes and running inference workload on Kubernetes. For the listeners, for people that have been using Kubernetes for a long time, but not necessarily running LLM workload, why would something like inference be any different than running, let's say, on a web application?
CLAYTON COLEMAN: That's a great question. It's taken us-- it's taken me, and I'm sure others as well, a lot of time to wrestle with this. AI/ML workloads, before large language models, tended to be really interesting. They were highly custom. They were dependent on lots of local software stacks. To Kubernetes, they were really just another workload, and there were thousands of variations. Every organization had this unique, unique take on how to run an AI/ML platform in production.
What was really interesting with large language models is it shifted the problem space from being one of software development to one of being resource usage and scale. And, interestingly, that was really when it stopped looking like a traditional microservice and started looking like a very specialized bit of software that just so coincidentally needs a ton of direct access to hardware and started taking on characteristics that were different from regular web apps.
A regular web app, random load balancing is pretty good. Horizontal auto-scaling in Kubernetes, you stick a CPU number on there. And when you cross out the CPU number, we scale up. And when you fall below it, we scale down. It was pretty simple, but what was pretty obvious from a lot of our discussions with people running large language models on Kubernetes was a lot of the primitives didn't work for them because the problem had changed.
And we've experimented. We started Working Group Serving two years ago at KubeCon EU 2024 and started partnering with people. One of the things that was really obvious was load balancing. What's really different from large language models is they're a bit of a computer in and of themselves. The model is processing things, and so I like to think of the large language model as a little bit like its own host, its own CPU, that has to be shared.
And so as we looked at it that way, we said, there's nothing really that helps you share a large language model acting as a CPU for a bunch of different workloads, and that led to Inference Gateway and the idea that you're load balancing traffic differently than you would for a web app because the requests are different lengths, right? A really short prompt is much cheaper to calculate than a very long prompt. You have to do them iteratively.
And so Inference Gateway started as, hey, we can do better than random load balancing, and maybe we can divvy up access to these large models fairly, help operators and deployers get a good idea for how the workloads are going. And KubeCon EU was actually really transformational for us because it was that shift. From thinking like a load balancer, we had the realization that, as a generic load balancer, you could only do so much.
We were already starting to think, hey, we need to work more closely with the servers that were actually running the models. And those are-- they're called model servers, and there's a couple of them that are really important. And one of the most widely known and most popular is vLLM. And so after KubeCon EU, I got an out-of-the-blue call, even though we had chatted a few times before, from Rob Shaw, and he wanted to talk about what we could do together. Not just Inference Gateway or not just vLLM, but how can we bring those two together?
ROB SHAW: Yeah, and from my side in vLLM, we've been partnering with model providers over the course of the past two years since vLLM came out and have really been staying up-to-date with the evolutions in the model architectures themselves. And we had seen, very commonly, vLLM being deployed inside of Kubernetes and operational systems, and were really interested in how we can make this work more closely together. But I think what really started to become acute over the course of 2025 is we saw model architectures, really, with the advent of DeepSeek right at-- in December of 2024, is this shift towards very large mixture of experts models.
And the problem of running this inside of Kubernetes became really acute because a mixture of experts models, these huge DeepSeek-like architectures that have-- with the recent model like Kimi that have a trillion parameters, are really designed to be deployed with techniques like disaggregated prefill, where we'll have a prefill instance and a decode instance that need to work together to serve an individual request, or things like wide expert parallelism, where multiple nodes would be working together to serve an individual inference request to really scale things out and get high performance in a distributed system.
And the demands of the models as well really started to make the operational challenge of deploying these frontier models more and more acute. And we obviously had seen Inference Gateway solving the problem of load balancing with intelligent scheduling, and we really wanted to drive the concerns that we had around how to serve these bigger models with these more sophisticated optimizations and make that really compose nicely with all the amazing work that the upstream community had done to deal with the load balancing workloads.
So that was a little bit of why things started to become acute, and why we really thought there was a good opportunity to bring together the vLLM community and the Gateway community to work on building a project that has tighter requirements between the two that help to drive the APIs that are needed in both systems.
MOFI RAHMAN: So you mentioned that the model server and vLLM happens to be one of these model servers, but the task of serving model, it's not necessarily something that these new LLMs are newer, but AI workload needed to be something that-- where people are serving different types of model before. Things like Seldon Core was a thing people used before, KFServe, now called KServe, that exists that does similar thing. Then you have [? TFServe ?] as well, like tensorflow models, things of that nature. In this new era of serving large language models, why vLLM? Or what is vLLM doing that these other solutions did not really answer before?
ROB SHAW: Yeah, sure. The fundamental problem with large language models is that they're autoregressive, which basically means that every token that gets generated requires another pass through the model over and over again to generate text. Whereas, traditional models, like a BERT, or a YOLO, or a predictive ML model basically does one forward pass to execute the request, and is done and returns the response back to the user.
So effectively, from the perspective of the model, they're stateless, right? Traditional predictive apps. You would see a very common strategy called dynamic batching, where servers like Triton server would queue up a series of requests, dispatch them off, all off, to the model to batch them together, one forward pass, generate the responses, return it to the users.
But as we look at LLM workloads, effectively, during the life of an inference request, there's a lot of state that needs to be managed because we want to reuse the intermediate states between those forward passes. These are called the KV caches, which are a really critical optimization to avoid having to recompute the prompt over and over again as we generate tokens.
So this performance optimization of KV caching is really critical to getting good performance out of the LLMs, but it comes at the cost of needing to keep track of this intermediate state. So vLLM really emerged in the summer of 2023 with a fundamental algorithm called page detention, which allowed us to manage this KV cache in a smarter way using what we call a block table, which is an homage to the concept of virtual memory in an operating system where each request has a logical view of its KV cache, which maps to random access physical KV cache blocks.
And there's an attention algorithm called page detention, which was really a fused gather attention operation that made this all work really well. vLLM really emerged in the summer of 2023 with a good implementation of continuous batching and this primitive of KV cache management through a block table with page detention. That really kicked off the open-source LLM-serving ecosystem with the right fundamental abstractions.
And that core of vLLM, a continuous batching engine and a KV cache management engine is still fundamental to the vLLM system, but vLLM really became the leading inference engine in the ecosystem. And so vLLM has really grown up alongside the open source ecosystem. We've seen major changes from 2023 to 2025 where, 2024, we saw a huge advent of multimodal models.
We saw lots of new techniques, like chunked prefill, or prefix caching, or structured generation or speculative decoding that emerged over the course of 2023 and 2024. And then in 2025, we saw a push towards large mixture of experts models. We've seen explosion in the number of different hardware backends that vLLM supports, whether it's NVIDIA, AMD, Google TPU, a big project that we did over the course of the past nine months in partnership with the Google team to add different accelerator backends.
So vLLM really started from that core problem of KV cache management, continuous batching, which are those fundamental abstractions that are needed to run a large language model performantly. And it's grown up and exploded in the number of features, model support, et cetera, as we've seen that open-source ecosystem really blossoming over the course of the past two and a half years.
CLAYTON COLEMAN: That's a great point, and I want to add something to what Rob said. To your earlier question, Mofi, or to the original question as well, what's different? I really want to emphasize how much the ML focus has shifted from, what's the platform that lets you bring many different types of models to production, which is some of the things that KServe, and Seldon, and TensorFlow focused on, models with wildly different architectures, to a world where the model is more important.
We still will have multiple, different models. There might be different sizes. I'd be trying a couple of different model architectures from different model providers, as each model provider races to one up the other with better performance because you have many fewer. So I like to think that the focus that we're trying to take now, and what vLLM is really built around, and where we're going is a workload-centric view, which is the model itself is a bit complex workload.
It's a distributed system, if you were, and it started out simple. All things do. We had replicas. And a lot of what Rob described is this, we're adding pieces that complement that internal cache, the internal processing, that are just really-- you wouldn't need them if you have thousands of models, but because you have one big model that's very large, it's larger than the hardware it runs on, and it's spread across all these machines, it becomes a distributed system.
And that's really just a different approach. Just, distributed databases, everybody's run a really small Postgres database on their laptop, but a distributed database is a completely different beast. I think that's the real transition, or that's the other transition that's happened along with large language models, is from platform to deliver models to model as a workload.
And these are big, massive, important workloads that will form the core of a whole host of satellite workloads that are consulting these models and using them to bring new capabilities, whether it's multimodality, or reasoning, or agentic workloads. The future is bright for calling models. What do we have to do to support it?
ROB SHAW: Absolutely, I think one other point I'll just add is that it's hard to underestimate the first L of LLMs, which is large. We're looking at just an amount of compute that's ginormous in terms of the just raw processing power that's needed to get reasonable throughput and latency in-- out of your cluster.
And a lot of the more complicated deployment patterns that we're talking about are fundamentally performance optimizations that are trying to reduce the overall amount of operational spend that's needed to support the models, and it's really because the compute is big. To actually serve these things, that drives the need for more and more performance optimization in the inference server level, in the cluster level, in the model level, to really continue bringing down the cost of these overall systems.
And I think the magic of what we're trying to do with llm-d is take all these performance optimizations that are complicated, hard, require a lot of engineering sweat to make work, and deal with an ML software stack that has lots and lots of pieces, that's hard to work with at times, and really bring it into the operational model of Kubernetes to try to make this easier for folks to run as they go into production with these architectures.
MOFI RAHMAN: That's a perfect tee up to the next question I was going to ask, is that you just mentioned llm-d. So you have the Inference Gateway, which is handling the routing, using the knowledge of the model itself to route it better. You have vLLM, which is the model server. Paint me with the picture of, where does llm-d fit in? Or where does llm-d come in to help?
ROB SHAW: So with llm-d, I just want to emphasize, it's really the merging of the two communities, right? vLLM and Inference Gateway jointly driving requirements-- and not just Gateway, but also other components in the Kubernetes ecosystem. We've increasingly been working with LeaderWorkerSet. But the idea is to drive common requirements across both key dependencies and key upstreams, and make these upstreams better and better at the well-lit paths that we're targeting.
And so the idea is to bring these two communities together and have Gateway drive requirements down to vLLM and have vLLM drive requirements up into Gateway, into LeaderWorkerSet, and into other dependencies that we'll rely on to build common, well-lit paths. And I think with these well-lit paths, what we're trying to really highlight is state-of-the-art ways to deploy common patterns, right?
So right now, with our 0.2 release that we came out with, we have three well-lit paths that we're targeting. The first is intelligent inference scheduling, which is an example of a deployment pattern that we think everybody should use in every situation. It takes a lot of the existing, really, amazing load balancing logic that comes out of Gateway, brings vLLM in, and provides a really common way that everyone should deploy every model with these techniques.
And what we're starting to build with these newer well-lit pads is architectures for running more and more sophisticated deployments. So an example of this is pre-filled decode disaggregation. This is a technique that allows us to split the model server into two parts. We'll have one replica of vLLM that's a prefill server that we've configured and optimized to do prefill requests.
Typically, this means using more replicas with less parallelism because the collective operations that are needed to process pre-fills, which is a compute-bound operation, is quite heavy. So in general, you want to use less parallelism for the prefill workers, and then the decode will process the decode request. And in general, we want to maximize the amount of KV cache space.
And this is a memory-bound operation. So in general, we want to have as high of a batch size as possible with-- and to do this, we use more parallelism. And so this is an example of a configuration and a technique that we've added a lot of features into vLLM for, adding things like Nixle as the KV cache transfer library, supporting a protocol to tell vLLM that this request should be processed with disaggregated surveying.
And then on the gateway side, we've had to implement extensions to do pre-filled decode-related scheduling. And so we've developed a joint protocol, where Gateway and vLLM are able to talk to each other in the right way to build a whole system that allows for pre-filled decode disaggregation to be something that folks can use when they go to deploy and, of course, running on top of a Kubernetes cluster.
So this is a great example of, if we were just working in vLLM, we would have to write a proxy layer that's doing complicated scheduling logic to decide which pre-filled replica to use, which decode replica to use, but we're able to just push the requirements up into Gateway and push the requirements down into vLLM. And we're able to use a real production load balancer and a [? pre-production ?] proxy just directly with vLLM under the hood.
So I think this was a good example of, we're ironing out the issues associated with running a PD-disagg kind of scenario, and bringing together the two components to make them work together to run really performantly and, again, get this performance optimization, which is pre-filled decode disaggregation.
And then the third path that we have for the recent release that we did is something called wide expert parallelism. This is an optimization that's targeted at these wide mixture of experts models. So DeepSeek, Kimi, Llama 4, these all have 128, 256 experts. They're huge models, hundreds of billions of parameters.
And the idea is we want to deploy these in a multi-node setup. So we've added a lot of features to vLLM to do pre-filled decode disagg, of course, but then things like data parallel attention with expert parallel MLP layer, integrating a lot of the key kernels that perplexity has put out, as well as DeepSeek has put out. DPP is a name. DeepGEMM is the name of the GEMM kernel.
We've implemented all this stuff inside of vLLM, and we've been working with the Gateway community to compose the existing load balancers that we have. And then we've been working with LeaderWorkerSet to deploy these multi-node replicas of vLLM. And we've encountered lots of issues doing this, and we've been ironing them out. And this is helping to really drive the requirements into LWS, helping to drive the requirements into Gateway, and really just providing a lit path.
We validated that this all works. We've dealt with a lot of the issues with dealing with these things and are helping to push the requirements into the upstreams to make that whole path work really smoothly. So I think these are three examples of well-lit paths that we have now, where not every one of these is going to be something that's used in every deployment, but we're trying to iron out how to run these more sophisticated deployment patterns and make that smoother.
And then we're working on adding more of these. Of course, there's things like KV cache offloading to CPU RAM and then, eventually, remote storage, which is another more sophisticated pattern for deploying in a cluster. This is something we're working on. And then we also have-- are working on some things related to SLO-based scheduling and auto-scaling, which are other well-lit paths that we're going to be bringing in over time.
So that's the overall idea of the project is to define these key user stories and deployment patterns, make them work really well, bringing together upstream Kubernetes projects and vLLM, make it work really well together, iron it out, and provide references for folks on how to run these things, and try to identify the top five, 10 different paths over time that we think are useful ways to deploy LLMs. I went on for a long time. Clayton, do you have anything to add?
CLAYTON COLEMAN: Well, no. And I think a lot of-- every time I hear it said back, I pick up something new, and it helps me think about things in a different way. So one of the things Rob and I agreed on really early in this is there's two hard problems, I think, in this space right now. And one of those is something Rob's really familiar with, which is everybody has all of these tricks and techniques, but they're all scattered.
So everybody, this ecosystem, large language models, Generative AI, everything's going so fast that a lot of the ecosystem is people making individual tweaks, learnings, and they're able to get the optimizations they need and it stops there. So they take the LLM, and they have a patch or two. It gets them up and going, and then they want to leave that because they have startups or they're rapidly moving new AI natives.
And speed matters. And so some of that extra hard work, which is taking all this and bringing it back together, wasn't happening. And that's something that Rob, and the Red Hat team, and the IDM team, and the larger vLLM community are really interested in is pulling this back together. So those well-lit paths, in some sense, is that's classic OSS.
It's all of us are more powerful than each of us, and making it easier for people to anchor on those paths makes contribution easier. If we can go out there and take a look at the 10 or 15 different ways people have done pre-filled decode disaggregation, we can apply some judgment and say, it works in these scenarios or these scenarios, and we can bring that expertise back. We're not necessarily the ones driving it, but what we are doing is curating and pulling it together.
But a second part of this, I think, is another ecosystem thing that I learned very early in Kubernetes. At the end of the day, to a lot of people, this is just something that helps them get their job done. But once you've got it-- once you deploy something, if you're deploying Kubernetes, for the last 10 years, most of the people who are deploying Kubernetes in anger were deploying-- were building platforms. It's platform engineering teams. They're supporting lots of workloads.
Kubernetes is not perfect by any means. It was just better than writing your own. You could write a better one, and I encourage everyone to go out there and write better orchestration systems. The reality is it wasn't the core of their business. And so where Kubernetes was really successful was about, you don't have to be perfect. You have to be useful, usable, and help people focus on the problems they actually want to focus on. People didn't want to go write for loops that recovered services when the node crashed.
So coming into this effort, I think something that we can really do beyond helping create those well-lit paths, bringing ecosystem optimizations together so that all of us get the benefit of them. And then that means there's a nice, tight release pipeline that everybody can depend on in vLLM, and Inference Gateway, and Kubernetes, where the stuff just works and it keeps working better over time.
But the other one is defining the APIs between these components. So Rob mentioned prefill and decode. It's been really difficult. There's lots of different approaches, but there's a very common refrain that I hear, is we're pushing massive amounts of data, like a thousand-character token, really, but let's call it a thousand-character prompt, might generate, on the order of gigabyte, a hundred megabytes to 10 gigabytes of data that you have to push across the network.
Most people's microservices are not pushing across 10 gigabytes of data in a throughput-oriented, relatively latency-dependent setting. And we've got these new fast networks, but it's a challenging problem. And so some of what we can do is, by coming in and looking at the operational patterns, by looking at what people have done, you can apply a little bit of a thumb on the scales and say, this pattern works for this use case and this pattern works for this use case.
But if all of us are going to do it, what are the one or two paths that we can focus on? And some of this is just opinion. We're coming in with an opinion and we're saying, we think this will operationally scale. Some of that comes from our own experiences. At Google, we have a lot of people who've been doing things like this internally, who provide feedback and guidance on patterns. Just like for Kubernetes, there was a lot of folks inside Google who had had experience running containers at scale.
So what we're trying to just do is get some really good opinions around APIs between these different components. And maybe, I don't think neither Rob nor I view this as we have to win at everybody's expense. In fact, what we'd rather have happen is that everybody in the ecosystem converge because there's other model servers, and there's other load balancing approaches, and there's other ways to orchestrate.
What can we do to bring those APIs? Show something working, clearly articulate why, and see if we can build a center of gravity. And that excites me. That's what I think I like to see, is we can lock away some of that complexity and make stuff easier, not just for folks today, but folks three, four years down the road.
MOFI RAHMAN: I want to drill down on one of the things you mentioned there, Clayton, is that a lot of the teams that are startups, or they have crunched for time, or they're trying to move very fast, they're maybe taking this open source vLLM, or llm-d, or whatever the guidance there is, and tweaking one or two things, one of the optimizations that works for them. What would you tell those teams? What is the reason they want to be at the table, bringing those optimizations that work for them, but make it more general so that it leaves the entire industry up?
CLAYTON COLEMAN: I think this is pure self-interest, which is open-source works best when everybody gets something. So usually, there's a lot of obstacles to contributing back. There's kind of two models. There's the deep and the broad model. The broad model is the lots of eyes. There is no better distributed bugcatching mechanism than a whole bunch of programmers just hacking on stuff, and that is happening now.
There's lots of little things that break. And, honestly, the ML ecosystem is fundamentally dependent on taking very sophisticated algorithms, breaking them apart using highly performance-optimized libraries, gluing them together, and then not touching them. You don't want them to break. And, of course, that leads to, there's lots of subtle breakages. So the incentive that I think we'd be looking for is you can go down this well-lit path. You can easily fork it and get your contributions. You can take it and you can have those patches that work around.
But instead of so much of what you're adding being pretty bespoke to your environment, you're working off of a path where not just the one piece, like the vLLM patch works, but some of the tunables for how prefill-decode work. Some of the future things, like EP, new models are going to come out. They're going to change the required mix of parameters, and the libraries, and new algorithms, and tuning.
The more that we can just concentrate attention, what that would mean is, your patch, you have to carry fewer patches. And then when you've got something working, just reporting the issue by centralizing some of these flows it makes, reporting, debugging, and verifying those issues-- and verification and benchmarking is a fundamental part of this. It's just hard to benchmark a small stream of fixes to a wide range of configs.
What we can do, what we can do with the vLLM community, and with Inference Gateway, and the larger ecosystem is build some of the tooling that's going to make it easier to do performance regression testing to try out these scenarios. And even just that really boring work, that'll make the job of those startups, they can pick this foundation. And instead of it being a bunch of pieces they assemble, it's a fewer set of pieces they have to assemble. Rob, I don't know, if I did justice to the kind of overview. I think you have a much deeper connection to the feel or mindset.
ROB SHAW: Yeah, the other thing, I think, is also important is, in llm-d, we're not taking forks of things. We're using and driving these things into the upstream directly, and I think this is really important because the pace at which things are improving and changing in the ML ecosystem is absolutely breakneck.
I always have been laughing to myself. It's like, we spent all of 2024 optimizing the Llama 3 architecture, and it's just completely irrelevant. A lot of that work is just completely irrelevant for how to serve DeepSeek, right? And so if you forked a system in 2024, in December, where six months later, and now you want to run DeepSeek, and you didn't get all those changes, right?
You have to implement all that yourself. And now we've gone through this whole effort to add YDP and prefill-decode disaggregation. And just this week, we have a new flavor of disaggregation that we've seen from ByteDance with their Mega-Infer, which does attention feed-forward network disaggregation.
And I'm sure there's going to be a bunch of work to make that all happen in vLLM. So one of the things I think is important about staying with these upstreams and working directly is you're going to benefit from all the progress that's happening. We're not yet even close, in my opinion, to stability as it relates to architectures stopping, having improvements.
We're not at that point yet. We're still going to continue to see evolution in the architectures. There's still a lot of interesting research that's being done by academia and labs, et cetera, that we're going to have to pull in to these systems. And the more that we can push these into the upstreams and make sure that they're working together, I think the more everyone can benefit from things.
So I think that's another kind of key piece is we're not at a-- vLLM is not static yet. It's going to continue to evolve rapidly to support these new techniques. And as folks fork and run things specifically, they run the risk of having to reimplement all those things themselves, or deal with constant rebases and et cetera. Yeah, that's, I think, another piece of context for the value of the way that we're going about this development process.
MOFI RAHMAN: Yeah, so those of you that are listening right now and are interested in getting involved in some ways, we're going to link the link to llm-d communities and the links. If you have any use cases to vLLM, to Inference Gateway, to llm-d, there will be links for you to join. But the thing I want to ask both of you-- and oftentimes, in this breakneck speed that Robert mentioned of things moving and changing, guessing what's going to happen in the future, it is a difficult task, but I am going to ask both of you to put your speculation hat on for a minute and try to imagine a world five years from now, 10 years from now.
Again, five years sounds like such a foreign idea in this world that moves so fast. But let's say five years from now, in your mind, either an ideal case or whatever you want to think about, what does serving AI model look like in five years in the world of Kubernetes? The work you're doing now, what do you want to get the world to in the next five years? And what would that look like in your ideal version?
ROB SHAW: Yeah, it's definitely a difficult question. I think that a lot of what we're doing in llm-d is very much a transformer-centric set of optimizations. We've been talking a lot about KV caching. We've been talking a lot about techniques, like pre-filled decode disaggregation, or KV cache offloading, or prefix cache-aware routing. All of these things are taking the view of, how do we best route requests, and manage this KV cache, and exploit the fact that there is a KV cache in the cluster?
A lot of the optimizations are really-- come down to the fact that this KV cache state management is fundamental to the problem. And so I think that at-- if the transformer architecture continues to be something that is frontier with the model architectures, I think that this KV cache management will still be something that's fundamental and potentially push to even more and more extremes.
The other thing we haven't really talked about as much today is the multimodality of the models. I think that we will see more of this in the future, and I would be shocked if there's not disaggregation associated with splitting up those models into smaller services to deal with things. But those are, I think, two overall thoughts about how things could really transform fundamentally. I think that if the transformer architecture continues to really dominate, I think we will continue to see optimizations associated with managing that KV cache being something that gets pushed on further and further, since it is so fundamental to the problem.
To the extent that the architectures change, I think we will see something that looks quite different than llm-d and what vLLM looks like today. And we'll have to, of course, take those new techniques and bring them into the systems that we're leveraging today. So probably not the best answer, but that's at least a little bit like how I think about it. As a model server, we really take the inputs from the models themselves and try to do our best to serve them.
And so it's definitely something I really have my eye on, is how the architectures are changing. We've seen some things over the course of the past couple of years, with things like Mamba, as an example, or things like diffusion models, which are potential other architectures that could really change things if they do become more standard in terms of how to best serve these. So I think this is an area where we'll still see a lot of experimentation from model vendors, and we'll have to make sure that we follow up with them as new architectures come out, but, yeah, Clayton, go ahead.
CLAYTON COLEMAN: Rob, it's great to hear you say that too because I'm going to go even broader. And let's say that, as always, I'm conscious that I'm only human and that I might get some of this wrong, but my guess would be, five years from now, the best and most important models are going to be a mix of open and closed innovation, but I think they're going to tilt towards open.
A couple of years ago, people were a little skeptical that there was going to be any space for open-source models. All of those people were fundamentally wrong. I was saying the other day, I'm excited because my guess is that the state of the art in OSS for running models efficiently is probably pretty close to the state of the art at scale.
And I don't have any deep knowledge other than just reading the tea leaves, but never underestimate a whole bunch of people optimizing their hardware to get the best performance out of it. If there's anything that I've seen, if there's money on the line and you can make something cheaper, making inference cheaper by optimizing it is going to be a trend that is going to mostly happen in the open.
There will be closed elements to some of these models, and people will continue to come up with new algorithms, but I'm pretty optimistic that the open ecosystem is going to run models. And that not only that, it is a technology that we're all going to have access to because there's more people who are interested in contributing and writing papers, and who are starting their own companies, who come with techniques that are no longer state of the art. So I think it's going to be pretty-- I think it's going to be a very big, open ecosystem.
And I think the other-- to match that, I think we're going to see hardware change. The way that people build servers for microservices, I think we're going to start seeing that the needs of the very large models are going to create some differences in how we think about what machines look like and how they're interconnected, faster networks between machines, more parallelism. And that's going to need people writing software that optimizes all of that stuff. And it's going to need orchestration that distributes it across many machines.
So I'm pretty confident that Rob and I have kind of a long ramp of features and capabilities. The future is open, and it's going to be built on top of Kubernetes, and the evolution of both Kubernetes, and vLLM, and all of the other technologies in the ecosystem, and that you will probably recognize the world today. Or the world today, I think five years from now, you'll see some of the same elements, and the ones that have changed are going to be the scale and how much value we get out of it. So I'm pretty excited.
ROB SHAW: The other thing Clayton and I were talking about over this weekend as well is this trend towards agentic applications, which, obviously, is a huge buzzword, but in general, the LLM system's becoming compound with many pieces, whether it's tools or other smaller models that are going to do subtasks, et cetera. And so I think we'll continue to see this trend of agentic applications emerging as users try to customize the model to their specific use case through these mega models, these mega-centralized models with their own enterprise data, or custom data, or with tools and other capabilities.
I think we'll continue to see this trend of compound AI systems starting to emerge, and we'll need to evolve the llm-d and model server roadmaps to make sure that we can work in those application patterns as well, which is a somewhat orthogonal component to the overall models themselves, is how the models fit into a broader AI application, where there's a really robust ecosystem that's emerging and experimenting on those lens. And so we'll be seeking to collaborate with those types of developers over time to make sure that our components fit into those architectures as best as possible.
MOFI RAHMAN: It's not often I get to quote one of the guests during the interview, so I'm going to take that opportunity now. Last year, Clayton, you had a quote in your slides that you-- where you said, inference is the new web app. And this year, I think there is a revised version that says, agents is the new web app. So we have that being echoed here as well.
CLAYTON COLEMAN: Absolutely, and the future is big, and complex, and awesome. So there's much more exciting stuff to come.
MOFI RAHMAN: I thank you both for taking the time to talk with me about vLLM, llm-d. Anything that you want our listeners to take from this conversation as the last thought?
CLAYTON COLEMAN: It is never too early or too late to learn about ML. Two years ago, I was a novice, and now I get to hang out with really smart people like Rob, who continue to amaze me by the depths of complexity in this ecosystem. Don't be daunted. Don't be intimidated. Give it a try. Learn and then come help participate. Contribute back. That's all we need.
ROB SHAW: Yeah, and I'll just say, it's got to be the most fun place to be working right now. The pace, the amount of innovation, the speed at which research moves from a paper into a real production system is so fast, and so you're really feel like you're on the bleeding edge, working in this area. So, yeah, we're really excited to have this community. And we're trying to develop llm-d in public. Every meeting is open. We got an open Slack. And so please feel free to jump in. Tell us your requirements. Get involved. We'd love to see you.
MOFI RAHMAN: Thank you so much. It was wonderful. Thank you for taking the time.
CLAYTON COLEMAN: Thank you.
ROB SHAW: Thank you.
[MUSIC PLAYING]
KASLIN FIELDS: Thank you, Mofi, Clayton, and Rob for that interview. I have been very interested in llm-d and the Inference Gateway. It's something that a lot of the folks working in Open Source Kubernetes, especially, have been telling me about [LAUGHS] as a way to make the AI workloads that folks are running on Kubernetes clusters more efficient. And so I haven't learned as much as I would like to about it yet. And so I'm very excited about this interview. What were your top takeaways, Mofi?
MOFI RAHMAN: I think the top couple of takeaways would be that llm-d-- again, when I first heard about it, it seemed like, oh, great, another open source project that is going to do a bunch of things is going to become-- try to be one more standard in the list of standards, but one of the things that is interesting is that, instead of trying to kind of have a llm-d become its own software stack, it uses and utilizes existing things that are quite great at doing the things it does and find the gaps. llm-d uses Inference Gateway and vLLM quite heavily underneath, and has a bunch of other things. It builds up and tries to build a well-lit path, as Rob and Clayton would call it, in giving people a way to have production-grade inference applications running on Kubernetes.
Now, where this is a slightly different than some of the other attempts in the past, is the folks that are working on llm-d are the same people that also have a lot of contributions to vLLM and Inference Gateway, which means that when llm-d finds a gap in the inference stack, they can then go back into the subsequent projects that they rely on and make those changes upstream, and now building the entire stack up rather than saying, oh, we rely on this other project that doesn't do what we need to do. Now we have to build it ourselves or just wait for them to build something. And so having the same people, having almost a high-level view of the whole stack, and then also have the access to the individual projects, make it possible.
KASLIN FIELDS: So what I think llm-d provides here, really, is an open-source tool for describing some of those best practices and some of the tools that exist in the open-source space for folks to run inference workloads. So it's interesting to see these things all come together in one package.
MOFI RAHMAN: Yeah, I think this is one of the questions I actually also had to Clayton and Rob. And the reason I had that question is that every company, every team, every midsize, big size startups that are now serving large language models, like these open models like Gemma, Llama, Qwen, DeepSeek, all of them, over time, will build up some sort of techniques to optimize for the cost, or the performance, or just the accuracy. What benefit does this team get by spending that time to bring it out in the open, telling others about it?
And the answer I think there Clayton and Rob gave was very-- what I also think about open source is that it's very-- in some ways, it's not necessarily something you're doing for free. It's very self-preserving in some ways, where you sharing your ideas with others helps you actually refine that idea better, but also lets you get the access to a huge mindshare of other people that are doing similar, but different optimizations that you can learn from. And building out in the open and building together, you can build much more than your individual teams potentially could.
The other thing is, one of the code that was being used a lot is that the pace of innovation is breakneck, is that in the last two years, the innovation that are happening, that is going to improve the quality of life for any AI models in the future, even for not language models, right? If you're running your pipelines for reinforcement learning, like fine-tuning pipelines, all of these things, because there's so much more investment, engineering time being spent, all of those things are seeing improvements because we just have more people trying things, more people contributing to these things.
KASLIN FIELDS: It took me until this very moment to realize that llm-d is probably a play on systemd, isn't it?
MOFI RAHMAN: Maybe. I think-- another quote they also used is that LLMs are a computer in themselves. So it make sense to make--
KASLIN FIELDS: Yeah.
MOFI RAHMAN: And LLMs are a computer systemd is like the engine of or the brain of your computer processes. llm-d wants to be that integral part of your LLM serving. I think other thing people-- another question I had to them, which I have been actually chatting with Clayton for a long time, Clayton had a quote a couple of years ago, is that inference is the new web app.
Earlier this year, he updated that quote to say agents are the new web app. So that quote, I think, has taken a life of its own. It's going to evolve, and update, and upgrade over time, but in some ways, we are looking at a different type of application at the same time as a different type of engineering that is needed to run this application. When Kubernetes first came out, a lot of work was done to optimize for web apps, and a lot of work was done to optimize for stateful application.
As the same thing in a different scale, which is like we're doing our optimization to make sure Kubernetes is a good fit for large language models, right? It seems like a lot, but at the same time, it's not a net new thing. We have been doing this over the last 10 years of Kubernetes. We have seen where the industry and what are the people using Kubernetes are using Kubernetes for and optimizing the underlying engine to make sure that they have a good time.
KASLIN FIELDS: I always like to bring it back to, Kubernetes as a platform for running distributed systems, which I think maybe some people think that's a little reductive sometimes, but the point there is that you have all of this hardware and you need to do things with it, and that's even more true than ever in the world of AI, where the hardware accelerators are really at the core of being able to do exciting things with the technology.
MOFI RAHMAN: Yeah, so I think the biggest-- the last part of the takeaway, I would say, is that the projects are-- again, two years can be a really long time in the world of AI, but it's also a fairly short time. We still are in the phase where we have a bunch of people trying a bunch of different things. So a lot of new projects and standards are being created all over the world, but I feel like in the next few months, two years, again, time is, at this point, no longer a real thing anymore. So in a few months to few years--
KASLIN FIELDS: For so many reasons.
MOFI RAHMAN: Yes, a few months to few years, we should start seeing a lot of not necessarily consolidation, but more people learning from each other, where projects learn from each other to build out functionality that seems similar to each other.
KASLIN FIELDS: It's very interesting to hear the variety of features that are in vLLM. When I first started working with LLMs, vLLM was something that got in my way because it was-- I was trying to run different models, and the way that I would interact with them would be different based on whether they were using vLLM, or the Hugging Face one, whatever it's called.
MOFI RAHMAN: That's TGI, yeah.
KASLIN FIELDS: TGI, yeah.
MOFI RAHMAN: Yeah.
KASLIN FIELDS: And so that was one of the first things that I really wanted to dive into because it was causing me a lot of trouble, but hearing vLLM talked about not just as the thing that was causing me trouble, but as this open-source project that's having this fundamental impact in the way that we run LLMs really helps me to see the open source ecosystem that's developing around LLMs.
MOFI RAHMAN: Yeah, even there, about a year and change ago, vLLM had their own kind of spec of their API, but then formalized on OpenAI spec, where most of the industry has formalized on. Now you can serve something in vLLMs, serve something in TGI, serve something in NVIDIA's NeMo framework. All of them can provide you an OpenAI-compatible API, which means your application does not have to know, what is the underlying serving engine, which gives you the abstraction that is needed for you to be able to consume an OpenAI model as a service, a model being served via vLLM.
Gemini also provides you OpenAI compatible API too. So now you can separate the application layer to the underlying model layer, right? This is like the innovations that are happening, not necessarily the topic of the conversation at hand here, but so many different things are fitting together and were coming together to give you this nice abstraction layers at every step and for you to be able to build the things you want to build.
KASLIN FIELDS: As was said in the interview, the focus has shifted from software development to resource usage and scale. [CHUCKLES]
MOFI RAHMAN: Yeah, I think it's not even that big of a shift, but if we just think about where money is the most-- what is most expensive in your stack? Before, writing software means the cost of engineering was very expensive, but now serving the application became very expensive again because GPUs and TPUs are costly resources.
So now you have to look at, where can you optimize? Where do you spend more time optimizing for? I think, again, the interview itself, I feel, was really packed with lots of valuable information. So I would ask people, if they listen to it again, pay attention, take notes because Rob and Clayton gave so many different technical details of how things work, and I definitely will go back and listen to it multiple times to actually absorb everything.
And the call out to everybody else, and it is a call out in the interview as well, is that if you are someone who is serving LLM applications and you have-- either struggling with like optimization or found some ways to optimize your stack, bring it out in the open. Talk to the community. Talk about if LLM is a well-lit path, works for you, or if it doesn't, why it doesn't.
And maybe we can find out some interesting use cases that the team is not thinking about. So if you are someone interested in LLMs, or serving LLMs, or learning more about it, llm-d community, we will link in the show notes, will be a great place for you to find some other people that are thinking about the same problems.
KASLIN FIELDS: Yeah, get involved with the open ecosystem that's trying to help folks understand how to do these things. I feel like this was a buy one, get many kind of an episode, and there's so many different topics that we talked about. Thank you very much, Mofi. [CHUCKLES]
MOFI RAHMAN: That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media at Kubernetes Pod or reach us by email at <kubernetespodcast@google.com>. You can also check out the website at kubernetespodcast.com, where you will find transcripts, and show notes, and links to subscribe. Please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening, and we'll see you next time.
[MUSIC PLAYING]