Kubernetes Podcast from Google: Episode 240 - Kubernetes Working Group Serving, with Yuan Tang and Eduardo Arango

#240 October 31, 2024

Kubernetes Working Group Serving, with Yuan Tang and Eduardo Arango

Hosts: Abdel Sghiouar, Kaslin Fields

Yuan is a principal software engineer at Red Hat, working on OpenShift AI. Previously, he has led AI infrastructure and platform teams at various companies. He holds leadership positions in open source projects, including Argo, Kubeflow, and Kubernetes WG Serving. Yuan authored three technical books and is a regular conference speaker, technical advisor, and leader at various organizations.

Eduardo is an environmental engineer derailed into a software engineer. Eduardo has been working on making containerized environments the de facto solution for High Performance Computing(HPC) for over 8 years now. Began as a core contributor to the niche Singularity Containers, today known as Apptainer under the Linux foundation. In 2019 Eduardo moved up the ladder to work on making Kubernetes better for performance oriented applications. Nowadays Eduardo works at NVIDIA on the Core Cloud Native team working on enabling specialized accelerators into Kubernetes workloads.

Do you have something cool to share? Some questions? Let us know:

News of the week

Links from the interview

Transcript

Show full transcript

ABDEL SGHIOUAR: Hi, and welcome to the Kubernetes Podcast from Google. I'm your host, Abdel Sghiouar.

KASLIN FIELDS: And I'm Kaslin Fields.

[MUSIC PLAYING]

ABDEL SGHIOUAR: In this episode, we spoke to Yuan Tang and Eduardo Arango, organizers of working group serving within the Kubernetes project. We spoke about this newly-formed working group trying to solve serving for AI and ML workloads, the challenges they are trying to tackle, and what the future looks like.

KASLIN FIELDS: But first, let's get to the news.

[MUSIC PLAYING]

Docker launched their official Terraform provider. The provider can be used to manage Docker-hosted resources like repositories, teams, organization settings, and more.

ABDEL SGHIOUAR: Tetrate and Bloomberg started an open collaboration to bring AI gateway features to the Envoy projects. This effort is focused on building gateways capable of handling AI traffic specifically. The first set of features will focus on usage limiting, based on input and output token, unlike traditional rate limiting for HTTP apps, API uniformity and upstream authorization to LLM providers. The community is looking for ideas of features to build. You will find a link to the show notes with details.

KASLIN FIELDS: The CNCF is hosting a laptop drive at KubeCon CloudNativeCon North America 2024 to benefit two non-profit organizations, Black Girls Code and Kids on Computers. If you're interested in donating a laptop, you'll need to make sure the device meets the requirements. And you'll need to fill out a form. A link to more information is available in this episode's show notes.

ABDEL SGHIOUAR: There are four remaining Kubernetes Community Days events going on around the globe. Denmark, Accra in Ghana, Indonesia, and Floripa in Brazil are all hosting KC Days before the end of 2024. The Accra event is happening virtually. If you are able and interested in attending, make sure you check out these events to support local Kubernetes communities. You can check out the list of upcoming events at community.cncf.io.

KASLIN FIELDS: And that's the news.

[MUSIC PLAYING]

ABDEL SGHIOUAR: Hello, everyone. Today we are talking to Yuan and Eduardo. Yuan is a principal software engineer at Red Hat working on OpenShift AI. Previously, he has led AI infrastructure and platform teams at various companies. He holds leadership positions in open source projects, including Argo, Kubeflow, and Kubernetes working group serving. Yuan authored three technical books and a regular conference speaker, technical advisor, and a leader at various organizations.

Eduardo is an environmental engineer derailed into a software engineer. I'm going to have to ask you questions about that. Eduardo has been working on making containerized environments the de facto solution for high performance computing, HPC, for over eight years now. Begun as a core contributor to the niche singularity containers, today known as Apptainer under the Linux Foundation.

In 2019, Eduardo moved up the ladder to work on making Kubernetes better for performance-oriented applications. Nowadays, Eduardo works at NVIDIA at the core cloud native team working on enabling specialized accelerators into Kubernetes workloads. Welcome to the show, Eduardo and Yuan.

EDUARDO ARANGO: Thank you. Glad to be here and super excited to be next to Yuan. He is a role for everyone. Three books, wow.

YUAN TANG: Thank you, Abdel, for inviting us. It's a pleasure to be here.

ABDEL SGHIOUAR: Awesome. Thank you for being on the show. So we are here to talk about workgroup serving, which is actually a workgroup that I learned about at KubeCon Paris this year from Clayton. So I was chatting with Clayton and he was there. There was new working groups that were created. One is called serving and one is called device management. And we decided we have to talk to the people who are behind this.

So let's start with the obvious question. What is workgroup serving?

YUAN TANG: So basically, I can share a little bit about the introduction, the creation of the working group serving. So basically, Clayton, Colton and I had a discussion at KubeCon Europe this year. And we had really good discussions around model serving. We talked about some of the soft challenges by case serve pain points and limitations of the current Kubernetes APIs, especially for model serving use cases.

For example, case serve community developed the model car feature to pull the model from OCI image. This reduces startup time and it also allows for advanced techniques like prefetching images and lazy loading, which makes the auto scaling of large models more efficient. We also discussed how other ecosystem projects in model serving space like [? Kito ?] and Ray are trying to come up with workarounds for similar challenges. There was really a lot of interest in the community to propose better primitives and better foundational pieces for model serving workloads that can benefit the broader Kubernetes and cloud native ecosystem, especially for hardware accelerated workloads.

So after KubeCon Europe, we officially established this working group and I joined the working group as a co-chair representing Red Hat and other relevant communities that I'm involved in, like KServe and Kubeflow.

EDUARDO ARANGO: Yeah, I think I can add to that. And everything is started at Paris, as you both already mentioned, and goes back to the same name, Clayton. So during the Contributor Summit, that is when there's a small room at KubeCon where all the contributors gather during the big conference sessions, like in the afternoon, everything was around AI inferencing. So we spent one or two hours speaking about it. And at some point, Clayton said like, hey, I had a couple of conversations with Yuan from KServe and from Kubeflow. We should have a working group.

There, we kind of had a conversation about what would be the main differences between the working group serving and working group Bash that already existed. So we needed to differentiate what will be the role of working group Bash versus the role of the working group serving. And also knowing that there is a CNCF cloud native AI working group, so we have a lot of parallel working groups in the CNCF itself and in the Kubernetes communities.

And that difference was well defined. And we decided that two weeks after KubeCon, giving everyone time to rest, we will start all the process to create the working group. So that's a quick history on how it got created. So I think we all got it back to KubeCon Paris for sure.

ABDEL SGHIOUAR: Got it. Before we move to the next question, I do have actually a statement and then I think I'm talking to experts who can correct me. I find it that a lot of times when you are on stage doing a talk or you are creating content and you want to address this topic of serving AI models, sometimes you would alternatively use either the term serving or use the term inference. Right? And one kind of dumb way I've been explaining this to people so that people can wrap their head around it, is you can think of serving or inference as the piece of software that does exactly what a web server does for a website.

Like to host a website, you need a web server and the content that would be served quote-unquote, "by the web server." So I'm wondering, is that like a close kind of comparison or close way to explain it? What are your thoughts?

YUAN TANG: It's a tough question.

EDUARDO ARANGO: Maybe I can start defining the concepts and Yuan can take it from there. So to me, there are two main workloads during the entire AI ecosystem. And one is training and the other one is inferencing. So training involves a lot of distributed jobs, meaning the MP operator that we in Kubernetes is very important for that. And during training is where you take a lot of data, and you create what we call creating a model.

Then inferencing is using that model to detect patterns that it was trained to detect and provide an output. So that's the whole inference in concept. But then the serving word I think comes from, OK, now we have this model that was trained. And we need to provide a full infrastructure to have it always listening to what we now call prompts, and to being able to scale it out. So I don't know. Yuan, will you add something to that?

YUAN TANG: Yeah. I think before the term AI became popular or GenAI, we were talking about machine learning, statistics, and everything was basically predictive, right? At that time it was very easy to explain, training versus prediction. Now that with serving an inference and with GenAI term comes in, it's really difficult to describe the differences. But in general, we talk about serving requests, no matter if it's a model, or if it's even like a database request, or regular prompt kind of requests.

So I think there's no strong differentiation in my opinion.

ABDEL SGHIOUAR: Got it. Got it. All right. So then what is working group serving trying to achieve? Like, what is the mission? Why does it exist?

YUAN TANG: Generative AI has really introduced a lot of challenges and complexity in model serving or inference systems, and we really haven't seen many of those challenges in traditional ML systems. And to meet these new demands and those new address, those new challenges, this working group is dedicated to enhancing serving workloads on Kubernetes with a special focus on hardware accelerated AI or ML inference.

And in my opinion, it's very important to address the need and to come up with optimized solutions to handle compute intensive inference tasks using those specialized accelerators. And we hope that all the improvements we made at this working group would also benefit other serving workloads like web services or stateful applications. And any new primitives coming out of this working group could also be reused and composed with other ecosystem projects like KServe and [? Kito ?] and Ray.

And the mission of this working group serving is really to advance the capabilities and efficiency of serving on Kubernetes to make sure that they are well-equipped to handle evolving requirements of generative AI and maybe future workloads for serving. And we are operating within the Kubernetes community and governed by the CNCF code of conduct. It provides us a neutral place to work on necessary initiatives.

And with the leadership from the four organizations, namely Red Hat, Google, NVIDIA, and ByteDance, and a lot of participating companies from the community, we would also like to invite others to from the community to join us and share your use cases so that we can solve the serving challenges holistically.

EDUARDO ARANGO: Yeah, from my point of view, I will define it as goals. So there are three main goals for setting the working group, and the first one of it will be enhancing the Kubernetes workloads controller. And it's basically, since Yuan mentioned, there are many companies joining these meetings, the idea is to provide recommendations and better patterns for improving Kubernetes workloads. And controllers, as we are right now are building operators to handle specific workloads for our companies, and how the recommendations coming out of the working group will enhance performance in popular inference serving frameworks.

So right now, the working group has a GitHub repo where we are kind of like collecting blueprints from the entire community to say, like hey, this is how we run this model now. This is how I run the model. So people can compare and improve their workloads at their companies. We are also investigating orchestration and scalability solutions. A lot of meetings, the working groups serving has been spent around what should we measure for autoscaling? Should we measure GPU consumption or should we measure prompt size?

So this all falls into the category of research or investigating, which are the key metrics that we should monitor as a community to then build better orchestration and scaling and load balancing ideas and projects. Right now, speaking about load balancing, for those joining the meetings, you will notice that we are talking about a new project that is the LLM gateways. And this is something related to load balancing. So we are all investigating how to enhance workloads overall for LLM serving. And this will be the second goal.

And the third goal will be to optimize resource sharing. And this ties the working group serving with the working group device management. Like we want to have a nice communication between the two working groups, mostly because the exciting new feature of Kubernetes that is DRA that everyone is talking about. So we want to create a list of needs of things that are not possible right now in Kubernetes via the regular device plugin, and hand that over to the device management working group and tell them like, hey, we need new features. Can you please prioritize that because they are coming from the working group serving? So that will be the third goal.

ABDEL SGHIOUAR: Got it. So I do have a follow-up question. And feel free, if you don't have an answer, it's fine. We can skip it. Can you folks give us some examples of concrete limitations in Kubernetes that working group serving is trying to solve? Like just one or two examples, if you can think of something.

EDUARDO ARANGO: I can start with one that is what I just said about DRA. And is that right now, defining multi-GPU, multi-node workloads in Kubernetes is almost impossible. DRA going to provide solutions for that. But it's not ready yet, right? Doing that with device plugin, like the tools that we have today at hand will be the MP operator, the Leader Working Set and device plugins. Like joining those three, you will get close to it.

But for what workloads that are coming, like the new models that are so big that they need to be run on multiple nodes, we need the features that are being promised by DRA.

ABDEL SGHIOUAR: OK. And just for the audience to know when Eduardo was talking about DRA, that stands for Dynamic Resource Allocation. And it's like a whole new set of features coming into Kubernetes, which are some of them are being implemented, some of them are not being implemented, so just for people to understand what they are. But Yuan, you had something to add there?

YUAN TANG: Yeah, so there are a lot of challenges in different in different work streams. I can talk about that later. But for example, for autoscaling, right, autoscaling on device utilization memory is really not sufficient for production workloads. And it's very challenging to identify and configure HPA to auto scale or model serving metrics. And it's also hard to measure like how latency, throughput and workload sharing interact with auto scaling, so that the deployment can achieve a target latency and model server configurations to achieve a better optimized performance.

ABDEL SGHIOUAR: Got it. Yeah, and I think that there is another whole other topic that is probably worth its own episode about observability for LLMs, which is a completely different way of how do you handle observability for LLMs compared to like a web server or something, or like a back-end application.

So I was checking out the GitHub repository of the working group. You have three work streams, so auto scaling, multi-host, multi-node, and the third one was orchestration. Can we talk briefly about this work stream?

YUAN TANG: Yeah, there's actually another one for DRA, which serves as a bridge in between working group device management and working group serving.

ABDEL SGHIOUAR: Got it. OK. Can we talk briefly about each of those? Just what are these work streams? I know that we covered these topics, but going a little bit more into details.

YUAN TANG: Yeah, maybe I can introduce the orchestration work stream and multi-host and then Eduardo, you can cover the autoscaling and DRA?

EDUARDO ARANGO: Yeah, sure.

YUAN TANG: For in the orchestration work stream, we focus on identifying challenges in implementing high level abstractions for orchestration-serving workloads. So we are working closely with ecosystem projects like KServe and Ray. So for example, we hosted dedicated sessions to collect solved challenges, pain points, and use cases from this ecosystem of projects. There are also interesting proposals from the community after those discussions.

For example, there's a blueprint API that proposes a new Kubernetes workload API for deploying inference workloads. The idea is to offer standardized APIs to define blueprints or preset configurations to instantiate serving deployments. However, it has a certain level of overlap with the case of serving runtime APIs. So we decided to switch our focus on a new project sponsored by this working group, our serving catalog. And so for that project, we'd like to provide working examples for popular model servers and explore recommended configurations and patterns for inference workloads.

We also sponsored another project that Eduardo mentioned earlier, the LLM instance gateway sub-project to more efficiently serve distinct use cases on shared model services, running on the same foundation model, like system prompt, or LoRA adapters, or other parameter efficient fine-tuning methods, for example, schedule requests to pools of model servers to multiplex use cases safely onto a shared pool for higher efficiency.

And for multi-host or multi-node work stream, we focus on extracting patterns and solving challenges for multi-host inference. So we had discussions around various implementations for multi-host inference and their cost effectiveness, their capacity optimizations. We also had a deep dive into the architecture and use cases of Leader Worker Set. So Leader Worker Set addresses some common deployment patterns of multi-host inference workloads.

For example, large models will be [? shouted ?] and served across multiple devices on multiple nodes. There wasn't really that much demand when we first started that work stream. But as models get larger and larger, serving them on multiple nodes really becomes necessary. Even though Leader Worker Set provides a good API to describe multi-host workloads, there are still a lot of challenges to be solved together with the working group.

For example, we wanted a way, a better way to express network topology preferences for multi-host inference. And multi-host is also poorly supported by orchestration tools, but they are actually very actively working on it. For example, KServe is working on multi-host support now and, yeah, that's it for orchestration and multi-host.

ABDEL SGHIOUAR: All right. And I'm going to get you, Eduardo, to talk about the other two work streams. But is the problem of multi-host serving because you think that we're going to get to a point where the models will be too big, such a way that they cannot fit on a single node anymore?

YUAN TANG: Yeah.

ABDEL SGHIOUAR: Is that the background?

YUAN TANG: Yeah, that's basically the hard use case for that. And there are also advanced use cases like deploying a disaggregated model serving configuration, which is really hard. But it's really-- it should be relatively common for large model serving in the future.

ABDEL SGHIOUAR: Got it. That's interesting. Because on the other side, what we are also, what I am noticing, is that cloud providers are able to provide bigger and bigger VMs. Right now, you can on Google Cloud, get something like some 3 terabyte memory, some ridiculous 196 cores on a single VM. But it's interesting to me that we are seeing that in the future, those will not be enough for a single model. But the model will be bigger than what a VM can actually do.

EDUARDO ARANGO: Yeah, I think that going to my teacher, my company, is that it doesn't matter the size of the node, but the number of GPUs that you have on that node.

ABDEL SGHIOUAR: Yeah?

EDUARDO ARANGO: So some models that to run them with low latency, you require right now big, let's say like average big model, requires to four GPUs. And you can fit that on a node. But we will get to a point we will need multi-node because one single model requires eight GPUs or more.

By the way, what I'm saying is because we need low latency. You can run LLaMA on your laptop, right? The thing is it will take three minutes for it to start responding back. On a production system, you don't want that. You want like fast responses.

ABDEL SGHIOUAR: Interesting, OK. So what's the actual limitation for attaching multiple GPUs to a node? What is the technical challenge there? What is the technical limitation, if you want?

EDUARDO ARANGO: Physical limitations? Depending on the architecture of the node, you can get from four to eight GPUs. I know Blackwell itself is going to be a GPU node on its own, so it's not like we were used to racking up GPUs. But it's more like the entire node now is a GPU. Like we are moving to a different type of architectures.

ABDEL SGHIOUAR: Got it.

EDUARDO ARANGO: So it is that, right?

ABDEL SGHIOUAR: Yeah. As I was asking the question, I realized I didn't ask it in the right way, because obviously, the physical limitation is the number of PCI Express ports you can have on a single server. That would be the obvious one. But then I was thinking around the lines of what you said, which is we're moving toward entire physical nodes that are GPUs, that have the GPUs built in. Is it common for people to do compute on a physical node and then GPU on a separate node, and have them talk to each other through some fast networking link? Is that a thing that is possible?

EDUARDO ARANGO: I have seen this in gaming, but for production system, no, you want to have everything together. I know that in gaming, if your laptop doesn't have a GPU, there is a way using--

ABDEL SGHIOUAR: External GPU.

EDUARDO ARANGO: Yeah, like a thunderbolt, you can have an external GPU. But for a production systems, you want to have everything as close together as possible.

ABDEL SGHIOUAR: As possible, OK. Cool. Got it. So can you talk a little bit about the other work streams, Eduardo, like the autoscaling and DRA?

EDUARDO ARANGO: Yeah. And I think talking about multiple GPUs is a good introduction to autoscaling. It's as you said, cloud providers are trying to provide better and better VMs as time goes by. And this means that Kubernetes will be creating new nodes on the fly. So the autoscaling workstream has been focusing on two key aspects. One is caching and the second is metrics, as I was saying before.

So LLM models can be quite big, right? During the working group serving, we have heard people talking about a couple of gigabytes to hundreds of gigabytes. And then this creates a complexity on autoscaling. And if the model is not cached in a way that it can be quickly used by the pod that Kubernetes is deploying, then you will start getting latency and the user experience gets degraded.

So you don't win things by your cloud provider providing a new VM in two, three seconds if you cannot cache or move the model to that node, so your pods can use them. So speaking about caching strategies, having network provided attached volumes via Kubernetes and are being strategies that we have been discussing at the working group.

And the second thing being metrics, should we listen to the hardware, GPU utilization or memory utilization? Or should we focus on what we are receiving and leaving the hardware as a Black box? Like latency, tokens per second, size of the prompt, that is kind of like an input, not an output, or a combination of both-- this is being discussed right now for the inference gateway new project.

And it's how can we do a better load balancing task. Should we listen to hardware? Should we listen to other metrics? And this is in the fly right now. And as Yuan said, we want people to join the conversation, come to the working group, and say, to me in my company, tracking GPU utilization works.

Well, cool. That will help us as a working group. So we need input from the community on these type of topics. Because right now, we are leading to the realization that pending on the model and the use case of the model, you should track hardware or you should track these soft layer metrics like latency and tokens per second. So that's the autoscaling workstream focus. And there are very interesting discussions there.

And the last work stream that is like a bridge work stream because it bridges serving with device management working group, is that what we call the DRA work stream. And is basically, as we have been talking, it's gathering all the feature requests for DRA from the point of view of what we need to make serving better, and then like championing that list at the work group device management meetings, right?

So we basically say like, hey, we know that right now, for example, we are cutting-- everyone in Kubernetes is talking about the cut of 1.32. So we allow, OK, what do we want from the serving side of the house that the guys at the device management will prioritize before the 1.32 cut? So that's what the workstream DRA is. It's creating a prioritized list and championing it at the device management meetings.

ABDEL SGHIOUAR: Got it. And so remember that from KubeCon in Paris, I was talking with Clayton. And we specifically talked quite a bit about this problem you covered during the explanation of the autoscaling workstream, which is how do you load balance based on the user inputs? Because for LLMs, your input is not a typical REST call or whatever, like SQL database query or something. It's basically text.

And one of the things that Clayton was mentioning was that it would be interesting if there is a way to load balance also based on the size, so you can send smaller inputs to smaller, more efficient models, and bigger prompts to bigger models to optimize for the time to get the response back.

So is this in the space of what you're also working on? Because you mentioned the LLM gateway, which is I would say almost of fork really, but inspired from the API gateway, but for LLMs, right? Can you talk a little bit more about that specific? Because that's essentially the load balancing problem we're talking about here.

YUAN TANG: Yeah, maybe can talk a little bit about ModelMesh. So before we even introduce the instance Gateway Project from the working group serving, KServe of actually has a project called ModelMesh. It's very mature and general purpose model serving management and routing layer designed for high scale, high density, and frequently changing model use cases. And it works very well with existing model servers, and it also acts as a distributed LRU cache for serving runtime workloads.

So even before the large language models, we had a lot of use cases for traditional machine learning models. We need to handle the traffic and the routing between those different model servers, depending on their usage, and density, and their frequency of changing. So I just want to mention that it's not just a requirement for large models, but also for traditional models, it has a use case as well.

But for the LLM instance gateway, the initial goals for the POC is to making sure it works well with LLM, our use cases, especially for high density LoRA adapter use cases. But later on, it may be extended to support other traditional use cases as well.

ABDEL SGHIOUAR: Got it. Yeah, so I think that that's basically all the questions I had for you folks. Do you want to add anything? Where can people find you? Where can they find the WG serving? We'll make sure there is a link in the show notes for your GitHub repository. But how can people join if they're interested?

EDUARDO ARANGO: Sure, people can find me on LinkedIn as my full name, which is very long, Carlos Eduardo Arango Gutierrez, but I guess the link will be attached, and also on the Kubernetes Slack. But no, I don't have Twitter, don't support it. I don't want to be there. So yeah, that's where you can find me.

ABDEL SGHIOUAR: Awesome. Yeah?

YUAN TANG: So make sure you just join the mailing list and then you'll get invited to all the existing and future calendar invites, and make sure to join the Slack channel as well, because that's where most of the real-time communication happens.

ABDEL SGHIOUAR: Awesome, cool. Well, thank you for being on the show, folks.

YUAN TANG: Yeah, thank you for having us.

[MUSIC PLAYING]

ABDEL SGHIOUAR: So, Kaslin, how are you holding for KubeCon? Are you excited?

KASLIN FIELDS: I am always excited for KubeCon. I walk into KubeCon and I'm like, wow, I'm home.

ABDEL SGHIOUAR: And then five days later, it's over. It's just like flies by.

KASLIN FIELDS: Yeah, and then I am destroyed and need to go take a nap.

ABDEL SGHIOUAR: Yes, yes. I call that post-conference depression. But yeah.

KASLIN FIELDS: It's a thing.

ABDEL SGHIOUAR: It's so engaging and you meet so many people, then it's sad when it's over, right?

KASLIN FIELDS: And of course, KubeCon is the CNCF's primary event and it has such a huge focus on open source. So I'm excited that today's topic was about working group serving.

ABDEL SGHIOUAR: Yeah. I don't even remember how this came to exist as a topic that we wanted to talk about. I don't remember where. I think it came from our discussion with Tim Hockin and Clayton

KASLIN FIELDS: I think it did, yeah, when we were working on the 10-year anniversary episode.

ABDEL SGHIOUAR: Oh, yes. Yes.

KASLIN FIELDS: We were talking about yeah, with Tim Hockin and Clayton Coleman. In those episodes, we talked about the new working groups that were spinning up to support specifically AI-oriented workloads, implementing new functionality in Kubernetes to help Kubernetes users better manage the underlying hardware, since hardware is so important to those types of workloads.

ABDEL SGHIOUAR: Yeah. And so the two working groups are serving and device management. So serving is the one we covered today, which is inference essentially. And device management, we will cover it next year.

KASLIN FIELDS: So serving is really focused on the specific type of workload. Device management is all about that challenge like we were talking about of managing the hardware better through Kubernetes. But inference is specifically digging into what do inference workloads look like right now, and what could we do with Kubernetes to make it better?

ABDEL SGHIOUAR: Yeah. And so I remember from the conversation, one of the challenges that Eduardo specifically was talking about that they are trying to solve is multi-host serving. So if you have a gigantic model and you have a physical limitation in terms of how many individual GPUs you can attach to a single node, can you split that model across multiple nodes? So that's just one of them.

There was a lot of other conversations. But this is specifically something that stuck with me, because I never really thought about it, like a distributed machine learning model, essentially.

KASLIN FIELDS: Exactly. I love it when a look into something in the Kubernetes world comes back down to the roots of Kubernetes is a distributed system. There's a whole bunch of computers running workloads. And so how do we do that in the most efficient and most useful ways? So that is a really cool aspect that I also had not thought of. But if you're running an inference workload on Kubernetes, then of course, it's running on a distributed system. So you need to think about how to make the most efficient and best use of that distributed system.

ABDEL SGHIOUAR: Exactly. Yeah. And preparing to record this, we just realized that working groups, and this is something you will have to teach me, I guess.

KASLIN FIELDS: Yes.

ABDEL SGHIOUAR: So in the working groups, we don't talk about leads. We talk about organizers.

KASLIN FIELDS: Apparently, so as I have talked about a number of times on here, I am deeply involved with the Kubernetes community. I'm a lead of a special interest group myself. But working groups are a little bit different from special interest groups. So special interest groups are the core tool that we use to split up the work of maintaining the Kubernetes project and building the Kubernetes project.

So we have special interest groups for networking, and docs has its own special interest group, infrastructure, testing. We have really big areas that are covered by special interest groups. But working groups tend to spin up when there is a topical thing that the project needs to think about. So of course, serving workloads makes sense right now since we're seeing a big increase in the number of people wanting to run those types of workloads on Kubernetes. And device management is a topical thing for us to cover because we need to better handle those devices for those types of workloads.

And the project hasn't really done that much before. There were tools, of course, but this is another level. So we needed focus groups to think about what's going on with these things, and spun up these working groups.

And so working groups are a lot like a SIG in that they have a specific area that they're looking at, but they're like I said, topical. So they're about something that's going on currently. And the idea is for them to eventually roll into a SIG or become a SIG themselves. They aren't going to last forever. And as such, they don't actually own any code generally, in the code base of Kubernetes. Any code that they produce is going to be owned by a special interest group because those are going to continue existing regardless of what happens. So there's a maintenance plan in place in that sense.

So they operate very similarly to SIGs in some ways, but also differently. So this organizers thing, I had always thought of the leads of working groups as in SIGs, we call them tech leads and co-chairs. I figured that they used similar language in working groups, but maybe not. Maybe they use the word organizers.

ABDEL SGHIOUAR: Yeah. And also, the other thing to keep in mind from my understanding is that working groups also can potentially span across multiple SIGs.

KASLIN FIELDS: Yes, they generally do.

ABDEL SGHIOUAR: So yeah, they work with multiple SIGs too, because yeah, I don't think serving is a specific special interest group problem. I think it's something that multiple SIGs will be involved in "trying to solve," quote-unquote.

KASLIN FIELDS: And we'll include a link in the show notes to the GitHub repo for the working group. And if you check that out, it actually says which SIGs they're most closely aligned with and work most closely with. So all of their work theoretically will go into those SIGs, rather than being owned by the working group. And then someday, the working group will probably dissolve and those SIGs will own that code instead, unless something changes and we decide that we need that working group forever, and it becomes its own SIG.

ABDEL SGHIOUAR: Yeah. And while you were talking, I was looking at the GitHub repository and they realized that they are a sponsor of a subproject called the LLM instance gateway, which we covered in the news. But this is something I'm super excited about for probably next year. We should have an episode about it.

KASLIN FIELDS: They're a sponsor of the subproject? That's also very interesting.

ABDEL SGHIOUAR: There is a subproject called LLM instance gateway. Yeah, it's listed as a sponsor. But I've been following the LLM instance gateway effort for a while because, yeah, it's basically what gateways are but for LLM specific workloads. So there are some interesting things going on there. And I think we should eventually at some point cover that.

KASLIN FIELDS: Yes, that sounds very interesting. So I mentioned the SIGs and the working groups, and subprojects in Kubernetes tend to be a part of a SIG. SIGs have very broad scopes and so they have subprojects to bring that scope down a bit into something a little bit more actionable. So it sounds to me like what happened here is the working group exists, of course, to address this topical issue. They identified a need. They worked with a SIG to create a subproject is what it sounds like.

ABDEL SGHIOUAR: Or they are sponsoring. That's what it says, like sponsored. So I don't know what sponsored means.

KASLIN FIELDS: But I would imagine sponsored means we figured out that it needed to happen. And so we're helping to spin up this subproject.

ABDEL SGHIOUAR: Could be.

KASLIN FIELDS: Because it's not like there's money involved. It's open source.

ABDEL SGHIOUAR: Yeah. Yeah. I mean like specifically the LLM instance gateway is an in-development project, so there is pretty much nothing-- I mean, there is code. You can build it yourself. But there isn't really much you can use yet. So the more you are, the more you learn, I guess.

KASLIN FIELDS: Is it a separate project or is it part of Kubernetes? And this always a good question.

ABDEL SGHIOUAR: It's actually listed under Kubernetes 6.

KASLIN FIELDS: OK, yes.

ABDEL SGHIOUAR: That's where it's listed.

KASLIN FIELDS: All right. That makes sense.

ABDEL SGHIOUAR: So that's, yeah. Yeah, it was pretty cool to talk with Yuan and Eduardo. I learned a lot.

KASLIN FIELDS: Yes. And we hope that you all enjoyed listening to the episode and learning about what the community is doing to support serving workloads in the distributed system that is Kubernetes.

ABDEL SGHIOUAR: Awesome. Well, thank you very much, Kaslin.

KASLIN FIELDS: Thank you, Abdel.

[MUSIC PLAYING]

That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media at Kubernetes pod or reach us by email at <kubernetespodcast@google.com>. You can also check out the website at kubernetespodcast.com, where you'll find transcripts, show notes, and links to subscribe. Please consider rating us in your podcast player, so we can help more people find and enjoy the show. Thanks for listening, and we'll see you next time.

[MUSIC PLAYING]

View More Episodes