Kubernetes Podcast from Google: Episode 245 - Device Management in Kubernetes, with John Belamaric

#245 January 15, 2025

Device Management in Kubernetes, with John Belamaric

Hosts: Abdel Sghiouar, Kaslin Fields

John Belamaric is a senior staff software engineer at Google who has been involved in Kubernetes since 2016, and is currently a co-chair of both SIG Architecture and WG Device Management.

Do you have something cool to share? Some questions? Let us know:

News of the week

Links from the interview

Links from the post-interview chat

2025 Kubernetes Community Day Events

Transcript

Show full transcript

KASLIN FIELDS: Hello, and welcome to the Kubernetes Podcast from Google. I'm your host, Kaslin Fields.

ABDEL SGHIOUAR: And I am Abdel Sghiouar.

[MUSIC PLAYING]

KASLIN FIELDS: And Happy New Year, everyone. Happy 2025. To kick off the new year, we're talking with John Belamaric. He's one of the co-leads of Working Group Device Management in Open Source Kubernetes. We'll talk about the awesome features the group is developing and what problems they're trying to solve. But first, let's get to the news.

[MUSIC PLAYING]

ABDEL SGHIOUAR: KubeCon Japan and India both have CFPs open now. KubeCon Japan will be the first KubeCon event held in Japan on June 16 and 17. The CFP closes for that event on February 2.

KubeCon India will be the second KubeCon event held in the country. The previous KubeCon in India was held in December in 2024, in New Delhi. KubeCon India 2025 will be held in Hyderabad on August 6 and 7. The CFP for India closes on March 23.

KASLIN FIELDS: And that's the news.

I am excited today to be speaking with John Belamaric. He is a Senior Staff Software Engineer at Google who has been involved with Kubernetes since 2016 and is currently a co-chair of both SIG Architecture and Working Group Device Management, which we're going to be talking about today. So welcome to the show, John.

JOHN BELAMARIC: Thank you, Kaslin, for having me. I'm excited to be here.

KASLIN FIELDS: So you've been around with the community for quite some time. 2016 was actually the first year that I went to a KubeCon. Did you go to Seattle that year?

JOHN BELAMARIC: Yes, I did. Yes.

KASLIN FIELDS: Nice.

JOHN BELAMARIC: I did. The election night, KubeCon. Yeah.

KASLIN FIELDS: Oh, man, that was a thing.

JOHN BELAMARIC: Yes. Yes, that was my first KubeCon. And I was involved, at the time, with CoreDNS. I still am, but that was my sort of first. The company I worked for at the time was a DNS company, and so we got involved and brought CoreDNS to Kubernetes.

So I had a Lightning Talk at my first KubeCon, which I had three minutes, and I was a little nervous. So I was so fast that I finished it in less than one, which I'm sure was completely useless to everybody in the audience. But you know, that's how it goes.

KASLIN FIELDS: That almost never happens with Lightning Talks. They always run over. So, I don't know, props to you for that, I think?

JOHN BELAMARIC: I don't think it was a good thing.

[LAUGHTER]

KASLIN FIELDS: That is super cool. And I could go on and on talking about the 2016 KubeCon. What an adventure it was. But let's talk about what's going on these days with Kubernetes. So we had our 10-year anniversary of the project last year in 2024. And of course, AI, AI, AI, all the things. And so I hear that is related to what you're doing in Working Group Device Management. Could you tell me a little bit about the group and what a working group is too, so that we can bring folks up to speed if they haven't heard of it?

JOHN BELAMARIC: Absolutely. That's exactly right. So working group in the Kubernetes community-- the way the Kubernetes community is organized, we have Special Interest Groups, SIGs. Those actually own the code. So it's just a bunch of people who work together in Open Source.

But sometimes there's a problem that spans multiple SIGs. So our SIGs are things like Node, which focuses on Kubelet and related APIs, and API Machinery, which focuses on the actual API server and all of that, and Scheduling, which focuses on the scheduler. So it's kind of broken up by components that are part of Kubernetes.

But sometimes there's a feature that you want to implement that's going to span and touch many of these different components. And so we have another concept called working group. And a working group is sponsored by multiple SIGs. And it has a specific short-- relatively short, meaning usually a couple years-- lifespan that is sort of gated by, we finish this feature and we're done and it dissolves. And the changes made into the code by that feature, by that implementation, are still owned by the SIGs that sponsored it. So that's what a working group is. And--

KASLIN FIELDS: Perfect.

JOHN BELAMARIC: --with devices-- well, if we roll back, say, to KubeCon Chicago, which was in November of 2023-- and we could roll back even further than that, but that's kind of where it started to reach a fever pitch, where people were like, we need to do things with AI, which means we need to do things with GPUs and accelerators.

And the solution that some folks, particularly my co-chairs of Working Group Devices, which didn't exist at the time, Patrick Ohly and Intel and Kevin Klues and NVIDIA, they had put together a solution for this called Dynamic Resource Allocation. Patrick had been working on it for several years. It originally came out of a different use case, but similar kind of needs for more flexibly managing devices, or rather resources on a node. And so there was a lot of excitement in KubeCon in Chicago around DRA.

Unfortunately, that iteration of DRA also caused some anxiety amongst some of the SIGs, in particular the Scheduling and Autoscaling SIGs, because the specific design and implementation was super flexible, which is great, except when it's not, because what it meant was that the autoscaler couldn't look at a pod spec or a deployment spec and a node and sort of easily identify whether that new node that it wants to create would satisfy that pod spec because of the level of flexibility built in to that DRA functionality.

So this kind put a little bit of halt on it, and we started discussions about how we might revise that. And so then that's around when I got involved. I got involved post that KubeCon. So January of the next year.

KASLIN FIELDS: I did not expect this to take a turn into dynamic resource allocation. Very interesting.

JOHN BELAMARIC: Exactly, right? So what we decided is that-- there were a couple of things. One is we needed to revisit DRA and how it is designed and structured, such that it meets the needs of the autoscaling community. And the scheduling community also had some concerns about certain aspects of it.

KASLIN FIELDS: So the DRA, to give folks a little bit of context around this, I guess, DRA, or Dynamic Resource Allocation, is a feature that was a primary feature within 1.33, the last release of Kubernetes that just came out in December of 2024. And something that we talked about in the release episode was that it was kind of a revision of DRA.

And I actually didn't know that going into that release episode. So this is really interesting to me to hear a little bit about the origins of DRA and it being very flexible and causing issues with the autoscaler and scheduler. And so now a new version of it is out in 1.33, right?

JOHN BELAMARIC: It's 1.32. 1.33 is the one we're starting.

KASLIN FIELDS: 1.32. Darn it. 1.33, shadow applications are closing the--

JOHN BELAMARIC: Yes. Sorry, I should have stopped you.

KASLIN FIELDS: The release team is spinning up. So 1.33 will be next. 1.32. [CHUCKLES]

JOHN BELAMARIC: Yes, we just released beta of DRA in 1.32, and that beta is the revised version that came out of our discussions post the Chicago KubeCon. So after the Chicago KubeCon, we all worked together, and it came out of this working group. So kind of going back to, we have the Chicago KubeCon, these concerns were raised. The Kubernetes community met.

And then we met in the next KubeCon. We discussed a lot of things offline, of course, or online in Slack and meetings and everything. And what we realized is that DRA had been designed out of the Batch Working Group, and that there were a lot of use cases around things like inference that meant we needed something a little broader than Batch.

We also realized, in our discussions at KubeCon EU, following the Chicago one, that there were other things than accelerators that needed to come together to solve our workload problem. So for example, sometimes your accelerator needs to talk to other accelerators over the network, and you want that accelerator and that network interface card to be on the same PCI bus.

Because if you look at say-- NVIDIA, for example, has technology that if they're on the same PCI bus, they can talk directly to each other and bypass the CPU, and it's a tenfold improvement in the I/O performance between them. So if we're just looking at accelerators, then we're not solving the whole problem.

And so we kind of took those few pieces of information and said, hey, we really need a new group that's going to understand all of these use cases around different types of workloads that use these specialized devices and try to come up with a plan that works for the autoscaling community, works for the scheduling community, works for, of course, meeting the needs of those workloads. And that's how this was born, the Work Group Device Management.

KASLIN FIELDS: It makes a lot of sense that your description of working groups was so clear, because my first introduction to working groups was the Long-Term Support Working Group and all of the shenanigans and drama in the community around that. And a primary feature of the Long-Term Support Working Group is that it is meant to be a limited time thing. So that point of it was very emphasized.

But Working Group Device Management has all of these connections between the different SIGs and the different areas of Kubernetes, and that's a more primary part of it than I would say it is in Long-Term Support, because Long-Term Support is like, it's just talking about the whole project and how we deliver it to folks. So it doesn't have as much of that piece of working groups. So you've got the Batch. Was Batch a working group at that point? Yeah. It is still, right?

JOHN BELAMARIC: Is it a SIG now? I think it's a-- I'm not sure if it's a SIG or a working group now, but I think it might be a SIG now because it may own certain code.

KASLIN FIELDS: I think it might be.

JOHN BELAMARIC: But I don't recall. But Node, Scheduling, Autoscaling, Architecture, and Networking as well are all involved in the Working Group Device Management, especially Networking for the NICs, the Network Interface Cards. And in fact, we're able to solve some of the low-level, multi-network concerns for Kubernetes with DRA as well because we're allowed to attach different devices to a pod. And so devices, network interface cards, attached to a pod means you get access to another network. So it kind of solves some of the problems that Multus, for example, is used for today.

KASLIN FIELDS: And that's a big swath of the project. That's a good chunk of the SIGs.

JOHN BELAMARIC: Yeah. Well, it's about the abstraction, right? Our physical machines have this abstraction already. Our kernel has this abstraction already of devices that can be attached and put in these different types of namespaces. In some sense, we're just kind of leveraging the logical constructs that the kernel gives us of this abstract device. And going back further, what Unix gives us, that everything's a file.

But anyway, the mission, if you go look at the charter for the working group, is to enable simple and efficient configuration, sharing, and allocation of accelerators and other specialized devices. So what that means is, as a pod spec author, I can put in some selection criteria, and that goes out and finds the right type of device based on that selection criteria and allocates it so that other people can't use it and attaches it back to my pod.

At the same time, I might have specialized configuration I want to attach to that. It's not just the selection of the right type, but I want to configure it in a certain way, and there might be certain ways that my administrator allows it to be configured and not. So it kind of tries to get all of these different pieces into place.

KASLIN FIELDS: It's almost like when you try to implement new functionality with new hardware in a distributed system, you have to deal with the whole distributed system. [CHUCKLES]

JOHN BELAMARIC: Yeah, exactly. Exactly.

KASLIN FIELDS: All the components of it, basically, because the system is made up of hardware. And so when you change the hardware--

JOHN BELAMARIC: And that's the thing. So this is kind of like, to me-- and maybe I'm biased because I'm one of the co-chairs here. But to me, one of Kubernetes' fundamental challenges, moving from the sort of traditional web app-type of environment to AI environments, was that our first 10 years, or maybe the majority of the first 10 years, was spent thinking about fungibility of hardware, making hardware as invisible as possible.

And we had certain workloads that we used as our primary use cases, and they would just consume that hardware in any old way. And that works great for those type of applications. It doesn't work as well for these training and inference workloads, which have very specific hardware requirements and have very expensive hardware that's scarce.

And we want to get the most utilization and most utility out of it as we can. So fundamentally, the goal of Work Group Device Management is to change Kubernetes' relationship with the hardware and change how Kubernetes understands the hardware and makes the hardware available to our users.

KASLIN FIELDS: That is a big ask.

JOHN BELAMARIC: It is. It really is a big ask. Right now, what our current effort is is DRA. And that actually solves a substantial part of that problem, but not all of it. So we may have new things that come in after DRA. But I can't look that far ahead yet.

KASLIN FIELDS: Yeah, I can see how DRA would be an important part of that, the dynamic allocation of these new types of hardware resources. You're going to have to solve a lot of problems with the way that Kubernetes looks at those resources in order to dynamically allocate them. So that makes a lot of sense as a base level. But I do think there is going to be more, like you're saying here.

JOHN BELAMARIC: Yes.

KASLIN FIELDS: We'll see what that is. I do also think it's funny sometimes, when we talk about the advent of AI workloads and how that's different from the web world that Kubernetes was originally built in, that normally in technology, I think you tend to go toward a world of more abstraction. [LAUGHS]

But it feels kind of like we've done a step backward here on that respect, because in the web application world, we could abstract the hardware more, whereas these AI workloads are so specific in the kinds of hardware that they need and how they use that hardware, it can want to do things at a very granular level with the actual hardware that you have. And so we're kind of going backwards there and showing all of that detail again to the users.

JOHN BELAMARIC: Yes. I think that's partly, though, a function of the newness of the hardware. So CPUs have been around a long time. Memory had been around a long time before Kubernetes came along. So making it more fungible, more commoditized was pretty easy to do. People had been working on that problem for many years.

We are not at that state yet as an industry with our accelerators. They're not fungible. They're not even very equivalent at times. So even if you can somehow represent them, the workload won't run on two different-- on this one versus that one. So their performance is different. So many characteristics are different. We don't even know how to measure it at times.

And so we don't know how to measure the utilization sometimes. So we're just not at a state yet where the abstractions can be as useful as we'd like. Now, we are building abstractions. It's just that we are-- and they are useful. But the parts that we're abstracting are about the scheduling, selection, configuration pieces. And it's a little bit-- it's like the orchestration layer we're abstracting, whereas with CPUs, we can abstract at a little bit lower level. I don't know if that makes complete sense, but it's kind of how I think about it.

KASLIN FIELDS: Yeah, I think that's a helpful way of looking at it. So the Working Group Device Management has a very big mission in terms of its meaning and impact to the project and a wide range of areas. There's a lot of work going on with the dynamic resource allocation project, which is in 1.32, but there's continuing work on that. What are some of the work streams? How does the working group kind of operate?

JOHN BELAMARIC: I mean, DRA is really our primary work stream, but you could think of that in-- we break that into many pieces, right?

KASLIN FIELDS: I'd imagine you break it down. It sounds pretty big.

JOHN BELAMARIC: Exactly. It's pretty big. So with DRA, there are a few aspects that you can think about. Like one is the API for how device vendors specify their devices. So traditionally, we had device plugin. And device plugin just says, here's a string and a count. And that's it for the node. It's an extended resource, we call it, where it says, I have nvidia.com/gpu, I have eight of them, and that's it.

DRA widens that API to allow a lot more detail. And on top of that, we allow very sophisticated models of how to represent devices. So the simplest thing we started with in 1.32, which is going to get much more sophisticated as we move on, is instead of publishing, here's a string and a count, we publish eight structures. There's eight of them, one for each device, and it has a bunch of attributes.

And those attributes are of varying types. So you can have model, name, vendor name, et cetera, et cetera. But you can also have things like capacity, how much memory this thing has. So that's how the users-- or rather, how the vendors publish information about that.

Then there's another aspect which is, how do users ask for those things? So that's our claim, our resource claim API. And it goes and allows the user to say, I need this particular model from this particular vendor, or it can be more flexible and say, hey, I need any model from that vendor as long as it has more than 8 gigs of memory. And so this is a way that we can allow some flexibility, or allow the user to underspecify, which leaves room for the platform, Kubernetes, to satisfy the request in different ways.

So when you have room for that to happen, you have people with opinions. In particular, your cluster administrators have opinions in that and want to control which choices get made first. So we have something called DeviceClass that helps with that and future things coming to help with that.

So resource claims are the way the user specifies how to do it. DeviceClass is the prepackaged version of what a device might look like from an administrator point of view so that the administrator can attach configuration, for example. Resource slices, we call them, are where the driver, the vendor, publishes the information.

And then we have an allocator, part of the scheduler, that's going to go and satisfy or resolve those claims against the available set of devices out there. And then we have a driver that runs on the node that when the allocation is made and the pod lands on the node, it attaches the device to the pod.

So these are four different areas, at least four there, that each one of them has its own set of caps that's developing. So for the resource claim side, we have the basic claim we started with, but we also have now, in 1.33, we're going to likely have an alpha feature which allows you to say, you know what, you can satisfy this claim by giving me one of this type of device, or two of this other type of device, or four of this other type of device, which allows you to solve some of the obtainability problems we have in Kubernetes with GPUs.

But that's a separate-- and that only touches the claim side. You don't have to touch the driver. You don't even have to touch how you publish those resources. It's just, the user can get flexibility in how they specify their claims. So that's one area. I'll pause there because I can go on and on, and you may have questions.

KASLIN FIELDS: I'm very curious how all of this is going to look in 5 to 10 years, because that level of flexibility is very interesting there. So I'm imagining, say you're a platform engineer who manages the infrastructure for a number of different development teams, some of those development teams are doing AI workloads, and some of them maybe are doing traditional web applications or other kinds of applications where maybe they don't need the level of detail that the AI applications would need. So dynamic resource allocation and especially the-- what is it called again? The device, the definition piece.

JOHN BELAMARIC: Class.

KASLIN FIELDS: Class. There we go. [LAUGHS] So the DeviceClass piece is very useful for those AI-type workloads. Would you use a dynamic resource allocation and DeviceClasses across the cluster also, for the workloads that are not AI workloads, that don't need that level of detail and control over the devices? Or is it mainly meant just for those workloads and you would do something different for the non-AI workloads?

JOHN BELAMARIC: That's a very good question.

KASLIN FIELDS: [LAUGHS]

JOHN BELAMARIC: The short answer is you would-- in my opinion, one should minimally specify what they need and let the system figure out how to optimize it. So if you don't need a device, you definitely shouldn't specify one. In theory, there are people working on drivers for DRA which model and represent CPU and CPU topology and memory topology, in which case, if you have needs that we currently solve with something like Topology Manager, you potentially can solve them with this with a little bit more flexibility.

Topology Manager or CPU Manager and things, they're based on per-node settings, which is an artifact of that they were built by SIG Node, not an artifact of the technology. And so you could actually-- if people build the right drivers and we can make it scale-- that's the big issue, there could be scalability issues-- then you could, in theory, use this for types of workloads that don't need specialized devices but have, say, specialized NUMA or other memory constraints or CPU constraints or need pin CPUs or things like that.

And you could dynamically configure those on a per-workload basis rather than a per-node basis. So today, it's per-node. And so then you're sort of cordoning off that node to say, only this type of workload should go on this node. And it creates kind of a chunky infrastructure, like a blocky infrastructure as opposed to a infrastructure you can cut up in smaller and smaller pieces.

So that's the long answer. But the short answer is, regular workloads that don't need specialized devices shouldn't specify any of this stuff, and our existing systems will work perfectly well. The longer answer is, once you start getting some specialized needs, you should minimally specify what those needs are to give the platform the flexibility to satisfy it in the most optimal way.

KASLIN FIELDS: We've said a few times on this show, in our particularly AI-focused episodes, that it's a pretty good time to be an infrastructure engineer. One thing I'm getting from all of this is there's still a lot of room for infrastructure expertise here and a lot of need for it.

JOHN BELAMARIC: Absolutely. And we're building the APIs such that if you're in infrastructure, like a platform engineer or running, you're understanding, say, even just your contracts with your cloud provider and where you have reservations versus where you have spot availability versus where you have-- we're enabling the APIs that when you combine it with something like Karpenter or Google's custom compute classes, which are autoscaling technologies, we kind of dovetail with those.

And I'll have a hopefully have a talk about this in an upcoming KubeCon. But where this DRA dovetails with those technologies. So like I said earlier, the workload author can underspecify the request, meaning, say, I can take this or this or this, and then let the cluster administrator, through these other autoscaling tools, decide which of the this or this or this based upon their preferences.

So it kind of decouples a little bit the workload author's efforts from the platform engineer's efforts, such that they can work independently without having to talk to each other every day because people don't like to-- engineers don't like to talk to each other if they can avoid it.

KASLIN FIELDS: [LAUGHS] It can certainly be a challenge, especially in cases like this where the developers who are creating the application are just like, I want my code to do the thing.

JOHN BELAMARIC: Exactly.

KASLIN FIELDS: Platform engineer, infrastructure person, please just make the computer do that. And things get dropped in that.

JOHN BELAMARIC: Exactly. And the infrastructure engineer is like, well, I have a budget, right? So I can't just give you anything you want. I have to make sure that there's enough for everybody.

KASLIN FIELDS: Yeah. That can be a real challenge. So dynamic resource allocation and DeviceClasses in the world of AI workloads can help with that. And one thing I was thinking, as you were talking about all of this, was that that sounds like a lot for infrastructure engineers to keep in mind. So splitting it up kind of between the cluster administrator and other roles makes sense, but still lends credence to the idea that infrastructure engineers have some job security here.

JOHN BELAMARIC: Absolutely.

KASLIN FIELDS: So speaking of infrastructure engineers and users, how can listeners out there who are maybe infrastructure engineers or maybe who are building AI workloads or whatever they may be doing, how can they support Working Group Device Management's work?

JOHN BELAMARIC: We do have some end users involved, obviously, both customers of those of us who are there-- we have all the major cloud providers and NVIDIA and other folks all involved in the working group, and we all talk to our own customers-- and we have some end users who come to the meetings. But more is always better with respect to that, because we're building APIs that are hard to change and are likely going to have to live for another 10 years.

And we're going to screw them up because we're human. But the more information we have up front, hopefully we'll make them better. So as an example, I talked earlier a little bit about, we have a way to underspecify claims. We also have flexibility in the way devices get published by the vendor. One of those ways we're working on for flexibility for the device vendors is what we call our partitionable device model.

So the mental model you can think of or the canonical model for this would be like NVIDIA has what they call Multi-Instance GPUs. You can take an individual GPU, and you can break it into smaller GPUs. And you can do that dynamically, and we need to pick which one. So we want to represent that.

We have a similar thing at Google where it's like we have TPUs. You can have eight of them in a node, but you can't consume any arbitrary two. If you want to consume two, they come in special pairs or four or whatever. So these topologies. So we can represent that.

Other vendors have similar use cases we would love to hear about. So Amazon has come and given us some of theirs, and Microsoft is there. But end users may have other constraints. They want to put that other ways. They want to make use of that. And we've gotten feedback from different end users on that. But that's just one small example of one of the things we're working on, where we could use input from either vendors or end users.

And the way you could help is-- the first thing to do would be just reach out to us on Slack. We have Working Group Device Management on the Kubernetes Slack. Come to our meetings. We have a meeting 8:30 AM Pacific time every other Tuesday. So from the show notes, you can go on there, and you can find out when our meetings are, where they are.

We're super friendly and welcoming. Everybody is welcome to contribute. And our meetings are very open. Anybody can add an agenda item. We talk about it. If it looks like it's something people want to do, then you can create an issue in the Kubernetes repo to track it, an enhancement issue or whatever, depending on what it is. And we just start executing on it and tracking it in each meeting.

KASLIN FIELDS: Highly recommend popping into a meeting if you do want to give them work to do, because work that people know about and were told about in a meeting tends to get prioritized a little bit more easily than--

JOHN BELAMARIC: Definitely.

KASLIN FIELDS: --work that they just saw it on an issue in GitHub. So if you can--

JOHN BELAMARIC: There's too many issues that go by. So yeah.

KASLIN FIELDS: Yep. If you can, pop into a Working Group Device Management meeting. If you can't pop into Working Group Device Management, maybe pop into one of the other SIGs that we mentioned. A lot of them are involved with this work. So if you had a question about one of these things, you might actually run into someone who knows something about it in one of the other meetings as well, depending on what times and meetings work for you.

So thank you so much, John, for being on today and teaching us about device management and what Kubernetes is doing to address it. I was just talking to some end users the other day, actually, who were saying, I'd really like it if GPU autoscaling worked a bit better on Kubernetes. So I know there's demand for your work, and I look forward to seeing where it goes.

JOHN BELAMARIC: That sounds great. And send those users to our working group.

KASLIN FIELDS: I'll see if I can do that.

JOHN BELAMARIC: Thank you. Thanks so much.

[MUSIC PLAYING]

ABDEL SGHIOUAR: Well, hi, Kaslin. Happy New Years.

KASLIN FIELDS: Thank you. It was super fun to get to talk to John to start off the year. He has done a lot of awesome work in the community. I've talked with him at KubeCons a bunch and seen a bunch of his Lightning Talks. We have him at the booth, at the Google booth, pretty often, I feel like. So it was really cool to get to hear more about Working Group Device Management, which I've seen a lot of stuff about around the community, but haven't really gotten to dive into until now.

ABDEL SGHIOUAR: Nice. I mean, we talked before about the fact that we were trying to get quotes from the new working groups, Serving and Device Management, but we couldn't suddenly get both of them last year. So we're just-- I'm just happy that we managed to at least do it.

KASLIN FIELDS: Yeah, I didn't mention that in the thing, but a hint at why both of those were created, AI.

ABDEL SGHIOUAR: Yes. The two most important acronyms, I guess, of 2024 and maybe 2025.

KASLIN FIELDS: Yeah.

ABDEL SGHIOUAR: So I didn't have the time to listen to the episode. So why don't you walk us through what have been discussed? I might have questions. I will for sure have questions.

KASLIN FIELDS: Yeah. So to start off things, right off the bat, I loved John's description of what working groups are in the community. I talked in the episode about how-- or in the interview about how my first real working group that I was kind of involved with stuff in was Working Group Long-Term Support.

ABDEL SGHIOUAR: Oh yeah?

KASLIN FIELDS: And that's kind of a weird one because it's like-- it's very much about what the industry is doing and how do we support that in open source, rather than being inspired by specific-- well, I mean, it is kind of a specific technical need, but--

ABDEL SGHIOUAR: Of course, yeah.

KASLIN FIELDS: --a little bit of a different perspective on how it relates to the different areas within Kubernetes, the different special interest groups. Whereas this one is, of course, AI functionality. We're getting a lot more AI workloads on Kubernetes. And so the SIGs were trying to address what was going on with the new workloads that people are trying to run on Kubernetes. And they were encountering conflicts because certain things that need to be created, new functionality that Kubernetes needs, really crosses the SIG boundaries.

And so they had to create this new working group. And so that kind of speaks to what working groups are meant to do. They're meant to be groups of folks who are working on projects that are cross-SIG, that are usually not forever. Usually, the work that they do ends up being owned by a SIG, so they don't own the code. So Working Group Device Management is just a really good example of a working group, I think.

ABDEL SGHIOUAR: Yeah. So in my head, it sounds more like what they're doing is coordination across multiple SIGs to make sure that staff are laden up in the right order, I guess.

KASLIN FIELDS: Kind of. The primary feature that they're working on right now-- it was very interesting to me that their work can be summed up with one feature, and it's dynamic resource allocation.

ABDEL SGHIOUAR: Of course, DRA.

[LAUGHTER]

KASLIN FIELDS: So John talked about some of the conflicts between Scheduling, Autoscaling, Node, all of these different SIGs within Kubernetes that need to work smoothly together in order for hardware resources to be dynamically allocated. You need stuff on the node. You need autoscaling to work right for it. You need the scheduler to understand what kind of hardware it has and how to schedule things on it.

And so they needed to implement new ways for users to be able to specify what kinds of hardware that they needed. So it kind of touches all of those. But he was saying that a lot of the code will end up being owned by the different SIGs, it sounds like. Different pieces fall into those different categories nicely.

ABDEL SGHIOUAR: Yeah. And I assume that, particularly with devices-- and I guess by devices we mean typically-- I mean, in the context of AI, we mean accelerators. But that could mean anything else-- it's kind of more complicated because Kubernetes is a workload orchestrator. But you typically also have an infrastructure orchestration layer, whether that's your cloud provider, VMware, OpenStack, whatever thing that gives you a VM or a node or whatever. And so having all of these things work together in a coordinated fashion is important.

KASLIN FIELDS: Mm-hm. And I talked about one thing I find really interesting about our current predicament or current situation in the industry with all of these hardware accelerators, just massive interest in using them and creating these new AI workloads and things. A weird thing is that we're kind of-- it feels to me kind of like we're going backwards because usually, hardware becomes more commoditized over time, and we're more about abstracting all of that underlying hardware.

But right now, it's like we've gone backwards a little bit there because everybody wants really close control of the hardware and understanding of the hardware. For certain types of AI workloads, you need to be really, really in control of what's happening with that hardware. So it's kind of backwards in the sense that we're getting more fine-grained control rather than more abstracted control. But one feature of the working group's work is that it's kind of flexible. You can have that level of detail, but there are also still ways that you can let the system decide things.

ABDEL SGHIOUAR: Abstract away. Yeah.

KASLIN FIELDS: Yeah. They're kind of trying to build in those abstractions now, and I think it's going to be interesting to see how that develops over the next several years.

ABDEL SGHIOUAR: Yeah. I mean, a little bit off topic, but it's still related to this. I was on Reddit over the holiday season because when I don't have anything to do, I just go on Reddit because it's fun. And there was a conversation going on the Kubernetes subreddit about somebody saying that they have workloads running on Kubernetes, but they have to restart them every few days, like, not let them restart.

KASLIN FIELDS: Manually?

ABDEL SGHIOUAR: Yes, manually, pretty much. Yes. I was actually making the same exact face you are making right now.

KASLIN FIELDS: Uh-huh.

ABDEL SGHIOUAR: I will make sure to link the subreddit or the thread because everybody was like, so you are running pods as VMs? And they were like, yeah, yeah, yeah, because the workloads have to be restarted so they can start from a fresh state. And I'm like, all right.

KASLIN FIELDS: Oof.

ABDEL SGHIOUAR: Sounds like we're not moving forward as much as we wish we are.

KASLIN FIELDS: One thing I'm always telling folks, especially management, about this world is that you think we've been around for 10 years and people know what's going on. They don't.

ABDEL SGHIOUAR: They don't. Yeah. No.

KASLIN FIELDS: People are still making the transition from the VM world into the container world. Things are still brand new to a huge swath of the industry, and that's OK and great, honestly. It's great to still continue introducing it to people for the first time. Those conversations are fun.

ABDEL SGHIOUAR: I just like the thread because the person was mentioning the fact that, oh, we adopted Kubernetes without really knowing much what we were doing. So that's the result of what we did. So it was like, OK, that's interesting. I will make sure to link the thread. I think it's fun.

KASLIN FIELDS: And folks are going to be dealing with the ramifications of that for years and years to come.

ABDEL SGHIOUAR: Oh, for sure. For sure. Yeah, for sure. Cool. Well, that sounds cool. I'm excited to listen to the episode later. So yeah, thank you very much for your time.

KASLIN FIELDS: Of course. Happy 2025. I'm excited to see a lot of the work that you're working on in 2025. Is there anything that you want to call out that's coming up in this year that you're excited about?

ABDEL SGHIOUAR: There are a lot of AI. But yeah. I mean, 2025 is looking super exciting. There will be a bunch of things going on. We are gearing up for KubeCon Europe, obviously. It's, what, 12 weeks from now? We are--

KASLIN FIELDS: So difficult.

ABDEL SGHIOUAR: So difficult. Yeah. There is quite a lot of KCDs happening this year, and there's a lot of content to be created. Yeah, there will be a lot of things going on. I'm excited. 2025 will be a good year.

KASLIN FIELDS: The biggest year of KubeCons yet--

ABDEL SGHIOUAR: Yeah. There are five this year.

KASLIN FIELDS: --with-- what have we got?

ABDEL SGHIOUAR: Yeah, we've got Europe, US as usual, we still have China, and then we have India and Japan, right?

KASLIN FIELDS: Yeah. So I think that's five. Five KubeCons.

ABDEL SGHIOUAR: Yeah. And then you have OSS Summits.

KASLIN FIELDS: Yeah, Open Source Summits. Kubernetes Community Days, like you were saying.

ABDEL SGHIOUAR: Yes. Then there will be, I think, Cloud Security Con again this year.

KASLIN FIELDS: Oh, right. Mm-hm.

ABDEL SGHIOUAR: There will be 30 KCDs. 3-0.

KASLIN FIELDS: Wow.

ABDEL SGHIOUAR: Around the world. Right?

KASLIN FIELDS: Mm-hm.

ABDEL SGHIOUAR: And then third party events and first party events. So.

KASLIN FIELDS: It's going to be a big year for the Kubernetes and cloud native communities. Infrastructure is not slowing down in the era of AI.

ABDEL SGHIOUAR: No, it's not. Actually, speaking of exciting things, I was scrolling on LinkedIn yesterday, and I saw this thread. I did not know, but apparently, the default ingress controller in Kubernetes-- and what I mean by the default, it means if you install vanilla Kubernetes, there is an ingress controller in it. That ingress controller was called NGINX Ingress. Very confusing. It has nothing to do with NGINX the proxy. It was just called NGINX Ingress.

KASLIN FIELDS: Great.

ABDEL SGHIOUAR: But apparently, they are moving toward a new implementation called InGate. And I'm excited to explore that. We might actually have to have an episode about it on the podcast.

KASLIN FIELDS: All right. I have been hearing about some other exciting ingress things that have been happening this year with Gateway, so hopefully we'll talk more about that soon as well.

ABDEL SGHIOUAR: Yes, that's for sure. I know that we chatted about that. And as we, I think, discussed last year, we will also have some end user episodes. We've been talking about some particular, not CNCF affiliated, but more companies that are actually using Kubernetes internally. And we're planning to do some of those. So.

KASLIN FIELDS: Yeah, definitely want to feature more of those stories of folks out there using Kubernetes and cloud native technologies to do really awesome things. Those are always my favorite. I used to love it when there was a use-case track at KubeCon and other container events. I feel like the events have kind of moved away from that. There's not a single track that is use cases anymore, and that makes me sad.

But whenever I'm track chairing or reviewing CFPs, I'm always like, if it's a use case. [LAUGHS]

ABDEL SGHIOUAR: Yeah, actually, speaking of that, if you are listening to this section of this episode, if you have a very interesting use case, email us. We would be curious.

KASLIN FIELDS: Yes, we'd love to feature you on the show.

ABDEL SGHIOUAR: Yeah. Just email us. What are you working on? How are you using Kubernetes? What challenges do you have? Or if you have any particular questions or anything interesting that you want us to explore, just feel free to email or DM us on social media. We'll be open to listen to what you have to say.

KASLIN FIELDS: And I think that's an excellent note to close on. Thank you everyone for listening to our first episode of 2025, and we hope to talk with you soon.

ABDEL SGHIOUAR: Thank you.

[MUSIC PLAYING]

KASLIN FIELDS: That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media, @KubernetesPod, or reach us by email at <kubernetespodcast@google.com>. You can also check out the website at kubernetespodcast.com, where you'll find transcripts, show notes, and links to subscribe.

Please consider rating us in your podcast player so that we can help more people find and enjoy the show. Thanks for listening, and we'll see you next time.

[MUSIC PLAYING]

View More Episodes