#151 June 11, 2021

Multi-Instance GPUs, with Kevin Klues and Pradeep Venkatachalam

Hosts: Craig Box, Sarah D'Angelo

NVIDIA and Google have teamed up to bring the new Multi-Instance GPU feature, launched with the NVIDIA A100, to GKE. We speak to Kevin Klues from NVIDIA and Pradeep Venkatachalam from Google Cloud on how and why people use GPUs, optimising instance shapes for machine learning, and why less is often more.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box with my very special guest host Sarah D'Angelo.

[MUSIC PLAYING]

CRAIG BOX: We last had you on the show in July 2019. You were on with Patrick Flynn. We caught up with Patrick recently, and he'd packed it all in and moved to Canada. Have you had any similar life-changing experiences since we heard from you last?

SARAH D'ANGELO: Yes, I actually packed up, too, and moved out to Winthrop, Washington, a small mountain town on the east side of the North Cascades. And I also joined Google's core developer team, so I've been focusing on what makes Google engineers productive these days.

CRAIG BOX: What does make a Google engineer productive? I reckon it must be kale and a nice home office.

SARAH D'ANGELO: It's a lot of things. Some of them we know and some of them we don't, which is why I still have a job.

CRAIG BOX: Brilliant.

SARAH D'ANGELO: So, Craig, what have you been up to these days?

CRAIG BOX: Well, the eagle-eared among you will notice that we had a little hiatus. We were supposed to have a couple of weeks off, but it ended up being three, unfortunately. The first two weeks were because the UK weather has improved enough and lockdown rules had also changed enough that I was able to take the family away for a little while, pop down to the farmstay that we stayed at last year.

For anyone who's keeping up with the notes, a lot of chickens, not so many cows this time, but a fun experience nonetheless. And then the experience of being around people for having been apart from them so long means that we all came back with horrible colds. So that's why the two week hiatus became three. But everything's back to normal now. And we apologize for the inconvenience. And we'll be back to our regular schedule.

SARAH D'ANGELO: Glad you're feeling better, Craig, and that your human immunity is still intact! [CHUCKLES]

CRAIG BOX: Anyway, now it's time for a quick recap of what happened while I was off the air. So let's get to the news.

[MUSIC PLAYING]

SARAH D'ANGELO: At Microsoft Build, the Azure team announced that Azure Application Services, their managed service for building and deploying web apps, is now available to run on Azure Arc. Arc, as you may recall, is Microsoft's Kubernetes anywhere play, allowing you to connect on-premises clusters to cloud hosted control plane. You can then use the app service UI to deploy to your on-prem clusters as custom location. The service is available in preview and includes functions, event grid, logic apps, and API management.

If you want Microsoft to run on-premise clusters for you, AKS is now generally available on Azure Stack HCI, their physical-server-based hybrid offering.

CRAIG BOX: Amazon's move into the on-premises space was through ECS and EKS Anywhere. both announced at reinvent in December. ECS Anywhere is now GA this week, allowing you to use their control plane to run local instances of ECS on their remote hosting products or on your own infrastructure. EKS Anywhere is still marked as coming in 2021. AWS also announced a build and run containerized apps service called app runner.

SARAH D'ANGELO: Google Cloud has added container native Cloud DNS to GKE, a native integration with their Cloud DNS service. Enabling this service removed the need to run in-cluster DNS pods, reducing operational burden and improving availability. It also makes it much easier to address GKE-hosted services from outside your clusters.

CRAIG BOX: Some rapid fire release announcements. Istio 1.10 brings improvements to upgrades and the number of API objects watched on the Kubernetes server for lower resource utilization.

SARAH D'ANGELO: Terraform 1.0 takes the last 0.15 version and blesses it as generally available. After six years and 100 million downloads, 1.x versions will now be supported for at least 18 months.

CRAIG BOX: Grafana 8.0 was released. And if you're allowed to look at it after its recent license change, you will find new alerting features, library panels to share between dashboards, and real-time streaming of data. There's also a refreshed look and feel and performance improvements. Tempo, Grafana's distributed tracing backend, has also reached version 1.0.

SARAH D'ANGELO: Argo Rollouts 1.0 is out, bringing a roll out dashboard, richer stats, and more options for Canary deployment with Istio, to their progressive delivery controller.

CRAIG BOX: CubeSphere 3.1 adds edge node management via the cube edge project, and improved metering and billing to make sure your edge nodes pay their way.

SARAH D'ANGELO: Cilium 1.10 adds support for encryption between pods using WireGuard, operating as a standalone load balancer and operating as an egress IP gateway.

CRAIG BOX: Nobl9, guests in episode 147, have announced open SLO, a declarative SLO as code format that they teased on the show a few weeks ago. This was announced at SLOConf, and all the videos from that event are now available to view.

SARAH D'ANGELO: Envoy proxy is now generally available on Windows. Lyft also released a chaos experimentation framework platform for Envoy, built into their infrastructure tooling platform Clutch.

CRAIG BOX: And now the database section. If you want to run Oracle on Kubernetes, Google has introduced an operator to handle just that. It's perfectly named El Carro, which I'm reliably informed is Spanish for "the car." If you want to run MySQL on Kubernetes and don't like the operators that are already out there, check out Moco, an operator built by Kintone for MySQL 8, using, wait for it, GTID-based lossless semi-synchronous replication.

If you want a MySQL-like hosted service and don't want the care, PlanetScale, guests from episode 81, and authors of Vitess, have launched their managed service to GA. And if you just like reading, the FoundationDB team published a paper on their design for the upcoming ACM SIGMOD Conference.

SARAH D'ANGELO: Docker launched development environments for sharing your IDE and project branch configuration as easily as you share your compiled artifacts. They also added a verified publisher program for commercial content, and a v2 of Docker Compose, with a C.

CRAIG BOX: Google's open source team has introduced a tool to view dependency graphs of open source projects. While you may craft your turtle app by hand, and include a couple of lovingly chosen turtle libraries, you'll be shocked to see how many more dependency turtles are dragged in, each with their own security implications. If you want to be scared by what it takes to make Kubernetes, links to the 1.0 and current 1.22-alpha graphs are included in the show notes.

SARAH D'ANGELO: Now you're thinking about software supply-chain. Check out the new Tekton Chains product which securely captures metadata for CI/CD pipeline executions to allow fully verifiable builds. It's written up on the Google security blog by Dan Lorenc and Priya Wadhwa. Oh, and if you're running Kubernetes, runc, or the VS Code Kubernetes extension, check out the show notes for some recent CVEs.

CRAIG BOX: Challenging or controversial, the opinion of the week is summarized by Steve Smith, who bravely suggests that GitOps is a placebo. In a blog post and Twitter thread, he says that GitOps is just a rebadging of other ideas and has no new ideas of substance. Recent guest host Vic Iglesias added some color with a blog post demystifying what GitOps actually is. The GitOps Days conference is underway as the show goes to air, so perhaps next week we'll bring you the other side of the argument.

SARAH D'ANGELO: Styra, authors of OPA and guests of episode 101, raised 40 million in series B funding to continue their work on policy and authorization for cloud native applications. They intend to double their headcount by the end of the year with people working on both Open Policy Agent and their commercial product DAS. The round was led by Battery Ventures.

CRAIG BOX: Finally, the cloud native community has launched a channel on whatever Twitch is, which I assume is like podcasts with pictures for the kids. At a more appropriate pace for the older generation, the CNCF has also published all the videos from the recent KubeCon EU.

SARAH D'ANGELO: And that's the news.

[MUSIC PLAYING]

CRAIG BOX: Kevin Klues is a principal software engineer on the cloud native team at NVIDIA and a maintainer of the CPU manager, device manager, and topology manager components inside the Kubelet. Pradeep Venkatachalam is a senior software engineer on the GKE team at Google Cloud. He works on the node team with specific focus on improving the accelerator ecosystem on GKE. Welcome to the show, Kevin.

KEVIN KLUES: Thank you very much. I'm glad to be here.

CRAIG BOX: And welcome, Pradeep.

PRADEEP VENKATACHALAM: Thanks, Craig. It's great to be here.

CRAIG BOX: I understand you both have a background in systems, and, Kevin, you studied under Eric Brewer at Berkeley?

KEVIN KLUES: I did, yeah. I started there back in 2008 and left in 2015, and headed to Google to continue working with them for a while and slowly made my way to NVIDIA where I am now.

CRAIG BOX: And you also made your way to Berlin during that time.

KEVIN KLUES: I did, yeah. I had studied in Berlin back in the early 2000s for a while and always thought I might come back here at some point. And my wife got into the University of Hamburg to study machine learning, actually, of all things. And so we moved here in 2017. And we've now left Hamburg. She's now finished the program and we moved to Berlin. And we're very happy here and we're expecting our second child actually in three days. So this interview is being done just in time.

CRAIG BOX: Well, I think you might find it hard to top that, Pradeep. But I hear you worked most recently at Facebook. Tell me about the scheduling challenges that they have there.

PRADEEP VENKATACHALAM: Yes, so I worked at Facebook for eight years. And we were mostly into the batch space. So we were building all kinds of, like, queues and schedulers and figuring out algorithms to make sure our nodes, or workers as we used to call them, were fully utilized. And I worked there for eight years and I made my way into Google. And Kubernetes was a really natural fit, in terms of my skills and my interests. And I've been working on GKE for the last year with specific focus on GPUs and accelerators.

CRAIG BOX: We had an interview with Pramod Ramarao from NVIDIA in episode 92 of the show. That was over a year ago. And in fairness to me, Adam actually did that interview without me. So I'm not 100% sure I understand what GPUs are. I think that they're machines for turning $1,000 of energy into $0.10 worth of Dogecoin. Perhaps you could correct the gaps in my understanding.

KEVIN KLUES: Sure. So the term GPU stands for graphics processing unit. And traditionally they've been used for running algorithms that drive graphics-based workloads. A little over 10 years ago, though, people discovered that they could start running machine learning style algorithms or artificial intelligence algorithms on these devices because of the high amount of parallelism that they provide. Nowadays there are still use cases for running graphics, but the vast majority of chips that are sold today are used for exactly this, running machine learning and artificial intelligence style algorithms on them.

CRAIG BOX: Perhaps we can continue with a crash course in the concepts of machine learning. We hear things like training and inference. What are the differences between those two phases of the machine learning process and how they utilize GPUs?

PRADEEP VENKATACHALAM: Typical machine learning applications have two phases, the training and inference. So the training is the first step, where you feed the system a bunch of data for which you know the answers ahead of time. And the system will kind of train itself. And once it's trained, what we get is a machine learning model which represents the structure of the neural net, the weights that go into it, and all of that.

You take this model and you deploy it into your production system. And this is the phase we call serving, where you take a trained model and you send user requests to it. For example, if you build an image recognition system, you'd first start off in the training phase. You would feed the system millions of images, for which you know the answers to.

And then you generate a model. And then you deploy it in your production infrastructure. And then you can use it for serving. So, for example, if the user uploads a picture of a cat, you can run it through your serving infrastructure, and it'll tell you that it's a cat. So these are the two phases, training and serving.

CRAIG BOX: We use the word inference sometimes. Is that because we infer that the picture that was sent to us is, in fact, a cat?

PRADEEP VENKATACHALAM: Yes, exactly. So serving and inference are kind of used interchangeably.

CRAIG BOX: So GPUs are pretty hot as far as computing is concerned. Can they run "Doom"?

KEVIN KLUES: Can they run "Doom," the third person computer game from the '90s?

CRAIG BOX: The ultimate test for a computing device is, can it run Doom.

KEVIN KLUES: I see.

CRAIG BOX: I figure GPUs, they're pretty powerful computers. They should be able to run Doom.

KEVIN KLUES: I could imagine that GPUs are able to drive the video for Doom, yes.

CRAIG BOX: I think pocket calculators these days can pretty much drive the video for Doom. [LAUGHING] Cryptocurrency aside, why are GPUs so expensive?

KEVIN KLUES: There's probably two reasons. One is that the demand for them nowadays is so high. Not only can you do cryptocurrency, Bitcoin mining, and things like that, but all of this proliferation of machine learning in all aspects of technology, the way that's been growing and accelerating, GPUs are in high demand to serve those workloads. And second, it actually is quite expensive to run GPUs.

So when they're sitting in your data center, they actually draw quite a bit of power. And as they're cranking away on these algorithms, it's actually expensive to run and not just use as an end-user.

CRAIG BOX: One of the magic things about CPUs is how we can use them for multitasking. Kubernetes takes that to extremes in terms of sharing the CPU and the memory on a machine with its declarative model. Now GPUs are not traditionally shared. Is that because they come from a world of generating output that's sent just to the one video card that's plugged into it?

KEVIN KLUES: I'd say the reason they're not traditionally shared is more of just the history behind where they come from. Traditionally, you would plug a GPU into a machine. You'd install a bunch of drivers on that machine. And then you'd have some users log on to that machine and start using them. You didn't have to worry about them sitting up in some cloud environment, containers running on top of them, lots of people wanting to share access to them. You could time slice and share users directly on the system where they were sitting, rather than having to virtualize that, or somehow get multiple people to access it where they might not necessarily trust each other.

CRAIG BOX: So how would that work in the single user case? You mentioned time slicing there. Do people come along and run jobs, and some sort of batch processing system allocates them?

KEVIN KLUES: At least with NVIDIA GPUs there is a programming interface called CUDA which most people use to interface to running workloads on top of the GPU. And CUDA itself has a mechanism inside of it, where you can submit jobs to CUDA and it will automatically multiplex workloads on top of your GPU, the same way as running a program on top of the CPU would work. At least from the end user's perspective, that's what it feels like.

But it obviously doesn't give you the isolation guarantees and things you would expect, if you were running in a more multitenanted environment.

CRAIG BOX: Now we are running in more multitenanted environments. What do I do in the case where I don't need a whole GPU? These GPUs are very good at running the same task in parallel. Shouldn't that mean that partitioning one is just as easy, if not easier, than partitioning a CPU?

KEVIN KLUES: Traditionally it's been difficult to do that purely in software. You can obviously build solutions to try and virtualize access to these GPUs. So instead of actually partitioning off some set of memory, some set of compute resources in the GPU, you can virtualize access to the entire GPU by time slicing workloads on top of those, even potentially at the kernel level, through a technology that NVIDIA provides, called VGPUs, gives you that exact abstraction. You run something, kernel will time slice this for you, and the next user can come along and work on it. And it looks to you like you have full access to this GPU.

But when it comes to actually taking a chunk of the GPU and running a workload on it and having another chunk of the GPU run a different workload, there just hasn't been the hardware support that's necessary to do that until now.

CRAIG BOX: What are the cases that you have to deal with? For example, in a CPU, we are able to segment off various sets of registers, but then the Spectre and Meltdown things come along, and we find side channel attacks. Is there a situation where people would come along and be able to read each other's memory, for example, in that virtualized case, or in the multiple users on a single GPU in software?

KEVIN KLUES: In the multiple users on the same GPU software case, that's exactly the issue is that there hasn't traditionally been hardware support to separate that. So even if you might partition off memory from one user to another, they can still read that memory if they want to. The bigger problem, though, is that there's no hardware support for fault isolation. So even if I was able to have a different MMU for different users sitting on the GPU, so that they both get their own virtualized environment in terms of what memory they have to access, if one of those processes running on the GPU crashes, it brings down the whole GPU, because there's no fault isolation between them.

CRAIG BOX: Now, NVIDIA came up with a great solution to this, which was announced a year ago at the GPU Technology Conference, which they called multi-instance GPUs. What is a multi-instance GPU?

KEVIN KLUES: Well, a multi-instance GPU, as you mentioned, MIG for short, is a way of taking a full GPU and partitioning it into some smaller set of GPUs that give you these isolation guarantees at the hardware level that were lacking in previous GPU-sharing solutions, specifically around getting dedicated access to L2 caches, memory, compute resources, and making sure that any faults that occur within one of these instances doesn't interfere with any processes running on another instance.

CRAIG BOX: Using a MIG, it says that you can divide a GPU into seven MIG instances and eight memory zones. So those numbers aren't the same. Seven's not a regular number in computing. Is that a bit like how Apple sometimes sells you a seven core GPU, which means, we tried to make eight, but one core didn't work?

KEVIN KLUES: Yeah, it's funny you mentioned that. We try and avoid necessarily talking about this. But yes, that's exactly right. It basically has to do with the yields that you get in the silicon when you manufacture these chips. So we designed it for eight. We only get seven that we can actually get reliably. And so seven's the number that we have to advertise it with, in fact, build it with and hard code it into having.

CRAIG BOX: I'll give you a top tip. Next time, design it for nine.

KEVIN KLUES: I'll take that into account, thanks.

CRAIG BOX: You're not stuck with just having GPUs that are one seventh the size of your instance, though. I understand that you can attach multiple zones together to get GPUs of different shapes.

KEVIN KLUES: Yeah, that's exactly right. So you can think of one of these MIG capable GPUs of having eight different memory slices and seven of these compute slices. But you can mix and match them together to create different, what we call MIG instances, by combining some number of memory slices with the compute slices. For example, you can take just one of these compute slices and combine it with one of these memory slices to give you the one-seventh sized GPU that we talked about before. Or you can take two of these and combine them with some other number of slices, if that's what you prefer to do.

CRAIG BOX: I can understand adding more GPU cores effectively to a workload can make it run faster, probably, quite linearly. What does the amount of memory matter in this case?

PRADEEP VENKATACHALAM: We think of GPU capacity in terms of both compute and memory. There are different use cases, really, like how you want a reasonable capacity, right? So one common use case is you have a bunch of ML models that we talked about before. And you want to load them into the GPU memory, so for fast access. And then you get incoming inference requests.

You can run your algorithm through that. All the data is already loaded in the GPU memory. So it's really fast. And you can just run your algorithm and return the response back. In these kind of use cases, you're thinking about GPU capacity in terms of how much memory do I have, whereas, on the other extreme, you could be compute constrained as well. You have a fixed amount of data, and then just want to run large scale training on it. Like, you want to run through multiple steps, thousands of steps a second, those kinds of things.

So it really depends on which part of your ML application you're working on. You could be constrained by either compute or memory.

CRAIG BOX: What kind of performance improvements might I see by partitioning my GPU up in this way, and in which kind of use cases?

PRADEEP VENKATACHALAM: First of all, we need to take a MIG capable GPU, like the latest A100 GPU. So the A100 GPU by itself is very powerful. It's multiple times faster than the previous generation GPUs, without any sort of partitioning. So now, when we take this full A100 GPU, which is too powerful for most workloads, and we partition it-- and we've seen that from previous tests that even one-seventh of that GPU is really comparable to many of the previous generation GPUs in terms of for the most common workloads.

And then you can also use the A100 GPU without any partitioning, in which case you get seven times the performance of an individual partition. So it's great, because it demonstrates the kind of linear scaling capability that the A100 GPU and those MIG partitions have, which is great, because now you can just choose whether you want to use a full GPU, if your application is powerful enough, or you can slice it and run a smaller application on those partitions. But you can run seven of those. So you can make good use of that GPU.

KEVIN KLUES: Interestingly, we've actually seen cases where a full A100 GPU actually performs worse than seven of the MIG instances running in parallel, depending on what workload you're actually wanting to run.

CRAIG BOX: And now, of course, if one isn't enough, you can attach up to 16 A100 GPUs to a Google Cloud instance. That's a lot. What kind of use case might you need that much GPU power on a single machine for?

PRADEEP VENKATACHALAM: There are many use cases where people train a large number of models. But each model is not heavy weight in itself. You can easily have hundreds or thousands of models, in many of these large customers. This is a perfect use case for that. The isolation guarantees that we get, with 112 GPU partitions on a node, you can potentially run 112 models on a single node. That is really great in terms of cost savings and efficiency.

CRAIG BOX: I understand the advantages you can get from bin packing things into larger machines. But if the GPU is the magic piece, why wouldn't I just run 112 small machines with whatever the smallest GPU is that I can get?

PRADEEP VENKATACHALAM: The smallest GPU itself might be too powerful for most use cases. As we get into these bigger machine types, if you're able to make full use of them, you'll find that, by using a single machine that can run various models, your overall price performance will be better than provisioning a thousand individual machines, each running one model with one GPU, because a lot of the CPU and memory cost can be amortized over all these applications running on a single node, can make better use of one machine rather than having, like, 1,000 underutilized machines.

CRAIG BOX: When I'm running these inference jobs, am I likely to have a lot of machine CPU and memory available to run other tasks as well, or is it more likely that I need to just dedicate that memory to whatever the GPU workloads that are running on the machine is?

KEVIN KLUES: I'll let Pradeep speak for the GKE case. But I know in the systems that I've seen, typically nodes that have GPUs attached to them tend to only run GPU workloads. And if you have any CPU workloads you want to run, you have a separate machine that has a lot of CPU and memory on it to run those types of workloads.

PRADEEP VENKATACHALAM: Yeah, and specifically, so exactly what Kevin said, right? And in order to help customers achieve that goal on GKE, when you create a GPU node, we actually add a taint, a Kubernetes taint, to it. So that prevents non-GPU workloads from landing on that node by default. And this is exactly so that we run only those GPU workloads on those nodes, because these nodes are way more expensive than regular CPU nodes. So you don't want anything else running on those nodes.

CRAIG BOX: How did you bring support for MIGs to Kubernetes and to GKE?

KEVIN KLUES: The first step for actually adding MIG support for Kubernetes was to add MIG support to containers directly. So NVIDIA maintains this container stack which we call the NVIDIA container tool kit. And we first added MIG support to that and then just plumbed through what we'd done there into the NVIDIA Kubernetes device plug-in to then push those MIG devices into Kubernetes, so that they can advertise them, so end users can consume them.

PRADEEP VENKATACHALAM: On GKE, we've expanded, taken what NVIDIA has done, and offered a managed solution on GKE. Most of the code for this is open source. You can go look at our device plug-in and you can see how-- if you want a closer look, you can see how this is done. Yes, so basically our device plug-in supports MIG, based on all the primitives that NVIDIA had already built. And we offer a managed solution.

So you can just specify what kind of nodes, what kind of partitions you want. And the device plug-in and the other components on the node will take care of creating those partitions and managing those GPU partitions.

KEVIN KLUES: Yeah, so, Pradeep, that actually brings up a good point. One of the hardest things with MIG is actually deciding when to change the MIG configuration available on some set of GPUs on a node. And it can be really hard when you're dealing with a fixed set of nodes in a static Kubernetes cluster to do this. And something like MIG integrated into GKE helps solve this in a very unique way.

CRAIG BOX: Is that a runtime change? Can I make changes to the other slices while workloads are running on one of them?

PRADEEP VENKATACHALAM: Not quite today, though that's our goal eventually. So what we can do right now is offer you an ability to easily migrate your workloads onto the partition sizes that you want. You can start off your workload running on a 5 GB partition size. And over time you see it's running too hot, it's not sufficient. It should be very easy for you to spin up a 10 GB partition and move your workload over to that in a seamless manner on GKE.

CRAIG BOX: And moving the workload is effectively stopping it in one place and starting it and picking up where you left off?

PRADEEP VENKATACHALAM: Yes, exactly. Definitely the user would need to make a change in their workload manifest, where they say, hey, 5 GB partition no longer works for me, I need a 10 GB. And then, that said, once you update the manifest, our GKE system, along with all the autoscaling support that we have, will make sure that the necessary GPU partitions are present on your cluster, and the workload has moved over to the new partition.

CRAIG BOX: I was just going to say, this sounds like a perfect use case for vertical autoscaling.

PRADEEP VENKATACHALAM: Yes, it does, actually, though, because of the way it is designed right now, those MIG partitions are not recognized as resources per se. But they are like labels on the node. So you can specify what partition size you want, whether you want a 5 GB or a 10 GB. It's not like a scalar resource that you can say, hey, increase it or decrease it. But you need to specify what partition you want.

And based on that, we create those nodes, and we run the workloads on those nodes. Kevin, do you know if there's any work ongoing to think about MIG and those partitions as scalar resources that we can size up and down?

KEVIN KLUES: What do you mean by size up and down?

CRAIG BOX: If it turns out that 5 gigabytes of memory is not enough, is there a way I can attach another 5 gigabyte slice to that particular GPU slice?

KEVIN KLUES: Not at the moment. There's nothing at least that I know of even in the next generation of GPU that will have something like that. They are still going to be, you have to configure the MIG slice size that you want. And then it's going to stay that size until you tear it down and bring a different one back up. There's not going to be a way to migrate your workload into a larger size without first stopping it.

CRAIG BOX: And how is it in terms of the manifest that I submit to Kubernetes, that would previously have said, run this on a single GPU? How is it that I ask the system to give me a slice of the size that I need to request?

KEVIN KLUES: We tried to be very careful as we were designing this to have the user experience look very, very similar. So in traditional GPU workloads, all you would have to do is in your resource spec ask for something of type NVIDIA.com/GPU and then specify how many of those GPUs that you want. If you wanted access to a specific type of GPU, you could then add a node selector that had a label telling you what type of GPU is on a given machine. And that's how you could either land on a K80 or a V100 or now an A100 based on that.

And in the new world, with MIG, we do something very similar, where you can still ask for NVIDIA.com/GPU, except that this label for what type of device you want to land on will have some of the MIG configuration information embedded into it.

PRADEEP VENKATACHALAM: So if you have a workload that specifies, you have a node selector that says, hey, this is the type of GPU partition I want, if you're given such a workload, it's actually possible for us to just infer the type of GPU partition, just by looking up your node selector. And that is exactly what our autoscaler does.

So if you don't have any nodes or node pools already created in your cluster, and you try to deploy a workload that requests for this GPU partition in the node selector, our autoscaler will just create those nodes for you, because it has all the information it needs to create those MIG partitions.

KEVIN KLUES: Yeah, and that's actually one of the most exciting things for me about this integrated MIG support in GKE is that they have the ability to autoscale out nodes to serve these types of MIG devices on the fly and then scale them back down once you don't need them anymore.

CRAIG BOX: Now you two have been presenting your work recently, I understand, at the NVIDIA GPU technology conference this year and also at the recent KubeCon EU.

KEVIN KLUES: Yeah, that's right. So Pradeep and I did a joint talk on this work at the GTC, the GPU Technology Conference. And that video was just posted very recently. I also did a follow up talk at KubeCon EU this year, which provides a bit more details as well as a demo of MIG in action on a Kubernetes cluster.

PRADEEP VENKATACHALAM: There's also another talk that is jointly presented by Google Cloud and NVIDIA at the recent GTC conference. And it has a great demo of building a full recommender system on GKE, powered by the new MIG technology. That's a great reference as well.

CRAIG BOX: And you can find links to all these presentations in the show notes. Where do you see this going next? What things are perhaps in the pipeline with new GPU technologies? And how do you think that you'll go about expressing them through Kubernetes and GKE?

KEVIN KLUES: One of the big things that we're excited about coming up soon is the ability to dynamically change the MIG devices on the fly from within a user's pod. So we can give them access to a full GPU, and then once they have access to that GPU, they can then go ahead and repartition these and run multiple processes on them, as they see fit. There's also some ongoing work in the Kubernetes community to add the abstraction of what's called pod level resources.

And we're working to try and get the notion of devices integrated into this work, so that we can actually allocate devices to an entire pod, all of the containers in the single pod, rather than just to individual containers, so that we can share the full GPUs across all of those different containers. And then someone can partition them into different MIG devices, and each container can then get exclusive access to one of those, all managed within that one single pod.

PRADEEP VENKATACHALAM: For us, we are definitely looking forward to the next level of sharing solutions that we can build with INVIDIA. And some of the ideas floating around are like, if we can help our customers figure out the right GPU or MIG partition for them, without their having to worry too much about, hey, what type of GPU should I run, what partition should I run, and instead, if customers could specify, this is my requirement, right?

Like this is how much GPU memory I require. Or this is how much compute I require. If they could specify, and then we can take care of figuring out for them, which is the best place to run that workload, whether it's on a MIG partition or whether it's a full GPU. We can make that decision for them and make it a lot easier for them to manage their infrastructure and also potentially improve efficiency overall.

KEVIN KLUES: So specify that in more abstract terms, and then you translate that into what size GPU do you need to satisfy this?

PRADEEP VENKATACHALAM: Exactly.

CRAIG BOX: All right, well a very interesting future for GPUs. And it just remains for me to thank you both for joining us today.

KEVIN KLUES: No, thank you. It's been great.

PRADEEP VENKATACHALAM: Thanks, Craig.

CRAIG BOX: You can find Kevin on Twitter at @klueska, and you can find Pradeep's work at cloud.google.com/GKE.

[MUSIC PLAYING]

CRAIG BOX: Thank you very much, Sarah, for helping out with the show today.

SARAH D'ANGELO: You're welcome, Craig. Thanks for having me. This is still my second ever podcast and I haven't been discovered yet. So hopefully this one will do the trick.

CRAIG BOX: Well, I'm not sure everyone listens right to the end of the show, but if you did, you know who to call. If you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter at KubernetesPod or reach us by email at kubernetespodcast@google.com.

SARAH D'ANGELO: You can also check out the website at kubernetespodcast.com, where you can find transcripts and show notes, as well as links to subscribe.

CRAIG BOX: I will be back with another guest host next week. So until then, thanks for listening.

[MUSIC PLAYING]