#237 September 24, 2024
Guests are Avin Regmi and David Xia from Spotify. We spoke to Avin and David about their work building Spotify’s Machine Learning Platform, Hendrix. They also specifically talk about how they use Ray to enable inference and batch workloads. Ray was featured on episode 235 of our show, so make sure you check out that episode too. Boilerplate introduction, featuring your hosts.
Do you have something cool to share? Some questions? Let us know:
KASLIN FIELDS: Hello, and welcome to the Kubernetes Podcast from Google. I'm your host, Kaslin Fields.
MOFI RAHMAN: And I'm Mofi Rahman.
[MUSIC PLAYING]
KASLIN FIELDS: In today's episode, our AI correspondent Mofi Rahman talks with Avin Regmi and David Xia from Spotify about their work building Spotify's machine learning platform, Hendrix. They also specifically talk about how they use Ray to enable inference and batch workloads. Ray was featured on episode 235 of our show, so make sure you check out that episode too. But first, let's get to the news.
[MUSIC PLAYING]
MOFI RAHMAN: IBM has acquired the Kubernetes cost management and optimization startup, Kubecost. Kubecost has achieved considerable success, being used in production at companies such as Allianz, Audi, Rakuten, and GitLab. On their blog about the acquisition, the Kubecost team framed the acquisition as joining forces with Apptio and Turbonomic, two other acquisitions IBM made over the last few years which focused on cost and performance optimization. The team also emphasized that they anticipate no interruptions in the products and services they offer during the transition.
KASLIN FIELDS: The Cloud Native Community Japan, a subgroup of the CNCF, has announced that there will be a KubeCon Japan held in 2025. While Japan has featured KubeDay events, usually alongside Open Source Summit Japan in the past, this will be the first time a KubeCon event is held in Japan. The document shared with the announcement states that the event is expected to feature two main conference days, over 100 sessions, and over 1,000 attendees. The event dates and location have not yet been announced.
MOFI RAHMAN: The call for proposals for KubeCon EU 2025 is now open until November 24. While the CNCF used to open KubeCon EU's CFPs after KubeCon NA, the CFP opening has been moving earlier. With the introduction of KubeCon India in 2024 and KubeCon Japan in 2025, there will be lots of KubeCons to apply to next year, so keep an eye out for those CFPs.
KASLIN FIELDS: Artifact Hub has become a CNCF incubating project. The project is a web-based application that enables finding, installing, and publishing cloud native packages and configurations. Discovering useful cloud native artifacts like Helm charts can be difficult with general purpose search engines. Artifact Hub provides an intuitive and easy to use solution that allows users to discover and publish multiple kinds of cloud native artifacts from a single place.
MOFI RAHMAN: OpenMetrics is archived and has been merged into Prometheus. In July 2024, the Technical Oversight Committee of the Cloud Native Computing Foundation approved and signed the archiving of OpenMetrics and migrating under Prometheus. As the author says in the CNCF blog post, "OpenMetrics is dead. Long live OpenMetrics as Prometheus format."
KASLIN FIELDS: kubecolor is an open source kubectl wrapper that can be used to add colorful highlighting to your kubectl output. Originally developed by Hidetatz Yaginuma, or hidetatz on GitHub, the project has recently been revitalized by Kalle, or GitHub user applejag, and Prune Sebastien Thomas, or GitHub user prune998.
The newly released 0.4.0 version introduces even more fun and useful functionality, like highlighting kubectl-outputted logs to make them easier to read. The release also features new paging functionality for long output, contributed by Lennart, GitHub user lennartack. If you use kubectl, you might consider giving kubecolor a try.
MOFI RAHMAN: And that's the news.
[MUSIC PLAYING]
David Xia is a senior engineer on Spotify's ML platform team. He has helped build and operate a centralized Ray platform that enables Spotify's ML practitioners to easily start prototyping their ideas and scaling their workloads. Before that, he worked on Spotify's core infrastructure for backend services, specifically on deployment tooling. Welcome to the show, David.
DAVID XIA: Thank you, Mofi.
MOFI RAHMAN: Avin is an engineering manager at Spotify, leading the ML training and compute team for the Hendrix ML platform. His areas of expertise include training and serving ML models at scale, ML infrastructure, and growing high-performing teams. Prior to joining Spotify, Avin led the ML platform team at Bell, focusing on distributed training and serving. Additionally, Avin is the founder of Panini AI, which is a cloud solutions that serves ML models at low latency using adaptive distributed batching. Welcome to the show, Avin.
AVIN REGMI: Thank you. Thank you, Mofi, for having me here today.
MOFI RAHMAN: So to get started, we had your bio, and we talked about it. But just tell us a bit more about how did you get involved with Spotify and the ML platform itself, starting with you, David.
DAVID XIA: Yeah. First of all, very excited to be here. I have been working at Spotify for many years now. And I've switched teams-- most recently worked on the current ML platform team for several years, helping them build out the infrastructure.
I first got involved because of just my interest in the area. And it's always been-- well, recently, it's been a very hot area of research and development for a lot of companies. And I just wanted to see what all the buzz was about and to work and learn on exciting new technology and to help Spotify apply it.
MOFI RAHMAN: And the same question to you, Avin. How did you get involved in the ML platform for Spotify?
AVIN REGMI: For me, I joined Spotify about-- soon will be two years. But before that, I was with Bell AI Labs, where I also led their ML platform. It was a very interesting journey because I actually got started into applied ML around 2016. But at that time, the platform side or productionization was very, very different. Nobody really did anything much about it. So me trying to productionize a bunch of the models that I was working on led into this experience of starting my startup, Panini, and eventually from there to Bell Labs and eventually to Spotify.
MOFI RAHMAN: So you mentioned a couple of things there I just kind of dissect a little bit more. And this is probably something-- a personal interest of mine as well, because every time I meet someone that they say they're working in a ML platform, ask the same exact question because I seem to get different answers from different people. So if you had to define in a few sentences what an ML platform actually is-- because every team I speak to, they have a slightly different definition of what an ML platform for them means.
DAVID XIA: For me, ML platform means that it's the tools. It's the SDK. It's essentially the infrastructure layer on which our users, most of them-- or all of them are internal, other Spotify employees like AI researchers, ML practitioners. Those users actually use it to do the actual application.
So we abstract away things like getting the compute resources, having to know or understand certain implementation detail. I'll talk more about it later, but it's all built on top of Kubernetes. So we don't want or expect our users to know how to use Kubernetes in order to get access to lots of hardware accelerators or CPUs or certain nitty-gritty details of, in this case, Ray. And so the platform basically wraps all of it up, makes it super easy for them to quickly get and to use our computational resources for training, serving, any other applications.
AVIN REGMI: Yeah. And just expanding on what David just said-- the interesting fact, I think-- Mofi, you said it-- it's very different from person to person-- is that I think one part of this also depends on the organization as well, and their adoption and ML journey.
Typically, if you ask the question to someone who is just starting off, or their ML scope is relatively small, what ML platform to them would be very different for larger enterprises where they're supporting many customers and many users. So I think the need for the ML platform also changes from work to work depending on the business case and at what scale you're operating at.
MOFI RAHMAN: And Spotify obviously have been in this journey of the platform itself-- not the ML platform-- the application itself uses a lot of machine learning and deep learning things to understand its user preference and things like that. So Spotify has been doing this for over a decade now. How has that changed over time?
Obviously, both of you probably have not been from the beginning of the journey, but you have probably seen more of the evolution over time. How has that changed, and has it become-- it seems like it has become more of a unified thing now. And how was the decision being made internally to make it more of a unified platform for all of Spotify?
DAVID XIA: Yeah. I've been on the platform team for a couple years now. I don't have the institutional memory before that, but I can talk a little bit as an outside observer, seeing how it was used. So I think, before there was an ML platform team, each team that wanted to do ML-- and one of the earliest teams was the team that built Discover Weekly, one of the most beloved use cases of ML at Spotify.
So I think they basically had a role of their own, from the model architecture all the way down to how do we get compute resources and how do we schedule this to run every week, and at the scale for every single user to generate a customized playlist of recent songs we think you might enjoy.
And then use cases proliferated. More teams needed to do similar things for different types of applications. And then I think the ML platform team was maybe-- I'm going to guess five years ago, maybe longer, it started-- and basically tried to build common infrastructure for all these use cases, so that those teams didn't have to keep reinventing the wheel and could become more productive at just focusing on their applications.
And then for us, also recently, a big change in the platform team was the choice of technology and the frameworks to use. One of the initial production-ready stacks was based on Kubeflow and TensorFlow. And then a newer path that we're working on today, relatively newer, is also supporting other types of frameworks, not just TensorFlow, but anything on top of Ray and other Google products like Dynamic Workload Scheduler.
AVIN REGMI: And I think the way how it has evolved in the last few years is-- I would say, draw a very similar line to other companies as well. In around 2018 or so, the ML landscape was a little bit heavily driven by TensorFlow or TFX related tools. And that was the default approach for going into production. And that's how we have created our system around that time.
But as technology evolved and the way how we do various modeling, that evolved, we had to look into different ways of supporting it outside of TFX as well. And I would say around-- since then, around 2022, going on forward, we've expanded it into Ray and PyTorch and allowing more of a generative AI use cases, NLP and so on, which we can definitely talk more about in a bit.
MOFI RAHMAN: Yeah, and our listeners can't really see this, but Avin right now is wearing a T-shirt that says "Hendrix ML Platform," which is the ML platform that both of you work on at Spotify. Tell me a little bit more about this Hendrix platform, and how does it operate on top of the tools you mentioned, like Kubernetes and Ray, together to bundle all the things people need to do, any kind of ML workload on Spotify?
AVIN REGMI: The most foundational layer at Spotify-- I would say there is the data compute and orchestration being the core piece that we have. And on top of that, I would say Hendrix resides on top of that. So a certain component of Hendrix would be, for a compute infrastructure, we use GKE. And we have the SKF for that. But also, recently, we expanded it on Ray, which resides on top of our GKE. And that's what people primarily use, or practitioners use to actually train their models to do batch inferencing and so on.
When it comes to serving, because we also allow users to serve, traditionally we were working with TF servings and so on, toolsets. Recently we've expanded into others like Tridents and VLM, supporting for more wider use cases. Then there has to be ability for users to actually orchestrate this or schedule this in a production. And for that, we use Flyte, which is very similar to Airflow, allowing users to orchestrate their various different ML workloads.
And for features, we have in house, kind of like a feature store, Jukebox. And there's Hendrix SDK, which is this high-level SDK that wraps everything together that users use to actually interface with it. So that's very on a high level. David, do you want to expand anything on that?
DAVID XIA: Yeah, I think that was a really good summary. That's on a high level. And then I can speak a little bit more about how we actually deploy and run and operate the Ray-based stack. Everything is on Google Kubernetes Engine, which allows us-- we don't have to maintain the Kubernetes cluster ourselves.
GKE has a lot. We're just using the standard, but it takes care of a lot of things for us. We deploy Ray in its KubeRay form. So KubeRay is just the open source way to deploy Ray onto Kubernetes, and that works really well for us. We don't do the VM-based deployment of Ray. And then a lot of the other things-- we mostly use Google products. The logging is just Cloud Logging. For metrics, we do use an internal metrics stack. Sometimes we use the cloud monitoring stuff, but mostly it's just internal metrics.
MOFI RAHMAN: So you mentioned using Kubernetes, and in this case GKE, for using-- how was the decision made to build a platform? When and how did you decide that this platform should be built on top of something like Kubernetes?
DAVID XIA: It was very early when we were playing around with Ray. And it was just me and one other engineer, [? Teshi, ?] who's still on our team. And I think because of where our team's bandwidth, needs, and expertise was, and also just how well Ray-- the KubeRay version worked.
After playing around a little bit with the VM-based deployment of Ray and the Kubernetes-based deployment of Ray, it was pretty simple that we were able to just get started a lot faster with Kubernetes. The autoscaling, the monitoring, the other value-added things like GKE image streaming, those just made it so that, because of our limited size of our team and that we had a lot of Kubernetes expertise, it just worked. We didn't want to have to build more things ourselves on top of just plain VMs. So that was pretty straightforward.
MOFI RAHMAN: Yeah. Again, I work in the GKE DevRel team, so I definitely have a lot of love for Kubernetes and GKE as the product. But there is a growing sentiment in the community that for many problems, Kubernetes might be an overkill or too complex of a solution to bring in. But it seems like for this kind of a use case where your problem is complex enough, which fits nicely in the model of Kubernetes-- that's one thing I tell people all the time, is that Kubernetes is a great solution, but it's not a perfect solution for everything.
You have that right match. Your problem set matches nicely with the features Kubernetes is providing, and that's when you get the value. Oftentimes, people are trying to force their problem in the world of Kubernetes or any other type of the solution that sounds cool and good, and they get burned because it is not really mapping with the problem they're trying to solve.
AVIN REGMI: Absolutely.
MOFI RAHMAN: The next question I was going to ask is that, imagine now I am a new user to Spotify, a new team that is trying to use this ML platform. What does that onboarding journey look like from that point? I have an idea. I have a thing I want to build that is going to use some of that ML platform. What am I doing to get my application, my ML application onto running on Hendrix and using the resources?
DAVID XIA: So if you're like a new team, or let's say you have a new team-- so for us, the way how we do it on Kubernetes in our Ray infrastructure is we have several different namespaces that we create for various different teams. So first one would be to actually create a namespace. And there is a way to do that from Hendrix that allows users to do that. And that gives you access certain level of [? Coda-- ?] default one that you can get started with.
We've been actually working quite a bit on Workbench that allows-- it's essentially managed instance of dev environment that allows you to go ahead and create resources that you can interact with it. So with the Hendrix SDK, you can actually specify saying that, hey, for a Ray approach, I would need a Ray cluster with these many head nodes or with these many workers, resources, GPUs.
And by typing that and entering that, if it's the CLI route, then users are able to essentially go ahead and create the necessary provision, the Ray cluster for you to get started. From there, you can actually ssh into either into the head node of the Ray and start your dev process, or you can actually start submitting files and so on.
I think the key here is that there isn't a specific one approach. I would say there are three different high-level approaches that we allow users to interact with Ray, and depends on what journey you're in. If you're relatively early on and you're just focusing on the EDA, trying to explore data, try different things, most likely starting up in a notebook and switching into a Ray cluster, a smaller cluster is fastest way for you to do so.
Once you start training your model and you have a certain level of maturity, perhaps now you want to actually deploy this model into production. At that point, you would want to orchestrate this in more of a clear, defined pipeline using Flyte rather than having a notebook or scheduling out of a notebook.
So at that point, a team would actually move away from a notebook experience into more so of fully built Docker images that would be actually deployed via Flyte. So depending on where you are in this journey, you would actually navigate or change the entry point of Ray either from a more of an ad hoc approach to more of a scheduled job via Flyte.
MOFI RAHMAN: That makes a lot of sense. I think oftentimes, people, when they think about a dev platform, they are looking at a-- the example a lot of people would think about is something like Backstage, which is very much like you-- there's a DSL that defines your application in a very strict format. But it seems like what gave Hendrix a lot of success is giving people the option of the flexibility of having a very ad hoc approach, as well as a very strict, defined using Flyte. I think Flyte uses YAML to define their pipelines that way.
So once people deploy their application, once they have deployed the application I have onboarded-- I tried with notebooks. Eventually, I wrote my Flyte YAMLs to get that running on Kubernetes and Ray. How much of the knobs are exposed to the user in that? I know abstraction comes with a lot of things hidden certain times. For the ML teams, how much of this underlying platform are they getting to see when they deploy something?
DAVID XIA: The principle we're going for is progressive disclosure, actually. So for just people who are getting started, we don't want them or expect them to even know that there's Kubernetes or GKE or have to look at YAML. So like Avin referenced, we have a Hendrix SDK and also command line executable. You can just run Hendrix, Create Cluster. And we give them some knobs.
But if you don't specify any command line switches, like number of CPUs, we give you sane defaults. And then you can just connect to it, start running a notebook, start writing a very simple Ray function. And even to get started, we also give them notebook tutorials so they don't even have to look at upstream open source Ray docs. They can just plop them into a notebook editor, and they can just start-- click Run All and just actually look at it that way.
Then, of course, progressive disclosure. People will want to customize, so then they probably discover the CLI switches that we have, or you can add hardware accelerators. They find out they need more workers, they can add more workers that are more than the default. Maybe they even need a custom container image and not our default one.
Then we have docs showing them how to do that. And then some people even have to drop down to really dealing with the Ray cluster Kubernetes YAML itself because there's something that we just didn't provide a knob for, because it's impossible to provide a knob for everything. And then we allow the command line tool to say-- path to your Ray cluster YAML. So we allow them to really drop down to being exposed to Kubernetes, but we make it progressive thing so that they only need to know as much as they need to get the job done.
MOFI RAHMAN: Yeah, I think that's a really good point of progressively getting them more exposure to these things. Because if you expose all the Kubernetes knobs, you are back to Kubernetes again. So the thing you're trying to save them from, you're putting them back into that same world. That makes a lot of sense. So another thing you mentioned, like earlier version of this ML platform, if not Hendrix, was using Kubeflow, and then eventually you thought of moving to Ray.
And now a lot of the platform also probably uses PyTorch as well, that you mentioned as well. So that decision of choosing Ray and moving over from everything TensorFlow to a lot of things PyTorch now-- was that mostly a bottom-up approach where the ML engineers were asking for these features? Or is it more about you looking at the trend of the industry and seeing, this is where the industry is going, or it's more of a mixed approach? How do you decide that the platform should be supporting Ray as a primary orchestrator?
AVIN REGMI: I think it's both sides. So we definitely saw trends happening in the industry as well. So there were certain industry trends where things were more focusing on transformer-based approaches, NLP, and we had use cases around that line as well. It is definitely possible, and we still have teams kind of using transformer Hugging Face packages on SKF, but the experience is still not the best. So we definitely got that. We've noticed that as well. But at the same time as more of a NLP models are coming in, LLMs are coming in, for us to fine-tune that, or those models on an SKF was impossible.
So I would say there was definitely elements of externally which direction the industry is moving in, and we definitely see things going more towards PyTorch. Ray was evolving quite a bit, and a lot of companies were adopting it. But at the same time, there were also use cases, business-driven use cases that allowed us to more invest in that side as well.
MOFI RAHMAN: Kubernetes as an open source project has evolved quite a bit in the last few years to support use cases like large language models and serving and fine-tuning and training these massive models at scale. But when you are talking about starting your journey in 2022, that is the early days. So what kind of changes you had to make to get Kubernetes to work well enough with the size of models that you had working?
AVIN REGMI: There were various different optimization that we had to do. And in fact, we'll actually be talking, expanding more on this on the Ray Summit a little bit later next month. Just to give you an example, just for our size of the Ray cluster itself, like our traditional SKF cluster, from scaling that up to a couple hundred nodes to, currently, we're at a cluster right now which can scale up to over 4,000 nodes.
But when it comes to actually training the models itself, leveraging GCPs, highly optimized compute node pools-- so if we're using something like H100 A2A3 node pool instances, those ones have a high interconnect bandwidth between GPU-to-GPU communication.
So that allows us to get a better support for training these larger models. The other one is compact placement strategy. Making sure that each of these VMs are physically co-located together in the same location to reduce the network latency also improves on that. The other knobs that we turned would be NCCL Fast Sockets, so all these NCCL-- NVIDIA's collective communication that's happening.
GCP has this transport layer plugin on top of that that actually optimizes on top of the traditional NCCL. Very easy for us to enable, but in a public form, we saw that people were gaining about roughly 30% speed-up in terms of training these models. So those are some of the kind of optimizations that we had to do on top of our GKE cluster to accommodate for these larger model trainings.
MOFI RAHMAN: David, you also mentioned earlier about Dynamic Workload Scheduler. That also is a feature that we have for helping getting resources that are difficult to get, like GPUs, if we all know how difficult GPUs have been to get. Tell me a little bit more about the way your workloads are scheduled so that you can wait to get those resources instead of having to get them right away.
DAVID XIA: Yeah. The current work with Dynamic Workload Scheduler, or I'll refer to it as DWS from now on, is pretty experimental. We're just about to test out some of the functionality with some early users, early teams that we work with. So it's a little bit early to say how they will decide to restructure their workload in terms of code or in response to this new functionality.
But I think one of the biggest pain points just in motivations for this added functionality now is that a lot of teams wanted to use the latest, most cutting-edge hardware accelerators. And there's frequently a lot of stock-outs either from just not enough quota from our end that we've acquired.
Or actually, sometimes there's Google Compute Engine stock-outs in a region. Because currently our Kubernetes GKE node pools, a lot of them are on demand. Sometimes we have reservations. And for listeners who don't know, reservations are where you pay to reserve compute instances. And whether or not you're using them, you're still paying, but it guarantees the availability. So the on-demand ones, teams request, and they might not get them, and so they're blocked.
And so we're hoping that the work with DWS not only will make it more cost efficient in terms of its getting scheduled-- so if you need eight H100's, you're not going to schedule your workload until you can atomically acquire all eight. You're not going to be hanging on to four and paying for it while you try to get the other four and sometimes failing to get the other four. So it's not only cost efficiency, but also, we're hoping that it gets us more availability and avoid those kind of stock-outs so that our users aren't blocked.
MOFI RAHMAN: When you're talking about this ML platform, about 4,000 node Kubernetes clusters, a lot of users, different users in different namespaces using the same cluster, you obviously probably have some challenges with resource sharing because they're on the same cluster. How does Hendrix handle multiple people asking for similar resources and then those resources being shared as fairly as possible?
DAVID XIA: So yes, our platform is multi-tenant, lots of tenants, lots of teams. Everybody's always hungry for, give me CPUs, give me latest hardware accelerators. We use several, both features of Kubernetes and then processes, like human processes, to manage those requests in a-- yeah, I'm not sure I'd call it a fair way, but in a way that works currently for people, or is at least transparent and visible as much as possible for people that provide guardrails and avoid noisy neighbors and resource contention as much as possible.
So we use Kubernetes namespaces. Each team starts off with a namespace. They can actually create more than one namespace if they have multiple systems. We go with a namespace per system approach. And then we use Kubernetes resource quotas. We start off with a default amount. Users can request to-- they can actually go and edit it or request more. But that's subject to approval by our team to check that they're requesting a sane amount and it's not going to hog everything.
It's like a combination of both a human in the loop plus Kubernetes resource isolation. And then we deploy Kueue ourselves. I think when we first started playing around with Kueue is-- obviously, it is very powerful and would solve a lot of problems that we currently face.
But it was also quickly apparent that it's pretty complex. There's a lot of cool things you can do with it, like borrowing, lending, LocalQueue, ClusterQueue-- all these Kubernetes resources that even we were new to. And so we deployed it as a team, kind of in a centralized way. I don't think any of our users would have the bandwidth or the know-how to get started with that quickly.
And then we use set of repos to centralize and encode the same defaults we want, and also perhaps promote certain behavior that we want. For example, we want only, if you're using hardware accelerators, that to go through the Kueue provisioning Kubernetes pools. And if you're just using CPUs, it goes through the regular on-demand auto-scaling stuff of the cluster. Yeah, so it's a combination of both human and technology features.
MOFI RAHMAN: Awesome.
AVIN REGMI: I would say that we also have multiple clusters as well. So we have, let's say, a smaller cluster that people get started with for the experimentation one and a larger cluster where people are deploying larger workloads. And for us, a more of a perfect vision in the future would be, as a user, I don't have to worry about context switching between clusters. That entire thing is abstracted away for me.
More so, here is what I'm trying to do, and figure it out, whether it's going to be reserved instance, whether it's going to be on demand, whether it's going to be on a larger cluster or a small cluster. And we hope that Kueue will be able to help us simplify this process by being able to navigate this ambiguity.
MOFI RAHMAN: This question, I think, goes to, well, both of you, but more specifically to Avin, because you mentioned that, over the last few years, you have been building Panini, working in Bell AI Labs, as well as Spotify. You have been very closely related to building ML platforms and like leading teams that build ML platform. What are some of the probably more interesting and/or surprising things you found out about building ML platforms?
AVIN REGMI: I would say there's two different aspects to this. And if you were to ask me like, hey, building a backend team or something like that, the talent is given. I would go out there, get a backend domain expertise, and work on it. I think the challenging and interesting part of ML domain is that you need expertise from both ML side, infrastructure side, sometimes a deep level CUDA optimization as well from GPU and so on coming together.
And the very nature of ML is such that it's changing so rapidly all the time. But at the same time, the nature of platform is that you're trying to have certain level of stability going forward. Because if your platform is also changing rapidly, it's breaking users' code, that's not a good ML platform.
So being able to navigate this, and on one hand, this ML domain that's constantly coming with new tools and technologies, but at the same time, how do you go forward in a way that's a little bit more stable, where you're not breaking user infrastructure, user code? I think that over there is quite challenging.
And it is a fine balance between, at what point do we want to move fast and try new things, and at what point do we want to slow down a bit and see, is this going to break users' changes or their experience? And at that point, we should probably slow down. I would say that over there is quite challenging.
MOFI RAHMAN: I think for me personally, that was a bit of learning, because I come from a strong infrastructure background. So when we first started talking about running a workload on GKE, my first response was, it's just workload. Put it on a container. It just runs on GKE. I don't understand what the big fuss is about.
But now that I have spent a bit of time and started talking to customers, as well as listening to folks from the community, I think what you mentioned about-- I think that dichotomy of the underlying technology of ML moves so fast versus infrastructure is somewhat static compared to something like this whole large language model is-- surprisingly, the transformer-based architecture was initially proposed in like 2017.
But in the last seven years, the scale of models that people are deploying have like thousand to a million Xed in size. Like a few months ago, someone was talking about a 2 billion-parameter model being a small model, which is such a bizarre way to describe a 2 billion-parameter model.
But again, I think our perspective of what a model size and what is a big model anymore changes quite rapidly. But infrastructure itself is not growing as fast. It is growing very quickly with the new GPUs. We're releasing new TPUs. The memory and everything is growing.
But the scale at this-- the ML platform and the model size is increasing. It's much faster than infrastructure changes. So that is actually a really interesting finding. And yeah, I agree 100% And then the similar question to David, then-- previously, even in Spotify, you worked in the team that worked with deployment, productionizing, probably back in other systems. But now, working on ML platform, how is that different or how are the same?
DAVID XIA: Yeah, a lot of things are the same in terms of what is good to do as an infrastructure team. Things that are always timeless have actionable error messages so that you help your users. Don't write error messages for yourself with all the context, but someone who has no idea what this is, actionable error messages are very important.
I think the Ray team does a-- any scale does a really good job. If you look at a lot of Ray error messages, it's just so informative. Even suggests, we notice that you're doing this. Here's this suggested optimization. We notice this thing is running slow. And it just prints it in the logs, and so you can just immediately go do it.
That's one. Number two is, again, progressive disclosure. Make it really easy to get started for people who are-- have a quick start, sane defaults, and then let people drop down to lower levels of implementation detail to override things. Another one is-- it's not surprising, but I guess because things move so fast in ML versus more just backend services or something, things go from prototype to production-- being used in production very, very quickly.
So keeping track of your tech debt, I think, is really important, because it's going to be very soon that you're going to have to pay it off, especially when it moves quickly and so many people are using it as quickly, even faster than before. As soon as you write a piece of functionality, people are going to be wanting to use it. Another thing is, at least at Spotify, the types of backgrounds and the level of expertise for AI researchers and ML practitioners is very broad.
So you'll have people who are coming out of PhDs. They haven't worked in industry before. They're probably writing Python. That's not the best. And they don't know how to SSH to something because they're just used to writing something locally. So at least for me, I've noticed that I have to write documentation or design tools that are just for a much broader array of users.
Of course, we do have very sophisticated ML engineers too, who could instantly tomorrow work in our team and have no problem. But we do have also a lot of other people who don't come from a traditional engineering-based background. And so for me, I have to take that into account.
MOFI RAHMAN: Yeah, I think for me, another learning has happened over time. And I think for a lot of teams, this is a revelation, too. For the longest time for many companies, ML probably was this research org that was doing interesting experiments. And they would run this one-off experiment to find information out. For Spotify, I think it's a lot different because Spotify has been putting ML in the forefront of the product that you have for a really long time.
Which, for a lot of companies, they got thrusted into the critical path in the last couple of years, and it's been like a big challenge. So I think Spotify definitely had a good head start in that space because productionized ML workload was already in the forefront of your product. So that probably definitely helped a little bit for teams to know that my application would be seen by other people, not just in my notebook. So that definitely probably was really good.
MOFI RAHMAN: The last thing I want to ask to both of you is, obviously, that Hendrix platform, it is going to grow. And other people are going to come in, new workers are going to come in. But if you had a wish list of things and features that you-- and infinite time and resources to get them done, what are things you would like to see coming into the platform for the people that have been asking about, and also things in the Kubernetes space that have been challenging and you would like to see improved for specifically ML platform building?
DAVID XIA: Yeah. One thing that we could definitely improve is-- so we put a lot of work into the interactive prototyping stage when you're writing stuff in a notebook. I think when people schedule something with Flyte and then something goes wrong, that debugging process is a lot slower and harder for them because of just how-- there's a separate team that manages a separate Kubernetes cluster that runs Flyte. And then the Flyte thing kicks off a workload on our team's Kubernetes clusters.
And so debugging that, you kind of are exposed to the organizational gaps in between. They're not the same team, so it's not a very integrated experience. And then just the fact that everything's running in containers, and when things go wrong, you can't go SSH in and poke around at the point in time when the error happened. So that's one part of the experience that could be improved, definitely.
Another part is, we have an SDK and it's pretty tied currently, or at least it very much supports or is opinionated on PyTorch. We want to make it more framework agnostic. We also want to make it more flexible in terms of the Ray version that you're using. So we wrote a bunch of code to abstract away Kubernetes, and it's very tied to, this version of Hendrix uses this version of Ray. But it doesn't need to be that way. It should just be-- you can use whatever version of Ray. You can specify it.
But because of how we wrote all of the other code on top of it, it's very Ray version-specific code. It's going to require a bit of rethinking about how to make it Ray version agnostic so you can just-- you want to use a newer Ray version? Go ahead. Everything should still just work. And another part in terms of user experience is everything should just be faster.
Currently, the images and software artifacts that we provide are a little bit bloated. There's a lot of things. A very frequent ask is, I don't need all this. How do I get the minimal thing? How do I get my image or workload to start up a lot faster? How do I not pip install and have to wait several minutes and then it's like this virtual environment is many, many gigs big?
AVIN REGMI: It's funny you asked this question. Because, right before this meeting, I was just in another meeting that I came out of it talking about more like what are the changes we want to bring in the six months and so on? And I very much agree with what David said about Flyte experience.
I think, traditionally where we came from, that KF, Kubeflow workflow, where orchestration was kind of taken care of by the Spotify Kubeflow-- moving into Flyte, I think there were opportunities for us to improve that experience a bit. And certainly, that's something-- it's on our roadmap to make that better. The other thing is that-- just taking users from experimentation to production much faster, because a lot of projects actually don't end up in production. Very few actually do make it past A/B testing and go all the way to production.
So what are the minimum different steps that users can take, or can we reduce the number of steps needed from a notebook experience all the way to the point where things are scheduled fully into a system like orchestration, like Flyte? And in between, can we reduce those steps? If we can eliminate those number of steps needed, we can actually hopefully make that number of processes much faster, not encountering for model performance and so on.
And I would say that, finally, the last kind of an ideal state for-- that's the direction that we're working on as a platform team would be from a rate perspective-- is that, as we grow in terms of use cases, we may also grow in terms of number of clusters.
But from an end user's perspective, how do we abstract away all of the necessary infrastructure and just have them focus on their model? So if I'm a machine learning engineer, I'm trying to train a train, I want to optimize for a certain business lift or business metric. I don't necessarily want to be able to have to worry about how many worker nodes do I need, how many memory GPU that I need.
Similar to, I would say, maybe perhaps SageMaker JumpStart, if we can actually have users focus on specifying their model-- hey, I'm trying to train a model that has this much billion parameters-- from that, can we infer the batch size? And based on that, can we actually infer the resources needed, where that's completely abstracted away? And certainly, this will help certain level of users. But then, again, there might be other users who want to fine-tune that or have control of that. And for them, that would be OK. We can expose them to those settings.
MOFI RAHMAN: The final question I wanted to ask is, in this space of doing ML experimentation, there is a number of different ways to get started, as you mentioned, like Ray has their own docs, Kubernetes has their own docs, people could get started on PyTorch. But for when you're building an ML platform, I think that zero to one experience of, I have an idea. I want to just get started and understand what this thing is all about. How does Hendrix or Spotify ML platform kind of help user getting started with using something instead of just having to learn all the moving parts individually?
DAVID XIA: One thing that we noticed users struggling with or just there being more sharp edges is setting up your development environment. It's not ML specific, but it gets a little trickier when it comes to a bunch of the ML frameworks and tools currently.
It's not so much of an issue now, but maybe a year or so ago-- it was actually even before that. Most of our employees-- this is very Spotify specific, but most of our employees use MacBooks, and that's a very different CPU architecture than where you're running your production workloads, which is more Linux based or at least Unix like.
And so when people had to set up their development environment locally, oftentimes, they pip installed stuff, and they'd have issues, like the behavior was just different. Some gRPC packages didn't work, especially after Apple Silicon came out. People would have run into strange error messages that required hours of googling to find out some obscure compiler flag that you needed to enable so the dynamic linking between libraries worked. It was just awful. It would be a time sink, like an unbelievable black hole of just productivity going down the drain.
So we partnered with another team at Spotify that owned the cloud developer experience for data scientists and data engineers. Because they had an experience where you could just click something, and it would open up something in your browser, and you could just start coding.
The environment would be set up for you. You'd jump to a definition. You don't have to set up anything locally. So we wanted that same kind of "hello, world" really nice experience for ML engineers as well. And the name of the internal tool that we built for this cloud IDE is called Workbench.
And so we added ML capabilities, specifically the Ray capabilities, to Workbench so that you can say, I'm doing ML stuff. I want a different kind of Workbench. It would just give you all of that in your browser It's VS code based. So we use the open source VS code server. And you can just get started right away without having to fuss around with pip install all of this stuff-- look up some obscure error.
MOFI RAHMAN: Yeah, as someone who is not a day-to-day Python developer having to deal with pip install error codes and what that red line actually means, that's a struggle I can relate very closely with. And with that, I think-- thank you so much to both of you for spending the last hour-ish talking all about building the ML platform at Spotify.
Hopefully in the future, in a couple of years, once you have all the wish list items that you talked about have been implemented, we'll come back and talk about the new challenges that we can face and how did this new solution prompted even more use cases for using ML at Spotify. So thank you for spending the time and sharing all your thoughts with us.
AVIN REGMI: Thank you. Thank you for having us. It was a great time.
DAVID XIA: Thanks, Mofi. It was a pleasure.
[MUSIC PLAYING]
KASLIN FIELDS: Welcome back, Mofi, and thank you very much for that interview. I'm really excited that we had our interview about Ray, talking about what Ray is as an open source project and how it works with Kubernetes and all of that. And now we have an episode talking with folks at Spotify who are using Ray and developing-- not just using Ray. They're developing a whole platform.
I liked the references to Backstage. It reminded me of a Backstage-esque platform where the idea is to abstract the underlying hardware, very platform engineering-y, and allow the end users to use the tools that they need. So really, the users of Ray are going to be these folks' users, but the platform engineering aspect of creating this platform that abstracts away that underlying hardware, I think, is a pretty common thing that a lot of companies are at least trying to do, if not doing it already.
MOFI RAHMAN: Yeah, no. Thanks for having me as a host once again. I think the key differences for me of something like Backstage versus an internal ML platform for your teams-- Backstage is trying to build a developer platform for a lot of folks that may or may not be in the room where a lot of the Backstage development decisions are made.
So it's the same challenge of, Backstage has to be something for everyone, where the people they are trying to build the platform for may not be there voicing their opinions right away. So that's the key challenge of, as Backstage becomes more and more-- bigger with more feature reach, you would need another platform to manage Backstage to abstract away some of the complexities of Backstage.
And it's the same thing-- like when Kubernetes first came out, the API space of Kubernetes was fairly limited. So there was like three or four different Kubernetes resources that you could deploy. As Kubernetes added more and more things, you needed abstraction because not everybody needed all the different features of Kubernetes. So this internal developer platform that Spotify is building, it is built with the input and the need from all the ML teams that exist within.
And as those teams grow, at some point, it is possible that the platform becomes so big that they need some other tools to simplify using the platform. And it's kind of like the never-ending ouroboros loop of, it becomes big, platform is successful, more people want to use it. Now it's too big. We need something more abstract and more streamlined to use the thing again. And the cycle continues again.
But even with that, I think the journey that Spotify has taken is actually very indicative, as well as illuminating for other listeners that are in the same journey as they are now. A lot of the ML platform journey for teams probably started post 2020 and in the age of LLMs,
But Spotify have been on this journey, as we discussed, for years now. So their maturity in being able to adopt new features, as well as knowing what not to do, is probably generally higher than a lot of the newer teams that are just getting in on this journey. So there is a lot to learn from their trial and error and figuring out-- they have been using ML from 2012, 2013 when the Weekly Discover feed first came out. They finally decided to build a unified ML platform in like 2018, 2019.
So in the beginning, just like most other teams now, they were doing similar things like experimentation, trial, build your own thing until you find out you're spending way too much building-- every team, building their own ML platform. Now you can save a lot of time by just having a dedicated, centralized team doing this. So that's a very, I thought, was a very common practice and common path a lot of teams are taking.
KASLIN FIELDS: Speaking of the beginning of Spotify's journey and complexity and how Kubernetes fits into all of this, one section I really enjoyed was when you asked David about how they made the decision about using Kubernetes as the base for this. And his answer, of course, was that it was easy and simple for them to decide on that because they were already using Kubernetes, and so they already had that expertise.
And so just doing Ray on top of Kubernetes was the easier path for them than doing it on top of VMs, which I thought was very poignant. A lot of the folks that I talked to in the Kubernetes space tell me similar things when I ask about complexity.
And in various areas of using Kubernetes, if you are familiar with that area or similar areas, then using it is not so complex, and it's not such a difficult hurdle to get over in the onboarding. So it was exciting to hear that Kubernetes made sense as the baseline for them for this.
MOFI RAHMAN: Yeah, we also discussed a little bit about that part of-- the complexity of the problem fits in the world of Kubernetes very well, versus if you're trying to fit a problem that is not immediately well defined in the space of Kubernetes and you're trying to jam that in in the context of Kubernetes, you're going to have a hard time. You're going to feel like you are bringing in way more complexity with Kubernetes than just doing it on a VM or something like Cloud Run or just running on container or something path solution.
So there is that hammer and nail problem sometimes in this space. But in this case, building an ML platform, something very complex, a lot of multi-tenant user-- you have a different scale up, scale down requirement. You have requirement of different types of resources.
And managing all of that manually in the VM world-- it is going to be probably more work than it's worth, which is what Kubernetes is really good at. So it is nice to see that for problems that match well in the dynamic of Kubernetes, ML platform building it on top of Kubernetes makes a lot of sense for Spotify in this case and many other folks that are trying to solve similar problems.
KASLIN FIELDS: And in building up this journey, I liked that Avin really seemed to have a strong concept of what he wanted this ML platform to look like. His background with ML platforms is very impressive and exciting because it's like, you don't see a lot of folks who were focused in this kind of area, I think. I think that's a pretty niche area of focus at this point, but it's growing so quickly.
But he was so focused on the core tenets of what this platform needed to be. One thing he said that I wrote down and bolded was, navigating this fast moving domain of AI while maintaining stability is the challenge. And that's something that really resonates with me in the Kubernetes open source world as well. [CHUCKLES]
MOFI RAHMAN: Yeah, like "move fast and break things" has been a motto in this space. But again, Kubernetes this year turned 10. It's in the double digits. So for pretty much most of our lifetime, Kubernetes will be in the double digits. So it has reached its like double digit age and for the next 89 years is going to be on a double-digit age. Hopefully Kubernetes stays that long, we hope.
KASLIN FIELDS: [CHUCKLES] The years start coming and they don't stop coming.
MOFI RAHMAN: Yeah. But I think the point there is that Kubernetes is, as far as we're concerned, a fairly mature system, over 10 years old. A lot of people rely on it. So Kubernetes has to take a lot of care in moving fast, but still not breaking things, because Kubernetes has to continue supporting all of these different types of workloads that are coming to it. But also, we can't break anything that people rely on Kubernetes for.
So the thing you bolded in our notes here is that moving fast, but also keeping the stability-- in the ML space right now, they're going through that "we need to get the new version out as soon as possible." And sometimes it seems like proper testing and proper integration testing sometimes gets put on the wayside in name of speed and a fast deployment of things.
But over time, I think the community will gather around and go for more of a stable system that people can grow on top of. In this space right now, speed is the name of the game. You want to get your new version of your software out as soon as possible, like new version of your model, new version of your data, new version of your serving engine training, what have you.
But I think, as you're talking about from experimentation to production, that is the switch. Production is stable. And speed is important, but not the most important. Stability and correctness is probably more important, just the raw speed of getting things out.
KASLIN FIELDS: I liked that when you went into the things that they want to do next, a lot of it was really in that space of, how do we enable the speed that these engineers need? Because the AI space is moving super fast, and it needs to right now. So we need to enable that speed, but we also need to have stability features. I really liked the conversation about debugging and making sure that you have solid error messages-- so simple, yet so important. [CHUCKLES]
MOFI RAHMAN: Yeah. I think, for me, another big highlight is towards the end, when we were talking about local dev experience, so just setting up your local development environment for using both Kubernetes, notebooks, all the Python libraries that exist, all the different versions of Python libraries that interact with each other. Also, a lot of the Python libraries also fall down to underlying GCC and C libraries of them. So you have those versions to care about as well.
Some of them are OS dependent. They mentioned, in Spotify, they use mostly Macs. But most of those images gets built out for Linux environments. If you're using Apple Silicon, the container image that you end up building ends up being an [? ARM ?] image that may or may not just work in a Kubernetes environment. You can cross-build using Docker Buildx and BuildKits and whatnot. But these are not common knowledge. These are deep container-based knowledge that people in the infrastructure space, like you and I, probably, have learned after many trial and error.
But if you're trusting your ML engineers in that space, now they are spending their precious, precious time in learning container skills, which they could be using in building out new models, new experimentation, notebooks, and what have you. So it's a matter of, how do you take that toil away from your engineers and allow them to do what they're best at, building the product, building the model, running the experimentation, instead of everybody having to learn all the different skills?
So the other thing-- David also mentioned about progressively giving them more information about how the thing works. And there, I think people fall on either side of that coin. Some people are strictly along the line of, ML platform or a platform should be abstracted and people just only have access to either a CLI, SDK, or some sort of DSL to talk to this platform.
But it seems like in Spotify's case, what is working for them right now is starting them off with an SDK and CLI. But if they want those knobs and access to those underlying hardware infrastructure, they have that option to fall down to the Kubernetes and the Ray settings themselves, which is, interesting and surprising, but also not that surprising at the same time. You can't really have 100% feature parity of every knob in both Ray and Kubernetes in the same platform without rebuilding all of them from scratch up again.
KASLIN FIELDS: That's one of the biggest challenges that we're always talking about with the folks building GKE is, how much do you abstract away and how much do you let folks get to? [CHUCKLES] Because we got to serve both users. And so I really liked the way that David put it. He used specific words, which I definitely wrote down here somewhere, but kind of progressive--
MOFI RAHMAN: Progressive disclosure. That's the word I think--
KASLIN FIELDS: There you go. Thank you.
MOFI RAHMAN: David used the word.
KASLIN FIELDS: I like that term.
MOFI RAHMAN: But yeah, so all in all, I think this interview-- as you mentioned, we had a previous episode 235 on Ray and Kubrick, kind of like the open source project itself. But now we're getting to see how Ray fits in a real world ML platform. So I think that order, whenever this episode comes out together, that it tells a very compelling story of understanding Ray and all the moving parts, but also taking a step back, zooming out a little bit, and taking a look at the broader picture and how Ray fits in a larger ML platform piece.
KASLIN FIELDS: I think that's awesome. And I hope that folks out there are able to relate to a lot of the scenarios that we talked about today. Thank you very much, Mofi.
MOFI RAHMAN: Thanks, Kaslin. That brings us to the end of another episode. If you enjoyed this show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media @kubernetespod or reach us by email at <kubernetespodcast@google.com>.
You can also check out the website at kubernetespodcast.com, where you will find transcripts and show nodes and links to subscribe. Please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening, and we'll see you next time.
[MUSIC PLAYING]