#93 March 3, 2020

Kubeflow 1.0, with Jeremy Lewi

Hosts: Craig Box, Adam Glick

Kubeflow, the Machine Learning toolkit for Kubernetes, has hit 1.0. Google software engineer Jeremy Lewi is a core contributor to Kubeflow and was a founder of the project. He joins the show to discuss what Kubeflow does, and what it means to have hit 1.0.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.


CRAIG BOX: We produce a podcast every week, but we're also both podcast listeners. What good shows have you heard lately?

ADAM GLICK: I really appreciate shows that let me learn about something completely different than what I do in my daily, normal life. And I stumbled across one called "Over the Road," not to be confused with "Over the Top," the amazing Sylvester Stallone movie from the late '80s. But this is actually on the same topic.

It's about trucking and about trucker culture and about what is changing in trucker culture as you're starting to look at things like autonomous vehicles and GPS and auto tracking. And it's a fascinating view from a guy who is a trucker as well as a podcaster, talking about the lifestyle and the people and what's going on. I just I found it really enjoyable and a great way to learn about just a part of the world that I don't get a chance to see very often.

CRAIG BOX: I imagine the truckers have a lot of free time on their hands-- or a lot of time to at least listen to things while they're driving.

ADAM GLICK: I was about to say, lots of time to listen, I don't know they have a ton of free time. They spend a lot of time driving, one of the things you'll learn.

CRAIG BOX: Yeah. But a podcast will be very popular with them.

ADAM GLICK: I imagine so. Maybe there's even some listening to us now. If any of you are truckers, please ping us on Twitter.

CRAIG BOX: Honk the horn as well. If you're driving down the road and listening to this, just honk your horn. And then maybe someone else will realize.

ADAM GLICK: Do you have any recommendations you'd like to share?

CRAIG BOX: Yeah. Last year with the anniversary of the Apollo 11 moon landing, the BBC World Service put out a podcast called "13 Minutes to the Moon," which was talking about the 13 minute period where they had the famous 1202 program alarm where they were trying to land the lander and the computer failed and so on and all of the things that needed to be done in order to unpack why that happened and get the astronauts safely down on the moon. That story, having been comprehensively covered last year, we now have a season two, which is coming out next week, which is talking about Apollo 13 that was covered quite well on the Tom Hanks movie of the 1990s.

ADAM GLICK: I was about to say, I've seen the two hour version of it.

CRAIG BOX: Yeah, well, you can listen to the half hour episode podcast version, which will be coming out next week that's on the BBC World Service, who have fantastic production values. In terms of film and soundtrack and so on, they have a score by none other than Hans Zimmer, Academy Award winner Hans Zimmer, composer for "The Lion King."

ADAM GLICK: Shall we get to the news?

CRAIG BOX: Let's get to the news.


CRAIG BOX: As conferences around the world are canceled or moved to an online format, the CNCF has reiterated that, as of the time of recording, KubeCon EU is still taking place as planned. The organizing team are working with an epidemiologist to ensure they are implementing best practices. And attendees are asked to self certify they have not been in areas affected by covered COVID-19 or exhibiting any symptoms.

However, many speakers and attendees publicly announced they will no longer attend or are restricted from doing so by corporate travel bans. Ticket refunds are available up to 14 days before the event. Meanwhile, the schedules for three CNCF Day Zero events have been published, ServiceMeshCon, the Serverless Practitioners Summit, and Cloud Native Security Day. Each event requires a separate registration. And tickets are available for all three.

ADAM GLICK: Kubeflow has reached version 1.0. The latest release of the machine learning and AI tools for Kubernetes has focused on stability and being ready for production workloads. You can find out more about the 1.0 release of Kubeflow in this week's interview with Jeremy Lewi.

CRAIG BOX: Kubernetes 1.18 has begun the countdown to release with its first beta. A list of the in-flight enhancements can be found in the show notes. And contributions to PR review are especially welcome. We need to make sure the Sidecar Containers feature finally makes it onto the release train.

ADAM GLICK: The Continuous Delivery Foundation, our subject of episode 44, has just announced its first incubation project called Screwdriver. Screwdriver is a continuous delivery tool described as a self-contained pluggable service to help developers build, test, and continuously deliver software using the latest containerization technologies. Screwdriver came out of Yahoo! as a simpler interface for Jenkins. The project was open source in 2016 and rebuilt to focus on modern CI/CD pipelines.

CRAIG BOX: There's a new tool for installing applications on Kubernetes. Arkade, with a K, provides a clean CLI with strongly-typed flags to install Helm charts and apps on the cluster with one command. It derives from Alex Ellis's k3sup, or "ketchup", tool for installing single binary Kubernetes clusters, separating out the app functionality into its own project. Where possible, arkade supports apps running on ARM, which is great if you've read Alex's other post this week, walking through how to install Kubernetes on a Raspberry Pi in 15 minutes.

ADAM GLICK: VMware has announced Weathervane 2.0, an application level performance benchmark for Kubernetes. Weathervane is open source and manages the deployment, testing, and tear down of the tests. So you only need to provide the application containers and a config file. Weathervane is designed to help users compare cluster performance, evaluate configuration changes on application performance, and validate new clusters before applications are deployed. The announcement gives a couple of examples if you're curious to see what the output looks like for the kind of tests that it runs.

CRAIG BOX: Two Microsoft launches this week. First, Azure Kubernetes Service added spot VM node bolt and preview. These VMs run on spare capacity and Azure data centers at a significant discount, but without a guarantee of availability. Next, container image vulnerability scanning on Azure has gone GA. When an image is pushed to a Container Registry, the Azure Security Center scans the image using technology from the security vendor Qualyz, with a Q.

ADAM GLICK: Another way to scan your containers is through Jerry Gamblin's new vulnerability container scanning API, launched as what he refers to as early beta. The API allows you to pass in public containers, and it will let you know if there are any CVEs open against that container and its components. The project was an outgrowth of his work scanning public Docker hub containers and identifying vulnerabilities in those commonly pulled containers. He does note that the API can take up to two minutes to return a result, so you shouldn't run it in a webhook.

CRAIG BOX: Speaking of webhooks, security consultants in Coldwater, guest of episode 65, and the meme appropriate Brad Geesaman, presented at the ISA 2020 Conference last week talking about a new potential attack vector for advanced persistent threats in Kubernetes. By using a validating webhook on the API server they were able to covertly capture and send out secrets as they were added or changed. Learn more in the video presented with the requisite "Untitled Goose Game" memes the security consultants can't get enough of. Meanwhile, Jeff Geerling shows some cases where overzealous applications grant RBAC permissions to the default service account, which make it a cluster admin, which is exactly the sort of thing that would enable placing such an advanced persistent threat on your cluster and should be watched out for.

ADAM GLICK: Mirantis is continuing to make acquisitions. This week, they picked up Kontena, our guests on episode 31. The entire Kontena team will join Mirantis to help them bolster their developer tooling and perhaps provide a complement to the parts of Docker that they did not acquire with the recent purchase of Docker Enterprise. Terms of the deal were not disclosed.

CRAIG BOX: If you want to mount a Google Cloud Storage bucket to a Kubernetes pod, Ofik Live with a K can help. Ofik from Datadog has released CSI GSC, a Kubernetes CSI driver for mounting GSC buckets to Github. It joins a growing list of drivers for storage systems in the Kubernetes world.

ADAM GLICK: Cornelius Weig of TNG Technology Consulting has posted a blog about how to make add-ons to kubectl. We talked about some kubectl plugins in the past on the show, and it's a powerful way to extend the Kubernetes CLI. The posting is a quick overview of how to make a plugin and why it might be preferable to trying to check in your own code into kubectl itself. If you've been thinking about how you can extend your CLI in an easy way, check out this article.

CRAIG BOX: Finally, Jay Huang of NeuVector with a U, shares how to understand the real time characteristics of Linux containers. Underneath all your Kuberneteses and your Dockers are just processes running on a kernel and requesting a share of a CPU. To truly be able to run a highly threaded I/O intensive applications requires you to think about how Linux schedules tasks. And Huang's post goes to the extent of explaining the red black tree used to apportion work in the so named completely fair scheduler. If you thought being a system administrator in 2020 stopped at writing YAML, it's time to go back to school.

ADAM GLICK: And that's the news.


ADAM GLICK: Jeremy Lewi is a software engineer with Google Cloud, and a founder and core contributor to the Kubeflow Project. Welcome to the show, Jeremy.

JEREMY LEWI: Thank you very much. It's a pleasure to be here.

ADAM GLICK: Longtime listeners of the show may remember that we spoke to David Aronchick way back in episode 2. And he talked about the launch of Kubeflow. For newer listeners or those maybe not totally caught up, how would you describe Kubeflow?

JEREMY LEWI: Kubeflow as a Kubernetes native platform for machine learning. What that really means is that Kubeflow is two things. One is it's a set of applications that you need to develop and deploy machine learning models. So we have applications for training models, serving models, hyperparameter tuning, et cetera. And then we also provide the scaffolding to make it easy to deploy those applications and wire them together into a cohesive platform on Kubernetes and deploy that in the cloud where the cloud could be your private on-prem cloud, or could be a public cloud.

ADAM GLICK: You mentioned a couple of things there. Is Kubeflow a single tool or is it a collection of tools that helps you achieve those things?

JEREMY LEWI: It's a collection of tools. So we typically describe it as a loosely coupled set of microservices or applications. So very much following the typical patterns we see in the Kubernetes and cloud native world for architecting systems and platforms these days.

ADAM GLICK: That makes sense. Who is Kubeflow built for?

JEREMY LEWI: I think we have a variety of different personas that we're trying to serve. And that's really what we see in the enterprise where we typically see that there will be platform teams that are responsible for serving a variety of end users. And so those end users would be ML engineers or data scientists whose responsibility is to rapidly iterate models to solve important business problems. And then the challenge that the platform teams are facing is how do they give those data scientists applications they need like Jupyter, and then how do they manage them effectively at scale on behalf of multiple teams?

ADAM GLICK: What would you say is the difference between a developer as most people think about it and a data scientist?

JEREMY LEWI: I think the overall workflows in many ways are very similar. We're seeing a lot of talk about ML Ops and how it relates to regular dev ops. I think a lot of what's different is that data scientists, the tools and then the problems they're solving is a little bit different. They tend to be focused more on analyzing data and building models as opposed to building traditional business logic or microservices.

ADAM GLICK: And you mentioned something called Jupyter in your description. What is Jupyter? Jupyter

JEREMY LEWI: Can be described a variety of ways. I would say it's like an IDE for Python that allows interactive analysis for Python. And it's become really popular in data science because it has rich tools for visualizing data, and then can allow you to easily and interactively manipulate the data and then generate rich plots. It's also used a lot for reporting and tutorials because you can intersperse markdown to have rich descriptions of what's going along with the code.

ADAM GLICK: Gotcha. So is it a little bit like what people think of as the YAML files that they use for their Kubernetes application definition? Would it be that way, but for a machine learning model?

JEREMY LEWI: The underlying notebook file is a little bit like a YAML file, and then it's actually a JSON file that contains the code. So there is a file format for the notebook that is a little bit similar to the YAML file, but the Jupyter itself I would say is more like and IDE, in that it's a way of developing code. And then under the hood, you really have Python files or JSON files, like I said, that contain your code.

ADAM GLICK: When would someone go and use Kubeflow?

JEREMY LEWI: Most of our customers are using Kubeflow because they have specific ML-related problems that they're trying to solve. So with what we're delivering with 1.0, some of the core problems people are trying to solve is, "I want to write a notebook in the cloud so that I can take advantage of the elasticity that my data center provides." So larger VMs, or more GPUs, or being able to run multiple notebooks in parallel. The other thing that people are using Kubeflow for is if they want to use distributed training to train larger models using cloud.

We also provide tools to actually make it easy to deploy the models. So we provide KFServing, which makes it easy to deploy their models and take advantage of GPUs, and then do automatic scaling based on load. It also makes it easy to do deployments and roll outs, and then there's some advanced features coming to KFServing such as explainability and payload logging. So those are some of the core capabilities that a lot of people trying to deploy and build machine learning ML products are trying to solve. And so that's what motivates them to pick Kubeflow.

ADAM GLICK: What's KSServing?

JEREMY LEWI: KFServing is a custom resource that's providing a higher level abstraction on top of Knative, which is our serverless platform for Kubernetes. And so with KFServing, we provide this high level custom resource where people can just specify their models. So a link to the object store file where you have your model like a TensorFlow serving model, which is just a protocol buffer. And then one of the things that we ship with KFServing is we'll automatically load that up into a model server like TensorFlow serving, and deploy that. And then we're also spin up the Knative resources that you need to deploy that. And so it just provides a high level abstraction along with some advanced MLOps and DevOps functionality for you.

ADAM GLICK: I've heard about a bunch of different ML frameworks. So TensorFlow is one that I hear people mention a lot. Sometimes I hear about PyTorch or MXNet. How do I think about Kubeflow in relation to those? Is it a competitor to those, is it a compliment to them?

JEREMY LEWI: It's a compliment to them, so it's a platform. The idea is that you can use whatever framework that you want to use, and most of our applications on Kubeflow are trying to support multiple frameworks. So with our Jupyter notebooks, you can use whatever framework you want-- TensorFlow, PyTorch, Scikit-learn. For training, we have distributed training operators for PyTorch, TensorFlow, as well as XGBoost. For KFServing, there's native support for PyTorch, TensorFlow, Scikit-learn models, and XGBoost. So we're really trying to be framework agnostic because we see most enterprises and customers that we talked to are using all of the frameworks.

ADAM GLICK: So there are a lot of tools that come together as people build an actual application that uses machine learning or artificial intelligence. What might be an example of a common software stack that someone would use including Kubeflow so people can get a sense of where it sits in the stack and what are the other pieces that all come together to build that application?

JEREMY LEWI: Typically when we deploy Kubeflow, the base layer would be your Kubernetes cluster. On top of that, there would be some middleware, which right now would be things like Istio to create the service mesh, and then Knative on top of that to create that serverless platform. Then on top of that, we basically deploy all of the different applications. So we have the built-in Kubeflow applications, like some of the custom resources that we're deploying or creating as part of Kubeflow. We're also pulling in a lot of cloud native applications like Argo, which is a workflow engine. And then we're pulling in other applications to bind this together into a cohesive platform. So a lot of people are using Dex as a way of handling off end and getting people's identities integrated with their identity provider.

ADAM GLICK: Where do the other frameworks people are familiar with sit in that stack? Do they then sit on top of that? Would the TensorFlow models and pieces sit on top of that stack?

JEREMY LEWI: It's usually at the application layer that various applications have different ways of pulling in the different frameworks. So with Jupyter, for example, you're launching a Docker container that's running Jupyter, which is a Python binary, in that container. And then you pull in the various frameworks like TensorFlow, PyTorch, by installing Python libraries into that Docker container. And that's usually the case for most frameworks, so it's mostly about packaging those libraries into the Docker container that you end up running. So in KFServing when you end up deploying a model, you would end up including the libraries for that framework in the Docker container that you deploy.

ADAM GLICK: One of the great things that we wanted to have you on the show about was that Kubeflow has now gone 1.0. What's new in 1.0?

JEREMY LEWI: The focus of 1.0 was really about production readiness and stability. So we've taken a core set of applications that we've been working on for a long time, and we've graduated those to 1.0 as a way of saying that here's a set of applications that we consider production ready. And taken together gather they deliver on our core critical user journey, which is we want to make it easy to build models. So that's why we provide Jupyter. That's a great way to experiment and develop your models, then we provide tooling to take those models or those notebooks and convert them to deployable artifacts, which means containers in this case.

You can then take advantage of the training operators that we've provided for PyTorch and TensorFlow to do training easily, and then once you've trained those models you can deploy them using KFServing so that you can easily scale those models and roll them out. So 1.0 is mostly focused on getting those applications, which were already there, to production readiness and having the stability and reliability there so that customers feel comfortable using them in production.

ADAM GLICK: So it's kind of like a GA release as people often think of about it of 1.0 ready to put into production, ready to actually serve this for real world applications rather than dev test and experimentation. That's great, congratulations. You mentioned a number of different steps there in terms of building, and training, and deploying the models. How are those steps different?

JEREMY LEWI: They tend to be different in the type of workload that you're trying to run and the parameters there. So if we look at it-- when you run Jupyter, for example, that's basically a stateful web service, whereas when you do training that's more of a batch-distributed job. And then finally when you get to serving, you have something that's closer to a stateless web service that can scale pretty nicely horizontally. So they have very different patterns of computation at each stage.

ADAM GLICK: That fits very well with the Kubernetes distributed model and how it can map to those pieces. Not to use an overloaded term, but is it all the same model that you're using in these different parts of the code in terms of the ML model that you're building and running, or is there actually different parts within your Jupyter and your other work that come together for these different phases?

JEREMY LEWI: Typically there's different phases of the model development. So in the notebook, you might actually be taking data and pre-processing it in order to produce a training set, and then you might be producing code that will train the model given the data that you've provided. And then you take that code and that data, and you run that in training, and then that actually produces some model artifact, which could be a protocol buffer or maybe a pickled Python file that contains the description of the model. And then in the final step, that file gets loaded up into a model server, which is basically a web server that provides either a Raster rRPC endpoint that can take in requests that contain the input data, feed it through the model, and then take the output, which is the prediction, and return it in the body of the response.

ADAM GLICK: I've seen you write about how Kubeflow runs on Kubernetes and thus becomes very portable to move and run in different environments. Can you describe how someone might take advantage of that portability?

JEREMY LEWI: We see a lot of customers that want to run Kubeflow on-prem. And in particular, we have a lot of customers in the banking industry that for security and compliance reasons are really keen to run it in their own data centers. And so that's one of the reasons why they're choosing Kubeflow is because they already have Kubernetes running inside their data center, and we make it easy for them to get up and running with some of the applications the platform teams need to deliver in order to serve their data scientists.

The other thing that we see a lot of excitement about from Anthos customers is customers that want to do hybrid workflows. So they might want to deploy models on-prem. For example, perhaps because they want to have low latency and be close to where data is coming in from on-prem cameras and other real time serving use cases. But then they would like to take advantage of the cloud for training so they can have the elasticity of cloud to easily add more machines to train large models and use GPUs.

ADAM GLICK: Is it safe to say that training tends to be more compute intensive versus the operation of the model is much more about where it's running and can run at embedded devices and lower compute scenarios?

JEREMY LEWI: Typically that's the case. We are seeing some cases where more people want to deploy models and serve models on GPUs. So if you're doing video or images, you have large data that you're trying to do inference on, you can potentially benefit from GPUs. But typically it is training that's more intensive, and that's because you're doing high throughput where you want to process lots of data in order to actually train the model.

ADAM GLICK: Kubeflow was originally released, I believe, in December of 2017. How has the community responded, and how have things changed from the time when it first started to now going 1.0?

JEREMY LEWI: You know, I think we've been really blown away and surprised by how much uptake it's gotten both in the contributor community-- we now have over 30 organizations participating in the development and use of Kubeflow. We get over 80 unique contributors every month or so, and the number of PRs that we're getting is I think 300 or 400 PRs every month. So it's really blown us away how much active development there is going on in the community involvement.

We've also been really surprised by how much uptake we're getting from users even before 1.0. And in particular, we were very surprised to see how much uptake we were getting across some enterprise companies, particularly in the financial industry where a bunch of banks are leading the way in terms of adopting this on-prem. We were sort of expecting those enterprise customers would be more reticent to adopt anything that wasn't 1.0, but I think what we've seen is that particularly for customers running on-prem, there's not necessarily a lot of alternative options out there. So if people are running in public cloud, there's a lot of SaaS solutions that people can adopt. But if you're running on on-prem, you're in the do-it-yourself world, and so I think a lot of people have been happy to join that community and benefit from shared knowledge and co-development.

ADAM GLICK: One of the great things about community and open source is that everything gets better as there's more people involved in it. And you see the projects that really take off and become greater, the ones that have multiple organizations contributing that really are a community led and driven effort. You mentioned there were a bunch of participating organizations. Are there other names that people might know of, companies that are contributing and being an active part of the Kubeflow community?

JEREMY LEWI: If you go to our community page, we list all of them out. But most of the public clouds are involved in the development. IBM has been leading some of the development of KFServing and some of the other projects. Bloomberg has also been contributing to KFServing a lot. They've also published a lot at KubeCon about their Kubernetes native platform for machine learning, and that's inspired a lot of their work in Kubeflow. Lyft has also published a lot about their platform for machine learning, and that's inspired a lot of what we've done. Other companies are Arrikto. Cisco's heavily involved, whole bunch.

ADAM GLICK: Now that you've hit 1.0 and you've got a stable GA released for folks, what is next for Kubeflow?

JEREMY LEWI: What's coming next in 2020 is there's two thrusts to our development. The first is we have another wave of applications that we want to graduate to 1.0. So that includes pipelines, which is a tool for orchestrating complex ML workflows, metadata, which is a tool for keeping track of all your data sets and models and how they were produced so you can understand the lineage of your data sets and models, hyperparameter tuning for training your models and actually fine tuning the parameters to get the best quality results. So we want to graduate those applications to 1.0.

The other big focus is going to be on enterprise readiness. So we have a lot of enterprises that have stringent security and compliance requirements. So they want to have a very strong data exfiltration story. They have a lot of concern around "day N" operations, so things like upgrades and SLAs. So we're working on trying to satisfy those things as well.

ADAM GLICK: You mentioned a couple of the new applications that you're building into the set of Kubeflow tools like pipelines and metadata. What is it specifically about the machine learning or AI workflows that has dedicated projects for that? Obviously there are projects that exist now that do better data tagging and that do pipelines for data processing. What's unique and needs to be specially crafted for this particular area?

JEREMY LEWI: I think a big one is lineage. Lineage becomes really important, because one of the problems that's happening right now in machine learning is the so-called reproducibility crisis, which is a lot of people publish models, they tweak the parameters, and then they try to retrain that same model six months later, and they don't get the same results. And that's because the model depends on a whole lot of factors like the data, which might have changed. New data might have arrived, the statistics of that data might have changed. So that's a big problem. And so a lot of what's going into pipelines and metadata is really about trying to log all the things you need in order to reproduce your results and understand how the model was produced.

ADAM GLICK: If people want to get started using Kubeflow, possibly contributing to the project, or even just learning more how to get their hands dirty in using this new technology, what's the best place to start?

JEREMY LEWI: The best place to start is our website. So www.kubeflow.org. And you can go to the Getting Started guide there. That will walk you through deploying Kubeflow on either your public cloud or on-prem distribution of Kubernetes. And then you can go to the MNIST tutorial and walk through the end-to-end experience with Kubeflow, which will walk you through developing a model in the notebook, training that using our custom operators, and then finally deploying that on Kubernetes using KFServing. And in addition, we actually go through spinning up web UIs that allow you to interact with your model in a more user-friendly way like an actual product.

ADAM GLICK: Congratulations on the 1.0 release, and thank you very much for coming on the show today with us.

JEREMY LEWI: Thank you. It's been a pleasure.

ADAM GLICK: You can find Jeremy Lewi on Twitter at @jeremylewi, and you can find the Kubeflow project at kubeflow.org.


ADAM GLICK: Thanks for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @kubernetespod, or reach us by email at kubernetespodcast@google.com.

CRAIG BOX: But probably not see us in Amsterdam.

You can also check out our website at kubernetespodcast.com, where you will find transcripts and show notes. Until next time, take care.

ADAM GLICK: Catch you next week.