#2 May 8, 2018

Kubeflow, with David Aronchick

Hosts: Craig Box, Adam Glick

Craig and Adam bring you the news from KubeCon and an interview with Kubeflow product manager David Aronchick from Google.

Do you have something cool to share? Some questions? Let us know:

News of the week

KubeCon + CloudNativeCon

ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.

[MUSIC PLAYING]

ADAM GLICK: How you been, Craig?

CRAIG BOX: It's been a busy week, Adam. I've worked with our team to really bring the best of Google to KubeCon. Among other things, we had a full program at lunchtime talks with customers and leaders in our lounge. I got to run a panel with some of my heroes, Tim Hockin, Dawn Chen and Eric Brewer. And I also found time to announce two new products in front of 4,000 people.

ADAM GLICK: Amazing.

CRAIG BOX: Let's have a look at the news of the week.

[MUSIC PLAYING]

ADAM GLICK: KubeCon, or CloudNativeCon, took place in Copenhagen, Denmark, this week. This year, KubeCon EU was the largest KubeCon to date. As you mentioned, you know, it was over 4,000 people, actually 4,300 people were in attendance. That's three times the size of KubeCon EU in Berlin, and that was just last year. That's really a testament to how big this community is, and more importantly, how fast this community is growing.

The CNCF this week launched the Certified Kubernetes Application Developer Program. This certification joins the existing Certified Kubernetes Administrator Certification that the CNCF already offers. The exam is designed to help developers show proficiency with Kubernetes, and is available as an online test. So you don't need to go to a testing facility, but a good network connection is highly recommended.

A spokesperson said that the current pass rate for the test is between 40% and 60%, but the CNCF representative mentioned that if someone doesn't pass for some reason, you do get an extra try to take the exam at no additional charge.

The CNCF has launched a new partner program that will identify Kubernetes training partners. After their launch of the Kubernetes Certified Service Provider Program with 41 certified vendors, the CNCF decided to expand this program to include a special tier of vetted training providers who have experience in cloud native technology training.

Individuals or corporations who are looking for specialized training that map directly to the certified Kubernetes administrator and application developer exams will now be able to choose from a list of partners who have passed the qualification process. The partner program has launched with six partners, and more details can be found on the CNCF website.

CRAIG BOX: Over the last couple of years, the CoreOS team have popularized the operator pattern where you run side cars next to an application which help operate that application by performing the task that an administrator would normally have to do manually, such as adding shards to a database.

Red Hat recently acquired CoreOS, and this week, Red Hat announced the new Operator Framework consisting of three pieces, an SDK for building, testing, and packaging an operator, Operator Lifecycle Manager for deploying an operator to a Kubernetes cluster and managing it, and the Operator Metering tool, which allows enterprises to do internal chargebacks or charge customers based on application usage.

We'll no doubt be hearing more from Red Hat at their summit this week, and we'll give you a rundown on that in the next episode. Speaking of operators, an operator for Kafka was released this week by Confluent, who run hosted Kafka services on AWS or Google Cloud.

ADAM GLICK: DigitalOcean announced this week that they are going to release a Kubernetes service on their platform. Currently, the service is in private beta, and those who are interested can sign up for access to their Kubernetes service on their website. Along with this announcement, DigitalOcean announced that they are upgrading their membership in the Cloud Native Compute Foundation to gold level.

CRAIG BOX: The big news out of KubeCon was gVisor, Google's open source sandbox that provides more secure isolation for containers. gVisor works by intercepting application system calls connecting as the guest kernel, all while running entirely in user space and adding minimal overhead. gVisor provides an OCI-compliant runtime, called runSC, which lets you run sandboxed containers.

Google expects significant cross-fertilization between this and other security projects to collectively move the field of container security forward faster. It's available to download today at GitHub, and we had a great chat with Yoshi and Nick from the gVisor team at KubeCon, and we'll go deeper in an upcoming episode.

Also announced by Google Cloud is Stackdriver Kubernetes Monitoring, which enhances existing capabilities with native Kubernetes support, and more importantly Prometheus integration. The new product unifies logs, metrics, events, and metadata to provide comprehensive observability across the entire hierarchy of Kubernetes objects, including operator-centric views, showing clusters and nodes, and developer-centric views, allowing you to drill down through workloads and pods to containers.

Stackdriver can now ingest Prometheus metrics from Kubernetes and your own applications without needing to make any changes to your environment. Stackdriver Kubernetes Monitoring supports multiple clusters running in any internet-connected environment and is now available in beta.

Also in the observability space, Datadog this week announced a new container map view, giving their customers an overview of their environment so they can group, filter, and explore their containers on the fly. It builds on Datadog's other container monitoring capabilities and integrates with their Autodiscovery and Live Container view products.

ADAM GLICK: Last month, Google Cloud announced a new Security Command Center, and is now highlighting this as a new way to monitor your containers running in Kubernetes Engine. With the Security Command Center, administrators can set up automated actions, such as sending alerts as well as isolating, pausing, stopping, restarting, and even killing the container in question. The Command Center also allows you to monitor status of your clusters, log events, view event history, and take snapshots of the file system of your containers.

At KubeCon, Google Cloud announced five partners in container security, whose tools will plug into the SCC. Those partners are Aqua Security, Capsule8, StackRox, Sysdig Secure, and Twistlock.

CRAIG BOX: Finally, Seattle startup, Upbound, announced $9 million of investment to build a multicloud Kubernetes platform to help users scale across multiple clouds. Upbound were previously best known for the development of Rook, a cloud native storage technology that was donated to the CNCF in January.

ADAM GLICK: And that's the news.

Our guest today is Senior Product Manager, David Aronchick. David is the lead PM and co-founder of the Kubeflow project, which makes an entire machine learning stack of any framework easy to use, portable, and composable on Kubernetes. David has long been involved in the Kubernetes community, having joined the Google Kubernetes team in February 2015. Before that, he worked at both Microsoft and Amazon. And, as well, he was a startup founder. David, thank you very much for joining us today.

DAVID ARONCHICK: It's my pleasure. Thank you so much for having me.

CRAIG BOX: How different do you think some of those early companies you founded would have turned out if you'd had Kubernetes to build on?

DAVID ARONCHICK: You know, as I think back to the times when we got those startups off the ground, you know, I think about how complicated and challenging things were, scaling things up, and being responsive to community demand. One of my companies, I remember, we had a very big social presence right when social was starting to take off. And we got some very popular one-off social engagement from big pop culture icons. And unfortunately, more often than not, that would bring the entire site down. It was in the days before, you know, infinitely scalable cloud, and more than just infinitely scalable cloud, but also highly responsive application deployments.

If I had had Kubernetes, I would have been able to scale up quickly, either manually or by using automatic scaling that's built in to respond to that demand and make sure that end users never saw those outages.

ADAM GLICK: Gotcha. These days you mentioned you're working on the Kubeflow project. Can you tell me a little about the Kubeflow project?

DAVID ARONCHICK: Absolutely. One of the biggest things with machine learning is that everyone sees how transformative it can be for your business, whether or not it's a business with a lot of data and looking to make the most use of that data or you're looking to transform an entire industry with brand new scenarios. The problem is that as you approach those new businesses or those new opportunities, it often involves a whole lot of just getting started and wiring together all the many microservices that are involved in rolling out your machine learning stack.

The idea behind Kubeflow is we wanted to take care of a lot of that boring stuff. Because we leverage the same stuff, the same APIs and conformant Kubernetes cluster that the Kubernetes project has spent years putting together, it means that you don't have to worry about libraries, dependencies, service discovery, storage, all sorts of things that are extremely challenging just in getting an application up and running, let alone operating it at peak performance. The Kubeflow project is designed to be open, to let any framework come in and plug into the project and then make it very easy for people to roll out and begin addressing real problems rather than focusing on infrastructure.

CRAIG BOX: What about machine learning, specifically, makes it perfectly suited to Kubernetes?

DAVID ARONCHICK: What you found as you go through machine learning processes is they generally require a number of key components. First, they tend to need to be highly composable. Almost every machine learning deployment in the world is going to be slightly different depending on what that organization has decided is the right set of tooling for them. For example, as you do data processing, some people may use Spark, some people may use Hadoop, some people may have their own data processing.

In your machine learning frameworks, perhaps you want PyTorch, TensorFlow, Numpy, scikit-learn, on and on and on. And then when you get to serving, you might use TensorFlow Serving, Seldon Core. You might use Tornado or a variety of other homegrown serving tools. In all of those scenarios, what you're going to want to do is be able to pick and choose the components that make sense to you, but know as you roll them out that they're going to roll out cleanly, that the dependencies necessary to run those applications will all be there, and that they will be able to communicate with each other.

Kubernetes provides a wonderful framework for doing that in because Kubernetes provides all the service discovery. It allows you to run containerized solutions in a declarative way but then also make sure that those things run in a complicated spec. So if you have a master node, a worker node, a parameter server, for example, for TensorFlow, you can declare that through natural extensibility provided by the Kubernetes project.

And then, once you roll those things out, once you've composed them, you'll want to make them highly scalable. So one of the nightmares for a machine learning data scientist is that they go away for the night to kick off a machine learning project. They come in the next morning because the thing took all night. And they realized they had a syntax error. It would be a far better experience for them to be able to quickly ramp up to a hundred, a thousand pods run all those experiments in a half hour instead of six hours, and then scale it down very quickly, giving space to the other folks in their organization.

With Kubeflow, using Kubernetes' native scalability of up to 5,000 nodes, you can quickly ramp up and ramp down your workloads, giving you the best opportunity to leverage the time, the money, and the trade-offs that go between those two.

ADAM GLICK: David, what's the history of Kubeflow? Where did it come from?

DAVID ARONCHICK: The funny part about it is that all this stuff that ended up in Kubeflow was actually open source. Jeremy Lewi, one of the core engineers on the Cloud Machine Learning Team, was very interested in helping TensorFlow run great on Kubernetes. He open sourced all the work he was doing around the original TensorFlow CRD, and the two of us met and saw the value in not just using this and making it production-ready, but really making it extensible. And that was the start of Kubeflow.

CRAIG BOX: You've said that Kubeflow was designed from the start to be an extension of GitHub and TensorFlow in Kubernetes. But then it took on larger goals, specifically, framework independence. Tell me about that decision. Wouldn't it be best to support one platform well before moving onto others?

DAVID ARONCHICK: We do get that all the time. Certainly, it's got "flow" in the name. As the person to blame for the name, all attention should be paid to me. And I deserve all the heaps of scorn. I chose "flow" for the name, specifically because, not TensorFlow, but the fact that "flow" is understood as machine learning flow. And it's a term very commonly used in the machine learning and data science space, where you're thinking about your data flowing through your entire stack.

Now, as you correctly pointed out, we did and do support TensorFlow as a first-class framework. It just so happened that that happened to be the one that we knew first. Many of the core engineers on the Kubeflow project were previously working on making TensorFlow run great on Kubernetes. And so we adapted a lot of those existing tooling as we moved it over.

It was always our mission, however, to expand and support any machine learning framework in that distributed, composable way because every data scientist we talked to did have that requirement of, I'd love to use this machine learning framework. I'd love to use Kubernetes. But I have requirements foo. I need to use PyTorch or Caffe, or scikit, or whatever it might be. So it really was always one of our core projects. It's just we wanted to get the framework in place first and find the real experts out there to help us expand.

And the reality is, even if we didn't want to export one other framework in the world, we would still need to have this highly composable framework because even within a single framework, you'll find data scientists saying, well, I'm not ready to go to TensorFlow 1.6. I'm still on TensorFlow 1.3. So can you just let me do that? Effectively, that's the exact same thing we're doing here. We want to give data scientists the ability to focus on the tooling that is familiar to them and use Kubernetes' natural, loosely coupled, microservice orientation to let them pick and choose the tooling that makes the most sense for them.

ADAM GLICK: So you mentioned a lot of different projects there, which is great, because I know there are a ton of open source projects that people are using to do machine learning workloads. Google has a lot of hosted machine learning projects as well. How does Kubeflow relate to those projects?

DAVID ARONCHICK: The reality is that hosted projects are terrific, and we strongly recommend using them when it makes sense for you. An easy way to think about it is if a hosted project provides you all the requirements that you need-- it gives you the right version of your ML framework, it's composable, it gives you the right flexibility to add particular configurations, and it runs your model in a way that you want-- we highly recommend using a hosted solution like Google's Cloud Machine Learning Engine over something like Kubeflow, where the reality is, no matter how easy we make it to set up and use, it will always require some degree of manual configuration and administration in order to get it right.

That said, as I was referencing earlier, more often than not, people do have specific requirements that no hosted provider have, due to their own legacy issues, machine learning frameworks that they want to use, they may not be supported yet, or particular model tuning that they want to do that, again, may not be supported. That's perfectly fine.

The Kubeflow framework is designed to cobble together many multiple components together. And so out of, for example, a standard 12 or 13-step machine learning workflow, you might choose to have steps 3 and 4 be completely hosted or step 8 and 9. And our workflow completely supports bridging from the self-hosted Kubeflow components to those hosted machine learning frameworks.

CRAIG BOX: The project was first announced at KubeCon last December. Three months later, it's time now for KubeCon Europe. What is the reception been since you launched? And what have you changed based on feedback you've had from the community?

DAVID ARONCHICK: We could not have predicted how quickly the Kubeflow project would have grown and how many contributors we would already have to the project. Right now, we have over 70 folks who are contributing from all sorts of various companies, including Red Hat, Weaveworks, [INAUDIBLE], Microsoft, and on and on, all contributing to make the project run great. In many ways, we'd very much hoped this to be the case because just like with Kubernetes, while we knew Google infrastructure very well, we weren't going to know every single deployment out there. We wanted to make sure Kubeflow ran great anywhere you had a Kubernetes conformant cluster.

Getting to KubeCon has been entirely about locking down our 0.1 release with all the core components in an easy-to-deploy framework. So for that, we have Tensorflow Job, we have Tensorflow Serving. We have a JupyterHub. We have an ambassador for external proxying. We have an HTTP proxy. We have Seldon Core checked in and a number of other key components to make running with a number of different frameworks and solutions very, very easy.

We also have a number of proposals out there that we hope to land post our 0.1 release, which, as I mentioned, is going to be at our KubeCon keynote on Friday. You know, going forward, more than anything, we do very much want to get to 1.0 and want to get to a very regular, stable cadence, similarly to the way the Kubernetes project got to, to make it very easy to roll out and trust that the components that you're rolling out are wired together in that highly scalable, stable solution.

CRAIG BOX: I saw an announcement recently from Cisco that they have integrated Kubeflow into their hosted communities appliances. Can you tell us a little bit about that collaboration?

DAVID ARONCHICK: Absolutely. One of the things that we really think that will unlock a lot of these ML scenarios is working with major partners who want to move their solutions on premises. More often than not, they are absolute experts when it comes to building hardware and building systems that solve the standard VM-based workloads around whatever those on-prem solutions might be. However, in the new world of using containers, using Kubernetes, there is an entirely new way of deploying and rolling out software. And we want to help them by giving them workloads that are designed to run in those new environments.

With Cisco, they have over 30 years of expertise rolling out hardware to virtually every enterprise in the world. And they were looking to partner with someone who could help bring first-class ML workloads to those on-premises solutions. In this particular case, what we've done is we've trained and tested using their local machines to make it very easy for you to go out and buy a Cisco on-premises hardware, put it into your data center, connect it to your petabytes, terabytes, yottabytes of data that you might have on premises, and use Kubernetes and Kubeflow to do your training.

And because it is Kubeflow, that same training will work whether or not you're doing it on-prem or you're moving to somewhere in the cloud, which may have more scale or custom accelerators, such as on GKE using TPUs.

ADAM GLICK: Earlier you mentioned that the flow naming thing can sometimes confuse people when people think about things like TensorFlow, for instance. What are some of the common misconceptions about the Kubeflow project?

DAVID ARONCHICK: I would say one of the biggest things is that it is in some way a new project. When you look at the Kubeflow components, I think the funniest part about it is there's nothing new to it. What we've done is we've really taken the best of machine learning today and helped make it Kubernetes native. So what you might do is, we'll go out and we'll cut from the production-ready battle-tested TensorFlow 1.7, 1.8 releases. We'll help to package it. We'll describe it using Kubernetes native tooling such as Custom Resource Definitions, CRDs, and other service deployments. And then we'll roll it out.

And the training that you're running against that is running using the same TensorFlow that you used in the open source. We are not writing anything new. We're really just helping things deploy and communicate with other services in the way that people do today but do it in a much more straightforward and programmatic way.

ADAM GLICK: Awesome.

CRAIG BOX: Kubeflow lets you assemble pipelines out of many different open-source projects. Are there any projects that you feel are missing from being able to describe and create a pipeline end-to-end in the way that a data scientist might want to today?

DAVID ARONCHICK: I think there's actually a really important point, which is, in many ways, a lot of data scientist pipelines are going to already exist in some form or another. For example, I mentioned Hadoop and Spark today. Almost every large enterprise is going to be pouring through enormous amounts of data and likely already has a data solution today for sharding, processing, and then storing those solutions. We don't in any way require every component as part of the Dataflow pipeline to exist inside Kubeflow today. If you already have a Spark or Hadoop deployment that outputs exactly the size and shape of the data you'd like to train on, that's great. Keep using it. Kubeflow can connect to that very, very easily.

That said, in the near future, we'd really love to get to a much more automated way to roll out your entire pipeline and to have things flow through the system. So today, what you might do is, your data scientist might do an experimentation on a model. Then they would hand it to a software engineer to roll out and train that in a distributed way. Then they might hand it to an IT ops person to take that same output and package it and roll it out to their production servers.

I think our goal over time is to make that all part of a single experience using an integrated workflow and to build in all the necessary glue tooling to package up models, to move them from experimental to training to serving, and to use all the Kubernetes native tooling to make that a very elegant and straightforward way, things like Istio, things like Prometheus, things like your native logging where you can, at the very beginning of your pipeline, describe everything that should go on before it rolls out to production.

ADAM GLICK: If I wanted to learn more about this or even become a contributor, where would I look?

DAVID ARONCHICK: Well, the first place to go is always our open source repo. We do everything in the open source, and very much want to hear all the sorts of issues, bugs, or great use cases you're hearing on our open source repo. That is on GitHub.com/kubeflow/kubeflow. And you can see all the various projects and experiments and feedback there.

Additionally, we have all the standard places. We have a Kubeflow Slack. We have a Kubeflow email list. We have a Kubeflow Twitter account. And all the team comes together to respond to all of those at various times.

And then, of course, we have many, many in-person presence all around the world. We have meetups on a monthly basis, you know, either at dedicated Kubeflow meetups or other data scientists meetups where we often are part of the display. We'll be at Red Hat Summit. We'll be at DockerCon. We will be at [INAUDIBLE]. And we expect to have at least monthly presence at major data scientist conferences through the end of the year.

CRAIG BOX: Kubeflow is like many great open source projects in that brings together projects that are already generally available, but the glue itself is still being worked on. You've mentioned we're just coming up a 0.1 release. Is now the time to be using Kubeflow, or is now the time to be looking at it, examining it, and thinking about how you'd use it in the future?

DAVID ARONCHICK: As much as I would love people to be using Kubeflow in production left, right, and center, I think it might be just a little bit early to take it on and bet get your entire business on it. That said, despite my many warnings, there are actually quite a few people who are using it in production, and I do highly recommend tuning into our Kubeflow discussions around KubeCon to see some of those folks.

That said, as I said, many machine learning pipelines are not production-grade in the same way that your web server or your mobile app or something like that is production-grade. And so if you can tolerate, you know, whatever, 20 minutes of tweaking or fixing as you roll things out, its perfectly acceptable if that saves you literally hours, days, weeks rolling things out in production.

And like I said, we are seeing folks use this right now. But my recommendation would be exactly as you said. Please come look at Kubeflow. Try it out with your existing data solutions, whatever they might be. And really give us feedback for what you would need to see before you were able to adopt something like this at scale.

CRAIG BOX: David, thank you very much for your time today.

DAVID ARONCHICK: Thank you so much for having me.

ADAM GLICK: You can find links to the Kubeflow project and more information about David Aronchick in the show notes at kubernetespodcast.com.

CRAIG BOX: Well, that's about all we have time for this week.

ADAM GLICK: And it's been quite a week.

CRAIG BOX: As people start digging more into the announcements and playing with the open source technologies that come out of KubeCon, we'll have a lot more to say. And we look forward to bringing you more interviews in the upcoming weeks.

ADAM GLICK: Thanks for listening. As always, if you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @KubernetesPod or reach us by email at kubernetespodcast@google.com.

CRAIG BOX: You can find the show notes on our website at kubernetespodcast.com. Until next time, have a great week.

ADAM GLICK: See ya.

[MUSIC PLAYING]