Kubernetes Podcast from Google: Episode 22

#22 September 26, 2018

SIG-Node, with Dawn Chen

Hosts: Craig Box, Adam Glick

Dawn Chen, TL for SIG-Node and the Google Kubernetes Engine node team, joins Craig and Adam this week. She has worked on containers and container schedulers since 2007 - not a typo. We also bring you the news, in part from the echo chamber of Google Cloud Summit in Sydney.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

Links from the interview

Dawn Chen on GitHub
The Borg paper
Process containers (later ‘cgroups’):
- The first submission of containers to the Linux kernel
- Early coverage of process containers
- Paul Menage’s 2007 paper “Adding Generic Process Containers to the Linux Kernel”
- Dawn’s first job: tracking processes. Each job had its own GID - she would use netlink connection tracking to map processes and threads to GIDs, and, using procfs, figure out CPU and memory usage.
- Dawn’s second job: adjusting CPU usage using nice
- Today we just use memcg
- Fake NUMA - cut a machine into big chunks and assign them to groups of processes.
Linux Plumbers Conference
- Tim Hockin’s presentation at the Linux Plumbers Conference in 2011, talking about the work Dawn’s team were doing
lmctfy - Let Me Contain That For You
- In case you don’t get the joke
- It’s like runc and containerd
SIG Node
- Node and lifecycle management
- Application management
- Container runtimes and kubelet
- Node problem detection
- Resource management
- GPU & TPU
- Security isolation
- gVisor and Sandbox Pods
- Logging and monitoring
Was SIG Node the first SIG?
- Tied with SIG API Machinery
How did we get to CRI?
Container RuntimeHandler, so some pods can run with one runtime and some with another

Transcript

Show full transcript

CRAIG BOX: Hi, and welcome to the Kubernetes podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

CRAIG BOX: It's another exciting episode of the Kubernetes "where in the world are we" today podcast. Where are you today, Adam?

ADAM GLICK: I have returned to Seattle, Washington, enjoying the last bits of summer here. It's so wonderful to be back home after a fantastic trip overseas.

CRAIG BOX: Mhm!

ADAM GLICK: How about yourself?

CRAIG BOX: I'm in Sydney. We have a Cloud Summit here today. And the good news about this APAC trip is the flights are relatively direct but they do take a long time.

It was a great week in Japan. I enjoyed the time I spent not only with you, Adam, but with our customers and partners, and people who came up and said hello, and even those who just came and said, can I have a sticker?

ADAM GLICK: Yeah, I was amazed how very popular the stickers were. We went through a ton of those. You might have to do a second run of them already.

CRAIG BOX: We will, a reprint. They're going to be these rare, first-edition stickers. So we do have them in our bags for the events we'll be at for the next few weeks. So, if you do want to get a rare, first-edition sticker, it will be worth something someday. Like I said, by the time you hear this you will have missed the Sidney summit. Unless, of course, you're in the room, in which case, hello, come and ask for a sticker.

ADAM GLICK: And where will you be, Craig?

CRAIG BOX: I will be with my colleague Mete at the Hong Kong summit, in two weeks' time. And then KubeCon Shanghai, I think, is my next booked engagement after that.

ADAM GLICK: Nice! So you can find me, in a couple weeks, at Next UK in London. So please stop by, if you're in London or the surrounding area. And then the week after at Gartner Symposium IT Expo in Orlando. And then you and I will be together at Shanghai KubeCon.

CRAIG BOX: Yes. So let's get to the news.

[MUSIC PLAYING]

ADAM GLICK: NetApp acquired StackPointCloud to offer Kubernetes as a service across all three major cloud vendors, as well as on NetApp's own hyper-converged infrastructure. NetApp is also hoping to combine this new acquisition with their hybrid storage offerings and their cloud data services. No price for the acquisition was released.

CRAIG BOX: At the Cloud Next event in Tokyo, Google announced an open alpha of Sandbox pods on GKE, which, as you might guess, lets you run pods in gVisor Sandbox. Submissions of interest are being taken in the form you can find in the show notes. Looking around the other major clouds this week, Microsoft released Kubernetes support for Azure Stack based on Terraform, and Amazon released the ability to generate kubeconfig files.

ADAM GLICK: Jian Liu, a graduate student at Zhejiang University, spent the summer as a Google Summer of Code intern. His project was adding support for Kata Containers, a sandbox which runs containers in a hypervisor to the containerd runtime. He tells his story in a CNCF blog post which you can find linked in the show notes this week.

CRAIG BOX: The Linkerd service mesh started life as a daemon set running on the JVM but went on a diet with a rewrite in Rust. What was once Linkerd TCP and then Conduit has been rebranded as Linkerd 2.0, and that product, now based on sidecars, has this week gone GA. Along with the announcement, Thomas Rampelberg from Buoyant posted a hands-on how-to which shows off the control plane web interface and command line, including integration with Grafana.

ADAM GLICK: Speaking of the CNCF, they voted this week to add Cortex to the CNCF sandbox. Cortex is a multitenant Prometheus back end, originally developed by Weaveworks in 2016, and is used today by Grafana Labs and EA.

CRAIG BOX: Red Hat have just announced the technology preview of Red Hat OpenShift Service Mesh, based on Istio. The technology preview program will provide existing OpenShift customers the ability to deploy Istio on their clusters. No sign-up is needed for this program, and GA is expected in 2019.

ADAM GLICK: Speaking of Istio, Trulia, operators of a real-estate website in the United States, have been busy decomposing their PHP monolith and have written a blog post about their adoption of Istio. Similar to other companies, they've moved to a microservices architecture and were able to use Istio to identify latency in their external dependencies, which caused them to operate outside of their SLOs.

CRAIG BOX: Speaking of Envoy, Heptio announced the release of Contour 0.6. Contour is an Ingress controller for Kubernetes which programs the Envoy proxy server on the edge of a network. Changes in 0.6 include the introduction of an Ingress route object, added to the cluster as a custom resource, to work around shortcomings in the beta Ingress object in Kubernetes. Contour follows both Istio and the Google Cloud Ingress, which have both added similar customer resources. We'll dig deeper into the state of Ingress in a future episode.

ADAM GLICK: Finally, keep your eyes peeled this week for the release of Kubernetes 1.12, which has an estimated release date of Thursday, September 27. We'll be sure to cover that in next week's show.

CRAIG BOX: And that's the news. Dawn Chen is the lead for Kubernetes SIG Node and the tech lead of the GKE node team. You ever seen one of those job descriptions that asks for someone who has been working on Kubernetes for 10 years? Well, Dawn is one of a handful of people who could legitimately claim this. Welcome to the show, Dawn.

DAWN CHEN: Thank you, Craig and Adam.

ADAM GLICK: Good to have you here. How long ago did you start working on containers and orchestration?

DAWN CHEN: The story began with the day I joined Google. So, back the 2007, I did the interview. The one of the interviewers asked me a question, say, oh, looks like you have a lot of experience working on the grid computing, utility computing. And actually we are doing the same thing at Google. And we're doing that for production.

The reason she asked, she mention this one, is because at that time I work in the Veritas software and in the research lab. We are working on something similar like the Borg-- or Kubernetes, today, but it is always stuck at the prototyping stage, always in the research project.

So, because Veritas Software provide the software, so they don't provide the IT work and all those kind of things. So I was really, really tied up in the research lab working on something cannot really deliver, help the real people. So I said, oh, really! I'm really interested.

So then she said, oh, if you are interested, please just tell the-- if Google recruits you, Google hire you, and please tell the people talk to you. So that's how I end up to working at Borg team. And which it is the internal version of the Kubernetes, back then, in the 2007.

So, when I joined the Borg team, I supposed work on the scheduler. Just like what I worked on the Veritas Software. The first day I met the team, and I met Paul Menage, who is one of the inventor of the cgroup. And then he mentioned to me what he's working on.

Then he also mentioned, are you interested in a system programming? I said, yes, I'm interested in a system programming, because I never worked on those areas. And also because I work on the research project.

One of the reason we couldn't roll out our prototype to the production is because we don't know how to manage those work node, on the node, Back then, there is no container technology. So he described those kind of things to me and really inspired me.

So I said oh, I want to work on the node instead of the scheduling algorithm. This is how I started working on this.

CRAIG BOX: And so you've been working all the way through on node agents for 11 years, now.

DAWN CHEN: When I first started, Borg is pretty small team. I think just maybe just roughly, like, six combined Borg master and also node agent. Entire team, we only have the six, seven people. And the second people join the team to work on the node. But, at the same time, I also work on the master. So we basically didn't really distinguish node and master at back then. But I quickly switched to focusing on the node, because there are so many new stuff there and is the innovation.

CRAIG BOX: What was it like, at the time when cgroups were being invented and Paul and Rohit who came up with the concept?

DAWN CHEN: There's no such things called "cgroup," [LAUGH] back then. So, when I first joined, my first work at Google actually is called the "process tracking." So basically there's for the Borg. And users submitted their task. And my job and then each job they have their own gid. And my job is to use the Linux kernels NetLink connection and listen to those things and figure out how each process, each thread it is elong to which gid.

So then we group them together and use procfs system to figure out their CPU resource usage and also memory usage. And also I remember that's after my first job about process tracking, and the second job, it is how I'm going to adjust CPU usage, by using niceing technology, which is really, really old.

But I do know that, back then, that Paul Menage and Rohit Seth is working on memory management. So they proposed those kinds of things. So I've been really inspired and really lucky to working with those really intelligent, smart engineer and on the first version of the memory management, which is totally different from today's MCG.

ADAM GLICK: When you talk about cgroups, where was the first place that that was used in production?

DAWN CHEN: I believe the first one is using it is Google and at Borg. So, right after, before even the cgroup concept, even the memory cgroup management's being accepted by upstream kernel. And Google internal production had already started working on the roll-out. The first version we are work, the memory cgroup, we are called the fake NUMA cgroup. We basically cut the node and the machine, the memory, into the big chunk and assigned to the group of the process.

So that is the first things in Google production. Also I believe that's the first kind of using kernel cgroup in the industry.

CRAIG BOX: This is technology that started allowing Google to run more work on fewer machines-- is the way that we talk about this. Do you think it took a while for the rest of the community to realize what it was that cgroups were and to start adopting this technology outside of Google?

DAWN CHEN: Yes. Even Google, actually, when we first using cgroup, we didn't really think about it as to what we end up to today. Right? So we also learn over like, the in production. We have the feedback. This is kind of the really good-- you create certain things and new tech knowledge, and you try, and you help your production issue. But then the production actually help you, by provide abundant of the feedback. So you iterate, and it gets better.

And cgroup is perfect example, in this case. So, when I first joined Google, actually Borg is just released to the first class data center. And so not many people believe in Borg, not even many people believe in the cgroup yet.

So, back then, also, there's no cgroup. I just mentioned that. I'm using process tracking, to track those work. So we spent a lot of time convince people and more production services to switch to using Borg to do the orchestration. And also batch-work node, MapReduce work node, all those kind of things switch to using cgroup and using the Borg orchestration.

And then end up to, there is the perfect time Google realize actually we can pack more work node into a cluster, into a single node. And then we can increase our utilization dramatically.

ADAM GLICK: What did the Borg team think of LXC, when it was released?

CRAIG BOX: Were you paying attention to what was happening outside of Google?

DAWN CHEN: We do have some conversation between each other. And we both attend to the Linux Plumbers Conference and other conference together. So we are really glad to see the cgroup concept being well accepted by other parties, other organization, and also by Linux community in general. Back then we talk about some out-of-memory management and name space and all those kind of ideas together.

But it didn't really goes over well until Docker leveraged the container technology and make that flying and really take off into the cloud industry. That is the moment to change everything.

CRAIG BOX: So, around that time, we actually released our own internal container runtime. It was called "Let Me Contain That For You," or LMCTFY." That's a complicated name. Who is to blame for that?

DAWN CHEN: Actually Tim Hockin created that name.

CRAIG BOX: Ah, that says it all really.

DAWN CHEN: [LAUGH] Yeah! I-- I trust his taste.

[LAUGHTER]

But I always have trouble to call the name, so I always have to start with the long sentence, "Let Me Contain That For You."

CRAIG BOX: What was the impetus to release that as open-source?

DAWN CHEN: Back then, we tried to build some container orchestration for the cloud users. So Let Me Contain That For You is one of those times. And we tried to, internally-- we have the agent called the Borglet. And which it is kind of like the kubelet in Kubernetes.

So, in that agent, we have the container management. And so the lowest level of the container management, actually it is "Let Me Contain That For You." It is kind of like the today's runc. And also part of the containerd Docker has. So we open-sourced that one. And, at the same time, Docker launched their release, the Docker engine, first Docker engine.

So we even went to Docker headquarters and talked to their engineers, including the founder, and we wanted to collaborate. We even suggest, propose, to like the integrated "Let Me Contain That For You," together with Docker engine. So there are the couple of times we tried to leverage container technology, help container technology being accepted by the cloud community.

ADAM GLICK: You lead the node team, on Kubernetes. What components does that cover?

DAWN CHEN: There's many components. Everything that's run on the node, to manage and provide services. The service include all for node management, life-cycle management, and also applications life-cycle management, such as the container runtime pieces and also kubelet pieces. And also we cover about, like, the node problem detection. That's also part of the node life-cycle management. Which it is help to diagnose the issues on the node and also the daemon, really important critical daemons issues.

And on the node, and also we are provide service for the resource management services. So you have the job on the node, and your job it is important, and have some performance requirement. And how we guarantee you get what you need of resource and computer resource, we guarantee you can access those resources. Some of those kind of the resource-management things.

And include of those the device management-- like, for example, GPU, TPU, all those kinds of things. And also we provide the node level of the security isolation. Like the early versions in the cgroup. And also in GCP Next, the announcement of the gVisor in closed alpha. And also the KubeCon, we announce of the sandbox Pod support. All those kind of things is on the node actual security and isolation.

And we also have management of the logging and the monitoring things. And everything's for customer, what their application running, and work node is running, and also node, how healthy it is, you need stats-- export those kind of things. And also logging out.

CRAIG BOX: It's probably a shorter list to ask what is the SIG Node not responsible for.

ADAM GLICK: Indeed. Speaking of SIG Node, there's lots of SIGs now that are part of Kubernetes. Was that the first of them?

DAWN CHEN: It is the first one. At least, it is one of the first one we founded. So, SIG Node, when we first talked to the Kubernetes staff, Kubernetes community, and there's no such interest group founded. But there's so many people want to talk about, to support, like, what I cover in this area earlier, like, how to do the node management, how to join the cluster-- all those kind of things.

So a lot of ideas. So that's why I reached out to form this interest group. Obviously, back then, there is also other people interested in API machinery. So they form the API Machinery team. So this is how we founded the first two.

CRAIG BOX: So it was basically just the grouping together of people around these areas of interest led to the formation of the SIGs as we understand them today.

DAWN CHEN: Yes. And also we want to, like the, set up, have the better communication with the community. Like, what is its direction. And set a direction, roadmap, vision for our product, in each critical components. That's kind of goal what we are doing.

And also, group of the people, interesting party and people, and working on those developments. And also, back then, even earlier stage, and we start think about how we manage releases, what is the quality, and if there's an issue, even it is open-source project, but if there's the issue, and how the customer to reach or user reach the group of the people could provide the support. This is why we founded those.

CRAIG BOX: When Kubernetes was launched, it was very tightly coupled to the Docker engine, as it stood at the time. Then there was the rkt runtime from CoreOS, and I believe they did a lot of work to sort of bolt on the option for a second thing, which eventually led to the separation of the idea of the runtime, CRI interface. How did we get from there to where we are now?

DAWN CHEN: Yes, you are so right. So I am the first engineer who integrated rkt with the Kubernetes. So we found back then that Docker also is not as today's Docker, is not that-- is also its earlier stage. And there were a lot of the production issues, and a lot of support churn.

And so Kubernetes community, actually especially SIG Node community, spend a lot of time to qualify each Docker and give the recommendation of the customer. But, no matter how we're doing, because there's integration issue, because two component is not that compatible. So, for example, Docker, it is treat container as the smallest of the scheduling unit, but obviously Kubernetes not. And we have the Pod concept. And also there is network incompatibility between each other, and the storage-- all those kind of things.

So this is why rkt project is founded. And rkt, it is founded to serve Kubernetes community. So that's why we work on the integration. Doing that integration, there is many other demanding come in. For example, there is the earlier phase of the Kata is called the "Hyper container." So they want integrated with Kubernetes.

And then there is also have some other people want to integrate Kubernetes-- for example, LXC and LXD. And also there is the people want to integrate Kubernetes support for VMs, based off the image-- instead of standard container, based image. And there is people want Kubernetes to management, Kubernetes management Solaris nodes. And then there is people who want Kubernetes management Windows.

So there is many requests that came to the SIG Node and the SIG Node TL. So I can say the lead about people want to support different type of the work node. And, also at that time, there is the OCI, Open Container Initiative, is formed. And they started to propose their OCI, the first draft.

So I think about, we need if we could define our own API. And what it is the container level of the API for the Kubernetes, then we could make that Kubernetes much easier to extend it. So that's why I initiated this Container Runtime Interface project.

And so the SIG Node successfully integrate that kind of things. And also we roll out that Container Runtime Interface-- I think it's more than two years ago. And the first one, it is the Docker. And, since then, we work together with a group of engineer in the container community, especially include of the engineer, a lot of engineers from the Docker and engineers from IBM and also many other companies, and we're working on the next generation of the containerd.

And also, at the same time, there is a group of engineer working on the CRI-O. And today, both containerd and CRI-O, it is productionized. And they run both products in some of the production.

CRAIG BOX: What work do you have to do, to be able to support some pods running with one runtime and some running with another? For example, with gVisor you don't want to sandbox all the pods; you need to be able to support multiple runtimes per node.

DAWN CHEN: Yeah, this is really good question. [LAUGH] So this kind question came the SIG Node a lot. And obviously today we have the way to do that. Well, we can using the label. Right? So you can set a label. And, for each node, they have one of the container runtime. And also at early stage SIG Node already made a decision we don't want to support multiple container runtime per node. So that worked very well.

And we also made a decision it is, each operating system, their image, and a decided bundle, which, what kind of container runtime. But, at the same time, actually in this year we also propose the sandbox API, which is called the "container runtime handler." So which it is accepted by-- both proposals approved by the SIG Node and proposed to the SIG Architecture and well accepted by community.

So, stay tuned! Next-- very soon, we are going to have that one. And I think in 1.12 there will be alpha release. And we are pushing forward to go through the beta and GA next year, very soon.

ADAM GLICK: Thanks, Dawn, it was really great having you on the show.

DAWN CHEN: Thank you!

CRAIG BOX: Thank you, as always, for listening. If you've enjoyed the show, it really helps us if you spread the word and tell a friend or rate us on iTunes. If you have any feedback for us, especially the five-star kind, you can find us on Twitter at @kubernetespod, or reach us by email at kubernetespodcast@google.com.

ADAM GLICK: You can also check out our website at kubernetespodcast.com. Until next time, take care.

CRAIG BOX: See you next week.

[MUSIC PLAYING]

View More Episodes