Kubernetes Podcast from Google: Episode 38 - Kubernetes Failure Stories, with Henning Jacobs

#38 January 29, 2019

Kubernetes Failure Stories, with Henning Jacobs

Hosts: Craig Box, Adam Glick

You learn so much more from failure than success. Henning Jacobs, head of Developer Productivity at Zalando, joins Adam and Craig to share his own stories of failure, and talk about what he has learned by reading stories from others.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

Links from the interview

Kubernetes Failure Stories blog post
- GitHub repo
- Hacker News post
Zalando
A Million Ways to Crash Your Cluster
- Original version of the talk from the Dusseldorf meetup
Tacoma Narrows Bridge collapse
Nordstrom talk at KubeCon NA 2017
Serverless Failure Stories
Startup scripts used to just kill the Docker daemon
90 days of EKS in production: configuration options you need to set
CPU throttling
Facebook oomd
John Wilkes: only make new mistakes
Henning Jacobs on Twitter

Transcript

Show full transcript

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

CRAIG BOX: Let's start with Adam's Game of the Week.

ADAM GLICK: I was looking for something interesting this weekend. And just couldn't believe, I looked through the top lists, and every game seems to be, obviously a slight overstatement, but like a "Clash of Clans" clone, a "Bejeweled" clone, or a slot machine clone.

CRAIG BOX: This is an interesting conversation. Would you like to pay $10 to continue it? 9, 8, 7-- the pressure is on my friend-- 6,

ADAM GLICK: Can I buy coins?

CRAIG BOX: --5--

ADAM GLICK: Is there a loot box for me?

CRAIG BOX: Perhaps-- we're going to give you a few coins to make you feel like you've achieved something.

ADAM GLICK: I was looking for something different. And I stumbled across a game I'd heard about before-- it came out a few years ago-- called "A Normal Lost Phone," which basically the game is narrative-driven. And it creates a fake kind of phone experience in the app.

And you basically have found someone's lost phone. And you're trying to discover who they are, what's going on in their life. And it was just, it was-- they do some clever mechanics as to how do they keep things hidden, how do you find things at certain points, how do you make sure you're reading the story in order. It was clever. I enjoyed it. It was fun to go through.

CRAIG BOX: Brilliant.

ADAM GLICK: How about you, Craig? What have you done in the last week? Have you gotten some sunshine?

CRAIG BOX: There was an occasional sunshine here. There is also occasional snow, which is just the brilliance of the UK weather, overall. I did you get a chance to get out last week and see New Zealand rock royalty Neil Finn and his son, Liam -- in fact, the whole Finn family performing in East London. Quite the opposite side of town from me under normal circumstances!

ADAM GLICK: Would people know those people from other bands by chance?

CRAIG BOX: Yeah, so Neil Finn's had a very storied career. Our American audience will probably remember a song called "Don't Dream It's Over," which was a number two by Crowded House in 1985. It unfortunately didn't make it to number one in the US. They were solely a number two one-hit wonder over there. But they had a fantastic worldwide career, especially in my native New Zealand, where the Finns are from.

And Fleetwood Mac fans will have opinions on the replacement of Lindsey Buckingham with Neil Finn in the band this year, but apparently everyone's kind of warmed to him. He's filling in that role very well.

But he found a break from Fleetwood Mac tour to continue touring an album that him and his son Liam, who is also a fantastic musician, put together. And we got to see that show last week. It was a very intimate show in a very old theater in Hackney. And for someone who can sell out stadiums, and performs in those kind of bands, it's always nice to have the little shows for the fans for these little obscure records that he puts out every now and then.

ADAM GLICK: Nice-- shall we get to the news?

[MUSIC PLAYING]

CRAIG BOX: CoreDNS has graduated as the fourth top-level project in the cloud native computing foundation, joining Kubernetes, Prometheus, and Envoy. CoreDNS is a fast, flexible, modern DNS server written in Go and the recommended DNS service for Kubernetes as of 1.13. CoreDNS was originally written by Miek Gieben, then an SRE on GKE here in London, and now has 16 active maintainers and over 100 contributors.

ADAM GLICK: Intel has released Nauta, an open-source distributed deep-learning platform for Kubernetes. Nuata provides a multi-user distributed computing environment for running deep-learning model-training experiments on Intel-based systems. Results can be viewed and monitored using the command line, web UI, or TensorBoard. Intel describes Nauta as a production-ready version of Google's Kubeflow in which Intel is the third-largest code contributor. If you want to know more about Kubeflow, we discussed it with David Aronchick back in show number two.

CRAIG BOX: Ian Lewis, a developer advocate at Google Cloud, has published the fourth and final part in a series of blog posts on container runtimes. Over the past months, Ian has focused on low- and high-level container runtimes. And in the final post, he wraps up by looking at the container runtime interface, or CRI, and how Kubernetes interacts with CRI-compliant runtimes. If you're looking to understand how the pieces fit together with running containers in a Kubernetes cluster, this is a great series to read.

ADAM GLICK: Google Cloud has started a blog series on Istio to provide a practical guide where they cover all kinds of user perspectives, from developers and cluster operators to security administrators and SREs. Through real use cases, they'll shed light on the what and the how of using the Istio service mesh. The first post in the series is by Megan O'Keefe, and defines what Istio is and some of the challenges Istio can solve. If you want to know more about Istio, check out our interview with Dan Ciruli and Jasmine Jaksic in Episode 15.

CRAIG BOX: OVH, a French cloud company and VPS provider, launched their managed Kubernetes service to beta in November. They've posted about how that service operates. They use Kubernetes to manage the master components for the other Kubernetes clusters, which they call kube inception. Similar approaches have been discussed by Giant Swarm and SAP. And we have links to all of them in the show notes.

ADAM GLICK: The GKE plug-in for Jenkins is now generally available. This lets you set up a GKE cluster as a Jenkins publisher to easily deploy code to GKE clusters.

CRAIG BOX: Jussi Nummelin from Kontena has been playing with GitHub actions, GitHub's new workflow tool, currently in private beta. Jussi has integrated Kontena's Mortar deployment tool to trigger deployments when pushes are made and helps explain how you can set up this workflow or one like it yourself.

ADAM GLICK: In another sign that the market is hot for those with Kubernetes skills, an article was published this week showing what the benefits are of working in the cloud-native world. The article says that there are almost 17,000 open jobs on LinkedIn in the US alone, and that salaries vary from between $75,000 to over $215,000 a year, with the average being over $144,000 a year, just another proof point that Kubernetes skills are in high demand and the need for good engineers is growing.

CRAIG BOX: And that's the news.

[MUSIC PLAYING]

Henning Jacobs is the Head of Developer Productivity at Zalando and responsible for providing a cloud-native application runtime to over 1,100 developers. Last week, Henning published a blog post and a GitHub up repository with a list of Kubernetes failure stories, postmortems, and other outage documentation. And he's also contributed to this with his own stories from Zalando. Welcome to the show, Henning.

HENNING JACOBS: Yeah, welcome, nice to be here, grateful for the invitation.

ADAM GLICK: Happy to extend it-- just to give people a sense of who Zalando is, can you tell us, what is Zalando and what do they do?

HENNING JACOBS: Yeah, Zalando is Europe's biggest e-commerce fashion site. So we basically sell fashion, shoes, et cetera, online in 17 European countries.

CRAIG BOX: Is it an online-only operation? Or do you have a physical presence?

HENNING JACOBS: Yeah, we have some outlet stores, but only four to five in Germany. And so it's online.

CRAIG BOX: How long have you been at Zalando?

HENNING JACOBS: That's a hard question. So I started 2010 at Zalando and partly was tasked as a system operator at that time, and then moved between teams. So now, since a year and a bit, I'm responsible for the developer infrastructure, first was basically in infrastructure for a longer time and switched roles back and forth.

ADAM GLICK: And being in e-commerce, I assume that you have fairly spiky workloads depending on sales, time of year, et cetera.

HENNING JACOBS: Yeah, it depends a little bit. So we have different business model. One is more like a shopping club, which is really campaigns and spiky workloads or spiky loads. But the regular fashion store online has acquired the regular spikes over the week. So this is kind of pretty regular over weeks.

ADAM GLICK: A little more predictable, the load.

HENNING JACOBS: Yeah, it's predictable, that one part of the business. Not everyone is predictable, but--

CRAIG BOX: How did Kubernetes find its way into Zalando?

HENNING JACOBS: We actually, like, migrated to the cloud 2015. And then later on, we discovered that just running EC2 with Docker on it is not really scalable and future-proof. So in 2016, we looked into Kubernetes and then started slowly ramping this up. And since last year, we basically consider this generally available and now try to achieve 100% overall Kubernetes adoption.

ADAM GLICK: When you say 100%, you mean that 100% of your operations are running in Kubernetes and not in VMs anymore?

HENNING JACOBS: Yeah. Of course, this is not the case yet. But, like, until end of next year-- or this year. It's now 2019 already. And we tried to ramp this up. I mean, 100% is always kind of hard to achieve-- but for all applications which ran in the cloud on EC2 to be on Kubernetes. We still have some, I call them vintage data centers, which are not really to be considered much. There are only a few deployments every week. But these are a different story.

CRAIG BOX: How has Kubernetes improved the developer experience at Zalando?

HENNING JACOBS: Yeah, that's a very good question, because I think this is the main plus for Kubernetes, is that actually, we want to improve the developer experience overall. And Kubernetes gives us the tooling to make it extensible, so for example, integrate our Zalando IAM system as custom resource definitions, integrate Postgres with our Postgres operator. So we now have kind of a unified view on things. We have the Kubernetes API and CI/CD. And of course, it's much faster to deploy containers and start them up than to deploy VMs or start up EC2 instances.

CRAIG BOX: And has it all been smooth sailing?

HENNING JACOBS: Of course not. I wouldn't share this failure stories topic if it would be smooth sailing. So I think there are, like, technical, organizational-- I mean, there are a lot of challenges, like every kind of big project. But yeah, Kubernetes is in a lot of forms not really matured yet. And yeah, of course you have to deal with that, and also with the release velocity of Kubernetes. So yeah, we can, I mean, dive into the failures right away. But I guess this is no secret that you shouldn't take this lightly to go to Kubernetes and cloud native if you don't have so much experience with that yet.

CRAIG BOX: So your first experience with sharing failure stories was in a talk at-- was it DevOpsCon in Munich last year?

HENNING JACOBS: Yeah, that was actually the most recent talk. But actually, I started with a small Meetup talk in Dusseldorf, which is also linked on my blog post. It's basically similar. And then I evolved it over time and will also, like, include more failure stories, which come up. So I think this will be a never-ending talk series.

CRAIG BOX: What made you decide to share the stories of failures with your Kubernetes environment?

HENNING JACOBS: I think this is totally driven by my own and/or our motivation to learn from other people's experiences and failures. And I think one good way is giving back and actually showing people that this is not such a big deal to share these failures. And if we talk more about it, I think it also takes maybe away some fears of others to talk also about these topics.

CRAIG BOX: Was it hard to get sign off from your management to be so public in going out and talking about incidents of failure?

HENNING JACOBS: Yeah, this is the nice part about Zalando. [LAUGHS] I mean, now I'm responsible for the whole developer infrastructure, including Kubernetes. And I think I can make a lot of calls myself, and of course, making sure that we don't share sensitive and business topics.

But I think it's in our interest as a company to really engage with the community, learn from each other. Like, we have a lot of open-source projects already out there. And I think this is part of how we want to engage and be a good citizen in the open-source community. So there was no extra approval or something.

CRAIG BOX: Your PR people didn't want to get involved?

HENNING JACOBS: Yeah, I guess we wouldn't put it as a big press release out there, like, how Zalando failed. But maybe, when I look for failure stories, it's often how things are framed is, this is kind of how we migrated. This is the success stories. And then on the way, the challenges, I explained.

And I think I wanted to turn it a little bit around to make it more open to learn about the failures. But I think a lot of these failures and challenges are always hidden in the success story. So if you would do a press release and you put, probably, like, do it in the reverse, and how in the end, this was a success. And these were the challenges we are overcoming.

ADAM GLICK: I always think of stories where things have failed as being great examples that people can learn from. I joke about, the only bridge that people know about where I live is the Tacoma Narrows Bridge. And they only know about that because of its catastrophic failure that happened and is now every engineering textbook. You've decided to aggregate failure stories.

HENNING JACOBS: Yeah.

ADAM GLICK: Why did you decide to aggregate not only what happens in your organization, but other failure stories as well and put them together?

HENNING JACOBS: What I forgot to mention, which is also in my blog post is, of course, I didn't come up with this great idea to now do a big failure series. Actually, others were before me. And others were adopting also Kubernetes earlier than we did at Zalando.

So I think this Nordstrom talk on KubeCon on which I listened to was very impressive for me. And I said, OK, this is kind of-- it was the best talk at this KubeCon. I think it was 2017. I don't remember, actually.

But so this had a impression on me to say, why don't we have more of these talks where we really go through these stories, where these are both entertaining engaging, and a lot to learn from. So this is-- I think, as you can already see from the titles, I just stole somehow this title and made it a million ways to crash a cluster.

ADAM GLICK: What makes it relevant to talk about the failures in the cases of Kubernetes versus things that have happened in the past?

HENNING JACOBS: Yeah, I guess if you look at traditional environments like in the data centers or similar, then it's always so diverse how this is set up and so dependent on how your vendor operates a data center, et cetera. So of course there is also a lot of failure stories and postmortems to learn from. But if I think about stuff going on in our data center, if there is a core router failing or similar, this would not be something where the community could learn from, because it highly depends on the network set up, and vendors, and service providers, et cetera dealing with this in the data center, for example.

Of course, the cloud, like, without Kubernetes, is also something to learn from. And I think a nice example is that, immediately after I launched these Kubernetes failure stories repo, there was already a serverless failure stories repository. So I think there are a lot of more topics where we can also do the similar topic.

But of course, in my example, Kubernetes is what we operate and offer. And Kubernetes has a great attribute, that it's now, like, so common and ubiquitous, that it's even greater to learn from each other. And so it's more important that we share our failure stories, because we are so many users.

ADAM GLICK: Yeah, I'm really glad that you do. Everyone should take a read through some of these, because there is some great lessons in it. You've been collecting these stories. And I'm wondering, since your work with Zalando goes back before your usage of Kubernetes, were you doing this before Kubernetes? Or is this something that started recently as you've been in the Kubernetes community?

HENNING JACOBS: No, I must confess, this started recently. So I would have liked to say that I am great from the learning from failures, and we always did postpartum since 2010. But I think we are also growing up as a company. And also the whole topic of operations reliability, SRE, 24/7 on-call, dealing incident response, postmortems, and so on, this is also evolving over time.

Of course, this was also already there before Kubernetes. But yes, I myself was only kind of, of course, involved with postmortems inside the company, but didn't really consider sharing this. And I think this is also kind of the cycle that we engage with this open-source community with the components we have on GitHub for Kubernetes, that this is actually a natural trend that you also think about what else to share and learn from. Because it's not only about code, it's also about how to deal with the code and components, how they work together, and how to actually set this up.

CRAIG BOX: This is an area where people with experience in running services have a lot more insight, especially on how the parts work together, than perhaps even the people who built the individual components. So given that we're here today with a world expert in the failure states of Kubernetes, let's talk about a few of the specific areas that are often cited in these postmortems as well as your own. Let's start off by talking about etcd.

HENNING JACOBS: Yeah, etcd-- I don't want to go too deep into etcd. I think one topic which came up in different places is-- and also one failure we had, for example, which I also shared in the talk is the connection from API server to etcd, and the different ways how to do it, where to operate etcd, so if it's on the master nodes, if it's separate outside of Kubernetes. If we really run it on our own of course, like manage Kubernetes and GKE, we wouldn't even see this component.

And I think this is not as straightforward as one might think. For example, in our case, this was really the connection from API server to etcd, and discovering that the client is not really resilient to, for example, small network failures. And one packet loss in one direction can actually lead to this hanging connections and never recovering.

And this is a little bit disappointing from our experience of course, because we are talking about cloud native. And then you discover, like, internal components not really living up to the principles of cloud native in being resilient to the small network failures. But I don't want to go into etcd itself, because I also don't have all the expert knowledge there. I guess, how components interact-- I mean, this is distributed systems. This is the fun part, running one etcd node.

CRAIG BOX: Are there other components in the Kubernetes environment where the resilience isn't there?

HENNING JACOBS: A lot of the failure stories I shared also linked them to some GitHub issues. And some are also already resolved. I think another topic which is now resolved was kind of kubelet connection to the Kubernetes API server. I'm not 100% sure if it's really solved.

But there was also kind of missing resilience against, for example, the change of IPs of the API server endpoint and reconnecting. So all these hanging connections or things which are not recovering automatically, these of course kind of lead to some frustration, and also potentially incidents. And I guess there are a ton of more of these cases which I don't know of and we will find only by looking into a failure stories.

ADAM GLICK: What are some of the examples from cloud vendor integration?

HENNING JACOBS: Yeah, so this is a long story. There was in the "Hacker News" thread also a comment that, like most of the failure stories, are related to AWS. And my answer was, of course this is a really biased sample, because you have to operate this on your own on AWS or also on other cloud vendors. And this of course leads to maybe more potential problems to do things wrong.

But in this case, yes, for example, PersistentVolumes was always a big pain point. I think it's now better. But this was, for a long time and maybe still is in some corner cases, a huge pain point for us on AWS, so they are the PersistentVolumes not really being attached to the node, et cetera.

Of course, there are also a lot of other topics. And one big topic which is, I think, also not really solved on Google-- I cannot say for Azure-- is, for example, IAM integration, which is maybe not core Kubernetes, but it's something where, for example, in the AWS world, iptables is redirected to a proxy. And there is race conditions which lead to kind of problems-- so to work around these PersistentVolumes AWS, IAM integration.

Then there are of course the usual topics, like rate limiting from cloud vendors. So if the core components or control loops do too many API calls to the cloud vendor, then you're basically DDoSing yourself. And that is a huge pain, which you cannot recover yourself from.

CRAIG BOX: What about Docker or the container runtime?

HENNING JACOBS: Luckily, Docker is now not another big topic for us anymore. But I think this was for a long time. And I guess a lot of people will agree. And there are some really horrible cases where Docker failed us.

So if the Docker daemon is hanging and you really don't know what to do, so Kubernetes cannot talk with the Docker daemon, you cannot really do anything, but the workloads are still running but you cannot really gracefully move them away or do anything, because also the Docker daemon restart was restarting the containers in the past. So this is one example. There are also like a lot of race conditions or pipe wait states of processes in the Docker daemon which also led to investigations on the Docker node or the node-- and not really able to do anything.

And I mean, no secret, this is mentioned that a lot of talks. Even Google has on GKE this infamous bash script which hacks if Docker daemon is still responsive I think every 10 seconds, and then kills the Docker daemon. So yeah, even cloud providers need to find workarounds to make some things component.

CRAIG BOX: Yeah, I think collectively we've come to the conclusion that having a daemon in that space isn't the right thing, which had led, obviously, to Docker decoupling some of this into containerd. And I'm glad to hear that those problems sound like they're predominantly on the way out.

HENNING JACOBS: Yeah, and I'm looking forward. I now hear the first people really running containerd in production. So we don't do that yet-- and looking forward to alternative, more lightweight, and maybe runtimes which cause less trouble. But all in all, Docker is, right now, no issue anymore for us. Though maybe we just survived the hell. And now it's kind of stable enough to not be on topic.

ADAM GLICK: What about DNS?

HENNING JACOBS: Yeah, this was actually the trigger for this blog post. So we had an incident in the beginning of this month regarding DNS, so cluster DNS. And this is not totally-- I mean most of these failure stories are not completely a fault of the Kubernetes core components or anything, but how they play together and how to size them for your workloads. And DNS is one of these things, if DNS is down in your cluster, then your workloads have a hard time to do anything.

And I think this is also the other nice blog post about first 90 days with EKS, also mentions how to tune KubeDNS and making sure that it, like, grows with your capacity or your workloads running on the cluster. And in this case, how it failed us was really something where we had a huge spike of DNS requests from a node JS app. And this spike was very sudden. So even Prometheus metrics, which was grabbing every 15 seconds, didn't really see this memory bump. So it was not kind of a low, slow-growing memory curve or anything.

And this basically led to CoreDNS, which we are running in production, kind of being killed, was out of memory. And because these requests are also constantly trying to do the DNS again, we had to bump this a lot to two gigabytes of memory just to kind of at least recover this first burst of DNS requests. And now we are really looking into this node local dnsmasq plus CoreDNS, and hopefully also share more with the community.

So we have an internal narrative how we set this up, how to configure caching. I mean, there are a lot of knobs you have to tune and know about. There is this "ndots:5" problem. There is whether you want to cache NXDOMAIN or not, whether you run CoreDNS caching, or dnsmasq, et cetera, et cetera. So a lot of options, and barely nobody really tells you how to evaluate these different options and what is the right setup for your case.

CRAIG BOX: A common failure case that people had described to me in the past was basically running out of CPU on a node and then finding that some of the system components were killed as opposed to user workloads. We have now requests, and limits, and so on which we can apply to these workloads. And if you happen to have your kube-proxy killed, for example, that might cause you a bit of a bad time. But what problems have you seen around the quotas and scheduling of workloads?

HENNING JACOBS: I think this is a whole different can of worms this, Kubernetes resources. Because I think, one big chunk is kind of how also your end users, or developers, or Kubernetes users know about these knobs. So this is one thing where you kind of have some education needs.

But also for the cluster operators, as you already mentioned, system components you want to have with higher priority with kind of guaranteed quality of service, et cetera. You would probably want to have a locatable reserve resources per node. So also in this case, you have to know about all these different knobs and know about what components actually you run per load, how to kind of give them the right resources.

So this is all from cluster operator side. And there is no kind of recipe to do it end-to-end for whatever set up. So that's why we also actually shared, for example, our whole cluster configuration on GitHub so you can at least compare whatever you want to do in your company with what we configured.

So this is the cluster operator side. I think this is really about tuning the resources. Then there is this whole user side to resources. So you have to know what is the impact of limits.

And CPU throttling is a big topic, because it's not only throttling you if you hit the limit, but also already before, which is kind of a known kernel bug. But actually, when I say known, it's not known to the Kubernetes community. So everybody who has actually low-latency workloads will hit this one day or another. And then the first step is always removing limits from the pod or container resources.

But as we learned on one of the last KubeCons, one option is really disabling CFS quota on the kubelet level. And that's actually what we did in all clusters. And this works for us. I wouldn't say to do this for all clusters out there, because in our case, we have this kind of scope per cluster. And we don't really have multitenancy per cluster or different teams who are working on one cluster. They are still in one area and can kind of align.

So yeah, I mean, there are probably a lot of more, like, paths to go down now. We can talk about memory, and out of memory, and whether you want to have overcommit or not, and what guidance to give to your users or not. We decided not to overcommit for memory. So we say request a limit to prevent this OM kill by the kernel. Like, also looking at this Facebook oomd blog post, you want to prevent this horrible scenario that even the cluster operator cannot do anything anymore on the node because it's kind of getting out of control.

So I think this is a mixture of knowledge for your users, and educating them, doing the right thing as a cluster operator, but also finding a strategy how this whole setup works in your organization and how much control the user has, and cluster operator, and how this all fits together, how much more multitenancy or not you want to allow. So this is basically something, probably, we can talk about three hours. And we will still find topics which are unclear or questions to answer.

ADAM GLICK: What do you think is the operational failure story that you personally learned the most from?

HENNING JACOBS: I think, like, always most interesting learnings are actually not so much about the deep technical topics. For example, taking the DNS example, it's actually not so interesting the find the right set up. Of course, it's technically interesting, but actually, the organizational learnings are often much more interesting.

So OK, how does incident response work? Was the escalation chain working? Are we really setting the right priorities in what we do? So overall, I would say, in the individual failures, the technical details, those are kind of just things you learn once from.

But the overarching topics are organizational priorities, incident response processes, right mindset, et cetera-- so that's why I would say, like, each individual incident, there was none which totally was outstanding. There were some of them which were in the beginnings, which I also showed on the talk, where we basically made mistakes in relying too much on certain behaviors. For example, our ingress controller relied on the API server to be available. And this is, of course, something you have to learn once and then hopefully not repeat the same mistake again.

CRAIG BOX: Along those lines, my colleague John Wilkes who gave a lot of talks about Omega, a project he was one of the tech leads on, and then leading into Kubernetes as it was open sourced. He said that a lot of the reason that he talked about this was to ensure that people only made new mistakes.

HENNING JACOBS: Yeah.

CRAIG BOX: He said, he doesn't mind people making mistakes, but they shouldn't be making the same ones over again. Do you feel that the Kubernetes community is at the point where they are only making new mistakes?

HENNING JACOBS: Not yet, I guess. I mean, like, when I collected these postmortems, I was actually shocked that I didn't find so many more. and so until now, I only got two PRs for the GitHub repo. So I think, considering how many companies I know run Kubernetes in production, I would like to hear a lot more. And of course, we also at Zalando have to do a better job in sharing with the community.

So I guess, without that, we cannot really say that we learn enough from our mistakes. And just taking this CFS quota example, there were, I think, three or four companies who gathered at this KubeCon Europe about-- around the CPU throttling topic. And yes, we found kind of the solution, and talked about it. But this has not really persisted right now.

Of course, this is also kind of partly my responsibility to take the next steps. But I think these smaller discussions are really valuable, but how do they really reach the broader audience and the broader community? And I think there are so many companies or people joining the community every day.

Yeah, how do you transfer this knowledge and make sure that this is kind of the important pieces? In the end, hopefully the documentation part in Kubernetes-- which I confess, I didn't contribute anything yet to kubernetes.io docs. But hopefully, there will be--

CRAIG BOX: It's still early.

HENNING JACOBS: Yes-- hopefully there will be production guidelines, or like how to run in production, and some more than we have right now in the docs.

ADAM GLICK: We can form SIG Production.

HENNING JACOBS: Yeah, I guess we don't lack SIGs, I must say! So I think there are enough special interest groups. And I tried to engage with some, but I also have limited time.

So of course, there are a lot which are interesting to our use case. So there is AWS. There is also Cluster Ops, and Cluster Lifecycle, et cetera. But yeah, it's hard to be engaged everywhere. And then let's think about all the other people who are just joining and going to the docs, and didn't even find the community repo with all those links.

ADAM GLICK: Thank you so much, Henning. This is fascinating. And I encourage everybody to take a look at the blog post and the recordings of the presentations that you have done. I think people learn a lot from that.

HENNING JACOBS: Thank you for having me.

ADAM GLICK: If you want to find Henning on Twitter, you can find him at @try_except_, with an e, and then a second underscore.

HENNING JACOBS: There is one background to it. My name was already taken. And look it up, my first name last name, then you will find it's not me, but it's-- I came late to Twitter.

CRAIG BOX: So late that @try_except with no underscore at the end was also taken?

HENNING JACOBS: I didn't try, but it has to have consistency, right?

CRAIG BOX: A balanced number of underscores.

HENNING JACOBS: Yeah.

[MUSIC PLAYING]

CRAIG BOX: Thank you, as always, for listening. If you've enjoyed the show, we really appreciate it if you help us spread the word and tell some friends. If you have any feedback for us, you can find us on Twitter @KubernetesPod or you can reach us by email at kubernetespodcast@google.com

ADAM GLICK: You can also check out our website at kubernetespodcast.com, where you can find the show notes as well as transcripts for every one of the episodes. Until next time, take care.

CRAIG BOX: See you next week.

[MUSIC PLAYING]

View More Episodes