#18 August 28, 2018
Do you have something cool to share? Some questions? Let us know:
CRAIG BOX: Hello, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.
ADAM GLICK: And I'm Adam Glick.
CRAIG BOX: You're a bit hard to see there, Adam.
ADAM GLICK: Yes. We are normally used to our wonderful outdoors. But there have been some fires here, and it's a little hazy if you've seen any of the pictures.
CRAIG BOX: I have.
ADAM GLICK: So for any of the folks out here in Seattle, probably good to stay indoors for a little bit until the smoke blows through.
CRAIG BOX: It's a real shame actually. A lot of great wildlife going up in smoke there on the West Coast of the US. But the picture that I saw was basically a view of the Space Needle. And I trust that it's there in the background somewhere, but it's basically just brownness, which is a real shame.
ADAM GLICK: Yeah. They were saying one of the bad days last week that the air quality was equivalent to smoking seven cigarettes if you're outside. So you know, you get all of the downside of smoking without any of the upside. It's a very strange twist. It's supposed to be getting better this next week though, so I'm looking forward to that.
CRAIG BOX: Good. I've been taking my poison inside this week. I've done quite a lot of the baking actually, so visiting a few friends and taking some baked goods. And last time I was in the US, I don't know if-- do you have a preference for a type of cookie?
ADAM GLICK: Oh, yeah. Oh, yeah.
CRAIG BOX: Tell me, what's your favorite cookie?
ADAM GLICK: White chocolate or if someone makes them with, like, the peanut butter chips, oh, magic.
CRAIG BOX: Ah, well OK. First of all, Americans, peanut butter is not a food that goes in sweet things. Please don't ruin it like this. It goes on toast. That's its place in the world. I'll hear nothing more about that.
ADAM GLICK: Are you going to give me some kind of like Vegemite cookie or something? I just-- aw.
CRAIG BOX: No, no. Well, Vegemite's Australian. We can do a whole episode on the differences between Vegemite and the two distinctly different Marmites, which now, we'll have to link to the whole shebang in the show notes. But I do like the chewy American cookie, kind of like the Subway cookie, but these cookies that are still very moist even after they've been cooked. So I buy a few to bring home when I'm in the US. And then I thought, well, I could probably just make some. So I found a nice recipe and baked some cookies. And they were pretty good.
ADAM GLICK: Awesome. We will link to Craig's cookie recipe in the show notes.
CRAIG BOX: We will indeed. And now for this week's news.
The 2018 Kubernetes steering committee elections are now open. The steering committee is a group elected to help oversee the Kubernetes Project. Three seats are up for election this year, and those selected will serve a two-year term. Nominations are due by the 14th of September, and voting will run from the 19th of September to the 3rd of October.
ADAM GLICK: Google introduced binary authorization for Google Kubernetes Engine. Binary authorization is a container security feature that provides a policy enforcement checkpoint to ensure only signed and authorized images are deployed into your environment. Binary authorization also integrates with Cloud Audit Logging to record failed pod creation attempts for later review. Last week, we spoke to Jon from Shopify, who helped to design this feature. And we encourage you to go back and listen to show 17 to learn more.
CRAIG BOX: Aqua Security this week introduced kube-hunter, an open source tool for penetration testing of Kubernetes clusters. kube-hunter lets you simulate an attacker's behavior. It looks for commonly insecure entry points into a cluster, and then provides a report of what it finds. You can run from a hosted version at Aqua Security's website or download a version in a container and run it yourself. Either way, only run it against clusters you own and be careful before invoking the active hunting mode, as that could actually execute code on your cluster and potentially change its state.
ADAM GLICK: Alexander Lukyanchenko, an engineer with Avito, posted an interesting blog this week talking about some of the challenges they have had running Kubernetes for production at scale and how they solved those issues. In particular, he covers issues including how to avoid slow scheduling, achieving zero downtime deployments, updating Helm without Tiller deployment downtime, avoiding network issues like high latency and dropped packets, CPU resource limits' impact on the overhead of the system, and how to improve local development performance. If you have any of these issues or if you're planning to run Kubernetes broadly in your organization, the article is worth a read.
CRAIG BOX: The Cilium project this week released version 1.2. Cilium is a network driver for Docker and Kubernetes which uses the Berkeley Packet Filter, or BPF, functionality in the Linux kernel to accelerate security and routing. New features in 1.2 include security policies based on DNS names and a mesh-like functionality for connecting and securing multiple Kubernetes clusters available in alpha. Cilium can also be used to accelerate Envoy for Istio, and you can find a video showing this in the show notes.
ADAM GLICK: James Lee finished up a three-part series on Kubernetes networking this past week. The series starts with the basics and then dives deeper into how Kubernetes networking works. If, like me, you're curious to learn more about networking in Kubernetes and how it relates to concepts you may be more familiar with from the VM world, this is definitely a series I would recommend.
CRAIG BOX: For our friends looking to run KubeFlow on AWS, Amazon has brought GPU support to their Elastic community service. They have launched machine images for their P2 and P3 instance families so that you can now add them to your cluster and use the attached GPUs.
ADAM GLICK: And that's the news.
Ken Massada is a technical support engineer working for Google Cloud. He focuses efforts on support of Kubernetes. Previous to his work at Google, he worked in various roles focusing on DevOps and systems engineering. Welcome to the show, Ken.
KEN MASSADA: Thank you. Thank you, Adam.
CRAIG BOX: Google famously comes from a world of self-service, and there's a meme amongst people who've been living under a rock for the last 15 years that we're not a company that offers support in the traditional enterprise sense. That's your team. What would you say to those people?
KEN MASSADA: First of all, I'll say it's not true. Google has a big investment into support right now. There are hundreds of us, skilled, all across the globe whose prime job is to respond to customers' concerns. We are present on Stack Overflow. We are present in forums. We are present in the Slack. We are present in IRC. We are even present on Twitter.
CRAIG BOX: You pop up on podcasts occasionally.
KEN MASSADA: Yes, yes, yes. We are present in podcasts. [LAUGHS]
CRAIG BOX: If I've got this Node.js application that's got this problem, can you help me out?
KEN MASSADA: Of course, of course. What version of Node.js did you write it in?
And so, you know, we come from various backgrounds. The folks in my team are extremely talented. Our adjacent teams are the SWE and SRE teams. We transfer between those teams regularly. The added value that we bring is that we also can talk to customers and communicate clearly, but essentially, our interview is very rigorous and we have to be very proficient. And also, we have to operate under high stress when the stakes are very high.
You know, I've been doing this for about two years now, and I can't tell you the number of times where I've gotten a call that was essentially the customer's bottom line being hit by an issue that they're seeing. And so, you know, we're working really hard at Google on support.
CRAIG BOX: What kind of background does someone need to have to be a support engineer at Google?
KEN MASSADA: Oh, we come from various backgrounds. Our interviews vary from-- understanding cloud first is key to this, having a strong background in networking, having a strong background in systems engineering. Linux is almost a prerequisite, and being good with people is key as well. This is something that confuses engineers a lot when they interview with us is that last step of being able to talk to a customer and reduce that anxiety. It's super important.
CRAIG BOX: Ken, what was your journey to Google Cloud support?
KEN MASSADA: My journey began in cloud research. I worked for a small company that wanted to build a private cloud end to end. This started about in late 2010. And we were doing very interesting things such as wanting to integrate Cloud Foundry at a time that was very nascent into Oracle. At the time, it was OpenStack for building your infrastructure, bosh CLI for standing out your Cloud Foundry stack. And as I've evolved through my roles, I've just kept a very strong eye on open source, because in that job, everything had to be part of the open source stack, from the firewalls where we use PFsense, if anyone knows what that is.
CRAIG BOX: It's like m0n0wall with BSD.
KEN MASSADA: Absolutely.
ADAM GLICK: And so, how did you get into Kubernetes? It sounds like you already had a background in systems as well as in the world of Linux. What brought you to Kubernetes?
KEN MASSADA: My journey to Kubernetes was trying to get fast. I worked for an Adcom company in Baltimore. We started optimizing our workloads. We were first running on virtual machines, and then we went on containers. And we ran into the problem of how do we orchestrate containers. And we built certain solutions around it. We even used the product at the time called Rundeck. You know, we had a lot of problems to solve, and Kubernetes started checking off our boxes one by one from the first iteration through.
CRAIG BOX: Moving from an environment where you're supporting one installation to now supporting thousands in Google Cloud customers, what are some of the most common problems that you see over and over again in people's installations?
KEN MASSADA: There are two sets of user journeys, I like to call them, right? There's the operator journey who are people who are interested in building communities for themselves. They have a business reason to do so, and so they set up the whole stack, and they're interested in providing that as a service for their developers. And one of the biggest challenges that these operators run into is, for me, is making etcd reliable. There is just no shortcut around it.
Once you start passing a certain number of nodes, there are different thresholds. There's 500 nodes. There's 1,000 nodes. There's 2000 nodes. And when it comes down to our-- I like to call them pilots-- the users themselves who are interested in deploying the application on Kubernetes, the set of problems we see are all based on understanding your workload first.
There are a lot of prerequisites for running a workload on Kubernetes successfully. One of them is understanding the failure modes of your application. If your application, right now in production, fails on bare metal or in an instance, when you move that workload into a container and into Kubernetes, where you have orchestration around this model, when your application is failing in containers, there is a cascading effect that happens. And this is really, really hard to troubleshoot once you move into a more sophisticated system. And that's a second one.
And then the third one is not really a problem, but it's just a very honest question to customers, right? Are you using Kubernetes because it's cool and because Craig and Adam talk about it on their cool podcast? Or are you using it because you have a real business?
CRAIG BOX: It can be both.
KEN MASSADA: True. Very true. But essentially, is it really that useful to you, right? I was dealing with a customer a while ago that had a high-performance database. They went through the investment of putting everything on Kubernetes so their workflows were optimized for this, and they put a high-performance database inside Kubernetes. And there are companies that do it out there that do it very well. They are benchmarks, and they're flying high. But this customer, I made them run a test on bare metal versus putting it in Kubernetes. And they got better results on bare metal. And that's where the conversation comes in.
Do you want to pay the operation price of containerizing your application and then run it in Kubernetes? Or do you just want to put your binary and run it efficiently, right? There's a lot of talk about the right tool for the right job, and that's a very important part in Kubernetes. And that applies to even workloads in Kubernetes, right? Do I need a StatefulSet for this workload I'm about to create? Or can I create a single pod? What are my failure modes like? Things like that.
ADAM GLICK: So when you look at some of these common issues that you see, especially around etcd, which is one that I've heard people talk about a lot as an area of challenge, what are ways that people can avoid some of those challenges?
KEN MASSADA: For etcd, there are a lot of very well-documented ways of getting around scale issues in etcd. Some of them have to do with the overwrite flags that you can use for etcd and separate your event end points, for instance, so that you can decentralize events from the rest of the operations that need to be stored in etcd. There are serious IOPS limitations, and this only comes in after testing and after running these at scale.
If your company has the might of a big budget, you could test these thoroughly, but if it doesn't, you'll have to-- you know, it's all about trial and error. Be aggressive on backups, test restores, test leader changes. Break it as often as you can to understand your failure modes in the application.
CRAIG BOX: This is the second conversation we've had with people who are basically talking about how complicated it is to run etcd. We had a similar conversation with Tina and Fred from the SRE team a while back. Do you think that that's just because it's running a database and databases are hard? Or is there something implicit either about distributed databases or the way etcd implements one that makes it the most complicated piece of the puzzle?
KEN MASSADA: I think it's just databases are hard. You have this complex system that needs to store all these different states, and these different states need to be persistent. And especially-- and this is why you see that scale and not very early on, is there are a set of operations that need to happen, and your etcd needs to store those operations. And so you run into scale problems because it just can't handle those reads and those writes.
ADAM GLICK: That's some great information. If you could give one tip to the folks that are listening, something that would help them running their Kubernetes clusters do it better, or avoid a challenve, what would that one tip be that you'd give to people?
KEN MASSADA: Honestly, I would just scale back and take a lot of time on building the application from the ground up. And by that, I mean this, right? Understand your application from end to end. What is your regular latency when you run your application in a container by itself? How does your latency look like as soon as you put this in a service? You know, if you're measuring latency from end to end, which means from the time it enters your node port to the time the request is processed and sent back, what does that trace look like? In microservices, we have all these buzzwords, tracing and monitoring, and people think it's optional, but it's not, especially if you're serious about having this at scale.
This is actually a good time to tell one of my war battles. I was on a call recently where we had this customer that was running a campaign. This campaign included people who were entering characters in their phone and sending it to this server essentially. This was a Node.js app. The person who configured the Node.js app forgot about localization. They Node.js app ended up doing some core dumps. And the failure mode the customer was seeing was our apps just constantly fails. They had all the good things we'll recommend for building a cluster like autoscaler. They had node repairs.
CRAIG BOX: So their application could fail more expensively.
KEN MASSADA: Exactly. Exactly. Or it could fail in-- you know, if this failure mode was just that specific application having a temporary problem, then this would have fixed it, right? Spin up and add a node and your problem goes away. But in this case, it created a cascading effect. The application just kept filling up to the point where they run out of nodes to autoscale to.
They decided to restart the existing nodes which were working fine. And those nodes also, because they were newer, they started to receive traffic from the campaign. The problem was localized at the very beginning because only maybe 10% of the cluster was receiving the campaign traffic, so they didn't see that much failure. But this expanded all out through the cluster.
And it took about maybe nine to 10 hours of troubleshooting nonstop, and this is also another plug for Google's support. I was on the call all of the 10 hours. I did not have to do this, but you know, we take this very seriously and personally. And I, of course had my APAC. That is, Tokyo and Sydney folks joined the call as well, closing that parentheses.
But after 10 hours of troubleshooting, we finally found out that the Node.js app was the one doing the core dumps. The core dumps were affecting the nodes. The nodes were not able to repair. And this happened because of the specific failure mode that they had in the application.
In the postmortem, we did restore and analysis of the application. And even their testing did not pick up that specific. It was just images in a campaign that was throwing off their Node.js application. And this was not picked up because they didn't know that the users would be using the campaign this way. They did not plan for that in the application, which is one of those failure modes you cannot guard against.
However, something that would have helped a lot in this situation is understanding and tracing how the application failed. Automatically, with tracing at the application level, we would have known that the application was failing in the stack. But having various characters on the call, there were a few network engineers who were calculating latencies on routers, sending packet tracing. And then having a few systems engineers on the call who were like, oh, what's the kernel? Here's the kernel dump. Let's read through it line by line. Can we send this to forensics? I mean, there were a lot of interesting things going on on this call.
And I think the biggest value and the biggest advice for customers that are trying to get serious on using Kubernetes is take care of your application and take care of understanding its failure modes. When you move to a container orchestration, you would add another layer of complexity that you would learn and follow the journey throughout.
CRAIG BOX: It almost feels like this might be a dangerous question now. But what is the most unique way that you've seen Kubernetes being used?
KEN MASSADA: You know what? There are various ways. But to me, this one way is very interesting. And it's the pattern of creating a cluster or workload that terminates itself.
CRAIG BOX: OK.
KEN MASSADA: So I'll explain a little bit on that. There are various sides to this. I've seen modes where there will be a cluster that is created, and the cluster's job is to create other clusters and terminate them after a certain amount of time. So essentially, using the scheduler and then using some of the added functions inside Kubernetes to schedule workloads and turn that into a little bit of a CI/CD pipeline.
I tried to discourage the customer from doing this. But they came back three months later with an amazing set up. And every time, I stalk their cases because I really am very curious and interested in seeing how they are developing this and where they are taking it today.
ADAM GLICK: What's the concept that you find that people have the hardest time with?
KEN MASSADA: There are various levels to this, right? And this is like my personal journey into Kubernetes. I come from configuration management background, and you get into this world of-- it's still declarative, but you have this YAML file that you're supposed to feed to the system and expect it to do magic things. The first line of the YAML file already is problematic. API version, right? If you are new to Kubernetes, you do not understand what those numbers mean at the top. API v1 extensions, v1. What the hell is this and why is it even there?
Second of all, looking at the whole file itself is a little daunting, because you essentially have spec. There are objects that have specs in them, right? That kind of pattern is not something that you encounter in any other system. So for beginners, that's like one of the biggest issues.
Another pattern I see or things that are a little bit hard to learn is to understand liveness and readiness probes. Those, essentially, it's not the concept that is hard. It's just what they're supposed to do. And, you know, I see Stack Overflow and gists.. And I see people copy liveness and readiness probes and then just making it exactly the same. They're not supposed to work like that.
CRAIG BOX: In fairness, that's pretty much how we make our YAML files for everything.
KEN MASSADA: Exactly, right? You know, just take the YAML file and duplicate whatever section it is to make it fit the scenario. But anyways, the whole--
CRAIG BOX: And then just start adjusting the spacing until it passes.
KEN MASSADA: There is that. There is that as well. And you know what's also very interesting is the way we are solving the problem is by making it a little bit more complex. So we're either writing templates for YAML files, or we are feeding just other values. So essentially, there will be a generation of Kubernetes users who will not ever write a full YAML file, right?
CRAIG BOX: Well, Joe Beda thinks of YAML files like it's machine code, so maybe no one writes in machine code today except a very, very small set of people who need very, very high performance. Might that end up being true for Kubernetes?
KEN MASSADA: Yes, yes. And I would be a big proponent of that. It's just working in support, I always have the negative view of those users will be calling us, and we'll have to explain why.
CRAIG BOX: Do you think it's a problem with how complicated the product is? Or do you think we should just train people? Because obviously, Kubernetes has to be similar to Unix. It needs that complexity to be able to represent anything you might want to run on a computer.
KEN MASSADA: I think, across the board, is selling Kubernetes a different way. From the get go, Kubernetes tried to be a lot of things to a lot of people. The addition of StatefulSets and CronJobs are a prime example of how an idea of being just able to schedule containers extended to oh, look at every other thing we could do with this. And being serious about telling people, hey, listen, when you are about to embark on this journey, it's a daunting task. You need to scale back and learn the complexity of the product before you're completely up for the task. And I think that's where the nuance enters. I would make it even more complicated.
CRAIG BOX: We'll get right on that.
KEN MASSADA: To be honest. You know, there are more flags I'd like to see. I'd like to see the YAML files be even more declarative than they are right now, even to the bit-level, correct? So it's a middle ground, right? Is it the Kubernetes ecosystem's job, or can I say Kubernetes' core job to make that bridge? Or is it its partners or providers to make that bridge?
ADAM GLICK: Is there one flag in specific that you'd like to see, or one particular feature that's on the road map, or that you'd like to see on the road map that would make supporting Kubernetes much easier?
KEN MASSADA: You know what? It actually ties back to the last question that you asked me about the concepts that are very difficult to understand, the affinity and anti-affinity rules. Right now, there are a set of topology flags that you could use for them, and then you cannot mix them with labels. I'd like to see this thing go crazy. Like, do not schedule a pod on a workload that has this pattern of CP utilization.
CRAIG BOX: On a Wednesday that has a full moon.
KEN MASSADA: Or while Mercury is in retrograde.
But it's true. It's true, you know. I'd make that feature a lot more complicated, which comes back to that feature of affinity and anti-affinity. The way the spec is reading and the examples are reading in the documentation, a very key thing to that feature is the topology key. The topology key comes after.
But when you're trying to tell yourself or say it in human understandable language to understand what's going to happen to the pod when it gets scheduled, you have to talk about the topology key first. It's just a little nuance of just saying, OK, do not schedule this pod or schedule this pod within topology key that matches labels. That's the way of thinking about it to get around having to draw Venn diagrams and other matters on understanding where your workload is going to end up.
CRAIG BOX: All right, and with that, I'd like to say thank you very much to Ken for joining us today.
KEN MASSADA: Thank you all for having me. It's been a pleasure. Please have me back again.
CRAIG BOX: You can tweet to Ken — in either English or French — @kmassada, or just break your Google application and he will come to your rescue and sit on a call with you for 10 hours.
KEN MASSADA: Have a good one. Please, please reach out. Please. We're here.
ADAM GLICK: Have a good one, Ken.
Thanks for listening. As always, if you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can email us at kubernetespodcast@Google.com or find us on Twitter @KubernetesPod.
CRAIG BOX: You can also check out our web site at kubernetespodcast.com where you'll find all our news and all our recipes. If you choose to bake some cookies, please take a picture and post them on Twitter. Until next time, take care.
ADAM GLICK: Catch you next week.