#9 June 26, 2018

SRE, with Tina Zhang and Fred van den Driessche

Hosts: Craig Box, Adam Glick

Craig and Adam from the Kubernetes Podcast talk to Tina and Fred from Google Cloud Site Reliability Engineering (SRE) about managing GKE and what lessons you can take to your own clusters.

Do you have something cool to share? Some questions? Let us know:

News of the week

ADAM GLICK: Welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.

[MUSIC PLAYING]

ADAM GLICK: What's going on, Craig?

CRAIG BOX: I'm reading a fantastic blog post from Paul Ingles, who's the head of engineering at uSwitch in London. He's writing about how Kubernetes has helped standardize their processes, and not so much about how you change the company to adopt Kubernetes, but how when you have a company that you want to change, you can bring Kubernetes in to help make that happen.

ADAM GLICK: Nice.

CRAIG BOX: So quite a lot of discussion about this online, so I'll put it in the show notes, and you'll be able to read along yourself.

ADAM GLICK: Cool. I'll take a look at that.

CRAIG BOX: How about you, Adam? What's new?

ADAM GLICK: I've been spending a lot of time lining up some of our upcoming guests for the show.

CRAIG BOX: Ooh!

ADAM GLICK: Really excited for some of the folks that we'll have joining us, both for some of the upcoming projects and upcoming milestones that are coming out, as well as a number of external folks that we're really excited to be getting onto the podcast coming up.

CRAIG BOX: Can't wait.

ADAM GLICK: Yeah. I should take the opportunity-- if anyone knows of interesting things that are going on in the community that you think would be good guests for the show, feel free to send it in to us at kubernetespodcast@google.com.

CRAIG BOX: We'd love to hear from you.

ADAM GLICK: Let's get to the news.

[MUSIC PLAYING]

CRAIG BOX: In this week's G-themed news, GPUs have gone GA on GKE. GPUs-- Graphics Processing Units, or more commonly used as machine learning accelerators or ways to waste power mining cryptocurrency, are now production-ready and fully supported on Google's Kubernetes Engine.

Using GPUs for CUDA workloads allows you to benefit from massive processing power without having to manage hardware, or even VMs. This has been very popular with customers, and GPU core hours are up 10x from the end of 2017.

NVIDIA GPUs available include the entry-level K80, the P100, and the new Tesla V100, which is available in beta. All these models are also available as preemptible GPUs, a functionality that became generally available along with this announcement, which allows you to run GPUs attached to a preemptible VM at a 70% discount.

ADAM GLICK: Rackspace this week announced their Rackspace Kubernetes-as-a-Service. Yes, that's RKaaS, for those that love branding, and one more for those that love acronyms that become multi-syllabic words.

CRAIG BOX: They're called initialisms. Just letting you know.

ADAM GLICK: Thank you. Rackspace has partnered with HPE for this offering, and it appears to use HPE's version of OpenStack, while also using HPE's Cloud Cruiser acquisition to do metering and billing according to an article on EnterpriseTech. Rackspace will run and support Kubernetes in multiple locations, including your on-prem environment, in colo, and in Rackspace managed data centers.

CRAIG BOX: GitLab released version of 11.0, which improves integration with Kubernetes and makes it easier to manage and monitor Kubernetes from within the GitLab service. New features include the ability to review a pod's logs directly from the deployment board, one click deployment of JupyterHub, and a Helm chart for deploying GitLab in a cloud native fashion. This chart uses a container for each component of GitLab and removes a previous requirement for shared storage, which increases resilience, scalability, and performance of GitLab on Kubernetes. GitLab also posted this week that they are moving their hosted product to Google Cloud Platform, using those very same charts to administer it, in a move that has been underway since November 2017.

ADAM GLICK: SUSE announced this week they're releasing their SUSE CaaS Platform 3, a Kubernetes-based container management platform. SUSE claims that their CaaS Platform 3 provides surrounding technologies that enrich Kubernetes and make the platform itself easy to operate. They tout new features including improved storage integration, a software load balancer, and support for tuning the micro OS container operating system.

CRAIG BOX: Way back in episode three, we talked about the integration of CoreOS Container Linux after its acquisition by Red Hat. You may recall Red Hat has a fully open source upstream for all their paid platforms, known as Fedora. They have now launched a project inside Fedora for integrating Container Linux with Atomic Host. This project is called Fedora CoreOS, which will bring a smile to the face of desktop Linux users from 2003. Fedora CoreOS and the Red Hat product to be built on it will eventually become the successor to both Container Linux and Atomic Host.

ADAM GLICK: Lacework last week released a report on container cluster dashboards being exposed. They found over 21,000 exposed dashboards, including Docker Swarm and Kubernetes, with 300 of those being opened without any credentials. Bad admin. No soup for you.

The issue of exposed Kubernetes dashboards came to many people's visibility when, in February 2018, Tesla had its Kubernetes dashboard discovered and its resources taken to run cryptocurrency mining. The article mentions that 95% of the exposed dashboards were running on AWS, and recommended security steps be taken to help people protect their Kubernetes clusters. Overall, the article suggests that organizations understand their inventory of applications with public clouds and perform continual audits and configuration scanning with compliance checks for workloads and security zone policies.

CRAIG BOX: Google Cloud announced a partnership with game tools company Unity 3D last week. Unity is currently migrating its infrastructure to Google, and the two companies together building a suite of tools and services for game developers who build connected games. The first tool will be an open source matchmaking project coming later to the summer, to be based on Kubernetes. This will be the second Kubernetes platform from Google's gaming team, following Agones, a project for hosting dedicated game servers, which is a collaboration with Ubisoft.

ADAM GLICK: This week, the CNCF welcomes 19-- yes, that's 19-- new members. The announcement adds several names that you may be familiar with, including Rackspace, Samsung Research America, and CircleCI. Also of note are a significant number of the companies joining the CNCF are from Asia and the Pacific Rim, which highlights the growth of Kubernetes in Asia ahead of KubeCon in Shanghai this November.

CRAIG BOX: And that's the news. Our guests today are site reliability engineers on the manage compute team at Google Cloud, Tina Zhang and Fred van den Driessche. Welcome.

TINA ZHANG: Hi. Thanks, Craig.

FRED VAN DEN DRIESSCHE: Hello.

CRAIG BOX: Tell us about a day in the life of a site reliability engineer.

TINA ZHANG: I guess the day in the life varies a little bit, depending on whether you're on call or not. When we're not holding the pager, our job is very much like the software engineers that work on GKE. We help to launch product features, make design docs, review design docs, review code to enable the launching of new and exciting features on GKE, and make sure that they're architected in a way that is scalable and reliable and delightful for our users.

FRED VAN DEN DRIESSCHE: Yeah. That's the ideal. I'd say a lot of project work day-to-day also follows on from days when we have been holding the pager and we've responded to incidents, and that's kind of informed directions for new projects and that kind of thing.

TINA ZHANG: I'd say yes. So going back to the kind of not-on-call project work, a lot of what the main difference with what we're doing versus perhaps a software engineer directly, working on features related to Kubernetes or GKE is that we also work on the monitoring and alerting for the platform, so that we do get notified and have visibility into the user experience, also the on-call experience, to help us deal with any problematic issues in a better and faster way.

CRAIG BOX: So on call is obviously a theme that's come up a lot in this discussion so far. Tell us about a day in the life when you are on call.

FRED VAN DEN DRIESSCHE: Well, ideally, it's perfectly quiet, the pager never goes off, and we can just carry on as normal. But from time to time, things do go wrong, and yeah, we can get paged. That can be anything from a single cluster having an issue to an entire cloud region having an issue. And we have playbooks that help us deal with, and dashboards, and all sorts of other monitoring systems.

TINA ZHANG: Yeah, we can also get assigned bugs and tickets directly from cloud support. So this is when a customer contacts the friendly neighborhood cloud support engineer with a very particular or special issue that isn't captured by our alerting, and sometimes we're asked to get involved and debug that too. And often, we need to pull in the help from our dev partners to help us find a root cause. And also I think we benefit a lot from the fact that lots of Googlers are also core contributors to the Kubernetes project, as well. So they can enable patch fixes directly in the open source projects very quickly.

ADAM GLICK: What are you doing to monitor and define the health of the clusters that you're supporting while you're on call?

FRED VAN DEN DRIESSCHE: Our primary way of monitoring our clusters is simply making a request to the API server, asking the health of the component statuses. So it's [INAUDIBLE] essentially. That tells us the health of the API server if it doesn't respond at all or gives us information about the component statuses, the scheduler, controller manager, and the databases underneath. That's our core way of monitoring clusters.

TINA ZHANG: Yeah, and that's a metric that we store internally in an internal Google monitoring behemoth that's used by all SRE teams at Google. And that's directly plugged into the alerting infrastructure. And we have predefined rules on what certain unhealthy statuses look like.

So in addition to component statuses, we also track the memory usage or the CPU usage of the hosted masters, and we have predefined conditions of what we consider to be a bad state that can either cause a page to wake up an SRE. Or it could be a ticket that has to be resolved by the end of the on-call shift.

CRAIG BOX: Now, you're based in London. Do you have a partner team in-- is it in Seattle?

TINA ZHANG: Yes.

CRAIG BOX: And so how often does a human actually get woken up?

FRED VAN DEN DRIESSCHE: Well, our shift starts about 6:00, so we can get unlucky in that case, depending on how late we want to sleep in, but it doesn't happen that frequently, I don't think.

TINA ZHANG: Yeah, well, I got woken up two days ago, so it's feeling a little bit more frequent for me. No, we do actually have great internal tools that track this. We generally tend to track the number of [INAUDIBLE] per day, not necessarily when they happen, because I guess the morning for us could be right in the middle of the day for one of our customers who need help straight away for their clusters. But yes, typically at Google, SRE teams are split in two shards, and we have sister teams so that every service that has an SRE team can have 24 hour on-call without being too disruptive to humans' personal lives.

FRED VAN DEN DRIESSCHE: We track that pretty carefully, as well. If the pager load gets too high, then we can drive improvements in that area.

CRAIG BOX: Your alerts are driven by unhealthy Kubernetes masters. How often is that driven by a user's choice in how they run workloads on their own cluster versus, say, the components on the master causing the problem?

FRED VAN DEN DRIESSCHE: I think that's probably cause and effect, as in, frequently for whatever reason the workload has caused components to become unhealthy. One particular one I remember is the rate of requests that an unhealthy workload was making to the cluster API server effectively DOSsed the API server. And at that point, the controller manager and scheduler started crash looping, because they couldn't get a word in edgeways essentially. I think that's one frequent way. Also, the rate of requests or queries to the etcd databases can have a big effect, and that's generally driven by workloads, as well.

TINA ZHANG: Yes. Although sometimes there are bugs as well in the etcd project. And yeah, that also affects us.

CRAIG BOX: I want to dig into that, because you two both gave a talk at the recent KubeCon in Copenhagen about tales from the playbook about the things you need to do to manage GKE, and etcd is a very common theme. I know from this talk and from other talks that SRE have given, as with all things, it is the stateful component and that they are the hardest things to run. So tell us a little bit about the challenges of managing etcd and perhaps some specific outages that you've had and how you've resolved them.

TINA ZHANG: Of the past year, one thing that was quite a large project that took up a bit of time was, for example, migrating the etcd running in our hosted masters from version 2 to 3. So I mentioned previously, there being bugs in the etcd project. Sometimes they are fixed pretty quickly, but it doesn't mean that we can always just patch that fix in straight away.

CRAIG BOX: Right.

TINA ZHANG: And we have to be quite careful about how we upgrade etcd due to its importance in the cluster. So one of the issues that we had with etcd that we discussed in our KubeCon talk, there was actually a fix for it in open source, but until we could roll out that version, we had to kind of roll our own automated fix.

CRAIG BOX: And a number of members of your team are actually upstream contributors, so there's a lot of work that gets discovered in the course of running GKE that's actually put very quickly back into the actual upstream projects?

TINA ZHANG: Yeah, exactly. So Wojciech and Joe Betz as well. I think they filed and fixed their fair share of bugs. We're very grateful to them for that too. And also, just regarding etcd more generally because there is, I think, in one of the kind of production best practices that I've read in one of our Google internal SRE training docs, they always say, if you can avoid master election, do. Because it does add extra complexity. I mean, obviously sometimes it's needed, but yeah, we definitely rely a lot on the devs when we're trying to debug issues with etcd.

ADAM GLICK: You bring up an interesting point. You're spending a lot of time maintaining the clusters. And indeed, lots of folks that work in operations know the treadmill of maintaining a service and keeping it healthy. You also talked about making contributions upstream and actually fixing some of the things that you see and helping make those pieces better. How do you balance the two of making the investments, in making the infrastructure and the project better versus keeping things running and making sure that it's running well?

FRED VAN DEN DRIESSCHE: So the official SRE line is that split should be 50-50. And we do track that, as well. So yeah, ideally, we spend 50% of the time making the service better and more resilient, and a maximum of 50% handling issues that happen in the service.

ADAM GLICK: 50-50 split sounds like a great split. How do you handle that when you're dealing with a particular escalation when something is happening? It's good to set a goal. How do you hold yourself and keep to a goal like that, especially given that lots that happens in operations is you have a battle plan and then there is what happens in the battle?

FRED VAN DEN DRIESSCHE: So yeah, I mean, if there's an issue going on, then we'll be focusing solely on focusing on that issue. If that takes us out of SLO-- or the Service Level Objective-- that we've set for the service, then we theoretically have the ability to freeze rollouts, new feature rollouts, for that service as well, so the future chance of another incident happening is reduced at that point.

TINA ZHANG: Yes. So I'd say also, to add to that, during an escalation or an outage, the number one priority for SREs is always to mitigate the impact for our users and to debug the root cause later. So it's all about reducing the user impacts, communicating the user impact.

And then after the outage is over, it's mandated that every outage is accompanied by a post-mortem. And in that, we very openly and transparently kind of discuss the root causes, the triggers. We actually have to write a section on what went well, what went poorly, where we got lucky.

And also accompanying every post-mortem are a list of action items. And depending on the severity of the outage, the action items have to have a minimum priority. These are internal priorities that we set for dealing with bugs and issues. And these are tracked on SRE-wide basis to ensure that these action items are closed over time.

FRED VAN DEN DRIESSCHE: Yeah. I'd point out in GKE's case that we wouldn't go into that mode for a problem affecting a single cluster or a single customer, as well. That's for if we have a problem affecting an entire region's worth of clusters or a particular feature in a large number of clusters.

CRAIG BOX: When you look at the release notes for GKE explicitly, it talks about the rollout. Normally we have a weekly rollout, and that happens over the course of a few days. Can you tell us a little bit about how that rollout works?

FRED VAN DEN DRIESSCHE: So we rollout the particular release of GKE over four days. So it hits particular cloud zones and regions in a particular order.

CRAIG BOX: Is that order always the same?

FRED VAN DEN DRIESSCHE: Yup. That's always the same.

TINA ZHANG: If you look closely at what's in day zero, day one, day two, the number of zones ramp up, and that's so that we can gain more and more confidence as the release goes out.

CRAIG BOX: Right.

FRED VAN DEN DRIESSCHE: We also won't roll out all of the zones within a single region on the same day, to reduce blast radius.

CRAIG BOX: Let's talk about a specific recent release, which was regional clusters. Now we are able to offer a cluster where the master is replicated across three zones within a single region, and thus has higher availability. Tell us a little bit about how that happened.

TINA ZHANG: Regional clusters went beta in December last year, and as you said, it went GA. So from an SRE perspective, we now have to monitor three masters per cluster instead of just one. So we had to change some of our logic to determine what counted as an available cluster.

Previously, it was relatively straightforward, so if the single master was down, SRE gets paged. For regional clusters it's slightly different because, the masters can tolerate more failure scenarios that doesn't necessarily warrant paging an SRE. However, there were added complexities, because with introducing three masters per cluster we're, for the first time, executing the leadership election code in the etcd clusters, which previously we weren't using with just a zonal master.

CRAIG BOX: What changes between when you launch a product, like regional clusters in beta, and when it goes GA some months later?

TINA ZHANG: So in the process from going from a beta release to GA, we internally set some SLO objectives for the product that has to be continuously met before we are able to launch. We also have a launch coordination engineering team within Google. You can read all about them in the site reliability engineering book. And they help us to identify potential reliability issues related to product launches, and they give us blocking bugs that we must fix before we're able to offer the products as GA.

FRED VAN DEN DRIESSCHE: And I guess we also see more varied real life use of the product, and that can help inform the content for playbook entries and that kind of thing when things do go wrong.

ADAM GLICK: How much of what you do is applicable to people who are running their own clusters?

FRED VAN DEN DRIESSCHE: I think some of it is. A lot of what we do is obviously looking at a cohort of clusters in aggregate. But when we do have to look at a specific cluster, that's obviously exactly the same. I think we're quite spoiled here, in that we have pretty good ways of collecting a lot of metrics about clusters and having fairly good observability for a specific cluster and the master VM health, that sort of thing.

TINA ZHANG: For anyone who's interested in learning more about regional clusters, come to Cloud Next in San Francisco in July, where I'll be giving a talk about it with my colleague Jeff Johnson.

ADAM GLICK: So what tools would you use if you were trying to look at the same signals that you're looking at internally? What are projects out there that might be available that people could look at those?

TINA ZHANG: From GCP, we have Stackdriver Monitoring, and of course Prometheus, which is part of the CNCF.

CRAIG BOX: What advice would you give to cluster administrators on particular signals that they should pay close attention to?

FRED VAN DEN DRIESSCHE: One thing that springs to mind for me-- because we see fairly frequently at the moment-- is any throttling that might be happening on the disks that etcd is stored on, the etcd database is stored on. Because that can kind of start off as a simple backup of queries to the database, which backs up, requests the API server, and that can kind of have a knock-on effect on the rest of the components, especially if the cluster is large and there are lots of queries going back to it.

TINA ZHANG: Also, as a cluster admin, it's also very important to keep frequent backups of etcd especially. It can be quite difficult to debug, but keeping the backups and practicing the backup restore feature frequently as well, is quite a core SRE tenet.

ADAM GLICK: We've talked to a number of times about etcd, and it sounds like that's one of the critical things that you're looking at and you want to make sure is running well. Are there particular signals that you're looking at that are coming from etcd that help you understand is that healthy, is it running well, or that are signals to you that something might be heading the wrong direction that you can fix before you actually have something that disrupts a cluster's normal operation?

TINA ZHANG: The main thing that we monitor is the component statuses. So etcd is-- well, for us, we monitor the main etcd and also events etcd. And we check for continuous health of those two components. We also have an alert for if there are too many changes in leader election for etcd. But that hasn't actually triggered yet.

ADAM GLICK: You mentioned monitoring the health of etcd. What particular metrics are you using to define the health of etcd?

TINA ZHANG: It's just the health endpoint provided by etcd. I don't know what's going on under the hood for etcd to actually determine whether it itself is healthy or not.

FRED VAN DEN DRIESSCHE: You know, we don't have any alerts based on any more specific metrics than is it healthy.

CRAIG BOX: Right. So we trust etcd when it says it's healthy, and we worry when it says it isn't.

FRED VAN DEN DRIESSCHE: Yes. And we can look at other metrics if it is unhealthy.

TINA ZHANG: Yeah, because of the way component statuses connects to check whether etcd is healthy or not, sometimes if etcd is not responding at all, that's also interpreted by component statuses as unhealthy.

CRAIG BOX: As GKE grows, how are the projects that you work on in SRE changing with it?

TINA ZHANG: The ultimate goal of all SRE teams is to be able to automate ourselves out of a job. And one of the core SRE tenets is that the number of SREs for a service should not grow linearly with the number of developers or the number of features or the number of users.

CRAIG BOX: Right.

TINA ZHANG: So as adoption of GKE and Kubernetes continues to be super popular, we need to be able to automate some of the instructions that we have in our playbook to fix clusters and generally make the service more scalable.

FRED VAN DEN DRIESSCHE: Yeah. And I think we want to be able to manage a large number of clusters easily. That means inspecting what they're doing, if we need to contact their owners, and also if we need to fix a security vulnerability that's affecting the entire fleet. That's quite a lot of toil at the moment that we don't want to have to go through, if that were to happen. So yeah, being able to manage a large fleet as simply as possible is a good goal.

TINA ZHANG: Yeah. And they're making the system more resilient and self-healing, much in the same vein as the philosophy of Kubernetes itself.

CRAIG BOX: Great. Tina, Fred, thank you so much for coming on the show. It's been fantastic having you here.

FRED VAN DEN DRIESSCHE: Thanks for having us.

TINA ZHANG: Yeah, thanks.

ADAM GLICK: Great to have you both.

CRAIG BOX: That's about all we have time for this week. If you want to learn more about site reliability engineering, we recommend you check out the SRE book, which you can read for free online at landing.google.com/SRE.

ADAM GLICK: Thanks for listening. As always, if you enjoyed the show please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @KubernetesPod or reach us by email, at kubernetespodcast@google.com.

CRAIG BOX: You can find the links from today's episode and more at our website at kubernetespodcast.com. But until next time, take care.

ADAM GLICK: See you next time, and have a great week.

[MUSIC PLAYING]