Kubernetes Podcast from Google: Episode 152 - SRE for Everyone Else, with Steve McGhee

#152 June 18, 2021

SRE for Everyone Else, with Steve McGhee

Hosts: Craig Box, Dan Lorenc

Steve McGhee worked as an SRE at Google for almost 10 years, then took a job outside the company. He was tasked with recreating “Google Production” and SRE practice from first principals, but with three books, modern cloud providers, and the entire Kubernetes ecosystem to help. How did he do? Learn about that which you can and can’t replace.

Do you have something cool to share? Some questions? Let us know:

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box, with my very special guest host, Dan Lorenc.

[MUSIC PLAYING]

Dan, your name has come up more than once in our News section in the last couple of months. When we spoke in episode 39, you were working on Minikube. But I'm told that you can now help me feel more secure.

DAN LORENC: Yeah. Thanks for having me on here, Craig. My Minikube work actually led me down this long rabbit hole of supply chain security. When I first started publishing the V0.1 release and on for many Minikube, it frankly terrified me that so many people were willing to take this binary I handed them and run it as root on their laptops. So I tried to do that as best and as securely as possible, and found out there were a whole bunch of problems we had to fix along the way. So that's why I've been working on lately.

CRAIG BOX: So you mean that "curl | bash" isn't best practice anymore?

DAN LORENC: I don't think it ever was. But we need to do better and make that easier for people to do right.

CRAIG BOX: So what are some of the things that your new team have been working on?

DAN LORENC: We've been doing a whole bunch of work to try to secure open source supply chains and products like Tekton CD, trying to automate the capture of verifiable supply chain metadata so you can trace things back to where they were from. And somewhat exciting to me, and we've got some news happening this week, there's a separate project called Sigstore. We're trying to make it easy for developers to get free code signing certificates and integrate signing into their workflows without having to think about it.

CRAIG BOX: Signing has always been a bit of a magic ceremony-- you hear about the DNS root certificates where there are a bunch of wizards who'll get together in a room. And it's very pleasing, catching up with you after not seeing you for a while, that you've really leaned into the pandemic haircut, and you're looking a little bit like Peter Jackson. So I can ask you what the robe and wizard hat ceremony is going to be like for you signing the new keys for the Six Door initiative.

DAN LORENC: Yeah. It's pretty fun and exciting. And I definitely put on this costume and hair and beard just for the event. But we're kicking off the first root certificate signing event for Sigstore. We're five people, and we'll stay on the robe and wizards theme, are getting their own individual hardware tokens and we're publishing them all live on a stream so people can make sure that they're legitimate. We're using those to sign everything. And then we are distributing them among the open source community.

So there are five different key holders that are going to be holding these, just like horcruxes. So we've got resiliency built in, and we'll be rotating through different open source communities and inviting people to become key holders going forward.

CRAIG BOX: Is there one key to rule them all?

DAN LORENC: Nope. And that's the point. We've got five of them in different companies, different communities, different organizations, different industries. And we'll need at least three out of those to do anything going forward.

CRAIG BOX: Is there some criteria that need to be met for you to consider a shave and a haircut?

DAN LORENC: Supposedly if enough people watch this and start signing, then we could do the shave and a haircut at KubeCon LA.

CRAIG BOX: Lovely. We'll have to do a sponsorship event where you get to throw in a dollar or two and cut off a lock for good luck.

DAN LORENC: We'll come up with something fun.

CRAIG BOX: All right. Let's get to the news.

[MUSIC PLAYING]

CRAIG BOX: Google Cloud is introducing a new VM type, based on AMD's third generation Epyc architecture. Tau VMs promise 56% higher absolute performance and 42% higher price performance than the leading general-purpose VMs from AWS and Microsoft.

They are supported on GKE from day one, where day one will be sometime between now and Q3 if you fill in the application form provided.

If you want to bring your GKE costs down today, Google is now offering committed use discounts for GKE autopilot. These give you a 20% discount off On Demand pricing for a one-year commitment, and a 45% discount for a three-year commitment.

If you're new to GKE, or want to learn more about the autopilot mode, listen first to episode 139, then sign up for free cloud on board training, live on June 22, or available on-demand afterwards.

DAN LORENC: Twice a year, StackRox publishes a state of Kubernetes security report, and post acquisition, the first Red Hat branded report has been released. 94% of the 500 respondents stated that they have experienced a security incident in their Kubernetes and container environments during the last 12 months, with 55% needing to delay deploying Kubernetes applications into production due to security. Sounds like a problem we should fix.

CRAIG BOX: Two years after its last minor release, the etcd team has released 3.5, with improvements to logging, monitoring, and project security processes, as well as bug fixes and performance improvements. Of a special note is a 50% reduction in memory usage under certain Kubernetes scenarios.

DAN LORENC: Google's open source security team, that's me, is at it again, introducing a new framework for mitigating threats across the software supply chain. Supply Chain Levels for Software Artifacts, or "salsa", is inspired by Google's internal Binary Authorization for Borg, which has been in use for over eight years, and is mandatory for all of Google's production workloads.

The goal of SLSA is to improve the state of the industry and to defend against the most pressing integrity threats. In its current state, it's a set of incrementally adoptable security guidelines being established by industry consensus. In its final form, SLSA will support the automatic creation of auditable metadata that can be fed into policy engines to give SLSA certification to a particular package or build platform. That metadata can also be used with the software build materials efforts underway in various US government agencies in response to the recent executive order.

CRAIG BOX: Another week, another new database operator. This week, Spanish startup Tesera has announced Ensemble, an operator aiming to run any database at scale. Tesera says the different workflows required by all the other operators are a problem. So they're building a single framework-- one ring to rule them all-- to orchestrate clusters where extensions provide support for different databases.

In the 0.2 release, the software supports Zookeeper, Revit MQ, Dask and Cassandra.

An operator for the Harbor container registry has also reached 1.0 this week.

DAN LORENC: Also teased on last week shows was GitOps Days. And organizer Weaveworks announced Weave GitOps Core, or "wego". Based on Flux, GitOps Core is an open source CD tool for Kubernetes, focusing on deployment and day two operations. It's marked as an early release, but likely to get continued investment. If you want to hear the story of Weaveworks and GitOps, check out the two-part extravaganza with founder Alexis Richardson in episodes [144 and 145].

CRAIG BOX: API management company WSO2-- the O2 is for oxygen-- has announced the launch of a new integration platform as a service called Choreo, with a CH. Choreo is a low-code cloud-native engineering tool for developers, allowing them to build event-driven workflows and deploy them to Kubernetes. Alongside the introduction, they announced the acquisition of Australian startup Platformer, whose technology will be used to help build out the Kubernetes integrations in Choreo.

DAN LORENC: The transparency report for the recent KubeCon EU is out with takeaways you might expect. Over 25,000 people registered, and only 63% attended. It was the first KubeCon event for 69% of attendees who dialed in from 168 countries across six continents. 43% of keynote speakers identified as something other than male, but only 17% in the breakout sessions.

If you're looking to attend the upcoming KubeCon North America, or any other Linux Foundation event in person, better go get your jabs. Proof of full vaccination will be required to attend these events in person, with no exceptions. If you can't meet that requirement, all events will have a virtual component.

CRAIG BOX: Finally, episode 66 guest and Google Serverless knight, Ahmet Alp Balkan, wonders if Knative missed a chance to be part of the default Kubernetes experience. He's long espoused the view that the serving component of Knative is just a Kubernetes service abstraction done better, and that almost everyone would benefit from using it. Get the scoop in his blog post.

DAN LORENC: And that's the news.

[MUSIC PLAYING]

CRAIG BOX: Steve McGee spent nearly 10 years operating and scaling services at Google, including Search, Android, YouTube, and Google Cloud. He focused on monitoring and stable deployments, as well as managing teams of SREs across the globe. He then took what he likes to call his summer vacation, and worked outside Google for about two years before rejoining in 2019. Welcome to the show, Steve.

STEVE MCGHEE: Hey, Craig. Nice to be here.

CRAIG BOX: How big was SRE when you joined it? Was it fully formed as a concept, or did it still have a way to go?

STEVE MCGHEE: I would say it was in the dozens of people, maybe. It was certainly in the single digit of teams. I joined a team called Mobile SRE. And it was right before there were smartphones. So we did stuff with phones that would not be recognizable today.

CRAIG BOX: WAP, was that a thing?

STEVE MCGHEE: WAP was a thing, for sure. I don't know if you remember things like the LG chocolate phone, things like that. So there was a lot of like, how does my BlackBerry search? So there was a lot of fun stuff like that.

CRAIG BOX: I actually worked at Symbian for a while.

STEVE MCGHEE: Yeah. I did some of my research in grad school on the Symbian OS, which was gnarly.

CRAIG BOX: May it rest in peace.

STEVE MCGHEE: Yes. Indeed.

CRAIG BOX: What sort of things did you do in the SRE area back then?

STEVE MCGHEE: I focused mostly on monitoring. So we had a lot of these different services that were built by a lot of different teams. But they all kind of worked together with respect to the customer.

CRAIG BOX: As is the promise.

STEVE MCGHEE: Yeah. That's the idea, right. My first project was to kind of merge all of their monitoring systems into one thing, so we can get an idea of if both of them are down at the same time, that's bad, or something like that. It was pretty cool.

Another thing that we worked on was downloading ringtones and wallpapers. If you recall, that was a thing, too.

CRAIG BOX: The Crazy Frog.

STEVE MCGHEE: Yes. Exactly.

CRAIG BOX: There is very much a mentality in SRE today to monitor from what the user sees. Was that a thing back then as well, or did that develop over time?

STEVE MCGHEE: Not at all. We did have a thing that we called, Hello Is This Thing On, or HITTO. And we attempted to synthesize that. But there wasn't a way to program devices, especially mobile devices, at that time to give back real telemetry. I think there are actual machines, not even VMs, running in universities that we would make calls from and make them look like phones. That worked reasonably well.

CRAIG BOX: What was the distinction in terms of network capacity and quality back then? I know that there are networks, for example, in Google offices where you can pretend to be on 2G or 3G for testing your application. But I imagine it was a lot different in those days.

STEVE MCGHEE: It was sort of a double edged sword. At that time, the availability of mobile services didn't have to be that high, because there were very low expectations of mobile networks. That was good in that we could kind of get away with some hacky stuff. But at the same time, some of our developers would build on these really good networks and they would build systems that expected good networks, and then it wouldn't degrade terribly well. But we learned that pretty quickly.

There was a point when, I recall, someone added a feature, not to the networks but just to their testing infrastructure, around introducing latency to random parts of the request. And then everything sort of fell apart and we're like, huh, OK, maybe we should not do that. That helped a whole lot.

CRAIG BOX: And does all that still come on useful today when you think about people accessing Google from countries that don't have the infrastructure that America does, for example?

STEVE MCGHEE: From the Google product development side, a lot of that is sort of built into the way that these teams build out these services. It's part of the testing procedures is to ensure that it has graceful degradation built in. It's actually often part of the framework. So the developers hardly even have to think about it.

But it is certainly, like you mentioned, it's not just based on someone driving through a tunnel. But if you're in a country with poor internet access, you still want to be able to use the services. And we want them to work well. So it's certainly important.

CRAIG BOX: Part of Google's hiring philosophy in engineering is that there's a small number of stacks and tools that everyone uses. And that makes it easy to move between teams. In your SRE career, you touched many different Google services. Does that still largely hold true?

STEVE MCGHEE: I think so, at least within SRE and the production facing side of the house. It's actually gotten even better. The funny stories tend to revolve around things like releasing systems, like deployment systems. I recall at one point there being eight different release systems that all basically did the exact same thing. And in the past 10 years or so, we've whittled that down to one or two that do the superset of all their behavior, and they actually succeed at deprecating the eight or so that existed at one point.

CRAIG BOX: Feels like the inverse of an xkcd comic.

STEVE MCGHEE: It is. It turns out, when you have some motivation and some coordination, you can actually get something done in the correct direction. It's not just like the raw internet doing all the infinite things.

CRAIG BOX: Is that part, then, of the SRE philosophy, in effectively there is one very senior person to whom everyone rolls up who has the authority to say to teams, no, you can't deploy?

STEVE MCGHEE: In a sense. It's not so much the no, you can't deploy part. It's more about a mantra or a rule or an intent within SRE to scale the team itself sublinearly to the number of services that it supports. And everyone seems to agree on that.

It does take a senior leader to write that down and say, this is one of our things we hold true, and then everyone kind of nods and goes, oh, yeah, totally. And then from that, you derive things like this, where we say like, well, if we're going to scale sublinearly, we can't have 19 different deployment systems. Because every time you change teams, you're going to waste a bunch of time learning the new thing. And so it's less about you must use this tool and more about, we all agree we want this outcome, therefore, using this one tool will help us get there.

CRAIG BOX: I'm going to read out a tweet that you posted recently. And I'd love it if you could tell me the circumstances behind it.

STEVE MCGHEE: Oh, boy.

CRAIG BOX: "Breaking Prod. More than once, I personally made it impossible to use Google search from a phone for a little bit. Like, for everyone on the planet."

STEVE MCGHEE: Yeah. Whoops. It's not as bad as it sounds. For a little bit is kind of the key part of that tweet. Or at least that's the CYA part of that.

CRAIG BOX: Like a microsecond?

STEVE MCGHEE: No, I mean like several minutes. From my recollection, it was like in the middle of the night, at least for the part of the world that I was in. At the same time, this was during the time where mobile search wasn't that big of a deal. It sounds crazier than it is.

But essentially, I was on one of the teams that ran the mobile interface to all of web search. Nowadays, it's all part of the main web search, because mobile's kind of a big deal these days. But at the time, it was less of a big deal. It was kind of more of an experimental thing. So it had its own dedicated set of servers. And my job was to make sure that they worked all the time.

And as is the SRE mantra, 100% is not the goal. There was a time when we were in the error budget and stuff was broken. And I was certainly a part of that. So occasionally, something goes wrong, and you flub a command and you go, oh geez, and undo, undo, undo, bring it back up. There are plenty of graphs with a big dip in it where I went, yep. I did that. Sorry about that.

CRAIG BOX: And thus are postmortems written.

STEVE MCGHEE: That's right.

CRAIG BOX: Let's talk about your summer vacation. Why did you leave Google?

STEVE MCGHEE: So I went from University to Google pretty much directly. I had one job in between, but it was at a University. So it kind of doesn't count, in my opinion. It was a fun job. But it wasn't a real job.

So basically, I just went straight from school to Google. I never really experienced the real world. At the same time, you may recall, personally, I moved to the UK and where we hung out a bit in the London office. I enjoyed it immensely. But at the same time, I also missed California. And I probably had seasonal affective disorder and was trying to deal with it being cold and rainy all the time. I wanted to not be in London anymore. And at the same time, I also wanted to work somewhere else.

And so I found this company back in California. And it looked pretty decent. And they said that they would pay me to do computer stuff. And so I said, let's do it.

CRAIG BOX: The part of California that you wanted to work in was also a place where there was not a center of Google engineering at the time, which would have made it a bit harder to transfer if you'd want to do.

STEVE MCGHEE: That's right. I did look into if I could stay with Google, but move to this location. So this is a town called San Luis Obispo, California, or SLO. If you're into SREs, you'll find that funny. I wanted to move here. I went to school in Santa Barbara, which is nearby. And I knew I really liked the area, the way of life, and all this kind of stuff. So I was unable to get Google to pay me to work on computers from San Luis Obispo. But this other company did. So it was a deal.

CRAIG BOX: There's an acronym which, in fairness, is mostly used by vendors, which is GIFEE, Google Infrastructure For Everybody Else. Going from Google to outside, how important was infrastructure? Can you just sprinkle Kubernetes on your data sender and say, we do SRE now?

STEVE MCGHEE: That'd be nice. Yeah, just a light dusting would be just enough. Yeah, so my initial title was actually Infrastructure Architect, which I found astounding. I didn't realize I was allowed to build houses now. That was fun to be called an architect at first.

I took two months off between Google and this company. And I read up a lot on "what is this Kubernetes thing, and how do I actually do it?" And GIFEE came up. I remember there being a promising amount of links. And then once I dove into it, not a lot of actual content.

And so it looked good from the outside, and then once I dug into, I was like, oh, man. This is not what I was hoping for. So this was two years ago. It was a start, the intent was correct. And I can see how it's like marketing gold. But under the covers, it quickly devolved into a series of shell scripts, which was unfortunate.

CRAIG BOX: If we hold Google as the gold standard of running services, and let's not talk about whether or not that's true, but you also, as well as the infrastructure pieces, some of which are now open source and available either in re-implementations or versions that Google have released, there is also all of that process piece. You wrote a blog post on rebuilding SRE from memory. And you talk about all of the documents and the things that you needed to build. There were three books that had been written on SRE at that point, but you had to build up a whole heap of extra stuff in order to start implementing this. What was missing?

STEVE MCGHEE: I had forgotten about that. It's a good point. If you look closely at that post, there's a bunch of things that should be links that never became links, unfortunately. So it's really like, more of a wish list at this point. A lot of these were cultural.

One that I recall was, there's a document within Google that was basically around, how do you escalate problems between teams? I kind of yearned to be able to fight that document instead of just getting in a tiff with a colleague, just be able to say like, hey, let's follow this procedure that we all know about. And I realized well, no, they don't know about this procedure. Because that was at Google. And so I had to write it. And I had to write it from memory.

CRAIG BOX: Pistols at dawn.

STEVE MCGHEE: Oh, pistols at dawn, yeah. That's right. It was more about escalating to a common person in leadership and having that person as a third party decide between the two warring factions and come up with an agreeable understanding. And to me, that made perfect sense. But unless you actually write it down and have people agree on it ahead of time, it's hard to do it in the moment.

CRAIG BOX: We talked a little bit before about how having a single figurehead for the culture of SRE within Google made that possible, whether it be Ben Treynor Sloss, or Urs Holzle above him. Is that true in outside organizations, especially if you have a company whose goal is making some widget, and they have some engineering as a side? Do you think that there is always a person that is suitable to use as that escalation point?

STEVE MCGHEE: No. And that was exactly the point, is that there wasn't such a person. At one point, I vividly remember thinking to myself, what would Ben do? Ben being Ben Treynor Sloss. Not what would he personally do, but what would a person in that position do if they were in my position. And that kind of changed my thinking about the whole process.

That's when I decided to write these things down, and think a little bit more about the cultural requirements, less about the technology, more about the people. It turns out the people are really important in how they work together. Defines a lot of how the technology is used.

The comparison I like to make is that 15 years ago, there weren't a lot of CISOs out there in the security space. Now there are. And it turns out that it's because there was a lot of these horizontal requirements around security that were really hard to just drive-through from the bottom up. We had to have someone at the table to say like, yes, we will, in fact, enforce TLS or something like that.

I think reliability is in the same space that security was 15 years ago, I think we'll find a CRO or CIRO, someone's going to come up with an acronym. But I think we will find that position will be more common in the future.

CRAIG BOX: In terms of the technology landscape at the time, you now had access to Google only as a cloud customer, and then of course to Kubernetes and the entire cloud native landscape. What did the technical platform evaluation process look like?

STEVE MCGHEE: That was fun. I was a customer to all the clouds. We had to evaluate the open source offerings, the vendor offerings, the cloud service provider offerings. And I was kind of on my own within the company. The direction was pick something, show us why you picked it, convince everyone that what you picked was the right thing.

I spent weeks and weeks, probably months, comparing the different offerings in terms of passes versus IAS, and writing down my findings and pros and cons. And I talked about Kubernetes and I talked about serverless offerings, comparing them against the needs of the company at the time.

At one point, I realized I just had to have a little bit more structured thought. So I kind of made the case for why I thought Kubernetes made sense for us at the time, as opposed to a serverless or VM-based offering. And then from there, I moved into which Kubernetes offering do we use? Do we use open source via something like KOPS, or do we use a hosted offering? And then from the hosted, which one do we choose?

And there was this really long, egregious process. And I don't wish it on anyone. But I realized people are doing it constantly today. So making that process smoother and easier is like a back burner goal of mine. I feel like a lot of cloud adoption is just like, you walk into a foundry and they have every possible part imaginable. And they say, you can build whichever car you'd like from any part.

Many companies, unfortunately, end up building the Homer Simpson-mobile, which doesn't work terribly well. Where in reality, what they really should do is say like, would you like a truck or a sedan? That would be a lot easier. I believe they call these solutions. And I think we're all working on that.

CRAIG BOX: In saying that the vendors are very much trying to move to building out higher level platforms these days, they don't want to be the proverbial lumberyard just selling wood, even though apparently, lumber is really expensive right now. Because there's a shortage of everything.

I can understand why that is. Those services are higher margin and they offer more soft lock in, perhaps, in terms of once you're happy with the system, you're unlikely to move off it. Can a vendor ever sell anything other than infrastructure when that's the thing that the customer is looking for? Is a set of processes so personal to a particular team to implement that they're going to build their own thing, and thus that they won't get the value out of the thing that the vendor's selling them if it's higher level?

STEVE MCGHEE: Yeah. I think so. I think they've been successful at doing that, actually. You can take a cynical look at it and say it's really all just computers under the hood. But in reality, we've seen that different vendors have been successful. I don't know if locking in is really the-- it sounds like shackles. And it's not really that. It's more like a level of convenience that is hard to overcome. It's like-- I forget what the term is in science, but the energy required to move from one electron shell to the next. Activation energy.

Once you're in one position, the idea of moving all of your data and all of your compute from one to another, of course, comes with engineering effort. And I think one of Google's strategies was to try to add some lubrication to that process, I guess, to make some consistency so you didn't need to reengineer too much. This is how Kubernetes helps here.

I think one misunderstanding in the world of multi-cloud and portable workloads is, at least in my opinion, the goal is not to have a workload that you're going to dynamically move from cloud to cloud. It's just that if you had to, you'll want to be able to move from cloud to cloud once a year, decade, or something like that.

CRAIG BOX: It's an insurance policy, more than a thing people actually do.

STEVE MCGHEE: Yeah. Exactly. The other thing that is related to this is often, if you're a medium to large company, you're going to have acquisitions. And you're going to have what they call shadow IT. And even if you've picked a cloud and picked a vendor and everything's running along smoothly, one day you're going to be like, oh, wait a minute. There's this other stuff running on this other cloud. Shoot. We didn't know about it. Or we bought it. Or something like that.

So M&A is a real thing. And being able to handle the interrupts from that and not have to spend an entire year shuffling them from one system to another by completely reengineering their production stack, if you can, quote, just redeploy them to the new provider via a series of YAMLs, you're in a much better position.

So in my opinion, whenever we hear arguments on the internets about no one's going to really multi-cloud in parallel. Sure. That's true. But there's a tremendous amount of value in being able to have a vanilla layer between all the clouds that you can then use to migrate between them in a slow or controlled fashion.

CRAIG BOX: When you were brought on board to this company to build out this platform, presumably they already had services running on a platform, you've talked there about mergers and acquisitions and this sort of constantly changing, how much is it worth trying to have a unified system versus unified processes?

STEVE MCGHEE: I think it's really, really helpful, actually. At this company, we had a traditional platform built on-prem, sands, and application servers and all sorts of things. We had all that stuff. And it was all running fine.

And then we had a series of acquisitions that had been made in the past three or four years. And many of them were on different clouds of varying levels of maturity. And my goal was essentially to make one platform to rule them all. And I knew that forcing them would never work. So instead of a stick approach, I was pure carrot.

So all I wanted was to build a carrotful platform. After all of that analysis I referred to before, I built out a platform based on GKE. And we had a CI/CD pipeline, and all the things you would expect. And we basically said, OK, the platform is ready. It has a couple of Hello Worlds on it. Like let's put something on it.

And we had a few teams self-select to try it out. So it was Greenfield Services. And they actually found a nice statistic I like to share was, in the old platform, the on-prem one, in order to release a new service, or to have the first request of a service, on the old platform, once a new service was written, all the code was done. In order to get it to production, took eight weeks of effort in terms of filing tickets and opening up firewalls and ports and load balancers and all this stuff.

CRAIG BOX: Sounds familiar.

STEVE MCGHEE: Yeah. It was quite a while. Eight weeks. And the new platform we built turned that into eight minutes, which is slightly faster. So it sounds like kind of an arbitrary timing, like who cares about making new services. Like how often do you do that?

But it had a pretty significant impact on the traditional side of the house, because if you have this level of friction in front of you, you're basically never going to make a new service unless you really, really, really have to. And so the outcome of that is you tend to put all of your new code into old services. And you have a bunch of mini-monoliths all over the place. Even though you say you're running microservices, or SOA or something like that, if it's hard to make a service, then you're not going to make them very often.

So we had a lot of services that did-- they were Swiss army knives of services. And that's trouble for releasing, as well as scaling, and things like that. So the new model was much faster and allowed for almost ephemeral service creation, which was a lot more flexible.

CRAIG BOX: A lot of more traditional companies will have services that have not been touched in many, many years. They are running on maybe a physical machine somewhere. And then there's always a lift and shift versus modernized discussion. And then there's also the approach of just throwing a sidecar in front of it and connecting it to your service mesh without even touching it. How do you make a decision like that?

STEVE MCGHEE: This is actually why we chose to use Kubernetes over Serverless. We kept Serverless in our back pocket through Serverless On Kubernetes options. And we didn't adopt it at the time that I was there. But it was still part of the strategy. But the idea was essentially the greenfield systems can adopt the platform I was just discussing. And the brownfield, the existing services that aren't going to really be rebuilt, can also make it onto the platform as well, and they can gain at least a subset if not like a significant amount of the capabilities of the Kubernetes platform, even if they're, quote, just VMs, with little to no code change.

I believe, I'm still in touch with people from that company, that is working, with quotes around it. I'm not really sure how well it's working or how far they are along. But the dream is real. Moving traditional VM-based workloads alongside Greenfield and cloud-native workloads all in the same enterprise-y Kubernetes world seems to be successful. So that's great.

CRAIG BOX: There's a famous quote from the author of Hadoop, Doug Cutting, about Google sending postcards from the future. He's talking about the papers that we published in the early days of things like MapReduce. On this show, Tim Hockin has talked about the idea of Kubernetes being a crystal ball, and telling people, here are some things that you are going to need in the future.

Sometimes you'll hear about a little town where it's against the law to have a sleeping donkey in the bathtub after 7:00 PM. And you think there must be a very specific reason that that law was put in place once upon a time. Many of the pieces of complexity in projects like Kubernetes feel like they are similar to that. And then a lot of people who are adopting them think, oh, I'm never going to run into those problems.

As someone who saw a lot of it in your time at Google, how do you go about explaining that to people, that the reason these things are so complicated are things that you might need, and you shouldn't be scared of them?

STEVE MCGHEE: That's a deep question. We could talk for an hour just on that.

CRAIG BOX: Not to mention talking about why the donkey is in the bathtub.

STEVE MCGHEE: We don't even have to get into that. That's another hour. I used to make the joke with my colleagues at this company that I joined that, as I said, I went from University to Google with nothing in between. So I only ever lived on what I called the spaceship.

When I left the spaceship, and I and I came back to Earth, and I was working in a real world company, I would say silly things like, all we really need to solve this problem is the anti-gravity drive. Don't you guys have one of those sitting around? And they kind of look at me funny. And I'd say, oh, right, OK. Got it. We don't have one of those yet.

Yeah, Google is a spaceship. Like it has a lot of alien tech aboard. Prometheus bringing fire to the people. So it's trying to bring a bunch of these alien tech to the rest of the world because it turns out just having it inside one company isn't the best use of it all. I get it. A lot of the complexity is hard to understand. But it's actually there for a reason.

So one example of that, I'm working on a publication right now. And one of the sections we're working on is about dependency management. And one of the goals of this is to be able to identify which services depend on each other, and how do we introduce concepts that internally we refer to as layering, and making sure you don't have circular dependencies, and making sure you can enforce that A doesn't like to B under condition Z and so on.

This all sounds super abstract and unreasonable, until you work in a place where there are tens of thousands of services that have just grown organically. And they can cross continents. They can cross org lines, totally without any checks and balances if you let them. And they cause problems when that happens.

There's a reason for wanting to have these seemingly abstract controls in place, because they actually prevent a complete collapse of these complex systems. One way of thinking about this also is if you know how to sail a boat, you can sail a boat anywhere in the world. But if someone just drops you into, whether it's Sydney Harbor or San Francisco Bay Area or the Galapagos, you're going to have a hard time unless you have someone with you who can, even at least a chart or something, or someone who can tell you, in this particular place, just having the skill of sailing a boat is required, but it's not sufficient.

So being able to understand the huge dynamic complexity of the currents and the other boats and where all the ports are and the big ships coming down the line, and all this other stuff, none of that is taught in your sailing school. You don't know about which way the boats tend to come from on Thursday evenings until you live it. So having someone alongside you who has seen the channel before is super helpful in that they can kind of give you these heads up.

Why do I need to bother learning about this? This is basically just experience. Experience helps you predict the future and not suffer when it hits you right in the face.

CRAIG BOX: What do you say to someone who says effectively, oh, I'm only ever going to sail this tiny little boat in this tiny little water that I'm used to. I have no reason to believe I will ever grow to 10,000 services.

STEVE MCGHEE: The problem with the internet is stuff happens quick. So computers can scale up quickly. The slash dot effect still is real.

CRAIG BOX: Look it up, kids.

STEVE MCGHEE: Sorry. I just dated myself, as would be expected. Life comes at you fast. Sometimes you got to be ready. It's better to be prepared for this kind of hypergrowth and then to handle it gracefully than to panicking and freaking out and suffering and offering errors to your users. No one wants that.

You can dream small. But you might end up being big. That's one way to think about it.

CRAIG BOX: You mentioned Prometheus before, which is what happened when some ex-Google engineers left and wanted a monitoring system. Mesos was what happens when a few ex-Google engineers left and wanted a Borg. Is the Google software stack special anymore? Are there large pieces of it that are inside Google, and there is no open source equivalent for?

STEVE MCGHEE: Yeah. There's a bunch of things that are still inside of Google. They don't have an external equivalent yet.

CRAIG BOX: Have the right people not left the company?

STEVE MCGHEE: Really, the stuff that's missing is the interplay between them all. Having a consistent way to work between traces and your monitoring consoles and then diving into the source code that tells you where that trace came from. Right now, in the open source world, that's like, five different products. And there's kind of deep links a little bit, but they don't always work. And it's because you five different either companies or organizations or just open source cohorts that are working on them without really a whole lot of expectation of complete interoperability.

Within Google, obviously, we have a giant monorepo. You can see all the code. When you want to make a change to the tracing visualization system, and you want to be able to connect to the source control system, you can just do it on both sides of the fence. And so you're able to have this interconnection between these tools that works really, really well.

And not only is it cool and helpful, but it's cool and helpful to all of your colleagues all at once. And so you can have like a tremendous lifting effect to a lot of Google engineers and therefore a lot of different products. Many boats can be raised at once, which is pretty great.

CRAIG BOX: If you were to advise a startup to build something that exists inside Google as a concept, but there's no cloud-native equivalent for, what would it be? Would that just be adding a sixth thing to the five different systems? Do you think people should focus more on the glue between them?

STEVE MCGHEE: The big ideas, I wouldn't say they're all out there, because I haven't really done a full accounting in my head. But a lot of the big ideas are already out there. They're not really working great. And they're certainly not working great with each other. So improving the glue between them would be a fabulous use of time for anybody.

And really it's just about using the tools, finding where they itch you, and then scratching that itch. This is how these startups happen. So I would just follow that path. I wouldn't say, like, try to find an insider at Google who can give you a hot tip on the new hot thing. Because generally speaking, those are pretty gigantic.

They've had hundreds, maybe thousands of engineers working for many years on them. So attempting to replicate that in the startup land is audacious. That's good, but at the same time like are you sure that is really worth your time when there's all these other maybe not low hanging fruit, but available fruit?

CRAIG BOX: In terms of prevention of outages, something that is often cited is that it's not the deployment of software within Google that causes outages. But it's configuration pushes. What's the state of the art of managing configuration in the cloud-native ecosystem?

STEVE MCGHEE: The way I try to advise companies on this is that you're really thinking about risk, and every time you change production, you're altering your risk equation. Every change is technically a risk. The formula that I give customers is that risk is equal to blast radius times time times probability. What you're talking about here is when you're reducing risk, you can reduce one of those three things.

Generally, configuration refers to a config file or arguments to a binary or something like that. Doing so happens over the course of some amount of time. You can spread that out over time. The probability is about, well, are you checking the input of your configuration file? Like do you have something like a linter in place to make sure that your config file is non-zero, it satisfies the syntax, and all these kinds of things?

But the biggest thing is the blast radius. So when you apply configuration to an existing running service, the worst thing you can do in terms of blast radius is apply it to all of it at once. This is a blast radius of like 100%. If your listeners have already heard of this, this is canarying.

So if you can instead apply your configuration to 1% of servers at first, and then you can wait and see if all 1% of them explode or not, you're in a much better position. Even better if you can do it to 0.1% or 0.01%, gradually turn that up. You're in a much better position because you can detect if you're introducing a bad change. And so your overall risk exposure is much, much lower. It's literally orders of magnitude lower.

Being able to do so has a far greater positive impact on your customer base. If you imagine you have 100 customers and you're rolling out to everything at once and you introduce a bug with a bad configuration, and you have to roll it back, all 100 customers see that. They all experience that two minutes of downtime. But if you have 100 customers and each of them have their own VM or something like that, then you apply it to just one of the VMs, only one of your customers is going to notice that couple of minutes of downtime. And the rest are not even going to see anything.

So over the course of time, essentially just by exposing these changes to 1% of customers or traffic at a time, it's making your software look 100 times better, literally to your customers en masse without doing anything to your actual testing or software development procedures. You look like you just got 100 times better at writing software. So it's a very powerful lever, which is why we talk about canarying pretty much day one when we talk about SRE and reliability.

Getting there, of course, is tricky. I say this at a very high level because once you get into the weeds of actually rolling this out, it's hard. Because A, you need to be able to siphon traffic at percent levels. So how do you send 1% of traffic to one new config place?

And at the same time, can you even roll out a new config to part of a fleet without having some sort of semantic drift? So there's definitely some engineering to be done here. It is all possible. Istio helped a lot with the percentage problem. It used to be that you had 1 over n where n was the number of pods in the service. Istio helped a lot with traffic routing. It is totally possible. I recommend starting with the high level equation and working your way into the actual implementation.

CRAIG BOX: You've been back at Google for a couple of years now, working with customers and internal teams to help them understand reliability as a concept. What are customers getting right, and what are they getting wrong?

STEVE MCGHEE: I think the hardest thing that customers are dealing with is choice. Let me start with what our customer is getting wrong, unfortunately just because they stand out and they're easier to remember. One thing that customers get wrong is that they have an expectation of how many nines they want. They set that bar really, really early. And they also have expectations of what that implies to other services and pieces of infrastructure around them that are not correct.

So if you have a service that you want to be globally available at four nines availability, what I like to call the naive math, or the bad math, is that you now need all of your backend services to have five nines of availability, or all of your infrastructure, you need to have five nines or 11 nines, or something more than the three that the user facing service expects.

And that's actually completely wrong. That's backwards. I just gave a talk at SLOconf, I think it was a couple of weeks ago. We can put the link in the show notes, where I talk about this. And I show it as a form of a pyramid. The point of the story is that you can build more reliable things on top of less reliable things.

The way to remember this is in ancient Google yore, we had computers that were built out of parts from Fry's and they were extremely cheap and they would fail all the time. And they were not the expensive SGI machines or HP machines that would never fall over. They were quite the opposite. They were machines that we expected to fall over daily. We got around that with software.

So that's not just a Google trick. Like anyone can do it. There is a talk by a guy named Yaniv Aknin at SREcon EMEA 2019. It's called "The SRE I Aspire to Be," but part of it is really inspirational, to me at least, was just talking about trade offs between reliability and other functions. And it's done entirely in software.

An example of this is Raid. We don't care about disks failing anymore because software fixed that for us. So we spend twice as much on disks. And we do some fancy software tricks. And now it turns out if a disk fails in the middle of the night, we don't have to page anybody. We can wait till the morning, and it's fine.

So there's a lot of these trade offs around reliability engineering, and they're all software based. And it allows you to build more and more and more reliable services on top of things that are not as reliable. So that's what I mean by building more reliable things on top of less reliable things. That's an important thing that customers don't get right, right away. A little bit of education helps change their mindset a lot. Multiple customers say, this was the most impactful thing you said was this one sentence. So I'm hoping to get that out to more and more people.

CRAIG BOX: You talk about things like Raid there. There's sort of an assumption now that we don't lose data because we have these systems that are durable. Do you put that in the category of things that are right now, that we sort of assume are always there?

STEVE MCGHEE: That's kind of the beauty of cloud, is that it's not just pure infrastructure. When you go and you get a VM and you attach a disk to it, it's not a disk in a disk tray on that same computer. You're not actually renting that. It's all virtualized.

Maybe this is obvious. But the fact that it's virtualized at the CPU layer, at the networking layer, and at the disk player is pretty tremendous. It gets you a lot of benefits that you don't have to think about. Prevents you from having to worry about a lot of failure modes. If you boil it down, that's the first value that you get from cloud is that you're getting this virtualization layer.

And it doesn't even have to be public cloud. Like this is not new. You could put VMware in the sand in your data center, and you get a lot of this already. What cloud brought everyone was just a huge amount of scalability and consistent APIs in order to do this quickly. I don't know if that answers your question.

CRAIG BOX: And the credit card.

STEVE MCGHEE: Yeah, the credit card. That helps.

CRAIG BOX: In your talk at SLOconf, you mentioned a model of reliability where you have effectively a full mesh, in that you need to have multiple instances of every set of replicas in order to make sure you have a service that is, as you say, more reliable when you multiply it out. That you don't need to have everything have eleven nines worth of reliability. Does that imply the need for something like a service mesh? Or can you get that same style of reliability in the cloud native ecosystem in a different way?

STEVE MCGHEE: Service meshes, as they exist today, they promise capabilities, and those capabilities are extremely convenient. The complexity that are associated with gaining those capabilities, the ratio is a bit high right now. Frankly, it's a bit hard to run some of these service meshes and keep them running and keep them upgraded and make sure that the version of this doesn't mess with the version of that, and blah, blah, blah. That's getting better over time.

CRAIG BOX: Is that a function of the service mesh, or is that a function of the fact that it's simply an app running on Kubernetes, and that's true for everything that you run on Kubernetes?

STEVE MCGHEE: It's kind of everything. An example is you need to upgrade a cluster. You can't just leave clusters as they are forever. Because you're going to miss out on things. And you're going to get hacked.

CRAIG BOX: They're on the internet.

STEVE MCGHEE: They're on the internet, trouble will happen for sure. What's so hard about upgrading a cluster? Well, you can upgrade control plane or the data plane or you can update Istio, and you can update the Istio control plane, or you can update the CRDs or the operators? And you haven't even touched your actual apps yet. And you've still got seven layers of things to upgrade in one cluster, and then multiply that times N clusters that you're operating, and try to do so in a way that doesn't take down production, and blah blah blah.

Just that stuff is really hard to manage today. But in my opinion, the capabilities that a service mesh gives you promise to be worth it. It's hard to say that it's actually worth it right now, because depending on your company, it's potentially really difficult. But those capabilities, if they were less expensive, would be definitely worth adopting for the highly available services that you're looking for.

So one thing that I always also advise customers on is to try to outline what we call criticality tiers of your services. So if you have one service that handles all of the money for your company, and it handles the credit card acceptance or billing or whatever, that's probably important. Let's keep that one on as much as possible.

But the one that handles if you've gone on vacation or not this month, I don't know if you really need to spend a lot of money on keeping that up. That one's OK to go down.

CRAIG BOX: So we can have nine fives on that one.

STEVE MCGHEE: Yeah. Totally acceptable. So outlining these tiers ahead of time, and not having an infinite set of these tiers, I advise like three-- tier one, two, and three, essentially-- you can choose if one is most or three is most. I don't know. It's up to you.

And then just providing a model for each of those tiers is really helpful. And generally speaking, the most reliable one will be the most expensive, and possibly the hardest to operate. But it has to be worth it to you.

There is a paper by Brad Calder and Anna Berenberg about archetypes of cloud deployments. It starts with a computer in a zone. And it works its way up to multi-region super crazy full mesh situations. So there is a range of architectures you can choose. You have to find which capabilities you need, and at which cost that you're willing to pay for.

And just map your criticality tiers to each of those architectures. And you're going to be in a great spot to stop worrying about how do we build this car? And instead you're sort of choosing a platform. You're like, this is the race car. And that other ones, that's the pickup truck. And then you tell your teams, pick a car, and go for it.

CRAIG BOX: Do you think there will be any demand for the Homer Simpson car?

STEVE MCGHEE: I hope not. I think people will always be building Homer Simpson cars. There's always the aftermarket parts crowd adding spoilers to Turcells and things like that. So it's entirely possible. But if we have a marketplace that allows for very clear choices, I think that would just help everybody.

CRAIG BOX: All right. Well, thank you very much for joining us today, Steve.

STEVE MCGHEE: Thank you. That was really fun.

CRAIG BOX: You can find Steve on Twitter @stevemcghee.

[MUSIC PLAYING]

CRAIG BOX: Thank you, Dan, for helping out with the show today. Good luck with the ceremony.

DAN LORENC: Thanks for having me and you're welcome for that.

CRAIG BOX: If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @kubernetespod or reach us by email at kubernetespodcast@google.com

DAN LORENC: You can also check out the website at kubernetespodcast.com, where you'll find transcripts and show notes, as well as links to subscribe.

CRAIG BOX: I'll be back next time. So until then, thanks for listening.

[MUSIC PLAYING]

View More Episodes

SRE for Everyone Else, with Steve McGhee

Chatter of the week

News of the week

Links from the interview

Transcript