#147 April 23, 2021

Service Level Objectives and Nobl9, with Brian Singer and Kit Merker

Hosts: Craig Box, Richard Belleville

Brian Singer co-founded Orbitera, which was acquired by Google in 2016. During that process he met Kit Merker, who was a PM on GKE and the GCP Marketplace, and the two are now working togther on relability engineering startup Nobl9. We talk about migrating Orbitera to GKE and Google’s SRE platform, and how many 9s are too many.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box with my very special guest host, Richard Belleville.

[MUSIC PLAYING]

CRAIG BOX: Richard, you were our guest on episode 94 talking about gRPC. And in that show, we learned that every new release, the G stands for something different. What does it stand for today?

RICHARD BELLEVILLE: Well, in the 1.37 release, it's "Gilded", but as release manager for this version, I got to choose the next one. And so it's been decided that the next one is going to stand for "Guadalupe River Park Conservancy", which also is GRPC.

CRAIG BOX: It is.

RICHARD BELLEVILLE: If you're not familiar with Guadalupe River Park Conservancy, in the Bay Area, there's the Guadalupe River, which runs through pretty much all of San Jose, and I like to go running on a path there. The group that maintains it is called the Guadalupe River Park Conservancy, which is only conveniently located a couple of miles away from the gRPC team itself.

CRAIG BOX: Brilliant. I was going to suggest that you could perhaps go with GNU. That's a similar recursive acronym, and I'm sure that wouldn't confuse anyone at all.

RICHARD BELLEVILLE: At least we bottom out with this one.

CRAIG BOX: Have you considered the Great British Remote Procedure Call?

RICHARD BELLEVILLE: Actually, that sounds really good. Maybe we'll use that. A couple of them have just been suggestions from GitHub user issues. So if anybody has suggestions, you can post them and we might pull them in.

CRAIG BOX: GitHub starts with g. I'm sure that would confuse people, as well.

RICHARD BELLEVILLE: Oh, my goodness, yeah.

CRAIG BOX: They might prefer that you don't do that.

RICHARD BELLEVILLE: We'll have to check with GitHub first, I suppose.

CRAIG BOX: It's been Earth Day this week, and there have been a number of climate pledges by countries around the world. America has pledged to cut their emissions in half by the end of this decade and have suggested that we can do things like buy electric cars, which sounds cool, and not eat so much red meat, which does not sound cool.

RICHARD BELLEVILLE: I agree with that. I've been absolutely abusing the steaks that I've been eating over the course of this pandemic. Since I haven't had catered lunch, I've been mostly just cooking up ribeyes in the oven. I've learned about reverse sears with thermometers that you can keep in the oven. And that's kept me happy throughout the course of the pandemic. I'm probably not being a great steward of the earth, but it's pretty tasty.

CRAIG BOX: Have you tried sous vide?

RICHARD BELLEVILLE: I haven't. That's a little bit beyond my current skill level, I think.

CRAIG BOX: I know that you can cook a salmon in a dishwasher.

RICHARD BELLEVILLE: I'll let you try it first.

CRAIG BOX: Yeah, I don't think you can really get away with a steak. This is interesting. Obviously with guest hosts, I can have conversations that I couldn't have with Adam, who was vegan. But Adam had seen the pictures of my haircut and had made some comments that it looked gray. I'm like, it's not gray, Adam, that's sunlight filtering in from the background. I know that might surprise you, that there was sunlight in Britain, but no, that's not gray. But it's all gone now, anyway.

RICHARD BELLEVILLE: It's funny you should mention haircuts again. I think a lot of people have had funny experiences with their hair throughout the pandemic. Obviously, it'll grow long.

I decided, after about a month of my hair being much too long for my liking, that I needed to take matters into my own hands. So I looked at Amazon for a set of clippers, and they were all out. Absolutely all of them.

So I decided to put myself on a waiting list for it. And it took about a month and a half. But ever since then, I've been cutting my own hair. I decided, if there is any time ever that you should have a bad haircut, it's during a pandemic, because nobody's going to see you. So I decided to get my sea legs and learn how to cut my own hair-- sort of blind, because you can't see the back of your head.

There have been a couple of bad mistakes. At one point, the guard came off as I was doing the back of my head, and I had about eyebrow's worth missing. But a baseball cap, and you're fine.

CRAIG BOX: The worst thing that happens with a haircut like that is that you have to just knock the whole thing off, number zero, all over. And some people probably have a head that suits that. I don't think I do. I had one go at having a home haircut, and I had my partner knock a couple of inches off below the ears, nothing I would let her do anymore, over the top. I'm quite protective of the coif.

RICHARD BELLEVILLE: It's amazing what you can learn on YouTube.

CRAIG BOX: It is. And on the topic of masks and hiding all sins, there is an orthodontic clinic not far from where I live. And I'm always seeing people queuing up outside and, of course, they're all wearing masks. I'm thinking, well, it would have been the perfect year to have your teeth fixed, as well.

RICHARD BELLEVILLE: Would have been. Unfortunately, that's one of the worst things about the pandemic. Dentists are closed because of the microspray, or whatever they call it.

CRAIG BOX: Yeah. In the UK, at least, they reopened last year, and so I have been to the dentist since the pandemic. And they're wearing face shields, not too dissimilar to what they were wearing in the past. But I still find that I'm lying, looking at the roof, and not really able to hold down the conversation anyway. So not much has changed there. Should we get to the news?

RICHARD BELLEVILLE: Let's get to the news.

[MUSIC PLAYING]

CRAIG BOX: Licensing and forks lead the cloud native news this week, with Grafana Labs relicensing their software from the Apache License to the AGPL. The update, which applies to Grafana, Loki, and Tempo, would force anyone who links their software with Grafana to also distribute their software under the terms of AGPL. The license is banned by companies like Google for that very reason. And CNCF projects that use Grafana for dashboards are receiving guidance to stay with the last Apache License version of Grafana for now.

RICHARD BELLEVILLE: AWS has been cited by other vendors in the past as the reason for similar relicensing. That doesn't apply in the Grafana case, as Grafana Labs has a commercial partnership with AWS, providing the software under a proprietary license. It does, however, apply to Elasticsearch, which AWS has now forked under the name of OpenSearch. This fork solves the trademark issue around Open Distro for Elasticsearch and their Amazon Elasticsearch service, which will be renamed Amazon OpenSearch service.

CRAIG BOX: Automation software Pulumi has launched version 3.0, including a new automation API for using Pulumi in your own apps, new native providers for Azure and GCP, and Pulumi Components, high-level abstractions of a cloud infrastructure. Pulumi 3.0 is the foundation for their new cloud engineering platform, unifying app development, infrastructure management, and security through code. Learn more about Pulumi in episode 76.

RICHARD BELLEVILLE: The K8ssandra project has released version 1.1, now with support for backups to MinIO block storage. This project has also become the home for Datastax's Cass Operator, which the K8ssandra SIG, Kubernetes, has decided to anoint as the path forward, with a goal to see it join the Apache Foundation. Features from Orange Technologies' CassKop, with a K in the middle, are currently being merged into Cass Operator.

CRAIG BOX: Docker has refocused on the desktop. And the hottest new desktops are using the new Apple M1 chip. Or should that be the coolest? Anyway, the two have come together with the G8 of Docker for Mac Support for Apple Silicon. Support from multiplatform images allows you to build and run for both X86 and Arm Architectures.

RICHARD BELLEVILLE: Disaster recovery vendor Zerto has released Zerto for Kubernetes. Z4K extends their continuous data protection suite for backup and disaster recovery to containerized applications, promising data protection as code. Their platform is priced at, quote, Ask a Sales Rep.

CRAIG BOX: A blog post from the Kubernetes multitenancy working group sets out three different tenancy models to make it easier to operationalize multitenant use cases. Namespace as a service and Clusters as a service may be familiar to you. But the post introduces a third option, a so-called control plan as a service model where you get hard separation of control planes but a single pool of worker nodes that are shared between them. An implementation of the latter model is provided by Alibaba and described in the paper at an upcoming IEEE conference.

RICHARD BELLEVILLE: In what looks like a potential case of parallel evolution, Loft Labs has open sourced V Cluster, a virtual cluster technology for Kubernetes. V Cluster works by running a peer cluster control plan using keys and running pods started on those Virtual Clusters on the host cluster, using a sync engine. Loft says they have open sourced V Cluster solution as it has already had production use as part of their platform.

CRAIG BOX: A bug in the Go Containers Storage Library has caused the CV in container runtime for users, including Cryo and PodMan. A malicious container image can be crafted which will dedlock the runtime. The vulnerability was found and reported by Unit 42 at Palo Alto Networks.

RICHARD BELLEVILLE: We end this week with more updates on the 1.21 release on the Kubernetes Blog, including volume health monitoring, indexed jobs, and graceful node shutdown. There are also updates on networking projects, including validating network policy implementations in CNI plug-ins and the new gateway API, formerly known as the Services API.

CRAIG BOX: And that's the news.

[MUSIC PLAYING]

CRAIG BOX: Brian Singer is the co-founder and Chief Product Officer of Nobl9, a company building a platform to optimize software reliability. His previous company, Orbitera, was acquired by Google, where he adapted the product to follow Google's best practices for production and reliability. Kit Merker is the Chief Operating Officer for Nobl9. Kit was an early product manager on GKE at Google Cloud and a VP of Business Development at JFrog. Welcome to the show, Brian.

BRIAN SINGER: Thanks for having us on. It's great to be here.

CRAIG BOX: And welcome, Kit.

KIT MERKER: Nice to see you, Craig. Nice to be chatting with you.

CRAIG BOX: How did you two guys meet?

KIT MERKER: It was a fateful day.

BRIAN SINGER: This is a good story. So when I was at Orbitera, our platform helped companies publish their software for consumption on public clouds. And Kit was fortunately one of the PMs working on GCP Marketplace from its infancy. So obviously, was interested in what we were doing. I can't actually remember the first time that we met, Kit. It might have been when we were pitching you on our platform, something like that. Maybe you have a better memory than me.

KIT MERKER: It's funny because, back in the day, I was working on this container stuff with Google, Kubernetes, and GKE. And I wrote this internal PRD called Container Marketplace. And it was this idea that we needed to go to this next level. I started working really closely with the Marketplace team. And I realized, talking to everybody, that there was just no way we could get enough apps in the Marketplace without some help. And I think Brian was trying to sell me something, I think, was the first time we met, as he said.

BRIAN SINGER: Probably.

KIT MERKER: But I met with Marcin and a few other people from Orbitera. Marcin is our CEO now at Nobl9, as well. But he was the CEO at Orbitera. We started to make a partnership work between the companies. And one thing led to another. And we ended up acquiring Orbitera into Google. And that became a really powerful part of growing Google's Marketplace business. We couldn't have done it alone.

BRIAN SINGER: What Kit doesn't mention is that, pretty much as soon as he convinced everyone that it would be a good idea to acquire Orbitera, he left for JFrog. And so we never had an opportunity to actually work together at Google. But working with him sort of in the early days of doing diligence on Orbitera and whatnot, Marcin and I both knew that Kit was somebody that we wanted to find a way to work with someday. So when we came up with the concept for Nobl9 and we knew that we were going to do another startup, Kit was the first phone call that we made. And we were fortunate that he felt the same way about us.

CRAIG BOX: Brilliant. Let's wind the clock back a little bit and have a look at how you both got to where you are today. Brian, how did tech start in your life?

BRIAN SINGER: Like a lot of folks, an avid gamer back in the days of playing a lot of "Doom" and "Quake 2." I started to get really interested in computers, running my own quick servers, modding it. Things like that led me to Computer Science in college. And I think the rest is history.

But as far as getting into the software business, I actually started my career in hardware as a chip designer. I really enjoyed the lowest levels of computers. How did the bits actually get from point A to point B? That kind of thing. Eventually, I realized that it takes a really, really long time to build hardware. So software called out to me. And I ended up moving more into the software industry.

CRAIG BOX: And Kit, you and I first met in 2014. I refuse to believe that was six years ago.

KIT MERKER: Crazy. Right?

CRAIG BOX: Yeah.

KIT MERKER: Back then, nobody knew what Kubernetes was either. So it's kind of funny, too, when you think back about what's happened over the last five years.

CRAIG BOX: We had to spell it out to people.

KIT MERKER: We did.

CRAIG BOX: Using Greek letters.

KIT MERKER: We did. I started programming when I was like 11. And I was doing all kinds of weird stuff, like trying to build my own artificial intelligence and genetic algorithms and things like that. And I was kind of a nerdy kid back then.

CRAIG BOX: Were you trying to do this on the Commodore 64?

KIT MERKER: Yeah. Something like that. I don't want to age myself. So I won't say what generation or processor I was using. But I will say I did have an 80-88 that my dad used that I liked to play around with a lot when I was really young. But I moved toward the dark side over the years. And it's been interesting, I think important. And I think it sets people apart when they can get a little bit of everything. Doing business stuff and tech stuff together.

And what I think we're seeing that is that it's merging. Business folks are becoming more tech savvy. And we're definitely seeing engineering folks and technical folks that are understanding business and entrepreneurial spirit and everything. And it's amazing how what used to be really two different trajectories for people now has come together as one way of looking at the world.

CRAIG BOX: I don't want to put words in your mouth. But you said you went over to the dark side. And you had quite the career at Microsoft before coming to Google, if I recall.

KIT MERKER: Yeah. I was there for 10 years. I worked on a variety of products. I worked on three releases of Windows, which I don't think I've used a Windows computer now for probably the last five, six years either. I worked on Office 365, the billing system for Azure.

And then I ran what we would now probably call the DevOps team for Bing. Bing is a search engine, if you're not familiar-- basically, the engineering tools there. And that was where I first got exposed to working with Googlers, by the way. And I left the Bing team for the Google team, because there were a lot of ex-Google folks working on Bing. My experience with Microsoft was like building package software. You know what I mean? Building operating systems and things like this.

Then moving really toward running online services and SaaS services and infrastructure. Then switching over to actually working at Google and seeing how that works in just a completely different mindshift of Borg and other systems that define the way that Google does Cloud computing. And for me, it was very foreign. A lot of people look at the Google infrastructure, and they see it as a very specific solution to a very specific set of problems and challenges that not everybody has.

CRAIG BOX: True.

KIT MERKER: Everybody has a core challenge. Right? The reliability and keeping your developers productive challenge. But they don't have the scale challenges and the growth challenges that Google has. But they do have the sort of fundamental trade offs. You know what I mean? So it was interesting to see that difference, because at Microsoft, we were coping with those things in a different way. And I think it came from a different history of where the infrastructure came from.

CRAIG BOX: Brian, you mentioned before that you started off in hardware and moved to software. Did you work at some big companies, as well? Or were you always in the startup space?

BRIAN SINGER: After working at a hardware company, I actually went and got an MBA because I said to myself-- it's actually kind of what Kit was saying. The engineering side is interesting. But I really actually want to understand how the business side of this works. Like, who are we selling these products to? And why are they buying them?

CRAIG BOX: Right.

BRIAN SINGER: So I went and got an MBA and really actually got enamored with marketing and understanding how marketing actually drives a lot of business innovation and back into product innovation. So coming out of business school, I went and worked in product marketing at Novell back when it was a public company and had a lot of different lines of business. Worked with some fantastic folks and really learned the software business there.

And then I went on to work at BMC. And that was really my first exposure to Cloud. BMC had a platform they called the Cloud. It wasn't really a Cloud. It was like a bunch of virtual infrastructure that you could run yourself. And that was about the time that AWS was starting to get a lot of traction in the industry, a lot of buzz. Like every customer we would go talk to about using the BMC, quote-unquote, Cloud was starting to ask us, OK, well, that's great. But how are we going to build an application that runs on this that can also run on AWS? How are you going to be compatible with the AWS APIs?

And when you hear the same thing enough times from enough different customers, you start to realize that there might be actually something that's really happening in the industry, some innovation that is going to change how people operate.

CRAIG BOX: That's MBA level thinking right there.

BRIAN SINGER: Exactly. At the time, I didn't know exactly what it was. But I knew that there was something really exciting there. And one of the things that I saw was that BMC, the way that it sold its software, was pay us a couple thousand per endpoint per year and you can license our management software, for example.

That didn't really work well with the whole, hey, $0.10 an hour for a server model. Right? You're going into companies and saying OK, predict your usage for three years. And we'll do a deal based on that. And they're saying we're going to scale up and down on Cloud and everything.

So that was sort of the foundation of Orbitera, was saying, hey, can we solve this problem of how do companies package and sell their software on Cloud platforms. We know that their companies are going to move here, and there's going to be opportunities to do that. And what was really interesting for me was we built that platform on AWS. We built it using a lot of the principles that were, quote unquote, state of the art at the time. Hey, we'll have multi AZ type deployment. We'll use Amazon RDS at the time.

We built that platform. And then when we were acquired by Google, we came into Google-- the first thing that happens when you're acquired by Google as a company is somebody shows up and says, OK, here's the production practices. You need to follow these. And when you look at that set of production practices and what you could have built on a Cloud like AWS at the time, it's pretty much impossible to follow those production practices. Right? You're talking about things like canary deployments and being able to roll back quickly and being able to have SLOs.

And for us to be able to do that using sort of traditional VMs and the APIs that AWS had at the time, it would have taken like a Herculean engineering effort. For us at the same time obviously, being acquired by Google, you want to run on Google systems. And we said, well, you know what? If we migrate to Google Cloud and use Kubernetes, this is going to be a whole lot easier for us.

And that was when it sort of dawned on me the first time that a lot of these technologies-- Kubernetes, a lot of the automation, GitOps-- were not just things that were invented in a vacuum. They were invented to allow engineers to follow the first principles around production that SREs had been pitching for a lot of years. And so that was kind of our journey in terms of developing on GCP and migrating to Kubernetes and why we did that.

CRAIG BOX: That's fair because the founding of your company was in 2013 before Kubernetes existed and before a lot of the Google services that make a lot of the stuff easy today.

BRIAN SINGER: Yeah.

CRAIG BOX: And the world inside Google was obviously very different at the time. People could run on this giant Google machine as long as they were inside the trusted group, really. As long as they were behind the firewall. Kit's put his good word in for you. You've been acquired by Google. You know that they've got Kubernetes now running to the public. What did you have to do to your software to make it operate at Google's scale?

BRIAN SINGER: People are going to love to hear this. But the first thing we had to do is we had to rewrite it. A lot of it. Because it's just the reality. We'd started writing this platform in 2011. We had taken a rapid prototyping approach. It was built originally in PHP. That just wasn't going to scale to what we needed it to. We took all the user journeys and everything. And we started to refactor that monolith, which I think a lot of companies go through, into a bunch of services. Built on Go using proto buffs. Running on Kubernetes and GCP.

I think in a lot of ways, we were pioneers. This was 2016. So you've got to think back five years before Kubernetes was pretty much everywhere. And it was interesting because I can still remember our engineering team sitting down with the Kubernetes team and basically saying, show us how we actually do this. There weren't enough examples out there at the time of how to do this. And we probably spent three or four days with the Kubernetes team figuring out how to basically translate what we had built in AWS into something that would run on GKE. And hopefully, they got some value out of that, as well. I'd like to think so.

KIT MERKER: At the time there was a big controversy between Teams and Google Cloud platform. And the question was, should we run our infrastructure on Borg or should we be running on Kubernetes? And the tension was interesting because to run it on Borg meant you got all the benefits of the internal infrastructure. But if you ran it on Kubernetes, you'd have the benefit of fitting on the Google Cloud platform at a level that was empathetic to what customers were experiencing. We wanted to get more mileage on it.

I believe at the time I can think of only maybe a handful of teams that we were working with across Google to build services that at least partially were on top of GKE or on Kubernetes. Most of the teams were building on Borg. My understanding is that's changed a bit. And people have moved to actually running Kubernetes.

BRIAN SINGER: Yeah.

KIT MERKER: I mean, one of the common questions I got as a Kubernetes Product Manager from people in the market all the time-- and I'm sure you got this too, Craig-- well, does Google run its infrastructure on Kubernetes? And we had to explain that, no, not exactly, kind of. We have this other thing that we built. And Kubernetes is actually a derivative of it. So we kind of have this inception of Kubernetes running on Kubernetes in a way-- Kubernetes on Borg.

It's an interesting thing when especially you're going through an acquisition. And so, now you're facing with an internal conversation. And Brian's team had to fit into that and make it work with the Kubernetes side. I think ultimately it was the right choice for the team.

BRIAN SINGER: It was a really interesting time at Google because, if you think about what engineers within Google are looking for, they're looking for the tool chains that they are used to. We actually had to build a lot of the toolchain ourselves in our team to be able to run on GKE. We kind of embraced that because we wanted to show that it was possible. More than anything else, it was a challenge for the team.

It also, interestingly enough, in terms of recruiting engineers to our team, was a big selling point. Because it was basically, hey, if you're going to work on other parts of GCP in the future, wouldn't you love to work on a service that's actually running on GCP now? So you can understand the platform to a level that's very hard to do if you're not actually building software that's running on it.

And so that was something that was really exciting for us. And I think, since then, the toolchain in terms of the support across the Borg and GKE has improved quite a bit. I'd like to think that we helped out with that a little bit, as well.

CRAIG BOX: I still don't think they would have let you run PHP, though, even if you had wanted to.

BRIAN SINGER: Absolutely not.

CRAIG BOX: I had a similar answer, Kit, to what you were saying before in that Google had a Borg shaped problem. And they had Borg as the solution to it. And so, if you were to take Google Workloads and rip Borg out and replace Kubernetes, first of all, you would have to buff Kubernetes up to fit Google's exact problem, which was very different to external problems and, in some ways, still is. And you would have spent all that time spinning the wheels to achieve nothing, to get back to the state you were already at. Whereas, for acquisitions, people coming in, it made sense to start using the new products.

The other thing that new companies coming in have to do is build their system up to a point where it can be run by the SRE teams at Google. The SRE culture, very famously, standardizes on a bunch of things. And they have a bunch of different monitoring systems so that fewer people can monitor and look after more services at once. Kit, I wonder if you might like to teach our audience SRE in 90 seconds or less.

KIT MERKER: I have a video actually if you go to Nobl9.com where I do just that. I talk about SLOs in 90 seconds or less. The Site Reliability Engineering concept is really all about applying software engineering principles to reliability engineering. Blogs have been written on SRE. But if you think about the combination of Borg plus SRE as the operating system or operating model of Google's Cloud internally, Google externalized Borg as Kubernetes, a fully working open source project with a rich community and all this other great stuff.

But SRE has really been externalized as a book. That's really what's been done there. And when you look at the differentiation between SRE and traditional ops, one of the key differentiators is service level objectives. And really what it comes down to is setting goals for unreliability. I mean, this is how I try to explain SLOs, is that you're actually defining a goal for how much unreliability you can get away with without customers noticing. And that's defined as your error budget, which is really just 100 minus however many nines or whatever your target is by saying, look, I want to hit three nines, 99.9%. I'm really saying 0.1% of unreliability, 0.1% of error is acceptable. And I can safely ignore it.

And by setting these goals and defining them across time windows, alerting, other things with those goals, you're actually giving your team a huge amount of flexibility. And the SLO concept is this very powerful tool. And interestingly, I think the reason why Brian and I are both attracted to it is because it does cross this business technical gap.

Because fundamentally, the gross margin in your business, the value and the risk that you're managing that produces value, comes from that. And it gives the SRE teams, these reliability engineering teams, such a powerful tool for deciding if they should work on technical, that they need to respond to an incident, if they need to apply automation, as opposed to waking up on every single error. Right? Which is really maybe the default position that businesses would like to take if we were superhuman DevOps engineers or whatever.

But in the Google approach, that standardization is such an important part. And actually, the two fit together, this sort of microservices, distributed systems approach to running Cloud computing that is resilient. It lets you abstract away physical location, abstract away physical servers and physical data centers, and gives you this resiliency and defines things in these abstract services.

Service level objectives become almost the contract around those services that other services can make important decisions about. Because, if you are basically saying my dependencies should expect 0% and I should expect 100% from my dependencies, you're doing it wrong. Right? You're going to make bad engineering decisions. And you're going to create a lot of risk.

But on the other hand, if you can say, look, the latency goal for this service is this particular amount of time and a certain percentile, I can make decisions about my retry logic, about my provisioning, about my contingency plans that I can't otherwise. So this is really, I think, the key thing. And there's lots of other stuff in SRE about toil and all this other stuff. But I mean, if you really want to get down to the fundamental idea, this is what it really comes down to.

CRAIG BOX: The book is online and available for free. But there's a fun story, which I don't know is in the book, talking about a service at Google, which had an SLO perhaps of 99.95% or something. So it was allowed to be down for half an hour a month, say. Is that right? 99? Three, two and a half nines? Did I get that right?

KIT MERKER: You're quizzing me on how many nines for this? I know five nines is five and a half minutes a year.

CRAIG BOX: Per year. Anyway, so there is a period the service was allowed to be down. But the service was so reliable, it was effectively up 100% of the time. And so, people came to rely on the service and expect it was up and never do anything like retrying. Because, for the last 12 months or whatever, it had been up 100% of the time. And so, that team then had a problem. Because they wanted to be able to do maintenance. But everyone was assuming that they had an infinite number of nines as their SLO.

And so, they had to start introducing outages themselves. They had to turn things off and say, hey, we're going to be down for a period because we're allowed to be. And other people aren't honoring the contract with us that says we're allowed to be down. They're not doing sensible things in the case that we are, such that in the future, if we have a need to be, we don't feel we can.

KIT MERKER: There's absolutely a thing called being too reliable. And also, reliability often comes from luck, too. I think people think they got a great, resilient system. But in reality, it just hasn't been tested. Or they haven't had quite as much traffic as maybe they would have hoped. It's true that, if people come to expect a certain level of reliability from you, they're going to engineer their systems a certain way. You may not even realize it.

This is part of software services, is the interdependency has gotten so complex-- rightfully so. There's a lot of value that comes from being able to stitch together third party services and reuse open source projects in these very scalable ways. But if you're not managing it, if you're not understanding how that has impact on users, on customers, and frankly on costs, those things all combine together to create a risk environment that you may not be aware of. And that risk is exactly what you're describing there, where you have this service that is running really super well. It sounds great on paper. Right? Hey, our service never goes down. Except when it does.

And that's where chaos engineering-- which, by the way, is a very advanced technique, and I don't recommend anybody start with chaos engineering if they don't just have like synthetics and reliability goals. But chaos engineering is built on this principle that you want to add some failure. And you want to add a predictable amount of failure. You want to see what happens and see if it actually violates your SLO when you're applying that pressure to the service to see how you can handle those risks.

BRIAN SINGER: I would just add the model that Kubernetes allows for actually increases the need for SLOs. Because if you think about it in the old world of a monolith, you can kind of just monitor the underlying infrastructure. Hey, if my VM is alive, I kind of know that my service is alive. Right? But when you move to Kubernetes, you have things like pods crashing. It happens. Or stuff gets shuffled around. You're going to retry that connection. That's fine. You're using something like STO, it's great. You're going to retry it.

But that retry does introduce some latency for your customer. Being able to understand how you've configured your environment and what the ultimate impact is and how much error budget you're burning is even more critical in a Kubernetes world. That's one of the reasons that we were so excited about starting Nobl9 was that we just saw that sea change of companies saying, hey, we're going to go operate in this model. It's much easier for our developers. We get a lot more development velocity, but maybe not understanding fully the implications of that in terms of what that's going to do to reliability.

Actually, it changes the challenges when it comes to reliability. And so that new model, the SRE model, is a much better fit because you're going to get the benefits that I talked about of first principles when it comes to SRE. We can do canary rollouts. We can do rollbacks. We can do retries. We can do all these things. But if you're not also measuring the reliability of these services, you're not actually understanding how those things are impacting your customers.

KIT MERKER: There are three terms that confuse people on this SLO space. So it's SLI, SLO, SLA. And I get asked about this all the time. So let me break it down just very quickly. SLI stands for Service Level Indicator. I think of this like a service KPI. You might have lots of data. You have a few KPIs that tell you a lot. That's really what an SLI is. This is the indicator, the set of data, the proportion that you're going to be tracking. It's not a goal. It's the data source you're using that hopefully correlates to user happiness. It tells you something about whether consumers are actually happy with that service, whether those are humans or automation.

CRAIG BOX: Right.

KIT MERKER: The SLA is your Service Level Agreement. And generally, this is modeled on the point at which we sue each other. The worst case scenario. If you're kind of violating the SLA, there's usually some sort of penalty. And even if it's an internal SLA, people understand like, OK, I made a commitment that I'm going to exceed this SLA. And I think the interesting question is, you're going to exceed the SLA by how much? How far above the SLA are you going to perform? Because we know that delivering the SLA was the point where you said, I'm never going to go below that.

And that's really where the SLO comes in. So if you think about the Service Level Objective, as opposed to an agreement, it's really the point on the SLI, the point on the KPI that you're targeting that actually makes people happy, that actually is the goal of what you'd like to achieve. And what you want to do is basically slightly overachieve the SLO, not overachieve it dramatically.

And by doing that and by tying that to where basically customers don't even notice it, that's really setting a bar for excellence. And that might sound hard. But it's way easier than perfection. Because 100% is not realistic in any way except by accident a lot. And so, creating an expectation that a team is going to make things as reliable as possible is really antithetical to this approach of building reliability. Because if you can make it unreliable to the point where no one notices, it still gives you some breathing room to operate the service efficiently.

The Service Level Indicator, Service Level Objective, Service Level Agreement-- that's how they fit together. And hopefully you know those acronyms better now.

CRAIG BOX: You spent three years at Google with your co-founder Marcin. And then you left to do something else. What was the decision to start? And how did you pick what you would do next?

BRIAN SINGER: For Marcin and myself, we both just enjoy the company building process and sort of starting things from scratch and finding a new market or a new set of challenges that customers have and going and trying to solve those. And for me, what I really saw from within Google was that, no matter where companies were running their infrastructure-- be it GCP or AWS or on prem-- they were adopting this new model of call it SRE or DevOps or whatever you want. There was just a lot of hunger for that.

But the tooling was very much in this sort of I would call it Cloud 1.0 world of AWS not having idempotent APIs not being infrastructure as code. So we wanted to build something in that space. And SLOs just seemed like a natural extension of what we had been doing. Fell in love with the concept and said, we just really want to build a company around the idea that it'll be easier to adopt these concepts inside of large companies. There will be hunger for that. Companies will want to move to this type of model for how they run DevOps or SRE or whatever the case may be. We just felt like there was an opportunity to build something in that space.

That's really how it started. And then, once we left Google, the standard entrepreneurial practice of making a lot of customer calls and talking to folks and saying, is this something you're thinking about? Is this a problem that you have? We started to get kind of an understanding of the market a long time before we even started to think about what product we would build.

KIT MERKER: I joined Nobl9 in November 2019. And we were all excited to go hit the road and go hang out with customers and everything and to start building this. And SRE Meetup started in Seattle. And Google hosted it. And we were all excited. And then of course, next thing we know--

CRAIG BOX: Everything stopped.

KIT MERKER: Everything stopped. And we were in lockdown. And it's an interesting thing because, when you build a company during hard times, I think it creates resiliency. Kind of like software. You know? It's almost like chaos engineering for entrepreneurs.

CRAIG BOX: Yeah. I was going to say, is this some SRE principle here that says, well, in the event that everything shuts down, we've got a page in the book that tells us what to do?

KIT MERKER: There was no planning. But what we did and what I think helped us is we really focused on the keys, the core principles of staying connected, taking care of our team. Frankly, being able to meet with customers over Zoom and have that be socially acceptable reduced our travel footprint entirely. And we were able to actually meet a lot more customers. And we were able to create more content and create a community conversation that I don't think would have happened quite as quickly otherwise.

I mean, I think we would have done it. We would have found a way, but that hardship quickly became an opportunity for us. And we ended up building pretty robust beta program as a result that I don't know that we would have done quite as quickly with as many big names as we were able to do.

CRAIG BOX: The company was for a while known as Meshmark. Why? And why did it change?

BRIAN SINGER: We had originally thought, oh, SLOs. So you're going to need a service mesh to do SLOs. And what we found in reality was that companies are interested in mesh, but none of their stuff is running on a mesh today. But they still want to do SLOs. That's why we kind of dropped the Meshmark name. We didn't want to tie SLOs to something that was probably going to be adopted over the next three or four years.

CRAIG BOX: Now we're called Nobl9. Did I pronounce that right?

KIT MERKER: You got it right. Even with your accent.

CRAIG BOX: How would I have known?

KIT MERKER: If you know, you know. Well, the 9 is for the nines in reliability. It's not just about trying to find more nines. You know, nines for nine's sake. It's actually trying to find the noble balance, what we call the noble pursuit, in said noble mind. That means finding the right nines for the job. And also the noble gases, if you're familiar with those, are actually the most stable of any of the elements. So we code named our first release Helium, our second release Neon. And if you know the periodic table, you can kind of figure out the roadmap from there.

And then no bull is kind of like no bull crap, which is our approach to kind of how we do things. So there's layers of meaning, Craig. Layers.

CRAIG BOX: And of course, the cool kids all drop a vowel from their names. How did you pick which vowel to drop?

KIT MERKER: Whichever one was the cheapest from "Wheel of Fortune."

CRAIG BOX: Tell me a little bit about what your product is today.

BRIAN SINGER: We built a platform to make it easy for companies to create and manage SLOs and sort of run their infrastructure based on the performance against objectives. And we did it with a few core concepts in mind. One is that SLOs should be treated as infrastructure as code and that you could build and run them in a GitOps style workflow. So we created a sort of YAML-based language to describe them that's Kubernetes compliant. We built a Kubernetes style API and a little CLI tool we call Slow Kettle to manage that and make it so that you could put it in a CICD workflow.

And then on the back end of that, recognizing that SREs have a whole different host of options for how they gather telemetry and then how they view that telemetry. So the platform is built with that in mind. You might want to take your SLIs from a variety of different platforms. Maybe some of that data is living in BigQuery. Some other of that data might be living in-- I think it's called Google Operation Center now or something like Data Dog. You want to bring that all into one place. You want to normalize that data, which is what we do.

Obviously, you can see that in our dashboards. You can alert off of it, or you can move it into other dashboards. The response so far to what we're building has been tremendous. I think we're just going to continue to push on the features, making it easier to build up SLOs. And Kit's going to keep putting out great content in terms of helping the industry understand why everybody should be using SLOs.

CRAIG BOX: Is this something that people install in their clusters and do alerting locally? Or is this a SaaS platform that they send their metrics to?

BRIAN SINGER: It's built as a SaaS platform right now, although we do have customers that are asking us if they can run it locally. And it is something that we're planning on supporting in the future.

CRAIG BOX: There's a famous quote from Steve Jobs where he's talking to the founder of Dropbox. And he basically says that what they've built is a feature, not a product. Dropbox is still around. But SLO monitoring sounds like it's something that you might expect to be added on to whatever your storage system is. How is your system different to or complementary to metric systems?

BRIAN SINGER: If you think about what metric systems are good at, they're really good at gathering and storing a large amount of telemetry and making that easily searchable and queryable. And then when you go say, OK, I want to do SLOs, you're actually missing a whole host of things that you need in order to do that. You've got to normalize the data. You have to be able to label that data as good or bad data. And then alert off of it, obviously.

And what we've found is that the gap there between what is available today and what you actually need in order to really use SLOs is significant. What's been fun for me as a product guy is just the innovation surface for SLOs is enormous. There's so much that we're building already and that we have planned in terms of making this something that is a core part of every company's approach to operations. We're just at the beginning, at the tip of the iceberg in terms of what you're going to be able to do with SLOs and what they're going to mean internally to companies. And I would say stay tuned for some of the things that we've got planned in the future, as well.

CRAIG BOX: You mentioned that this data is stored in Kubernetes CRDs. Are you seeing people that want to manage SLOs all running on Kubernetes? Is there anything about this that is particularly Kubernetes native, given that you could be ingesting metrics from Prometheus through the monitoring VMs or any other kind of infrastructure?

BRIAN SINGER: I'd say given the prevalence of Kubernetes in the industry, obviously we see pretty much every company using it somewhere. But I think what's really interesting for us and for SLOs is that very few companies are Kubernetes only. There's a huge mix. Mainframe, VMware based infrastructure, stuff running as monoliths on VMs, and everything in between. And what most companies tell us is, for us to do SLOs, it can't just be something that's only in our Kubernetes environment. It has to be something that we can apply broadly across our infrastructure.

And if you think about how a lot of banks are built today, they have sort of client facing applications that maybe they built those on Cloud native platforms. But a lot of the data and a lot of the services that they rely on might still be running on mainframes. So how useful is an SLO going to be that's just on the Kubernetes infrastructure when all of this stuff in the critical path is running on legacy systems? That, I think, is part of the opportunity for us but also for the industry in terms of how they're going to think about SLOs moving forward.

CRAIG BOX: You're building out a product here. You mentioned that it's a SaaS product. Do you see that this is something that will standardize? And are you building any open source technology around this that you want to see people use aside from just your own platform?

KIT MERKER: We're definitely working with and collaborating with the open source industry. One of the things that we've seen is people want to know Service Level Objectives for their infrastructure, whether that's running on the Kubernetes itself. But it could also be workloads. It could be a Kafka cluster running in Kubernetes monitored by Prometheus. And what we're planning to do is to build a Samples Library, which eventually will probably open source. But it will be public at least that would give you core samples of SLOs and AR budgets and alerting that you could use and copypasta-- if you know what I mean-- from GitHub Repos and innovate there.

We've also started collaborating with the Keptn Project from Dynatrace who have also SLOs as part of their CITD. And we're looking at how we can essentially align on the format for the SLO YAML. And there's some discrepancies between the two formats today. So we're figuring out what makes the most sense there. And we're not trying to make any sort of industry standard. But definitely being able to use compatible YAML formats is a good thing.

That's where we've kind of started. But at this point now, we're trying to build really the core product use cases and everything. And we'll figure out what's open source later. Those are important decisions and I think easy to make mistakes of open sourcing things too early or for the wrong reasons. To do open source for marketing, for example. And we're just not about that. We're trying to make sure that we have a working product. And then, when it makes sense to move things into open source, when we see a demand and a need, we'll do it confidently. We'll do it correctly for the community and not something as a stunt.

CRAIG BOX: Kit, you are one of the organizers of SLO or SLOconf. Tell me about that event.

KIT MERKER: The event kind of came out of nowhere, to be honest. And we're just channeling the energy around SLOs into something. It actually started as a joke on Twitter where I said we're going to have SLOconf US is going to be in Asolo, California. And then I was trying to find a location for it in Europe. And we were just kind of joking around. And so, people started asking, well, wait a second. Is SLOconf really happening? And I said, well, maybe. If you can prove that there's enough people that are interested. And the Twitter thread kind of went wild.

And that quickly turned into a public Google Doc to plan it, which then quickly turned into a planning meeting with 30 plus people who showed up to help plan the SLOconf. And we already at that point had probably, I don't know, 30 or 40 people that had contributed to the document and had added talk ideas and everything else. So we quickly went and bought the domain. SLOcon was taken. So I added the F and turned it into SLOconf. That's how it was born.

And we scheduled it for May 17 through 20. The idea was to do it attend while you work. So everything will be pre-recorded for the most part and asynchronous. We're doing it for both of US and EU friendly time zones. So most of it will be morning time US, afternoon EU for anything that is scheduled. And it will be relatively little that you have to do during a certain time.

We already closed the call for proposals. We've gotten 40 plus submissions. I think actually closer to 50 submissions that were accepted. We also are going to continue to leave that open for women and for underrepresented minorities. So if anyone wants to submit a talk now that fits into that category, we'd be happy to review it and hopefully accept it into the talks. So it's not too late if you're in that category.

The other big piece of this is going to be hands-on labs. This is something people asked for. They wanted to have hands-on products. So we've got a few companies we're working with that are going to develop some actual hands-on lab experiences, classrooms, et cetera. And that'll be part of the event, as well. And there may be a few surprises cooking up. So we'll be announcing the speaker line up very soon. We'll be announcing more of the sponsors very soon. It's all kind of happening very rapidly, but shaping up to be great.

And I will say also I've been running the Beyond Seattle SRE Meetup, which started in Seattle, went to beyond Seattle when the pandemic hit us. And now we have over 600 members. And we meet at least monthly. And we also have a Slack space. Co-organizers, including two SREs from Google, SRE from UiPath, SRE from Getty Images, a few other folks that are kind of independent engineering enthusiasts and consultants. And they're really running the community now. You know, I bootstrapped it. But they're really kind of the ones in charge.

And we've built quite a community. There's a lot of job posts, a lot of discussion about outage postmortems. A lot of different ideas being shared in Slack. And then our monthly coffee meetup where there's no agenda. Our monthly full agenda meetup, where we have one or two speakers and a networking session, et cetera. So we're going to build on that platform for the SLOconf. And we'll be using the same Slack space. We already have an SRE community. And I think it's really just going to all kind of fit together nicely. And it's going to be super cool.

And it's free, by the way. I'll also mention we're not charging anything. And we're going to have sponsors help cover the costs of the infrastructure, as well as the swag. We've got a few things lined up there, as well. I don't want to give away too many secrets. But if you go to SLOconf.com or follow @SLOconf, you can get all the information and keep up to date on what's happening with the event.

And I encourage everybody who is interested in how to make the reliability better and how SLOs work, come do it. It's a conference just for that. It's no observability or incident response or SRE practices. It's really just about SLOs, error budgets, SLIs, and SLAs-- you know, all the core ideas of measuring service level reliability.

CRAIG BOX: SLO is the airport code for San Luis Obispo in California. Could you see yourself hosting an event there in person when that's a thing we can do again?

KIT MERKER: Absolutely see us doing that because that's how crazy we are. And then the European equivalent will be in Oslo.

CRAIG BOX: Brilliant. Is that because it's O-SLO?

KIT MERKER: O-SLO.

CRAIG BOX: O-S-L-O.

BRIAN SINGER: Yeah. I'm really looking forward to SLOconf. Some of the tracks and sessions that are planned are going to be really exciting and interesting.

CRAIG BOX: Finally, Brian, who's your favorite member of Gun N' Roses? And why?

BRIAN SINGER: I would have to go with Axl Rose.

CRAIG BOX: Bold choice.

BRIAN SINGER: Just the voice and the vocals. You know, interestingly enough, I had tickets for the Guns N' Roses concert at Fenway Park last August. Obviously, that was canceled. I'm hopeful that they're going to reschedule for this August. I still have the tickets, so fingers crossed.

CRAIG BOX: And Kit?

KIT MERKER: I have to admit that I really am not a fan. But I did see Slash with you, Craig, at Wembley, if you remember, back in 2014. Just crazy to think it was so long ago. But I remember it being a very fun night of a lot of loud music and maybe a few beers.

CRAIG BOX: That was a good night out with Slash. I've seen them a couple of times since. And then the revived band, it's a thing. It's not quite the same as it used to be. There's no danger anymore in rock music. But what are you going to do? Thank you very much, both, for joining us.

BRIAN SINGER: Thanks for having us on today, Craig. It was a lot of fun.

KIT MERKER: Thanks, Craig.

CRAIG BOX: You can find Brian on Twitter at @brian_singer, with an "i". Very important. And you can find Kit on Twitter at @KitMerker. You can find Nobl9 at Nobl9.com.

[MUSIC PLAYING]

CRAIG BOX: Thank you, Richard, for helping out with the show today.

RICHARD BELLEVILLE: Thank you so much for having me. It's been a blast.

CRAIG BOX: If you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter at KubernetesPod or reach us by email at kubernetespodcast@google.com.

RICHARD BELLEVILLE: You can also check out the website at kubernetespodcast.com, where you'll find transcripts and show notes, as well as links to subscribe.

CRAIG BOX: I'll be back with another guest host next week. So until then, thanks for listening.

[MUSIC PLAYING]