#120 September 8, 2020

Airbnb, with Melanie Cebula

Hosts: Craig Box, Adam Glick

Melanie Cebula is a staff engineer at Airbnb, where she has built a scalable modern architecture on top of cloud native technologies. She regularly shares her knowledge in presentations focusing on cloud efficiency and usability, and today shares the story of Airbnb’s Kubernetes migration with hosts Adam and Craig.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

CRAIG BOX: In my head, I still like to think that 20 years ago was the '80s, but time has moved on. And we do live in the future. I hear you've been reliving the 2000s.

ADAM GLICK: I was reminded of an old show that was going on during the writer's strike way back in the 2000s-- Dr. Horrible's Sing-Along Blog. I may have talked about it before, but I went and rewatched it.

The first episode especially is just-- it is such well-written comedy in a 15-minute little package. And just, it brought joy to my face just to watch it again and to listen to some of the songs and the humor that's in there.

CRAIG BOX: Did you rewatch it on the original physical plastic disc, or did you find a modern streaming equivalent?

ADAM GLICK: No, this would be through a modern streaming service, as it turns out. I've been looking through the plastic disc library and looking to get rid of it as we were discussing, do we actually have anything that would play any of the stack of discs that we have in the house?

CRAIG BOX: Well, it'd have to go to a very old laptop if I wanted to play anything on a plastic disc.

ADAM GLICK: What have you been up to this past week?

CRAIG BOX: I had a little vacation last week and went down to the beautiful Blackdown Hills in Devon. Came across a river down there named the River Otter. And you might think, well, that's fantastic. It's named after the lovely cute little critters. It doesn't seem like it is. It seems like it's full of beavers at one point, but it is not called the River Beaver.

What I will say about it is that they don't seem to be very imaginative with names. It starts at a ford of the river with Otterford and then passes through the Otter Valley. We went to the Otterton Mill, which is this little town on the River Otter, which is just a little bit up from Ottery St. Mary, which is named for the St. Mary's Church, of course. And then the river passes out to the Otter Estuary, where it finally meets the sea.

ADAM GLICK: It feels like a little bit of a misleading branding there, doesn't it? Kind of expects the otter, right? That there's going to be some cute little furry guys sitting there, holding hands, swimming along.

CRAIG BOX: They are delightful.

ADAM GLICK: They're the most adorable water creature ever.

CRAIG BOX: Yes, if you go to a zoo or a water park of some sort, I've seen them put food inside a little bowl for the otters. And they lie on their back, and they pull it apart. It's a fun, little game. And you'd think with all of the otter-themed names, the best advice I could give you is go to Otter Valley ice cream. It's fantastic. If you're ever in the Devon area, I can thoroughly recommend it.

ADAM GLICK: Do they provide it to you in a little ball and you get to tear it apart as you eat?

CRAIG BOX: No, no.

ADAM GLICK: The full otter experience.

CRAIG BOX: It's not actually on the river as well. So basically, you sit in the paddock that smells like cows and trying to avoid the wasps.

ADAM GLICK: I'm glad you were able to do that. Shall we get to the news?

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

With Kubernetes 1.19 released, the requisite five deep dive blog posts on new features and improvements have been published. You can read about structured logs, EndpointSlices, storage capacity tracking for ephemeral volumes, API server warning messages, and the new one-year support window. The posts come from authors at Google, Intel, and VMware.

ADAM GLICK: TiKV has graduated in the CNCF. The key value store, created by PingCAP and open sourced in 2016 joined the CNCF in August 2018 and moved into incubation in May 2019. Graduation included defining the project's governance, passing the CII best practices, and adopting the CNCF's code of conduct. With this news, the CNCF now has 12 graduated projects.

CRAIG BOX: cert-manager from Jetstack has reached version 1.0. cert-manager is a certificate toolkit for Kubernetes, commonly managing x509 identity certificates for TLS. 1.0 is a symbolic release to show that cert-manager is production ready and has a commitment to support end backward compatibility.

Jetstack also announced an enterprise version of cert-manager that comes with support, signed builds for security, a configuration checking tool, and design blueprints. Learn all about cert-manager in Episode 75 with James Munnelly.

ADAM GLICK: VMware held the SpringOne Conference last week. And hidden behind all the Java were a couple of Cloud Native announcements. The Tanzu Build Service is now generally available. This service combines the buildpack experience from Cloud Foundry and adds a declarative image configuration model and integrates with common CI/CD tools.

The State of Spring 2020 report was also released. It says that 65% of Spring users are running in containers and another 30% are planning to do so. Of that 95%, 44% say they are putting the apps in Kubernetes, while another 37% plan to do so in the next 12 months. You can exchange your email address for a copy of the report.

CRAIG BOX: AWS has announced the GA of Bottlerocket. They have cut down container-focused Linux distribution. Bottlerocket was announced in March, just before the end of life of CoreOS Container Linux. It;s GA on EKS and then preview for ECS. But if you want to run it locally, you'll have to build it from source.

ADAM GLICK: Looking for a UI to jazz up your Kubernetes usage? This week, Kalm, with a K, launched as an open source UI for making Kubernetes easier to use. It provides a UI for Let's Encrypt certificate renewal, application deployment, health checks, scaling, volume mounting, and more.

Kalm is designed to work with your cluster either on-prem or in Google, Microsoft, or Amazon. The team announced their launch on Reddit this past week. And David, Scott, and Tian are very interested in your feedback.

CRAIG BOX: The Kubernetes community loves nothing more than a high level of abstraction. And so a new project from Salesforce this week might be of interest. CRAFT, the Custom Resource Abstraction Fabrication Tool, claims it removes the language barrier to creating Kubernetes operators.

You declare a custom resource in JSON format, and CRAFT will generate all that pesky code for you. The provided example for Wordpress saves you writing 571 lines of code, which they claim could take months. CRAFT is built on a project called Operatify, which, in turn, is built on Kubebuilder.

ADAM GLICK: HPE is pre-announcing the release of CSI extensions using sidecars, which they say will help with the ability to change persistent volumes while a container is running. Several extensions are planned, including a resizer, provisioner, attacher, and snapshotter. Performance management, data reduction, and data management are called out as target use cases, though no release date for these new features was provided.

CRAIG BOX: Virtual KubeCon EU 2020 session videos are now live on YouTube. If you want to rewatch a session, catch a session you missed, or you didn't have an event pass and have been waiting for all the videos to be posted, your wait is over. To make it easier for you all, we've put a link to a playlist of all the recorded sessions in this week's show notes.

ADAM GLICK: The CNCF will be providing another round of CommunityBridge mentorship opportunities after the successful graduation of the previous 21 mentees. These mentorships are paid internships. And the next round starts in October.

If you are a maintainer and want to submit your project for participation, project suggestions are due by September 9, and project selection will be finalized on September 21. If you are interested in applying as a mentee, applications will be open on the CommunityBridge site linked in the article after September 21.

CRAIG BOX: Back in February, we reported on a bug in the Linux CPU Scheduler which could cause problems with container limits in Kubernetes. A blog post this week from Eric Khun at Buffer proposed a nuclear solution-- just turn them off.

Hacker News was not impressed with the suggestion, pointing out that the bug had been fixed in Linux since 4.19. But the back and forth suggests that there might be something else going on. Either way, watch your metrics.

ADAM GLICK: Finally, Tasdik Rahman from the engineering team at GoJek has written up their experience with upgrading Kubernetes versions and how to keep their apps running during upgrades. GoJek uses GKE so the infrastructure upgrades are handled for them.

But their post is relevant to users on any Kubernetes service as they talk about the disruption to workloads as clusters are upgraded and how to ensure your apps are configured to handle changes in the cluster underneath them.

CRAIG BOX: And that's the news.

[MUSIC PLAYING]

ADAM GLICK: Melanie Cebula is a staff engineer at Airbnb, where she has built a scalable modern architecture on top of Cloud Native technologies. She regularly shares her knowledge and presentations focused on cloud efficiency and usability, including on the KubeCon Keynote Stage in 2018. Welcome to the show, Melanie.

MELANIE CEBULA: Thanks for having me.

CRAIG BOX: It might surprise you to know that you're our second guest who was a classically trained musician before becoming a Kubernetes expert. How does an upright bass player get into Cloud Native?

MELANIE CEBULA: I think music attracts people who like to achieve mastery of complex pursuits. And Kubernetes and just distributed systems in general are pretty challenging topics. And so I actually do think there's surprisingly some overlap there.

CRAIG BOX: Did you end up transitioning to doing tech at school?

MELANIE CEBULA: Yeah, so when I entered college, I actually switched from being a music major to a computer science major.

ADAM GLICK: How did you find that transition? That's quite a shift, as someone who also did a whole bunch of music work previously and then in kind of post-college time actually shifted over to the IT and computer science world.

MELANIE CEBULA: It was pretty challenging because my math and science background was really quite weak because I was so focused on music and performance. There was a lot of catching up to do. But I also found it really reinvigorating kind of finding this whole new field. And going from novice to expert in that field has been really satisfying.

CRAIG BOX: You're probably really good at counting to four.

MELANIE CEBULA: Yes. Yeah, and variations, like six out of eight and three out of four, all the kinds.

CRAIG BOX: And so how from there to Airbnb?

MELANIE CEBULA: From starting in tech to Airbnb, I studied computer science at UC Berkeley, had several internships under my belt, and then I actually interned at Airbnb in the summer of 2015. And my experience was just really great there. I love the product.

And it felt like there was so much room to really grow the infrastructure with the scale that the company was facing. And so I was really excited to come back and really help build out the infrastructure there.

And so I joined full-time in 2016. And I've kind of worked all over the infrastructure stack since then and for the past few years, really building out Kubernetes. And how we use and think about Cloud has been just really satisfying to do.

ADAM GLICK: Airbnb was founded in 2008. And you kind of joined in in 2015. What did the existing infrastructure look like at that point?

MELANIE CEBULA: So it was a really exciting time, because we had recently started introducing configuration management. So before that, there was no configuration management that I was aware of. We're talking about handcrafted artisanal, single batch boxes, instances that we're running in the cloud.

So we always, as far as I'm aware, were running in the cloud. But really, when we thought about configuring things in a sophisticated manner, that was still being developed. And so Chef was really the first time, I think, engineers at the company thought about that and started building that out.

And that was really helpful. So it was sort of early days. And I know that there were engineers who are looking at Kubernetes and curious about Kubernetes at that point. But we were really kind of like, let's get our ducks in a row and sort of start with something basic and go from there.

CRAIG BOX: As a high growth startup, you outgrew some of the facilities that were available from your Cloud provider at the time. There are a number of blog posts put out on the Airbnb Nerds Blog, perhaps aptly named, that talk about some of those technologies, one of which was a load balancing technology called Charon, with a C. What can you tell us about the growth of the Cloud technology and then to some more custom development?

MELANIE CEBULA: So it's actually pronounced Charon.

CRAIG BOX: Of course.

MELANIE CEBULA: We had this trend of naming sort of bespoke internal technologies after-- I think they are Greek gods. It's something to do with mythology, so they're all incredibly hard to pronounce. So there's also Hades and a few others.

And I think what you find when you operate at such this big growth curve is that there is out of the box vendor technology and open source technologies that originally work for you, just sort of as is. And that's really convenient and awesome.

And then through time, you start to notice weird things that just don't quite work as well because the amount of traffic that you're receiving kind of outpaces the expectations of that. So I believe what worked at the time for the engineer on it was a combination of an Nginx slow bouncer, which was Charon, and also using the vendor load balancer.

It was the combination of both of them that had the best reliability output metrics, which was surprising. But they just sort of went with it. And it actually worked quite well after that.

ADAM GLICK: In that world where you were walking into it, there was a lot of mutable infrastructure. It's a lot of probably VMs, possibly taking a look at what containers could do for you and moving towards the immutable side.

You mentioned that you were looking at configuration software. My guess is previously, did you have something like a directory full of bash scripts maybe that people would pull from? How did that transition happen? And what did you learn through that process?

MELANIE CEBULA: It's interesting because it kind of happened at the same time as the initial move to services. So we had been working on this monolith since the very beginning. And that's the code base that held all of the code for the main website. And as that built out, it became more and more unruly and very, very difficult to contribute to as a typical engineer.

So what we did was we built out some of the initial services that the monolith would call to delegate some of the logic. But just that simple step of, let's have some code in these other code bases, and let's have the monolith call these other services exposed so many things that were missing in our infrastructure.

So really basic service discovery didn't really exist at the time. So that was something that had to be built out. A message bus for services to sort of pass data and mutate data was something that needed to be built out.

And the other thing that was built out, of course, was the configuration management was when you create and configure a new service or you modify the configuration of a service, how do you roll that out, especially changes that would affect all services, so base recipe changes and things like that.

So that was a really big step improvement. What we found, though, and I think it took maybe a few years for this to become more evident was that with this VM-based infrastructure, we didn't have a lot of guardrails for changes being checked in.

And so there were a few interesting times where, in the course of an outage, an engineer would go and I think do the right thing and stop the bleeding and manually fix some instances as fast as they could. But of course, once the hoorah of the incident is over, the engineer may have forgotten to perhaps check in those changes to the configuration management system.

And so there were a few times where an auto-converge or someone making another change would accidentally revert things or undo things that were, oh, wait, no, we actually needed those. And so coordination is not enforced. There weren't guardrails for that.

CRAIG BOX: Do you have any particular favorite war stories from that period?

MELANIE CEBULA: Yes, there was more than one time where we basically had built separate deploy infrastructure as the way that we apply Chef changes, which was this sort of converge mechanism. And so there was more than one case where someone deployed at the same time as a converge.

And in those days, we didn't have anything preventing that from happening. So the deploy would be taking down instances and applying something. And then the converge would also be taking down instances and applying a change. And before you knew it, you didn't have enough instances left to handle traffic. So that was kind of a funny one.

There was another incident at around that time. I think it was my third week on the job, where I come into the office, and it turns out that our deploy tooling is down. And our deploy tooling is down because we broke deploys with a change to the deploy tooling. And we use that same tooling to deploy changes to the deploy tooling itself.

And I'm sure someone had the insight that maybe this might be a problem someday. I guess, we'll think about that then. And of course, that became my problem.

CRAIG BOX: Put it on the backlog.

MELANIE CEBULA: It's like one of those things you throw onto the backlog. We should really fix that, but it's not a pressing problem, right? And someone had prepared a bit of a script to help in that situation, a manual sort of rollback that didn't use the same tooling. But we found out that day that that script was broken and did, in fact, not work.

And so kind of on the fly, me and another engineer were just hacking away at the script and trying different things. And within a reasonable amount of time, we had a replacement script, and we applied it and everything ended up being fine. We were able to roll back.

But these sorts of lessons really informed us for years to come. Recently, we've been working with using Spinnaker for Kubernetes-based deployments. And when that team formed that started working on it, one of the things I brought up in the review was, what are you using to deploy changes to Spinnaker, and have we solved for this?

And yeah, there was a lot of intention around making sure there weren't bootstrapping problems like that. Because when you get into the world of configuration, you need to have some way of applying the configuration changes themselves. And so that kind of can be really interesting.

CRAIG BOX: You toggle the little switches on front of the machine to enter the first configuration. Then it can configure itself after that.

MELANIE CEBULA: Yeah.

CRAIG BOX: I don't think they still make servers like that.

MELANIE CEBULA: Yeah, it's just interesting, because I feel like a lot of lessons were learned almost independently from companies at the same time. A lot of companies were making these same, or very similar, mistakes.

And it's interesting just watching over the last half decade a lot of these companies kind of grow up together and the technology grow up. And a lot of lessons have been learned. Because I just feel at the time that some of these configuration as code technologies were just fairly new, at least for wide adoption and use in Silicon Valley and at this kind of web scale.

CRAIG BOX: Now you've mentioned the monolithic application that was written in Rails, was that right?

MELANIE CEBULA: Mm-hmm.

CRAIG BOX: So what are the services outside that monolith would your Rail application be talking to at this time?

MELANIE CEBULA: It's proliferated. For backend services, a lot of JVM, Java-based services. We have a lot of standardization around those. And so a lot of the critical production services are in Java.

And then for the frontend, we have a pretty sophisticated Node monorepo as well. So it's a JavaScript-based set of services. And then in sort of the data side, a lot of the machine learning and batch jobs are written in Python.

CRAIG BOX: A lot of people who worked on that data site went on and founded Mesosphere. But on the production side, where you work, you have a technology which, again, another blog post about, called Smartstack, which came out, again, around the same time, around 2013.

Smartstack was an automated service discovery and registration framework for doing service-oriented architectures, which was run in production at Airbnb. What was Smartstack like to work with?

MELANIE CEBULA: Smartstack was really ahead of its time, in my opinion. There were, as far as I'm aware, no open source technologies that worked for us that could do this service-to-service communication.

And it took advantage of HAProxy as a service proxy, which was a very important building block. And that actually has been around for some time. And it was able to handle the scale of all these initial services talking each other with no problem. And we were really able to use it for service-to-service discovery.

When it started to have kind of cracks in its scalability was actually as we had more and more services. And the service call fanout-- when one service calls another service is called 10 other services, et cetera-- it didn't handle the service call fanout as well because of when one of those services is replaced, all the hardware is replaced-- yeah, so sort of getting into the weeds.

The technology that it was using, especially older versions of HAProxy, what they do is they restart to handle changes in the IPs and hardware changing and stuff like that.

CRAIG BOX: Right.

MELANIE CEBULA: And what we were seeing were tons of HAProxy restarting and forking. And so the memory footprint of HAProxy would grow and grow. And we started seeing [? OOMs. ?] And that's a pretty catastrophic failure for your service discovery.

And there were definitely ways to patch and improve this. But kind of the way that it was designed, it didn't really have in mind a very large service-oriented architecture with lots of hardware changing.

And the interesting part is Kubernetes kind of exacerbates that problem because we have different IPs per pod. And so it's not just when you're replacing a services hardware, you're provisioning new backends. Now, it's when the service is deploying, you're actually rotating through all of those pods for our infrastructure.

And we started seeing this problem get worse. And we were like, OK, we really need to rethink this service discovery infrastructure. But it actually held fine for probably five years, which I think is really impressive for a piece of technology.

CRAIG BOX: You have a service proxy in the form of HAProxy. You have a distributed data store, which is Zookeeper in this case, and then a backend that ties all these things together. Would it be fair to call this a service mesh?

MELANIE CEBULA: [LAUGHS] I think it probably was, in some ways, an OG service mesh, which is why I think it was ahead of its time. Service meshes today try to provide so much more in the way of secure by default and some more sophisticated traffic mechanisms.

I've seen a lot of sophisticated traffic routing and rollout and kind of Canary-like features with service meshes, which, just, technology didn't have. But when you think about the bare bones of what a service mesh offers, I mean, I think this kind of did offer that. So it's fun to think about that.

ADAM GLICK: You mentioned that you ran into certain scaling issues and the out-of-memory errors and the problems that come as you scale. Did you move to something more like a microservices architecture in order to try and make that work as Airbnb grew and grew? And if so, what were the differences there, and what did you have to do to make that transition?

MELANIE CEBULA: Yeah, one thing that was really interesting is, I think the engineers at the time were pretty concerned about a switch to microservices architecture. And so the philosophy at the time was services-oriented architecture, not microservices.

And that was kind of the saying internally is, we don't want it to be the case that every engineer has their own service, when we have thousands and thousands of services. We want it to be the case that for every logical different piece of the product, there is a service for that. And we can reuse as much as possible.

And so that was sort of the philosophy that we held to. One thing that was interesting is that once you sort of start the services-oriented architecture train, it does take a lot of intentional architectural design to keep it under control.

So we did feel that as recently as this last year, that there were services that were unnecessary. There were services that were very similar to other services. And so we actually have a senior architect on the product side, who's really been driving service simplification.

So we have this sort of design process now, where we kind of co-locate services in a service block. And then within that block, it's a new logical monoservice, and then we kind of simplify. And so we have put some mechanisms in place to keep the thing kind of simple, as simple as possible, which I think maybe more companies might be looking into that, too.

I will say, as a person who, probably, my first year on the job was, can we scale contributions to the monolith, I tech-leaded a MergeQ project, which was basically-- it was kind of this neat idea of, like, everyone, get into the Q, and we'll merge your PR when we can. And so we kind of tried to make everything orderly.

I mean, that project was really difficult because at the end of the day, we already had hundreds of engineers trying to contribute as many PRs as they could per day. And so it was really hard to scale those contributions.

So I do think it made sense to move it to the services, and I think that was the right move. But there is a lot of complexity that comes with that. And we've always tried to-- especially I've always tried to tame that as much as we can.

CRAIG BOX: How many services does a user hit between loading airbnb.com and booking a property?

MELANIE CEBULA: I can't give an exact number, but I would say it's somewhere between 50 and 100. Because when you think about it, we have a lot of shared services that provide-- they're kind of like middleware, like rendering and authentication and a lot of these shared functionalities that all services would need.

And then there is sort of-- we think about it. You hit the front page. You use search listings or render. Reviews are rendered, the payment page, handling all that. So you actually end up having a few more services going through the flow than someone might originally think. There's a lot kind of functionality that's serviced there.

ADAM GLICK: Let's talk a little bit about the Kubernetes migration. What was the decision process and timeline in terms of moving into Kubernetes and adopting that at scale?

MELANIE CEBULA: I believe it was around 2016 or early 2017 when there was an initial push internally for Kubernetes. I was still working on deploy tooling at the time. I think the goal of that team really was, is this feasible? Can we prove that it is possible to run any of our workloads on Kubernetes? And can we build out sort of a prototypical infrastructure to demonstrate that?

And so it was really more of an R&D project. And then around 2018, I think, was the sort of leadership decision that there was enough there that it could be worked with. And I think we had a lot of hope seeing other companies like Lyft build out Kubernetes and associated technologies like Envoy.

And so I think there was a bit more of a, yeah, I think this could work at our scale. And that was when I joined the team. It was sort of the road to the production-ready cluster and the road to production-ready services.

And so we did a lot of work with pairing with service owners to migrate some of these workloads, especially test and staging environments that we could point a lot of traffic at. And we could show that we could get this configuration to work.

And we sort of had several goals. One was sort of feasibility and scalability. But one thing we really cared about was sort of developer productivity. Like, could we make this a better developer experience? And so there was this whole other aspect to the project that was partially related to Kubernetes, but also a bit separate.

One problem we had as we built our infrastructure is, you really had a different place to go to for every piece of configuration. So we actually drew out this diagram of, what does it take to create a service here? What does it take to make a change to the service here?

And there were something like 50 different steps in 50 different places to create a service. It was truly horrifying to see it all graphed out. So you would go to this Slack channel to get permissions. You would go to the AWS console to provision this.

You would talk to a Slack bot in some cases or a goalie. Or you create a ticket. You would create a change in the alerts repo to get alerts. You would create a change in the chef repo to get your confirmation.

CRAIG BOX: You've got to order new business cards.

MELANIE CEBULA: Yeah, it was just this very manual process. And it was confusing. Where do you make the change to get CI/CD, continuous migration and deployment? Where do you make the change to get dashboards?

ADAM GLICK: You had a lot of different sources of truth.

MELANIE CEBULA: Yes, and so we wanted a one-stop-shop. And Kubernetes was actually rolled into this. So we were like, one place to look, one place to configure, and one place to deploy. And that was the project, was like a one-stop-shop.

And so we were actually, I think, one of the first companies to really think about and use the operator pattern essentially, where we actually had custom resources defined. In this case, I think alerts was one of the big first ones. We created a custom resource definition in Kubernetes. And then we had custom controllers that deployed those changes.

So when you actually deployed your service, you were really deploying all these different things. And when it comes to all this configuration, for every project, they have what we call an _infra directory. And in _infra is where all these different YAML files are.

And I know not everyone loves YAML. But we really were just sort of moving all of these into one place so that people wouldn't have to go to all these different places. And in that sense, in 2018, that was the goal, was, can we move all of this stuff in one place and show this really clear benefit to service owners from migrating?

And there are a few other things they got from migrating. We talked about the immutability, the convergence and deploys no longer competing with each other. And we also got auto-scaling, so it's like, you can handle all summer traffic peak if you migrate to this technology.

And so we really bundled all of this goodness. And that is why in 2019, we did the push to migration. So in 2018, new services were created in Kubernetes and with all this extra goodness. And then in 2019, we were like, OK, it's the time to do a migration push.

And so that kind of brings us to today. We reached, I think, I want to say, like, north of 95% of our goal of migrated services, so pretty much a really great success. All the critical services have auto-scaling now.

And we're kind of now in this cutting edge territory of, what does it mean to run efficiently at scale? What can we do to run with better cloud performance? And I've kind of sort of shifted focus to that, which is, how do we do some of this sophisticated benchmarking and efficiency testing with Kubernetes and with this Cloud Native technology? So that's where we are today.

CRAIG BOX: Have you been able to migrate away from Smartstack and the pre-Kubernetes service discovery ecosystem?

MELANIE CEBULA: So that's a work-in-progress. One thing that was really interesting about our migration is, we migrated service by service. When we talk about migrating service discovery, service proxies, or service meshes, most of those migrations are edge by edge. So between every two services, you have an edge. And there are way more edges than there are services.

So for that particular team-- and I worked with that team, too, but there was a different team that really drove service discovery migrations. They've migrated most of non-TCP edges, I want to say. And then there were some different services that had particular characteristics that were harder to migrate.

But we are mostly off of Legacy Smartstack. We're mostly on Envoy now. And that team is also working on kind of early exploration and adoption of a service mesh. So they've kind of started tinkering with Istio. Can we make Istio work at scale?

And looking at that project, it really reminds me of 2016, 2017, and early 2018 with Kubernetes, where it was like, hey, there's this piece of technology that might really offer a lot of benefit to us. But it takes a lot of investment to demonstrate whether it would work for us in our use case. So it's been exciting to watch the traffic journey as well because I do think they're really interdependent on each other.

ADAM GLICK: You mentioned that you're running a lot of this stuff in the cloud. Do you rely on managed services from vendors to do it, or are you kind of rolling your own, on top of VMs and primitives?

MELANIE CEBULA: We're rolling our own on tops of VMs and primitives. We did look at vendor services. But I think we were just so early, in a way. I think this is the challenge of being an earlier adopter in that broad adoption curve is, you end up kind of running a little bit ahead of that curve.

And hosted technologies, we did evaluate and try a few of them. And they just didn't quite work for us. And so, well, we did end up running our own clusters configured on VMs. And then we've been working through that. And it did take a lot of investment.

I think that's the biggest thing, is, if you are rolling your own, you will need a dedicated cluster team to work on it. That's just kind of the name of the game. And so, for us, there was a lot of investment in multi-cluster. One thing we found in 2018, early 2019, was that we were running out of headroom. We had a single cluster.

And at the time, at least without serious, serious etcd tweaking and investment, a 2,000-node cluster was about as good as you were going to get. So I believe we were rapidly heading towards 2,000 nodes, and we were like, oh, shoot, we should probably get this multi-cluster thing figured out.

And then we really went all in on the multi-cluster strategy. And so today, we have tens of clusters. I think we probably have 10 or 20 production clusters and then lots of little test clusters for the compute team to kind of test out new changes.

CRAIG BOX: As situations change around the world, I can imagine that you have a business that has a very elastic requirement for compute. How do you change your compute demands in response to change to user demand?

MELANIE CEBULA: This was really interesting. Since I've been working on cloud efficiency, one obvious business outcome of cloud efficiency is cost savings. And cost savings going into 2020 is important. And it became more important when traffic started dropping dramatically.

A lot of companies during this crisis, I'm sure their engineers can attest to that the traffic patterns were not what they expected. There were some services where traffic would crater downwards, some services where traffic would go exponentially higher than they could have predicted.

And in a lot of cases, it's also not stable. Like, it goes up a bunch or down a bunch. And so what you need is not easily forecasted. So a lot of people do capacity planning. And yeah, capacity planning was kind of a wash this year. And the reason for that is because of these traffic patterns.

And so, one way I think Kubernetes benefited us was that we did have Horizontal Pod Autoscaling in place, or HPA. And so, that was one of the big benefits we pushed for, for service owners was, hey, if you migrate to this, your service will scale up and scale down to handle traffic. And so a lot of our sources were on HPA going into 2020.

And the other thing is, we had an engineer on the compute team roll out cluster autoscaling. Because we do manage our own clusters. Cluster autoscaling is our job. And they did a fantastic job on that. And our clusters now can scale up and down with traffic.

We obviously still have to work with our cloud vendor to reserve compute. Or if you're on prem, you still need to reserve compute. And so that's not something you're going to get away from. But we were able to handle traffic.

There was one interesting thing, though, that came out of this. And that was that we noticed that with HPA, you can set minReplicas and maxReplicas. And we had a lot of service owners, I think they noticed some reliability issues. Or they just kind of wanted to have a minimum provisioned amount.

So there were some service centers that set minReplicas quite high, which was fine when traffic was really high. But when traffic dips down low, those services didn't scale down, obviously, below minReplicas. And so what we saw were a lot of services that could have scaled down further, but weren't.

And so I actually kind of ran a little bit of a campaign internally on service owners, like let's tune our services. Like, let's get minReplicas to an appropriate number. And so we sort of adjusted those numbers for probably our top 20 or 30 most trafficked services-- or the services that received the most traffic.

And we also had kind of a basic rightsizing, impromptu capacity planning moment, where a lot of service owners were sort of manually adjusting. And so we didn't quite get a completely free experience. But I do think we were most of the way there. And just with a little bit of tuning and tweaking, we got a great autoscaling experience.

ADAM GLICK: We've talked a bunch about the infrastructure piece. And if I can shift the conversation a little bit to talk about the developer tooling side of things, what did the developer tools look like when you came in, and how has that changed as you've adopted Kubernetes' move towards microservices and evolved what's being built out here?

MELANIE CEBULA: Originally, we had a UI for deploys. And that was really just deploys, tests, and builds. So anything you can think of as modern day CI/CD was in this UI. And then there was another UI for sort of launching machines and converging those machines and otherwise interacting with them.

And as we move to Kubernetes, we really had to rethink a lot of that. The configuration piece was sort of a new challenge. The Chef recipes in the monorepo were also quite difficult for developers. I never want to overpromise things.

I think developers, when they're frustrated and they want to work on a product, they kind of hate any infrastructure that's in their way. So whether it's Chef recipes or Kubernetes YAML, it's going to be hard. But we did want to improve that experience as well.

And then, obviously, when we thought about those pieces of what does CI/CD look like for Kubernetes, what is the equivalent of launching or converging? So we started with the configuration. And starting there, the original idea was, I think with that early research, was raw Kubernetes files.

I think they quickly found that raw Kubernetes files, when you have a lot of environments for your service, those files look really, really, really similar. So most of our services have a development environment, multiple different kinds of tests and staging environments, a Canary environment, and then a production environment.

So we're talking a minimum of five sets of the very similar deployments, replica sets, et cetera. So template YAML I think was the kind of most obvious first approach to that. And that is the direction we went in.

Looking back on that, I actually do think that we could have abstracted even more away. And I think that's kind of where we're looking to go next. If we could just define your service, your compute requirements, and what services you're talking to your clients, et cetera, then we can kind of hide even more.

One unexpected challenge with templated YAML is refactoring it. It's really hard. If there's multiple ways to template the same piece of YAML, how do you have a script go in and change it automatically?

And we really liked doing automated refactors because we do have hundreds of services. So if we want to make a change in all of them, it's really convenient that a human doesn't do that. If we can automate that, that's awesome. So that was another piece of it.

And one other insight is, anything we've ever had to refactor, is that something that we could also hide behind an abstraction layer? Like, if we need to change the same thing for everyone, let's also hide that underneath our abstraction layer.

So that's really been a work in progress, is, how do we expose the configuration and modify it? And so, yeah, we sort of started with templated YAML, but I'm not really sure if that's-- I think it might be going in a different direction now, kind of like new meta abstractions, as I like to call it, which is getting kind of nerdy.

But what do you expose? What are the defaults? How transparent are you? And what can be configured, and what can be overridden? And it's really hard to get that right, but if you have opinionated workloads, it's a little bit easier.

So we started with configuration. And then we kind of moved to the UI. And one thing that happened here was that deploys were very different from tests and builds. And what we wanted to do was unify CI and CD kind of into the same concept.

So we sort of pioneered running deploys as jobs. And that was a new concept for the company. And we kind of ran all of these, actually, in a Dockerized way. So the builds for these services were containerized as well. And so, yeah, we had the typical issues of getting around Docker and Docker and things like that.

And then, finally, there was, for the UI and for deployment pipelines, I think we had the deploy team kind of come in and try out Spinnaker. That's been the replacement UI for our internal UI for the tooling. So yeah, lots of tooling. Oh, and then there's also the CLI, the "K" tool, which I've talked about as well.

CRAIG BOX: As you're adopting Envoy, you have the opportunity to move code out of your own programs and the libraries that power them and into the sidecar. What experiences have you had with the sidecar model?

MELANIE CEBULA: Yeah, so the sidecar model is really interesting. One challenge we had early on, moving to services-oriented architecture, was, we wanted to provide a lot of shared functionality in libraries.

So in Java, we could use Dropwizard and other configuration mechanisms like that to provide rate limiting and retry backoff and sort of the things you do to prevent thundering herd and other problems between your services.

And what we found was that that's something we did in a language specific way. So you have to re-implement that same library, OK, in Java and in JavaScript and in Ruby and wherever you're using that logic.

CRAIG BOX: Visual Basic?

MELANIE CEBULA: Yeah, because you're basically building the same logic in different languages. And that's kind of unfortunate. We have this problem where this client would fall behind for some of the languages and not others. And so there was this shift from if we can provide this in a language agnostic way, that is a massive, massive maintenance win.

And so if we can provide this in a way that is language agnostic, that would be really great. And it's interesting because I think Sidecar's kind of caught the Kubernetes community by surprise in a few ways. When I think about Kubernetes originally, it was multiple co-located containers.

There wasn't really a native breaking down of the container roles as "this is the main container, and these are sort of supporting sidecar containers. But because of this massive benefit of the language agnostic power of having these sidecars, this kind of was a use case that evolved for major users of Kubernetes.

So now we have a sidecar that provides logging, a sidecar that provides statds metrics emitting, sidecar that does distributed tracing, a sidecar that does service discovery, service mesh. And the list kind of goes on.

One thing that's hard, though, is that when you have this many sidecars, your compute footprint increases. So when you think about one unit of your service, the service owner will think of their server or their container. They won't necessarily realize that along with their containers, maybe, like, 10 other containers.

And so we definitely have challenges with, if we need to make a change to the logging container to update it, if the logging team needs to update logging, how do they make that change for all the services using the logging container? And we experimented with generating service code.

And so we had sort of our own versioning on top of our configuration over Kubernetes. So as people pulled in versions, they got the latest logging container. One problem with that is, oh, well, what if the logging container actually has a real problem with it? And it's like, oh, sorry, the fix is in the next version. I guess, you just have to wait. When you have an outage, you can't really just wait for the next version.

And so one thing that became really important, I want to say mid-2019, was a sidecar injection mechanism, so a way for us, for the logging container owner or any infrastructure sidecar container owner, to modify their sidecars and give us a way to sort of roll it out slowly. So we also now have kind of this slow rollout system for container changes. And then we can do it quickly if it's really serious.

So that's kind of a direction. And we've talked to a few other major users of Kubernetes, and they've kind of built similar technology. So I kind of think that's the direction it's going, is sort of sidecar injection and things like that. So it's been kind of fascinating, this whole pattern kind of appearing almost out of nowhere. And now it's kind of following its own path. But it's provided a lot of value.

ADAM GLICK: One of your talks last year had the BuzzFeed-esque name of "10 Weird Ways to Blow Up Your Kubernetes." Which one of those 10 will shock me?

MELANIE CEBULA: One of the more interesting ones was related to the sidecars. Oh, oh my gosh, there's so many. I kind of love these stories because when I talk about some of our outages, it's just, they're very human. A lot of people can relate to them, like the mutability one.

And so when it comes to 10 ways to blow up your Kubernetes, I think it's just things people can relate to. So one was with sidecar ordering. We did have need for a custom patch for getting this logic for sidecar containers to kind of coordinate with the main container to sort of start up and after them.

And that was surprising to people, was that to get this functionality, I had to build out a system that has build Kubernetes, apply some custom patches, and then build it, and then kind of deploy that across the infrastructure.

And when you work with open source technology, you do sometimes need to do changes kind of ahead of the curve, and then try to get them in the mainstream. Sometimes, for your own use case, you do need to add some stuff in there. And that was surprising.

As far as blowing up Kubernetes, one of the easiest ways to do it is to create a DaemonSet that requests slightly too much CPU or memory. And then it tries to get scheduled onto every node in your cluster. And if you have a large cluster-- let's just say your cluster's 1,000 nodes-- you suddenly have a lot of scheduling errors, which maybe doesn't make etcd happy. And you start to have problems.

ADAM GLICK: Cascading failures there.

MELANIE CEBULA: Yes, there was some cascading failure problems. But one thing I will say is that Kubernetes failure modes are sometimes quite nice. When I look at Zookeeper outages, it's usually just a catastrophic failure, where things are not working well. And everyone knows pretty much immediately that things aren't going well.

With Kubernetes, when the clusters fail, it's usually just that things aren't going to scale up anymore, or things aren't going to scale down, or some new things aren't getting scheduled. But what's nice is it's like, OK, everything's fine right now. If we were to have a traffic spike, that would be bad.

But things have sort of gracefully degraded until we can kind of fix this whole cluster again. So that was one thing that was nice about the Damon set outage was that we were able to get a hold of it before there was any production impact.

CRAIG BOX: Finally, as someone who used to travel the world, as many of us did, giving conference talks, you will have had the chance to stay in a number of Airbnb properties. Do you have any favorites?

MELANIE CEBULA: Ooh, favorites. I gave a conference talk in, I want to say Amsterdam. And I had a beautiful view of the city and the town. And I really enjoyed staying in that city and just being able to really experience the local culture.

And one thing I will say is, as someone who has gone around the world speaking on these technical topics, I've gotten to meet so many enthusiastic contributors and users of Kubernetes across the world. It's just really powerful to be able to talk to people from Europe and from Asia and really all over and see how they approach this technology and how their challenges are different.

CRAIG BOX: Well, we hope that that is a world we'll be back in very soon. And that reminds me to say thank you very much for joining us today, Melanie.

MELANIE CEBULA: Thanks for having me.

CRAIG BOX: You can find Melanie on Twitter at @MelanieCebula or on the web at melaniecebula.com.

[MUSIC PLAYING]

CRAIG BOX: Thanks for listening. As always, if you enjoy our show, please help us spread the word and tell a friend. If you have any feedback for us, please tell us. You can find us on Twitter at @KubernetesPod, or reach us by email at kubernetespodcast@google.com.

ADAM GLICK: You can also check out our website at kubernetespodcast.com, where you'll find transcripts and show notes, as well as links to subscribe. Until next time, take care.

CRAIG BOX: See you next week.

[MUSIC PLAYING]