Kubernetes Podcast from Google: Episode 50

#50 April 23, 2019

Spotify, with David Xia

Hosts: Craig Box, Adam Glick

Spotify were early adopters of Docker, and wrote their own deployment tool to run it in production. David Xia from the Spotify platform team talks about Spotify’s engineering, challenges, how Helios worked, and migrating from it to Kubernetes. Adam and Craig also give a round up of the week’s news, in the form of a question.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

Links from the interview

Spotify
- This podcast on Spotify
- Spotify open source utilities on GitHub
Helios
- 2014 introduction video with Rohan Singh
Apollo: Java libraries for microservices
GKE Usage Metering: Whose line item is it anyway? with Madhu Yennamani and Yang Guan from Google, and David Xia from Spotify
- Episode 40 with Madhu Yennamani
GCP Firewall Enforcer
David Xia on Twitter

Transcript

Show full transcript

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

CRAIG BOX: Given the announcement on our show last week, the eagle-eared listener may have been able to infer a star sign for Glick 2.0, but nerds can surely do numerology better than that.

ADAM GLICK: Ah, yes, indeed. As a new dad, I've been going through all sorts of the baby numerology. I've defined that the baby's cryptographically is, indeed, ours.

The additive checksum has worked out to verify the payload. And the 8-bit checksum actually turns out to be the parent's initials. So we have verified both sender and content authenticity for the security folks out there. It's been great fun playing around with all these new things as a new dad.

CRAIG BOX: Has the baby been entered into the blockchain?

ADAM GLICK: No, but we will find an appropriate database to store them in. Speaking of arcane trivia, I hear that your trivia title may be in jeopardy?

CRAIG BOX: Well, it's been a long time. We've been away, so haven't participated in the pub quizzes, but I have been enjoying the TV show, "Jeopardy" this week. I read all over the internet that there is a trivia star, James Holzhauer, who is blitzing away with all the "Jeopardy" records.

You may remember, about 15 years ago, there was a guy called Ken Jennings who won 74 games on the trot and came away with many millions of dollars. But James Holzhauer, he's about 10 games in as we record this. And he's broken the one-day record at least 5 days of his 10. He is getting to numbers that no one's seen in the past, and so much quicker than anyone.

Not something I follow every day. But I do recommend, if you turn your TV on and just try and catch a couple of episodes, the guy is super good. And it'll be very interesting to see how far he goes. A good way to spend an otherwise relaxing Easter back in the UK.

ADAM GLICK: Doing anything else for Easter?

CRAIG BOX: Just enjoying the sunshine, really. It's been 25 of our British degrees almost every day here, which is-- really helps to help get over the jet lag. You've got an excuse to get up-- a bright sunny day, friends to see. Everyone wants to come and hang out in the gardens. Are you familiar with the work of Dale Chihuly?

ADAM GLICK: I am.

CRAIG BOX: Yeah. Well, for those who are maybe not based in the settler region, he's is a glass blower and sculpture artist. And his work is on display at Kew Gardens, not far from where I live, until the 27th of October this year.

ADAM GLICK: Nice. And wish a happy Easter and a pleasant Passover to all of our listeners who are observing. Shall we get to the news?

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

ADAM GLICK: Google Cloud, this week, announced GKE Advanced, a new tier for their hosted Kubernetes offering. The Advanced tier will offer a 99.95% financially backed SLA calculated on the cost of the cluster nodes and not just the master. Features in the Advanced tier will include vertical auto-scaling capabilities, node-pool-aware auto-provisioning, binary authorization to ensure images haven't been altered, and the GKE Sandbox, a container runtime built on gVisor.

GKE Advanced will also provide serverless operations with Cloud Run on GKE, a hosted version of Knative, and GKE usage metrics, which allow you to break cluster resource usage and attribute it to customers and departments. The existing GKE tier will now be known as GKE Standard.

The new tier will come with a free trial and is scheduled to arrive before the end of June.

Google Cloud has also introduced managed HTTPS certificates for load balancers. Google-managed SSL certificates are provisioned, renewed, and managed for your domain name and can now be added to your GKE services using Ingress.

CRAIG BOX: CNCF service mesh project Linkerd has released version 2.3 this week. This released graduates mutual TLS out of experimental to a fully supported feature and turns authentication and encryption between meshed services on by default. Linkerd has a goal to make secure communication easier than insecure communication. But we do suggest to avoid their preferred curl-a-shell-script-and-pipe-it-to-bash method of installation.

ADAM GLICK: Microsoft Azure has released a preview of pod security policy for AKS. Pod security policy is a Kubernetes feature, which gives users the ability to restrict which pods can be deployed into which clusters. Microsoft notes that this is a preview feature. And deploying it removes your cluster from support until the feature is GA.

CRAIG BOX: Google Cloud has released an open-source utility called Berglas, a secrets management tool that leverages cloud storage for data storage and cloud key management service as an encryption provider. Berglas comes with the mutating web hook configuration for Kubernetes, which can be used to inject a secret into a pod using a sidecar.

It writes to a temporary volume, meaning, the secret is never persisted to etcd. Berglas was written by Seth Vargo, who is one of the authors of HashiCorp Vault.

ADAM GLICK: GoDaddy has released a secrets management tool called Kubernetes External Secrets, an operator for synchronizing secrets from a vendor secret management service and populating Kubernetes Secrets objects with them. The project supports AWS Secrets Manager and AWS Systems Manager, but they have made the project general enough to support other external secrets management systems and are excited to work with the Kubernetes community to add them.

We should point out that synchronized secrets are written to etcd with the standard base64 obfuscation. So we don't recommend that you use it in production until a web hook and temporary volume approach is implemented.

CRAIG BOX: Platform 9 has announced Klusterkit with two K's. This is a collection of three open-source tools meant to help with the deployment of air-gapped, on-premises Kubernetes clusters. The tools are etcdadm to simplify administering etcd clusters; nodeadm, which helps install dependencies that cubeadm requires on a node; and CCTL, which is a cluster lifecycle management tool. KlusterKit also allows recovering a completely failed cluster from an NCD snapshot.

ADAM GLICK: The CNCF and Alibaba announced last week that they are making a free cloud native training course for Chinese developers. The course is fully aligned with the Kubernetes certification for administrators and application developers. The first lesson is already posted, with the full course scheduled to arrive by the end of October.

CRAIG BOX: Tinder has swiped right on Kubernetes. In the "Medium" post this past week, quote, unquote, "dating app, Tinder, talks about their two-year transition to Kubernetes as a platform, including how it helped them scale containers in seconds versus having to wait minutes for EC2 instances. Their new platform consists of 1,000 nodes running 48,000 containers in 15,000 pods."

ADAM GLICK: Saifuding Diliyaer posted this week about a new tool, cloud storage company, and surname-domain-squatter Box.com is open sourcing to help track where network policies are causing dropped packets. The project, called iptables-tailer helps alert developers to these packet drops and can help them more quickly diagnose network issues that are causing application problems.

CRAIG BOX: Andrew Sy Kim of VMware, Mike Crute of AWS, and Walter Fender of Google posted this week about the Cloud Provider SIG. This SIG was formed 9 months ago and is focused on making sure Kubernetes stays cloud agnostic and helping to ensure compatibility. They detail their work to move cloud specific code out of the Kubernetes tree and are looking for anyone else that would like to help with the effort.

ADAM GLICK: Bobby Salamat from Google put up a nice summary of pod priority and Preemption, a newly GA'd feature in Kubernetes 1.14. He explains how this feature allows you to put multiple workloads on the same cluster and set priorities to ensure that the most critical things run first.

It will also remove resources from lower-priority workloads to provide them to higher-priority ones. This is particularly useful if you have an application that can scale rapidly and don't have time to spin up new nodes when you have traffic spikes. For the network geeks out there, this may sound very familiar to QoS.

CRAIG BOX: Finally, you can get great performance improvements by moving from JSON over HTTP to protocol buffers with gRPC. But does it impact your observability story? Gary Stafford, an enterprise architect from New York and a follower of @KubernetesPod on Twitter, has been looking at Istio in a series of blog posts with an example app communicating using JSON payloads.

In unrelated tests by Auth0, protobufs were found six times faster than JSON. So in Gary's latest article, he has moved the same service to using protos and gRPC, and finds that, while you have to change the code, the Istio tooling allows the same observability for logs, metrics, and tracing with no changes required.

ADAM GLICK: And that's the news.

[THEME MUSIC]

CRAIG BOX: David Xia is an infrastructure engineer at Spotify who works on deployment tooling. His team is currently upgrading Spotify's infrastructure to use Kubernetes. Previously, David helped build Spotify's in-house Docker tools and platforms. He enjoys biking in subzero temperatures and dreams of a more livable and just world with 100% clean energy for everyone. Welcome to the show, David.

DAVID XIA: Thank you.

CRAIG BOX: Everyone has heard of Spotify at this point. But why don't you summarize Spotify for our audience?

DAVID XIA: Spotify is an online music streaming service. And our goal is to enable content creators and artists and also consumers of music to enjoy music from a big catalog, wherever they are, and however they want.

CRAIG BOX: And podcasts now, too? This show, there will be a number of people out there listening on Spotify?

DAVID XIA: Yes. Spotify's been investing a lot in the area of podcasts. I listen to all my favorite podcasts on Spotify, of course.

CRAIG BOX: The activity of things like Spotify and Netflix, as well, in large part, you can summarize it as "distributing media files to people". And then you say, well, those media files just go on a CDN. So it doesn't sound like a hard problem, when you say it like that. Tell me why that's wrong.

DAVID XIA: Scale is definitely one of the big challenges. If you have hundreds of millions of users, at any one time, there are tens of millions of people simultaneously using the service, how do you scale a music streaming service to all those people in a way that, when I-- it's on demand, right? So if I click on a track, it should play instantly and so fast that it feels like the file is on my phone.

So that is not an easy problem to solve and, especially, to do it in a way that's-- not only technically great and feels great for the user, but also makes all the stakeholders happy, like labels, content creators, consumers-- is a pretty daunting challenge.

CRAIG BOX: And there's is a lot more to it than just the delivery of the music. There's obviously all the recommendation pieces and everything. What are the things you don't think about that comprise Spotify?

DAVID XIA: Yeah. So like you said, a lot of the machine-assisted curation, human curation. We hire full-time people that make playlists. All the different types of features that are thought about before they go into planning and prioritization, and then hiring for building out a feature, testing it, actually trying to figure out, is your feature something that people want?

Is it improving the service? And then we're also building a whole suite of tools that we hope will help content creators. I think it falls under the umbrella of our artist services. So letting artists figure out where-- giving them data on who's listening to their music.

Where should they plan their next concert? How do we give them more control over the way their identity looks like or their brand looks like on Spotify? How do we let them reach out to their fans-- tools that do all of that.

CRAIG BOX: Having worked a little bit with Spotify before, I understand the way the teams are set up are in these things called squads. Can you tell us what a squad is?

DAVID XIA: It's a little confusing to people on the outside looking in. We have different set of terminology. But essentially, a squad you can just think of as a team. We have, for most of our feature teams-- and I'm just talking about research and development department here.

For most of our feature teams, it's pretty cross-functional. We try to keep it between 4 to no more than 8 or 10. When squads or teams get too big, then things usually get inefficient in the way you communicate and the way you work. Most feature teams will have a back-end developer, designer, front-end, maybe a mobile engineer, a product owner.

But my team is different. I work on infrastructure. So we're all back end or SRE DevOps types of engineers. And we do still have a product manager or product owner. But we don't have a designer. We don't have a mobile engineer.

So teams are structured in that way. And we practice Agile. But there's a high degree of autonomy of how you decide to work. So some teams will do things on a three-week cadence. My team does our planning and retrospectives on a two-week cadence.

CRAIG BOX: And you've been seven years at Spotify now.

DAVID XIA: Yeah, almost.

CRAIG BOX: So these squads are your customers?

DAVID XIA: These squads are definitely-- yeah, our customers. And most of the time-- it's not everyone on the feature team. It's almost always, right now, the back-end developers. That's our primary customer that we're trying to target right now.

CRAIG BOX: What was the deployment scenario like seven years ago before the platform team started?

DAVID XIA: When I first started, it was-- it was all over the place. It was anything you could think of. Most of time, I think, a lot of people were building Debian packages, uploading that. And then they would cluster SSH onto all their machines and run apt-get update, apt-get install.

CRAIG BOX: Was each team allocated a set of their own machines?

DAVID XIA: Each team had to request their own capacity. And this was another big pain point, actually. When I first started, there was a shortage, or just a very-- there was not only a shortage in the amount of machines you could request. But there was a long lag time.

So you would have to manually create a ticket, ask for how many machines, and then, eventually, someone gets around to it. But if you requested in a data center that is short on capacity, you could end up waiting for weeks. So what we saw were, people were essentially overprovisioning ahead of time because they knew that would take a while. And then they were hoarding their machines, which actually made the problem worse.

CRAIG BOX: Absolutely. What was the driver to start looking at containers and clustering?

DAVID XIA: So that was-- like we mentioned, one of the motivations was, there was no standard way to deploy your back-end service. We wanted to solve that. And containers also were very popular at the time, and for several reasons. They give you reproducible builds, immutable artifacts. And that was something that we were also attracted to.

CRAIG BOX: And was it up to the nascent platforms team at that point to look into something and try and sell all of the squads on that? Or is that something that you had a bit more autonomy to say, right, we're going to shut down machine access and demand people do things a certain way?

DAVID XIA: I wasn't on the team when it started. But there was an infrastructure team that was started up in New York. And it was called New York City Site Infrastructure, or NYCSI for short. And we had some very talented people on that team. And they created their own mission. They thought of, like, oh, these are all the different problems we see on Spotify. which one do we want to focus on? And they wanted to focus on deployment.

And you asked, how much autonomy did we give teams? Could we go and shut down their stuff and then force them to migrate? So Spotify is actually-- it works a little differently. So teams are highly autonomous. And you can't make people do stuff. You can-- it's more carrot than stick.

So usually, what happens is that the infrastructure team will build tools. And we want to make them really easy. We want to attract people to use them. We want to get feedback. And we don't have the authority to go and say, you must shut down this machine by this point, unless it's a security concern or something where we're going to lose support for it, like something's end of life. But most the time, it's on their own time. And therefore, we are incentivized to make our tools easy and nice to use.

So that still holds true today. But the great thing about the culture of Spotify is that, almost always, back-end engineers, they want to use the new thing. It's a company where people are always trying to innovate, trying to test out what's going to make their lives easier. So people love switching to what we built for them or switching to Kubernetes. And the only limiting factor-- it's not whether they want to, it's when they have the time to do so.

CRAIG BOX: So what were the things that made Docker appealing at the time you were evaluating it?

DAVID XIA: Reproducible builds, immutable artifacts, the fact that you can guarantee that what you're running-- you're not going to have the problem of, well, it works on my machine. So these environmental differences largely go away. And you can be sure that, what you tested and had running in your test environment or local computer is going to behave and be the same thing in production.

CRAIG BOX: As will become apparent in a couple of minutes, this was before Kubernetes was announced.

DAVID XIA: Yeah.

CRAIG BOX: So there were not clear obvious choices in terms of what to do in running containers on multiple machines in a fleet. What was the solution that Spotify came up with?

DAVID XIA: You're right. There were not a lot of open source, ready-to-use, off-the-shelf solutions if you wanted to use Docker at scale across a lot of computers. So Spotify built Docker orchestration framework called Helios. And I wasn't there for that, but the talented people on the original team, they looked at things like, if we want-- let's say you have a fleet of 100 machines.

You want them running the same thing. How do we guarantee that the same image is going to be deployed, configured the same way, run the same way? And then how do we check later that it is actually running? And if something crashes, how the framework will try to restart it?

CRAIG BOX: How would you describe Helios in terms of someone who's familiar with Kubernetes concepts today? What are the things that it does? And what are the things that it doesn't do?

DAVID XIA: That's a good question. Helios is a very stripped-down version of Kubernetes to the point where you don't even have pods. Helios never had the concept of running more than one container as an atomic unit, at the same time, which was actually a great insight that the people who built Kubernetes had.

Helios also doesn't have a declarative-cell API with declarative configuration. It's pretty imperative you say, Helios, deploy this, or, Helios, create this job. And that's it. Helios is essentially wrapping-- there's a concept called a Helios Job.

And all it is, is it wraps a bunch of configuration, like environmental variables, volume mounts, with a Docker image. And then you create this job, which is just a bunch of metadata. And we used ZooKeeper at the time. So it just sits in this data store. And then you can say, deploy this image onto these hosts.

CRAIG BOX: You have to actually select which host you want to deploy on.

DAVID XIA: You do, yeah. So originally-- there's no way to even have a collection of hosts. You just have to explicitly list out all the hosts that you want. And that's a big difference. With Kubernetes, the user doesn't have to care about hosts. They just say, I want this many replicas. With Helios, you are still very hosts-aware. You have to say which hosts you want to deploy to or undeploy from.

CRAIG BOX: Helios and Kubernetes, I believe, were announced at the same event, or open sourced at the same time. Obviously, the two teams working on this weren't aware of the work that each other were doing. Do you remember the feeling at Spotify when Kubernetes was announced?

DAVID XIA: So I wasn't on the team yet. But I remember hearing about it afterwards. I don't remember who was first. Was a Google, or was it-- was it Kubernetes, or was it--

CRAIG BOX: I'm not sure of the exact order on the day. But I remember there was a couple of different open-source projects. There was a Centurion from New Relic as well that was announced at the same time. And obviously, you've got to have a little bit of a surprise for this. So we are sorry that we kind of stole the thunder.

DAVID XIA: No, it's totally fine. I think it speaks to how it's a very common problem that all these people started working on it at the same time. There was this vacuum in the ecosystem of-- no tool solves this. Spotify is proud of building great software.

It's probably obvious that we didn't have the amount of human power to support building out Helios. So I don't know what the feeling was. But I, myself, thinking back, and still a little bit undecided on whether-- it's hard to do that cost-benefit. You needed something at the time, so you went ahead and built the tool that solves your needs. Would it have been better to wait? Maybe.

CRAIG BOX: You can never be sure, I feel.

DAVID XIA: Yeah, you don't know.

CRAIG BOX: At the end of it, you have a tool that exactly solves your needs. And 0.1 of Kubernetes quite probably would not have done it.

DAVID XIA: Yeah. So it's, oftentimes, very hard to calculate, oh, what would have been a more optimal path?

CRAIG BOX: You mentioned one of the features that was not in Helios at the beginning was authentication authorization. That was something you bolted on afterwards. What was that experience like?

DAVID XIA: Helios was built without any of authentication authorization--

CRAIG BOX: For a single customer and a single user, basically for yourselves.

DAVID XIA: Everything was HTTP. There was no way to do HTTPS. You just had to stick Nginx in front of it and do your TLS termination there.

I was there when security said, oh, this is too open. If anyone's on the network, they can just undeploy, deploy-- they can do anything they want. We said, yeah, this is a problem. How are we going to solve this?

So we had to tack on-- or design a solution afterwards-- to solve the security hole that we had left the system wide open. And my big takeaway from that is, when you're designing distributed system, build security into it from day one. It will be a lot better for everyone involved-- users, operators, yeah.

CRAIG BOX: When, in the lifecycle of Kubernetes, did it become obvious that it was the thing that would ultimately get the support from the rest of the ecosystem and, thus, be worth adopting over the custom solution that you built?

DAVID XIA: I don't know if there is one specific point in time. But just over time, as Kubernetes got accepted into CNCF, became this very popular open-source project, and you see all the cool features being built, like higher level deployments, the Ingress, network policies, RBAC. And then we see managed solutions like Google Kubernetes Engine.

Amazon has its own. Microsoft has it's own. So by then, it was very obvious that this is the way to go. This is a clear front runner. And we want to be part of this community and this ecosystem using this great tool.

CRAIG BOX: So what are some of the ways that you participated in the Kubernetes ecosystem or contributed to the code?

DAVID XIA: I found some typos in docs. Those are pretty low-hanging fruit.

CRAIG BOX: Very good place to start.

DAVID XIA: Get them accepted. So I get my contributor credentials there. I think reporting issues is actually a really important thing you can do as part of--

CRAIG BOX: Absolutely.

DAVID XIA: -just being a good way of contributing back. Writing up very detailed ways to reproduce something, explaining why you're trying to do this, why this matters to you-- just a really nice GitHub issue is oftentimes very, very helpful for the community.

And a lot of those have been fixed. A lot of times, it's-- so other than documentation and reporting issues, there are a few outstanding things that I would like to see improved as well as some of the side client libraries for Kubernetes. They could be improved.

There's no inheritance between the different Kubernetes resources that have their own model classes. My use case is that I want to give it a bunch of YAML. And it will tell me this client library, maybe, has a utility to tell me, oh, this-- you have a deployment. You have a service. You have a horizontal plot autoscaler. But right now, you have to do a lot of that YAML reading and parsing yourself. So that's one area that I'm excited to see get better over time.

A lot of times, we work very closely with Google on Google Kubernetes engine and how that experience works for us. We actually-- one time, we wrote our own admission controller. We deployed it, and it started-- all of our clusters started crashing. And we didn't know why.

After I think a week-long open P1 ticket, we finally found out that there was a-- we didn't discover this. But our reported support case helped these Google engineers find that there was a bug in GoLang's implementation of HTTP2, which caused these clusters to crash. Because we had written our admission controller incorrectly according to the specs.

CRAIG BOX: That's some serious root-cause analysis there, if it turns out that it's in the library.

DAVID XIA: It is-- it was in the very low layer of GoLang itself. It was pretty amazing.

CRAIG BOX: What are some of the things that you've had to build to integrate your environment with GKE?

DAVID XIA: So some things that we've had to work on at Spotify are making the developer experience and the workflow compatible between the way we used to, or currently, still do things at Spotify and GKE. One example is service registration. We started with client-side routing, so it's all DNS record based-- a lot of SRV records, a lot of A records. But Kubernetes is server-side routing.

There's a service resource that has a bunch of endpoints. It figures out which endpoints you need. But we had to build something-- or our sister team over in Stockholm, they had to build a custom integration with our existing service discovery mechanism where they actually just listened to Kubernetes.

And when a pod comes up, and it matches a certain service, we register that pod's IP in our existing service discovery framework, which causes all sorts problems. You shouldn't do that. This is not a good thing to do for listeners out there.

CRAIG BOX: But it's legacy. And it's interoperability. It's required.

DAVID XIA: Definitely. So that's why we built it. Another thing is secrets. We had our own way of storing secrets. Kubernetes has its own way of storing secrets. We built an integration that will basically replicate from our secret store onto Kubernetes in a way that's relatively easy for people to do. So again, it's interoperability there.

CRAIG BOX: What's the experience like for a developer at Spotify, from writing code on their laptop, to seeing that change running in production?

DAVID XIA: For most back-end developers, they'll be writing most of the time in Java. And they'll write some Java. It's usually MAVEN Project all packaged up. We have almost everything as CICD, so we use GitHub Enterprise too.

So you push it to GitHub Enterprise on a branch. You make a pull request. Your automated tests run. Most of the time, people also don't start from scratch. We have a Java framework that's open source called Apollo. And it comes with a bunch of boilerplate and a few basic tests that are given for free to you. People add more tests. Once the tests pass, and people have reviewed your code and approved it, you can merge it in.

CRAIG BOX: What tooling are you using for that? Next, we'll move on to the deployment phase, what is the process of--

DAVID XIA: Yeah. So our deployments-- it's something that another team built at Spotify. This is not open source. It's a tool called Tingle, and it's all container-based, so we run all the tests in containers. And then there might be test reports that come out. At the end, you can define it. People probably are familiar with tools like Travis.

CRAIG BOX: Yes.

DAVID XIA: Travis and Circle CI-- and you can write configuration for what build steps you want, what kind of tests you want to run, maybe even custom commands that you want to run. So Tingle is very similar to that. And at the end, it can produce artifacts, like produce this Docker image, push it to a central registry, then sometimes people will have automatic deploy to production.

If I merge the master, just automatically deploy to production, sometimes, there will be a manual gate. We've built Canary functionality into our deployment pipeline, so you can have one Canary instance in production.

One of the big things we're missing is, we haven't solved for a nice testing environment, which has always been really hard to do, for some reason. I'm sure a lot of other people have struggled with that. How do you have a test environment that is isolated, but at the same time, gives you enough of a signal that your system is actually working?

CRAIG BOX: What environments do you offer people? Do you have different namespaces for different tests? Or do you have different clusters?

DAVID XIA: So for Kubernetes, we've encouraged people to have one name space per logical system, not per team. Because systems could get transferred. And it's a lot easier if they were just in their own namespace to begin with. Short answer is, we don't really have a testing environment for people. We have the Canary. And so people will look at their canary graphs.

And if it looks OK, then they usually will promote it. And all the other instances will get that newer version of their code. We have a very legacy testing environment, but it doesn't work very well. So I think this is probably one area that Spotify is going to try to improve going forward.

CRAIG BOX: And what about operating multi-cluster environments, running clusters in different locations?

DAVID XIA: We have, definitely, multiple clusters. We're going to need to scale out to even more in the future. We currently run Spotify in three GCP regions-- one in the US, one in Asia, one in Europe. And we, right now, since our usage of GKE is pretty low, we just have one production cluster in each of those regions. But in the future, we'll probably have three, four, maybe more clusters in each region.

And we're planning, for now, on keeping them all the same. They're going to be configured the same. Maybe we'll think about things like, this will be a cluster for tier-one services. This will be a cluster for tier-two services. But for now, we're just going to try to keep them all the same for simplicity.

CRAIG BOX: You gave a talk at Google Cloud Next recently on GKE Usage Metering, a topic we covered in an episode recently. What was your involvement with that feature?

DAVID XIA: I was mostly helping the team just kick the tires on the feature. We want to use GKE Usage Metering, which is a feature that helps you break down your GCP costs when you're on GKE. You'll get your costs broken down by namespace. Because right now, all of our teams, they just create their own projects. And you get a cost breakdown per project for free in your invoices. My team has one project that has all of our GKE clusters.

CRAIG BOX: Yes.

DAVID XIA: And as people migrate to GKE, they're going to be, essentially, moving their compute from their projects to this project that we own. And if you look at the invoice, it's going to look really strange. Because it looks like our team is just increasing. We're doing something, and our costs are just growing and growing. Meanwhile, relative to other people's projects, that are not. But underneath the surface, it's because they're migrating to GKE.

CRAIG BOX: And those are all-- the whole cost will be falling because people are now moving to a shared environment.

DAVID XIA: Yeah. GKE usage metering will basically help us gain more insight into the cost breakdown as we migrate. So I was mostly just kicking the tires on it and test driving the docs, seeing that the product actually worked for our use case. At the end of the day, it's probably going to be our finance and tech procurement team that will be the true end users.

But yeah, my talk was with Madhu and Yang from Google. And after using it, we gave them some input into, like, oh, this is actually really easy to use. This is definitely something you probably want to enable on all our clusters.

CRAIG BOX: What things are you excited about in terms of new features and new announcements in the cloud namespace?

DAVID XIA: There is Kubernetes Cloud Connector, which I am personally-- and I think my team is also very excited about. Being able to manage-- that's a feature that will allow declarative Kubernetes-style APIs for other GCP resources.

CRAIG BOX: Right.

DAVID XIA: So right now, a lot of people at Spotify, when they want to use other GCP resources, like service accounts, IAM policies, PubSub, Cloud Datastore-- they're just clicking around in the UI. They're not doing it declaratively. They're not versioning their configuration anywhere.

And it's really hard to keep track of what is the state of your infrastructure when specific cases-- our firewall rules, our security team has actually written a tool that will go and monitor and make sure that the firewall rules we have are the ones that are supposed to be there.

But in the future, when Cloud Connector is there, we can just use that. And that will manage, hopefully, a lot of our infrastructure, including our clusters themselves. We're currently using Terraform to do that, but we're--

CRAIG BOX: So you've got two different environments where people have to know two different DSLs in order to be able to describe how to do that.

DAVID XIA: Exactly.

CRAIG BOX: Alright. And let's ask-- subzero temperature cycling. Is that Fahrenheit or Celsius? Does it matter?

DAVID XIA: It does matter. There's a big difference. I personally prefer to use just Celsius in my everyday life, just metric system in my everyday life. But for this particular case-- there was one day in January this year-- it was Martin Luther King Jr. Day. And New York had temperatures of, I think with wind chill factored in, negative 5 Fahrenheit.

CRAIG BOX: So that's minus 20 Celsius, I'm guessing.

DAVID XIA: Yeah. That was a pretty cold day. But there was this one event that I really wanted to go to. Everyone else wanted to stay home. I biked over to St. John's on the Upper East Side for this MLK event. And I was super excited to go. And I convinced a bunch of my friends to also get themselves--

CRAIG BOX: Were the trains not running that day?

DAVID XIA: I think the trains were. You just had to make it into the station. And then, once you're in the station, it was probably above zero. But yeah, I biked over there to kind of scope it out, get a place in line, and people thought I was just insane for doing that.

CRAIG BOX: Well, good on you. Thank you so much for joining us today.

DAVID XIA: Thank you.

CRAIG BOX: You can find David on Twitter at @davidxia_, or on the web at davidxia.com.

[MUSIC PLAYING]

CRAIG BOX: Thanks, as always, for listening. If you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @KubernetesPod or reach us by email at kubernetespodcast@google.com.

ADAM GLICK: You can also check out our website at kubernetespodcast.com, where you can find show notes and transcripts of each episode. Until next time, take care.

CRAIG BOX: See you next week.

[MUSIC PLAYING]

View More Episodes