#136 February 2, 2021

Backstage, with Lee Mills and Matt Clarke

Hosts: Craig Box, John Belamaric

Backstage is a platform for building developer portals, powered by a centralized service catalog. It was built at Spotify and both open sourced and donated to the CNCF in 2020. A Kubernetes plugin was recently added. We talk to maintainers Lee Mills and Matt Clarke from Spotify.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box with my very special guest host, John Belamaric.

[MUSIC PLAYING]

CRAIG BOX: Welcome back to the show, John.

JOHN BELAMARIC: Thanks for having me back, Craig. It's a pleasure to be here.

CRAIG BOX: You were last with us in June with Episode 106, talking about CoreDNS. It feels like yesterday. Where did the time go?

JOHN BELAMARIC: I don't know. It's all a blur these days. It's hard to believe.

CRAIG BOX: One of the things you were working on when we spoke last was the idea of production readiness reviews and when someone wants to launch a new feature or an enhancement in the Kubernetes, having to go through an SRE-like process of figuring out how the feature would be supported. I hear those are now going to be required as of the next Kubernetes release.

JOHN BELAMARIC: That is correct. So we actually originally merged the policy change to make it required in December, but I guess a number of people were already off and didn't get to review it. So there were a few delays, but now we have merged that change. So that means that everything going into Kubernetes 1.21, which is right now in the enhancements process, where we're going through and identifying new features and evaluating them and going through the design process as to what those features are going to do and how is going to be implemented, and then finally deciding whether they're going to be in the release or not, will have one extra layer of review.

It's been the SIG leads that have sort of made the final decision on that, which is still true. It has to go through the SIG leads, but we also have just a further level of review to make sure that people have thought about the monitoring. They've thought about the supportability. They've thought about the scalability. So they just have to answer a few questions, and we'll have some back and forth on that, typically.

And luckily, the SIG leads themselves still do most of the work on that, because they review the questions before our small team of production readiness reviewers sees it. So usually, there's not too much to do because it's already been solved. But we're looking forward to that helping ensure that features really are much friendlier to operators. And we can always turn them off until they're GA, and we can always identify when they're failing and hopefully prevent or at least be able to quickly mitigate any kind of problems for people's clusters in the future.

CRAIG BOX: Now, you're in the San Francisco Bay Area. And I understand that California has slowly started lifting coronavirus restrictions, maybe not so quickly in the Bay Area. But what's changed lately?

JOHN BELAMARIC: Well, so last week, we actually had winter! Not snow, but we had a whole bunch of rain, an atmospheric river that came through. And so just as they decided, they said, we're going to reopen outdoor dining on Monday, the weather was just abysmal. So nobody opened anyway. Or maybe some people did, but I don't think anybody went.

But finally, just this last weekend, things were opened up. The weather cleared up, so my wife and I went out to eat for the first time in months. And it was a real pleasure. It was a little chilly. It's winter, but you got heaters on the patio there. And we were able to have a cocktail and just relax. It's a good time.

CRAIG BOX: Was it everything you dreamed it would be?

JOHN BELAMARIC: Oh, yes, everything and more. I mean-- no, it's fine. It's not life changing, but it's a relief just to be able to get out a little bit and not be in the house all the time.

CRAIG BOX: Yes, nice to go back to some semblance of normality. I think in the UK, we're still a way away from it here.

JOHN BELAMARIC: Shall we get to the news?

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

CRAIG BOX: The Longhorn storage project, originally from Rancher, has released version 1.1. The update introduces experimental support for read-write menu volumes, as well as support for backups via CSI snapshots. Longhorn now runs on ARM 64, and thus Rancher paints this release as being ready for the edge.

JOHN BELAMARIC: Vitess 9 is generally available this week, with the cloud native database system focusing on further increasing compatibility with MySQL. New features include improvements for debugging replication and streamlined support for online schema changes.

CRAIG BOX: Sonobuoy has added reliability scanning. The cluster compliance tool can now check your pods have configurations, such as liveness and readiness probes, and minimum quality of service. The work was contributed by the Customer Reliability Engineering team at VMware Tanzu.

JOHN BELAMARIC: Kubernetes security startup Alcide has been acquired by cyber security company Rapid7. Alcide, who you might recognize from the write up of Kubernetes CVEs and their tool sKan (with a capital K), was founded in Tel Aviv in 2017. It had taken $12 million in investment and was purchased for a reported $50 million.

CRAIG BOX: In case you were worried there were no security startups left to acquire, please meet Armo. Armo is also an Israeli company, building a platform called Workload Fabric for runtime security. Armo comes out of Stealth with $4.5 million in seed funding.

JOHN BELAMARIC: In 2018, Open AI wrote a blog post talking about scaling Kubernetes to 2,500 nodes. And three years later, they return having scaled to three times the size. The research team behind the GPT3 text generation model has trained their cluster to avoid network encapsulation and track down out of memory errors in their monitoring stack. Their updated post explains what challenges remain unsolved at 7,500 node scale.

CRAIG BOX: The Linkerd project announced the formation of a steering committee. As opposed to an elected governance group, this is a user advisory board formed currently of four members out of a potential seven. Linkerd has been incubating in the CNCF for four years as of this week, and a governance plan and diversity of maintainers is a requirement for eventual graduation.

JOHN BELAMARIC: Vamp, who do both CI and AI, have released a report on the state of Cloud native released orchestration. Unsurprisingly to this audience, it is all Kubernetes all the time with 72% of respondents using Kubernetes in production and a further 16% planning to do so. 81% are running a microservices architecture with 86% of participants using what they consider high risk release strategies. The full report will cost you your email address.

CRAIG BOX: Dan Lorenc, guest of episode 39, has moved from Minikube to malware, now working in product security at Google and on the governing board of the Open Source Security Foundation. He's currently looking for malware in installers for Open Source projects and, by having them install into pods in the Kubernetes cluster and then running the Falco scanner against that cluster, was able to find file accesses in places they shouldn't. Future plans include watching for network activity and publishing the data to a dashboard.

JOHN BELAMARIC: One of the costs of running your own Kubernetes infrastructure is upgrading it. And if you wait too long, you can end up several versions behind. WeTransfer got stuck on version 1.11 in late 2020 and upgraded to 1.18 in the course of a month. Jeff Wolski has written up WeTransfer's notes on their successful migration.

CRAIG BOX: When Kubernetes goes wrong, it often goes into a crash loopback off state. David Giffen from Release has published a guide to debugging yours or someone else's container, letting you inspect the container and find out what is causing it to fail to start. If your container has curl installed, he even provides a script for easy debugging.

JOHN BELAMARIC: Finally, with some sad news, we mark the passing of Jeff Brewer of Intuit. Jeff was an emeritus member of the CNCF Talk as the representative of their end user board and member of the team that won the CNCF's End User Award in 2019. Our condolences to Jeff's friends and family.

CRAIG BOX: And that's the news.

[MUSIC PLAYING]

CRAIG BOX: Lee Mills is an Engineering Manager at Spotify responsible for the teams building Backstage. Matt Clarke is a Senior Infrastructure Engineer in the deployment team at Spotify, working on the Kubernetes plugin for Backstage. Welcome to the show, Lee.

LEE MILLS: Hey. Thanks for having me.

CRAIG BOX: And welcome, Matt.

MATT CLARKE: Hi, nice to be here.

CRAIG BOX: Backstage was built to serve a particular need at Spotify. Can you tell me, first of all, about how Spotify's engineering teams work, and what the problem was that led to building Backstage?

LEE MILLS: Sure. So Spotify has a very unique structure, I think, when it comes to engineering. We work on our own Spotify lingo, tribes and squads and all of this side of things. But effectively, when you boil it down, we've tried to create very autonomous teams with very clear ownership about what they own, allowing them that freedom to maneuver and develop the platforms as and how they need to.

Back in the earlier days with Spotify, what that meant was-- and it's still, to be fair, somewhat of a problem that we see today-- is around the sheer number of services that have grown out of Spotify, as Spotify itself has grown, and trying to identify who owns services, how they all relate to each other, how they all connect to each other had proved challenging and still can be challenging.

Backstage really grew out of that need. Backstage has been around quite a long time, four, five plus years now. It started out as a slightly different service. And that was just purely focused on the ownership problem. How do we identify who owns what? And who do we know to reach out to? And that's why Backstage really, really started. And it grew from there. As we started to solve some of that ownership problem and do it in a way that was visible to all to see what was going on.

CRAIG BOX: Help me set the scene here. A lot of people understand that there are a lot of microservices involved in any kind of application. But for someone who thinks that Spotify is just "click button, get music", what are some examples of services that run in order to power Spotify?

MATT CLARKE: There's lots of different types of services that power Spotify. For example, there can be services that manage your playlists. There can be services on the commercial side who manage the pricing for different countries. There can be services that come up with recommendations. There can be internal tools, for example, playlist editors can use. And then, there's the obvious tools such as things like the services that power your Accounts page, your Home page, your Library, that sort of thing. So everything you interact with in Spotify is made up of an aggregation of different microservices.

CRAIG BOX: And if I'm building a new service, am I going to need to discover the teams internally that I'll be interacting with through something like Backstage?

LEE MILLS: Yes. So you can jump into Backstage and find all of the different services and who owns them, who you are going to be interacting with. That's part of our service catalog. And you can also, through Backstage, create those services. So we tried to make it as simple as possible and take as much as that kind of discovery work and finding things, we try and take that load for you. Identifying the people, and then helping you to create skeleton services in standardized ways for Spotify.

CRAIG BOX: Backstage obviously is a service within Spotify. Which was the team that put this together?

LEE MILLS: From Backstage inception, it's been one team up until relatively recently when we open sourced that has done all of this with Backstage. We tried to build Backstage in a way that its ownership is distributed across R&D. Backstage has this methodology or this thinking of plug-ins. And every plug-in is owned by the team that knows that service or that functionality the best. So the team that owns Backstage, we own the core, we own the core framework and the things that allow the plug-ins to interact with each other and glues things together. But actually, 85% of it is owned by other teams. Like Matt, for example, with the Kubernetes plug-in.

CRAIG BOX: And what tech stacks did you adopt in order to make that plugin model possible?

LEE MILLS: Interestingly, it's slightly different internally in Spotify to the open source flavor of Backstage at the moment. Internally, it's a large scale TypeScript application with a Java backend. In the open source, we went TypeScript with Node.JS backend to try and give us a few more opportunities and fit in with familiar tech stacks in the open source community a bit more.

CRAIG BOX: Are you looking to adopt the external version internally over time?

LEE MILLS: It's actually something that we're going through right now. Our secret advantage, if you will, for adopting Backstage open source has been we have a customer sat right next to us. So the original team that built Backstage in Spotify, we've actually grown and split it into two teams. One owning the open source we're in, working purely on that. And then the other owning the internal instance. And the team owning the internal instance, we're actually working on that right now. We're slowly adopting everything that we can from open source and build and contribute back from that.

CRAIG BOX: You have this catalog tool. You have something that's built for Spotify's needs. Talk me through the process of deciding to open source this.

LEE MILLS: The original idea behind the catalog and the internal versions of it, as mentioned up in the top, was around solving that ownership. The things that grew from there were around solving the ownership, being able to see the services of features that you own, but then giving you that view of all of the tools, all of the facilities to be able to manage and maintain and work with those services right there within or linked to that catalog.

As we started to talk about open source and we started to meet with other end users in the CNCF community, in the wider open source community, we started to see that this was a problem that many people were trying to solve. And it's a problem that we had solved for ourselves over the past few years. At the time, it seemed like the right thing to do is, hey, we've got this. Let's get it out there. Let's see what people think. And let's get other people's ideas and feedback on it to try and make it better for everybody, ourselves included.

CRAIG BOX: You mentioned that you can not only catalog the services that you run, but you can create services. How did the model that you have with the different squads and teams using different software drive the model that you have for being able to deploy all the different variants of platforms that those teams might use?

LEE MILLS: The creation of services is tied heavily to this internal concept that we have called the Golden Paths. And what the Golden Paths are are basically recommendations and best practices when it comes to working in various different domains. So for example, we have a Golden Path for web development.

CRAIG BOX: Right.

LEE MILLS: A Golden Path for full stack, and so on. With the model that Backstage had about plug-ins being owned by other teams, so too do we have that with our Golden Paths. Web Guild owns that Golden Path as an example. And they can define their best practices. The tooling to then create the services, then we give the functionality to allow those groups to be able to create their own templates and maintain those templates. So the idea being that those best practices live on through the software. They're not rigid. People don't have to use them. But it's going to be the most supported and kind of most standardized way, if you will, in line with the rest of Spotify.

CRAIG BOX: What things did you see that surprised you when people outside Spotify started adopting Backstage?

LEE MILLS: To be honest, just the general reaction. The seemingness of this thought that we had that we thought other people were having the same problems that we were kind of being validated as more and more people started to jump on. So I think that was the first thing that surprised us.

And then that quickly followed with just a wealth of contributions on different ideas. For example, we had a number of different people starting to contribute and share ideas around, hey, can this help me? I'm using this cloud. Or can this help me developing locally and then sharing it to a cloud, et cetera? And all of these various different ideas of usages.

And the speed of contributions from people probably was the third thing. It's now quite regular we put an issue out there of we need help with this-- thinking back to introducing Dark Mode to Backstage. We put an issue out, and within hours, somebody had contributed back and provided that to the project. We were able to adopt that to our internal instance super quick, as well.

CRAIG BOX: There'll be a few things that that will be true for. If you, for example, tend towards "tabs" and you put an issue out saying, please implement a "spaces" mode, I'm sure that will come through very, very quickly from the community.

LEE MILLS: It will definitely start a discussion.

[LAUGHTER]

CRAIG BOX: Matt, your team has just released a new Kubernetes plugin for Backstage. Tell me a little bit about that.

MATT CLARKE: So the Kubernetes backstage plug-in is a monitoring tool designed for people who own microservices. It allows you to easily check the health of the service, no matter where they're deployed. And it's built into Backstage.

CRAIG BOX: Now Spotify have been on a journey from their own internal container orchestration system called Helios to Kubernetes over the last few years. We spoke with David Xia from Spotify in April of 2019. He spoke a little bit about the Spotify platform team and their journey. Matt, can you pick up with how Spotify have been adopting Kubernetes recently?

MATT CLARKE: Spotify's adoption of Kubernetes has been typical of a lot of organizations out there. You see organizations experiment with Kubernetes. Perhaps one team creates their own cluster. They run their workloads on that cluster. And then they decide, this is really great. We should create a platform for every team at Spotify to be able to use this. I find that, once you do that, it starts to change the developer experience of using Kubernetes. When you have one cluster, it's very obvious where all your services are running.

But then at Spotify, at our scale and running in multiple regions, you have to provide tooling around that. And this is where we started to notice that developers were going to need some more tools to help them debug issues, to help them see the current status of their services. And Helios had already, at that point, had an integration with Backstage that let you see the different instances that were running in different regions and kind of aggregate that information together.

So we thought the next logical step would be to show the Kubernetes information there. And as we kind of aggregated that together and Backstage was open source, we realized actually what we've done here is really useful. It's a use case that we didn't originally think existed because we created our Kubernetes clusters. We were the cluster admins. We had total access to see everything. But now we're running multitenant clusters. There's multiple clusters. It's very confusing if you don't have developer tooling to even just know what cluster your service is actually running on.

In order to help developers, we created this internal Backstage plugin. And they really liked it. And we tried to take as much of the positives we got from there into the open source version. One thing we're really focusing on is that it's multi-cluster. It doesn't matter if your service runs in multiple clusters or even multiple regions or multiple clusters in multiple regions. Or, in the open source version, different Cloud providers. You can be Hybrid Cloud. And the Backstage plugin will basically aggregate all that information together so that you can see the health of your service.

CRAIG BOX: Do I have to have deployed my service through Backstage to be able to get that kind of observability off it? Or can I connect up to something that's already running?

MATT CLARKE: No. You can connect up to something that's already running. So at the minute, it just uses labels to retrieve the Kubernetes resources. So as long as it's discoverable like that, then Backstage will be able to pull in the Kubernetes information.

CRAIG BOX: What tooling were you using to solve these kind of problems beforehand? Was there anything that you were running sort of hodgepodge internally that you were able to replace with the new Kubernetes working Backstage?

MATT CLARKE: Well, one thing we had internally was a Kubernetes plugin which identified what cluster a service was running on and then allowed you to set the current kubectl context so that you could then run commands against that cluster. One of the problems with that is that it's not aggregated. You can only have one active context at one time.

CRAIG BOX: Right.

MATT CLARKE: Unless you want to have a bunch of different Kubeconfig files open and different terminal windows open. By being able to monitor Kubernetes resources through Backstage, it will show you all that information at once. So for example, if you own the Playlist Management service, instead of having to go to all the clusters, retrieve Kubernetes information for them, and then aggregate that together in your head, you just go to Backstage. You go to the Playlist Management service. You click the Kubernetes tab. And you can see all of the information about how it's deployed.

CRAIG BOX: Backstage was open sourced by Spotify in March last year, and in September, it was submitted to the CNCF Sandbox. What was the decision making process in terms of making that donation? And what did you gain versus what did you give up?

LEE MILLS: Spotify has open sourced a few things previously in the past. Backstage, I think, has been the first time where we've dedicated a full squad to it. 100% of the time. Let's see where we can take the project. And joining part of the CNCF, that part of the decision making there was really about being part of the conversation and really trying to help other organizations with what we think is a really good solution to this problem. Or a work in progress solution to this problem.

So being part of the CNCF, it gives weight to the project. But really it was all about the opportunities that gives us our reach with other companies looking to try and solve these problems and helping us to build those relationships and get the projects out there, as well as help us understand how we can better open source the project and how we can better be maintainers and be in the community.

CRAIG BOX: And the Kubernetes plugin has just been released recently?

MATT CLARKE: It has just been released recently. We had it available in a test mode. But now, we've just released all the documentation and started getting contributions in, which has been very exciting.

CRAIG BOX: Obviously, Spotify has proved that Backstage works at a very large scale. Is this a project that's useful at a small scale, as well, if I have two or three developers working on something? Arguably, maybe I shouldn't even be using Kubernetes. But should I be using Backstage?

LEE MILLS: There'll be benefits for smaller groups of developers, for sure. But I think really where it comes in is when you start to have multiple different services and you start dealing with some of those ownership issues. Backstage could be useful and will be useful in certain ways. As I say, probably the plugin marketplace being the main thing.

If you are two or three developers and you want to quickly get some test statistics, or even if you are running Kubernetes at that scale, to get access to the plug-in that Matt and his team have developed, that will be useful. But really, the use comes in when you're starting to deal with lots of different services and cross squad ownership.

CRAIG BOX: And what have you done to make sure that it's not a breaking change? A lot of people deploy with a stack that suits two or three people. And they find that, when they've taken on some investment and they start getting popular, the quote unquote "fail well" comes up. And they have to re-architect. Need to make sure that when you do introduce tooling like this, it doesn't break what you've already built.

LEE MILLS: For one, we've tried to make getting started with Backstage super simple. Again, Backstage kind of acts as a little bit of an abstraction layer almost. We want to kind of abstract some of those difficulties away for you. And the way that Backstage works, effectively we use YAML files to identify different services and hook up to them. We can hook up to services that weren't built in Backstage.

So in theory, that smaller scale, you could hook up all of your existing services into Backstage to see them. And then gradually start to move them if you want into more integrated ways or build out on Backstage. But really, it gives you that opportunity to try it and see if it's really what you need or really what you want.

CRAIG BOX: If I've made the decision that I do want to adopt Backstage, what's the process of getting it installed? Is there something that I put onto one or multiple clusters? Is there something that I consider taking as a SaaS service from somewhere?

LEE MILLS: So getting up and running locally is super simple, it's checking it out and just spooling up with a couple of commands. We then do also like an app version, if you will, which is getting it set up and ready for production already for major use in your organization. And that's built-in with a bunch of different configs. So you can kind of choose your flavors, choose your plugins, how you want to get it set up.

CRAIG BOX: Is there a bootstrap problem? Should I be using Backstage to install Backstage?

[LAUGHTER]

LEE MILLS: Perhaps. Maybe we should look into that and get that up and running.

CRAIG BOX: One thing that we see as projects mature is that, not only do they get contribution from outside the company that created them, but they start seeing an ecosystem of startups build around them. Are you seeing that with Backstage?

LEE MILLS: We are. And it's incredibly exciting, actually. To me, it's really healthy. And we have seen that. We've seen a few different smaller startups built around Backstage now offering Backstage as a service or consulting and helping to deploy Backstage for other people. But that's certainly happening. And it's really exciting to see at this stage already.

CRAIG BOX: Is there anything that either users or startups around this space could contribute that the project would most benefit from?

MATT CLARKE: In terms of what's next for the Kubernetes plugin, I'm really excited to see people talk about how they're using Kubernetes with Backstage. Especially excited to see how people are using custom resource definitions to deploy their applications. So for example, you could have a system like Argo CD, Argo Rollouts which deploys CRDs that then manage workloads. And how do we integrate that into this Kubernetes monitoring tool, Backstage?

CRAIG BOX: And do you think that there will be a large contribution of features that you hadn't thought of or that aren't necessarily useful to Spotify but you think will be more broadly applicable to the Kubernetes community?

MATT CLARKE: Absolutely. I think that's one of the reasons that we didn't open source the Kubernetes plug-in we have now. We built an open source Kubernetes plugin. So we created abstractions that we thought might be helpful for other organizations, such as being able to have customizable authentication, being able to load cluster configuration in dynamically. These are all things that are kind of in progress right now that are pretty exciting.

CRAIG BOX: Finally, it's been a long time since any of us have been to a gig or a theater or anything. But do either of you have a Backstage story?

MATT CLARKE: Oh.

CRAIG BOX: Did you knock on Keith Richards' door once and he was swigging a cup of tea or something?

[LAUGHTER]

LEE MILLS: Years ago, I got a chance to go backstage at a Faithless gig and talk IPAs and about cups of tea, good cups of tea with the band, which I thought was incredibly British but really good fun.

CRAIG BOX: We had a guest who played in a band previously. And he shared with us that other bands on the tour, the more death metal they were, the more that they liked cups of tea and Taylor Swift.

[LAUGHTER]

All right. Thank you both for joining us today.

LEE MILLS: Thanks for having me.

MATT CLARKE: Thank you.

CRAIG BOX: You can find Lee on Twitter at LAMills83. And you can find Matt on Twitter at MatthewClarke47. You can learn more about Backstage at Backstage.io.

[MUSIC PLAYING]

CRAIG BOX: Thank you, John, again for helping out with the show today.

JOHN BELAMARIC: Absolutely. No problem at all. It's a good time to just get a chance to chat.

CRAIG BOX: If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter at @kubernetespod, or reach us by email at kubernetespodcast@google.com.

JOHN BELAMARIC: You can also check out the website a KubernetesPodcast.com where you will find transcripts and show notes, as well as links to subscribe.

CRAIG BOX: I'll be back with another guest host next week. So until then, thanks for listening.

[MUSIC PLAYING]