Kubernetes Podcast from Google: Episode 52 - AutoTrader UK, with Russell Warman and Karl Stoney

#52 May 7, 2019

AutoTrader UK, with Russell Warman and Karl Stoney

Hosts: Craig Box, Adam Glick

AutoTrader UK were an early adopter of Istio. Adopting it to meet GDPR requirements for encrypted traffic, Head of Infrastructure and Operations Russell Warman and lead engineer Karl Stoney have gone on to use it to reduce resource usage, and thus cost, as well as uncover bugs in their applications. They talk to Craig about it, while Adam serves his country.

Do you have something cool to share? Some questions? Let us know:

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

CRAIG BOX: Here's a hypothetical situation for our listeners. Let's say that you were home on parental leave after the birth of your first child, wanted to spend as much time with her as possible, help out your wife. What would be the absolute best possible scheduling thing you think could happen in that situation?

ADAM GLICK: I'm going to take "jury duty" for 500 on that one.

CRAIG BOX: Congratulations! How was jury duty?

ADAM GLICK: Oh, it is one of the small civic asks that the United States asks of us as citizens. And so, although the timing was less than convenient, it really was a great experience. If anyone gets a chance to serve jury duty, I really encourage people to do it. It gives a view of the justice system that I think a lot of us don't have. And I myself really appreciated it.

CRAIG BOX: Are you allowed to say anything about the case?

ADAM GLICK: At this point, I can. The case is over. It was a misdemeanor case dealing with threats that someone had made.

CRAIG BOX: Oh. And are you allowed to say if the person in question is now safely out of harm's way?

ADAM GLICK: Justice was served.

CRAIG BOX: Brilliant. Well, congratulations and thank you for upholding the American way.

ADAM GLICK: Yes, well, I'm sure many, many places have that. At some point, would you get called in London, or would you get called back in New Zealand?

CRAIG BOX: Depends which country knows where I pay my tax, I guess. I looked this up for you at the time. I was very surprised because there is a process in New Zealand at least and presumably here in the UK, where you can send them a letter and say, actually, this is not a good week and have your service deferred. And if you were a New Zealand citizen, that would have been an option available to you.

ADAM GLICK: Well, I'm glad to be done with it and now getting ready for KubeCon, where we will both be together.

CRAIG BOX: Yes.

ADAM GLICK: How goes your preparation?

CRAIG BOX: Well, we've got a lot of different Google Cloud things happening at KubeCon. We have three spaces at the event. We have a booth where you can learn all about GKE and Anthos. We have an outdoor terrace where you can celebrate Kubernetes' fifth birthday. And we have an indoor lounge where we will have things like Live Code Review and Meet the Maintainers. And if you are very, very lucky, you might actually get to meet the co-hosts of the Kubernetes Podcast from Google there as well.

ADAM GLICK: Shall we get to the news?

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

ADAM GLICK: It's Vendor Conference Week with both Red Hat and Microsoft hosting large developer events this week. To make the most of it, they've announced a new project they're working on together. Kubernetes-Based Event-Driven Autoscaling, or KEDA, is a metrics server that can expose metrics to the horizontal pod autoscaler, allowing scaling based on events. KEDA can be used as a scale to zero mechanism for hosted versions of Azure Functions in the same manner that Knative events works for the open part of the ecosystem.

CRAIG BOX: Other announcements from Microsoft Build include the general availability of virtual nodes on AKS, which use the Virtual Kubelet Project to abstract away Azure Container Instances; and Azure Dev Spaces, an integration for running and debugging containers in AKS. They also announced a preview of Azure policy support for AKS and the recent end of life of Kubernetes 1.9. We'll bring you coverage of Red Hat's news next week.

ADAM GLICK: Banzai Cloud offers a platform called Pipeline on top of vendor hosted services, but not all of them support all the features they need. Instead, they built their own service, called a Pipeline Kubernetes Engine, or a PKE, to work in those situations.

This week, they published the work they're doing to make PKE work on Microsoft Azure and why they felt the need to do this. If you're running Kubernetes using AKS or considering it, it's probably worth a read to understand how they've handled some of the load balancing and availability challenges they've encountered.

Banzai also launched a Helm Chart repository as a service, including a free tier for hosted public charts. As part of the launch, they open sourced a library called Chartsec, which scans a helm chart for potential security vulnerabilities. Private chart storage is also available as part of their commercial offering.

CRAIG BOX: If you liked the sound of debugging containers, but don't run on Azure, Microsoft announced extensions for Visual Studio Code last week that let you do remote development against Docker containers or other machines using SSH. You're running your application inside the container with an agent, and you're on VS code on your local machine as normal.

ADAM GLICK: Docker also held their annual conference last week and made a raft of announcements. Docker Enterprise 3.0 added new capabilities for automated lifecycle management and enhanced security, as well as Docker Desktop for Enterprise. Realizing it's the thing everyone comes for, they also rebranded their community support to Docker Kubernetes Service.

The new platform is launched in public beta. In their day two keynote, Docker announced the Docker Foundation, a philanthropic organization that will focus on enabling education opportunities by partnering with organizations like CodePath.org and Black Girls Code.

CRAIG BOX: Monzo, a digital first bank in the UK built on Kubernetes, has opened sourced their incident response system. Monzo response was built to reduce pressure and cognitive burden on engineers in an incident situation, which they describe as any occasion when a thing gets outside its normal parameters.

Response is built to integrate with their Slack workflow and assist with reporting and coordinating the efforts of other engineers. Response was first shown at a DevOps Exchange meetup in London last month and in a Chemical Brothers song from 1999.

ADAM GLICK: Velero, the project formerly known as Heptio Ark, has announced a beta for their upcoming 1.0 release. It includes all key features expected in the 1.0 release, plus a number of bug fixes and documentation updates. A couple more small releases are expected before the 1.0 proper release. Keep an eye open for this at the upcoming KubeCon EU.

CRAIG BOX: Longtime listener Povilas Versockas from Lithuania writes in with some work he has been doing on providing Grafana dashboards for Kubernetes. Not content with just dashboarding the Kubernetes control plane and node components, he wants to share them with the community.

There are some downsides with the regular Grafana dashboard sharing system, so Povilas has published these dashboards to GitHub as monitoring mix ins, meaning they can easily be customized for your own Prometheus configuration. Check them out in the show notes.

ADAM GLICK: For those attending this year's EU KubeCon, please consider stopping by the Diversity Lunch and Hack. This year's Lunch and Hack will be held at the Fira Gran on Wednesday from 12:30 to 2:00 PM. There will be peer programming exercises and a variety of community topic tables and moderators available to help guide the discussion for people of all skill levels. If you're interested in attending, registration is now open.

CRAIG BOX: Red Hat last week announced version 3 of their container registry Quay. A quay, spelt Q-U-A-Y, is a stone or metal platform lying alongside or projecting into water for loading and unloading ships. Quay is also spelt Q-U-A-Y, but pronounced "kway". Go figure.

Quay 3 introduces support for multiple architectures and windows container images, as well as being re-based onto Red Hat's base images. Quay.io has been acquired three times since its founding in 2013 by CoreOS, by Red Hat, and then finally, by IBM. We're not sure which one of them came up with the funky pronunciation.

ADAM GLICK: Quay sounds a lot like a dock. What do you call someone who uses a dock?

CRAIG BOX: A quay-er?

ADAM GLICK: [LAUGHS] Rook, the cloud-native storage aggregator and subject of episode 36, has reached version 1.0. Aside from the new features, such as support for Nautilus or v14 series of Ceph, the Rook project has launched a new website with updated documentation, user guides, and cartoon medieval artwork. Rook has recently surpassed 5,000 stars on GitHub and has now been downloaded almost 40 million times.

CRAIG BOX: Diversifying, or reading the writing on the wall, the OpenStack Summit was renamed the Open Infrastructure Summit for its most recent iteration in Denver last week. Veteran tech journalist, Stephen J Vaughan-Nichols was there and reports the upcoming rollout of 5G networks requires a variety of network functions to be virtualized. And the services which do that are all likely to land on telco platforms powered by Kubernetes.

The event saw OpenStack Foundation project Airship announce version 1.0. Airship is a set of open source tools for automating cloud provisioning and management with Kubernetes and OpenStack, sponsored by AT&T and targeted at the network operator use case. AT&T report they have been using Airship in their production network since December.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

CRAIG BOX: Russell Warman is the head of infrastructure and operations at AutoTrader UK. Karl Stoney is a lead engineer on the infrastructure team. Welcome to the show.

RUSSELL WARMAN: Thanks, Craig. Good to be here.

KARL STONEY: Thanks very much.

CRAIG BOX: I have only owned one car in my life. It was a 1993 Vauxhall Cavalier that I bought on a holiday to the UK because it was cheaper than renting a car for that period. And I was possibly under 25 at the time. So the whole deal of having to get insurance was tricky enough, but I bought that car on AutoTrader. And I actually bought that car for 300 pounds and sold it three weeks later for 280 or something like that.

So I think I did quite well on that deal, all things considered. So I'm at least a little bit familiar with your platform. But Russell, why don't you start off by telling the listeners what AutoTraderUK is and does?

RUSSELL WARMAN: Of course. We've been around for about 40 years. And we started off printing magazines. And then in 1996, we launched our first website. And we've been online as a fully digital business since 2013. And we've got somewhere in the region of about 500,000 vehicles on the site listed at any one time. We're the 16th busiest website in the UK, and we've got probably somewhere in the region of 55 million cross platform visits to our platforms each month.

CRAIG BOX: What was the experience like as a transition from a print company to a digital company?

RUSSELL WARMAN: At the time, we had two separate sales teams, one that sold digital and one that sold print. And we basically allowed them to go after the same customers. And then we merged the selling experience probably back in 2009, something like that, where people then became digital and print reps. So, for a long time, we basically competed against ourselves.

CRAIG BOX: When did you start publishing the book?

RUSSELL WARMAN: 2013.

CRAIG BOX: OK.

RUSSELL WARMAN: So, we're about five years, six years in this year. And we definitely see ourselves as a pure play technical digital business now.

CRAIG BOX: So you are older than the cloud, but not older than the internet. Where did your infrastructure start?

RUSSELL WARMAN: We started with some servers in a very small comms room in one of our magazine centers. And then we moved up to hosting in data centers, probably in around 2001. So we built our physical server infrastructure in 2001 in one data center. Quickly realized we need resilience, so we added a second data center a couple of years later. And then probably about 2005, we started moving towards virtualization and consolidating down from physical servers. And then fast forward a little bit, probably to about 2012, we started building our private cloud.

CRAIG BOX: OK, and Karl, when does your involvement begin?

KARL STONEY: So I've not been at AutoTrader nearly as long as Russell-- I think 1/10 of the time, actually. So I joined about two years ago now, at the start of probably-- I like to say their next technical evolution, which was going from the private cloud that we talked about before to the public cloud.

CRAIG BOX: Russell, did you bring Karl on board?

RUSSELL WARMAN: I'm partly responsible.

CRAIG BOX: What was the thing at the time that led you to growing the team?

RUSSELL WARMAN: We had a number of engineers that were really skilled in managing on-prem capabilities. And what we were starting to see was-- we developed our data platform in one public cloud, and we were starting to see some challenges. And we knew that we were starting to use more cloud services. Karl's got a great background in doing that. He's done it for a number of other companies. So it was a really great fit in terms of his skillset, where we were at that time, and it just worked out, great timing.

CRAIG BOX: Were you looking at that time to start a migration to cloud?

RUSSELL WARMAN: Not explicitly, no. I saw that for the last six years, we've talked about being cloud-native. And as I said, we'd built out a private cloud infrastructure, and we started getting our application teams to migrate their apps across from virtualization platform onto that. So we'd moved around-- I don't know-- 150 apps across or something like that. And we'd not explicitly said we were going to go public cloud. We just said cloud-native.

But what we found was that-- I think it started with our logging, monitoring platform. We tried to do an on-prem. We had a few challenges with it. So one of the first things-- well, we'd actually moved it into another cloud provider. And then when Karl joined, we were still having some performance issues with it. There was a lack of confidence in the quality of the data that was being recorded, and their performance was slow. So one of the first things Karl did was actually rebuild that up in GCP.

KARL STONEY: I was going to say that I think one of the problems that we had when that was moved to the other cloud provider that we shall not name, we did a lift and shift, rather than a lift and improve. And it's a common mistake that I think a lot of people actually make when they go into public cloud. You've got these two physical data centers with one millisecond latency between servers, and then you're suddenly moving that up to public cloud. And it's not quite that fast.

CRAIG BOX: No.

KARL STONEY: So, actually, lifting and shifting doesn't improve performance in a lot of cases. So as Russell mentioned, one of the first things we did is really look at the architecture of this particular-- of Elasticsearch, look at how we can make it more cloud friendly, and we ended up deploying that on GKE on GCP. And, yeah, it was a great success. It was the first real GCP success at the organization.

RUSSELL WARMAN: And I think just to add to that, I mean, you touched on before the lift and shift. This is just more than a technical project. There's some cultural stuff that we've had to tackle as well, in terms of how you manage and think about the infrastructure.

So we've been talking about infrastructure as code for a little while. And Karl's really brought a lot of expertise in being able to help bring that to life on what that really means to engineers that we've tried to talk and explain what the differences might be, as you start managing more public cloud infrastructure. But because Karl's had practical experience of doing that, was able to articulate it in a way that really resonated and helped them understand what the differences would be. And I think that has been quite crucial.

CRAIG BOX: And Karl, you mentioned some of the obvious technical differences between private and public cloud. What are some of the cultural changes that you had to coach people through?

KARL STONEY: I think there's a certain level of emotional attachment. I think that's probably one of the biggest cultural things. As Russell told you about our journey before, we've built these physical data centers. It's very easy to get an emotional attachment to things that you physically built. So moving to public cloud, people were having to let go of this thing that they've nurtured for many, many years. And it was actually a really great implementation, and it worked really well for us, so people were naturally a little bit defensive.

CRAIG BOX: Yes.

KARL STONEY: So that was that was one of the big cultural changes, I would say. Russell, unless you can--

RUSSELL WARMAN: I think that's what I would say. You're right. I mean, we were under no pressure to move things to the cloud. We do have a really good history of good reliability, good performance with our on-prem applications. And we were absolutely accelerating things like deployments, and doing it in a way that was safe and not impacting our customers. So all the things that you would care about, we had a strong track record in those, didn't we?

KARL STONEY: Yeah, and I think, in fact, that's another cultural thing that we should draw out. So everything is in your control when it's in your data centers. It's like, if we have a customer-impacting issue, it's our engineers that are going to go and physically look at those servers. And they're going to do it in a time frame that we're very familiar with. You're moving to the public cloud, you're effectively offloading some of that responsibility to a cloud provider, which everybody sells is a great thing.

Because AutoTrader-- we're not in the business of building and managing data centers. We want to have a car marketplace. However, letting go of that control and putting that trust into another organization, you have to build that trust. So we had to build trust with Google before people started to feel comfortable about putting more stuff there.

CRAIG BOX: One of the things that you're famous for and why you spoke at the Google Cloud Next conference recently is your adoption of Istio. I understand that it wasn't so much a "here's a Kubernetes environment. Let's put Istio on top of it", but it was actually the other way around. So perhaps you could tell us a little bit about that journey.

KARL STONEY: The requirement for Istio actually came out of a customer requirement. So one of our customers was wanting us to protect their data end to end through all of our microservice architecture with effectively mutual TLS encryption all the way. On our private cloud, data was encrypted up until the edge, but then between some of the microservice, it was just HTTP, inside the same network segment anyway.

So this was a new requirement. It was something that we hadn't done in the past, and we were looking at implementing it on our private cloud. In fact, we tried for several months to implement it onto our private cloud. There was a team of about six to seven people focused on this, and we were not making great traction. So we started looking at service measures effectively, because one of the things that a lot of service measures out there turn around say that they can do is transparent mutual TLS, and you don't need to have to worry about it.

AutoTrader has a culture of experimentation. We like to try things out. We like to fail fast and learn from our mistakes. And we'd spent maybe two months trying this implementation. We were starting to look at other options. We were like, let's see what's out there. So that was why we started looking at Istio. Istio because it has the backing of many big companies, and we'd already done a bunch of work with Google as well. And Google being one of those companies, felt like a good place to start.

CRAIG BOX: Right.

KARL STONEY: So we just decided to experiment and test. We wanted to just deploy Istio as quickly as we can, just to prove out the capability. Interestingly enough, at the time, because we were involved in Istio really, really early doors, the best way to do that was on Kubernetes. So we didn't have Kubernetes on premise. So we were like, OK, well, we want to test out Istio, and in order to really test out Istio, we want to do it on Kubernetes. So what's the quickest way that we can test out Kubernetes? And at the time, that was GKE.

CRAIG BOX: Right.

KARL STONEY: So we, in the space of a couple of days, spun up some GKE clusters. We deployed Istio on top of it. We tested our mutual TLS. And we stuck the applications that were relevant to this customer's particular requirement. We Dockerized them, stuck them onto this cluster, and suddenly, we delivered this capability. I mean, we went surfing. It was an experiment at the time, but we delivered a capability that we'd been working on for months in literally two days.

RUSSELL WARMAN: And at that point, we then had a conversation around where is the right place to run this. Because we still had that investment in, and we still had that investment in data centers. Should we build our Kubernetes on premise, and then add Istio on top of it? And I think we concluded really quickly, actually, that wasn't the right thing to do. We didn't want the overhead of managing the upgrades and trying to sort out the dependencies between them. We thought the best place to run it was in GCP and take advantage of somebody else handling all those bits for us.

KARL STONEY: And that was a decision that was definitely made easier because if you remember earlier on, we mentioned the Elastic stock that we were successfully running on GCP and GKE. It's nearly a 20 terabyte Elasticsearch stack running on top of GKE. We built organizational confidence in GKE as a product through that piece of work. So the discussion about, well, should we use it to run some websites, it's quite an easy one.

CRAIG BOX: Yeah, why not? Let's do it.

RUSSELL WARMAN: And what you hopefully have picked up on is through this conversation so far, we've not talked about the costs of doing that. What we've talked about is the capabilities that we were looking for, and it's not been about shiny tech. It's about enablement. It's about trying to fix a problem and then doing the right thing.

Karl talked about taking away complexity from our environment. We don't build this stuff. So we don't necessarily have all the skills that we need. So taking that complexity out and getting somebody else to do that absolutely makes sense. It just makes our engineers focus on the things that are important to us.

CRAIG BOX: How did you make that new platform available to people once you'd proved it out?

KARL STONEY: There's obviously a lot of complexity in Istio and in Kubernetes. So in the talk that we gave at Next, I started to think about all of the different manifest files that you need to write in order to deliver an application on top of this platform. And you've got all of your Kubernetes services, your deployments. You've got your virtual services for Istio; your destination rules, your sidecar.

Then, if you think about your network policies, you've got those as well. And I think I counted in the end about 20 different YAML files that you need to write in order to deploy a service on top of the stack.

So during our experiment-- obviously, I've got experience with Kubernetes. I got very involved in Istio. But as an organization, something that we haven't touched on is we have about 200 developers. And trying to skill up 200 developers on effectively 20 new APIs, some of which were under extremely rapid development, Istio, in particular, it's never going to happen. We're never going to get the momentum or the traction that we wanted. So we decided to continue a practice that we'd done on the private cloud, which was to hide it to a certain extent behind what we call our delivery platform. It was an abstraction on top of all of this stuff.

So we asked our API-- our contracts with those developers was actually a really small subset of values. It was simple stuff, like how many replicas do you want? How much CPU and RAM do you think your application needs? What language is it? Because that actually drives out some capabilities that we've got as well. Give us some basic metadata about your application. What's it called? What does it do? Who talks to you, and what do you talk to?

And, actually, most applications-- I think it was about 20 lines of configuration. Behind the scenes, we then translate that into all of the manifests that are required in order to deploy the application onto the stack. So from a developer's perspective, not a lot changed, apart from the fact their application was effectively going to be a Docker container, rather than historically republished just a single JAR.

CRAIG BOX: Do you think that the platform needs to be as complicated as it is? Do you think that just by describing 20 lines of configuration, there are simpler platforms that only require that as configuration and don't let you do all of this stuff? Where's the tradeoff?

KARL STONEY: Do you mean like, for example, why did we not choose to deploy into Heroku or something like that?

CRAIG BOX: Yeah, or Docker Swarm or something that was more traditionally easy.

KARL STONEY: Because we want a lot of the capabilities that the more complex platforms give us. We want the mutual TLS capabilities of Istio. We want to do retry policies, or back-off policies, or outlier detection. And these things that Istio can give us out of the box, but we don't necessarily want to expose that to our developers. So as an organization, we want that feature set. We just want to be picky about what we expose.

RUSSELL WARMAN: And I guess that's the role of the infrastructure team, is really to turn those things into platform capabilities and then try and simplify the way that our development team can take advantage of them.

CRAIG BOX: Do you think that that's work that you should have to do as a platform team, or do you think that a vendor or the platform itself, being Kubernetes, should provide an abstraction to do?

RUSSELL WARMAN: I think it's quite an interesting question. Where I've got to with it when I thought about it is our abstraction is quite an opinionated one. It's one that fits our organization, so in our abstraction, we've made certain decisions. For example, we have one application or one service per namespace. That's a decision we make. We have clusters pair environments.

You know, quite a lot of people have multiple environments on the same cluster, and they separate by namespace, et cetera. So because of the flexibility of the underlying platforms, our abstraction isn't going to fit with other organizations.

Like, I was chatting with Shopify guys yesterday, who've got a very different infrastructure to us. And it's a little bit frustrating in some ways, because I'd love to be able to share a bunch of the tools that we've written with those guys. But they just won't fit their organization. I definitely think you've got that sort of 80-20 rule. You could probably come up with a model or an abstraction that fits 80% of people out there, who just want to get up and running quite quickly. You'll still always have that 20% who do some things sufficiently different. We're probably in that 20.

CRAIG BOX: But that's probably part of the beauty of the platform, though, isn't it, is having that flexibility.

RUSSELL WARMAN: I think if you're thinking about it, it's like, that's where other things like Cloud Run and stuff will come in. If people just want to be able to run a service really, really quickly with minimal configuration, that's the sort of angle that they will go down. We're doing some slightly more complex stuff.

CRAIG BOX: Yeah, versus platforms where you don't have that ability to break the glass and get out of it. I think you get the ability to say, hey, I can run my complicated thing next to my easy thing, and I'm managing it all on the same, getting all the benefits, condensing the workloads down and so forth.

Eric Brewer did say recently that he believes Kubernetes is a platform for platforms. I think that's a meme that's sort of going around the community. It feels like it suits a lot of people's needs, but ultimately, it may not be a thing you think of as a product. It's, again, what we're trying to do with Anthos, is level things up a little bit further.

One of the things that you've published, Karl, is a blog post on cost dashboarding and some of the work that you've done to make it possible for your customers inside AutoTrader to see how much it costs to run their workload. How did that come about?

KARL STONEY: AutoTrader, typically, the way in which we build our data since it's a CapEx based sort of model. Every few years, we spend x amount of money, and then developers are just deploying applications on top of that, without really giving consideration to what the running cost of those apps is. Because we've got some predefined amount of space. It doesn't really matter.

Moving the organization from this CapEx style model to an OpEx model, where we we're fundamentally getting charged for consumption, aren't we? CPU, RAM, et cetera. One of the things that we were asked by our CTO, effectively, is to make sure that we demonstrate that we're staying on top of that growth in the public cloud, and it doesn't just spiral out of control because it could be very, very easy with a bit of misconfiguration to spend up far more resources than you need. And then if you do that, if you just lift and shift or you let that growth grow, you can end up with some big bills.

So what we wanted to do is really understand that the cost of our infrastructure, and I don't just mean like the total cost of our Kubernetes cluster. We wanted to make informed decisions about whether or not an application is even worth running. If it's costing us x part, our return on investment is not covering that. Then we can get rid of the app.

So we actually started off really, really early doors because we had this relationship of an application to a namespace. It was very easy to see how much CPU and RAM resources were being used within a namespace. And then we just basically built some dashboards on top of that, but then multiplied that CPU and RAM by the GCP costs of those components.

And then very, very quickly, we had this capability where every time an application was deployed, we can just go to a dashboard and see overall cost of the cluster and overall cost of this application or group of applications. And we then set up alerts based on spikes in cost and spikes in CPU and RAM utilization. And doing that early on, rather than retrospectively trying to do it later, it just means we've been able to keep on top of it, and it's not been a chore.

RUSSELL WARMAN: There's two things probably to add on that. The first thing is that we never had that visibility within our data centers. Capacity planning has always been like guessing, rather than science. I think what we're starting to work towards now is a little bit more predictability around what our applications really, really need. And as Karl said, we've embedded very early in the process in understanding how much it costs to run applications. And so every time we migrate applications across, we understand more about them than we've ever done.

CRAIG BOX: And as you had those conversations with your management about moving to the cloud, how has having this data available helped?

RUSSELL WARMAN: It starts to give confidence that we're not just moving stuff and not thinking about the impacts of running it. And like Karl says, a lot of this is a shift from a CapEx to an OpEx model. So I mean, there's obviously OpEx associated with running data centers, but it shifts our model. We need to show that we're being responsible about the money that we spend on compute, et cetera.

KARL STONEY: I think that's a really interesting point. So one of the things that I mentioned in the talk at Google Next was, we've actually, as we've been moving applications across from on-premise to public cloud, the average CPU and RAM utilization of these applications, we've lowered about 70%. So on-premise, because we were kind of a little bit flying blind, we gave every instance of an application 2 CPUs and 2 giga RAM and then we scaled it out horizontally.

CRAIG BOX: That's very generous of you.

KARL STONEY: We were quite kind. But if you think about it, you've got little knotty web services running single page applications that now need a fraction of that. But we were lacking some visibility. That gave us confidence to lower it. The actual platform that we were running on-premise didn't have the capability to do different sizes CPU and RAM for different deployments. That's obviously something that just comes out of the box with Kubernetes.

So as we've been moving stuff across, the increased visibility that we get from Istio-- we've got that black box visibility of the health of our applications, the golden signals-- combined with Kubernetes metrics-- so what is my CPU and RAM utilization over periods of time, et cetera-- we're suddenly able to look at these applications and make really informed decisions about, OK, well, you only need like 0.3 of a CPU, so let's give you 0.3 of a CPU. See the impact that that has. And then suddenly, you've made a massive dent in your running cost for that application.

CRAIG BOX: And we've spoken before, Karl, about the effect of adding Istio to these dashboards, is you're able to see, if I make those changes, what does that actually look like to the output? What does it look like to the request per second I'm able to respond to or anything like that?

KARL STONEY: Exactly, yeah, so it's exactly that. It's like we have the ability to make informed decisions, just because we can see it instantaneously.

CRAIG BOX: Are you going to update the published work to include what you've done with Istio?

KARL STONEY: Would you like me to?

CRAIG BOX: I'd love you to.

KARL STONEY: [LAUGHS] I will do that.

CRAIG BOX: You were one of the earliest adopters of Istio. And along with that, came the privilege of working together with our mutual teams at Google and AutoTrader to help guide the deployment and make it successful and also to help shape the product in those earlier days. What was that experience like?

RUSSELL WARMAN: I guess Karl's probably a better place to talk about it from a technical point of view. I think from my perspective, what I saw was a real level of engagement and partnership, actually, just in terms of wanting to develop a product, but also understand our needs and where to support us getting to where we wanted to go.

I think, sometimes, with a lot of organizations, you kind of have to go through quite a lot of hoops to get to an engineer. You end up going through-- you speak to a product person, you might speak to a pre-salesperson, but it's not engineer to engineer.

And, for me, I really love the fact that we got access to Google's engineers to be able to really input the features that we felt were missing, and then being able to fix some of that, that we got the absolute benefit from.

KARL STONEY: I would really like to think it's been an incredibly mutually beneficial engagement. We've given the Istio team some complicated use cases, some real world deployments. And we've done it at a relative scale now. And like Russell said, at the same time, we've been able to really input in shaping the products. I don't think our deployment would have been as successful as it has been without that early engagement with the product teams.

CRAIG BOX: There are some people who questioned the complexity of Istio. I know we've talked about that a bit before in terms of configuration. But do you think overall that it's worthwhile, and would you have considered a simpler product?

KARL STONEY: I do think overall, it's worthwhile, yes. It's got a lot of features. With a lot of features comes a lot of complexity. But one of the good things about Istio is you don't necessarily have to deploy all of it. So it's like if you're not using certain components-- they make the user interface for Istio has improved drastically in the last couple of releases. You toggle a couple of feature switches in your Helm template and those things just don't get deployed.

So, taking AutoTrader as an example, we actually really only use the mutual TLS component, so the strong encryption and the uniform observability. So we don't make use of a lot of the traffic routes, and we don't do any policy check stuff. So that stuff's just not enabled for us. Yeah, so, I think Dan described it as an a la carte menu. That's probably the best way to do it. You just choose what you want.

CRAIG BOX: You've recently gone through a migration from the 1.0 series to the 1.1 series of Istio. What was that migration like?

KARL STONEY: In Istio 1.1, there was a great new feature that I've been working with Istio guys, talking to them about for quite some time now, which is the concepts of isolation for a sidecar as well. So just for a bit of background for those people who aren't massively involved in it, if you think about our cluster, we've got 250 services running on this cluster.

Previously, the sidecars for each of those applications received the configuration to know how to route to all 250 services. So as your cluster grows, there's a lot of config being pushed around. Every time any state changes on your cluster, everything gets pushed everywhere. And it's eventually, you're going to hit breaking point.

So I've been waiting for this feature for a while, and what this feature enables you to do is say, OK, I'm service A, and I'm only interested in B, C, and D. I don't care about the other 246 services. That was released in 1.1, but it came with a new custom resource that you had to explicitly say Service A depends on B, C, D, et cetera. We'd already deployed 250 applications. As I said before, we've got a delivery platform which abstracts a whole bunch of the stuff. And one of the aspects of their values file effectively is, as I mentioned earlier, who talks to them and what they talk to.

So we were in quite a fortunate position where we could take that list of stuff that they talked to behind the scenes, generate this new custom resource that was released with 1.1, and then deploy it with the applications. What we actually had to do in order to get that out there, though-- because all of our deployments are immutable, we have to renew deployment if we want to add some new capabilities-- is we have to redeploy all 250 applications. So that took a bit of time. Oh, I say it took a bit of time. In the grand scheme of things, it took us two hours, which is pretty--

CRAIG BOX: You've become quite impatient since all this has been deployed.

KARL STONEY: Yeah, so it's funny because you said that and you go, god, I wish I could have done that faster, but really, deploying 250 applications into a production environment in the space of two hours in order to get a new Istio feature out the door is pretty reasonable.

RUSSELL WARMAN: On that particular day, though, we hit our release record. We did 450 releases in one day.

KARL STONEY: To production.

RUSSELL WARMAN: To production, without any impacts to our customers.

CRAIG BOX: Fantastic. What are the bits where you think there's still room to improve?

KARL STONEY: I still think what's happened with Istio is there's been this massive growth of features. And that's great. But the user experience for organizations which don't have the engineering capability that AutoTrader has-- for example, they just don't have the amount of engineers that they've got-- that's where they need to focus now. It's the need to make it more accessible to a wider audience effectively.

And there's a lot of work obviously happening in that space. You can do the fully managed version of Istio on GKE, for example. But it's still that user experience. It's still that. Oh, also, I'd probably say debugging and documentation as well. It just needs that little bit of extra.

CRAIG BOX: What other impact has this platform had on productivity at AutoTrader?

RUSSELL WARMAN: I guess something we talked about before, hitting 450 releases on one day, we've actually-- I mean, I can't remember what the stat was, but we've moved from something like 4,000 releases and went to 15,000 releases in a year because of like--

KARL STONEY: Because of this platform, yeah.

RUSSELL WARMAN: Because of the platform. And next year, we're predicting to get to 30,000.

KARL STONEY: I think the interesting point about this, though, is number of releases alone isn't the best measure. I think it's combined with-- we also measure customer impacts of releases, so we've doubled the number of releases that we do within the last year. I think we've gone from 99.6 to 99.8 or something like that, in terms of successful releases. So double that number of releases, but also reduce the number of customer impacts.

Something else to really talk about that I don't feel like we've touched on too much is, our on-premise private cloud is tied to Java, so the deployable artifacts for that platform is a single jar. We have applications on our even older infrastructure that we're never able to move to that platform because they don't fit into the model of a jar.

We also had the problem of developers working on problems, OK? So they're faced with a platform that only enables them to deploy a single jar, but they want to deploy a Node Express application. They were wrapping that in a Maven build process and deploying it as a jar.

So that's what developers do. They work their way around problems. The new platform, because the deployable artifact is now a container, it's actually opened up a whole load of doors for us. So we've had products that have gone out and now serving customers now, where we've been able to take advantage of software that we would have never been able to deploy before because it's not Java, which has massively increased organizational agility. We're building new products for customers faster.

RUSSELL WARMAN: I think also release time is reduced as well as a result of this. And that, again, just means developers aren't sitting around waiting for their pipeline to deploy, are they?

KARL STONEY: Yeah. I mean, we recently did a talk at our offices. We did these tech talks to share stuff. And we took a sample application from literally just being on the developer's local machine through to deploy it into production with everything in AutoTrader-- address, certificates, DNS, split across availability zones. We have mutual TLS. And that whole process, I think it took three minutes.

CRAIG BOX: Again, we're living in a different world. When you deployed your applications into the new platform, even just for your proof of concept, I understand that it uncovered bugs that the application developers were previously unable to find or fix.

KARL STONEY: Yeah, this has actually happened quite a few times. On-premise, we had a whole variety of monitoring tools. Basically, each team was doing what they needed to do in order to get visibility into their application. So some teams were using Elasticsearch. Others were using Stackdriver. Others were using-- we got SolarWinds, and ThousandEyes, and all these different tools.

Having a uniform view of application architecture across all of those tools, it just didn't happen. So as we were moving stuff on to this new platform and we got the uniform observability of Istio, we were side by side deploying some of our complex applications.

Like, I'm in a website and sending syntactic load through it, and it was highlighting memory leaks in some of the smaller macro service applications that we just couldn't see on-premise. And what was happening was performance was going down over time. Like, response times were going up slightly. And over the space of two days, response times were almost doubling on one of these applications. And we just didn't know that was happening on-prem.

So we were able to fix these memory leaks and just the small bugs in applications before they were causing customer impact and because of the visibility that we were getting on the new platform, which was great, really.

RUSSELL WARMAN: And like you say, I mean, you've seen those a number of times, haven't you?

KARL STONEY: It's happened, yeah, five or six times.

RUSSELL WARMAN: So for some of the applications that we've not yet managed to successfully migrate across, we're still being able to test them on the new platform and fix issues. So even before we move them up to the new platform, they'll be, you know, a ton better than what they are today.

CRAIG BOX: As you free up space in your existing data centers by migrating applications to the cloud, do you see yourselves using that space with Kubernetes in the private data centers?

RUSSELL WARMAN: Nope.

KARL STONEY: Nope. Gonna mine Bitcoins with it, aren't we? [CHUCKLES]

RUSSELL WARMAN: No, our plan is to aggressively migrate all the applications that we have within our data centers to the public cloud. All new applications basically get deployed straight onto Kubernetes, so we've not deployed an application back into our data centers. And we'll be ambitiously out of those in 18 months.

KARL STONEY: There might be something to touch on there as well, in terms of the growth of this platform. So we have loosely 350 services that make up our entire architecture. 185 of them are now deployed on public cloud, and we only really ramped that migration up between six and eight months ago. So you can see that's a really successful figure. I mean, that's like half of our applications have moved within less than a year.

RUSSELL WARMAN: Yeah, so.

CRAIG BOX: All right, Russell, Karl, thank you so much for joining us today.

RUSSELL WARMAN: You're very welcome. Thanks.

KARL STONEY: It's been a pleasure.

CRAIG BOX: You can find Russell on Twitter, @rjwarman, and Karl, @KarlStoney. And you can find AutoTrader at AutoTrader.co.uk.

[MUSIC PLAYING]

Thanks for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter, @KubernetesPod, or reach us by email at kubernetespodcast@google.com.

ADAM GLICK: You can also check us out at our website kubernetespodcast.com, where you can find show notes and transcripts of all the shows. Until next time, take care.

CRAIG BOX: See you next week.

[MUSIC PLAYING]

View More Episodes

AutoTrader UK, with Russell Warman and Karl Stoney

News of the week

Links from the interview

Transcript