#108 June 17, 2020
Two years ago, Sarah Wells from the Financial Times gave a KubeCon EU keynote about how the company moved from monolith to microservices, and how her Content and Metadata platform team moved to Kubernetes specifically. She joins hosts Adam and Craig to recap that migration, and what life has been like since. As Sarah has moved to a broader role in charge of all observability for The FT, she also invited Dimitar Terziev, the current platform lead for the CM team, to the conversation.
Do you have something cool to share? Some questions? Let us know:
CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.
ADAM GLICK: And I'm Adam Glick.
CRAIG BOX: Uplifting story of the week-- a lady in the UK bought a box of duck eggs from the supermarket. And first of all, we should say it's the kind of posh supermarket that sells duck eggs. I know that not everyone has one of those available to them.
And then she took the duck eggs and put them in an incubator which, again, you must just have sitting around at home these days, and three little ducklings hatched.
ADAM GLICK: That is amazing. The pictures are absolutely adorable, as any little fuzzy creature is.
CRAIG BOX: Most ducklings are.
ADAM GLICK: Yeah. Amazing story because I didn't realize that eggs that you buy, in some cases, have been fertilized. And if you do, indeed, incubate them, apparently you can get the little animals out of them.
CRAIG BOX: They're apparently perfectly healthy to eat even if they are fertilized, but it's just one of those things you'd never really give much thought to it.
ADAM GLICK: Yeah, totally.
CRAIG BOX: Probably don't try it with your chicken eggs.
ADAM GLICK: For the vegetarians out there, we discovered this one in our home a couple of weeks ago. And it's much the same thing. If you take green onions, and you just leave the bottom part-- you cut the rest of it off-- and you put it in a jar of water, the green onion actually grows back.
CRAIG BOX: Ooh.
ADAM GLICK: And so you can regrow the onions several times. And then it only takes a few days. So the next time, if you want to try a fun little experiment with your kids, or if you simply like cooking with onions, or if you like various food items that can be regrown despite you buying them in a store, all of these, same option.
CRAIG BOX: How long would it normally take you to grow a green onion?
ADAM GLICK: I don't know. I've never grown one.
CRAIG BOX: Did they have a start somewhere, or is this one of those chicken and egg things-- or duck and egg things even-- where you have to have a green onion in order to grow a green onion?
ADAM GLICK: I know there's a number of things. Like, you could take a carrot, and you can cut off the top and plant that, and you can eventually grow one. But that takes quite a while. But the green onions, literally, less than a week, and it will fully grow back. It's pretty impressive how fast they grow.
CRAIG BOX: Well, let's end on a joke then. What's orange and sounds like a parrot?
ADAM GLICK: What?
CRAIG BOX: A carrot.
ADAM GLICK: [IMITATES DRUM STING] Let's get to the news.
CRAIG BOX: Let's get to the news.
ADAM GLICK: LinkedIn is one of the largest users of Hadoop. And as they've adopted Kubernetes for AI workloads, they've suffered from a disconnect between the authentication models of the two systems. They bridged the gap with a tool called kube2hadoop, with the number 2, which has been open sourced this week.
kube2hadoop allows Kubernetes pods to get Kerberos tokens for accessing HDFS so that the jobs running on frameworks like KubeFlow can get access to Hadoop data links. The announcement blog post walks through the implementation and its security concerns, and the code is available on GitHub.
CRAIG BOX: MayaData, the originator of the OpenEBS storage system has announced Kubera, a new solution for the efficient operation and management of Kubernetes as a data layer. Kubera includes logging, alerting, visualization, reporting, backups, maintenance, compliance checks, troubleshooting, and lifecycle automation.
Evan Powell, our guest from episode 56, likens OpenEBS to VMware's vSAN and Kubera to the vRealize operations software that envelops it. Kubera is software as a service, free for individual users, and with company subscriptions standing at $49 per user per month.
ADAM GLICK: The Linkerd project announced version 2.8, adding support for multiple Kubernetes clusters via service mirroring. Linkerd does not provide an ingress gateway for its service mesh, choosing to only focus on East-West traffic. That means you have to operate a completely separate product if you want to get traffic into your cluster. Ambassador would like to make the case that they should be that product and have announced a partnership with Buoyant to support the new multicluster feature.
CRAIG BOX: Ambassador is an Envoy-powered solution, the proxy engine used by Istio and Consul. HashiCorp recently announced beta versions of Consul 1.8, including Envoy-powered ingress gateways. Banzai Cloud, this week, has written an in-depth intro to Istio's ingress gateway, including how they have extended it to provide lightweight API gateway functionality in their backyard's distribution.
ADAM GLICK: Lightbend, the creators of Scala and Akka, has released version 2.0 of Cloudflow, a tool to enable users to quickly develop, orchestrate, and operate distributed streaming applications on Kubernetes. Highlights of the release include a new operator-based installer, a new configuration system, multiple language support, a multi-image local runner, and improvements connecting to Kafka. Cloudflow 2.0 is also the name of a running shoe, but we believe it has none of the listed features.
CRAIG BOX: Google has updated its engineering internship program in response to the ongoing pandemic. This year's cohort of over 1,000 interns are working from their homes in 43 countries and contributing to open-source projects. Not surprisingly, some of these interns will be working on projects close to our hearts, including Kubernetes, Istio, and Envoy, as well as on some of the 2,600 other active open-source projects created at Google, as well as COVID-19 response efforts.
ADAM GLICK: The CNCF end user community has started a new kind of report called a Tech Radar. It's based on a similar program from ThoughtWorks, but they've removed the hold category and only cover one use case at a time. The first use case published is CI/CD, where a survey of the end user community recommends the adoption of Flux and Helm. GitLab, CircleCI, and Kustomize, with a K, are worth a trial, it states, while all other technologies are placed in the assess category.
CRAIG BOX: Meanwhile, the CNCF keeps adding special interest groups, as it marches towards its stated site of 6 or 7 SIGs. This week, they highlight the new SIG observability which will help move relevant projects through the TOC process and help the CNCF identify areas where no projects are currently engaged with them.
Chairs are Matt Young of insurance marketplace EverQuote and Richard Hartmann, now of Grafana Labs. You may remember Richard from his previous life as OpenMetrics founder and his interview in episode 37.
ADAM GLICK: Lukas Gentele from DevSpace Cloud has released loft, a closed source multi-tenancy manager for Kubernetes, that's built on top of his open source toolkit kiosk. Loft lets engineers create separate name spaces to work without impacting other users on the same cluster. It also comes with tools designed to manage costs by shutting down inactive name spaces.
CRAIG BOX: JIB 2.4 is out. JIB builds optimized Docker and OCI images for Java apps without a Docker daemon. The highlight of version 2.4 is a new extension mechanism to allow customization of the generated container build plan.
ADAM GLICK: Personally, I like the cut of their jib. Zerto announced Zerto for Kubernetes at their virtual conference last week. The new product will provide backup and disaster recovery functionality for Kubernetes applications, covering both the persistent data storage and configuration options. It will work with existing distributions as well as cloud services, and is said to provide continuous replication and non-disruptive recovery and testing.
Kudos to their marketing team for coming up with the phrase "data protection as code." And a tech preview of Zerto for Kubernetes is expected later this year with GA coming in 2021.
CRAIG BOX: Azure Kubernetes Service has released two new features in preview. Node image upgrade lets you upgrade to the new node images for Linux and Windows which they publish weekly. You can upgrade nodes across the entire cluster or per node pool.
Application Gateway Ingress Controller lets you connect your pods directly to the Azure Application Gateway similar to how container native load balancing works on GKE. This release also allows users to upgrade clusters from free to paid and brings Windows server container support to GA in China.
ADAM GLICK: Cloudera has announced the Cloudera Data Platform, or CDP, for private clouds. Joining the existing CDP for public clouds, the private version will run on top of Kubernetes provided by a partner. CDP for private clouds is currently in private tech preview for a select customers and is expected to go GA later this summer.
CRAIG BOX: CloudBees, our guest on episode 44, has released a heightened version of their CI tools to help meet the US Department of Defense specifications. Their oft-renamed enterprise product, now called CloudBees CI, has been certified by the US Air Force's Platform One team against a shocking number of acronyms that we won't be spelling out here.
ADAM GLICK: Microsoft has announced the discovery of cryptojacking happening within clusters in Azure running Kubeflow. Kubeflow clusters are particularly attractive to hackers since ML workloads typically do their inference work on powerful machines and often have nodes that are connected to GPUs. Microsoft says the entry point was likely a dashboard that had permissions changed to be open to the internet. If you are concerned about your own Kubeflow applications, the article in the show notes provides commands you can run to detect if the Monero mining code has been deployed on your clusters.
CRAIG BOX: Finally, Gokul Chandra has written up an extensive write-up on his experience with Google Cloud's Anthos and his perspectives on the platform. He's done a nice job of showing screenshots of the platform, giving you an idea of what Anthos is and how it works from a technical standpoint for anyone familiar with Kubernetes.
ADAM GLICK: And that's the news.
Sarah Wells is the technical director for operations and reliability at the "Financial Times," leading work to maintain and improve their operational resilience, improve developer experience for development teams, and reduce the amount of repeated work. Before this role, she was principal engineer for the content and metadata platform team at the "Financial Times," which is a role now inhabited by our second guest, Dimitar Terziev. Welcome to the show, Sarah and Dimitar.
SARAH WELLS: Hi, very pleased to be here.
DIMITAR TERZIEV: Hello. It's a pleasure to be here.
CRAIG BOX: Being based in the UK, I am, of course, familiar with the "Financial Times." I think anyone born after 1888 should at least have passing familiarity with it. But for our international audience, can you give us an overview of your organization?
SARAH WELLS: The "Financial Times" is one of the world's leading business news organizations. We've been publishing a paper since 1888. Over more recent, a decade or so, we've been making a real transformation to digital first and a subscription-based business.
ADAM GLICK: As someone who grew up here in the States and sometimes has the good fortune to travel internationally, I've noticed that I often see the "Financial Times." And it always stands out to me when I get on a plane or am at a hotel because its pages are pink. Why are the pages of the "Financial Times" pink?
SARAH WELLS: People often say it's because they decided to save money by printing on unbleached paper. But it's pretty clear that it was more about standing out from the other papers that were printing. And nowadays obviously we dye the paper a pink, so it's actually more expensive to print on pink paper.
CRAIG BOX: I can understand then why you'd want to move to being a digital company, given all the pink dye you must have to purchase. I understand that of the million subscribers you have now, over 650,000 of them are digital subscribers, and 70% of them are outside the UK.
SARAH WELLS: Yes, we have an international audience. If you're interested in business, wherever you are in the world, the "FT" is a place for you to come and find out about it.
It's very clear that if you're going to be successful in news, you need to have a source of revenue that isn't just advertising and isn't print because producing a printed newspaper costs much more money than showing someone something online. So we made that move to subscription, knowing that you have to have an income that isn't advertising.
CRAIG BOX: The "FT" was an early adopter of both internet publishing and the idea of monetizing that. It went metered in 2007. At that time, weren't we all about information wanting to be free?
SARAH WELLS: We were, and I think for newspapers, that's been a very difficult thing because people were very reluctant to pay for newspaper subscriptions. I think that's changed more generally now. But the "FT" always had something where we have very specialist information that is something that is a differentiator for people and their businesses. And quite a lot of people who read the "FT," their company pays for it. So we have a lot of people who, business-to-business subscriptions, where a company pays for 50 "FT" licenses and shares them out among the staff.
CRAIG BOX: That's how I have mine.
SARAH WELLS: OK, so we had that sense that this is something where we can have a paywall, and it's been very successful. And obviously we've gone through various iterations of how we do that, but it was always going to be easier for a company like us where we have niche content that is worth money to people to be able to use a paywall.
ADAM GLICK: I just think it's interesting. There's a number of English publications that seem to have figured out that model and been successful with it. The "Financial Times" is one. "The Economist" is another one that I think of doing that. Versus in the United States, it may be different, although one that people maybe more familiar with, the "Wall Street Journal" has somewhat of that same model and probably a similar type of audience.
SARAH WELLS: Yeah, "Wall Street Journal" very definitely.
ADAM GLICK: You have a fairly large digital base, probably larger than the paper subscriber base. Is that fair to say?
SARAH WELLS: Yes, yes, I would say so. I certainly know that, as of a couple years ago, we make more of our money from subscriptions than we do from, say, advertising. So subscriptions, and digital subscriptions in particular, is important.
ADAM GLICK: What will happen if the print version doesn't go out, if most people are looking at the digital? Are we looking at a transition from paper to bits?
SARAH WELLS: The way that the "FT" does printing of newspapers has changed over the last decade. There's been a lot of cutting down, so we don't do as many different editions. We've been very carefully looking at our print side. So print is profitable for us, but that's not to say we don't think about whether we would stop doing print.
I think there's an interesting thing, which is that being a print newspaper gives you a gravitas that you may not have if you don't do it. So it's not just about cost. It's also about your reputation and what people think of you as an organization.
CRAIG BOX: I'd like to ask a little bit about the relationship between technology and journalism. So separate obviously from this conversation, I've noticed during the COVID-19 pandemic, a lot of great digital journalism. There's one guy I want to call out to that-- John Burn-Murdoch, who's been publishing trajectory graphs for coronavirus cases and deaths in various countries. Does doing digital journalism require a change in the technology stack? Or is it more that having flexibility and having a modern technology stack allows you to branch out and do different types of journalism that you wouldn't have been able to do before?
DIMITAR TERZIEV: I would say that in times likes this, it's important to support our journalists to be able to experiment and deliver content in different forms and formats. So pretty much we have a mission within our product and technology team to support content innovation.
And this particular example here, for us how we support content innovation, is with something that we call live blogging. This is a medium for us to almost cover real time in-person stories and topics. And we have a live blog specifically around COVID-19.
And the technology powering live blogs, so far-- up until now actually-- was a bit outdated, and right now we're making an overhaul of how we process these live blogs. And probably, as of next month or so, we'll be processing the live blogs through the content and metadata platform. So this is actually a great example of the product and technology teams working with the journalists to deliver this high quality journalism in the best possible format for our subscribers.
SARAH WELLS: Yeah, I want to say a little thing about the team that's doing those graphs, which is we have a data journalism team, and they work really closely with a team of software developers. So we also make sure that we have people who are in the same place working closely together for things like that.
So it's software developers and data journalists. They all work on it. That particular article with all of those graphs is by far the most read article we've ever had at the "FT" because it's free to read, like a lot of coronavirus coverage, and it's just off the chart in terms of the number of page views that it's had for us.
ADAM GLICK: Do you see any interesting trends in terms of the technology that people are using to consume the "Financial Times" and what that means for how you decide what you're building? Say, when you started going digital back in 2002, I imagine that that has to be all web browsers versus today you're seeing things, I assume, like tablets and mobile devices, and how does that change how you think about what technology you're building, what the update cycle of information that you put out there, how you serve that audience?
SARAH WELLS: I've been working at the "FT" since early 2011. And at that point, we'd still just about got our editorial team thinking about web, thinking about digital first, not print first. And people would think about what it was going to look like on the website, in terms of what is it going to look like in someone's browser window?
And it took a bit of time for them then to say you can't assume people are reading this on their computer. They're reading it on their phone. They're reading it in lots of different ways. And you can't be trying to do things around how you lay out your article because you can't assume how someone's going to read it.
So I think that it always takes a bit of time to pick up the new way that people are reading stuff. It probably took us a while to think, actually, a lot of people are reading on mobile. And at certain times of the day, it's even more. The morning commute, people are not reading on laptops. Lunch time is quite a big thing for laptops.
And then, of course, it's changed completely with working from home and the coronavirus. We see different patterns about when people are reading the news and how they're reading it.
CRAIG BOX: Do you find different patterns for usage of the puzzles page?
SARAH WELLS: I don't know of patterns of usage. But what I would say-- and I know people who work for other newspapers-- is that people who do crosswords are the most dedicated group of people that any newspaper has. And if you change how puzzles work, you will get a lot of feedback.
I remember "The Guardian" changed their crossword puzzle the way that you could see those online, and there were a lot of comments. So yeah, crosswords, it's a very dedicated audience for any newspaper.
ADAM GLICK: It is amazing how dedicated. I've seen some of them that have moved some of their work online, and seeing the usage graph of what it looks like, and it is amazing. People know exactly what time that comes out, and the spike that they get when the new crossword comes out.
From an operations standpoint, to think about how do you handle the fact that 99% of the people that are going to consume the content you're going to put out are going to consume it in the first five minutes that you've made it available-- and what does that mean in terms of how do you scale for that, and how do you plan for it-- is a really interesting challenge.
SARAH WELLS: Yeah, I think when you're not doing interactive crosswords, you're probably OK. Because a lot of what we do on the "FT" is basically cacheable.
ADAM GLICK: CDM?
SARAH WELLS: Yeah, yeah. I mean, we have got quite a lot of personalization, so there is some element of not being able to cache it. But you benefit a lot from the fact that it's mostly reading content, and so you can cache it.
CRAIG BOX: When I complete the crossword, can I enter my initials into the high score table?
SARAH WELLS: We have a page which is literally, I think, people who have won the crossword. And it's very highly visited as well.
ADAM GLICK: The "Financial Times" has gone through a couple of migrations we want to chat about today, especially given the focus on the technology. You moved from monoliths to microservices, and then you moved from microservices, running those in Kubernetes as opposed to outside of Kubernetes.
Can we start with what did the stack look like before you were in microservices? What was your monolith stack?
SARAH WELLS: Basically, when I joined in 2011, it was mostly Java running on Tomcat with Apache sat in front of it. It wasn't a monolith, but they'd be fairly big systems-- 50, 60 packages in them, if you think about the modules within that. That was a general stack, but most things were actually looked like that.
We moved to microservices-- pretty much everybody is doing something in microservices or serverless. But we didn't move to a single stack. We've got a lot of ability at the FT to choose your own technologies.
So while we do have teams using Kubernetes, we also have teams that are running Node apps on Heroku. We have loads of people writing Lambdas. And we have people running apps on EC2 instances. And we try mostly to be using something a bit more complicated than actually writing and deploying your application on to a server, but there's quite a breadth.
CRAIG BOX: Was there a particular kicking-off point that made you think, we need to make this change?
SARAH WELLS: There were a couple of things that happened. The first thing was the potential was there because the FT had invested in automation and provisioning. So when I first joined, it took literally 120 days for us to get a server bought, installed, set up, and everything configured so that we could put code onto it. By a couple of years after that, it had come down to 15 minutes. And it's just a massive change, and you can't do microservices until you're able to provision quickly.
So first of all, there was potential there, and there was automation like infrastructure with code. The second thing was we started rebuilding several things at the same time. And all of those teams independently decided to do that using microservices. And that was all really based around being able to deliver small changes frequently.
And obviously, you can do that with a monolith, but there's something about microservices that makes that easier. So we had the content publishing platform, subscription platform, and our website all being rebuilt and all using microservices at the same time, which was about 2013.
ADAM GLICK: How do you decide which technologies you're going to enable people to use within the organization? You're a fairly large organization. You've got multiple development teams working on things.
How do you decide, hey, we're going to let people go and build things in Lambdas? Or you've got someone that wants to use Erlang and someone else is like, oh, I want to build this in Rust. How do you keep from just having a technology landscape that is absolutely everything? And then how do you manage that as people transition?
SARAH WELLS: Obviously, I'm now technical director for operations, so this is close to my heart. We have a lot of things. I mean, in pretty much every area, we have multiple ways of doing it. I'd say with programming languages, there's quite a common set that people use. You know, generally things are written in Node or Go or Java, and that's the main thing.
CRAIG BOX: That's pretty much all the languages there are.
SARAH WELLS: [CHUCKLES] Well, I mean, we've certainly had people write stuff in Scala in the past and stuff. That's reasonably good for it, as far as I'm concerned. We, generally speaking, have an idea of what we expect you to do even if you do move to a new platform.
So if you're building a service, you should have a health check in a particular format that we can call and wrap in our monitoring. You should be shipping your logs so that we can see it in the log aggregation service.
So it's a little bit if you're going to add something new, you're going to have to do some work to make that possible to comply with our guardrails. That doesn't stop people generally. [CHUCKLES] But it does make you think that you're not just going to be building new stuff and ignoring the important aspects of running a system, the operational stuff.
CRAIG BOX: And one of your biggest platforms inside the "FT" is the content and metadata platform, which you both have worked on in the past. Dimitar, can you give us an overview of what that content and metadata platform entails?
DIMITAR TERZIEV: The platform actually is mainly focusing on processing the digitally published content that we have here in the "Financial Times." And the other important bit here is actually something that we call a knowledge graph. It's just a presentation of concepts that we use to annotate the content with.
So pretty much what we do just making sense of the content once we publish it. This is what the platform does actually. And it's transforming the content in some forms just to fit the needs of the website, while also giving it to different consumers. They either be customers which are just pulling the content or visualizing the content into the website or into the web publication.
CRAIG BOX: Two years ago, Sarah gave a KubeCon talk which talked about the migrations that the "FT" had. And so we're going to revisit a lot of that talk and how things have gone in the two years since.
At the time, it was mentioned that there were 150 microservices that made up the content platform, running on 650 instances. Looking back, was 150 microservices the right number? Monoliths are fashionable again. Did the number of people that you have working on it suggest that that's the right number? Or do you think, swings and roundabouts, you may have done more or less if you were doing it again today?
SARAH WELLS: Really interesting talking to people about how many microservices you would expect to have as part of a system. And I've found that very often people have 20 or 30 services making up some kind of a system. So it feels as though we are a little high in the numbers of them, but then you look at someone like Monzo. Monzo have 1,500 microservices.
CRAIG BOX: And if you look at their graph, it kind of looks more like one of those splatter paint diagrams than anything architectural.
SARAH WELLS: Amazing, isn't it? But they will absolutely swear this works for them. Now, they have a much more standard approach to all of that stack. So if you deploy a service at Monzo, you do it in a particular way, and we don't.
What I think is interesting is while we were starting to build the concept platform we went back and forth, and we combined some microservices and split others out. Arguably, if I look back on it, I think there were probably too many. But I think you hit the problems as soon as you have 20-odd microservices of operational complexity, the need for automation.
So after that, exactly how many microservices you have is really down to if you find they change at the same time, you might combine them. If you're always releasing them in the same order, you might combine them. But I don't actually think it makes a lot of difference. If you're getting to the point where you can release small changes independently through whatever you've chosen, then I think you're in a good position.
And it's interesting. Some of that talks about we've gone back to the monolith, actually, what they were struggling with wasn't microservices. They were struggling with they hadn't found the right lines to separate the work that they were doing.
ADAM GLICK: As you moved to microservices, did you focus more on defining interfaces and how things interacted and spend less time defining what people should build and how? Or did you find that both of those were necessary in order to really start to create a microservices set of applications at scale?
SARAH WELLS: I think at the FT, you'd find that the sets of microservices that are worked on by particular teams and make up a particular system tend to have a particular deployment-style languages and technologies. So the website is built in Node and runs on Heroku, and the content platform is in Go in Java and running on Kubernetes.
So it's been tended to be more the boundary between those systems which is where you have a big change in technology. And that's because you've got a bunch of people who are working on stuff, they probably need to be able to move around all the services without having to learn a new language. But my current team writes stuff in Go and in Node because we're doing some stuff with Prometheus. All of the connectors we had there were in Go, so you don't want to try and decide to use Node if it's easy enough to just match the existing libraries.
ADAM GLICK: You adopted Docker fairly early on-- back in about 2015, if I recall-- and back then, container orchestration was a very different landscape. And you had actually gone and built your own orchestrator. I think this was right before Kubernetes 1.0 came out. Can you describe what your platform, or orchestration, how you managed your containers at that point?
SARAH WELLS: Yeah. I wouldn't recommend building your own cluster orchestrator, but there weren't really alternatives that provided what we needed. So we did look at things at the time.
CRAIG BOX: I think for the timing, we can forgive you.
SARAH WELLS: [CHUCKLES] Yes. So we basically used CoreOS because it's designed to manage deployment of containers, and we looked at existing current standard Linux tools and some cluster-aware stuff alongside that. So we were using systemd to define our services and how to run them and Fleet to do that across a cluster.
And we did that, I guess, routing, for an American audience, to using vulkand and storing the configuration for that in etcd And we built our own deployer, and it was written as a microservice in Go.
And basically, we had some YAML files that said these are our services; here's how many instances we have; here is the fact that this one needs to be sequentially deployed. And you would basically update the version and check it into Git, and the deployer would see that that had changed and deploy that version of the container into our stack.
ADAM GLICK: How did you make the decision to build your own platform to do this versus waiting for the industry to do it? At that time, there were some early orchestrators that were out there, some of them open source. So the industry was heading that direction, but you made the decision rather than wait, it was more important to go. How did you make the decision that that was the right trade-off?
SARAH WELLS: We had to do something because we were building this microservice architecture, and what we had in place as a platform at the "FT" very much expected you to deploy an application onto its own VM. So when you start thinking about having 600 VMs, even at the smallest possible size of them, you're going to be wasting a lot of money to do that. So we started thinking about, well, obviously we could start having multiple services running on a single VM, but then you still have to address them, you have to find them, and then containers seemed like the obvious solution to that.
I think we were aware that we were early adopters, and it definitely cost us in terms of the time it took to build things and to run it. Running your own thing is difficult because you can't ask anyone else. You can't look it up on Stack Overflow. You're basically reliant on someone having worked out what it means.
But I think that that option worked for us and made sense. It was worth taking that cost, given the situation we were in and the timing.
ADAM GLICK: At some point, you decided that you weren't in the business of building an orchestrator and you moved over to Kubernetes. And you've talked publicly about your use of what you've called innovation tokens as a way to pick the bets of what you're investing in when it comes to new technologies. Can you describe that system?
SARAH WELLS: This is Dan McKinley, who used to work at Etsy, basically said you only have a certain amount of effort to be doing brand new stuff. You don't want to be wasting that on something that you're excited about, but it isn't important to your business aims. So you can't afford to take on loads of new challenges at the same time because you won't be successful.
So if you're going to build a cluster orchestration layer, what are you not going to do? Is it going to affect your ability to do other stuff? So you should probably try and only do a couple of those things or one of those things at a time so that you're not taking on a lot of risks of many complicated, and potentially, things that will fail.
So he wrote this great blog post saying use boring technology. People use Postgres because it works. Think carefully before you decide that you need something different.
ADAM GLICK: If tokens are how you pay for it, or kind of a stand-in as you figure out essentially what the opportunity costs are, how do you decide on which projects to fund, assuming that there is a fixed amount of those innovation tokens to spend?
SARAH WELLS: [CHUCKLES] I don't really think of the innovation tokens in related to what we fund. It's more about, given that you've got funding particular business outcome, how do you approach solving that problem?
So I don't really want to talk too much about how the "FT" thinks about what they fund because it's like every company. It's very complicated, and you spend a lot of time talking about it. But once you have a team and you've decided that you want to tackle a particular type of functionality, it's how do you decide that you're going to deliver that outcome is where I think the innovation taking comes in.
CRAIG BOX: The content platform adopted Kubernetes in 2017. At that point, there wasn't exactly any mature technology in terms of deployment to AWS, which was your cloud provider at the time. Amazon didn't launch their EKS service until 2018. How did you factor the manual administration of the platform in the decision to adopt the technology?
SARAH WELLS: I think it all depends what you've been doing before. If you've been running your own platform and you move to Kubernetes, it is so much better, then that's fantastic. If you were a team that were used to deploying to Paz and adopted Kubernetes, you were taking on a lot of new complexity.
But for us, it was a simpler version of stuff we were already doing. And I think we basically thought, well, the direction of travel is obviously managed Kubernetes, but we'll wait until we have something that we think is going to be ready for that. And probably how early adopters do we want to be on everything? Can we just wait until something's been tried by other people for a little bit?
CRAIG BOX: Two things from your KubeCon talk that I'd like to follow up on-- the first, you said that one of the major costs of the migration was needing the ability to deploy to both the old and the new stack at the same time. When the migration finished, did you change? Were you able to now deploy only to one system? Did you think differently about the way you did deployment? Or did you just turn off the old system at that point?
SARAH WELLS: Pretty much. We just turned off the old system. The complexity was just if you're going to deploy to two different stacks, how do you know that they're both OK? Because the code that we were deploying had different paths to it depending on which stack it was deployed to. And at some point, you are focusing more on one stack on the other. But if you've got one is your live stack, and the other is the one that's going to be live, you have to know that the code is working on both.
And while we were doing this work, we were delivering new features. The other stuff didn't stop. So you're basically doing this whole migration while also everything is changing. So we wrote loads of new services during this period, which also had to be able to be deployed on two stacks.
So I would have to say that if we could have stopped and had more people working on the migration, that would have been a good thing to do. But it's really hard to ever say, look, we're just not going to deliver features for three months while we move our stack.
CRAIG BOX: The other thing that I wanted to check in on is that you did this migration in 2017, and you said that you planned to break even in three years. Three years later, how did you do?
SARAH WELLS: [CHUCKLES] So this is a really interesting question because nothing stays the same. It's extremely hard to look at the end to three years and say, oh, we definitely saved this much money because so many other things change. We moved the platform from being developed by a team in London to a team in Sofia. So the notional costs of work changed during that. We changed the stuff we were building in terms of features.
So I think I'm pretty comfortable that we did save money over the alternatives we had available to us at the time. But whether I could actually produce a spreadsheet and prove that, probably not.
ADAM GLICK: Dimitar, you're responsible for the content platform now, since you've joined the "Financial Times." What has changed since Sarah's KubeCon talk back in 2018?
DIMITAR TERZIEV: I would say that we are still embracing the microservice architecture for sure. We have now probably 160 or 170 services. But we are still making use of the great work that Sarah's team has done actually, and this approach to continuous deployment actually allowed us to make almost 350 deployments the last month or so.
So for us, right now, it's very important to move quickly and get this instant feedback of the things that we're doing. And pretty much what we have right now is exactly the thing that we want to have in terms of continuous integration, continuous deployment, and delivering value and work closely with the other teams, helping them achieve their goals.
ADAM GLICK: So what are you spending your innovation tokens on now?
DIMITAR TERZIEV: [CHUCKLES] Since the platform is quite mature and stable, I would say that we are now mainly focusing on delivering product value and pursuing our goals in terms of delivering new ways for our customers to read content, improving what we have and thinking of powering new products.
CRAIG BOX: There are new, out-of-the-box things you can install on top of a Kubernetes platform that didn't exist back then. I'm thinking of service meshes or serverless platforms. Are any of them in use or on your radar?
DIMITAR TERZIEV: Currently not, actually, but we are constantly looking for things that we need to improve. To be completely honest, right now, for us the journey for moving from managing Kubernetes on our own to using Kubernetes solutions such as EKS was a bit of a long one. And one of the lessons that we've learned is that we need to find a way to make this more sustainable because currently, in the FT core, we have three big teams making use of Kubernetes, and it's important for us not to spend too much effort doing one and the same thing.
And as I mentioned earlier, the journey for moving to the managed Kubernetes solution was an interesting one. We've learned a lot, and now we've created something that we call Kubernetes guild within the FT core just to think of the best way moving forward and potentially spending the innovation tokens very wisely for the future.
CRAIG BOX: It sounds very City of London-- "The Worshipful Guild of Kubernetes".
DIMITAR TERZIEV: [CHUCKLES] Yeah.
SARAH WELLS: The thing about service meshes though that was interesting to me was that because we had to solve a lot of the same problems ourselves, you get to the point where you've missed the point where all the value would come. If you've already done back-off and retry and tracing and everything and built it into your services in some way, then it's actually harder to think about how you're going to take that out and add a service mesh in.
CRAIG BOX: Was the same not true of the system you built that orchestrated your containers before Kubernetes?
SARAH WELLS: Yes, but the problem with that is that there was a lot of pain in operating. And we had made the decision that this is not something we should be spending our innovation tokens on.
Cluster orchestration is not a differentiator for the FT. So we felt we should move onto Kubernetes once it was ready in the same way that we felt we should move on to EKS once we thought that was better than what we had. And we'd made that decision, and we were starting work on it, and then Fleet was declared end of life.
So actually, we had no choice. The platform we had built was built on something that was going to go away a year later. So I was quite comfortable that we'd already made this decision, but we then had a deadline to do it by.
CRAIG BOX: What is the split between developing and operating your service look like at the "FT?" Do your developers run their services in production?
DIMITAR TERZIEV: Yeah, I would say so. This is one of the great things, at least from my perspective, in "FT," that we as engineers are responsible for the whole development cycle. We develop microservices, the applications. We test them, deploy them, and actually maintain them in production.
And I just want to say that one of the teams here in "FT" customer products, they have a mantra, which is write software that you can confidently fix at 3:00 AM in the morning. Luckily, this doesn't happen very often, but it's important for us to understand the whole pipeline, what it means to software, help support it in production.
And the other great thing which gives us confidence to do it so is that "FT's" actually a blame-free culture, and whenever an incident happened, we work very well, and there is actually no pressure from anyone, and we just are focusing on giving the best of us and trying to fix the problem at hand. And this is again thanks to Sarah's team and guidance.
SARAH WELLS: [CHUCKLES] It's a whole "FT" technology culture, but it's absolutely essential. The two things of you build it; you run it, and we are focused on getting things working rather than trying to say who did that? It's critical.
ADAM GLICK: Sarah, when you started this project, you were the platform tech lead for the content platform. Since then, you've become the technical director for operations and reliability. Your platform and your purview goes much wider than Kubernetes now, I assume?
SARAH WELLS: Yes. I think, just to show you what it's like to build and operate your own cluster orchestrator, it's led to me moving into operations full-time because it was such a challenging thing. I learned so much. So now, my responsibility is it's around making sure that the FT is set up to effectively operate software and to build things that can be fixed at 3:00 in the morning.
CRAIG BOX: So it's not so much that you're on call for the software, but it's that you provide the things for the developers to be able to be on call themselves?
SARAH WELLS: We have got to first line operations team who work in my group, and a lot of what they're doing is trying to work out whose team needs to be called because one of things about microservice architecture is trying to work out we can't publish content-- which of the 12 services that are involved across three different teams is the cause of that problem? So my team is spending a lot of time on helping surface that information. So it's about the support for the teams that would have to support to be called.
But one thing we do that I think is quite important is for every service, we talk very carefully with people about does need to be fixed? If it does break after hours, is it something that's essential that needs to be fixed now? Or can it wait until the next day? So we've got a very clear split between things that would involve someone being woken up and things that will just be maybe we'll try and do something overnight, but we'll wait until the next working day to fix it.
CRAIG BOX: Can I get Prometheus metrics from your printing press?
SARAH WELLS: No, you can't, but I do think we do have some metrics from our print sites available in our [INAUDIBLE] through Prometheus, in our monitoring stack.
CRAIG BOX: One of the discussions I hear, kind of as an ongoing conversation, especially as you get developers that are running things in production, is operations something that as an independent discipline will go away and that development will own full stack operations as well as creation? And yet, you've moved from development into an operations role. What's your perspective on that kind of debate as to what does operations look like in the future, and is it just part of development, or is it a separate discipline that manages and does other things?
SARAH WELLS: When I think back to when I started as a developer, I wrote code. That's pretty much what I did, and I packaged it up, and I checked it it, and someone deployed it. And when I was working in the content platform, probably half your time was on things other than writing code. It was setting up new services or basically maintaining them.
CRAIG BOX: Checking the crossword service was still working?
SARAH WELLS: Yes, exactly. So I think that as a developer, you now have a whole range of skills, but you can't expect everyone to be experts in every skill. We talk a lot about T-shaped engineers at the "Financial Times," so the idea that you have depth in some areas but breadth across others. So I think there is a role for people who are experts in building and operating platforms, who are experts in knowing how to do effective monitoring. So that idea of SRE, that idea of specialists is not going to go away.
So I think in the same way that we have experts in security who provide guidance to everybody on how to be secure, we expect every engineer to build secure systems, but we also expect there to be people who can help you learn how to do that well. I think that role that you are spending time teaching people how to build observable systems and operate systems effectively replaces operations.
CRAIG BOX: How do you feel the day two operations experience of Kubernetes has developed as a platform?
DIMITAR TERZIEV: Pretty much the observability and monitoring the system is a shared responsibility actually because we monitor the system throughout the day, and the ops team that Sarah manages is monitoring the system out of ours. But pretty much we tend to follow one and the same approach, as Sarah mentioned in the beginning, of how we deploy things, how we build things and engineer them so we can have a common understanding and common way of monitoring things here in "FT." And actually, we are building also around a lot of internal tooling which helps us in this direction.
In terms of working with Kubernetes and what we want to do from here, what we have right now is actually, as I mentioned also earlier, that we have a platform which is at a very stable state, but we are constantly trying to make improvements, either they be in continuous integration aspect or continuous delivery. Right now, we are thinking on adopting GitOps as an approach. And pretty much these are the main things that on now we're focused, to be completely honest.
ADAM GLICK: Do you have a single observability stack across both your Kubernetes and non-Kubernetes services?
SARAH WELLS: Yes. So I moved to this new role two years ago, and the first thing we did was to rebuild the way that we did monitoring and observability. As with everything else at the "Financial Times," we have a lot of different monitoring software. And at the time we were surfacing it within the dashboard build using Dashing, which is a thing that Shopify built which would give you a nice set of tiles. But it gives you no history, and it's the only place that everything was aggregated was in the browser.
So we wanted somewhere where we could aggregate it all together that wasn't the browser, so we introduced Prometheus. And we basically suck in metrics from every other monitoring system and surface it through Prometheus, and we have some services that's on top of that. So that includes information from health checks from the services across the whole "FT" estate, network stuff through SolarWinds. We suck in stuff from Grafana alerts, CloudWatch alarms. Basically anything that we have that is monitoring something, we can export it into Prometheus and surface it.
So it is a single stack, but it comes from a lot of different places, and I think that's just being pragmatic really. So we still have Nagios boxes-- not very many, but that's one of the things that we monitor.
CRAIG BOX: If you move offices, just leave them behind and hope no-one notices.
SARAH WELLS: [CHUCKLES] Yeah. So we did move office last year. And one of the things was when we move office, we are not going to have a server room in the new office. And we managed that, but it took a long time.
CRAIG BOX: Well done. To wrap up, what advice would you give someone who is a little later on this journey than you are, who's maybe looking at taking their steps towards cloud and microservices today, with the hindsight that you have from the tooling plus the experience that you have from the evolution of the available off-the-shelf software since you did the migrations?
SARAH WELLS: One of the guardrails we have at the "FT" is buy rather than build unless it's something that is really critical for us as an organization. Whereas three or four years ago, four or five years ago, there weren't things that we could just install and run in lots of these areas, now there are. So I don't think you should be spending your time doing something complicated with platforms.
You need to have something that's managed for you. That would be my advice. Manage platform so you're spending your ingenuity somewhere else than basically running that platform.
DIMITAR TERZIEV: I would say that from our perspective, I guess it's important to understand all the options that you have and choose the right one for you and for the thing that you are building. Sometimes people tend to ride the hype train, and it's not necessarily important for you to use the latest technology. Just check what works best for you, what your development team will be most comfortable working with. This will definitely pay in the long run.
ADAM GLICK: Dimitar, Sarah, it's been wonderful having you on the show.
SARAH WELLS: Thank you for having us.
DIMITAR TERZIEV: Thank you very much. It was a pleasure.
ADAM GLICK: You can find Sarah on Twitter at @sarahjwells and Dimitar on Twitter at @dimityrterziev. You can also find and subscribe to the "Financial Times" at FT.com.
CRAIG BOX: Thanks for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter at @KubernetesPod or reach us by email at firstname.lastname@example.org.
ADAM GLICK: You can also check out our website at kubernetespodcast.com, where you'll find transcripts and show notes as well as links to subscribe. Until next time, take care.
CRAIG BOX: See you next week.