#145 April 8, 2021

Weaveworks (part 2), with Alexis Richardson

Hosts: Craig Box, Justin Garrison

We conclude our two-part conversation with Weaveworks co-founder Alexis Richardson, picking up when the company received Series A investment in December 2014. Since then, they built projects like Scope, Cortex and Flux as well as SaaS offerings based on them. We also look at Alexis’s role in the founding of the CNCF.

Please be sure to listen to the first part before this one!

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box, with my very special guest host Justin Garrison.

[THEME MUSIC]

CRAIG BOX: Do you have any favorite educational YouTubers?

JUSTIN GARRISON: I have way too many to list. I'd say some of the ones that I watch pretty frequently are, Film Riot has a lot of really cool stuff, which is about making movies and editing, which I used to love doing in a past life, and also now currently again. Mental Floss is great for those educational- thinking about different things. And recently, I started finding a lot of animation channels. There's Animator Island TV, which is really fun, which is if you want to learn how to draw, and make cartoons, and just doodle. It's a really interesting format to be able to see how to put things together-- different tools, different software-- and just go with it and see the fundamentals of animation, which is really fun.

CRAIG BOX: I ask, of course, because you put together a talk for the last virtual KubeCon in the US, which was a little different to the average conference talk. It was basically a three-part play, or a tragedy perhaps, but acted out rather than just spoken to a camera. I thought it was very refreshing to see someone actually take the constraints of this new format and make them something more interesting than simply what you would have done if you were giving a webinar or a presentation in the room.

JUSTIN GARRISON: Yeah, I had a lot of fun with it. And it's been an idea I had for a long time, partially for myself, just because, even in college, I was doing writing for film and TV as a minor for a little while. I never pursued it. But it was just an interest of mine always, how movies get made, and television, and this idea that, in a virtual environment, it really doesn't matter if someone's on stage with slides, but we can educate people in ways that aren't always obvious. And I watch a lot of TV and movies, and I learn a lot of things, and I thought, hey, why can't we do that for conferences too?

And also, I don't get sort of the stage fright butterflies of not being able to speak properly in front of people, or type when I have hundreds of people watching me, because I can do all of that up front at my own time. And I think it's a really freeing format, just to be able to have these pre-recorded talks. And as much as some people don't like giving them or don't like attending them, I'm really enjoying playing with that format, as an ability to engage more people, and allow more people to do it.

CRAIG BOX: Strangely, I think it might be easier for people to do. Because when you're cutting between scenes, you can hide a lot of sins. You can basically do two or three different takes of something and take the best one, whereas a lot of people are sitting in front of their webcam just recording the thing beginning to end, and they have to take it warts and all.

JUSTIN GARRISON: Yeah, and you can add a lot of other things that are kind of unspoken in a traditional "here's a webcam in my face." We're acting out things and kind of leaning into that transition between different things. I had a friend come and do a drone shot with me for one of the scenes, which, for me, was like a throwback.

I wanted to kind of get "Shawshank Redemption." I wanted to get that scene of "Shawshank Redemption" where he's free. That was my inspiration for these things. Whether anyone connected those dots doesn't really matter. I was having so much fun with it, just to be able to be out there, and have a different view of what's going on, and to teach in a different way.

CRAIG BOX: I'll encourage anyone who hasn't seen Justin's talk to check the video out in the show notes. If you want to listen to the podcast interview we did, it was way back in September 2018, episode 20. One thing in your video was the idea that you have a film within a film, and you wanted to put this in a theater. Have you actually been able to go to a theater since that was made?

JUSTIN GARRISON: Nope. That was the last time. Filming that outside the doors was the last time I was at any theater, and I never went inside. So it has been a while.

CRAIG BOX: Are you interested, perhaps, when they do reopen, just to play it in there? Just hire it out for a day and put your own film up, just to say that you did?

JUSTIN GARRISON: For sure. There are some smaller theaters that rent out space and screens for private parties or for whatever you want to do. So it's definitely been something that I've thought about. So maybe KubeCon, if it is in person in Los Angeles, we could do something at a theater if they're open again.

CRAIG BOX: Until then, I have heard that there are some theaters in the UK that are being used as coworking spaces. People are so desperate to get out of their houses that we've all been locked in for so long that they will go and sit socially distanced in a theater in order to get some work done.

JUSTIN GARRISON: Lots of people work in front of their TVs. Why not just make the TV bigger?

CRAIG BOX: That's the plan. Should we get to the news?

JUSTIN GARRISON: Let's get to the news.

CRAIG BOX: Kubernetes 1.21 is out, quite possibly by the time you're hearing this podcast. You'll hear all about the release next week when we talk to release team lead, Nabarun Pal. As a result of the recent "up or out" policy for moving features to stable, Chrome jobs and pod disruption budgets, both introduced in Kubernetes 1.4 in 2016, are finally declared GA.

On the other hand, pod security policies are formally deprecated, as the community decided the existing implementation couldn't be taken to GA as is. A new version is under development. And the current feature will be removed in Kubernetes 125.

JUSTIN GARRISON: KubeVela, the Kubernetes implementation of the open application model, has released 1.0 this week. KubeVela lets a platform team run a PAS-like system for its end users, with APIs like application, environment, and template. KubeVela is built by Alibaba, who developed the OEM spec in association with Microsoft.

CRAIG BOX: The Argo Project released new versions of two of its DevOps components this week. Version 3.0 of the cloud native workflow engine Argo Workflows adds an improved UI, the ability to run high-availability controllers, and enhancements to artifact management. Argo CD 2.0 introduces a rich notifications framework, cross-cluster application management with application sets, and a new pod view in the UI.

JUSTIN GARRISON: The team at Isovalent, represented by their open-source chief and recent guest host Liz Rice, have launched networkpolicy.io, a community site dedicated to the pod firewall description language used in Kubernetes. The site contains a new tutorial and examples, as well as a link to Cilium's network policy editor. The announcement also points out that the term firewall was first used in computing by the 1983 film "War Games."

CRAIG BOX: IBM's serverless code engine is generally available this week. The managed multi-tenant runtime is built on open source projects, including Kubernetes, Knative, Istio, Tekton, and Paketo Buildpacks. It's charged by memory, CPU, and HTTP requests used, with a free tier.

JUSTIN GARRISON: Also launching a Knative service this week is VMware, with a public beta of cloud native runtimes for VMware Tanzu. The platform includes the serving and eventing components, and is available for people running Tanzu Advanced Edition. Meanwhile, VMware's security business unit has added container support to Carbon Black Cloud, and new capabilities including prioritized risk assessment with a security posture dashboard.

CRAIG BOX: Also generally available this week is Cisco's Intersight Kubernetes service. Intersight is a common platform for visualization, optimization, and orchestration for apps and infrastructure, handling firmware updates and OS and hypervisor installations. It now adds Kubernetes cluster management, built on top of the Cisco Container Platform, launched in 2017.

JUSTIN GARRISON: Tetrate, a company founded by early members of the Istio team at Google, has launched Tetrate Service Bridge to general availability. The product uses Istio under the hood and layers on capabilities for centralized management, multi-tenancy, audit logging, service inventory, configuration safeguards, and more. It runs on any compliant Kubernetes cluster and also supports adding VM and bare metal workloads to your mesh.

CRAIG BOX: Some new updates have been added to Azure Arc-enabled Kubernetes, the feature where you connect an external cluster to Microsoft for management. You can now Azure Monitor container insights and Azure Defender for Kubernetes with connected clusters. And you can use Cluster Connect to securely access clusters that aren't publicly available. Microsoft also announced a preview of an Open Service mesh add-on to Azure-Kubernetes service having gone out on their own in the service mesh space.

JUSTIN GARRISON: The CNCF has published a project journey report for SED, the eighth such report for one of its graduated projects. Since joining the CNCF, the project has seen a 33% growth in companies contributing and a 66% rise in individual contributors. Geographic diversity of contributions expanded from 12 countries in the first year as a CNF project to 26 during the second year.

CRAIG BOX: Developer Ben Dixon has published a detailed guide to single sign-on for Kubernetes, using Keycloak and open LDAP. Across nine posts, he explains how to install the auth system and use it to provide access to Kubernetes and Nginx Ingress, a Git server, and a container registry.

JUSTIN GARRISON: Finally, this week, we pour one out for Apache Mesos, which is moving itself to the attic due to a lack of maintainers. The attic is a term for Apache Foundation projects that are no longer actively maintained. The move follows a similar archival of Mesos framework Apache Aurora in February 2020. The Mesos project could continue as a fork on GitHub if the community wants to continue development.

CRAIG BOX: And that's the news.

[MUSIC PLAYING]

Last week, we spoke with Alexis Richardson, co-founder and CEO of Weaveworks. The conversation was fantastic, and so we thought we'd bring it to you in whole over a two-part episode. So it's my pleasure to welcome back to the show Alexis.

ALEXIS RICHARDSON: Thank you very much, Craig.

CRAIG BOX: You said before that the VC started knocking on your door around this time. In December 2014, you took a $5 million Series A round. What was your pitch to those investors as what Weaveworks would do?

ALEXIS RICHARDSON: We told them that we would build management solutions for container applications. And we thought that networking would be a way to gain insight into how those applications were behaving. Because we would monitor usage of the network, and we would manage the distributor based on the network construction. And what we found out, Craig, was that it's not a very good way to manage. It won't turn out.

CRAIG BOX: Right.

ALEXIS RICHARDSON: There are a couple of other companies-- this company at the same time called Netsill that you may remember that thought you could do the same thing. And the answer is, managing and monitoring networking is useful for application management and monitoring, but there are other ways of doing it that are probably more attuned towards solving problems for applications and clusters. And so we were disadvantaged focusing too much on networking and had to quickly move up the stack, which is where we started to invest in other tools like Scope, and then CourseX, which is a Prometheus backend.

CRAIG BOX: So I was going to ask about those two tools. Was Scope a tool that you built to do that network monitoring?

ALEXIS RICHARDSON: Scope was a tool that we built because we thought that containers are so complicated that people would want to see what's going on inside their application. And they would want to click on parts of it and go, let me monitor what's happening here, to see a bottleneck or something. And also, the original version of Scope was that we would have management as well as monitoring. So we had an interactive component in Scope, which meant that you could write into the application and read from it, as well as just doing monitoring.

But what we found was, monitoring a real-time application was less useful if you didn't have a monitoring system like Prometheus attached to it. And so we were early adopters of Prometheus in Kubernetes, which we brought in to connect up Scope and provide people with metrics associated with what was in Scope. Yeah, that was kind of the original motivation.

CRAIG BOX: And then you mentioned Cortex. And that's a storage backend for Prometheus. And I would have assumed that came out of your need to store metrics from either your Scope customers or users or the Weave cloud service that you were building around the same time.

ALEXIS RICHARDSON: We wanted to monetize developer teams who were using containers to build apps. And so we wrote this tool which provided them with some management and some monitoring around the distributed application. That had to have Prometheus as a service component, because otherwise we would have to run one Prometheus for every user, which is very expensive.

We wanted it to be cheap, like Datadog is cheap, at least at the entry tier. So we wrote Prometheus as a service, which is still going really well. And actually, that code has been adopted by many, many people, including Amazon most recently, with their Prometheus launch.

The Cortex backend lets you do a multiteneted SaaS of Prometheus. Because instead of running one whole Prometheus for you, per user, it runs what are called session collectors. And then they write back into a store.

And so it's also an architecture based on Kubernetes for managing metrics. So we have these two things. We have this tool for managing, we have a tool for monitoring.

But at that point in time, it worked with any container implementation. So you could use it with Docker apps. You could use it with Kubernetes apps. You could also use it with Marathon, or Mesosphere, and ECS.

And that's when we found that what we'd done is, we built a tool which was extremely useful for people who wanted to work in a heterogeneous environment and who understood how to get their clusters set up and wanted to see all this information. But it wasn't very useful for people who were struggling just to get to the end of day one with a cluster set up and an app deployed. So we realized that we need a lot more. Because developers were getting stuck much earlier on in their journey than where we found it's really earning its sweet spot. And it wouldn't be the first time somebody has built a tool that probably is a bit ahead of its time in this technology market. And that was basically a reflection of how confusing things were, I would say.

CRAIG BOX: In order to meet people earlier in the software delivery lifecycle, one of the tools that you maintain today is Flux. When Flux was launched in April 2016, it was described at the time as a service routing layer. That's not what I think of Flux as today. Has it evolved, or is that just a reuse of a name for a different tool?

ALEXIS RICHARDSON: It's exactly the way you said. So when we first came out with Flux, we were still thinking in terms of, the distributed nature of the application was fundamental to how you managed it. And so we wanted to add to the routing component with something that we can actually use to route requests to different services, but also would be associated with how those services were deployed and managed, a bit like, actually, tools like Seneca today.

What we found was, that was a great thing to do, because it gave you almost an application model for microservices. But it forced you to confront two facts. One, if you do anything involving routing and load balancing, you've really got to commit to that. And that includes commitment to RAM performance, and probably not running stuff in Go, unfortunately.

And two, you want to really think about the deployment piece. Because actually, that's the one that a lot of people find hard deploying these services, whereas there are already existing high-performance load balancers. We saw Flux One as a prototype for understanding service management, I suppose.

But that made us realize that we didn't need to worry about the load balancing piece. We should really focus on the deployment piece. So that became the last part of Weave Cloud.

You've got deployment, management, and monitoring, all in one circle. So you could go, I deploy a change, I can look at the results using my monitoring tool, and then I can use management to fix things that come up due to my alerts. And so that was very, very nice operational model for a SaaS, but, again, still a bit ahead of its time.

Because 99 customers out of 100 that we talked to are saying, how do I get my Kubernetes clusters set up? And how do I back it up? And how do I make a copy? And can you just solve this problem for me?

I've set up Kubernetes already, and I'm stuck. I need help with that. So that was something that was certainly instructive.

CRAIG BOX: People who were around 2016 may remember what I would call the V2 or V3 of Docker Swarm, where it was relaunched at DockerCon with a command you could run to start a control plane and then another command you could run to join a node to a cluster. And people would look at Kubernetes and say, ooh, there's a hell of a lot of work you've got to get to set it up. My take on how this happened is largely around the fact that Google engineers who were still most of the contributors to Kubernetes at this stage had GKE and could get a one-click cluster.

And it kind of left an umbrella, in terms of how hard it was to set up in other places, for other teams to work on. And one of the other teams who attacked that problem were a team at Weaveworks, who built what is now Kubeadm. How did that project come about, and how did you decide to invest in Kubeadm?

ALEXIS RICHARDSON: As I said, Weavenet was an optimized supportability. So when we first started looking at Kubernetes in December of 2014, January 2015-- that was Ilya, who I think you've met.

CRAIG BOX: I have.

ALEXIS RICHARDSON: We could see that Weavenet, if we could make it work with Kubernetes, would make it relatively straightforward to run a Kubernetes cluster somewhere else. And so, Ilya wrote the single Kubernetes Anywhere, which was originally a POC for setting up Kubernetes on Azure. And Patrick [Shanizar], who was at Azure at the time working on that stuff, was really unhappy about it. So people carried on using Kubernetes Anywhere, and we worked on it a little bit, just as a sort of demo.

Then what happened was, I guess, 2016. By then, we were running Kubernetes in Anga for our SaaS at scale. We were doing deployment. We were doing management.

We were doing monitoring. We were doing all the things. We had some customers. We were starting to realize that we should just focus on Kubernetes completely.

The PMs at Google came to us and said, look, we'd really like to do some work with you making Kubernetes easier to use. In particular, we think your network is really easy for developers. So could we make the Kubernetes developer experience better by working with you on that?

So we spoke to David Aronchick and Tim Hocking. I don't want to repeat what was said at the time. Because it was all a great conversation.

And there were some other really cool Google engineers involved in the discussion. But we all realized that we were sort of lying to one another if we thought that you could solely solve the Kubernetes developer experience through networking changes. Or what is more, the networking assumptions in Kubernetes were so deeply baked into the design throughout the code base that, actually, refactoring Kubernetes to take them out was a challenge equivalent to rewriting it from scratch, which is something that everybody has entertained more than once over the years. Because it is a complicated thing.

But nobody wanted to do that. And Weaveworks was certainly not recommending this. So we said, look, let's not do that. Let's think about another problem.

And I can't remember where it came from. Maybe it was Aronchick, maybe it was me, maybe it was Ilya, maybe it was someone else. But somebody said, let's really look at the Bootstrap.

Because then you might be able to use the network to make the Bootstrap simpler. But we're not asking you to completely rewrite Kubernetes' network. So OK, let's look at the Bootstrap.

We had a sanctioned effort with some really cool people at Google to really redo Google Bootstrap, which was in parallel with a number of other efforts, which thought they had been sanctioned, but actually weren't. Every time I talked to people, I would say, we've been asked by the Google people to help them with the Bootstrap. And they would say, no, you haven't. We have been asked by the Google people to help with the Bootstrap. And so that was a bit of a problem.

CRAIG BOX: Nothing worth doing at Google is only done by one team.

ALEXIS RICHARDSON: Of course. Anyway, so then we all ended up at DockerCon in Seattle, which is before Craig McLuckie left Google to do Heptio, but not long. And very memorably, that was when Docker announced Swarm 2, which kind of completely merged with the API, and that the API user experience for Docker Swarm would be just two verbs for starting or joining a cluster. And it would just be an extension to the Docker API.

At which point I remember Craig was not a happy chappy at an analyst event the night before. Because I think he'd been pushing for some similar stuff for ages inside Google and had felt resistance.

I don't know that for a fact. That's my speculation, by the way. And also, there were really long faces in there around the Google folks. It's like, how could Kubernetes compete with this?

And I remember saying to David, look, what the Swarm team have done is good here, but there's no reason why you can't have this kind of API experience with Kubernetes, too if we build it together. We just look at Kubernetes anyway. That makes massive simplifying assumptions and lets us do something like that.

There were things that we'd done in Kubernetes Anywhere that were really quite neat. And we separated provisioning from Bootstrap, which is something that people hadn't done before, I think. And that made it easier just to focus on the Bootstrap. And we would say, provision the machines however you like. Use Terraform or something else-- Ansible or whatever-- and then do the Bootstrap using something else. And there were some other assumptions that simplified things as well, to do with how it worked, which was good.

So we started working on Kubeadm in that summer with the goal of having just the two verbs. And by then, of course, everybody else was full of energy. So very, very, very quickly, we were also working on it with Brandon Philips and Joe Beda, who obviously went on to do Heptio, in Joe's case, and then now VMware. And Brandon, CoreOS and Red Hat, and, I think now, Wild and Free.

CRAIG BOX: Writing a newsletter.

ALEXIS RICHARDSON: Writing a newsletter. Luke Marsden was with us at the time and was working on it. So we had a really strong team of folks.

The people at Google, as you know, who work on this stuff are absolutely first class. So we had the Google folks, Brandon, Joe, Ilya and Luke, a few other folks. It was just a great team.

And actually, I remember, in early discussions, people's opinions were different. But it didn't take long to iron them out. So I think Kubeadm came very rapidly out of that.

Once Kubeadm became useful-- meaning you could actually download, and run, and install a cluster using it-- by about September or October, that's when we saw a huge change in the number of clusters of Kubernetes still being started, pinging back home to say, I've been started. And then it just accelerated after that. So that, for me, shows the value of really focusing on the developer experience with a technology like Kubernetes.

CRAIG BOX: And in terms of the investment that you as a startup made in that, it was something for the benefit of the whole community. Did it raise the tide on your product enough to be worth that investment?

ALEXIS RICHARDSON: Well, we were at the time thinking about the fact that we have a lot of open source tools and a SaaS product, and we have some good customers on our SaaS product, some really big names, some really big users. But we do feel like we're missing the majority of the market.

Because so many people are on-prem. Or so many people are saying, I want to be multi-cloud. Or so many people are saying, can you do this on Google or Microsoft?

Can you make it multi-cluster? Or can you just help me manage my cluster instead of worrying about the app? I want to do the cluster first.

You get all the questions. And you're thinking, this thing is just a little bit too ahead of the curve, in terms of the adoption. So let's move forward.

So Kubeadm was an investment for us to basically get to grips with installing Kubernetes ourselves on-prem in a way that lined up with how we installed it in the cloud. That's when we began to get serious about thinking in terms of cluster management, as well as application management. So we had originally started with the vision of focus on applications. And then we realized that we weren't going to be successful unless we also focused on clusters.

CRAIG BOX: You mentioned there working with the Google team. A couple of years later, Amazon comes out with an elastic Kubernetes service in 2017.

And your team have worked a few ways with Amazon. One of them was creating the command line tooling around Amazon's new service. How did that partnership come about?

ALEXIS RICHARDSON: I mean, we're sort of veering off into slightly more commercial territory here, so I want to be fair to all parties. One of the things that's very important for Weaveworks and GitOps especially is, you've got this ability to start a cluster, and an app, and the stack, and a whole lot of other stuff that you need, from source, from the files in Git. That's your source of truth. You can project that anywhere you like.

So we can provide open source stacks, and we can work with commercial vendors, and we can work with cloud providers. And so long as we can interact with those things, we can get the same stack running wherever you want it, which is really cool. You've got this reproducibility, which is a form of portability. I'm not a big believer in portability necessarily. But being able to reproduce your state is really helpful after a crash for all sorts of other reasons.

One of the things that's critical to our positioning as a company is that we are independent. So there are other independent vendors in the DevOps Kubernetes space-- Hashicorp, for example. Nobody says there's a particular bias towards one cloud provider.

We are very careful to make sure that we retain that independence. And a lot of our customers are using two clouds, or one cloud and an enterprise vendor. Nonetheless, we do have investment from both Google and Amazon. And we have a partnership with Microsoft. So it is what it is.

With Amazon, our relationship began when Amazon got in touch, saying, we're going to launch EKS, and we'd like you, Heptio, Hashicorp, and a few other companies to be part of the launch, if you like. We said, yeah, that'd be great. And we wanted to do a thing to show how we were working with Amazon.

And Ilya, who pioneered Kubernetes Anywhere and some other Weavenet things, was asked to help figure out what we can do to make EKS more fun at launch. Ilya's observation was, as you won't be surprised to hear, that, from a developer's perspective, this is putting quite a few obstacles in your way to use, and it does feel like a very early technology, in terms of, is this a date-driven release, for example?

And it was coming to market after GKE and a number of other options, which had had more time, shall we say, to iron out some of the user kinks. We decided to focus on ease of use for EKS for this launch. I remember sitting with Ilya, talking about how his proposal would work, which is to take a load of the scripts and bundle them up into a more useful user experience.

And the way that it was interacting with the cluster, I said, is it the case that you want to have an experience for an EKS user that feels like it is familiar to somebody who knows Kubectl? And he said, yes, that's exactly it. So I said, why don't we call this EKSctl to make that point. And then that will be what we recommend people use for EKS.

But at this stage, it really was just a proof of concept to support a launch for one of our cloud partners. Then it just took off. So then that led us to working with a lot of Amazon customers.

CRAIG BOX: The perennial debate in the community is about the pronunciation of Kube CTL, or Kubectl. And you could look at Kubectl and say, well the word "kube," we pronounce that as a whole, so we should pronounce the second half as a whole as well. With EKSctl, you might think, well, I'm pronouncing all three letters of EKS, so I should do the same with CTL instead?

ALEXIS RICHARDSON: No.

CRAIG BOX: Perfect answer.

ALEXIS RICHARDSON: Obviously not. That would be an abomination. I think the reason it's "EKS cuttle" is because everybody calls it EKS. So that's a given.

And then we're just adding a bit to the end, which is as simple as possible, "cuttle." A lot of people do say "kube cuttle." Some people say "kube control." Some very confused people say "kube CTL." What can I say?

CRAIG BOX: I've always had a soft spot for Tim Hockin's "kube-ectal."

ALEXIS RICHARDSON: Yes, indeed. One can go off the beaten track quite a long way. Anyway, so what we realized was that, again, as with Kubeadm, flashing the developer experience is so important.

So now we found ourselves in the position of having a really great SaaS tool for day two and a lot of open source tools that made it spiffy easy to start clusters on day zero, putting your product manager in a hat. And you might think, what about day one? How are we going to solve that problem? That's when we really started to put effort into GitOps, Flux, and Kubernetes management.

CRAIG BOX: Now would be a great time then to pick up on the concept of GitOps. It first appeared in a blog post from you in August 2017. Where did the idea come from to manage clusters and applications state using Git?

JUSTIN GARRISON: I think that probably came from Google. You have to think about it as multiple strands of influence converging. I'd say there were at least three. So one was definitely the bunch of folks who'd been at Google, and sometimes in SR rerolls-- exceptionally bright people who had understood the patterns that Google developed for the famous infrastructure around Borg and so on, and also had come to some dos and don'ts conclusions about that.

Secondly, there was definitely an influence from a gentleman called Peter Bourgon, who was one of the original designers of Flux. Peter is quite well known in the Go community. Worked with us for a little while. And Peter has very much interest in microservices and had some opinions about how deployment trends were going in that space that I'm sure were informed by some of the DevOps infrastructure as code.

And then another gentleman, Michael Bridgen, who worked with Peter on the original design of Flux, and some others as well. And I think that they thought very, very carefully about how we were deploying our infrastructure. And we'd always deployed it from Git. But they made a lot of other changes that made it just much more logical, coherent, and manageable, so that we actually had a really good, world-class Kubernetes management infrastructure, a lot ahead of a lot of other people.

What was key to it was this way of deploying the pieces repeatedly, and so on. So I've talked about it so much that I don't want to rehearse all the details now. Essentially, we found that, for reasons of low cost management, high frequency of change, safety of change, low cost of operations, flexibility, teamwork, disaster recovery, and many other reasons, having an entire model of everything, including your dashboards, clusters, and apps, described in declarative form, if possible, in Git, immutable containers and a bunch of tools that worked in a very specific way to do convergence, like in Kubernetes, was the right way to manage a multi-cluster, multi-zone Kubernetes infrastructure. Deployed clusters of apps to do updates.

So we got into the habit of doing everything, including production changes, testing, rollouts, using this approach. We deployed Cortex in this way-- so complex things, simple things. And everybody who was doing it was like, wow, this is so much better than everything I've done before.

We also had people who didn't want to learn Kubernetes, including our CTO, who were deploying changes to the SaaS without knowing Kubernetes, because of how it worked. And so we just thought there were so many good things about this that we just should talk about it all. We were chatting in our office with the whiteboard.

And I can't remember who first said GitOps. It may have been me. It may have been someone else.

But it just seemed to sort of stick. Because it brought together all of the different elements of our thinking around operations and developments. I ran the word by a few people I trusted, and they said, yeah, this is a good portmanteau word. Because you're sharing the sort of inheritance through the lineage of DevOps, but also emphasizing this developer tool Git and thinking about the future, which is that, there's going to be a lot of people using Git to do a lot of stuff.

The way I see it is, they used to do it just for development, and in the future, they'll also it for operations. So I just thought, we're going to write some stuff down and talk about it. It took me a while to get the blog post written. Because I knew it wasn't going to be complete the first time. But it did definitely get some traction.

Of course, what then happened was the classic "oh, well, yes, we already do this," or, "we thought of it already," or this, that, and the other kind of approaches. But really, GitOps is not about Git so much as it's about reconciliation. And at GitOps Days, the last time, Kelsey did the whole Kelsey schtick, which is, I'm going to approach this as if I'm a newborn baby and looking at something completely for the first time and try and get to the bottom of it.

And he said, first of all, I thought it was Git. Then I thought it was the workflows. Then I thought this and that.

And gradually, I realized, no, you can do GitOps without those. You can do other kinds of GitOps. But it's the reconciliation loop that's important-- the automatic correction of your stack if it drifts in the right state. And we've been just drilling very deeply into that technical idea. And we've discovered that you can build management products in this way.

CRAIG BOX: It was convenient that you had a product that would help in this particular space. A lot of people would think of a delivery tool as being something that pushes code or containers to a cluster. With Flux, you have something that runs in the cluster and then watches the state of some external environment-- in your case, a Git repository. And then when the change is committed, it's able to pull and actuate those changes.

ALEXIS RICHARDSON: Correct.

CRAIG BOX: Is there anything more to GitOps than that? Or is that basically the 10,000-foot summary that people need to know?

ALEXIS RICHARDSON: I would say that's a good 10,000-foot summary. [LAUGHS] I think the key thing is alert on drift, where, if it drifts away from the correct state on day 100, the system will try and take remedial action. And then there are other elements, the level two, level three GitOps concepts like, how do I deal with multiple clusters, fleets, templates? How do I introduce policies?

So it's not 100% automated. There's a lot more to it. But that's the core, what you said.

And actually, that automation loop concept goes back to the very first steam engines. I think it's called a governor. It's the thing that you run.

It spins. As the engine spins faster, the thing opens up. And then it rises, because the propellers are spinning. And that means more air comes out of the engine.

So it then cools down again. So it self-regulates. That's the governor. That's the feedback loop that everything is based on in automation. And that's also the feedback loop that makes GitOps work.

CRAIG BOX: Conveniently, obviously, from the same Greek word that Kubernetes is.

ALEXIS RICHARDSON: These things are not accidental, Craig.

CRAIG BOX: Now, Flux as a product was donated to the CNCF sandbox in 2019 and has recently been promoted to an incubation project. What was the decision process behind this CNCF donation?

ALEXIS RICHARDSON: I don't like the word "donation." Because that word is used by people who think that this is a transaction whereby you have a thing, and then you give it to somebody else, and then they look after it for you.

CRAIG BOX: Sure.

ALEXIS RICHARDSON: And that is the worst possible way to do open source. Joining the CNCF was something that we thought was essential for Flux because it's so tied to Kubernetes. It's actually growing in scope now so that it's able to handle a range of non-Kubernetes things. So it's a plug-in system.

It's a natural tool for CNCF, because it naturally helps you do deployment for Kubernetes applications. And there's also Flagger, which helps you do progressive delivery, which is basically funky deployment on canaries and ABs for networks of Kubernetes clusters and networks of containers.

CRAIG BOX: In November 2019, there was an announcement that Argo, a similar kind of GitOps project from Intuit, was going to join forces with Flux.

ALEXIS RICHARDSON: Yes.

CRAIG BOX: Looking back, I don't see a V2 of either product where those two converged. So what was the intention? And did that actually happen in that way?

ALEXIS RICHARDSON: So what happened was that we had a big summit to decide what to do before the announcement, formed a plan to pilot an integration which would allow us to rebase both projects on a common core for the GitOps engine. One of our team, Alfonso Acosta, spent about four months working the GitOps engine with one of the Intuit team, Alex Matyushentse. And we got as far in spring of last year as having this ready to integrate with Flux.

The problem was that it wasn't really giving us much in terms of improvement to the Flux codebase. It was just very tied to the way that Argo worked, which is a very tightly coupled monolithic system, where you've got a UI, and the reconciliation tool, and the user management system, and the security all its own thing. And we thought that, by putting these things together, we could get "two plus two equals five" benefits, where we would have some Flux's ability to do infrastructure management, plus the nice things that Argo had, in terms of applications, but also a layered system, so we can integrate it with more stuff.

But it really wasn't working that way. We kind of didn't get as far as rebasing Flux 2 on top of GitOps' engine. We wrote another thing alongside called the GitOps toolkit, which is now the backend of Flux 2, which has got parity with Flux 1. And I think the Argo team have looked at it and gone, wow, actually, that would be quite a good thing, to rebase Argo on top of-- I think that we were still at a point where we were developing in parallel, which was a little frustrating, given that that wasn't the original objective. But at least we've both got lots of happy users. So what we've successfully not done is piss those people off, I think. [LAUGHS]

CRAIG BOX: Or fragmented the ecosystem too much.

ALEXIS RICHARDSON: Yeah. It's unfortunate. Because I still believe that there's a lot to be gained by having a single GitOps CNCF world.

And we'll just see how it goes and continue to press forward. Flux has got some amazing features. It's very secure. It's very lightweight, and people really like that.

We tend to see some of the enterprise operators as extensional. So it has a number of different user interfaces that people are building or have built. There's a really nice one that Brian Borum tweeted about last week. We're doing tracing for deployment-- so more of a modular approach, I suppose. But we'll see what happens.

CRAIG BOX: Weaveworks was a founding member of the CNCF in July 2015. They decided collectively to form a technical oversight committee for the organization. And that was elected and announced in February and March 2016 at KubeCon London.

You were the founding chair. Did you lose a bet? Did you draw the short straw?

ALEXIS RICHARDSON: [LAUGHS] All of the above. I was involved in how the CNCF got off the ground working with Craig-- the other Craig.

CRAIG BOX: McLuckie.

ALEXIS RICHARDSON: Craig 2, as we think of him, of course.

CRAIG BOX: Indeed.

ALEXIS RICHARDSON: That sort of led to getting elected to the TOC, which I can probably thank Craig for. After that, the CNCF had its first meeting. There were six of us and one person who was listening in, which is permitted.

And then we were trying to decide who should be chair. And then Brian Cantrell said that I should be chair, which is not what I was expecting him to say or wanted to hear. All I could think of in response was that he should just [MUTED] off, which is on record. And I think I caused a number of people to spill their drinks.

But I don't know. It was probably the right thing for me to do. Because I've been so involved in thinking about what should happen.

So I had a really strong belief that, with CNCF, we had an opportunity to put right some of the things that the previous foundations hadn't quite got right, probably realized it a bit late and then had to fix. Also, there was another great foundation out there, Apache, which is still an outstanding place to do software, but growing very, very big. There was a chance to have something that was associated with this generation. And Apache had gotten it right twice with the web and then big data.

So I was very inspired by that. But I really strongly wanted to have a project-led foundation, where we would emphasize each individual project first, because we weren't sure which ones were going to succeed, and we weren't sure how exactly they would come together. So Prometheus didn't need to work with containers. It's a pretty damn good system for VMs as well, for instance.

CRAIG BOX: Both yourself and Liz Rice, who is the chair who replaced you from the UK-- you mentioned before the challenges working from the UK with American teams. Of course, I'm personally very familiar with them. But how has the CNCF helped maintain that work-life balance? How have you managed to grow this community and lead the TOC for those years while the team was predominantly in the US, and now largely worldwide?

ALEXIS RICHARDSON: This is a question that can have a short answer or a long answer. I'm going to try and give you a short answer. When I started CNCF with everybody else, I thought that one of its biggest challenges was marketing. Because it has money, and it has a business model, the Linux Foundation event model and sponsorship all that. But it didn't really have any brand or reason to exist, other than, Kubernetes needed a home, and, for various reasons, the other homes weren't quite right for it.

Then we wanted to put some other projects-- some projects needed a home. It was a bit like "Battlestar Galactica." So I thought the best way to market CNCF would be to have, obviously, great projects that had terrific momentum due to the passion of the people working on them, and the communities around them, and a bunch of other reasons, like, they're good projects technically, and so on.

CRAIG BOX: Sounds fair.

ALEXIS RICHARDSON: We just did that, and pushed, and pushed. And suddenly, we had five, six, seven, eight, nine, 10 projects. And then people stopped saying, what's the point of this thing, and started thinking, oh, shit, I've got fear of missing out, at which point I didn't need to worry about where people were based or what they were doing anymore at all. Because they were all just "fear of missing out." And the rest is just momentum.

So then we had all these conferences, which grew and grew and grew. There was this period where people thought, if they weren't in the CNCF summit, they were terribly wrong, which is also the wrong assumption. But yes, that's how we managed it. We just built this momentum around the projects and let the rest take care of itself. And that's good for the first couple of years.

CRAIG BOX: What dictated how long you wanted to stay on the TOC?

ALEXIS RICHARDSON: Well, I was elected for three years, I think. I stayed on for one more year as a member. Because I wanted to be there for Liz, the future chair, in case they needed some continuity.

Because one of the negative sides is the way we'd approach the CNCF as being a part-time activity. It was free-time, part-time stuff, and done a bit "seat of the pants," because we just wanted to grow fast, which meant that some pieces of institutional knowledge were not necessarily as fully written down as perhaps later generations would hope. I wanted to be available to pass that on. And that was fine, so then I could go after another year.

CRAIG BOX: We've talked a lot about the early SaaS version of Weave's platform and the open source projects that underpin it today. You have a comprehensive Weave Kubernetes platform that you market now to your customers. Tell us a little bit about what Weaveworks makes as of today.

ALEXIS RICHARDSON: Weave Kubernetes platform is a product for people who want to manage large numbers of Kubernetes clusters, stacks, and applications. The key benefits are that it gives you a whole-stack approach so that, if you want to, you can describe what's in your cluster, what's in your application, what's in other pieces on top, and then manage all of those together, or in a modular way, as needed, which means that, for example, if you have a team of people in your company, and some of them are building data science applications, others are building mobile backend applications, others are building web apps, others are building risk management apps, others building dot dot dot, you can create different collections of Kubernetes add-ons for each one of those teams and manage them using GitOps, correctly, at scale. You can patch. You can maintain.

You can distribute. You can run the way you like. You get all the multi-cloud, all the hybrid cloud.

You can run it with Amazon's on-premise Kubernetes. It works with GKE. It works with EKS. So it's a true multi-cloud Kubernetes solution, but with this key benefit of, I'm using GitOps to manage my stocks, which means that I'm getting the same stacks I want, patched, secure, wherever I want them. That's really where it comes to life.

It includes its own GitOps-related extras. So some of the metrics and policy enhancements are things you would expect to see if you're doing GitOps, like, I'm automating my deployment. I want to put in gateways to protect people from accidentally deploying the wrong stuff at the wrong time. I'll do a security check using your favorite security product.

I can also do things like fleet management, which is a new thing in Kubernetes land, as you can imagine. And I'm going to be able to do a lot more with size, with sort of progressive delivery, et cetera, once we've got that a bit more integrated into the platform. But our vision is, developers want to have a really awesome tool which brings together all of the DevOps and operational pieces around a dev-centric medium. And they want to plug into great tools from companies like Google, Amazon, Microsoft, et al. But they also want a tool which understands them as a developer-- their life, what they need to do around the concept of operations.

So we see a ton of wonderful tools out there-- GitHub, GitLab being very prominent, Atlassian and BitBucket being other, CloudBeast, which have done a wonderful job on the CI side of things-- dev test, that kind of lifecycle stuff. But there's a big piece missing, in terms of really nailing down what it means to operate an application if you're a dev team. And of course, nowadays, operations is part of the role of a dev team.

CRAIG BOX: The clue is in the name.

ALEXIS RICHARDSON: Yeah.

CRAIG BOX: So over almost seven years now, you've been able to pivot as Kubernetes users' and developers' needs have changed. Do you think that the platform you've built today is what you have envisaged when you founded Weaveworks?

ALEXIS RICHARDSON: Yes. I think that we are starting to get to where we wanted to be when we started, as in, we wanted application developers to have a complete experience with a new generation of operational tools without having to worry about the operational details. What we've learned since then is that that's not enough. You also need an enterprise story-- things like governance, audit, compliance, all of which are part of our commercial products. And you also need to wait until the market starts to stabilize, which it didn't really do until three or four years ago. And once you've got a stable market, then you can start to see consistency in terms of bringing out these features to people.

Otherwise, you're in a world of, we do X for Kubernetes. So we've seen some terrific companies doing X for Kubernetes. CoreOS is probably the first independent company to market with an enterprise management solution for Kubernetes. Had they carried on as an independent company, they would be looking now at the challenges of applications and so on, as we are. For me, it's always been about enabling developers to take advantage of operational stuff, rather than getting bogged down in infrastructure.

CRAIG BOX: Well, you've done some great work in the space. And it's been an absolute pleasure talking to you today. Thank you very much for joining me, Alexis.

ALEXIS RICHARDSON: Thank you very much, Craig. It's been a pleasure chatting. I hope to see you in the real world soon.

CRAIG BOX: I look forward to that. You can find Alexis on Twitter, @monadic. And you can find Weaveworks on the web at weave.works.

[MUSIC PLAYING]

CRAIG BOX: Hey, Justin, thank you very much for helping me out with the show today.

JUSTIN GARRISON: Thanks, Craig, it was a lot of fun to help out.

CRAIG BOX: If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter, @KubernetesPod, or reach us by email at this at kubernetespodcast@google.com.

JUSTIN GARRISON: You can also check out the website at kubernetespodcast.com, where you will find transcripts and show notes, as well as links to subscribe.

CRAIG BOX:I'll be back next week to talk Kubernetes 1.21. So until then, thanks for listening.

[THEME MUSIC]