Kubernetes Podcast from Google: Episode 33

#33 December 11, 2018

Envoy, with Matt Klein

Hosts: Craig Box, Adam Glick

The Envoy proxy, a universal data plane for Cloud Native, has just graduated as the third top-level project in the CNCF. Craig and Adam talk to its author, Matt Klein from Lyft, about modern load balancing for microservices and pragmatically avoiding “second system” syndrome.

Do you have something cool to share? Some questions? Let us know:

ADAM GLICK: Hi, and welcome to the "Kubernetes Podcast from Google." I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.

[MUSIC PLAYING]

ADAM GLICK: And it's KubeCon week.

CRAIG BOX: It is. And where are you this week?

ADAM GLICK: Unfortunately, I'm in Silicon Valley this week for another obligation. How about you, Craig?

CRAIG BOX: Oh, I'm in London as well. So we have many, many colleagues from Google Cloud who are representing at KubeCon. Many wonderful talks to see. And don't forget Ice Cube, who is there on behalf of Mesos but is still a very important piece of the furniture.

But unfortunately, you will not be able to meet with us. But please don't leave KubeCon without taking a "Kubernetes Podcast" sticker. They'll be at the Google booth. They'll be on the stickers table. You'll find them absolutely everywhere. If this is your first "Kubernetes Podcast" episode, welcome. If you found this through KubeCon, we have a great show lined up for you today, as every week.

ADAM GLICK: Speaking of which, as well as KubeCon this week, there's a number of anniversaries, correct, Craig?

CRAIG BOX: Yes, it is a big week in tech history. This week marks the 50th anniversary of the "mother of all demos" presentation by Doug Engelbart in 1968, which introduced things like "the mouse", as well as word processing and hypertext. Basically, almost all the things we think of as being part of a modern computer system can be traced back to this one demo 50 years ago.

ADAM GLICK: Wow, I can't imagine putting that all together. That's a 90 minute demo. That's just kind of mind blowing.

CRAIG BOX: I have enough trouble putting a 90 minute demo together today with all the technology that we have.

ADAM GLICK: True enough. Also, slightly more modern history. I believe it is the 25th anniversary of "Doom".

CRAIG BOX: Oh, yes.

ADAM GLICK: You remember "Doom", don't you?

CRAIG BOX: Love me some "Doom". Who doesn't?

ADAM GLICK: Oh, my god. It destroyed several years of my finals. I swear that id Software did that every year. They release it right before finals and just crush us. Although, one of the interesting pieces, my university had a network. And it was weird because it was back in the days of very, very slow connections. But everyone had a fiber optic link in their dorm room.

And at that point, people weren't actually doing network transmissions that were directly to the people you were playing on the network. So it would do a broadcast for every player playing. And it took about four people-- with four person games. Yeah, four people with four games. So 16 people in this campus of thousands of people could take down the entire campus network. And it basically--

CRAIG BOX: You would eat your entire megabit.

ADAM GLICK: It just ate up it all. And it was just hilarious because the network just all went down, and they had to figure out what to do about this thing because everyone had a copy of it, and everyone was playing it. It tanked everything.

CRAIG BOX: You can probably judge by our different "Doom" stories. I'm a couple of years younger than Adam. But we had a school system with serial cable running from the teachers' machine out the back where we had hidden a copy of "Doom" away. And we had 50 minute classes. And it took about 40 minutes to copy "Doom" across that serial cable back out to a machine in the lab. And then we got 10 minutes of gameplay in between those two machines before we'd have to end the class. And then the whole process would need to start over again when the teacher would found where we put it.

ADAM GLICK: Proof that if you're motivated enough, you will always find a way.

CRAIG BOX: Yes.

ADAM GLICK: Shall we get the news?

[MUSIC PLAYING]

CRAIG BOX: We now have more information on the major Kubernetes vulnerability that was breaking as we spoke to you last week. A write up from "Gravitational" spells out the issue with a lovely sequence diagram. Briefly, a client would ask a back end of the API server to upgrade to a WebSocket connection, and it would be able to send commands to that server whether or not the upgrade was successful or authorized. If your API server allows commands from authenticated users, it could result in exploits against the API server.

If it does not, the attack could also be performed against the kubelet on nodes where you had permission to attach to a pod on that node, and use to pivot to attach to other pods in that node you did not have permission to attach to. There are at least two proofs of concept available for the exploit, so script kiddies are likely not far behind. This comes along with continuing reports of Kubernetes clusters being exploited to mine cryptocurrency.

It's important to remember that Kubernetes is just software. And software requires attention and patching. If you're on the 1.10 to 1.12 series, you need to patch. If you're on a version prior to 1.10, you need to upgrade. Security response to this issue was led by Jordan Liggitt of Google. And GKE clusters were patched prior to the disclosure.

ADAM GLICK: Microsoft held their Connect Developer event last week and made two major announcements relating to Kubernetes. First was the public preview of Azure Kubernetes Service and virtual nodes. This is an integration of the Azure Container Instances which obscure any underlying VMs and bill you by the second. Integration between Kubernetes and the Container Instances is through an open-source project called the Virtual Kubelet, which implements Kubernetes semantics on the front end, and talks to a range of providers on the back end.

Microsoft also announced that the Virtual Kubelet has this week joined the CNCF up as a sandbox project. Other Azure news include that Container Instances now support GPUs in public preview and the upcoming end of life of the original Azure Container Service on January 31 of 2020.

In open-source news, Microsoft announced the Cloud Native Application Bundle, or CNAB. Microsoft has worked with Docker to build the container of containers specification, which describes a format for packaging, installing, and managing distributed applications. Under their Deis Labs brand, Microsoft announced a reference implementation called Duffle and support for VS code.

CRAIG BOX: Docker also hosted the DockerCon EU last week. When the company first announced Kubernetes support in April, they added support for running Docker Compose stacks using a custom resource which introduces a stack object and a controller to provision them. This functionality has now been released as open source and is available on GitHub for anyone to run in their own clusters.

Docker also announced an enterprise version of the Docker desktop, including a new Application Designer.

ADAM GLICK: Hashicorp Vault, a popular open-source key management tool, has this week released version 1.0. Vault can be used to protect secrets, used in Kubernetes systems, and allow Kubernetes applications to perform common security and cryptographic workflows.

CRAIG BOX: Upbound, a startup founded by the authors of the Rook storage system, this week introduced Crossplane. Crossplane applies the Kubernetes declarative API and operator model to management of objects on multiple clouds, including Kubernetes clusters, databases, storage buckets, data pipelines, and more.

GitLab, who recently moved their primary hosting to GKE, have announced that they're committed to supporting their customers on the three major public clouds, and they are looking to adopt Crossplane to do so.

ADAM GLICK: Rook, the storage integration and control project for Kubernetes, has released version 0.9.0 this week. Improvements in this version include storage providers for Cassandra, EdgeFS and NFS. The Ceph CRDs have now become stable at v1. Ceph versioning is decoupled from the Rook versions. And Ceph releases of L and M can now be run in production as well as N in experimental mode.

Ceph upgrades are now greatly simplified as well. The minimum supported version for Kubernetes has been upgraded to 1.8, which shouldn't be an issue for you anyways because you're already upgrading due to the security issue we mentioned earlier.

CRAIG BOX: And finally, Canonical has released MicroK8s, a delivery method for Kubernetes clusters so micro, they don't even bother spelling out Kubernetes in full. MicroK8s is a snap package that installs on 42 flavors of Linux, including Orange Chocolate Chip, and is targeted at small Kubernetes deployments on desktop, server, or IoT devices.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

Matt Klein is a software engineer at Lyft and the original author of the Envoy proxy, a high performance C++ distributed proxy designed for services and applications, as well as a communications bus and universal data plane for service mesh architectures. Welcome to the show, Matt.

MATT KLEIN: Thanks for having me.

ADAM GLICK: You worked on large scale networking at AWS and Twitter, as well as at Lyft. What made you decide to do networking?

MATT KLEIN: That's a tough one. Well, I've been spending pretty much my entire career working on low level systems. I started working on mobile phones and moved to operating systems, embedded systems. I've done a bunch of work on virtualization.

I think as I've gotten exposure to different things throughout my career, there's something about systems programming that I just fundamentally enjoy. And I think in terms of networking, when I started working in AWS, I had never done any very low level networking before. And that was just what was needed at the time. I was launching their initial HPC instance types within EC2. And we were launching 10 gig networking. And I just needed to learn about it to make that product happen. So I think the rest, as they say, is history.

CRAIG BOX: You've built the Envoy proxy server at Lyft. I've heard you say that you based it in part on the work that you did at Twitter working on their front proxy. How similar were those systems?

MATT KLEIN: It's been long enough now since I worked at Twitter, so I've been at Lyft for over three and 1/2 years, that it's a little blurry in my mind in terms of what the old code base at Twitter looks like, since obviously it's not open source. But I do think a lot of people come and they look at the on byte code base and they say very nice words about how well it's designed, and how nice the abstractions are, and the code is very nice and beautiful.

And I'm usually very honest in saying that I think at any time when someone gets an opportunity to effectively write a V2 using empty source files because the old code was behind a guardrail, it's an amazing opportunity to think about all of the things that had gone wrong previously. I can't say exactly what is similar and what is different. But I think it's more that there were a lot of learnings in terms of what we did in that edge proxy and what could be better. So I think it's really just an evolution of that code base in some ways.

CRAIG BOX: There's a chapter in the "The Mythical Man-Month" talking about the Second System. How did you avoid the problems of trying to solve all of the things that you knew were wrong with the first system when building the second one?

MATT KLEIN: I think that from an engineering perspective, I tend to be fairly pragmatic. So for me, pragmatism means let's do something, and let's actually ship it. So I think when I look back at being able to build Envoy at Lyft, sometimes I'm actually flabbergasted that I was able to do it. Because at the time that I joined Lyft, Lyft was probably about 80 developers. It was a pretty small, fast moving company.

And to come in and to propose doing something like Envoy-- going to be written in C++, which was not a language that Lyft had used at the time. And I guess I would say that I was lucky to be given a lot of rope with which to hang myself with. And I think as part of that, it was very important that we showed value relatively quickly.

So we started working on Envoy approximately May of 2015. And I don't know the exact date, but I think we went into initial production in probably about September of 2015. So that was four months from initial empty files to production. So a lot of that was being pragmatic. It was focusing on a very targeted initial use case in terms of edge proxy, only supporting H1 at the time, and no TLS termination. So really trying to scope that feature set, and then incrementally building up that feature set over time.

And I think the way that I view it is that when you're being very customer driven, whether that's a commercial, customer, or end user type product like Lyft or infrastructure product, I think as long as you stay very customer focused, you're delivering value for the customer, and you're pragmatic, I think that's the way to avoid that problem.

CRAIG BOX: Were you hired explicitly to solve the problem which you solved with Envoy? Or was it just something you noticed when you joined Lyft?

MATT KLEIN: I think I was hired because Lyft was growing rapidly. I think there is a general understanding that Lyft had a bunch of the microservice scaling concerns that most companies of that growth rate typically have. So coming from Twitter, and before that working on EC2, I obviously have quite a bit of experience in this area.

So I wouldn't say that I was hired to build Envoy because I don't necessarily think that there was a good understanding of what Envoy would become at that time. I think I was hired to help solve some of the problems that we had previously solved at Twitter. So as a byproduct of that, it's a pretty natural progression to seeing, OK, let's learn from what came before and let's see what we can build here.

ADAM GLICK: What would you say are the problems that Envoy solves? Someone's looking at it, why would they start to adopt it? What's the reason?

MATT KLEIN: From an industry perspective, as we've moved towards microservice architectures, and as we move towards more nimble architectures around things like containers and functions, as we move to a world in which people don't necessarily write all of their applications in something like Java anymore, it's now typical for companies to have heterogeneous architectures where they might have six, or seven, or eight different languages. We've entered a world where two things are happening.

The first thing is that networking becomes a problem. Again, we have these architectures which are scaling up, scaling down, have all these different failures. Figuring out how to do common concerns around service discovery, load balancing, rate limiting, observability, logging, stats, tracing. And there's just no end to these concerns that have to be solved to build reliable systems.

And at the same time, we're doing this now in six, seven, eight different languages. And historically, the way that people have solved this is typically by building libraries in particular languages. So that would be something like Finagle from Twitter or Hystrix from Netflix. Or there's other examples of those types of libraries. But the reality is that those library solutions don't scale very well to six, or seven, or eight different languages.

So Envoy it was built with the idea that we can take a bunch of these complicated concerns where we'd like to build a bunch of functionality, want to build it once, and we want to put that in one place and have that be usable by all these different applications. So when we originally did it at Lyft, we were obviously doing it for PHP and for Python. Then we brought online Go. And since then, obviously we have Java and we have Node.js.

And you see people now they're using Envoy across many, many different languages. And that wouldn't be possible without this out of process architecture. So fundamentally, Envoy is a network proxy similar to an NGINX or an HAProxy. It's a self-contained server. It encapsulates a bunch of common concerns. And it allows people to use that with any language.

ADAM GLICK: So you said that you wrote it in C++ earlier.

MATT KLEIN: Yeah.

ADAM GLICK: I normally think of C++ as a high performance, relatively low level language as opposed to some of the rapid application development languages that tend to be a little more common these days. What made you decide to use C++? Was it just the performance pieces? Or were there other reasons to pick it?

MATT KLEIN: So at the time-- this is the beginning of 2015-- I think we were looking at, if we're going to build this component and it's going to run everywhere, it needs to be relatively high performance. And I would point out that the most important part of the performance is not actually throughput, it's actually latency, particularly at the tail.

So how does this proxy when you start looking at the P99 latency or the P99.9 latency, how does the proxy affect that latency? And what you'll find is that particularly when you are using the proxy to provide your observability, get your timing information, and do a bunch of other stuff, if the proxy itself is adding latency in certain cases, it can be very hard to trust those numbers. So it has to be very stable from a latency perspective.

So at the time, that probably precluded any garbage collected language. So Go was out. Go obviously has increased their performance quite a bit over time. I don't think we want to spend this podcast getting into a language war type thing. There's other reasons that I wouldn't have picked Go at the time, but garbage collection was probably the big one.

Everyone likes to ask me these days, if you're starting again, would you use Rust? Yeah, if I were starting today at the end of 2018, would I think of Rust? Probably yes. But at the beginning of 2015, there was no ecosystem back then. They were still making compiler changes that would break code.

So looping back to your previous question, which was, how do you keep a project like this from going off the rails? For me, it was let's ship and show value within three or four months. And at the time, I was confident that given the mission critical robust libraries that we could rely on that were written in C and C++ and doing a bunch of that stuff, I was confident that we could ship something at high quality in a relatively short period of time. So that's why C++ was chosen.

CRAIG BOX: When you looked at the other tools on the market at the time and decided that nothing would do what you needed and nothing would make sense to extend, and thus decided to build something from scratch-- then you have something where the obvious question now, should we share this with the community? Should we make it open source? What was the decision making process like at the time at Lyft to talk about making it available to the community?

MATT KLEIN: I think that from Lyft's perspective, Lyft is not an infrastructure company. Lyft was never going to attempt to monetize something like Envoy. I think there is a bunch of us that had put a bunch of work into this and had done similar stuff at previous companies. And frankly, we're sitting there and, thinking about it, and saying, Lyft is not going to make money from this endeavor, would prefer to not re-implement this thing yet again somewhere else, right?

There's probably other companies that could benefit from this. This would be a great way of increasing Lyft's credibility from a deep engineering perspective. So I think from Lyft's perspective there was very little that could be lost. Now, of course, if you open source a project, there can be some tax there. And if the project doesn't become very popular, maybe it would be a net negative. But in hindsight, Envoy has become more popular than I think any of us in our wildest imagination could ever really have thought.

And the amount of resources that are being poured into Envoy at this point-- Lyft relies on code that's written by the rest of the industry. We have everyone helping with security, and debugging, and stability, and performance. And great from an industry visibility perspective. It doesn't happen in every case, but I think this is a real open source success story of doing something that benefits the larger community, but also Lyft getting a bunch of benefits back.

ADAM GLICK: One of the other contributors in the community, obviously the folks here at Google, amongst others who have contributed to it, what did it feel like? And what were your thoughts when you started to see other companies started to join the project and to contribute to it?

MATT KLEIN: It's pretty interesting, actually, because when we were about to open source-- this is probably in September of 2016 or something like that-- we had gone around and talked to a bunch of other Lyft's peer companies, like the single digit billion unicorn type companies, whether that be, like, a Slack, or a Pinterest, or something like that. And I think our hope was, prior to open source, saying, let's get a company like Lyft, similar problem space, similar size, let's see if we can get them excited about Envoy.

And I remember going around-- this is probably in July or August of 2016-- and talking to a bunch of different companies. And everyone was super nice. They were like, wow, this is great. We would love to think about maybe replacing our HAProxy solution or something else with this. But we have a three person networking team, and how can we possibly do this? This is not possible. We'll think about it and talk to you next year.

So I was a little bummed and thought, well, I don't know if anyone's going to use this thing, but we'll go ahead and open source and just see what happens. And right before open sourcing, I think Google learned about Envoy. Actually, I think a grPC meetup or something like that. And I remember having a meeting with some Google people. These are the people that were doing Istio prior to Istio's launch.

And I have a meeting. And then suddenly, there's one person asking for access to the repo, two people, 10 people, 20 people. And it was an interesting time. I felt very much like it was a startup almost being acquired, going through technical diligence. And some pretty stressful times there. But after open sourcing, Google obviously became very excited. Apple actually contacted me an order of weeks after open sourcing, They were super into it. Microsoft contacted me not much later.

And I think what was really fascinating is I had so greatly underestimated the industry desire for a modern solution that was frankly not Nginx. A lot of people have various issues with Nginx. So a community-driven solution that was a modern code base-- and I really underestimated the desire for this type of solution.

And what I also underestimated was that it would be the larger companies that would have the resources to invest first. So you see the massive companies that are coming in and investing first. And then obviously over the past two years, we've seen all those peer companies come on. And almost all of them now are deploying Envoy. But it took them an extra six or 12 months to get resources and feel comfortable, whereas the bigger companies, they had a more immediate need, had more engineering resources, and could dive right in.

CRAIG BOX: It's not enough just to turn up with a dump truck full of code and open source. And obviously, Envoy has very high performant code. It solves this problem very well. But a number of the other things were there from the beginning. You've went out, you got a website built, and a logo made for Envoy. It was really a testament to-- was it you personally? Or was it a team at Lyft in terms of the work done to market Envoy when it was released as open source?

MATT KLEIN: I'm not going to claim that I did all that work. We had lots of help internally from having a graphics designer help with the website. And there's lots of people that chipped in. I definitely worked many, many, many hours on making that launch possible.

And I think that people often ask me about either what I've learned over the last two and 1/2 years, or how all of this has come about. And the last two and 1/2 years have been, in my opinion, a once in a lifetime thing. It's been a crazy, wild ride. And the interesting part of it is that I feel like technically, in terms of networking or writing code, I've learned almost nothing. In fact, I almost don't code anymore.

But when I think about how much I have learned about building a community, on facilitating open source, on growing this kind of thing, it's just been an incredible learning experience. So when you start talking about the websites, or the documentation, or the being on Twitter, or speaking on conferences-- and the way that I describe it to people, or the way that I counsel people now who are thinking about doing open source is that, of course you can do open source of any type. You can do a small project, you can have a read me.

But if people are serious about having a project and having it be a big bang, there's a certain level of base marketing and PR that's just required around getting out there and, again, like using Twitter, and blog posts and having a nice logo, and all of those types of things. And that takes a ton of time. So I think I kind of understood what I was getting myself into, but I don't think I fully understood it until it started to happen.

CRAIG BOX: You've spoken about how your job changed from being a programmer to being a community lead. And yet you've also said you don't think of yourself as the benevolent dictator of the community, to use the open-source terminology. What is it that you actually do day-to-day today? Are you employed by Lyft to run the Envoy community now as opposed to writing the code? Are you getting enough contribution from everyone else to make it such that it is worthwhile for the company to do?

MATT KLEIN: That's a very interesting question because I think that as open-source projects, for the few that are lucky enough to become very popular, as they grow in maturity and some of the community demands become larger, it can become a little more complicated in terms of what a job at a company like Lyft looks like, right?

If I were working at a company like Google, it would be potentially more justifiable for me to spend 100% of my time doing community-only work, whereas Lyft doesn't profit directly from Envoy. Lyft obviously benefits from Envoy. But it's a harder sell frankly for me to be employed by Lyft and spend 100% of my time doing purely community stuff.

So I won't go into as much detail, but I think over the last year or year and a half, it's been a progression at Lyft to figure out an approximately 50-50 split that's comfortable for everyone where I'm spending 50% of my time approximately doing community work, industry work, those types of things which has benefit for Lyft. And then I still spend about 50% of my time doing more Lyft internal stuff.

ADAM GLICK: I've seen you posting about, once you built this project and put it out there, you got a fair amount of interest from the investment community, VCs and others. Was it hard to turn that down? How did you make the decision the right thing to do is to keep this open source and to keep focused on your work at Lyft?

MATT KLEIN: The beginning of 2017-- this was when it was clear that the project was really starting to blow up. I think that, yeah, there was a lot of inbound interest from the VC community. Frankly, I think the VC community, they look at trends. A lot of them I don't think necessarily think far ahead enough to understand how those trends would actually be monetized. As much as people like to think that they should, I don't think that's always the case.

So I think that from a portion of the VC community, when they see an open-source project that's becoming so popular so quickly, they're going to say, well, I'll just start a company and we'll figure out how you're actually going to make money later. I tend to personally take a much more analytical approach. And I actually think that my analysis that I wrote up about a year and a half ago has been completely proven, which is that I really believe that at the layer that Envoy is at at that base data plain layer, attempting to build a company at that layer and go and deal with the concerns around PaaS, and CaaS, and FaaS, and trying to figure out how you get Envoy into all of these places, and do all the packaging, and like a bunch of other stuff, it's not that it was not feasible, but it would also have made it a much bigger sell for a company like Amazon to announce that they're doing the AWS app mesh, or for Microsoft to have come along and done their service fabric mesh. Or even for Google to have invested as much in Envoy as it has.

So I really believe that having no primary Envoy company has actually been one of the main reasons that it's grown so fast, is that we have been able to make these community-first decisions. And it's been super fantastic. And look, I'll be honest, I'm super privileged. I'm not hurting for cash. So it's not like I threw away an opportunity that is going to change my life.

I believe that the path that we have taken and that I have taken with Envoy, it optimizes more for what I care about, which is industry impact. And I believe that by staying neutral and by not starting that singular, low level company, we have created this thing that is now on its way to becoming ubiquitous. And what we're seeing now, is we're seeing tons of companies that are going to build on top. And those companies will end up making money. And that's fantastic.

And that creates this next layer, this OS, this platform, right? And it just makes the overall community stronger. So no, I don't have any regrets. I think it's been a fantastic ride.

CRAIG BOX: Envoy is a universal data plane for service mesh architectures, as it states on this website. You've written about the difference between the two pieces, the control plane and the data plane, and how you see Envoy acting as the data plane. What control plane, or what technology do you use to control the Envoys that run in production at Lyft? And how has that changed since you launched Envoy?

MATT KLEIN: At Lyft, we obviously have our homegrown custom control plane. And that's obviously because we built Envoy, so it's been a very organic, pragmatic process. We've gone through almost every single iteration that you might imagine. When we first built Envoy, we were literally hand writing all of the JSON configs, and we would deploy them with the code changes because we were still making breaking config changes.

And then, obviously, that involved where you can't do that forever. There's lot of duplication. So we started with a basic templating system. Eventually, we started decoupling the configs from the code deploys because it's not scaling. Eventually, we start implementing all the different XDS APIs that Envoy provides at Lyft. So we're still using all homegrown stuff.

I think it's just one of those things where when you develop a technology, people ask will you open source your Lyft control plane? And the answer is probably not because it's very Lyft specific. It's tied to Lyft's specific workloads. And I think it's actually indicative as to why it's been harder than I think some people thought to have a universal control plane because the control planes tend to be pretty opinionated and based on, are you using this PaaS provider, or this CaaS provider, or that bare metal thing, right?

And sure, there are things that could be done to unify that. And I do think that as people move to more of the functional platforms and more of the serverless platforms in the future that things will start to get unified. But I think it's hard to build a control plane that is generic enough that it can make everyone happy. And that's why I think that this split is very important, and it allows the data plane to progress on its own, the control plane to progress on its own. And maybe in the future we converge again. But yeah, from the Lyft perspective, we've really gone through almost every possible iteration.

CRAIG BOX: When you look at internet protocols, we've got a common protocol and we have different implementations on it. Is it sensible to standardize the data plane versus standardizing the API in saying we can have different implementations of the data plane as long as they follow that?

MATT KLEIN: Yeah, so our original intention of the, quote, "universal data plane API", or the XDS API, is that it is protobuf specified. It's backwards compatible. So from a technical perspective, someone could come along and they could implement that API, and it would theoretically just work from the control plane perspective. I say theoretically because in a practical perspective, the data plane ends up being so complicated to get right, there's obviously going to be parts of the API that are implicit by what Envoy does but aren't very well specified.

So it's not that we don't want the API to be a proper backwards compatible API. And some people actually ask, are you going to go and do IETF around the API? And I don't think any of us are opposed to that. It's just that requires a bunch of resources and a bunch of work. And what is the benefit going to be from that?

So I think that there's a theoretical let's build an API. Let's clearly decouple data plane from control plane. We want to do that anyway. That is what makes Envoy successful.

From a practical perspective, the way that I look at it is that if your Company X and you're trying to ultimately offer value in your control plane by making your control plane opinionated, having value added services, the data plane becomes this commodity. And at a certain point, if you have enough people that are working on it, I don't personally see why anyone would bother writing a different data plane.

But it might happen. And you do see even in certain cases within Google now. So Google, they are actually migrating gRPC to use the Envoy API. So they'll be able to use a central control plane which can control Envoy. It can also control these gRPC library to help it do load balancing. And it's still very Envoy-centric, but it shows you that it is possible to use this API and have it be implemented by different things.

CRAIG BOX: When you look at Istio specifically, there's an API which has evolved with the community in terms of traffic management, especially that controls the Envoy API. So it basically translates the intentions that you describe in the YAML files into XDS commands and so on in order to program Envoy. Do you think that that abstraction, being slightly higher level or more microservices are entered at Lyft, do you think that's something that Lyft might look to adopt themselves over time?

MATT KLEIN: I think that over time, as Lyft completes a Kubernetes migration and moves more towards standard technologies, I think that does make sense. But for right now, Lyft is in the situation that many companies of Lyft's age are, where we're moving to Kubernetes, but we have a huge legacy VM based infrastructure.

CRAIG BOX: Absolutely.

MATT KLEIN: It's just not easy to fit that into something like Istio, right? So I think that will Lyft move to a more open-source control plane, whether that be Istio or something else in multi-year time frame? I think the answer is probably yes. I think that will happen because once you start moving on to the industry-supported architecture, whether it be for CI, or deploy, or something else, it becomes somewhat pointless to maintain your own bespoke infrastructure. But right now, I personally don't think that trying to make something like Istio work for Lyft today, given what we already have, I don't think that's a great use of resources.

ADAM GLICK: The default Ingress controller in Kubernetes is currently Nginx. With Envoy's graduation, do you think we'll see a shift towards making Envoy the default?

MATT KLEIN: I'm less of an expert on what is going on within Kubernetes. I know that this is something that has been discussed for quite some time. Obviously, there's several Ingress controllers that are starting to use Envoy. There's Contour from Heptio. There's Ambassador from the Datawire folks. There might be others.

So I think that there are people within the larger Kubernetes community that think that Envoy and its configuration model maps better to what Kubernetes Ingress needs to do. My understanding, though, is that there is still some very big open questions from an API perspective. What does Ingress even mean? And how does that map to a specific type of technology?

So sure, I would love it if Envoy became the default Ingress controller for Kubernetes out of the box. And I think that there are people that think that there should be a better out of box experience for Kubernetes. But I don't have a lot of insight from the Kubernetes community perspective on how likely that is to happen or what the time frame might be other than hearing people complain about Ingress a lot.

CRAIG BOX: Do you think that the Ingress abstraction in Kubernetes is a useful abstraction?

MATT KLEIN: It's tough. And this is what brings us back to the control plane component before, right? It is that in a perfect world, you would like to have abstractions that everyone uses across all of their different infrastructures. But what you find, or what one finds, is that as you build infrastructure that gets closer to the user, gets closer to how they physically deploy software and how they physically work with their configurations, in order to become simpler, it has to become more opinionated.

That's why something like Heroku is very, very popular because it is very opinionated in terms of how people build those applications and that infrastructure. And it makes it relatively easy. So I think that there's a push and pull here in terms of something like the Ingress API and Kubernetes where if you want to make it more opinionated and simpler to use, it may need to more closely map the capabilities of something like Nginx, or HAProxy, or Envoy.

Trying to have a generic API that somehow works with all of them when they have different underlying features, in my opinion, it starts to make the API a little harder to use. So I guess to answer your question, I think it is a-- having an API that allows people to do operations around network ingress is an important thing. But can we get that without picking a particular proxy to support? That's less clear.

ADAM GLICK: You said that you find it hard to tell people what the roadmap is for a community project. What do you think Lyft's contribution to Envoy is going to be in the future? And what would you like to see as the next steps?

MATT KLEIN: I will have a lot more to say about this early next year. But I'll tease it here. Which is, I think that one of the areas that I am most excited about from Envoy moving forward that I think directly applies to Lyft, is I think that we have shown that the service mesh architecture at a high level can have very large benefits to polyglot architectures.

And one of the things to consider is that the polyglot architecture extends out to the mobile client. We have iOS. We have Android. We have apps that run on both. And there is very real potential to have Envoy actually provide mesh-like or abstraction capabilities on the client. So I think next year, increasingly, we're going to see Envoy in IoT and mobile use cases really for the same reasons.

So I think Lyft's major contribution probably moving forward, in addition of course just to general community leadership, and of course there's going to be features that end up coming out, is that I have some pretty bold plans around running Envoy on both clients. And I think that's something that we're going to be doing in the open. And I think we'll have more to share with people probably in the late Q1 time frame. But that's something that I'm pretty excited about.

CRAIG BOX: Brilliant. And with that, Matt, thank you very much for joining us today.

MATT KLEIN: Thank you so much.

CRAIG BOX: You can find Matt Klein on Twitter @mattklein123, on Medium at medium.com/@mattklein123, and you can find Envoy proxy at envoyproxy.io.

[MUSIC PLAYING]

ADAM GLICK: Thanks for listening. As always, if you enjoyed the show, please help us spread the word and tell a friend. Thank you to those of you who've left us a review on iTunes. It really helps others find the show. If you have any feedback for us, you can find us on Twitter @kubernetespod or reach us by email-- kubernetespodcast@google.com.

CRAIG BOX: If you are a regular listener or if this is your first time, you can check out our website at kubernetespodcast.com and find all of the links and our show notes, and also transcripts of each episode posted a couple of days after they've launched. Until next time. Thanks for listening and take care.

ADAM GLICK: Catch you next week.

[MUSIC PLAYING]

View More Episodes

Envoy, with Matt Klein

News of the week

Links from the interview

Transcript