#221 March 19, 2024

Creating Envoy, with Matt Klein

Hosts: Abdel Sghiouar, Kaslin Fields

Matt Klein is the CTO of bitdrift which is building a Mobile observability platform. Matt is known for being the creator of Envoy, one of the most popular open source proxies in the cloud space.

Do you have something cool to share? Some questions? Let us know:

News of the week

Cloud Native Rejekts

CNCF 2024 Prospectus

KubeCon Paris Guide Abdel co-authored

KubeCon Paris Recommendations Map

Matt Klein:

Envoy Proxy

Twitter kicks Android app users out for five hours due to 2015 date bug

NGINX

HAProxy

Matt Klein’s X post about 1 billion pulls for envoy on DockerHub

Envoyproxy on DockerHub

Envoymobile

Rust programming language

Bitdrift

ABDEL SGHIOUAR: Hi, and welcome to a new episode of The "Kubernetes Podcast" by Google. I'm your host, Abdel Sghiouar.

KASLIN FIELDS: And I'm Kaslin Fields.

[MUSIC PLAYING]

ABDEL SGHIOUAR: In this episode, we chat with Matt Klein. Matt is not a stranger to the cloud native space because he created and maintains Envoy, which is one of the very popular proxies used in the service mesh space but in other use cases. We talked with Matt about Envoy, the story behind it, the status of the project currently. We talked a bit about Envoy Mobile, which is super interesting. And also, what is Matt up to? And what's his new endeavors?

KASLIN FIELDS: But first, let's get to the news.

[MUSIC PLAYING]

For the news today, it's KubeCon week. So we're going to talk a little bit about KubeCon and what we're excited about coming up at the event. So, Abdel, do you have anything you want to start off with?

ABDEL SGHIOUAR: I just wanted to say before, I let you do the catch phrase because I know you like it.

KASLIN FIELDS: Yeah, I had to do it. I love doing it.

[LAUGHTER]

ABDEL SGHIOUAR: Yeah, so KubeCon, it's interesting. Hopefully, by the time you hear this episode, we will start KubeCon tomorrow. I am going to be, actually, my KubeCon week starts earlier because I'm speaking at Cloud Native Rejects on the weekend.

KASLIN FIELDS: Right? Rejects is exciting. I always hear such wonderful things about it. But it's always so hard to add another day on to KubeCon.

ABDEL SGHIOUAR: It's a super cool conference. Actually, they changed this year, by the way. I just realized that, this year, it's Sunday, Monday.

KASLIN FIELDS: Yeah.

ABDEL SGHIOUAR: It used to be Saturday, Sunday.

KASLIN FIELDS: That's what I thought. Well, I mean, I guess they moved KubeCon because they've been trying to do the single day of co-located events thing.

ABDEL SGHIOUAR: Yeah, that's true.

KASLIN FIELDS: Yeah, so now it's--

ABDEL SGHIOUAR: It's Sunday, Monday, yeah.

KASLIN FIELDS: Tuesday through Friday-ish.

ABDEL SGHIOUAR: Yeah, and actually, it's in a very cool arena near the Louvre Museum in Paris. The arena itself is-- I forgot the name, but it's like a gaming, LAN party-type place. So it's a kind of place where you play LAN games, right? I saw a lot of people excited about it on Twitter. So we'll check it out on Sunday.

KASLIN FIELDS: Yeah, that sounds like a very interesting venue for a tech conference.

ABDEL SGHIOUAR: It is. It is. Now I'm super excited. It's a really cool conference. I'm going to be talking about the-- well, by the time you hear this, I would have already done my talk. But I've done it about the gateway API. So that'll be cool.

KASLIN FIELDS: And if folks aren't familiar with the Cloud Native Rejects Conference, do you know how long it's been going on? It's been going on for several years. But the idea behind Cloud Native Rejects is that KubeCon is such a huge conference. It gets so many submissions to its call for papers, its call for proposals, CFP for talks, that it can only accept a very small percentage of them, even though the talks that get submitted are generally really high quality. So there are a lot that don't get accepted that are really fantastic talks.

And so a whole other conference spun up this Cloud Native Rejects Conference that happens right before KubeCon, just to show off all of those really wonderful talks that didn't get accepted, since KubeCon is just so hard to get into sometimes.

ABDEL SGHIOUAR: Yeah, you described it beautifully. This is going to be my second time. I spoke there last year in Amsterdam. And it was really cool.

KASLIN FIELDS: And I hear wonderful things.

ABDEL SGHIOUAR: And I like the design.

KASLIN FIELDS: Yeah.

ABDEL SGHIOUAR: The black and red on their website and on the badges are really cool.

KASLIN FIELDS: Oh, interesting. I wonder if they do the same colors every time. I'm not sure.

ABDEL SGHIOUAR: I think it's the same. We'll see. I'll show you the badge when we meet.

KASLIN FIELDS: But KubeCon is enormous. And Cloud Native Rejects is much smaller but similar, like, quality of talks. So people say that it's a really wonderful experience-- lots of folks who are really excited, good networking. So if you ever get a chance, definitely check out Cloud Native Rejects-- too late at the point that you are listening to this episode.

But check it out next time. It happens for NA and EU. This year, though, there are four KubeCons, according to the CNCF prospectus, that are supposed to happen. There's EU, NA, China-- in Hong Kong, not in Shanghai. Whenever it's been in China in the past, it's always been in Shanghai. And then they're adding on India.

ABDEL SGHIOUAR: Yeah, so it's going to be interesting because it's the first time in India. And China, I think it's the second time after COVID. They stopped it for a while after COVID. And they came back last year. And then it's back again this year, so more KubeCons around the world.

KASLIN FIELDS: More KubeCons. I doubt that they'll do Rejects for China and India since India is the first one. And China is the second one back. And it tends to be much smaller, by my understanding.

ABDEL SGHIOUAR: We'll see. Technically, Rejects is not done by the CNCF. It's a completely different group that runs the conference, right?

KASLIN FIELDS: Right, but I don't think they've ever done it for a Shanghai before.

ABDEL SGHIOUAR: I don't think so, no, probably not.

KASLIN FIELDS: Anyway, that's a lot about Rejects. Let's talk a little bit more about KubeCon itself.

ABDEL SGHIOUAR: The main event.

KASLIN FIELDS: The main event.

ABDEL SGHIOUAR: Exactly. Yeah, no, it's going to be exciting. I mean, first of all, it's Paris. I really love Paris. The city is cool.

KASLIN FIELDS: Oh yeah, you helped with a blog post about things to do in the city, right?

ABDEL SGHIOUAR: Yes. We can talk about that later. So let's talk about KubeCon first.

KASLIN FIELDS: Yeah.

ABDEL SGHIOUAR: So KubeCon, three days of chaos, I guess? Is that a fair way to describe it?

KASLIN FIELDS: That's how we do it.

ABDEL SGHIOUAR: That's how we do it. That's how we roll. Little sleep?

KASLIN FIELDS: Arguably four days, depending on if you do co-los.

ABDEL SGHIOUAR: And you are doing the contributor summit, right?

KASLIN FIELDS: Yeah. So for folks who don't know-- this is very confusing for so many people that I've talked to who go to KubeCon, especially newer folks. So the day before KubeCon begins-- so KubeCon begins when there are keynotes. There are mainline talks. There's a showcase to go to. All of that stuff means it's KubeCon. But the day before all of that stuff, there is day zero of KubeCon, which is when all the co-located events are.

So there's no keynotes. There's no main track sessions for KubeCon. But you can go to these individual themed sub-events that happen at the same venue as KubeCon usually, though sometimes they end up somewhere else. And then people get confused, which is unfortunate. Make sure you check where your co-lo is if you're planning to go to a specific one. And it used to be that you had to add on access to the specific co-located event that you wanted to go onto your KubeCon pass. But they changed that with Amsterdam, I think, was the first time that they did it?

ABDEL SGHIOUAR: I think so, yeah.

KASLIN FIELDS: They made it so that the co-located events are just part of the full badge. So there's a full KubeCon badge. And there's a less-full KubeCon badge. I don't remember what the terminology is. But the full pass to KubeCon also includes the co-located events. So you can just go to whichever one you want to go to.

ABDEL SGHIOUAR: Yeah, I think that the full one is called All Access, so the All Access ticket.

KASLIN FIELDS: Right.

ABDEL SGHIOUAR: And then you have the KubeCon only tickets. So KubeCon is only for KubeCon. And All Access has the co-located events.

KASLIN FIELDS: Very descriptive.

ABDEL SGHIOUAR: And then the contributor summit is on?

KASLIN FIELDS: Tuesday-- yeah, well, this time. I think for North America time, KubeCon itself was Tuesday to Thursday. So the co-located events were on Monday. I think that's going to happen again this year. And then, for EU last year and this year, it was Wednesday to Friday for KubeCon itself. And then co-located events are on Tuesday.

ABDEL SGHIOUAR: OK.

KASLIN FIELDS: And yeah, the contributor summit is not included with the All Access pass. Don't show up to the contributor summit if you--

ABDEL SGHIOUAR: You need to be a contributor.

KASLIN FIELDS: --just wanted to hang out. It's something just for contributors to Kubernetes maintainers. You have to have org membership in the Kubernetes org on GitHub in order to attend the contributor summit. But yeah, it's a chance for the contributors to chat and share knowledge about contributing to the project.

ABDEL SGHIOUAR: So, basically, what you are saying is that when you are going to be at the contributor summit, I'll be uploading this episode?

KASLIN FIELDS: Yeah, thanks, Abdel.

ABDEL SGHIOUAR: [LAUGHS] OK, good. I just wanted to make sure that we have a clear understanding. [LAUGHS] No, it's going to be cool. I'm going to be spending some time in the Paris office. There's actually a new office in Paris. I'm excited to check that one out. Then KubeCon starts on Wednesday, keynote?

KASLIN FIELDS: Yes.

ABDEL SGHIOUAR: We have an Ambassador Breakfast on Wednesday morning.

KASLIN FIELDS: Yeah, just for the ambassadors to meet up and be ready to go. In case you're not familiar with the program, the Cloud Native Computing Foundation-- we've mentioned it a couple times on this show-- runs an ambassador program where it kind of bestows this title upon leaders of communities across the cloud native ecosystem.

And the goal of this program is to make sure that those community leaders have a way to bring their feedback into the CNCF and to propose and help lead projects that will serve their communities that they help to lead. So the submission to become a CNCF Ambassador just closed a couple of weeks before KubeCon. So you missed that if you're interested in it. But definitely look out for it next time. It opens twice a year at this point, I think? Or is it once a year?

ABDEL SGHIOUAR: Yeah, I think they have a fall and a spring one.

KASLIN FIELDS: Yeah.

ABDEL SGHIOUAR: There will be another one somewhere this year.

KASLIN FIELDS: So they usually make sure that the ambassadors have at least some sort of swag that they can wear that indicates that they're an ambassador at KubeCon. And the purpose for that is supposed to be that, if you see someone wearing ambassador swag at KubeCon, they probably know quite a bit about the event and about the CNCF. So please come up to us, ask questions. We love helping the community. So make use of your ambassadors if you see us. Say hi.

ABDEL SGHIOUAR: Yes, please. Yeah, so then, we go to have breakfast. And after breakfast, we go to see the keynotes. I have looked at the schedule. It's quite busy.

KASLIN FIELDS: Always, always.

ABDEL SGHIOUAR: Always. So there's a lot of things going on. And I'm not going to attempt to go through all of them. I think we'll probably spend the next couple of days building the schedule on the app. But I think there is one thing we wanted to talk about, which is cloud native hacks?

KASLIN FIELDS: Yes.

ABDEL SGHIOUAR: The hackathon?

KASLIN FIELDS: Yeah, very interesting that this is the first year that CNCF is running a hackathon alongside of KubeCon, which you had to sign up for in advance. So if you were hoping to just drop in and be part of the hackathon thing, that's not how this one is organized.

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: But it sounds exciting. I think they'll probably announce the winners of it in the final keynote of KubeCon, which will be on Friday.

ABDEL SGHIOUAR: Yeah, you will be able to know, I think on social media, for sure. They will publish it. But yeah, there is a very good prize. The first prize is $10,000, second, $5,000. And the third one is $2,500, so a lot of money.

KASLIN FIELDS: And for a good cause, too. It's based on United Nations' social good targets? [LAUGHS]

ABDEL SGHIOUAR: They call them sustainable development goals.

KASLIN FIELDS: There we go.

ABDEL SGHIOUAR: So yeah, so it's sustainability driven. So it'll be interesting to see what comes out of the hackathon, what kind of ideas people come up with. And yeah, it will be interesting to see what's going on there.

KASLIN FIELDS: Another new thing at KubeCon is more of a focus on academia, more representation of academia, at least. They're having a papers area thing at this KubeCon.

ABDEL SGHIOUAR: Oh!

KASLIN FIELDS: Yeah, so, like, academic papers like you would see at an academic conference they're going to have. I don't know if it's a poster session, if there's going to be an area on the show floor of the posters where people talk about it or if it's done as sessions. I need to go and check the schedule on that. But yeah, there's going to be academic research represented at KubeCon more prominently than it has been in the past as well.

ABDEL SGHIOUAR: Interesting. I didn't know that. That's good. Yeah. Yeah, so it's going to be exciting. I think the last thing we wanted to, so we don't forget, is that there was a guide published on the CNCF blog about what to do in Paris during, before, and after. I guess, by the time you listen to this episode, you will care about after.

KASLIN FIELDS: Or maybe during.

ABDEL SGHIOUAR: Or maybe during.

KASLIN FIELDS: But we've kind of covered that.

ABDEL SGHIOUAR: So I shared a few of my favorite things to do in terms of sightseeing, eating, interesting neighborhoods to check out. So we're going to publish that in the show notes. But I think I'll probably-- I made a map of my favorite restaurants, which I shared only with our team. I think we should probably make that one public.

KASLIN FIELDS: Yeah, sounds great.

ABDEL SGHIOUAR: Why not, right? It has a lot of cool places to eat, Paris. One of the things I like about Paris is that there is a lot of multiple cuisines from different parts of the world. It's an international city. So you can have all sorts of food from all over the place. And usually, it's pretty good.

KASLIN FIELDS: Awesome. And call out a couple of your favorite things to do, Abdel, for folks who might be listening and looking for something to do right now. [LAUGHS]

ABDEL SGHIOUAR: Yes. So go to the Louvre, obviously. There is a little trick in the article which we're going to share about how to get to the Louvre without queuing. Basically, generally speaking, anything you want to go visit, book your tickets online. That's the first thing.

Go to the Louvre. Go to Montmartre, the neighborhood of Montmartre. There is a cathedral there. But it's actually built on top of an elevated platform, so it's on a hill. So once you get to the cathedral, you will have a really nice view over the city of Paris.

There is a bunch of rooftop bars you should check out with really nice views over the city. Get yourself a Navigo pass, which is a tube pass, which is a card that you can top up with money. So you can use public transport. You can use all the ride sharing apps that you know, maybe some of you don't know. One of them is Bolt, which is very popular in Europe. Walking is a good option. Taking the Vélib', the bike sharing service of Paris, is also another very good way of navigating the city. I think that that's all. Yeah, that is--

KASLIN FIELDS: It's a pretty good list.

ABDEL SGHIOUAR: That's a pretty good list, exactly. And yeah, just enjoy Paris. It's a really cool city. And in the article, I think we also shared a bunch of French words to learn. So please go read them. And if you are approaching French people, make sure to say "hi" in French, which is bonjour.

KASLIN FIELDS: Important vocabulary.

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: So, with that, we hope that you're all excited for KubeCon, or at least the news and announcements that will be coming out of the event. And let's talk about Envoy.

ABDEL SGHIOUAR: Well, hello, everyone. And welcome back to a new episode of the podcast. I am today talking to Matt Klein. Matt is the CTO of bitdrift, which is a company built in a mobile observability platform. But Matt is known for being the creator of Envoy. And this is actually why we are here to talk about today. And just for the reference, you don't see the video. But he is actually wearing a t-shirt that says Envoy. And it's one of my favorite t-shirts. Envoy is one of the most popular open source proxies in the cloud space today. Welcome to the show, Matt.

MATT KLEIN: Thank you so much for having me. Great to be here.

ABDEL SGHIOUAR: Thank you. I didn't want to spoil your intro. But I'll let you introduce yourself in your own words. Who is Matt Klein?

MATT KLEIN: Who am I? That's a fantastic question. I've been working in the technology industry for over 20 years. So most of my background is in low-level systems. I started my career working on operating systems, embedded systems, virtualization, those types of things. And then, over 10 years ago, I went up the stack a bit, still in low-level systems but mostly focusing on application networking and how people build these large cloud native systems. I did that at Twitter. Then I was at Lyft. And now, obviously, I've started my own company. But yeah, so for the past 10-plus years, it's really been in the cloud native networking space.

ABDEL SGHIOUAR: Awesome, and mostly proxies.

MATT KLEIN: Yeah, yeah. I mean, I guess the first half of my career was mostly in operating systems and virtualization. And then I started writing network proxies when I was at Twitter. And that's been mostly what I've done, with various detours into other types of systems. I think the main undercurrent of my career has been that I very much like low-level systems problems. That's the thing that I've always been drawn to. So things that are highly concurrent or have very interesting performance needs, those are the things that I tend to gravitate towards. So the networking space has been fun, for sure.

ABDEL SGHIOUAR: Nice. It just popped to my head, a question I want to ask. But before we get there, one of the things you're known for is Envoy, obviously, which was created at Lyft. I got to know Envoy about five years ago when I started using service mesh, mostly Istio. But I want to go back to before Envoy days. What's the story behind it? Like, how did you folks get together and decide, we're just going to rewrite this whole thing from scratch?

MATT KLEIN: Yeah, well, the story really begins back at Twitter. So I started working at Twitter in, I think, 2012. And without going into a huge, long story, one of the things that I had worked on there was building a proprietary proxy that was actually used for what, at the time, Twitter called its, quote, "fire hose." So basically, that was the stream of all tweets. And as you might imagine, that's a very large amount of data.

So there was proprietary systems that were built so that entire barrage of data could be sent to various people that wanted to consume it. And during that time, as part of writing this proxy, we were doing various other things. We were doing the World Cup in Brazil, which had very interesting concerns around latency, right? So we were putting points of presence down in Brazil and needed to run proxy software on small number of racks.

So we decided to actually take this new software and adapt it and then eventually run it as Twitter's entire edge proxy. So I gained a lot of experience at that time actually writing this type of software and deploying it at scale and then, in a fairly infamous incident, ended up through a bug in the software actually logging out about 40 million Android users. And that was, I think, in-- I want to say it was in the winter. It was around New Year's of 2015.

And that was a very fun bug. It was actually a one character date bug. So there's a Twitter thread on it somewhere. But the TLDR on that is that there's a one character date bug that tickled some other bug in Android. And anyway, it ended up logging out, like, 40 million users. And that was a big problem, you know? And that ended up leading to a lot of people becoming disgruntled and leaving Twitter.

And that's how I and a bunch of other people wound up at Lyft. So I started at Lyft in, I think May of 2015. And when I joined Lyft, I didn't join Lyft to build Envoy, of course. You know, Lyft at that time was a relatively small company. I think there was probably 60 or 70 developers at that time. And Lyft was struggling with its microservice rollout, which, during the mid 2010s, that's a very common thing. Like, lots of companies were trying to adopt microservices. And they were failing pretty badly at it.

And when I joined Lyft, Lyft had a monolithic PHP application. And they had started to dabble in microservices. And it just wasn't going that well. Like, they didn't have observability. They didn't really understand where things were failing. And as I like to say, they were in kind of the worst of all possible worlds in the sense that they had started the microservices rollout. But people didn't trust microservices. So they were still adding features to the monolithic code base. And it just wasn't going well.

So based on some of the work that I had done at Twitter, we had decided, well, we probably need to start investing in some of these technologies in terms of bringing observability to our edge, maybe trying to figure out how to actually tackle some of this service-to-service traffic to give, again, observability and resilience around things like retries and timeouts and all of that. So based on some of the work that I had done at Twitter, I proposed building Envoy.

And look, I mean, I'm not going to lie. Like, in hindsight, for Lyft to allow me to go build this thing at that time in the company is frankly insane, right? It's like, no company should have allowed me to do that. But they did, you know? And part of that was just the era that we were in, in terms of people being given leeway to do some of these things. Part of it was that I had already done some of this at Twitter. So there was some confidence that I could do it again.

But it was definitely a green light to do a project that probably had a high risk of failure. So anyway, so we started doing Envoy. And then, I'm sure you'll ask follow-up questions as to why we didn't use Nginx or something like that. And we can certainly go into that. But just the real quick story is that we built the software. And we started to roll it out on the edge first. So even though Envoy is pretty well known as a, quote, "service mesh proxy," it's actually very, very widely used as an edge proxy and as an API gateway.

And that was actually its first use at Lyft. So it was first used to try to give some visibility as to what was going on. And then quickly, from there, we started to dabble in various microservice use cases. As I also like to say, much of Envoy's original development was actually related to making MongoDB stable at Lyft. So there was a bunch of work around that. And then we eventually got into some of the service mesh use cases before it was called service mesh. And then, before long, it was deployed everywhere. So it became a pretty successful project within Lyft pretty quickly. And then, of course, there's the future history of open source.

ABDEL SGHIOUAR: I have, actually, a follow-up question. And before we get to, Why not use something that exists? I think that the first question that comes to mind is, writing a proxy from scratch, that's pretty low level. That's dealing with low-level packets. I mean, Envoy is written in C++. So I assume it uses a bunch of libraries that already process packets and headers and stuff like that. But what's your experience as somebody who is into that low-level network management?

MATT KLEIN: Well, look, I mean, it's not like all of the code was written from scratch like you said. So Envoy is written in C++. And at the time that it was originally written, it used libraries for I/O handling and async event handling. It used libraries for parsing various codecs. So Envoy has always used a lot of different libraries.

I think that what made Envoy interesting and successful was actually not the underlying network handling and all of those things. I mean, by the time the Envoy was written, that's a pretty well-trodden category. I mean, there's fantastic existing projects that already existed at that time, whether it be Nginx or HAProxy. I mean, I have nothing bad to say about those projects. Like, they're extremely well-written and stable pieces of software.

So it's not like Envoy was really going to do better in terms of proxying requests or parsing HTTP. It's just not possible. I mean, that stuff has been done for a long time. Envoy brought a lot of other things to the category, but I would view those as more what we did on top. And we can always talk about that. But I think Envoy's contributions to the industry are less about the low-level stuff and more about the higher-level stuff.

Now, again, depending on your perspective, this is probably all very low level. But I guess from where I'm sitting, it's not like I spent a lot of time thinking about, how are we going to better process TLS, or how are we going to better parse packets? It's like, that part wasn't interesting.

ABDEL SGHIOUAR: Well, I get your point. But I still think that, by the high-level stuff that Envoy have done, I assume you're talking about all the APIs and all the gRPC API, the hot reloading configuration, all that stuff, which we can talk about. But I watched the documentary, obviously, that has been done with you as one of the people. And there was a lot of talk about the improvements over existing proxies in terms of performance. Like, Nginx is famous. And you can only process requests using one worker. It's hard to do multi-threading and stuff like that. So that's also important.

MATT KLEIN: It is, yes. That's true. I mean, most of those other proxies at this point have fixed those things. So it's like, HAProxy is now multi-threaded. And I don't keep up to date with the development of the other proxies. But again, if I'm really honest, I don't think the fact that Envoy had a more modern siloed threading architecture-- like, that alone is not what made Envoy fantastically successful.

ABDEL SGHIOUAR: Got it.

MATT KLEIN: It's the other stuff. So it's not that the work wasn't interesting. And it's not that it wasn't developed, I would say, in a more modern way. But if that was the only thing that Envoy had done, I think the chance of it being successful probably was not there.

ABDEL SGHIOUAR: Got it. So yeah, it's the new, if we can call them, abstractions that were introduced by Envoy that made it probably--

MATT KLEIN: Yeah, so to circle back on the question that you probably were going to ask anyway around, at Lyft, Why didn't we use HAProxy? Why didn't we use Nginx? like I said, the fact that the project was allowed to happen was somewhat crazy. And most sane people probably would have said, well, you should just use Nginx. You should just use HAProxy. And I think that the reason for why we didn't do that is multifaceted.

One of them, of course, is that, having built the software back at Twitter, I had a pretty good understanding of what something like Nginx was good at or HAProxy and what it was not good at. And one of the early focuses of Envoy is just the copious amounts of observability that it spits out. Like, Envoy was well known in the beginning for giving you lots and lots of different metrics, often to its detriment in the sense of your observability bills, just because it spits out so many metrics.

But I think Envoy became popular because it was built from the ground up, just to give you really, really rich observability, which is not something that, historically, Nginx and HAProxy had really focused on. Like, they had really built this category of super high-performance, highly concurrent proxies. And they both did an amazing job of that. But observability and cloud native-type observability was not really their focus because, at the time that those projects started, things were still mostly in data centers. Things were quite static. So it was a bit of a different focus.

But the other reason ultimately that we didn't use Nginx is that, having worked with the code-base in the past, knew that, one, it was written in C. And I think most of your viewers will laugh at me saying that C++ is way better than C. It's like, wait--

ABDEL SGHIOUAR: I would agree with that.

MATT KLEIN: --way more productive. But from my perspective at the time-- and again, you'll probably ask this anyway, so I'll just say it. I mean, for the last two years, I've been exclusively writing Rust. So it's like, if I were to write Envoy today, would I write it in Rust? Absolutely, it's a complete no-brainer. But in 2015, Rust was not where it is today. So I stand by my decision at the time to write Envoy in C++. And at the time, C++ felt monumentally more productive than writing it in C.

And the other reason is that both Nginx and HAProxy are fairly notorious for not having a great open source community. So people have to carry patches. And they don't take contributions. So it's like, we basically knew that if we started with Nginx, we'd have to fork the code.

ABDEL SGHIOUAR: Yes.

MATT KLEIN: They probably wouldn't take our patches. We'd have to be writing it in C. So it's like the whole thing to me at the time just felt very unproductive. And I felt that I could get something working relatively, quickly written in C++. So that's basically the why, is that we felt that we would have to make substantial changes to Nginx or HAProxy. We felt that they wouldn't take our patches. We'd have to fork it anyway. And if we had to fork it, we might as well just start with C++ because it would be way more productive than having to fork it and have a partial C code-base or something along those lines. So that's the why.

ABDEL SGHIOUAR: Got it.

MATT KLEIN: But then to come into, What are the features where we focused on, versus the low-level stuff that I think made Envoy what it is? there's no one feature. And I think that's what's so interesting about Envoy, is that I think there's a couple of different reasons that Envoy ended up becoming very popular. One of them is the focus on observability, for sure.

One of them is definitely the API. I mean, it's like, Envoy was built from the ground up to live in this data plane, control plane, split world, which, in the cloud native Kubernetes space of things coming and going and failing and all of those things, it's like, we just don't live in a world of static configurations anymore. So the API really changed the game, just in terms of how people can use the proxy.

So I think it's the API. I think it's the observability. I think it's the extensibility. It's not that Nginx and HAProxy don't have that. But we did a good job of allowing people to build filters and different components. And you've seen that now, coming up on 10 years of Envoy, which is that the number of protocols, the number of things that it supports are just completely flabbergasting to me.

But if I really point to the thing that I think made Envoy the most successful, frankly, is that in a space where much of these existing proxies at the time-- they were still developed on Linux kernel-style mailing lists with mailing patches around and people that wouldn't take patches. And, like, people had to fork. Like, we were on GitHub from the beginning. We worked super hard of having a really open community.

And to me, that is the thing that, beyond anything else, made Envoy. And it's that we were able to bring in all of these different companies, whether it be Google or Apple or Microsoft. I mean, look, at this point, when I look around, it's hard for me to even say this because it's so incredible, is that the question is, who's not using Envoy? And of major companies around the world, there's really only a few. I mean, it's like, almost everyone is using Envoy.

And we've done that because I think we had a really welcoming community. I think people collaborated together. And I also think that many people have been able to derive value and build businesses on top of Envoy. Again, it's not one thing. It's the API. It's the focus on observability. It's just being built for the cloud native systems that we have now. But if I were to point at one thing, it's the fact that we really brought everyone in. And we tried to make everyone successful.

And in the beginning stages of the project, I really had this ethos, which is basically, we will never say no, meaning-- people would come in. And they might want to propose a patch or whatever else. And we would never say no. We would say either yes, or, let's make a new extension point so that you can fix your problem. And I think that's how the project grew. And that's why it is so extensible. And look, Envoy is a very powerful piece of software. It's not for everyone.

But I think it's become useful in so many different cases. And we've just enabled a lot of people to build success on top. And I think, far and away, that's the thing I'm most proud of and I think the project has done really well at.

ABDEL SGHIOUAR: Yeah, you touched on a lot of points there. I think that's one of the things I have to mention is, as you are talking, it sounded to me, you came really into the project at that-- what I always call the split point between the old school of doing open source and the cloud native way of doing open source, like moving from somehow closed communities to open communities, things done in a more modern way. You talked about HAProxy and Nginx developed in the old-school Linux kernel way. That's pretty cool.

But I think I want to go back to one thing. And that leads me to my next question, which is Envoy outside of the service mesh, because I think that a lot of people that would be listening to this podcast, in their head, Envoy is the service mesh data plane. But you talked about the, well, performance part. You talked about the API, which we have to-- probably for those who doesn't know, it's a gRPC-based API, right?

MATT KLEIN: Yep.

ABDEL SGHIOUAR: And the most important thing is hot reloading configuration. For most of the configuration changes, you don't actually need to restart Envoy, without dropping connections. That's also very important. So tell us a little bit, how did you-- so back in the days, or even now-- Envoy used as part of a non-service mesh, like outside the service mesh?

MATT KLEIN: Well, I mean, there's the beginning of the project. And there's now. And again, to be honest-- and again, it's really amazing for me to say something like this. But I stopped keeping track of who's using Envoy a long time ago, right? Because so many people are using it that there's no point in keeping track. But if I anecdotally look around the world at all of the companies that are using Envoy, yes, of course, it's very widely used in service mesh. Lots of people use Istio. There's other service mesh projects.

People have built their own proprietary service meshes based on Envoy. But if you actually look at most of the logos of the giant companies that are using Envoy, almost all of them are using it as an edge proxy or as an API gateway. So tons of development has gone into Envoy in terms of hardening it for the edge, in terms of adding, whether it be WAF capabilities or OAuth or JWT or all of the different technologies that you actually need for an edge proxy.

So I think the reason for that is that, if you really look at what is the difference between a, quote, "edge proxy" and a, quote, "service mesh proxy," 99% of what they do is exactly the same. I mean, it's like, they receive connections. They load balance them. They do transforms. They send them to some backends. I mean, yes, on the edge, maybe you'll do some other things, like TLS termination. Or you have maybe a different type of security concern that you might-- from an internal thing.

But even from that perspective, if you look at where the industry has gone, in terms of now really moving towards a zero trust networking world, there actually really isn't that much difference because, now, people are doing TLS. And if they're living in a zero-trust world, every hop in the network has to do all the same stuff. So the point that I'm getting at is that what Envoy brought to the table-- the API-driven configuration, the hot reloads, the awesome community, all of the extensions-- it's going to apply just as much to the edge as it is to service mesh use cases.

And the other thing that we've seen is that, at the time the Envoy was made, for reasons that I've never honestly fully understood, people typically had split deployments. So it was really common, for example, that people would use Nginx as their edge proxy. But then they would use HAProxy as their internal proxy. And I don't quite know why that is because they mostly do the same stuff.

But I think the other thing that had appealed at Envoy is that operations people, they don't want to run two different pieces of software. They don't want to learn how to monitor two different pieces of software. So the fact that Envoy had grown up in both of these worlds I think was very appealing for people because, now, they could have one system that would drive the configuration for all of the proxies. They could learn this one piece of software.

They would know how to debug it, all of those things. So I think we saw a lot of organizations, at least larger ones that were a bit more modern, they would progressively replace Nginx and HAProxy with Envoy in both deployment spaces because I think it ended up being simpler. But no, I mean, we've seen Envoy used in tons of API gateway cases. And in fact, like I mentioned, that's where it was deployed first, at Lyft, actually. So I mean, it was built as an edge proxy first.

ABDEL SGHIOUAR: Interesting. I know that you are very humble in terms of what Envoy have managed to achieve and being their first. But I think one of the biggest probably testaments, if I shall say, is-- I mean, Google is switching also to Envoy in some of the cloud load balancers. And I wanted to mention in the beginning this. Like, you started by saying the famous incident of 40 million devices or 30 million accounts being invalidated.

I know of a streaming company-- I'm not going to mention the name-- that uses Envoy as an edge proxy. And a few years ago, they had an incident, again, a bug. And they disconnected 70 million connected smart speakers. So you shouldn't feel bad about that. That happened in, like, 2023 or 2022.

MATT KLEIN: It happens.

ABDEL SGHIOUAR: Yeah, exactly. Stuff happens.

MATT KLEIN: Yeah, sure.

ABDEL SGHIOUAR: So continuing on this discussion of the success, actually, the thing that triggered me to want to have you on the show was your tweet about 1 billion downloads on Docker. And you put it in a way where you said, I don't think that this is a big deal. I think it is. I think a container that has been pulled a billion times from Docker Hub is quite a big deal.

MATT KLEIN: Well, and what's even amazing about that-- and I didn't say this on Twitter-- is that that's a billion downloads from Docker Hub. I mean, it's like, almost all large users don't pull from Docker Hub because, news to your users, if you're pulling from Docker Hub in production, you are insane.

[LAUGHTER]

So you should mirror those containers. So it's like, almost all large users, they are mirroring those containers, or they're building from scratch, or whatever else. So to me, it's super cool that it's been downloaded a billion times from Docker Hub. And then if you think about how many times the container has actually been downloaded, I mean, I'm sure it's many multiples of that, which is quite cool.

ABDEL SGHIOUAR: Yeah, that's amazing. So I want to switch gears a little bit to talk about some efforts that have been happening around Envoy. And one of them is the service mesh interface, SMI, which have been discontinued recently. What's your thoughts on that? What's the SMI? Let's start with, what's a brief introduction of what is that?

MATT KLEIN: I'm going to give a very large disclaimer, which is that I have generally stayed out of a lot of what I would call the service mesh wars. So I'm happy to give my thoughts on it. But I'm definitely-- I have not been in it.

ABDEL SGHIOUAR: OK. I want your thoughts.

MATT KLEIN: So yeah, I mean, in terms of what it is, is that it was an effort to basically make an abstract interface to allow configuring service meshes across vendors. It's very similar to, within Kubernetes, we have Ingress. And then there's Ingress V2 and the gateway API and all of these things. And I guess my take-- and I'm speaking for myself now, I'm not speaking on behalf of any organization-- is that I think that it is very tricky to do these things.

And the reason that it's tricky is that there's this constant tension of coming up with what I would call the LCD, the lowest common denominator, config that applies to all of the different technologies. And then what ends up happening in almost every case-- we see this with Ingress, we saw it with SMI-- is that people are not dumb. Like, they know what's being used.

So what ends up happening constantly is people would come in-- let's actually not talk about SMI. Let's just talk about Ingress. And people would come in. And they would say, well, I know I'm using Nginx. And I know that it has these, like, 77 features. And I want to use them. So it's like, how do I do that? Or I know that we're using Envoy. And I know that Envoy has these, like, 100 features. Let me use them. So what they ended up doing in almost all cases is they end up having these extensions.

So it's like, they've got the Ingress API. Then they've got the extensions. And then you're no better than where you were before because if everyone is using extensions, it's not multi-vendor. I mean, you've pretended that you have this common API when you don't. And I haven't tracked the service mesh interface stuff. But I'm guessing that it fell to really the same concerns, which is that they're trying to come up with this common thing.

But the vendors all support all these different features. And then people want those features. So it's basically impossible to come up with an API that actually works for everyone. So I think it's a great goal. I just don't know how realistic that goal is just because of the tension of people wanting to expose features and people wanting to actually use those features.

ABDEL SGHIOUAR: That's an interesting observation, actually. I think that the way I understood it when I was looking into SMI, it looked like it came from the same place where most of these interfaces in Kubernetes came place, like the CLI, the CSI, the container storage interface, container runtime interface, which is standardizing the interface, in a way. But I think that the key difference is that Kubernetes has more control over its surface compared to trying to standardize multiple proxies across multiple vendors. And that's probably one of the reasons why SMI falls through.

MATT KLEIN: Even there, though, if you look at a lot of the other Kube stuff-- it's like, let's take a, quote, "simple" example of how people load secrets. If I'm on AWS and I want to store my secrets in Secret Manager, yes, I'm using a CSI plugin. But then I have this custom YAML that's specific to AWS. So I'm just saying, this problem leaks into almost everything.

So I'm not saying that it's not a valid effort to go and attempt to have common APIs. I just don't know how realistic it is for some of the-- let's call it the low-level nuts and bolts portions of the ecosystem. So it's like, to me, Kubernetes is a building block that you use to build larger things that might be easier for people to consume. Envoy is a building block. It's a very complicated piece of software. And then you have things that come on top.

And they try to make it a bit more opinionated. And they try to make it more of a platform. And they might expose different APIs. And that's completely fine. But then you're giving people an opinionated interface to what they might want to use as opposed to trying to claim that Kubernetes or Envoy or something like that is this common API that is going to satisfy all vendors. And even for Envoy, I will comment that Envoy's API, we call it XDS.

Now, at this point, it is-- I mean, I don't even know-- tens, 100,000 lines of proto. I mean, like, it is a giant, giant API. And a couple of years ago, it was actually quite cool. We had the gRPC folks come along. And they started to use the XDS APIs in gRPC. And that's fantastic. That's great. I mean, that's something that we would like to do. But even there, if you look at the complexity of trying to have a common API, even between two implementations, gRPC and Envoy, the amount of discussion that has to happen around, this works for Envoy but not for gRPC or all of these things.

And if you look at what they're trying to do with Kubernetes, something like Ingress or a service mesh interface and have that be applicable across tens or hundreds of vendors, I'm not saying it's an effort that's not worth trying. I'm just saying that you're always going to be dealing with these abstraction leakages and boundaries that people are breaking. And then you have to ask yourself, what have we done really if people are always using these extensions? Like, is it useful? And that part is not super clear to me, to be honest.

ABDEL SGHIOUAR: Yeah, I do know very well where you're coming from because I was involved in the Ingress API for a while and now the transition toward the Gateway API, which tried to solve the-- you deploy an Ingress. And then you have 100 annotations just because you want to do something specific to one vendor. And then the Gateway API is attempting to solve that. But it's progressing slowly because of what you are talking about because it's difficult to get everybody to agree on a certain way of doing things.

MATT KLEIN: Right. And then, when you look at customers who just want to get stuff done, I think at a certain point, they end up not caring. Like, they would rather just use the extensions and like--

ABDEL SGHIOUAR: Yeah, get stuff done, essentially, yeah.

MATT KLEIN: Yeah, effectively, yeah.

ABDEL SGHIOUAR: Just want to move on with your life. So, then, I want to move to the next thing, Envoy Mobile. I remember reading the blog post that was introducing this a few years ago. And I haven't looked at it for a while. So excuse me if I'm asking a stupid question. But I think I remember, back in the days, I could not wrap my head around it, like what it is.

MATT KLEIN: What it is is it takes the core of Envoy. And it basically runs it on the phone. So it's a networking library. And for those that don't know, Google itself has done something like this for quite some time. So there's the Chrome networking code. It's called Cronet. And that's code that runs in Chrome. And I think Google might use some of that internally. I'm not sure. And then they have run it as a library, like a cross-platform networking library, on both iOS and Android.

And again, the idea-- I don't think it's been anywhere near as successful as Envoy server-- was that, again, you have this very complicated code base. And you have all of these networking concerns that also have to run on mobile. Let's attempt to use the same code in all places. And the project definitely has been going in fits and starts. I think Google is still very actively working on it, I think because no one is working on Cronet anymore.

And I think Google has so widely adopted Envoy that, for the reason that I've talked about, I think they would like to have one code base, basically, that runs everywhere. So I think Google is still working on it quite heavily. I don't know the current status of where it's deployed. And I think that there are a bunch of other companies that have historically used Cronet. But because Cronet is no longer maintained, I think Envoy Mobile is probably the likely migration path for those people.

ABDEL SGHIOUAR: Got it. Just for my own understanding, it's a library. It's not CS. It's a separate process. So you have to import it into code and use it.

MATT KLEIN: Yes.

ABDEL SGHIOUAR: So it's quite different than Envoy Server.

MATT KLEIN: Yep, mm-hmm.

ABDEL SGHIOUAR: Cool, got it. So I went scouting a little bit around your Twitter. And I saw that you are trashing on Go in favor of Rust.

MATT KLEIN: Oh, yes.

ABDEL SGHIOUAR: I mean, it's on Twitter. So what's the deal there? Why?

MATT KLEIN: I don't know that I want to get into a huge programming flame war.

ABDEL SGHIOUAR: Those are the best.

MATT KLEIN: Yeah, I'm just not a big fan of Go. I just, I just don't really like using it, for a large variety of reasons. And I think that, again, most of my career was spent-- just because of the low level systems nature of what I was doing, it's really all been assembly, C, C++. I mean, that was the great majority of my career. And I've switched basically 100% to Rust. So all of the technology that we're building for bitdrift, it's basically 100% Rust.

And I am a big Rust fanboy. The way that I view Rust is that they basically took C++, and they fixed every single thing that is wrong with it. So for me, having come from C++-- and not hated the language. Like, I think it's fine for what it does. But I just think it's been a very natural progression to move to Rust. So I would much rather talk about why I love Rust and not why I hate Go.

ABDEL SGHIOUAR: Yeah. I mean, that's a very good point. I think Rust is probably one of these programming languages that I want to have on the show at some point. So I'll probably call you to discuss. I've never written code in Rust, so I have no idea. I have to do my own research before I start discussing it with people.

MATT KLEIN: I think what is interesting about Rust, and it's not surprising that I like it, is that it was made by Mozilla. And it was made to replace C++. So the people that made Rust, they were trying to get C+ programmers to switch to it. So I think that the languages actually are very similar. Like, if you look at modern C++, like C++ 20, and you compare that to Rust, just from a linguistic perspective, they are pretty similar. So I think, from a C++ programmer perspective, it's very natural to move to Rust.

But the way that I talk about it is that all of the stuff that I used to do in my head around figuring out, Is this reference dangling? Is this thing doing this? Is this doing that? it's like the compiler just does it for you. So I mean, to me, Rust is an absolutely incredible technical achievement. And if you look at it just from the perspective of what the compiler catches but, even furthermore, the way that they handle data races, it's just-- Rust is basically bulletproof. I mean, it's like, if you write code in Rust-- I know there's all of these comments around, like, you can't write bugs in Rust or whatever. Of course you can write bugs in Rust.

But the amount of work that the compiler does for you, I feel so productive. And what's incredible to me about Rust is that, and this is going to sound crazy, it has replaced Python for me. I mean, it's like, what's so amazing about the language is that they've done such a good job with the VS Code integration and the language servers and the whole ecosystem of packages that I can write small programs. I can write sophisticated systems. And I can do it super productively.

And then, when you look at something like Go-- again, I don't want to spend a lot of time, actually, talking about why I don't like Go. It's more that, in large Go code bases, you still have data races. You have actually a lot of complicated issues around channels. And people don't understand how they work. So it's like, there end up being a lot of bugs, actually, in Go programs related to memory leaks, data races, all of these things. These bugs are impossible in Rust. So to me, it's just a much more productive way of getting stuff done.

ABDEL SGHIOUAR: Got it. So speaking of productivity-- I know that this is probably off topic-- do you use any of these code assistant AI tools?

MATT KLEIN: It's funny you said that. I got access to Copilot for the first time today. So I have not actually used it yet. But I keep hearing people talk about it. And I'm going to try.

ABDEL SGHIOUAR: I'll be looking for your tweets, then.

MATT KLEIN: Yeah, so you'll have to check back in with me. But up to today, no. I will occasionally ask ChatGPT how to write bash scripts because I am very bad at bash but, other than that, no.

ABDEL SGHIOUAR: OK. I think I and everybody who listens to this episode, go check your Twitter because you share your opinions on Twitter. So my last question-- and I don't want to take more of your time-- let's talk about your current company, bitdrift.

MATT KLEIN: Yeah.

ABDEL SGHIOUAR: What are you folks doing?

MATT KLEIN: So the very quick version is that we have believed for a while that, in the observability space, there's, I would say, misaligned incentives between the people that are using observability systems and most of the vendors in this space. And by misalignment, I mean most vendors are charging by volume. It's like they want you to take in as many logs and as many metrics and as many traces as possible. And they're not super incentivized to help you get value out of that.

And on the consumer side, it's constant. There's memes about this. It's that no one's really happy about their observability bills. It's like, most people feel they're spending too much money or they're not getting enough value. So what we are doing at bitdrift and what we've shipped publicly with our Capture product, which is a mobile-only product right now-- we are working on server-- is we're kind of flipping things around where we've built a fairly sophisticated control plane, similar to how Envoy works.

So there's a data plane, which is a library that runs on the phone. And there's a control plane. And we're able to actually keep telemetry on the phones until it's explicitly asked for. And that might be asked for via particular filters, or it might be asked for via workflow rules where we might be able to say, you hit this log. Then you get this 500 from the server and then dump all the data on the phone.

ABDEL SGHIOUAR: Got it.

MATT KLEIN: And we keep this ring buffer on the phone that stores local telemetry. And then we can flush that to the server. So the idea is that we can greatly reduce costs because we're not ingesting data that people are likely never to look at. And we're helping people get access to their telemetry that they want for the problems that they're actually debugging.

So it is a very interesting space. We'll see what happens, of course. We're a tiny startup. But I am having fun. And I'm looking forward to seeing if we can change the game here a bit.

ABDEL SGHIOUAR: That sounds amazing. So yeah, we'll have a link to the website of the company, bitdrift.io, in the show notes.

MATT KLEIN: Yeah, great. Thank you.

ABDEL SGHIOUAR: Well, awesome. Thank you very much, Matt. Thanks for your time.

MATT KLEIN: Thanks for having me.

ABDEL SGHIOUAR: I had a great time discussing with you. And yeah, good luck with your next endeavors.

MATT KLEIN: Thank you very much.

[MUSIC PLAYING]

ABDEL SGHIOUAR: That brings us to the end of another episode. If you enjoyed this show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter, @KubernetesPod, or reach us by email at <kubernetespodcast@google.com>. You can also check out our website, kubernetespodcast.com, where you will find transcripts and show notes, as well as links to subscribe. Please consider rating us in your podcast player so we can help more people find and enjoy the show. Thank you for listening. And we'll see you next time.

[MUSIC PLAYING]