Kubernetes Podcast from Google: Episode 156 - Opstrace, with Sebastien Pahl

#156 July 28, 2021

Opstrace, with Sebastien Pahl

Hosts: Craig Box, Jimmy Moore

Sebastien Pahl is a pioneer of container technology, building the predecessor to Docker as a co-founder of Dotcloud. After working at some big tech companies, he’s back to the startup life as co-founder of Opstrace, a fully open source observability distribution, built on top of the tools you know and love.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

Transcript

Show full transcript

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box, with my very special guest host, Jimmy Moore.

[MUSIC PLAYING]

CRAIG BOX: We talked a couple of weeks ago about the Olympic opening ceremony. And Jimmy, as an event planner, you had opinions and thoughts and effectively said that it was going to be the highlight of the games for you. Was it everything you dreamed it would be?

JIMMY MOORE: Oh, such high hopes, such high hopes. But, you know, I knew it was going to be a different experience this time. And I think they did a great job demonstrating the story of our global struggle over the past year and kind of the anticipation of the games, which were, of course, canceled in 2020 and now the care in which we're approaching it this year.

From a show perspective, I'm usually a big fan of the technology pieces of the show, but, this year, actually, my favorite part was the icons. They did this really cool thing with the icons they made for each sport, acting them out with a physical kind of stick person. And that's, of course, a tribute to the fact that Japan introduced the icons for all 50 sports to the Olympics organization back when they hosted. I think it was in the '50s.

CRAIG BOX: Right.

JIMMY MOORE: Yeah, it was fantastic.

CRAIG BOX: I don't know if you saw. There was a South Korean TV station that was representing all of the countries with an icon. And they had to apologize for Italy for using pizza as their description for the Italian team.

JIMMY MOORE: [LAUGHTER] I didn't see that. But I would never be offended to be represented by pizza.

CRAIG BOX: Did you catch the Tongan flag bearer?

JIMMY MOORE: Oh, of course. I mean, why change the best part of the Olympic tradition? Everyone posts that as a meme for sure in my circles.

CRAIG BOX: I didn't see the opening ceremony, but I did hear that he was back. This guy managed to find a sport to compete in in both the Summer and the Winter Games basically just so he could oil himself up and walk with the flag in the opening ceremonies.

JIMMY MOORE: My friends all said, if it's not broke, don't fix it.

CRAIG BOX: Now, tell me. The idea of using thousands of drones to spell things out in the sky, do you think that's cheating?

JIMMY MOORE: No, not at all. I mean, it's amazing. We're a technology company, right? We like that kind of thing. I love just the idea of saying, look what we can do, right? I mean, to an extent.

I don't need to go to space on a rocket, like we talked about last week, but I do say, let's put 1,800 tiny little planes above a stadium and see what they can do. In fact, I kind of wish they did more. They made a really cool globe and made a neat shape, but I could have imagined a few more things. Might as well, spending the time and money and battery power on all those little drones.

CRAIG BOX: Do you think that the 4th of July will start becoming a drone display festival in parts of the US?

JIMMY MOORE: You know, I kind of hope so actually because global climate change and whatnot. I don't know if exploding fireworks in the sky is really going to be in vogue for much longer.

CRAIG BOX: Well, speaking of climate change, there's been some big storms approaching Japan. And it seems like they may be diverted around the Tokyo metro area, which is probably good for the canoe slalom teams. They really wouldn't have looked forward to that.

JIMMY MOORE: No, it would become white rapids, right? A white rapid competition.

CRAIG BOX: Perhaps. One place that the very inclement weather seems to have been noticed worldwide is in metro stations. A couple of weeks ago, there was video of a hurricane hitting New York and people trying to wade into the subway there.

And, just this week, there was a month's worth of rain dumped here on London in a one-day period. And a railway station, bringing it all back together, right near the London Olympic Park was very flooded in the video that you can find linked in the show notes. So maybe get your waders on if you want to catch the train.

JIMMY MOORE: Yeah, absolutely. You know, I will say, New York did it first. We had Hurricane Sandy five or six years ago where it completely flooded, and it kind of blew me away. I actually thought London might be more prepared for this. As I understand, it rains every day of the year there.

CRAIG BOX: It does, but the drains were all built 150 years ago. And we haven't really kept up with the technology required.

JIMMY MOORE: Perhaps with the next monarch?

CRAIG BOX: We'll keep that in mind.

JIMMY MOORE: Let's get to the news.

[MUSIC PLAYING]

JIMMY MOORE: The release candidate for Kubernetes 1.22 is out. As a reminder, there are now only three releases per year. And this, the second one, is due out on August the 4th. You may remember from our interview with the 1.21 release team lead, Nabarun Pal, that testing a release candidate is one of the most important things you can do. And he made a promise to send us swag. Maybe don't hold him to that though.

CRAIG BOX: The Cloud Foundry Foundation announced version 5 of their cf-for-k8s platform, leveraging new upgrade features in Istio and supporting the latest generation of Paketo Buildpacks. Foundation chair and episode 105 guest, Chip Childers, told "Container Journal" that the CFF is committed to making sure Cloud Foundry API entities have an analogous representation in the Kubernetes API and that cf-for-k8s should eventually allow users to manipulate either representation. This will help Cloud Foundry and other Kubernetes workloads better coexist and interact on the same cluster.

JIMMY MOORE: Last week's guest, Priya Wadhwa, mentioned the Connaisseur project, an admission controller to verify container signatures before allowing them to run on your cluster. This week, that project launched version 2.0, supporting multiple keys or multiple validators at the same time. The Notary project from Docker is fully supported with experimental support for Sigstore and Cosign now updated to the latest version.

CRAIG BOX: Also releasing version 2.0 this week is Chaos Mesh, originally built by PingCAP to test TiDB and discussed in episode 121. This release refactors the chaos controller, which, side note, would be a great name for a super villain, to allow more accurate description of your chaos. A new schedule concept allows you to regularly run experiments. And the new workflow API lets you run multiple schedules or chaos runs in serial or in parallel.

JIMMY MOORE: Enterprise Kubernetes vendor Spectro Cloud has announced a $20 million Series A funding round, taking their total funding to $27.5 million. Spectro Cloud's platform manages the full lifecycle of both clusters it creates based on the cluster API and third-party clusters that it can manage with an agent.

CRAIG BOX: Finally, with Kubernetes 1.22 almost out the door, it's time to think about joining the release team for 1.23. Opportunities to shadow key release roles are available, as the team is currently being put together. If you would like to get involved, listen to one of our regular release team lead interviews, and then fill in the form linked in the show notes.

JIMMY MOORE: And that's the news.

[MUSIC PLAYING]

CRAIG BOX: Sebastien Pahl is the co-founder and CEO of Opstrace. He worked at Red Hat, Mesosphere, and Cloudflare and was a co-founder of the company that became Docker. Welcome to the show, Sebastien.

SEBASTIEN PAHL: Thank you very much.

CRAIG BOX: You have a French first name, a German surname, and your accent is very hard to pin down. Where was home for you growing up?

SEBASTIEN PAHL: I was born in France. And then I spent most of my young life in Hamburg in Germany. And then I went back to France until I eventually came here to the US for dotCloud.

CRAIG BOX: You went through all your education in Europe?

SEBASTIEN PAHL: Yeah, I was in a French school in Germany and higher education in Europe. Even afterwards, I studied at a school called EPITECH where we were coding the entire time in France. I pedaled between the two countries quite a bit.

CRAIG BOX: EPITECH was where you first met Solomon Hykes?

SEBASTIEN PAHL: That's correct. We met, and we worked together at a company where I interned at first. And then we had a bunch of things in common.

CRAIG BOX: Solomon was the CEO of Docker for a while. And, when he left the company, he wrote a little story, which I'll quote a little bit here from. He says, "10 years ago, I quit my job, returned to live with my mother in Paris, and, together with my friends Kamel and Seb," yourself, "started a company called dotCloud." He says, "I was 24 and had no idea what I was doing. We needed a CEO. So that became my new role."

So Seb, my question to you, how old were you? And did you know what you were doing?

SEBASTIEN PAHL: Well, I was still finishing my studies. So my last year of Epitech was actually working on dotCloud together with Solomon basically full time. And then, after this, when the study was over, we continued in France. France was a kind of a time of experimentation for us where we built a lot of technology and everything, but the actual founding of the company that became Docker was after we came to the United States with Y Combinator.

CRAIG BOX: If Solomon was the CEO, what was your role?

SEBASTIEN PAHL: Back then, during the French times and also during YC and everything, we were both just working on the technology, basically, working, building containers. I still remember when, all the way in the beginning, we were doing things with OpenVZ, and we were putting containers in Mercurial. All these ideas came much later, but the role was just hacking and building things. We were also doing some consulting on the side, but that wasn't that important.

CRAIG BOX: Yes, it's very different when you say you were building containers versus what someone would say if they're building containers today.

SEBASTIEN PAHL: Yeah, it's true. We were using existing container technology. That has to be said, like I said, OpenVZ back then. But then we were actually trying to make it easy, trying to democratize it, trying to not have people redo the same things over and over again. And really it wasn't so much about containers, but about-- as much as we liked things like Chef and Puppet, it was about breaking out of the cycle where computers had to rebuild themselves all the time. "Build them once, patch the software once" was one of our big ideas, but we were also trying to build something that would look like Kubernetes, obviously, very different, but that was the intent back then.

CRAIG BOX: I think I had always thought that Docker was based on LXC. Was that part of the tech stack at the time?

SEBASTIEN PAHL: That is something we moved to later. I started playing with LXC right before YC or something like this because it was very simple. OpenVZ needed you to patch the kernel heavily, and we wanted to run on AWS. To do that, I found a kernel from Ubuntu that-- you couldn't do your own kernel back then on AWS, but one kernel of Ubuntu had the right sets of things activated so that you could run LXC containers. So that's how we got to LXC. That was even long before Docker. We're talking 2010 here.

CRAIG BOX: Yeah, I mean, AWS wasn't even very old at that time.

SEBASTIEN PAHL: Yeah, it was fun. I also remember that, during YC where AWS finally said, you can build your own kernels, and I spent a night and a half figuring it out without docs. You're like I just want my kernel with everything I need. I don't want this Ubuntu kernel that they have. It wasn't bad. It was just not what we wanted, fun times.

CRAIG BOX: The cloud enabled everyone to replicate the experience they had 10 years earlier of wanting to compile Linux on their own desktop machine.

SEBASTIEN PAHL: Pretty much.

CRAIG BOX: A few times now, you've mentioned the trip to the US to go through the Y Combinator startup incubator. Tell me about that experience. How did you apply to YC?

SEBASTIEN PAHL: We actually applied twice. We applied once, didn't get in. But then we applied again for summer 2010. We actually got accepted to the interviews.

As soon as we got accepted, we booked plane tickets, and we flew off and spent three weeks in California talking with other YC companies and everything until we did the interviews. It was very intimidating back then. Looking at it from now, it wasn't that crazy, but, yeah, it was very intimidating. We went through the interviews with Paul Graham and all the others, Jessica [Livingstone], and then got in.

And it was fascinating because the difference between France and the US was that we got people to give us angel checks on the spot without much discussions. That was the big thing.

CRAIG BOX: Was there any kind of startup scene in France at the time?

SEBASTIEN PAHL: There were people doing startups, but I wouldn't call it a startup scene. I think it's gotten better now, but I don't want to comment too much. I don't do startups in France, but it's gotten better. I know a bunch of good French companies and founders. I don't know how good it is financing-wise.

CRAIG BOX: Were YC evaluating the idea or the people with the idea?

SEBASTIEN PAHL: I would say both honestly. It's funny because I don't even know how much they were evaluating us back then, but it doesn't matter. It worked out.

CRAIG BOX: How long was the actual YC process? And what was life like for you during those weeks?

SEBASTIEN PAHL: We arrived in California. And I stayed in a motel, and I coded. And then we went to the interview, did the interview. And then, after this, it was a bit more relaxing. We got to spend more time visiting other people, still working on our project, but we knew, OK, we spent time finding an apartment to actually move there during the summer.

CRAIG BOX: Good.

SEBASTIEN PAHL: That's it. And then YC was the typical YC process, three months, but it was hard because we were still trying to figure out what to do with our technologies. We went back and forth a lot.

The initial idea was dotCloud. It was like Heroku for everything, right? We didn't know how complicated things were really back then. So we were like, yeah, sure, everybody can run databases. Everybody can run this on us. Like we'll just let you do anything. That was quite formative, I would say.

CRAIG BOX: It's easy to look back and say, well, that wasn't possible with the technology of the time, but it is very much a thing that people are doing today through operators and so on, as we'll talk about later on. But do you think that the technology was just a little ahead of the time then? Or do you think that this lit the fuse for an idea that people wanted?

SEBASTIEN PAHL: It just lit the fuse for ideas that people wanted for sure. We always believed that. The real thing came when I wasn't even at the company anymore when they broke out the core, which became Docker, and forgot about the platform and the experiments-- not forgot because they had a lot of ideas of what to do with, but actually making it easy for everyone to never again have to configure a database on your laptop just to develop on it.

Everybody now does Docker-run database. And then you can just use it. Let's not talk about production. That's another-- we can, but production is a whole different discussion.

CRAIG BOX: dotCloud took a $10 million funding round in March of 2011, but you left the company in December. What can you tell me about the formative months of the company and then your decision to move on?

SEBASTIEN PAHL: It was a different time, you know? During those times, we were building that platform. But it was tough. We were building a PaaS. We didn't know really where this was going. As a usual thing, co-founders don't always get along. We had been working together for many years. And you sometimes don't really know, and you take a different path. That's actually how I got to Cloudflare. So I don't regret any of that.

CRAIG BOX: Where did you meet the Cloudflare team?

SEBASTIEN PAHL: That's the most Silicon Valley thing. They were actually in the same building, at least in the same building at first. So I met Matthew Prince because he knocked on our window. It's that simple.

CRAIG BOX: Did he leave his keys behind or something? He wasn't able to get in?

SEBASTIEN PAHL: No. When you worked in that Cloudflare building, you would always pass the Founders Den is where we were. It's the place where we had our office. And then you would just always pass in that alley in front of these windows when you went upstairs. I later learned that because I later actually worked upstairs. That was cool.

CRAIG BOX: So the transition between downstairs and upstairs, what did you work on at Cloudflare?

SEBASTIEN PAHL: At Cloudflare, I joined a company that was very exciting. I was, I think, around the 30th employee there. And I basically arrived, and people told me like, build whatever you think is useful. And so then I said, oh, you don't have metrics. Let's put metrics in. So that was the first thing that I did.

And then most of the things that I ended up working on were I did a lot of projects here and there, but mostly focused around how to manage deployment of software, the SRE-type side of things, right? I ended up together with somebody else leading the SRE team for a while. I focused more on the software side of things, tools to deploy, monitor Cloudflare, help, but, honestly, anything and everything that was interesting that needed to be done there.

I even worked on the TLS stuff. That was fascinating. Cloudflare was an amazing company that was the most fast-paced thing I've ever seen when I worked there.

CRAIG BOX: When did you first come across Kubernetes?

SEBASTIEN PAHL: During the time at Mesosphere. I mean, I read about it near the end of my time at Cloudflare, but, at Mesosphere, that's where I ran across Kubernetes the most. Later on, I even led the team that helped pivot Mesosphere to Kubernetes. That was fun.

CRAIG BOX: Can you tell us about that time there?

SEBASTIEN PAHL: It was another very interesting time. I went there because the idea was to build what they called a data center operating system. So that sounded cool. So I went there for that. I was doing containers full time again. Why not? That sounds great.

And so the time there was mostly-- I don't want to be demeaning, but we ended up building the HD DVD of container worlds. And that's fine. You can still have a lot of fun. And then the company eventually pivoted to Kubernetes, which we can call the Blu-ray.

CRAIG BOX: Well, you could use the Betamax analogy because the one thing people say is Betamax was actually a slightly better technology. So it's a bit of a flex when people say that.

SEBASTIEN PAHL: Yeah, but I didn't want to go that far back in time. I have known HD DVDs, but I've never known a Betamax. So that's why I don't use that analogy.

CRAIG BOX: And fair enough.

SEBASTIEN PAHL: Yeah. It was a fun time. Most of the team of the company that I'm in right now comes from there. So it definitely was worth it spending time on these other things.

CRAIG BOX: When you started working back on containers again, did you recognize code that you had written yourself in what was then Docker?

SEBASTIEN PAHL: I actually started using containers at Cloudflare. As soon as Docker opened and was there, none of the code in Docker was mine because Docker was rewritten in Go. And, back then, we wrote everything in Python. But the ideas were super cool to use, right? Finally, I was like we don't need to reinvent this stuff.

I used, at Cloudflare, Docker right away to start building Debian packages, very boring, as everybody built the Debian packages the same way. It wasn't quite ready to do much more than that back then. Thankfully, it is now. And, later on, I ended up, like I said, running Mesos there. That's where we used containers too.

CRAIG BOX: Did you have an affinity that led you to Mesosphere, aside from the transition to Kubernetes?

SEBASTIEN PAHL: No, it was the thing that was there that worked that you could then build higher-level things on top. I like low-level things, but I do prefer the higher-level things when it starts getting easy for other users to do. That's what drove me there. That's it. Kubernetes is also this thing where I see it as a tool kit that you use to create things that make things easy for others. It itself is OK, easy. Yeah.

CRAIG BOX: One of the ways that you can make things easy to run is using operators and the operator pattern on top of Kubernetes. You were working on the Operator SDK at Red Hat for a while.

SEBASTIEN PAHL: Yeah, that was very fun. I joined Red Hat because I knew the CoreOS team quite a bit, and they ended up going there. And I was very interested in seeing how a company gets acquired and how it happens like when it's integrated, especially that they got acquired for their ideas and integrating their ideas into Red Hat.

And so I ended up leading a bunch of teams, including the Operator Framework team, which one of the pieces was the Operator SDK. And yeah, that was one of the things where, again, how can we encode certain things so that you don't have to rewrite them again and again and again? And how can you abstract these ideas?

I think it could go much further than what it is today, but these things are a good base. Obviously, my work there was a bit more managing teams and leading teams than actually being involved in the tech itself. So that's a bit of the difference.

CRAIG BOX: Was that something that you missed that you wanted to get back to, building a product from scratch?

SEBASTIEN PAHL: Oh, absolutely. That's why we started a company.

CRAIG BOX: You left Red Hat to join Y Combinator a second time. How was it going back? Was the process largely the same? And was the experience similar with many years of experience under your belt at that point?

SEBASTIEN PAHL: I wanted to go back to YC, one, because my co-founder, Mat Appelman, hadn't gone through that. So that's one of the things that is quite nice to have on their belts, to have that network of people. YC is all about the network of people.

And it's different when you're doing a company outside of it, even if you've been through it before, especially since we went in, and we didn't have the idea of Opstrace that we have today. So we wanted to go into a place where we would have a framework to get to our idea, that would have a place that pushes us towards that. And YC is very good at that.

And, also, being surrounded by like-minded people during such a hard time, which is the birthing of a company, that's quite useful. That's something I wanted to relive again.

YC itself had changed quite a bit. It's tremendously big now. But it was cool. They built quite a lot of cool things to be able to scale to this higher size. We even had things that didn't exist back then like group office hours where you would together see what other companies were doing and how are they were advancing. So, no, it was awesome. And then the demo day was the same thing as usual. That was also a fun part.

CRAIG BOX: Were Paul and Jessica still involved? Or had they largely moved on by that point?

SEBASTIEN PAHL: Oh, no. We met them because they came to talk, but they're not involved, from what I know, in the day-to-day activities. I don't want to speak for them of course.

CRAIG BOX: You said that the idea that you had wasn't what ended up being Opstrace. Can you tell me what that idea was?

SEBASTIEN PAHL: Yeah, we wanted to help people run thousands of Kubernetes clusters. It turns out there's not that many people that want to or need to run thousands of Kubernetes clusters. So that was the initial idea.

And then we ended up going through quite a rigid process of Q&A with companies where we would ask the same questions over and over and over and again about how they manage their stack, what they use to monitor it, and things like this. And this is how we got to our current idea.

CRAIG BOX: Do you think it's a failing of Kubernetes' design that that's a thing you might think you need to do rather than simply having one cluster that you could do a thousand tenants within?

SEBASTIEN PAHL: No, I'm not so sure. One cluster tends to be complicated honestly. I wouldn't call it a failure in its design. I personally like to compartmentalize things. But it wasn't even that.

Like running thousands of Kubernetes clusters was not just because you wanted to split it up, but how can you basically rebuild the whole infrastructure from scratch easily, have staging environments that are completely independent? But, on the other hand, you could argue, yeah, it should be all doable in namespaces and everything. No, I don't think it's a flaw. I don't even think our idea was that good. Let's be honest here.

CRAIG BOX: Well, at least in the conversations that you had with the people you spoke to, you ended up with a different idea. What points were those people raising in terms of what they were doing in their monitoring at the time and what problems they saw? And how did that guide you to what you would eventually build with Opstrace?

SEBASTIEN PAHL: We saw that, for most people, first of all, monitoring was very bad. That's not something that we learned just there. That's something we remember it by working before, for example, at Mesosphere with bigger, larger enterprises. We noticed that every company rebuilds their monitoring stack again and again and again when they choose to build.

Or they pay through the nose by sending it to vendors that charge for the amount of data that you sent there, which is fine when that's the only solution, but we came to the conclusion that, with the technologies that actually happened in the last, let's call it, five years, a bit more than that now, like Kubernetes, you now have programmable APIs for everything. And that's how we got to the idea that you should be able to set all this up in your infrastructure without their knowledge, right? So think of it as like a mega operator, but we thought let's focus it on one thing and one thing only, observability.

CRAIG BOX: There are a number of open-source observability tools, many of which are part of the Opstrace platform. And then there are the Datadog, as you like to talk about, the SignalFx, and so on, the third party, pay by the number of metrics things. How does Opstrace bridge those two worlds?

SEBASTIEN PAHL: Our idea was to say, SaaS is great. SaaS, actually, what Datadog, SignalFx, and others do, it's what a lot of people want. They can command that pricing because all you have to do, quote unquote, of course, is to "plug in some agents, start sending the data," and then you're done. You can be a consumer of it. You don't have to build the platform yourself.

When you move to the open-source world, which is the world that we prefer, the world of Prometheus, the world of other pieces like this, suddenly, you need experts. And everybody can become an expert. That's OK, just like the rest of the open-source world. But you do have to spend the time, invest the time to learn, set it up, and then maintain all of this.

And one single Prometheus is easy. But, once you start wanting to do these things at scale and not rebuild your stack again and again, not have completely dedicated teams that are not your business, that's where open source becomes hard. And that's what we wanted to solve. So we wanted to create a platform that somebody can start with when they're tiny and then grow with. And, if they scale, they can just scale it easily without having to know what to do in the intricate details.

CRAIG BOX: So, when I'm installing Opstrace, what am I getting?

SEBASTIEN PAHL: Today, you get a CLI. You download the CLI. And it sets up an entire observability stack in your cloud account. So take, for example, GCP. We support GCP and AWS because, when you build things for two clouds, it's easier to go to "more than two clouds" versus from one to X.

You give it, let's say, an empty GCP project. And then it sets up the entire platform there. So it will start by setting up the network. It will also set up the GKE. We use the managed Kubernetes offerings of the providers, didn't want to rewrite that. And then it sets up the GKE cluster.

And then, on top of that, then it starts a controller or operator, if you want, inside of that cluster that then deploys the entire stack there-- Cortex, Loki, Grafana for each tenant so there's more than one, the alert system, and the other pieces that we have in there. What it does then is we basically put one version number on top of all of this and allow people to upgrade from one version to another.

We test it. We make sure that it works under load. That's the high-level thing. It also, obviously, sets up TLS by default. It sets up authentication for everything. You need tokens to access the data. In open source, security is often left as an exercise for the user. That's what we wanted to avoid.

So that's what it does. So you end up with an entire stack that you can then just start sending data to it. And then we have other things that we have built on top of course.

CRAIG BOX: Let's talk, first of all, about the deployment system there, because that seems a little bit unique– in the sense that most people will have an installer, which will install to an existing Kubernetes cluster, one that's perhaps running the workloads that people want to monitor. But what you give to your installer is effectively credentials to a cloud, which will go off and build its own thing from scratch in parallel.

SEBASTIEN PAHL: Yeah. We made that choice because this is a monitoring system. And one mistake that we saw that most people do and still want to do is put their monitoring stack on the same Kubernetes cluster as their workloads, but this is monitoring. It needs to stay up when your cluster is down.

Plus, one of the advantages of doing it this way is that you're not prone to the errors of your own infrastructure. It stays separate. It's not a black box, but it's this thing that's on the side that you can run in a different region and in a different cloud. That's why we decided to do this.

And the other reason we decided to do it this way, this way, we can actually test it end to end. When you install things in people's Kubernetes clusters, the variations are infinite. And that's guaranteeing uptime, which is what we want to do with this. We want to get to a point where we can guarantee five nines of uptime.

That is much easier, at least, when you manage the entire infrastructure. And, given that this is such a special product, not Opstrace, but the observability is so special-- it's so different from the rest of the stack-- that's the choice that we made-- control more of the stack to guarantee more of it.

CRAIG BOX: That also gives you the advantage that the observability stack is running not only in the same cloud and selectably. Of course, in the same region as your application if you want it to, but it's also in the same project. So that means I can tie the VPCs together. The data never leaves my personal environment. And I don't have to pay. I don't have to pay egress to send something out to somewhere else.

It also gives you the opportunity to use vendor storage. I understand that your metrics, for example, they get stored in GCS or in S3 rather than to disks on the clusters.

SEBASTIEN PAHL: That's correct. Before we found the databases that we wanted to use-- we don't invent new databases. We use Cortex and Loki, which already exist. Before we chose those, we wanted to make sure that the data doesn't stay on disk and doesn't stay in RAM because that's what's expensive in clouds if you want to keep this kind of stuff long term.

We also wanted to make sure, like you said, that the data never leaves the network, not just from a pricing perspective, but also from a security guarantee and from a privacy aspect. People today consider cloud VPCs their network. That's quite useful to do that.

CRAIG BOX: For people who want to go one step further into the past, perhaps, and talk about actual on-premises hardware, do you have a plan to support non-cloud environments?

SEBASTIEN PAHL: We talk about it, but we're not in a rush about any of that because, honestly, you can't control that environment as much. And my belief is that the cloud providers are going on prem with their own stuff. So I'd much rather piggyback on that.

And, honestly, also, you can have a lot on prem, but why have your observability stack there? You might as well put it somewhere else. If you have that much money to run on prem, you also have enough to pay a little bit of egress to send it over to a cloud provider and make sure that, when everything is down, you can still observe. So no, not yet.

CRAIG BOX: All of the open-source software you've mentioned so far comes with its own installation methods and so on. What have you had to build in terms of glue to tie the installation of them together with the one version number that you mentioned and then in terms of interface to make it feel like it's one coherent product to the end user?

SEBASTIEN PAHL: We built an installer and an operator to make sure that we can manage the life cycle of these systems. We don't want to deploy tens of thousands of YAML by hand. It literally was tens of thousands of lines of YAML. We used TypeScript. We're moving certain things to Go, but, at first, we did this, purely practical reason like converting this YAML to JSON was faster.

We ended up writing quite a bit of code to automate. When I say automate, that's make sure that it always works. Make sure that you can always interrupt the installer whenever you want, and it just picks up where it is. Make sure that we follow all the best practices. Whether they are documented or whether we had to discover themselves, we encode those best practices in code, all of it.

And then the next thing that we built is a testing infrastructure. That's the main hard work that we did. We want to continue to move fast by being able to constantly, constantly test the system for correctness, for scale, for upgrades, all of this. This is how we can find bugs in Cortex or Loki and either report them or fix them by moving versions.

And why we did this this way is also we want people to upgrade all the time. We want to get to a point where we have a monthly release, and people constantly upgrade because, in the open-source world, things move fast. So it's all code for that.

And then, on top of it, of course, we ended up building more. We now have an admin interface because you talked a lot about installation and management, but the next place where people have difficulties in the open-source world is, when you go to a SaaS provider, the SaaS provider will guide you. You will come to an NT interface. It will tell you, click here. Download this. Do this.

We do the same thing in our interface. We have a management interface where you go, and it tells you, click here to install it to your Kubernetes cluster. And we will go. We will install Prometheus and the Fluentd, for example, into that Prometheus cluster, get the logs connected into the Opstrace cluster with a click or with an API call instead of having to craft all this yourself as well.

We also make sure that you can easily go get your cloud provider metrics from different clouds, whether it's Azure or GCP. All of this, we want to bind together and guide people. Go here to find the Grafana for your tenant and so on. So high-level tooling is where we are right now.

Just like also, when you install it, you gets a domain name by default, we give you an opstrace.io domain name by default because domain names are hard. And, same thing for login, we give you social login by default. All of these things are there. That's what we build on top, making things as easy as possible.

CRAIG BOX: You mentioned multitenancy there. And you've also said that you have a separate Grafana installation for every tenant. Which things can be used in the multitenant situation? Do you need to have multiple Prometheus and Cortex backends? Are you able to do multitenancy with that? And what are the use cases that you see for multitenancy?

SEBASTIEN PAHL: We wanted multitenancy from the get go because, just like multiple clouds, multitenancy is something that, if you don't build from the beginning, you're going to have a very hard time down the road.

CRAIG BOX: As mentioned before with Kubernetes.

SEBASTIEN PAHL: Exactly the same thing, which is why Kubernetes also has like authentication and all this from the beginning because, otherwise, it becomes insane. For us, it was, we chose Cortex and Loki because they were multitenant by default. They supported this lightweight approach of, you give it a different user ID and an HTTP header, and it doesn't isolate the data. It's stored the same way, but you can query it in an isolated way.

And then we put ourselves. We create different Kubernetes namespaces for each tenant in which we run one Grafana per each. We give different authentication tokens to each tenant and so on. That's the multitenancy aspect. So everything is actually multitenant. And, when you deploy, let's say, if you say I want to scrape AWS CloudWatch metrics or Google metrics into my project, we deploy Prometheus exporters in the tenants that you ask it to be.

Why are tenants important? Well, the basics are, let's say, you have Prod, Staging, and then we call it System, which is Opstrace observing itself as well. So these are examples. Or multiple teams. We wanted to make sure that, with one installation, you could serve more than just one thing because that's how you can save on quite a bit of cost as well. It's not duplicated. Could you run one Opstrace instance for just a single tenant and then another and another? Sure, but that's not needed.

CRAIG BOX: Your software is Apache 2 licensed, and it builds on a lot of software made by Grafana Labs, which was Apache 2 licensed and has recently been relicensed. They have expressed their concern with people taking their software and hosting it in a SaaS environment and making money that they're not seeing any of. So have those Grafana changes impacted you at all?

SEBASTIEN PAHL: No, not really because, while the AGPL itself is a more restrictive license, as long as you contribute back to it and you don't change the software, you can actually use it. If we were, for example, to want to use some LogQL code from Loki, we would make the code that links to its AGPL 3 licensed.

More importantly, we chose Apache 2 because, in the beginning, like you said, these projects were Apache 2. So we thought, how can we package all this in the most respectful way? We didn't want to go to a more restrictive license like AGPL 3 because they were Apache 2. And so we ended up doing it this way.

We'll keep it that way because it doesn't impact us to use Apache 2. We kind of like it. I'm glad that they chose a real open-source license and not something like the SSPL or whatever. That would have been, let's be honest, a problem for us.

CRAIG BOX: Is it a problem for your users though? Because a lot of companies are scared of any license with GPL in the name.

SEBASTIEN PAHL: I don't think it's a problem. I've had multiple conversations this way, and all these people use Linux. And Linux doesn't infect the rest of their system. It's the same thing when you just use AGPL software. People were scared because they thought that, if you just talk to it over the network, boom.

CRAIG BOX: That's considered linking to it.

SEBASTIEN PAHL: But that's not how it works. So that's fine.

CRAIG BOX: When I go to opstrace.com and read all about the software, at no point on the page does it tell me that I have to pay you any money. Do you have a way to take money from me?

SEBASTIEN PAHL: Yeah, we do. We're just going a bit slowly. Our goal is still to build a SaaS. So we're building Opstrace. You can use it. It's free. We don't intend to have enterprise features or anything.

But what we do work with our customers is to say, we're going to manage Opstrace for you in your account. And you don't have to do anything for it. We're on call. We will fix it, patch it, upgrade it, scale it, do all that.

So we will have quite a bit of software-- we already do-- to run these things in other people's accounts. The difference is it's still a SaaS, but, instead of having one big, centralized instance, you will have tens, hundreds, thousands of them that we will manage with no open-source software. We believe-- and it seems to be working-- that people will pay for this.

It's kind of like, today, they pay Red Hat for the brand and for the support. That's very inspirational, but only Red Hat succeeded at that. We wanted to inspire ourselves from that model by adding that next component, which is we run it for you. You don't have to worry about it at all, which is, in this case, one day, I hope that you can come to the Opstrace website, put in your AWS or GCP credentials, and it starts, and you have different plans to choose. This is how we make money on fully open-source software.

CRAIG BOX: Your company and product name is obviously Opstrace. Your logo is an octopus named Tracy. But I noticed that you support logs and metrics today, but not yet tracing. What's happening there?

SEBASTIEN PAHL: Basically, it's very simple. We started with the problems that people have today. As much as we love tracing-- we even had tracing in there before we opened it up-- it's just people are not ready for it. Or, at least, some people are, but the vast majority need help with logs and metrics. And that's what we're helping them with.

We know that it can feel a bit bolted on to add tracing later, but the tracing world has long places to go before it's completely ready, right? You have to change a lot of software and minds inside of the customers' software themselves. So we decided to focus on the problems people have today while acknowledging that tracing is something that's going to come. We also have a couple of ideas how we could contribute to the tracing world beyond the systems that exist today, but it's just not the first priority.

CRAIG BOX: In that you're built on open-source tools for logging and metrics, do you think that one of the many distributed tracing options that are out there today will be the basis for your new system when it does come about?

SEBASTIEN PAHL: I hope so. I think it'll be one of them combined with new things. We already saw what Grafana did with Tempo, which is nice. But I don't think it's necessarily the only way of approaching this.

Tracing data is quite heavy. Sampling is a big issue. So we want to look at how we can help people having this in different ways. We don't intend to build a database either. We intend to either collaborate on an existing one or heavily modify an existing one.

CRAIG BOX: What other parts of the observability stack do you feel you will eventually need to address as you work down the long tail of customer needs?

SEBASTIEN PAHL: One of them is Loki today is not full-text search. And that's great. It's a distributed grep. You can do, I would say, 80% of the ops tasks that you want with it.

I would like to help people with some of their full-text search as well, so help manage something like OpenSearch-- and I call it OpenSearch, but, obviously, it's Elastic as well. OpenSearch is what we would use in that case because of licenses, like we talked before.

We would like to help people by redirecting a percentage of their logs to that, not everything, not just jam all their logs in there. But, when they know that logs with labels A, B, or C need full-text index, put them in there. So we'd like to help with that.

We'd like to help with also, how do you empower the team that is managing the Opstrace cluster or using the Opstrace cluster to provide this data to the rest of the company? Think about the system as-- it has multiple inputs, but it should have multiple outputs like to send it to, let's say, a Redshift instance or other ways that you can then query in a way that data scientists do? These are places that we want to help.

But then what we're focused on right now is not that at all. What we're focused on right now is making it easy for people to hook up their existing clusters. That's what we're focused on right now. Really make it so that-- I explained it a little bit-- with one click or one API call, you can connect things together. We can also help by having some of the software talk to the software that runs in the customers' Kubernetes cluster help with a lot of things like back pressure and other things like that.

CRAIG BOX: Do you think that open protocols, like OpenMetrics and OpenTelemetry, will eventually help the ability for you to be able to take any customer thing and connect up to your system?

SEBASTIEN PAHL: Absolutely. I hope that this continues. Everything that goes and touches the customer's code is good to have as a free and open standard, right? Today, the only good way is to use Prometheus or the Grafana Agent to start extracting data and send it.

But people have to change their code. People even adapted their code to, for example, the Datadog Agent, things like this. So having this open is something that is being pushed a lot by all vendors out there, like all observability vendors, because it kind of creates this fluid market where people can jump from one place to another easily and without having to change the code because code changes at companies take forever. So I'm hoping that, with things like OpenTracing and OpenMetrics and others, we'll be able to have more data sources inside of these companies that are more standardized than today.

CRAIG BOX: In the context of smart home devices I've heard people say recently that companies will try and keep their protocols private when they think there's still a chance they can be the winner. And, eventually, they see that the ecosystem is going to settle down with three or four particular players. And, at that point, there's no value in that to the end user. And that's the point that they start introducing standards. Is a similar kind of thing happening here?

SEBASTIEN PAHL: I hope it's better than the home devices thing because, the home devices thing, I've only ever seen people close things up that used to be open. That's on the personal level that I've played with my own devices. I don't know if it's a good analogy.

I just think that changing code in companies, the bigger they are, is complicated. And they don't want to do it over and over again. And these vendors have an interest in having people use the same code base so that they can then go after other companies' customers. That's just my take on it.

CRAIG BOX: All right, well, thank you very much for joining us today, Sebastien.

SEBASTIEN PAHL: Thank you very much. It was great.

CRAIG BOX: You can find Sebastien on Twitter @sebp. And you can find Opstrace at opstrace.com.

[MUSIC PLAYING]

JIMMY MOORE: Thanks for listening. As always, if you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us or the Olympic Committee, you can find us on Twitter @KubernetesPod or reach us by email at kubernetespodcast@google.com.

CRAIG BOX: If I were them, I would have called it Tokyo 2021. They did that with the Euros as well. I think it's a bit of a cop out so they didn't have to remake all the t-shirts.

JIMMY MOORE: Mm.

CRAIG BOX: You can also check out our website at kubernetespodcast.com where you will find transcripts and show notes, as well as links to subscribe. Until next week, take care.

JIMMY MOORE: Catch you next week.

[MUSIC PLAYING]

View More Episodes

Opstrace, with Sebastien Pahl

Chatter of the week

News of the week

Links from the interview

Transcript