#171 March 9, 2022
The fourth horseman of
the apocalypse observability, according to Frederic Branczyk, is continuous profiling. Frederic is founder and CEO of Polar Signals and creator of the Parca open source project. He and Craig talk all things Cloud Native observability.
Do you have something cool to share? Some questions? Let us know:
CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm your host, Craig Box.
CRAIG BOX: The bad news just continues to pile up, especially if you happen to be Australian. Let us put the world aside for a short while and instead talk about the fine art of real estate. When we left the UK, a photographer came around and took pictures of our house. We tidied up as much as we could. But a few things we didn't think about we were told not to bother touching.
We were shocked at how clean and empty the house looked when the photos came back. For starters, the wide angle lenses they use. You can tell when the cutlery drawer suddenly becomes 3 meters wide. But that wasn't even the most shocking thing. They photo-shopped out our fridge magnets and all the things we keep on top of the shelf.
Here in the New Zealand summer, photographers at the high end properties come at the golden hour and get those nice sunset pictures. Or do they? My brother pointed out a couple of listings where the skies were obviously photo-shopped. Now, I think I see it everywhere. Sometimes they don't even bother trying to hide it and show the exact same photo with day and evening skies.
But sometimes, these auteurs do go all in on the practical effects. Houses here are rarely shown empty. Instead, they're full of props in the form of staging furniture. Never a TV, but always a single open cookbook on the kitchen bench top. Today's recipe was spinach ravioli with peas and pesto. But the kitchen would need gutting and replacing before anyone would actually make it there.
Let's get to the news.
CRAIG BOX: The Knative serverless platform has been accepted as an incubation project at the CNCF. We've chronicled the life of Knative on the show since its announcement in episode 14, through its 1.0 release in episode 166. Congratulations to everyone involved.
Google Cloud's managed service, Prometheus, is now generally available. Using Google's Monarch metric storage backend with a PromQL API layer, the fully managed Prometheus service includes two years of metric retention at no extra cost. Prometheus metrics can also be queried in Google Cloud Monitoring alongside all your other GCP metrics. Also GA this week is Identity Service for GKE, which lets you authenticate to clusters with external identity providers that use OIDC.
The K8ssandra project has been on a path to replacing its helm installation with an operator. That V2 has just become generally available. The new system has reached parity with the old, but includes many new features, including the ability to run a K8ssandra cluster over multiple Kubernetes clusters.
Chinese Cloud provider, DaoCloud.io has released Merbridge, a new platform to accelerate Istio using eBPF. Merbridge is a daemon set that runs in your cluster and replaces the iptables-based sidecar routing with eBPF programs that promise a shorter path through the network stack, especially when two pods are on the same machine. The project is at an early stage. But benchmark results show a small performance improvement. And the authors are looking for feedback.
Observability and APM company, New Relic, has launched a new Kubernetes experience. The upgrade promises to bring application and cluster performance data into New Relic's UI, so developers can correlate insight on their apps with the machines running them. The platform is in part powered by Pixie, which New Relic acquired last year and which is now a CNCF sandbox project.
Finally, a recent CVE in Linux allowed root users to gain root access. Now, on the surface you might say, meh. Two write-ups of the vulnerability talk about how it might be used. Palo Alto Networks unit 42 gives a regular security write-up, which basically boils down to "use AppArmor or SE Linux and you're fine."
Jordy Zomer writing on his blog at pwning.systems has a simple explanation of how one might exploit this, by causing a container to core dump, another reminder that whether Docker, Rocket, or Containerd, it's really all just sparkling Linux. And that's the news.
CRAIG BOX: Frederic Branczyk is the founder of Polar Signals, co-creator of the Parca Open Source Project, an observability enthusiast, and a coffee fanatic. Welcome to the show, Frederic.
FREDERIC BRANCZYK: Thanks for having me.
CRAIG BOX: It's early in the morning for you, so perhaps we can unpack those things in reverse order. How much coffee have we had so far this morning?
FREDERIC BRANCZYK: I've only had one coffee so far.
CRAIG BOX: Was there a very special process involved in making that coffee?
FREDERIC BRANCZYK: Yeah, I mean, I have very specific equipment. I don't know how much the audience is familiar, but I have a Niche Zero grinder. It's a low retention grinder where you put a single dose of coffee beans and there's low retention in the burrs. Then I have what's called a Decent Espresso Machine.
CRAIG BOX: I would have thought I had a good espresso machine. Is good better than decent?
FREDERIC BRANCZYK: This is the brand. And it's very on topic about all of the things that I'm involved in because it basically measures absolutely everything at every possible point of the espresso making. It measures the temperature, the pressure. It does all of these things so that you can produce repeatable good coffee.
CRAIG BOX: I'll look forward to taste testing it one day when I make it over to Berlin next. You talk there about the idea of low retention in coffee beans. Is there a parallel to the retention of metrics in an observability situation?
FREDERIC BRANCZYK: [LAUGHS] I guess it's kind of an inverse relationship. With coffee, you want the least amount of retention. Whereas with observability data, it tends to be that people want the most retention possible.
CRAIG BOX: Let's work our way backwards then to your enthusiasm for observability. How do you define observability?
FREDERIC BRANCZYK: For me, observability is anything that allows us to understand the operational aspect of our running applications better. That isn't limited to a single type of data. It's not limited to some certain methodology.
It's really anything that allows us to understand our running applications better because there's so much nuance in different types of data, that they show us that different kinds of nuance of our running applications. And so it's a pretty vague description. But it's intentionally vague because there's just so much data out there that's useful.
CRAIG BOX: It means whatever you want it to mean.
FREDERIC BRANCZYK: Pretty much.
CRAIG BOX: You first started working in observability working for CoreOS in 2016. How did you come to work there?
FREDERIC BRANCZYK: I had been following CoreOS pretty much since its inception. Everything from the immutable operating system was-- that in itself was super interesting to me. But I was a CoreOS user and everything. And one day at my previous job back then, I was at a canteen somewhere here in Berlin. And I saw someone with the CoreOS employee t-shirt.
CRAIG BOX: Wow.
FREDERIC BRANCZYK: And I had recognized from a meetup that I attended in San Francisco. So I was like, wow. I guess they're opening an office here or something. And so I did a little bit of research and ended up finding that days afterwards, a couple of positions were published and I immediately applied.
CRAIG BOX: Excellent, well that's a happy coincidence.
FREDERIC BRANCZYK: Yeah.
CRAIG BOX: What was the CoreOS team in Berlin working on?
FREDERIC BRANCZYK: You could generalize it as the open source office. We were intentionally kind of created to work on all open-source things that CoreOS was doing. So more specifically, we were working on the Rocket container engine. We were working on Prometheus. Later, the shift change a little bit. But those were the two main areas of the CoreOS Berlin office.
CRAIG BOX: Now, Prometheus was founded out of SoundCloud, who were also based in Germany as I understand?
FREDERIC BRANCZYK: In Berlin even, yeah. We collaborated a lot with those folks.
CRAIG BOX: What was the process of taking that to the nascent project and bringing it to something which was useful to the wider Cloud Native community?
FREDERIC BRANCZYK: The SoundCloud folks, by the time that they had open source, they had already used it for several years, I want to say, internally. So it was definitely already a super useful product. It's just like with any other project and product out there. At the moment you put it into varying people's hands, you have a lot of different kinds of expectations and requirements pop up.
It ended up being because we were so deep in the Kubernetes world and SoundCloud was starting to explore Kubernetes as well because previously, they had built their own orchestration system. It kind of was super, very harmonic relationship because we kind of were able to exchange a lot of thoughts and just kind of work really well together to kind of make Kubernetes not the only first class citizen, but one of the best ones, I would say, in Prometheus.
CRAIG BOX: You remain a maintainer on Prometheus to this date. Which parts of the stack have you worked on?
FREDERIC BRANCZYK: I want to say I've probably left my fingerprints on pretty much everything at this point. But certainly, the most and what I remain the maintainer of are all of the Kubernetes integrations in Prometheus.
CRAIG BOX: You were also lead for the Kubernetes SIG Instrumentation. Was that largely around the Prometheus integration?
FREDERIC BRANCZYK: Over time I was doing everything Kubernetes in Prometheus. And I was kind of part of the group that founded the special interest group for instrumentation in Kubernetes. I wasn't initially a lead for it. But I think at a year or two into it, I was.
And I only recently, after almost four years of service, I handed that off to the next generation. Basically anything that concerned instrumentation observability was also my thing in Kubernetes. And so kind of everything in that intersection I have been working on for the past six, seven years, yeah.
CRAIG BOX: There have been a couple of different versions of the integrations, things like the cAdvisor and Heapster and so on, moving on to kube-state-metrics. Do you think there is a solid set of tools there that have been developed over time or do you think there is still room for new technology to come into that Kubernetes Prometheus integration space?
FREDERIC BRANCZYK: It's funny that you mention all of these because I was directly involved in all of them. I'm a long time maintainer. And I actually continue to maintain kube-state-metrics. I'm a maintainer for the Prometheus Operator.
For those who are not aware, it's kind of bridging the operational knowledge of Prometheus on Kubernetes. I do think there is still a lot more that can be done. But I think the foundation is really solid. I think everything that happens now is on a higher level in terms of metrics and monitoring.
CRAIG BOX: One of the things that you need to do with your metrics is, as we mentioned before, retain them somewhere. You also have been involved in the Thanos project. Tell us a little bit about how you got involved with that.
FREDERIC BRANCZYK: Fabian, who was one of the co-creators, was my colleague at CoreOS. And so kind of naturally, being the co-creator, he kind of introduced it at CoreOS times. And that's how I got introduced to it and started maintaining it as well and then over time also contributed quite a lot of things in Thanos as well.
CRAIG BOX: We've talked to a few different teams over the years, people who have backends for Prometheus metrics, whether it be open sourced things you install yourself or whether it be part of your provider's metric system. How do you decide which tool to use and how much data you should retain as a user of a Kubernetes system?
FREDERIC BRANCZYK: I think it does end up being a somewhat personal decision, especially the retention. Sometimes you even have to abide to some regulations. I'd like to think that we as humans tend to over collect. We tend to want retention that is forever and never ends. I, in practice, find that more than a month of high resolution metrics tend to be almost useless.
They start to increase the cost of your stack quite a lot while you're probably never going to have a look at it again. That's different for higher level business metrics potentially or maybe even SLOs or stuff like that. But the high resolution, low level metrics, let's say memory usage or something like that, I find tends to be almost useless when it's six months old or something like that.
CRAIG BOX: Yeah, I think you mentioned that humans like to retain things. I feel you accuses me of hoarding just with that statement. I like to keep things. What can I say?
FREDERIC BRANCZYK: I don't think there's any shame in that. But I think sometimes it's good to understand why we're doing it and if there's no good reason, then to maybe sometimes also let go.
CRAIG BOX: We started a little bit by talking about observability. Some people might say that the pillars of a modern observability stack are metrics, logs, and tracing. We've talked there about metrics. What's your view on the state of observability for logs and tracing in a Kubernetes environment?
FREDERIC BRANCZYK: It certainly evolved over the years. I'm a big fan of what Grafana has done with the Loki project, where essentially you don't require your applications to necessarily have structured logging, simply because there's so much-- not necessarily legacy, but just applications that have their log formats that are necessarily structured that are still super useful.
Let's say NGINX logs or APACHE logs, all of these things, they have their own structure. But they're not JSON logfmt or something like that. The reality is that most engineers just want to grep over their logs in an efficient way. And they kind of took that to heart and really created a pretty incredible system, I think, for logging.
CRAIG BOX: Some people describe Loki as the Prometheus for logging. Is that a fair comparison?
FREDERIC BRANCZYK: It's maybe a stretch to say that. But I think there were definitely a lot of pieces that were highly inspired by Prometheus. It's not like Loki does any sort of scraping of logs or something like that. But things like the data model is super similar.
A lot of the service discovery and the way that labels are attached to log streams is super similar. The query language is quite similar in many ways. So, lots of inspiration and why it also complements Prometheus set-ups so well.
CRAIG BOX: And then tracing. Is there a distinction there in the way people treat tracing versus metrics and logs and that they might say, all right, metrics and logs let me look at something that has happened, whereas tracing lets me look at something that is currently happening.
FREDERIC BRANCZYK: I think in that sense, logs are often used in the same way to observe what is currently happening. I think metrics are weaker in that regard but still useful as a signal. That's why we build SLOs using metrics. Even if they are built off of logging or tracing data, we end up producing something that looks a lot like metrics.
CRAIG BOX: It's a sort of calculus. If you differentiate your logs, then you end up with metrics.
FREDERIC BRANCZYK: Yeah, pretty much.
CRAIG BOX: You're of the opinion that those three pillars are not enough for observability. What led you to that and what is missing in your opinion?
FREDERIC BRANCZYK: I think it goes back to the very start of what we were talking about when we were defining observability. We were talking about anything that allows us to understand our operational systems better is observability. And so I was kind of looking at what I was doing on a daily basis and reflecting whether I'm able to understand the same depth through the observability tools that I had available.
This was in 2018. I had distributed tracing. I had logging. I had metrics. And yet I still found myself going and manually profiling my applications all the time. And so I was like, hold on. This is just another aspect of my application that I'm trying to understand better right now. And none of the existing observability tools that I have in place can provide me with this nuance that I'm looking for.
CRAIG BOX: Was that a case of logging into a machine and breaking glass on the pod and running a profiler on a particular running application?
FREDERIC BRANCZYK: Yeah, pretty much, maybe a port forward or something like that in a Kubernetes cluster. Definitely things that make me a little bit uncomfortable doing in production. But it was the only tool I had available at the time basically.
CRAIG BOX: It did not have all the repeatability of your coffee machine, for example.
FREDERIC BRANCZYK: Exactly.
CRAIG BOX: So by this point, CoreOS has been acquired by Red Hat. You've been working on observability all the way through. How were you applying this? Are you looking at your own software that you're building and saying that you need profiling or are you looking at what the users of your software are seeing?
FREDERIC BRANCZYK: We're really seeing both, in a way. The thing that got me started in this was, as we've already talked about, I was working on these super performance-sensitive pieces of software, Prometheus and Kubernetes, both of which users expect to use exactly zero resources and have, like, zero latency for all requests.
It's kind of a general expectation that people seem to have for infrastructure components. If you talk to database companies, they'll say the same thing. Being very conscious about the changes that you make and how they affect performance or improving performance or resource usage is super critical for these kinds of software.
At the end of 2018, I actually read this white paper that Google had published in, I believe, 2010-- it's been quite a while since this has been out there, where Google described how-- I think they called it Google-wide profiling, where absolutely everything is always profiled in all Google data centers. And simply through having this data, Google's able to save on resources by multiple percentage points every quarter.
This kind of consistency in improving performance was really interesting to me, specifically because of what I had said about working on various performance-sensitive pieces of software. But I found out that this was the initial trigger, I think, this paper, that while I have a very specific set of software that I work on where I'm already aware of this, you can reduce costs of your infrastructure super easily simply by having this data.
And it doesn't necessarily have to have a user impact, though we're seeing a lot of that as well. For example, in e-commerce companies, you want to have your latency as low as possible because that means that your conversion rate increases because there's human psychology that when e-commerce websites are more responsive, we're more willing to purchase something on those websites.
CRAIG BOX: That's why it's very important to pick the right shade of blue.
FREDERIC BRANCZYK: Exactly.
CRAIG BOX: You've identified profiling as a thing that you need to do to make your own applications more performant. When we talk to start up founders, we'll ask them whether the software or the company came first. Were you looking for an opportunity and you picked profiling as the thing to go for or did you come up with this idea for the software and think, all right, I'm going to take that and that's going to be what I work on now?
FREDERIC BRANCZYK: It was very much the later. After reading this paper, I was kind of inspired and put together this barely compiling proof of concept and published it on GitHub. And I got super, super lucky and got to present this as part of a keynote at KubeCon in Barcelona as part of a larger keynote about what does the future hold for observability and continuous profiling to be, kind of, one of the next pillars of observability, was kind of one of the predictions that I was making in this.
I did work on this open source project in my free time. But in 2020, half a year into the pandemic, I felt like there was still nobody really filling this gap. I felt like I was in the position with this proof of concept. We were already using it within my team at Red Hat. It was already super useful, even though it's barely working. I felt like there was something bigger to be done here.
And at that point when I quit my job at Red Hat, there was no one else doing this out there, at least not in the way that we had envisioned it. Profiling itself had been a part of the developer toolbox ever since software engineering has existed. But this notion of continuous profiling, where you always profile absolutely everything, basically only hyperscalers had this kind of capability. And so I felt like there was a market gap to be filled there. And yeah, definitely the software came first and then the opportunity.
CRAIG BOX: They do say that the best way to predict the future is to invent it. So it's easy to say that that prediction came true?
FREDERIC BRANCZYK: Yeah.
CRAIG BOX: Red Hat has a lot of open source software that they've built and developed themselves. Was this something that you considered building in your day job? Did you approach anyone there about this or was the time right for you to start something of your own?
FREDERIC BRANCZYK: I was kind of architect for all things observability at this point at Red Hat. I think there was a general mix of being uninspired. I think the coronavirus pandemic didn't help. I just felt like if I'm going to do this, I need to focus fully on it. And the reality of my job at Red Hat was that I was involved in so many other things that it was impossible for me to truly focus on this. That's kind of when I made the decision, I need to make this my full time job.
CRAIG BOX: The company that you founded is called Polar Signals. I'm sure there's an iceberg joke in there somewhere?
FREDERIC BRANCZYK: 100%. Actually the very first logo that I drew up was an iceberg. And at this point, the logo is actually still intended to be an abstract iceberg.
CRAIG BOX: What percentage of the iceberg is below the water?
FREDERIC BRANCZYK: A huge part of it. And that's kind of the play on all of this, right? We're only seeing the tip of the iceberg. And Polar Signals is intended to help you discover the rest of it.
CRAIG BOX: You now have the beginning of a new company and you have the beginning of a new piece of software. How did you go about developing those two in parallel?
FREDERIC BRANCZYK: When I decided to quit my job, the first thing I did was actually talk to a whole lot of companies about their potential needs for something like this. I did have a piece of software that was useful to me. But I didn't know whether and how it was going to be useful in other people's hands. Keep in mind, this is a barely compiling proof of concept that worked exactly for my use case.
CRAIG BOX: People have had billion dollar valuations for less.
FREDERIC BRANCZYK: That may be true. But actually, one of the first VC conversations that I had, I was told that I was being very German about all of this because I was kind of declining offers for investment because I wanted to understand the space better and what we needed to build and what we were going to do with the money if we were to raise some money. Yeah, it's happened to me several times. But it seems like VCs also appreciate that.
CRAIG BOX: Were you talking to VCs in Berlin or in California?
FREDERIC BRANCZYK: All over the world.
CRAIG BOX: How did you go about establishing those relationships with VCs? If you'd been working inside a company for a while, did you have experience with colleagues or friends who had gone through this process who were able to help guide you?
FREDERIC BRANCZYK: I have a set of really awesome advisors who definitely helped introduce me to some people. But declining to take investment was definitely-- in hindsight, I can say this. I didn't know then, but it was a good strategic choice because it allowed us to show what we can do as a company before taking an investment. And I felt like that allowed us to go much stronger into our negotiations when we actually did want to go and raise money.
We had always taken introductions with VCs that had reached out. When you have established that initial point of contact, when you actually want to talk about funding you can kind cut straight to the chase. I'm not saying that it's wasting time. But you can just skip that introduction and get right to it.
CRAIG BOX: The project that you're working on is now known as Parca. How much of the Parca project was built and ready for people to use before that funding discussion?
FREDERIC BRANCZYK: Just kind of looking at it chronologically, at the time when Polar Signals was founded, everything that was open source was still known as the Conprof open source project. Very creatively named, right? Continuous profiling, Conprof.
CRAIG BOX: There's no number in the middle of that.
FREDERIC BRANCZYK: Yeah.
CRAIG BOX: C-5-f.
FREDERIC BRANCZYK: We built our invite-only private beta at first because we didn't want people to have to do a whole lot of setup to try this kind of product. We wanted to make it as convenient as possible for them because really, all we were looking for was feedback for what the product that we needed to build was and what the technology that we needed to build was. That was kind of the little bit of closed source code that we were writing.
But it was primarily management things. Everything that was crucial to doing continuous profiling was open source. The point where we did take the investment, we did have all of the things that we wanted. We had understood what kind of product a lot of companies wanted from a space like this. We had figured out what the technology was that we needed to build for it.
And so it was kind of a mix at that point. When we took the investment, we had a bunch of proof of concepts that ended up evolving into what is now the Parca open source project. In a way, you can also think of it as the next evolution of the Conprof open source project.
CRAIG BOX: How do people consume Parca today? Is it something that they install and run the management infrastructure in their own cluster in an entirely open source fashion or is it something that connects to a SAS back end that Polar Signals operates?
FREDERIC BRANCZYK: The overwhelming majority is everything open source. We kind of have two components that make up the Parca project. So we have the server that has the storage, APIs, UI to create all of this data. And then we have the agent that, in Kubernetes for example, you deploy a daemon set. And then it starts profiling absolutely every container in your Kubernetes cluster automatically. And then it kind of sends all of that data to the central storage.
That's how the majority of people use it. Of course, we are working on that SAS product for people so that they don't have to maintain and run this relatively complex system by themselves, kind of the same as most other observability companies out there.
CRAIG BOX: In episode 163 we talked to Thomas Dullien from Profiler. And it very much sounds like a case of parallel evolution, is that the two companies sort of had a very similar idea with similar inspiration at the same time. It sounds like Thomas's background in reverse engineering and so on led him more to work on the agent side. And perhaps your background in Prometheus and so on led you to work more on the server side. Is it fair to compare the two projects? Do you think that the approaches are similar or complementary?
FREDERIC BRANCZYK: It's funny that you say this because we had exactly the same thoughts when we first learned about them, when it became more concrete what the Prodfiler folks were up to. You're absolutely right. The very first conversations that I had with people I was saying, I'm not interested in the collection side of things. I think the harder problem and the more useful problem for us to solve is the server side, the querying and so on.
Because basically, my background was largely in the Go ecosystem at this point. And Go has a fantastic set of really high quality profilers that work with open standards. And so I felt like even if I was able to just solve continuous profiling for the Go ecosystem, I would be able to reach a really significant portion of developers to get the business started at least. That's why we focused so much more on the storage things.
But as we started giving this product to companies, we realized that in order for us to be successful, we had to make instrumentation and collection of this data easier. Otherwise, we were going to kind of fall into the same category as distributed tracing, where companies need to invest immense amounts of software engineering resources in order to get useful data out of their systems.
That's kind of what led us to investigate zero instrumentation methodologies. And we then, through that, ended up going with a very similar approach to the Prodfiler folks.
CRAIG BOX: There is an OpenMetrics and OpenTelemetry logging formats and so on. Is there an open profiling format?
FREDERIC BRANCZYK: We've definitely talked to other profiling vendors out there and there's definitely an interest for that. At the same time, everybody kind of agrees that the space is still a little bit too young to create this one standard to rule them all. We feel that things are still evolving so much that it would hinder innovation at this point. We're confident that this kind of standard will exist one day.
CRAIG BOX: Does everyone at least agree on if the graph should point up or down?
FREDERIC BRANCZYK: I don't think so, no.
CRAIG BOX: We have the flame graph and the icicle graph which we'll have a picture on the show notes to demonstrate the difference. It's not just the color and the direction.
FREDERIC BRANCZYK: Actually I found out that people call the upside down version the icicle graph after choosing the company name. And so it was kind of a funny accident that icicles and Polar Signals kind of end up working really well as a brand together.
CRAIG BOX: It's very convenient, that.
FREDERIC BRANCZYK: Yeah.
CRAIG BOX: And Parca also has polar references I understand.
FREDERIC BRANCZYK: That's right. Program for Arctic Regional Climate Assessment, and this was essentially a set of polar expeditions that was performing what's called ice core profiling. And what we do there is-- I don't, but the scientists.
CRAIG BOX: The other Parca.
FREDERIC BRANCZYK: They study climate change through basically huge icicles that they drill out of the Arctic ice and kind of understand the atmospheric composition through that. It was kind of an homage to these expeditions in a way because part of what continuous profiling allows you to do is reduce resources of data centers essentially.
Hopefully we can have a little bit of a positive impact on climate here. The reality is that human psychology works a little bit differently. And when we make things more efficient, we actually end up consuming more of it. Maybe it's wishful thinking. But we have seen that hyperscalers have prevented the need to build entire data centers through continuous profiling. And so we're hopeful that bottom line is that we are able to help a little bit here.
CRAIG BOX: Now, you can't be a hip startup in 2022 without having some eBPF somewhere in your stack. So I'm going to assume that that's a huge part of how you're able to get this data out of the running processes.
FREDERIC BRANCZYK: Absolutely. As I said, zero instrumentation ended up being a really key thing for our strategy because it was how we were going to get people interested and started with continuous profiling extremely easily. It's one thing when all you need to do is drop a daemon set into your Kubernetes cluster. But it's a whole other thing if you have to go into your code, do code changes, deploy those changes, and so on.
CRAIG BOX: printf! printf "I'm here". printf icicle. printf flame.
FREDERIC BRANCZYK: Exactly. As I said, we had realized that in order not to get into that problem like distributed tracing did, we needed to invest into zero instrumentation capabilities. And the very first proof of concept that we did there was literally running Linux perf to do some of these things. But we realized that perf was doing so much more than we needed to.
eBPF was starting to mature in many ways. Basically what it meant was that we were able to use eBPF to capture exactly the data that we need and want in exactly the format that we want, and only export this data out of kernel space once every 10 seconds, or it's configurable, but-- and this meant that we were able to reduce overhead very significantly. It's the best of all worlds, right? We're able to profile absolutely everything at low overhead.
And people don't have to do anything. It's kind of the holy trifecta of anything observability because everybody always wants all the gain without having to do anything for it. And that's kind of, I guess, just completely natural. But with most other observability signals, this is really hard to do. But it turns out because profiling itself is so close to how the operating system executes code, it actually ends up working incredibly well.
CRAIG BOX: I think of profiling as something that's part of the development process, even though it's happening effectively in production. Is there a use for profiling in the operations phase. Can I set SLOs on my environment and say, I shouldn't be exceeding these numbers for some usage figure?
FREDERIC BRANCZYK: We're thinking about it more on a higher level. You continue to set your goals through service level objectives. But profiling data is a supporting type of data to understand why you're violating your SLOs, for example. I forget who coined this, but sometimes observability data can be categorized as debug ability versus alerting. And I think this falls more under the debug ability aspect.
CRAIG BOX: You can't think of a case where I'd be wanting to say, all right, the call stack is getting too deep. I should wake somebody up and have them fix it.
FREDERIC BRANCZYK: Unlikely. I think what we'll end up seeing more of is kind of reporting style things, where you'll see things like your infrastructure cost or your infrastructure usage and CPU grew by 10% and here's the biggest new offender or something like that. I think it'll be less critical of a reaction than alerting.
CRAIG BOX: One of your colleagues at Polar Signals works on the Pyrra project for monitoring SLOs. What can you tell me about that project?
FREDERIC BRANCZYK: Yeah, I really love this project because much like Parca, it was created out of a need, where the need wasn't properly fulfilled previously. So huge shout out to Matthias there for kind of seeing that opportunity. It's funny, we obviously use Pyrra in our production infrastructure.
And at this point, even though Pyrra kind of generates Prometheus alerts and look at all of this data through Prometheus or even through Grafana, I actually find myself when I get paged immediately heading to the Pyrra UI because it's so much more clear, so much crisper in terms of the information that I need and that I'm looking for, that it's really become my default entry point to debugging.
CRAIG BOX: You were working on building tools like Prometheus and Thanos. And you realized that you needed continuous profiling in order to debug and get the most performance out of those tools. As you have built Parca, what have you wished you had that will be the next thing that you build?
FREDERIC BRANCZYK: Wow, that's a great question. I don't think I've ever been asked this before. Let me think for a second. The two hardest things that we have encountered while building all of this is, one, while the eBPF ecosystem has evolved tremendously over the years, it's still really hard sometimes to really understand what the eBPF verifier wants you to do or why. It's not allowing you to load a certain program.
We're seeing this more and more. And we're also investigating doing this. Most eBPF programs are written in C. And frankly, I'm really bad at C. A lot of people are starting to write these programs in Rust. And we're definitely super interested in this. But definitely the ecosystem there, while it's also rapidly evolving, it's still pretty young compared to all the other tooling out there.
CRAIG BOX: If you were a startup who somehow managed to get WebAssembly runtime running to compile eBPF programs, that's a billion dollar valuation without even writing any code.
FREDERIC BRANCZYK: Yeah, potentially, yeah. A lot of times, people actually compare the two technologies as like, eBPF is the WebAssembly or the Serverless of the kernel. And I think it's super true in the way that we end up using it.
CRAIG BOX: Would you like to see Parca become as ubiquitous as Prometheus, for example, in the observability stack?
FREDERIC BRANCZYK: 100%. We were super conscious and deliberate about creating Parca as a separate brand from Polar Signals. We may be the company that initially started this project. But we hope that it becomes much bigger than the company. That's why we're really focused on building an ecosystem around Parca.
So everything we do, we were very intentional about making the API and the interactions with the API as concise as possible so that we can build lots of tooling around Parca as an ecosystem. We could have taken some shortcuts and build a bunch of functionality into our UI. But then really only the Parca UI is the only useful thing to interact with this data. These are some of the very intentional decisions that we're taking to make sure that we're building an ecosystem and not just a tool.
CRAIG BOX: How does someone get started in this ecosystem? What's the first thing that I should do to experiment with continuous profiling?
FREDERIC BRANCZYK: The very first thing is head to the parca.dev website and just try it out. We've put a lot of effort into making, getting started with Parca extremely easy. It should really just be deploy the Parca server and deploy the Parca agent as a daemon set into your Kubernetes cluster. That's it.
CRAIG BOX: What do I then have to do, though, to get my symbols to be recognized?
FREDERIC BRANCZYK: If your programs are compiled with debug infos included, then you have to do nothing. Everything happens automatically. If they are packages that come from a Debian package or something, Parca will actually recognize that and download debug info from what are called c servers. They're kind of publicly accessible servers to download debug information for publicly available packages.
So most of it is taken care of. What we're still working on is specific integrations for languages and run times that are not compiled to native code, things like Python, Ruby, Node.js, Java. These need a little bit of specific integrations so that you truly have to do nothing. For, let's say, Java and Node.js, you actually need to pass a couple of flags to your processes right now that they can be properly profiled. But we're working on putting all of this functionality directly into the agent, so that this kind of zero instrumentation, zero effort applies to truly every program out there.
CRAIG BOX: Which of those features that you're talking about will you need to bless a 1.0 version of Parca?
FREDERIC BRANCZYK: Well Parca and Parca Agent are separately versioned. So the server is primarily the storage and the APIs to be stable. Right now they're still very rapidly evolving. But we're pretty confident in the direction that we have. At this point, it's mostly finishing some of the work we've started and stabilizing it.
For the Parca agent, we want to have at least a decent set of languages so that we can be sure that fundamental approaches that we're taking aren't going to be changed in the future. And right now, we're not completely sure about absolutely everything there. We might have to change some architectural things, maybe even the wire format slightly. And that would obviously not be OK with something that we would bless with a 1.0 title tag label.
CRAIG BOX: All right, well we'll be keeping an eye on the project. And thank you very much for joining us today, Frederic.
FREDERIC BRANCZYK: Thank you for having me.
CRAIG BOX: You can find Frederic on Twitter @fredbrancz or on the web at brancz.com. You can find Parca at parca.dev.
CRAIG BOX: That brings us to the end of the show. If you enjoyed it, please help us spread the word and tell a friend. If you have any feedback, please send it to us on Twitter, @kubernetespod or by email, to Kubernetespodcast@google.com.
You can also check out the website at kubernetespodcast.com where you will find transcripts and show notes as well as links to subscribe. Thanks for listening, and we'll see you next week.