#163 September 17, 2021
Prodfiler is a new tool that provides fleet-wide full-system continuous profiling. It is in some ways the second act of its co-creator Thomas Dullien, who is an internationally-renowned reverse engineer and vulnerability researcher under the name Halvar Flake. Thomas joins us to discuss his career, what you should profile in a distributed system, and why you can’t sell something with a negative cost.
Do you have something cool to share? Some questions? Let us know:
CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box with my very special guest host Jimmy Moore.
CRAIG BOX: Jimmy, did you see the container news this week?
JIMMY MOORE: So much container news, Craig. Tell me which one you're referring to.
CRAIG BOX: Well, first of all, the Suez Canal, that which was blocked not two months ago for nearly a week, another tanker decided it would pull up and stop things again. Only lasted a few minutes, apparently, and then they took their lessons and went on their way. But I just think that they really wanted to be back in the news again.
JIMMY MOORE: You know it. You know, she got a taste of being infamous, Miss Suez. And she said, I need some of that again. She said, let's get one of these ships in here, blow some wind, and get some press back up in this story. But luckily, it only lasted a little bit.
CRAIG BOX: More container news, a five-bedroom home made out of 21 shipping containers went on sale for $5 million.
JIMMY MOORE: I just — I mean, I live in San Francisco, so it's really not a big deal for me. But it seems a little pricey for some metal boxes.
CRAIG BOX: I think it depends where you are. This is Williamsburg in Brooklyn. So you probably couldn't buy much more than a postage stamp there for that much.
JIMMY MOORE: That's true. And actually, looking at the pictures on the link, it's beautiful. It's really incredible, industrial, and kind of gorgeous, but it just feels weird to say, probably bad things were shipped in these boxes once.
CRAIG BOX: I don't think it's likely to auto scale.
JIMMY MOORE: Definitely not.
CRAIG BOX: Shall we get to the real container news?
JIMMY MOORE: Yes, let's get to the news.
JIMMY MOORE: Crossplane, an add on to Kubernetes that lets you manage cloud infrastructure through creating Kubernetes resources has moved to the incubation phase in the CNCF. Since joining the sandbox in June 2020, Crossplane has released 1.0, tripled its number of contributors, and now has contributors from over 100 companies. Learn all about Crossplane in Episode 141.
CRAIG BOX: Google Cloud has announced Backup for GKE, an integrated, first-party backup system to allow users to backup and manage their stateful Kubernetes workloads. Backup for GKE lets you create a backup plan and schedule periodic backups of both application data and GKE cluster state. The product is now available in preview. And you can learn more about it at the upcoming Google Cloud Next.
JIMMY MOORE: To find out what else you can learn about at the upcoming Google Cloud Next, check out the show notes for a link to the recently published session catalog. Sessions include deep dives into GKE and Anthos features, customer case studies, and perhaps an announcement about Prometheus.
CRAIG BOX: Alternatively, think beyond the cluster with a panel of experts talking about multi-cluster support on Kubernetes. On October the 6th, engineers from Google, Apple, and the CNCF, including no less than three previous podcast guests, will tell the multi-cluster story. Learn how to engineer your application to be a global service, minimizing latency for the people connecting to it. The event is free to attend. And a link to subscribe is in the show notes.
JIMMY MOORE: Some new GKE features this week — with Google's Private Service Connect, you can have private consumption of services from different VPC networks. You can now create internal load balancer services in GKE and publish access to other VPCs natively through Kubernetes resources. Applications hosted in GKE clusters can then become accessible to other orgs, companies, and users in a secure and private manner. The managed CSI driver for Google Cloud File Store is now generally available and multi-cluster ingress now supports SSL policies and HTTPS redirects.
CRAIG BOX: Researchers at Palo Alto Networks discovered a vulnerability in Microsoft's Azure Container instances which allowed them to gain full control over other users' containers. Old versions of Runc and Kubernetes and the multitenant clusters powering the service provided a path to code execution on the API server. With that, they were able to gain full control over all customer containers. The issue was patched by Microsoft. And there is no evidence of it having been exploited.
JIMMY MOORE: If that's not enough motivation to keep your environments up to date, two new CVEs in Kubernetes were published this week. First, a user may be able to create a container with sub-path volume mounts to access files and directories outside of the volume, including the host file system. Second, actors that control the responses of webhook requests are able to redirect those requests, and in some cases could see the responses and headers in the logs. The former issue is rated high, and the latter, medium.
CRAIG BOX: Last month, the US National Security Agency published guidance on hardening Kubernetes clusters. Security researchers NCC Group published guidance on how to consume their guidance. They find most of it to be good, but call out some areas where the advice is outdated or missing some key context. For example, the NSA suggests you use pod security policy without mentioning that it is deprecated and scheduled for removal.
JIMMY MOORE: OK, we get it. Security is hard. And even the experts don't agree with each other. That's why there is big money in it. Snyk announced a $530 million series F investment round that values the company at $8.5 billion. Learn about Snyk in episode 140.
CRAIG BOX: Earlier this year, Google open sourced a tool called SQL Commenter to easily correlate application and database telemetry. SQL Commenter enables RM tools to augment SQL statements with information about the application code that caused its execution. The project has now merged with OpenTelemetry. The addition of this tracing standard and its libraries will now enable APM tools to easily integrate with databases.
JIMMY MOORE: The team at Kubermatic have been hard at work since we spoke with them in Episode 109. Their new 2.18 release, out this week, adds Edge-ready multi-user cluster monitoring, logging and alerting, cluster templates, open policy agent enhancements, and spot instances. The KubeOne Cluster Lifecycle tool brings a brand-new add-ons API, managed support for encryption providers, and automated Docker to container demigration.
CRAIG BOX: VMware's Tanzu Kubernetes Grid 1.4 is GA this week. Package management is now provided for add-ons using the Carvel Tools, meaning no more tarballs to install or upgrade. New networking features include support for IPv6-only clusters on vSphere, and Kubernetes has been upgraded to 1.21.
JIMMY MOORE: Finally, Congratulations to Matt Klein and the Envoy team on five years of Envoy open source. Matt, guest on Episode 33, tells the story of how Envoy started as an internal tool at Lyft and the acceleration of growth after its adoption by the Istio and networking teams at Google Cloud. He also digs into the human story of how running an open-source project can be incompatible with one's day job, yet ultimately very rewarding.
CRAIG BOX: And that's the news.
CRAIG BOX: Thomas Dullien is the co-founder of Optimyze. Starting his career in DRM, he spent a few decades doing security, ran a reverse-engineering tooling company, and spent time at Google, both on malware-related topics, and later in Project Zero. Welcome to the show, Thomas.
THOMAS DULLIEN: Thanks for having me.
CRAIG BOX: Your Twitter bio starts, "I do math." You're from Germany. Shouldn't you say you do maths?
THOMAS DULLIEN: Remember that, in Germany, we speak German. So there is no plural in this. So you study "die mathematik". I'm not quite sure what the plural of mathematik would be, anyhow, because there really is just one mathematik.
CRAIG BOX: That's interesting. Why I ask, of course, is I've always assumed that math was an Americanism. And everywhere else in the world, it tends to be "maths."
THOMAS DULLIEN: At least in German, it is "the mathematic," which is one thing and — not even sure whether there is a sensible plural of it.
CRAIG BOX: Do you think there's a huge difference in the education that people get in a math program versus a computer science program and how it impacts their work later in life?
THOMAS DULLIEN: Definitely, yeah. I definitely think that what you learn in a math program is significantly different from what you learn in a computer science program. In my case, I actually started studying mathematics and economics. Computer science was something that I did as an aside. It's definitely a very different topic. And clearly, mathematics does not necessarily prepare you very well for writing code. Well, that said, I'm not sure whether computer science curriculum does either.
CRAIG BOX: I saw a meme recently which sort of shows the progression people make through maths through school, going through algebra, and calculus, and so on, and then basically tapering off and going to the job, swinging right back down to the bottom, where they're all just working in the spreadsheet.
THOMAS DULLIEN: That's not entirely false for most jobs. But then there’s always the exceptions. And I've certainly done a fair bit of mathematically-related stuff in my day job. Funnily enough, all the topics that I studiously avoided studying in university are the ones that are most important in the real world. I ended up, after my math studies, after I had managed to finish math studies with a minimum of calculus and statistics, you end up reading about machine learning. And that's mostly calculus and statistics, right?
There’s plenty of places where you'll need math in the real world. And the way I think about this is that, if you do it right, you walk out of a math education with a toolbox full of strangely-shaped tools. And you need to figure out what they're actually useful for in the real world.
CRAIG BOX: You also walked out of your education with the nickname Halvar Flake. Tell me the story of that name.
THOMAS DULLIEN: Oh my — so yeah, I started doing DRM work, which really just translates into copy protection removal, when I was very young. Clearly, at that point, it is not wise to publish too much under your real name. And you need to pick a nickname of sorts. And then, at the time, there was a cartoon on TV with a young Viking boy. And his father was big, and a bit stupid, and a bit rowdy. And the chief of the village was called Halvar. And I thought that Halvar is a good nickname to choose for pseudonymity.
And then later on, I actually started doing serious work and the legalities of reverse engineering work were still a little bit dubious. So I started publishing under that nickname. And it turns out that once you've done all your important early work under that nickname, you can never ever get rid of it again, because people won't know who you are under your real name. And then you're stuck with the nickname you chose in your teens.
CRAIG BOX: I think there will be a few people who are listening to this show who had no idea who you were until I introduced you by that nickname.
THOMAS DULLIEN: That is very, very likely. People still get confused about this. And co-workers of mine at Google actually created an email alias pointing to me for themselves so they could email email@example.com instead of having to remember my real name.
CRAIG BOX: When you were first removing DRM from software, what platform was this on?
THOMAS DULLIEN: This was x86, early Windows. Like, we're speaking about the '95, '96 time frame.
CRAIG BOX: How did that go from being a hobby to being a career?
THOMAS DULLIEN: By accident, really. So I stumbled into this entire copy protection removal and by accident because my older brother who had been doing a lot of Atari ST programming, when I had asked him as a nine-year-old, how do you remove a copy protection, he said, well, it's easy. You just invert a jump.
And I had no idea what a jump even is. And then in my teens — I think I was 14, 15 — I picked up a book called "Master Class Assembly Language" that explained to me what a jump is. And then I went into finding out what I can do with inverting a jump.
And it turns out that doing copy protection removal, you get very comfortable reading software without having access to the source code. And it turns out that that is extremely useful when you do security analysis of closed-source software. It turns out that closed-source software was very, very relevant everywhere in the late '90s, early 2000s.
All the web servers were still commercial. You had something like Netscape Enterprise or whatever. And source code for these wasn't available. So the ability to read the software without having the source code was extremely useful for security analysis.
And that morphed by accident into a bit of a career, because I didn't have an intention on doing that. It was more something I did to finance my studies. Well, then it careened out of control.
CRAIG BOX: I read some interesting blog posts, people talking about removing copy protection from software. One that stood out to me was Michael Steil talking about removing the copy protection from Geos for the Commodore 64. And one of the things that I think becomes easier as the systems get older, or perhaps the gap between what computing power we have today and what we have then, is you can emulate the entire system. And you can basically look at the entire state of a system running in memory, pause it when you like, and so on. Is there a big difference today in the experience of reverse engineering older software versus new software?
THOMAS DULLIEN: So I haven't done a lot of reverse engineering of older software recently. There are a lot of things that make reverse engineering much, much easier on the one side today. Like, you've got working decompilers that can create C-like code out of binaries. On Linux, you've got reversible debugging, so you can literally step backwards in time. You've got very, very strong emulators in QEMU that can then introspect the entire system. So there is a lot of awesome stuff available now.
On the flip side, the software you're looking at is orders of magnitude more complex. And the DRM, people have had 20 years of well-funded commercial evolution to become better at what they're doing as well. Clearly, if you take today's tooling against last decade's DRMs, you're going to have a better time than you had at the time. But all in all, it's always a cat and mouse game of sorts, right? Where it's not always clear who has got the upper hand in that duel.
CRAIG BOX: In 2007, you went to present at the Black Hat Conference in the USA. And you were declined a visa. That was widely reported in the security news at the time. Can you give us a little bit of background on how that event not only happened, but how it resolved itself, and if you've been able to travel back to the US in the future?
THOMAS DULLIEN: That entire episode was a bit of a silly story, where it turns out that, legally speaking, if you accept a speaker honorarium, it does matter whether the conference is a for-profit or a non-for-profit. And that can then have effects on your immigration status as you enter to speak at that conference. And clearly, I was unaware of that at the time. And then that caused a whole bunch of problems that later on just resolved themselves with me making multiple pilgrimages to the consulate and finally getting the right visa for doing what I was doing, but it was certainly a bit of a mess at the time.
CRAIG BOX: You mentioned founding a company. You were traveling and presenting your work on behalf of that company. That company was then acquired by Google. And all of a sudden, you were working in a big company rather than a small one. What changed for you when that happened?
JIMMY MOORE: Oh my, that's a long and complicated question. The interesting thing about running a small bootstrap company is that you have very, very few interests to balance in the end. You get interest alignment between the founders, the employees, and the customers relatively easily. So what's right for the founder, is often right for the team, is often right for the customers.
As you enter a big organization, things get way more complicated, because there is now the interest of me as a leader of my team. There is the interest of the team. There is the interest of the overall organization. There are all the other competing interests of everybody else. And navigating that is much more complicated than navigating a small environment.
And interestingly, also now with running a venture-backed startup, it turns out that that's again a very different experience than running a company that doesn't have investors. So it turns out that all three experiences are super distinct. And neither of them prepares you for the other, which is a bit counter-intuitive.
CRAIG BOX: No, I think that makes sense. I guess, though, that you now have the resources of a much larger company. And how is that able to accelerate your work?
THOMAS DULLIEN: Being at Google is clearly pretty great from the perspective of just having vast amounts of compute at hand, particularly for what we were doing at the time. We were scaling the system independently. And then you get dropped into the computing ocean at Google and just have a different dimension, really, of computing at hand to solve problems.
The downside was that we had to rewrite pretty much everything to work well with Google's infrastructure at the time. Cloud wasn't a thing. And the right thing to do was a rewrite, anyhow.
I used to joke that Google makes a lot of easy things hard in order to make the impossible possible. It is true that writing software for Google's infrastructure is often quite different than writing software outside of it. But then you get such leverage for free, where just scaling it to n-thousand machines is just normal.
I mean, nowadays, people know about Kubernetes. And people know about containers. But the first time you come in from the outside and you're exposed to Borg, that's clearly a mind-expanding moment.
CRAIG BOX: That's a good way to describe it. You were able then to open source some of the products that you previously used to sell. Were you changing your strategy at that point to be working more on internal research at the direction of someone else? Or were you largely a research group who were able to keep doing the work you were doing before?
THOMAS DULLIEN: Google acquired us for certain technologies that we had. And they didn't acquire us for the commercial value of the products we were selling beforehand. One of the three products was made freely downloadable, which is freely downloadable to this day and maintained by some heroes on 20% time. Another product was open sourced. And then other technologies were scaled internally for the security of Google's users.
CRAIG BOX: You worked for a few years in Project Zero as well. What was the most interesting bug or security flaw that you found during that time?
THOMAS DULLIEN: I have to admit that I don't think I found particularly interesting security flaws in my time at Project Zero compared to the interesting security flaws found by teammates. A lot of the security flaws I found at the time were of the sort where software had been forked a long time ago and forgotten about. So somebody took UnRAR, and took parts of the source code, and inserted it somewhere else. And then that had become a security vulnerability.
So I did a little bit of software genealogy and archaeology at the time, but I don't think that compares to, for example, Jann Horn's work on speculative execution at the time, because I sat in the same room with him while he was working on that. And I think the interesting thing about Project Zero is that, having been a pretty renowned security researcher before, I was just about average in that room. And that was a pretty great feeling.
CRAIG BOX: Yes, that is definitely a thing that people notice coming to work at Google, is just the people who are around you working on these things, like you say, the moment when you realized that there is Borg available to you and then getting to talk to, on the show, many of the people who have worked to build those systems up. You're now working in profiling, leaving the security space a little bit. Are reverse engineering and profiling effectively the same thing?
THOMAS DULLIEN: They're definitely related. The motivation for doing what we do now was, my co-founder and me were both a little bit burned out about security, in the sense that, we mentioned my Twitter, which is, I do math. And I was asked by Robert Morris Sr., for whom?
And this for whom is actually a very, very interesting question, because in security, you're always working for somebody against somebody. All of security is fundamentally about human conflict. At some point I just got tired of always working on human conflict. And I was asking myself, hey, what can I do where I'm not working for one side against another side, but where I am helping people do something better?
And then I realized, with the end of Moore's law, there is an opportunity to get agreement between my technical interest, which is low-level systems work, and my economic interests, because with profiling, you can help people compute more efficiently and save them money, and then lastly, my ecological interests, where through profiling, we can help people compute efficiently and save energy.
And then I realized that a lot of the skill set I had from security work ports over quite nicely to the efficiency space, because if you look at security work, you're often faced with a large, large mountain of legacy code. And then you dive into that code. And you find security problems. And then everybody's mad at you because it needs to be fixed.
When you look at a lot of performance work, similarly, you have mountains of legacy infrastructure. And then you dive into it. And you find problems. And then everybody's happy, because it's now faster and cheaper.
The toolset itself in terms of analyzing across abstraction layers, analyzing the whole system, analyzing top to bottom, being unafraid of going to the level of the assembly or the kernel, but being also aware of the high-level data structures and design decisions, all of that is actually very, very similar. It's just that the benefit is different.
CRAIG BOX: Bringing the conversation back a bit to Borg, one of the things that we found at Google when we released Kubernetes was that we assumed people would want what you call the ecological benefit. People would want to be able to run more workload on fewer machines and get a cost saving from that. But it turned out that the most impactful piece of Kubernetes to start with was simply the fact that people weren't able to automate their work in general.
There were a few people who had a good CI system set up, but the tooling at the time just didn't have the APIs to make that possible. And we had to meet people where they were to eventually get them on board with that to get to the point where they were then able to start thinking about driving efficiency and so on.
That all being said, you had a background in security. And there are a number of companies out there who are helping people, again, maybe slightly earlier on in the journey. And a number of those companies have blown up in terms of investment and valuation these days. Did you think at the time about maybe starting a security vendor?
THOMAS DULLIEN: I did think about it. And I more or less consciously decided against it. The reason why we consciously decided against it is twofold.
One is I actually wanted to get out of the human conflict business. The second one is that, in the hierarchy of needs of software sales, security is always a cost center, meaning if you're starting a B2B business, the best product you can have is the product that grows your customers' top line. So they buy your product. They make more money. That's why AdWords is such a fantastic product.
The second-best product is the product they buy and they save money. So it grows their bottom line at the same top line. And then the least-great product is the product that does anything else.
Security is always in the anything else bucket. You're always a cost center. As you go down that hierarchy of products, your organization becomes more sales driven. And that's why, if you go to something like the RSA conference, the RSA conference is a huge sales show and sales event.
And my co-founder and me, we looked at us honestly and said, we are not good at building sales organizations, so we should not be in a market where sales is critical to your success. That pretty much sealed the deal. And we said, OK, we may want to do something else than security.
CRAIG BOX: When you launched Optimyze, it launched as a consulting business, where your pitch to people was that you would profile for them and then take a percentage of the money that they saved as your fee or your cost. Was that something that you decided to do to bootstrap? Or was there always an intention that you'd move to a more product-based model?
THOMAS DULLIEN: Initially, we thought there is this great opportunity here, because it's such a clear win-win. The customer doesn't pay anything until you've actually saved them money. It seemed fantastic on paper as a consulting opportunity, because my experience is that, once you drop profiling infrastructure into an infrastructure that hasn't been analyzed before, it's a bit like switching on the light in the basement that has not been touched in 20 years. You find so much stuff to sweep out and so many easy wins.
And we thought that the right way to align incentives between clients and us at that point would be to do this value-based pricing. It turned out that it's a great conversation starter, but no business of sufficient size can actually sign off on a professional services contract where the actual cost isn't entirely clear upfront, even if that cost is always negative in total, which was a very humbling experience, because it goes to show that, in the end, as a companies get larger, organizational logic overrules real logic.
So even if you can go to a company and say, hey, the product we're selling is always net negative cost — like, it can never cost you money. You'll always be better off financially than if you hadn't done it by a certain amount. Surprisingly, that's not a great sales pitch, because accounting can then not exactly say, hey, out of whose budget should no money come?
Yeah, it was very humbling. And it was a good business lesson to learn, that what you think makes total sense on paper does not make a lot of sense when viewed through the prism of real-world organizations. And then we realized that perhaps the right way to do this is just give the people the tool they need to do it themselves. Then we pivoted to product.
CRAIG BOX: Your product launched recently is called Prodfiler. I like the pun. It works well. You are in the Kubernetes space, to some degree. Did you consider calling it Podfiler?
THOMAS DULLIEN: We did, but we didn't want to have it seem exclusively Kubernetes-focused, because we pretty much want everybody to run it in all of Prod. And they don't need to necessarily run Kubernetes, sorry to say. I mean, we do work on Nomad and other platforms as well.
CRAIG BOX: Tell me a little bit about how profiling works in general, and how distributed profiling is different to profiling just a single machine.
THOMAS DULLIEN: Sampling profiling is a relatively simple mechanism where your CPU gets time interrupts all the time anyhow. The easiest thing you can do is you can just keep a record and account of where precisely that time and interrupt fires. And then you can know where your program is spending time.
In addition to knowing where your program is spending time, you're probably interested in how did it get here. So for example, if it's spending a lot of time in malloc, you'd probably want to know, how did it get to malloc? Like, where is that allocation being called from?
So the logical complement, then, is that, at the time you get that time interrupt, you unwind the stack and you record the stack trace. And then you save statistics about that.
Now historically, a lot of profiling was done on single developer machines in the game industry or in other industries where speed and latency are very important. But the reality is — I've talked about this before, and this is a very Google mentality thing that I took away — the right way to think about a data center is not to think about a data center as a group of computers, but to think about it as one huge computer that you want to treat as one huge computer.
And this is happening more and more. I mean, the entire modern cloud-native microservices world, we don't have one program running on one machine. We've got a huge service that is composed of many other services running on a large fleet of machines. And we try to treat that large fleet of machines like one computer.
I think that's a very powerful prism through which to look at computing today. And then if you accept that the data center is the computer and we're trying to build an operating system for that data center — and in some sense, Borg is an operating system scheduler for data center, and Kubernetes is trying to figure out whether it wants to be the scheduler, or just the entire operating system, or whatever — but once you view everything through the prism of the data center is a computer and we need to build an operating system for it, you can also ask yourself, what are the right debugging tools for that computer? And what are the right profiling tools for that computer? And then it turns out that good profiling tools for that computer didn't exist yet, because profilers were meant to be run on a single machine.
Good profiling infrastructure for the computer only exists in places that have a long history of having the computer. So Google has some profiling tools internally. Facebook does have some distributed profiling internally. But it just wasn't available to the greater world.
That's where the decision came from, let's build something that you can just deploy on the entire fleet and that can then measure across the entire fleet where things are going. Essentially bringing the experience that a game developer had on a single workstation in front of him to the developer that is working with an infrastructure of hundreds of services across thousands of machines.
CRAIG BOX: There is a paper that Google published on their internal profiling system. It was co-authored by Eric Tune, who is one of the co-authors of the Borg paper and one of the early contributors to Kubernetes. Is there a sort of a dapper inside Google being the Jaeger and distributed tracing outside moment here as well? Is this sort of a re-implementation of an internal Google idea for the public to this?
THOMAS DULLIEN: Google published a paper about Google-wide profiling in 2010. And the results are very, very compelling in that paper. And there is multiple follow-on papers where Google writes about the insights they generated from their Google-wide profiling and how that affected things on both a tactical level, where developers were empowered to make great changes, but also to a cultural level, where developers could argue for their promotion by showing the benefits they had brought to the code base; and lastly, on a strategic level, where Google made long term investments in stuff like TPUs and hardware video codecs based on profiling data.
So the results from Google's experience doing fleet-wide profiling were so compelling, a lot of our motivation was, hey, can we bring what Google has internally as an external product to everybody? And there’s a bunch of technical hurdles, because Google achieves what they're doing through a bunch of trickery that a lot of non-Google companies just can't do. And then we had to solve a bunch of technical problems to bring the smooth experience of ‘it just works’ to people outside of Borg.
CRAIG BOX: It is over 10 years since that paper was published, so I'm somewhat surprised, I guess, to see that that problem really hadn't been solved between now and then. What were some of those technical complexities?
THOMAS DULLIEN: A lot of the technical complexities are actually related to, for one, frame pointer omission on x86. So it's actually surprisingly difficult to do reliable stack unwinding on x86, unless you are willing to recompile things from scratch, unless you've got certain amounts of hardware support. And modern cloud infrastructures, the issue is that most people are running virtualized instances that don't provide the right hardware support to do the unwinding reliably. And nobody's in the situation that they recompile everything from scratch.
The first thing we have to solve is, find out how we can get reliable unwinding through native code working for everybody in a smooth way. That's where a lot of our low-level reverse engineering experience was very, very helpful. And then the second piece of the puzzle to actually make this work in practice was the arrival of eBPF in the Linux kernel and the wide deployment of kernel versions that support eBPF, because a lot of the heavy lifting that we do would in the past have required a kernel extension.
And trying to ask people, hey, can you run my kernel extension in your production, is a very, very big ask, whereas asking people, hey, can you run this eBPF program in production that will do this, that's a much, much lesser ask, because the eBPF code can't really take down the machine or cause other severe production problems.
It was the confluence of those two things that really enabled what we're doing here. And then, of course, there is a lot of grimy work of just implementing stack unwinding for all the different interpreters out there.
CRAIG BOX: Is there anything complex about the distributed nature? Or are you just sort of aggregating the things that run on individual machines?
THOMAS DULLIEN: There is just a matter of scale and thrift. As you start collecting data from many machines, you need to be very thrifty in how much of the data you send out. And you need to be thrifty in your CPU consumption along the way.
So pretty much, at every step of the way, you need to carefully account for how much data am I sending? How much CPU am I eating? And so forth. Because in the end, we all want observability, but we don't want an arbitrarily high observability tax.
And something like profiling needs to be reasonably cheap in order to be worth doing. Like, if I have to pay 20% of my fleet in order to do profiling, then I'm not going to do it. The challenge for the large distributed environment is really one of scale and very, very thrifty engineering on the data processing side.
CRAIG BOX: That could then go back to a value-based discussion though, because one of the things I remember people saying when service meshes were new is, yes, it's going to cost CPU to run the sidecars. But you're going to get so much more in terms of observability that it's worth paying that cost. In the case of profiling, you are obviously going to be able to find areas of inefficiency and effectively recover the cost of that agent fairly quickly, I would hope.
THOMAS DULLIEN: You would hope. But also, one big problem with profiling tools — and with any new tool — is one of adoption. It's much easier to get people to adopt something when the initial cost is very low.
It's much easier to convince somebody, install this, it'll eat 1% of CPU, and then they see great results. Versus, install this, it'll 15% of your CPU, and then them seeing results that make up for it.
CRAIG BOX: Right.
THOMAS DULLIEN: You really want that experience of just trying this out to be, like, there being an extremely low hurdle, ideally no friction to it. The thing is, once developers see the benefits of having a distributed profile — it's a bit like, I mentioned reversible debugging as being a game changer for a lot of reverse engineering, and having distributed whole-system profiling across everything is one of these moments where, after using it for a while, you can't really imagine going back to not having it. So you want to get people to try it out, just because that's a good way of getting them hooked.
CRAIG BOX: Do you think that a profiler is something that people should keep running on the production environment all the time? Or they should only enable it when they know that they have a problem?
THOMAS DULLIEN: My view is that you should probably run it on a significant fraction of your fleet all the time, because it's such a game changer for debugging issues as well. Like, the time travel aspect of it is so great. You see an individual node consuming too much CPU on Sunday morning at 4:00 AM. You can go back in time and ask, hey, what was that node doing when we saw that spike? And the time travel is just great.
So you can sample on a fraction of your fleet at all times. But then as that fraction gets smaller, the odds of you catching the difficulties when they arise are less. I guess my view is, I would probably run it on a significant fraction of the fleet all the time, possibly the entire fleet all the time, but then you can configure — pretty much by varying the percentage of your fleet that you're running it on, you can calibrate the total costs according to your taste. Like the total cost, and CPU, and RAM, and so forth.
CRAIG BOX: Do you think people should think about it in terms of how much they want to spend and then what percentage of the fleet they should enable, versus thinking of it solely as an I want to do 80%, or 95%?
THOMAS DULLIEN: Initially, to get your feet wet, you're going to enable it on some fraction of the fleet, and then as you see utility, you can expand it from there. And the way this works in practice is you would do something like, I don't know, 20% of the fleet. But then that means 20% of the fleet will be sampling at any point in time. And then every couple of minutes, you switch to a different 20%, because in the end, you want to get a full picture of what everything is doing across the fleet.
It also depends on the fleet's size a bit, right? We've got people running 200, 300-node Kubernetes clusters where they're sampling on every machine, and they're quite happy with it. But then you can also decide to just sample it on a fraction and extrapolate from the fraction. For the pure profiling use case of seeing where cycles are spent, a fraction is probably fine. For the use case of debugging production issues, usually having more data is better.
CRAIG BOX: You launched recently, and one of the first things that you put out as a use case was a production mystery with the Kubelet eating CPU and IOPS on a machine rising to 10 times the normal level and eating effectively a core of CPU on an entire machine. Was that a customer that you did that for? Or how did that come about?
THOMAS DULLIEN: Yeah, so we had early beta deployments for Prodfiler. And when we rolled that out, we found a lot of performance murder mysteries of similar style where something weird is happening and then you have to dig in to see what's going on. This was actually with one of the early deployments where, due to some bizarre accident, the Kubernetes user was creating a large number of temporary files in the ephemeral layer of the Docker container.
And that started bogging down the machine. And we just saw huge amounts of CPU time spent in Kubelets in that infrastructure. And then when we dug in, we were very surprised about this particular failure mode. And then we pretty much re-implemented and reproduced the same issue separately for the blog post, because clearly, the customer did not want their production infrastructure to feature.
CRAIG BOX: Is that something where it's fair to say that the cAdvisor tooling that was showing that bug should be fixed? Or is that just to say, all right, well, if you're going to create containers in that fashion, it's always going to have to trigger this kind of behavior?
THOMAS DULLIEN: It's an interesting one, because it's a conscious design choice inside cAdvisor to some extent that is driven by disk usage accounting, and to some extent, disk quota accounting. In Linux, it's the responsibility of the file system. There is no general API that you can just call for these things. And now, as the implement of something like cAdvisor, you've got the choice — if I want that data, do I need to be file-system-specific? Or do I make do with the APIs that are available, which is recursively enumerating the directory tree?
And that's a tricky one, right? In some sense, the cleanest solution would be to add an API to all file systems to provide that functionality, but that may not be an option. So I think perhaps just putting a warning sticker into people's minds, don't create a lot of files in the ephemeral layer of a container. Or if you do, then configure cAdvisor to not do the directory works, is possibly the right choice.
CRAIG BOX: With a tool like this, you now have to consider the business model. You talked before about the model of charging a fixed price to someone being better than charging a variable price. But then you also have the choice of how much you want to open source, and being built on top of eBPF, and presumably having agents that need to run on people's machines, whether this is something that can run entirely in their local environment, whether there is a cloud back-end service, and then if you make that available to people who want to run completely disconnected, and so on. How did you go through that thought process? And where did you eventually land?
THOMAS DULLIEN: That's a complicated thought process. We haven't quite 100% nailed down the pricing model, to be honest. We expect that the pricing model will be something like charging per core month that we're profiling on. So it will be most likely volume-based in some way, but we're still trying to figure out what is actually the best pricing model, because you want it to be cheap enough to be tried, but you also need to run a business on top of it, because we do have to do a fair bit of data processing on our side. So that's one side of it.
When it comes to the question of whether to run an infrastructure for the users or whether to allow people to run the infrastructure in an entirely disconnected place, that's another really difficult question, because, from the engineering side, it's quite non-trivial to package things in a way that they can be run in a disconnected manner.
There is a reason why Google doesn't ship Borg as a software, like as a shrink-wrapped software you can buy, or a Bigtable as a shrink-wrapped software you can just spin up. They are offering managed services for a reason, like providing the right SLAs, providing the right SLOs, and so forth. It's just much more feasible in a managed service than in providing, in quotation marks, "shrink-wrapped software."
So at the moment, we're simply not confident we can deliver the amount of availability and reliability, and ease of management that people need in a disconnected scenario. That's why we're currently running all of this as a managed service that sends data to our back end.
The last part of the question with open source, versus commercial OSS, versus shareware — the thought process there is, everybody would love to be as open source as possible, but you also have to run a business. And the reality is that, for example, the experience with the fight between Amazon and Elastic, open sourcing something early and then later on having to revert back on that promise because another business entity starts behaving in a particularly annoying manner, that's always a very bad spot to be in.
So we tried to minimize mistakes early on, where we're currently keeping everything that we can keep from that software closed source in the classical sense. And then we'll try to open things up over time as we understand how that would work, and to the extent that that would work. We do allow people to inspect the source if they have security concerns, but that's a different story. That's not really open source. That's just helping people get confidence that things are secure.
CRAIG BOX: I was going to ask about that. If someone wants to look at the agent that's going to be running and seeing the internals of everything running on their machine, are you expecting them to decompile it to assembly and understand it in that fashion?
THOMAS DULLIEN: No, we do provide the source code under the right licensing agreement for security review. It's just that we don't just make everything open source. That's the point.
CRAIG BOX: That was really a reverse engineering question.
THOMAS DULLIEN: [CHUCKLES] I mean, you can't expect people to do a very good job at this. You want to make it easy for people to get confidence.
CRAIG BOX: You mentioned before all of the different programming environments that you have to put in support for stack unwinding. Can you name them all off the top of your head?
THOMAS DULLIEN: I hope I won't forget anything. So, we support C and C++ code in production. We support Rust in production. We support Go in production. Those are the native languages, of course.
Then in terms of non-native, we support the JVM — everything based on OpenJDK, essentially. We support Python. We support Perl. We support Ruby. We support PHP. We have node.js in the works, so that hopefully is arriving in a couple of months or weeks. And then the next thing we need to tackle are .NET and Erlang. And I think once we have those, we should have a fairly broad coverage of everything. I hope I didn't forget anything.
CRAIG BOX: Yeah. Fortran aside, I can't think of anything else that really you'd need to implement.
THOMAS DULLIEN: Well luckily, Fortran will actually compile nicely to something that will be unwound automatically by the existing C/C++ unwinding. So we're good on that front.
CRAIG BOX: A lot of people are looking now to compile things down to WebAssembly. Does that intermediary help you in this case?
THOMAS DULLIEN: That's an interesting question. We haven't actually worked with any WebAssembly-based container runtimes yet. We would probably have to do a bit of engineering before things would work, because we currently rely on ELF files being found on disk. We haven't actually done any profiling on Wasm yet. It's an exciting avenue, though.
CRAIG BOX: What about Windows?
THOMAS DULLIEN: We would love to support Windows, but in the end, Windows currently does not have the same eBPF power that Linux has. Windows does have detrace, nowadays. And it is imaginable that you could build something similar to what we're doing on top of the detrace infrastructure in the Windows kernel.
We've also had an enthusiastic user get our profiler to at least produce some data using the eBPF subsystem inside of Linux on Windows subsystem. But the reality is, right now, we do not and don't have any plans for supporting Windows, because the reality is, very few people run a large-scale services infrastructure on Windows. People use Windows as a developer workstation. And then they run production on Linux. The people that are burning most CPU tend to do so on Linux.
CRAIG BOX: Now that you've launched, and presumably are getting broader feedback from a larger set of customers, what other things do you feel you need to implement?
THOMAS DULLIEN: Oh, there is plenty. We want to have much more detailed profiling for stuff like I/O contention. We want to have much more detailed capability of doing memory profiling in the future. We don't have much of that at the moment. And it's a difficult lift to do in a nice and smooth manner.
There is a lot of work to be done on automatically alerting on profiling data changes in production. There is a lot of interesting work to be done with CI/CD integration. There is ARM64 support, which we're very excited about. We're working on that at the moment quite actively, because we see so many people move to ARM64-based cloud offerings.
CRAIG BOX: All right, well, I look forward to seeing how things develop. And thank you very much for joining us today, Thomas.
THOMAS DULLIEN: Thank you for having me, and have a great day.
CRAIG BOX: You can find Thomas on Twitter @HalvarFlake. And you can find Prodfiler at prodfiler.com.
CRAIG BOX: Thank you so much for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @kubernetespod or reach us by email at firstname.lastname@example.org.
JIMMY MOORE: You can also check out our website at kubernetespodcast.com, where you will find transcripts and show notes, as well as links to subscribe. Until next time, take care.
CRAIG BOX: See you next time.