Kubernetes Podcast from Google: Episode 150 - Pixie, with Zain Asgar and Ishan Mukherjee

#150 May 13, 2021

Pixie, with Zain Asgar and Ishan Mukherjee

Hosts: Craig Box, Alex Ellis

Pixie Labs built an observabiity platform for Kubernetes, which uses eBPF to get telemetry without user intervention. They were recently acquired by New Relic, who open sourced the Pixie software. Co-founders Zain Asgar and Ishan Mukherjee join Craig Box to tell the story and talk about what’s next. Guest host Alex Ellis tends his garden.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

Links from the interview

Transcript

Show full transcript

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box, with my very special guest host, Alex Ellis.

[MUSIC PLAYING]

CRAIG BOX: It was a busy week last week at KubeCon, so I have to ask you Alex, did you manage to take some time off and perhaps do a little bit of gardening?

ALEX ELLIS: I did, actually. I've been spending far too much time in my garden shed and have these lofty ambitions of converting it to a space where I can go and retreat and do a bit of work and a bit of gardening.

CRAIG BOX: And of course, there's no gardening these days without some technology, so tell us a little bit about Grow Lab.

ALEX ELLIS: Grow Lab is something that started in 2017. I put a Raspberry Pi on the side of a plastic box and took a time lapse, and a couple of people got involved with it. And we had some fun. And then I didn't think about it until this year. And I thought, you know what? I want the camera to be over and above. And so I was brainstorming a few ideas of how I could actually make that happen, and part of that was using some copper piping, just like you'd have in your plumbing at home, baseboard of wood.

I got it all working and thought, I shouldn't keep this to myself. The benefits of community and of learning new skills and just taking time out to do things that aren't productive and aren't going to make you a success, it's just leisure, are really underrated, and I wanted to share that with other people.

CRAIG BOX: It is. It's always fun-- I watch a bunch of YouTube videos of people who just do something interesting that they enjoy. And it's great when that shows up. And it's great that you get to share that with people. Bring a little bit of Functions-as-a-Service in there as well, I'm sure.

ALEX ELLIS: Yeah, so there's actually three experiments now. And just to prove that you can take things far too far, I've registered a domain name, I've got a logo for this. We've got 15 different people registered. And to make them feel like they're scientists, we call them lab technicians. So you're a lab technician, we've got 15 of them. There's three experiments. The first is the one that's the contest.

So if you want to win a prize from OpenFaaS Limited or Pimoroni, what you do is you take a timelapse over two weeks, and there's some other things to do as well, and that gets you into the contest. The video can then be generated through FFmpeg, a standard Linux tool. The second thing you can do is a live preview, and that's what uses, again, Python, and humidity, temperature, and air pressure sensor. And I thought, well, actually I've done this before.

I want to make it really easy for people to have a data logger. And so I have one in my shed, one in the arbor-- which is like an outdoor seating-- and one on the Grow Lab. And I plot all three of them using an OpenFaaS function to receive the data, update an InfluxDB database, view it with Grafana. And because I don't want a heavy weight Kubernetes Cluster to run it all, I just used OpenFaaS's new faasd project. It just does everything for you.

CRAIG BOX: Well, you can learn all about Alex's projects in our interview in episode 116. A number of them loomed large in KubeCon last week. There was a mention of OpenFaaS in the RISC-V keynote. And the K3s project you've been doing a lot with, you've actually made a training course for Kubernetes on Edge, which was announced at KubeCon last week.

ALEX ELLIS: Yes, and that's something that was put together for the Linux Foundation and CNCF together and I think they just liked the course they did on serverless on Kubernetes so much that we wanted to work together again and had this great idea with K3s coming into the CNCF. Could we help people understand Edge, LF Edge, the whole landscape, what SOCs are like and what the difference is. But we keep finding people asking about K3s.

What is it? Why is it different? When should I use it? And so we cover a lot of that. And then one bit I'm really excited about, just the basics of Kubernetes, probably a condensed version of probably what you need for CKAD. How to use a pod, how to use an ingress, how do you build a container image, and then finally, get to tie it together, again, with a little bit on OpenFaaS and functions, Rancher's fleet project, flux, continuous deployment, HA. And I think if you're anywhere near Kubernetes, you'll really enjoy this course.

CRAIG BOX: Maybe we need to tie these concepts together a little bit. We need some sort of certified gardener where we can go through and talk about the process of germinating seeds, and so on. There was so much I saw in the Grow Lab stuff I'm like, this is all very technical and I've never really given too much thought to it. But I'm sure that I would probably be better off if I learned it rather than just experimenting.

ALEX ELLIS: Yeah. I mean, I did a talk for Equinix Metal at their giphy day. That's Google infrastructure for everyone else. So obviously, you get access a Google infrastructure yourself, but the rest of us, Craig, we're kind of playing with our Raspberry Pi, simulating it. And I was asked, can you do something that shows how to get commodity low powered hardware into the hands of people and they can do something interesting?

And so I took on the challenge. And I even sent the talk to my mum afterwards, and she said are you trying to be Monty Don from "Gardener's World?" Because at the end of the talk, I showed how my tomatoes had fallen over in the wind last year. I showed how my aubergines hadn't grown quite big enough, and how I'd over watered things, and everything I'd learnt. And yeah, there is so much to learn about nature, but the great thing is it doesn't really change much, unlike tech.

CRAIG BOX: Well, I will just point out that Google infrastructure is available to everyone else in the form of Google Cloud, but gardening information is probably not my forte, so do check out Grow Lab for that. And let's get to the news.

[MUSIC PLAYING]

CRAIG BOX: Microsoft announced an open source project to bring eBPF to Windows 10 and Windows Server.

An eBPF runtime and a bytecode verifier are built into the Linux kernel. And many networking and observability utilities make use of this feature. The Windows work is based on two existing projects, a user's base runtime and an extended verifier. The project is still in early development, but it is being shared to start a collaboration with the eBPF community. In related news, Google Cloud's eBPF-powered Dataplane v2 for GKE is generally available this week.

ALEX ELLIS: Confluent for Kubernetes is generally available. Confluent is a commercial Apache Kafka service. And with this new release, you can run it inside your own environment. The operator based install handles upgrades, scaling, resilience, and monitoring.

CRAIG BOX: VMware Tanzu SQL now supports MySQL, joining Postgres as supported database systems that you can install with their operator. The new product installs the Percona server variant of MySQL. Also from VMware this week, a new modern EPS connectivity solution which brings together the capabilities of Tanzu Service Mesh and NSX Advanced Load Balancer and runs on almost any Kubernetes.

ALEX ELLIS: Finally, every year the DORA team now at Google Cloud runs a state of DevOps survey and publishes a report based on the findings. The survey is for everyone, regardless of how far you are on your journey or the size of your team. DORA says that even reading the survey questions can spark ideas for improvement. So that's another good reason to check it out and play your part in this community survey.

CRAIG BOX: And that's the news.

[MUSIC PLAYING]

CRAIG BOX: Zain Asgar is the GM of Pixie and New Relic open source, and CEO and co-founder of Pixie Labs. He is also an adjunct professor of computer science at Stanford and was an entrepreneur in residence at Benchmark before co-founding Pixie.

Ishan Mukherjee is lead product go-to-market at New Relic and co-founder and chief product officer of Pixie Labs. Ishan led Siri's knowledge graph product team at Apple after the acquisition of Lattice Data, a Stanford based startup where he led product and go-to-market. Welcome to the show Zain.

ZAIN ASGAR: Hi, Craig. Thanks for having us here.

CRAIG BOX: And welcome Ishan.

ISHAN MUKHERJEE: Thanks, Craig. Thank you for having us.

CRAIG BOX: We've got a lot to talk about today but to set the scene, can you give us the 90 second elevator pitch about what Pixie is?

ISHAN MUKHERJEE: Pixie is a Kubernetes native observability system which helps collect telemetry, so metrics, traces, events, and logs, for all workloads inside Kubernetes without the need for any code changes. The second thing is that Pixie lives entirely inside the Kubernetes cluster, so no data has to get stored or persisted outside of the cluster. And then the third is it exposes this kind of interesting interface where the API allows you to write scripts to do production and troubleshooting, but also build a lot of interesting applications using telemetry.

CRAIG BOX: Pixie Labs emerged from Stealth in episode 124 and was acquired by New Relic in episode 132. Now, not five months later, the team has followed through with the promise of open sourcing and submission to the CNCF sandbox. Your company was founded in August 2018-- which is around the time of episode 15, in case you were wondering-- but your stories will have started a lot earlier. Perhaps you could both start by sharing the first memory you have of when computing came into your lives.

ZAIN ASGAR: I had this 286 computer a very, very long time ago. I can't even remember what year it was. I'll actually share a couple of memories. The first one was that my dad had just bought a 40 megabyte hard disk. I was very excited because the original hard disk on his computer was 20 megs and I had 60 megabytes total. That was quite an accomplishment for computing back then.

CRAIG BOX: How were you possibly going to fill that?

ZAIN ASGAR: Right, I thought this was, I can install anything and download whatever I want and life is good to go. Similarly, I also had this 286-- I don't know if you want to call it a laptop because it was quite heavy to be a laptop. I like to call it a backtop, which was kind of like the suitcase computer with this very monochrome screen and, of course, also had like 20 megabytes of disc space because that was state of the art.

CRAIG BOX: Was that what your career was, basically, from the moment you very carefully picked that up?

ZAIN ASGAR: Pretty much. I was a child. It was actually quite heavy to lift as a kid. It weighed like 30 pounds, it's quite ridiculous.

CRAIG BOX: So he was either going to go into weightlifting or computer science. It was one or the other.

ZAIN ASGAR: Pretty much. Pretty much.

CRAIG BOX: Ishan, how about you?

ISHAN MUKHERJEE: For me, I grew up in really small mining towns in India. So we are far removed from kinda what was happening in the US, in the Western world. So there was a lag-- which doesn't exist anymore. So there was a lag in that technology adoption. But my brother, who's a surgeon now in New York, he is a hacker. I remember he was in the eighth grade. He started to write visual basic programs to make a ping pong game.

And he used to take these kind of massive notebooks and write all the routines, like go to 10, go to 80, all of that stuff. So I remember as a small kid just reading through that and thinking about, wow, this is possible. So that was my first example, but that quickly switched to going to the local government office to set up your AltaVista email.

That's how I got looped in. But very quickly for me, I was pretty obsessed with building bikes and robots and stuff. So for me, building with both hardware and software, and that's how I got into mechatronics and robotics.

CRAIG BOX: Now, you've obviously had a great career. You're a startup co-founder, you've been acquired recently. Do your parents still think of you as the kid whose brother is the doctor?

ISHAN MUKHERJEE: I still do think that they think that I am an IT person, which is less valuable to society than being a surgeon. But no, they're all super proud I think. At least back in India, they know what software is, but their view of software is an IT person. I don't think I'll ever be at my brother's level.

CRAIG BOX: Well you did work at Amazon through the acquisition of Kiva Labs, who make some very interesting robotic technology. Tell me a little bit about your time there.

ISHAN MUKHERJEE: I moved here to the US for grad school at MIT. As part of that in that MIT ecosystem, there were a bunch of robotics startups, iRobot, Kiva. It's a very healthy robotics ecosystem up there in Boston. So Kiva was essentially building robots which help fulfillment workers inside warehouses pick, pack, and ship your e-commerce orders. It was still a pretty nascent industry. Essentially robots are the most unglamorous industry.

So what we used to do was obviously build the hardware, but also essentially this control system which organized products inside of the warehouses, so that after the order comes, you can ship that out in like 30 minutes or an hour. And my role there was to build simulation systems to figure it out for a large warehouse-- whether that's Target or Office Depot-- how many robots do you need.

So that's how, as an engineer, I got in, thinking about the customer's need and this pretty sophisticated technology. Kind of fast forward to now, we got acquired by Amazon. So Kiva became Amazon Robotics now. I think the Kiva robots run most Amazon warehouses. So it was pretty interesting in an application of fundamental research.

CRAIG BOX: Is robotics basically just becoming machine learning with arms?

ISHAN MUKHERJEE: I think that's a really good point. And that's where Zain and then my shared passions come from. I was obsessing about this automation and thinking about the human and the machine interface and the hardware and software had a part to it.

But with Kiva, what ultimately became clear was the right path to automation was mostly software driven. And at that time, there were two schools of thought. There was the one school of thought where you build these sentient singular robots were essentially Robocop-like, smart humans.

And the other one was really reduce the reliance on the hardware and do all of the logic and the software as well, and that's where Kiva were pioneers. Robots in itself are actually quite naive. It's an army of tens of thousands of robots all talking to the central service, right? So with that, now 15 years since that started, ultimately a lot of it is software driven, and then the robots in itself are, as you said, for armed or for actuation, right? Whether you're picking up stuff, whether you're moving stuff.

And it's an interesting interplay. Obviously, Zain does research on Edge stuff, where a lot of the computing now can move to where it makes more sense because of all of the innovation. But ultimately, it's a pretty elegant balance where you have this clear intelligent model, but I do think primarily increasingly software driven automation.

CRAIG BOX: And, Zain, most recently, you were working in the machine intelligence group at Google. I hear that's quite a good place to work.

ZAIN ASGAR: Yeah, it was a lot of fun. Some interesting stories from there. When I first started working there, one of the first projects I worked on was actually doing apparel recognition, something that I can say I have absolutely no skills on.

CRAIG BOX: As in this a shirt, this is a pair of trousers, et cetera?

ZAIN ASGAR: Well, I think I can figure that part out. But it was like, which one of the million handbags is this? And that was a lot more challenging for me because first I didn't realize that there are a million different types of handbags.

CRAIG BOX: Can you tell the difference between a blueberry muffin and a dog?

ZAIN ASGAR: That's a different challenge.

CRAIG BOX: It's an unsolved problem.

ZAIN ASGAR: Yeah, that was someone else's challenge. My unsolved problem was, could I tell the difference between these two Gucci bags? And I'm like, I'm not sure I can tell the difference, but maybe a computer can. Let me try.

CRAIG BOX: Was that for the benefit of detecting counterfeits?

ZAIN ASGAR: No, so actually a lot of these features became a core part of what is now Google Lens. I think it provides a few features, right? One of them is just allowing people to do shopping. Oh, I'm really interested in this handbag. What is it? And then also providing an experience around finding something similar or matching styles.

And that was the very first project I worked on. So it was quite interesting because I went from not knowing anything about-- being completely naive about fashion to browsing random fashion websites to find trending data. So it was quite an adventure. In more serious notes, the area that I really focused on was on device machine intelligence, which is where Ishan alluded to earlier, where we're trying to get all these models to run on actual Edge devices, whether it's cell phones, or Raspberry Pis, or something similar.

I continue to work in that area, actually, at Stanford and still collaborate with people from Google in that space. Interestingly, this all actually ties back to Pixie. Bringing that up, I was working on all this Edge ML stuff, started working on machine learning, teamed up with Ishan. And part of our thing over there is, if we can get access to bridge data sources, could we actually process all of this at scale and make it useful for developers? That's where the tie-in actually happened and continues to happen.

CRAIG BOX: When we spoke to Ramon from Okteto in episode 125, he basically said that his founding team were looking to build some software, but then they ended up having to build tools to help them build that software. And then they decided to commercialize those tools instead. So my question to you both is, what systems hurt you so much that you were looking to build a better developer experience?

ZAIN ASGAR: I built a fair amount of production software during my career. And I'm actually one of those people who really, really likes to make sure that the stuff I deploy continues to work, as I'm sure most software engineers are. But one of the things I feel pretty helpless about is when things break in production. I'm like, please, help me figure out what's going on. Send me the logs. And you never have the information you need to debug those available.

And sometimes a debug loop can be really long, right? You've got to go add more logging, send it back out, hope the problem occurs again in that time when you're watching, and be able to capture those data and debug. Part of what we wanted to fix with Pixie is actually just shorten that entire feedback cycle.

As an engineer building production software, if I see something wrong, can I actually get visibility to that code, figure out where performance issue's occurring, figure out what function call arguments are without having to go through this entire redeploy cycle, which could be like a week or two at some companies or even longer. That was the main thing that we were going after is just like, how do we actually get things installed, not have to worry about it. Once things are running, be able to get all sorts of data to be able to debug actual production issues.

ISHAN MUKHERJEE: Obviously from an experience standpoint, while we were thinking about it, my previous role was one of the lead product people at Siri inside Apple. And when you're leading a large consumer application, you're pretty much on call 24/7 because there's some part of the world where there's a wrong utterance. So as Zain was mentioning, anybody leading applications at scale, it is a daily occurrence.

But beyond that, both of us in the team that we have, we're essentially product builders. So we were looking to build a product which is just fundamentally better and hopefully helps to find the next decade in that domain. And we experienced it on our own.

We felt like we could deliver something that was truly magical in this space and for an end persona that we relate to really well, which is developers. We have this academic background and startup background. Ultimately, we tried to build a truly amazing product experience. This seemed to be the most exciting opportunity for us to go do that.

CRAIG BOX: One of the things that jumped out at me when I was reading the announcement was that you use eBPF to collect the data, the metrics and traces and so on, for application, Kubernetes, operating system. Could you have built something like Pixie before eBPF became mainstream? Or is it really tied into being able to get that without having to add any instrumentation yourself?

ZAIN ASGAR: I think the question really is, could we have built something from a technical perspective and also could we have built something from a distribution perspective? From a technical perspective, could we have built what we did without eBPF? I think the answer is yes.

We could have built a special kernel module. We could have done some binary instrumentation or something, things that people have done in the past and be able to build that out. The problem is deploying those things can usually be very cumbersome and difficult, right? Getting people to install kernel modules on production environments is usually a no go.

CRAIG BOX: Yeah, it's like curl pipe bash but even worse.

ZAIN ASGAR: Yeah, exactly. And I think the binary instrumenting applications is another challenge because you can start off with someone's application. You're like, OK, well, please let us instrument this before you deploy it. And it can be quite a chore to actually make that happen. The other approach, of course, is language specific agents where you actually go and add in some specific annotation or import some libraries.

This works well with dynamic programs, but it can basically monkey patch your code. And a lot of existing agents do stuff like this. Go and take a look at like New Relic or Datadog or whatever, they try to do this by importing language specific agents. What eBPF really allowed us to do was do some of the things that kernel modules could do or even binary instrumentation could do, but automate that behind the scenes considering it's a core part of Linux.

ISHAN MUKHERJEE: To add an anecdote to that, we spent a lot of time just talking to developers even before we decided, hey, both of us are going to do this full time. The question that we were initially looking to answer was, can we collect all this data without doing any work? And eBPF became a means to that end.

With eBPF, I remember having a conversation and then getting conviction. OK, we feel like we can truly optimize this multivariate equation and, oh, this makes sense. Just to add the point that eBPF initially was a means to the end of getting all this data. And we wanted to deliver the connected experience to developers.

CRAIG BOX: Kubernetes effectively requires you to build distributed systems in that you need to run multiple copies of things. I find that it's very easy to conceptualize how you can monitor a single static thing running on a kernel, and you can clearly do it easier and better with something like eBPF. How do you then think about the distribution of those applications? And how do you treat things that are running in multiple different machines as if they are one instance of a thing?

ZAIN ASGAR: That's a good question. For us, Kubernetes actually also provides us with a very standardized substrate, where applications are deployed, and also very, very rich metadata stream. Part of what we do with Pixie is that we actually monitor the Kubernetes metadata stream, so we know what pods and services are running. And we can use that to actually automatically gather a bunch of information about the services.

The other thing we do is we take this really, really low level information coming out of eBPF, which is usually tied to the process level on the machine, and then really tie it to these high level concepts from Kubernetes. And without a system like Kubernetes actually doing all this bookkeeping behind the scenes, it would be very difficult, if not impossible, to do what we're doing with Pixie. In addition, Kubernetes also provides Pixie the ability for us to actually manage our deployments on-cluster without having a really hard time just managing a remote deploy somewhere.

CRAIG BOX: Is Pixie inherently a Kubernetes tool, or can you use it to monitor other things?

ZAIN ASGAR: The way we think about this is that Pixie is Kubernetes native, so it was built for Kubernetes, right? Probably unlike most apps, it was only designed initially to run and work on Kubernetes. We do want to add support for gathering telemetry from other Linux machines as long as the customer has at least one Kubernetes cluster where we can funnel the data to. Pixie's data system and machine learning and we can rely on having a Kubernetes cluster to host our stuff.

CRAIG BOX: So let's talk a little bit about the architecture. And for anyone who's sitting in front of a screen right now, there's a link in the show notes to an architecture diagram. But for those who are just listening, perhaps out walking the dog, can you paint a picture in our minds about how the different pieces of Pixie come together?

We've talked about eBPF and getting events out of kernels. We've talked about getting the event stream out of Kubernetes. Where do all these things go? And then how do people visualize and interact with them?

ZAIN ASGAR: From an architectural standpoint, there are basically three-ish layers to Pixie. So I'll start at the bottom most layers because that's where our eBPF stuff's located. So we deploy these things called PEMs, which stands for the Pixie Edge Module, as a DaemonSet on Kubernetes. So it gets installed on every single layout. Right? So it's important to understand that it's one per node and not one per application or anything else.

And with the PEM does is it actually does two things. The first one is it's responsible for collecting all the data on the machine. And we use eBPF and other Linux APIs to capture data. So this actually will capture both application level data, HTTP requests, database calls, et cetera, et cetera, along with all the system metrics. In addition to that, it also hosts the Edge part of our data processing system.

Part of the challenge we have with eBPF data is that allows you to capture orders of magnitude more information. And it's actually pretty difficult for us to ship the stuff upstream anywhere, right? Because we're monitoring every single network request. There's no way we're going to ship that off the box.

CRAIG BOX: How quickly will you fill up your 40 megabyte hard drive?

ZAIN ASGAR: Yeah, exactly. That will be pretty much instant these days, but never bandwidth. So what we actually do is that we do a fair amount of processing locally to be able to figure out what we want to send, figure out what's interesting to keep, and then also to compress the data in order for us to actually store it. The second part of our system is a single Vizier, which basically is our semi-centralized data store that sits completely within your Kubernetes cluster.

And what this allows you to do is that either you can write scripts, which are written in a Python dialect. It's actually uses pandas, effectively, behind the scenes. Our entire data system is based on the pandas API and Arrow. And you can basically write what we call pixel scripts and be able to execute scripts across the entire cluster without having to worry about where the data is located.

So you can interact with services and pods. And then we'll figure out where the data is located. Because the data could be on the PEM, or the data could be on Vizier, which is our semi-central data store.

So that's the second layer and in that layer we can really manage all the metadata, the longer term storage. Pixie is only designed to store data for the last 24-ish hours because it's mostly meant for debugging. The last layer is basically the UI layers. We do have a connection to a cloud based system so that we can do things like RBAC and security and all of that in a more centralized manner. And we also host our UI and APIs over there, so you can interact with your clusters.

CRAIG BOX: Now your diagram does have a little label on it, says that no customer data is stored outside your network.

ZAIN ASGAR: That's right. Because we only need to store stuff inside of Vizier, which lives entirely within your Kubernetes cluster.

CRAIG BOX: And so when I'm connecting to the things that you have hosted as your cloud service, is it connecting back to my cluster and retrieving that as I need to see it?

ZAIN ASGAR: That's right. We actually have two modes for this. One of the modes you can connect directly to your clusters. You never have to proxy the data back through us.

We have another mode, which is a little bit easier to do if you have complex firewall rules and stuff that you don't have to deal with setting up perfectly, where we do a reverse proxy to pull the data and then send it through to your API. That way it allows you to easily access the data. But we don't actually store anything. It just proxies through our system.

CRAIG BOX: Does Pixie handle the process of alerting me that something's wrong, or is it solely a tool that's used when something else is alerted something's wrong that I can go and find out why?

ZAIN ASGAR: We don't do any alerting right now. There are a lot of other great tools on the market that do alerting. One of the things that we are working on and are shipping is open telemetry egress.

So we can actually egress our data out as open telemetry. And once you get the data out, you can then send Pixie data to other tools, which can then do the alerting. And then you can pop back into Pixie for more details.

CRAIG BOX: It's interesting that you mention that because New Relic also recently announced their commitment to open telemetry as a standardized API. Do you think that the tipping point for open telemetry has passed and that's now becoming the industry standard?

ISHAN MUKHERJEE: From an open telemetry standpoint, Craig, there's two things. One, what's happening in the broader community, and the other specific to vendors like New Relic and Datadog and Splunk. From a community standpoint, that is definitely the intent. The intent is to standardize on open telemetry starting with metrics and then, I think there's traces and logs next. There is a journey to it. It'll take a fair bit of time to get there.

It goes from a customer standpoint early cloud legacy, early cloud to this very modern release. All of them agree that that's the direction that we need to move toward, with everybody in they're relevant adoption points. From a vendor's perspective, New Relic has this really interesting strategy where they do want to standardize on open telemetry and open source all instrumentation technologies that they have worked on, and Pixie is part of that strategy. So they obviously pioneered the APM space with their APM agents.

They've actually open sourced the last 8 to 10 years of R&D in agents. And the Pixie acquisition and contribution to the open source is part of that strategy. So the conversation that they want to have with developers is all software that's running in your environment will increasingly be open source.

Once you're harvesting that data, you could build your own open source stack, whether that's Prometheus, alert-manager, Grafana, or you could choose a high end SaaS managed experience. And New Relic tries to make the claim to be one of the better ones. I think open telemetry is increasingly becoming the standard. We're just going to take a while to get there and then vendors are increasingly now standardizing on open telemetry.

CRAIG BOX: Now that you've open sourced Pixie, would you say it's open source as in I can audit it or open source as in I can run the entire thing myself if I choose to?

ZAIN ASGAR: It can be both, but you can run the entire backend yourself if you so wish. Keep in mind that the Pixie backend, the Vizier and PEMs are already hosted by you, right? The question is, where does our cloud hosting run? And we also open source a cloud version of Pixie, so you can actually monitor and manage multiple deployments seamlessly.

CRAIG BOX: All the way through we've talked about the machine learning experience that you both bring. And you mentioned before, some of the on-device work that's done to try and figure out what's important and what to save. In which other ways do you bring that machine learning background into how Pixie works?

ZAIN ASGAR: It's kind of interesting, people always ask, you were working on Edge machine learning and now you're working on big servers. How is this even semi-related?

CRAIG BOX: It's all just computers, innit?

ZAIN ASGAR: Yeah. The first answer is, of course, they're all computers and they all kind of work the same way and basically optimized for the same stuff. But the interesting thing is when you're a monitoring application, any resources or anything you use are resources that are not available for the main application to use. Part of your goal is actually to make everything as efficient and low utilization as possible, which ultimately makes it not look very different than a very low powered machine on the Edge.

Some of the things we do in machine learning is trying to facet out the metrics by different parameters in the message, so that we can actually store exemplars and be able to make it easy to slice and dice the data. Other things that we haven't done yet but we have prototyped out is doing semantic compression. So once we learn schemas and stuff, we can actually easily compress the data so that we can use up less memory and storage space. Since we're looking at all of those data pre high throughput, we need to come up with some way to be able to manage all this data at scale.

ISHAN MUKHERJEE: One thing I would add to it is somewhat, I guess, special knowledge in some regards. As machine learning people, when we actually build a product, we inherently make a lot of decisions which might be counterintuitive because we have this data-first view of the world. So as we were building up Pixie, things that Zain talked about or even thinking about the data model, architecting the platform to be able to view as much data as possible, and then making decisions to drop or keep later in the process, these are all going to resulted in how Pixie is today.

So what you see it is it has a very novel UI, but it doesn't have the standard nested dashboards. Everything fundamentally is this data-first view. And those engineering and product design decisions inherently happen because over the last decade we've been just building these data-first muscles. So yeah, from a product builder standpoint there's a lot of things from our past that are going to get baked in.

CRAIG BOX: We mentioned before you started building Pixie in August 2018. You launched a public beta last year. And at the time of your launch, you took a $9 million series A investment. Eight weeks later, you were acquired. What can you tell us about the sequence of events.

ISHAN MUKHERJEE: We actually raised the capital that you talked about in 2018 when Zain and I started the company together. So we raised the series A from Benchmark, Google Ventures, and a few other angel investors who became advisors. And then in the summer of 2018, we started to design and architect the platform.

From a timeline perspective, we spent a lot of time-- so we were at KubeCon 2018 talking to as many customers as we can. We did not start hardcore building until much later. I would say mid 2019, we started to have our first design partner contract. And then around spring 2020-- so this is a couple of months before COVID-- we got deployed in massive production scale systems just organically, even though we were in alpha stealth mode. The early version of the platform was inherently very well received, and we started to get deployed.

And around the COVID time frame, we decided, OK, now we should plan for a public launch. And we put the October our timeline. And between March and October, we really focused on building out the community driven adoption. We started to build out this community called Pixienauts. These are early end users or mostly senior principal engineers who are either on the platform side or on the perf side or maybe on the architecture side who started to really try out the product.

And then fast forward to October, we did do the public launch but what was the basis for it was about two years of R&D, about nine developments of concerted customer deployments and experience, and about six months of building out the community. So it was our coming out party, but we were doing a lot of the hard work before that building up to it.

So around the October time frame, because of real production use cases that we saw on the community adoption, we did get a bunch of interesting conversations about raising our next round of capital or potential acquisitions. And that's the context in which Bill and Lew Cirne from New Relic came to us and we had the conversations, and those led to us joining forces.

CRAIG BOX: And you mentioned, obviously this was all happening during the middle of the COVID pandemic. Is there a story about how Pixie made it possible to scale these applications that people were having to grow all of a sudden?

ISHAN MUKHERJEE: When Zain and I started talking to customers, obviously 2017, early 2018, and a lot of the relationships that we built with engineering teams ended up becoming converting into customers. And now, they're some of our biggest champions. A huge cohort of that, Craig, were folks in the content streaming industry, some of the most household brands that you know, the top 1, 2, 3 streaming companies, their businesses were scaling like crazy during COVID.

Everybody was affected. We were as well from all of the stuff that was going down. But from a work perspective, they were just seeing transaction volumes peaking every single week, right? So really working with them, understanding their performance use cases, and getting Pixie deployed and helping them troubleshoot outages, but also helping them support through peaks was something that we honestly did not plan for.

We had aspirations to get deployed in production in these large internal scale production clusters at some point in our company's trajectory. But that happening while in private beta during COVID was really surprising. But it ended up becoming a real proof point for the technology that we were building.

CRAIG BOX: New Relic recently announced that they've upgraded to become a platinum member of the CNCF. Is that as a result of this acquisition, or does it just speak to other cloud native projects inside the broader company?

ISHAN MUKHERJEE: As we talked about before, so New Relic was always a very developer and engineer focused company. It was founded by Lew, who is famous to be a coding CEO. He still is today. A big part of their strategy for observability is now around merging metrics, traces, logs, events under this singular concept of observability. So they have a unified data system where you can ingest all this data and a very cohesive user experience up on top.

And then what they standardized on is the open telemetry spec on the ingest side. With all of their investment in open source, them taking a governing seat in the CNCF just makes the most sense. It helps New Relic engage with their customer base and also with the project teams across CNCF in a pretty high touch manner, where they can essentially help contribute with the best possible outcome. So it was a part of their overall source strategy.

Obviously, Pixie was a huge part. So Zain represents New Relic and takes the board seat. From a Pixie standpoint, the level of investment is quite exciting. It is really, really amazing engineers. The entire engineering team is mostly focused on open source development. So from the community standpoint, it's an exciting investment to really build out a very mature project and not just as a side project, right? So you have some of the top engineers at New Relic just focusing on open source development.

CRAIG BOX: And Zain, what are you looking forward to as a new member of the governing board?

ZAIN ASGAR: Part of the thing that we're always interested in is contributing to the open source community and being a larger player within the CNCF ecosystem. I think some of the stuff that we're interested in is just driving forward adoption of open standards, like open telemetry, and hopefully helping with Pixie and all that moving forward.

CRAIG BOX: You'll probably get a better parking space at the next KubeCon.

ZAIN ASGAR: Hopefully.

CRAIG BOX: Is Pixie Labs going to remain a separate brand, or is it going to become New Relic Cloud Native Monitoring at some point?

ZAIN ASGAR: The current plan is to keep it a separate brand. The Pixie brand itself is going to be contributed to the CNCF. We're in the process of doing that right now. As part of the contribution, the branded Pixie project will become a CNCF project, and then New Relic will continue in to host with pixie.ai.

CRAIG BOX: How else do you see the direction of the product evolving now under New Relic?

ZAIN ASGAR: I think one of the main things that we're looking at is we want to keep this clean separation, right? So we want to have the Pixie open source be something like Chromium and the Pixie close source version be a little bit more like Chrome where we have some potentially value added things that you can use. Like, oh, you can use the extension with New Relic and store 100 gigs of free data, but we do want to have mostly the open source experience.

ISHAN MUKHERJEE: One thing I wanted to add, which is quite bold from a New Relic perspective, is that we're not using any licensing or any other commercial means to restrict the usage of the core Pixie project by other entities, whether they're cloud folks or other folks in the industry. So as you probably saw on the launch, AWS is investing heavily and then a bunch of other folks who use Pixie already.

As Zain mentioned, the Pixie core project, which runs inside of Kubernetes cluster, is quite decoupled from New Relic in the cloud, right? So like the API is very much like an APM Ruby agent in talking to New Relics, simply Pixie's talking to New Relic. As part of this, we do expect other entities to have managed versions of Pixie as well.

CRAIG BOX: I should just point out, I did the maths. Your 100 gigabytes free is 2,500 times the size of Zain's initial hard drive.

ZAIN ASGAR: Nice.

CRAIG BOX: Do you hope that the open source core of Pixie will become like Prometheus and become a thing that you sort of assume is installed on every Kubernetes cluster?

ZAIN ASGAR: That's what we hope to get out there is that Pixie is a core debugging experience for every Kubernetes cluster.

CRAIG BOX: What else does the future hold?

ZAIN ASGAR: For us, we are really excited about just making tools for developers to debug production software. And we're going to be extending Pixie in various ways to do that. And we've been looking a lot into actually just getting code level visibility and live debugging with some of our new features. So if you want to check out things like Pixie's Continuous Profiler where you can get performance profiles, we're going to be moving a lot more in that direction of just enabling core developer use cases.

CRAIG BOX: And at your heart, Zain, you are still a university professor. I understand you are publishing on the TensorFlow blog coming up soon.

ZAIN ASGAR: I'm an adjunct professor, so I work on research at Stanford, and I have some students that I work on research with. Actually as part of the Pixie open source project, we are publishing a blog on how we actually use TensorFlow within Pixie. Hopefully look out for that in the next couple of weeks.

CRAIG BOX: All right, well we very much do look forward to reading that and learning more about Pixie as the process of joining the CNCF continues. Thank you very much, both, for joining us today.

ISHAN MUKHERJEE: Thanks, Craig.

ZAIN ASGAR: Thanks a lot for having us.

CRAIG BOX: You can find Zain on Twitter @zainasgar and you can find Ishan on Twitter @ishanmkh. You can find Pixie at pixielabs.ai.

[MUSIC PLAYING]

CRAIG BOX: Alex, thank you very much for helping me out with the show today.

ALEX ELLIS: More than welcome. It's been a pleasure.

CRAIG BOX: If you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @kubernetespod or reach us by email at kubernetespodcast@google.com.

ALEX ELLIS: You can also check out the website at kubernetespodcast.com, where you'll find transcripts and show notes as well as other links to subscribe. You can find me at alexellis.io or @alexellisuk on Twitter.

CRAIG BOX: I'll be back with another guest host next time, so until then, thanks for listening.

[MUSIC PLAYING]

View More Episodes