Kubernetes Podcast from Google: Episode 137 - Datadog and the Container Report, with Michael Gerstenhaber

#137 February 9, 2021

Datadog and the Container Report, with Michael Gerstenhaber

Hosts: Craig Box, Saad Ali

Michael Gerstenhaber is a Director of Product Management at Datadog, and the curator of their annual Container Report. He joins Craig to discuss why they release it, some recent trends, and how it helps people validate their assumptions about technology.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box with my very special guest host, Saad Ali.

[MUSIC PLAYING]

CRAIG BOX: Welcome back to the show, Saad.

SAAD ALI: Thank you for having me. I'm really excited.

CRAIG BOX: You were here on episode 103. We talked a little bit about storage. We didn't talk a lot about the CNCF technical oversight committee, of which you are a member. I understand you have some new colleagues who just joined you this week?

SAAD ALI: That's right. We got a bunch of new TOC members.

CRAIG BOX: The TOC is elected by various different groups. The governing board elected three candidates this time-- Aaron Boyd from Apple, Cornelia Davis from Weaveworks, and Lei Zhang from Alibaba. And the end-user community elected Dave Zolotusky from Spotify and Ricardo Rocha from CERN, who's my favorite because he's been on the show.

SAAD ALI: That's right. Really excited to work with them.

CRAIG BOX: What's changed? What's new in the CNCF of late?

SAAD ALI: Really, it's been focusing on how to make projects more successful and to figure out how to get projects into the CNCF. So over the last year, we've worked on revising the process for adding projects to the CNCF sandbox. And it used to be that you had to go and talk to a TOC member, get them to sponsor you. That whole process has been streamlined now.

And there is an application process. And the TOC regularly reviews applications. And really, we've actually lowered the bar for joining the CNCF sandbox.

And the goal has been simply to make it easy to collaborate amongst multiple companies on a given project. And it becomes a way to kind of increase the innovation in the ecosystem. And we kind of removed the requirement for projects to have to move to incubation or move to graduation, which frees us up to adopt more projects in the sandbox as we go forward. Now, moving forward into the next year, what we want to do is see if we can improve the process for incubation and graduation as well.

CRAIG BOX: One of the things that you don't get to do with these people is obviously meet them in person. What do you miss the most about not going to KubeCons and conferences, or even out of the house in general, perhaps?

SAAD ALI: No, it's really, really just seeing people in the hallway, bumping into them, talking to them, and seeing the face behind the GitHub IDs. It really just helps you connect with people on a different level, and I really miss that. I'm looking forward to being able to do that again.

CRAIG BOX: What's the best meal you ever had on a conference trip?

SAAD ALI: You know what? It would have to be this Malaysian roti I had in London. And I can't remember the place.

It was a little hole-in-the-wall wall place. And it was delicious. And I think I went back two or three times.

CRAIG BOX: I know, a lot of people don't think England is famous for its haute cuisine, but there is a lot of good food to be had around here, in better times at least.

SAAD ALI: Definitely. It seems like they've adopted the food from all over the world.

CRAIG BOX: It's one thing that England is very good at is adopting things from all over the world. And on occasion, they're asked to give them back again. [cough] Elgin Marbles, well. Shall we get to the news?

SAAD ALI: Let's get to the news.

[UPBEAT MUSIC PLAYING]

SAAD ALI: Congratulations to the Open Policy Agent, which is now a graduated project in the CNCF. The announcement was made in a blog post written by Tim Hinrichs and Torin Sandall, guests from episode 101 of the podcast. Listen to that episode to learn all about the project.

CRAIG BOX: Docker Distribution, the registry that powers Docker Hub, has been donated to the CNCF sandbox. Distribution is a rewrite and go of the original Python Docker registry. It has been adopted as the base of many other registries, such as those from GitHub and GitLab. And Docker's CTO Justin Cormack says that maintainers from those forks have been approached to become maintainers of the new project, now just called Distribution.

SAAD ALI: In more registry news, Red Hat announced version 3.4 of their Quay registry. This is the commercially supported version of the Quay open source project. Quay is written in Python and has migrated to Python 3, as well as offering new versions of their operator and updated Clair security scanner. Quay was proposed as a CNCF project last year, but has not yet completed that process.

CRAIG BOX: The Unit 42 group at Palo Alto Networks has discovered a Kubernetes cryptojacking malware, which they have termed Hildegard. Breaking in via a kubelet, which was configured to allow anonymous access, the malware spreads inside a Kubernetes cluster and mines Monero, which I believe is the only cryptocurrency Elon Musk hasn't tweeted about this week. The attackers call themselves TeamTNT, which I sincerely hope is different to the people who created the fourth episode of the original "Doom" back in the mid '90s. If you want to know how an attack like this might work, a blog post from Eduardo Baitello coincidentally explains how to get full-node access by attacking the kubelet.

SAAD ALI: Jetstack, the creators of cert-manager, has announced a public beta of a service based on it. Jetstack Secure provides visibility of machine identities using an open-source agent on your cluster and a web UI, which is either SAAS or running your cluster with the Enterprise Edition. You can try one node for free, and the paid version will include multi-cluster Enterprise CA and service mesh support with some features arriving throughout the year.

CRAIG BOX: The Traefik proxy has released version 2.4, adding support for the proxy protocol on TCP services and dynamic support for mutual TLS. This release also adds support for using Traefik as a gateway with the new experimental services APIs in Kubernetes. If you'd like to use Traefik as an ingress gateway for Istio, a blog post from Tetrate this week talks about how to do just that.

SAAD ALI: API gateways are all the rage this week, with Kong announcing general availability of Kong Konnect, with two Ks, their commercial service management platform. Konnect runs as a cloud-hosted control plane with agents on your cluster and operates a central service catalog. Kong also announced $100 million of series D funding, taking them to a valuation of $1.4 billion and making the gorilla a unicorn.

CRAIG BOX: Early access tickets are now available to KubeCon EU to be held virtually May 4th through 7th. Buy yourself or someone you love a ticket before Valentine's Day and pay only $10 versus the $75 your ticket will cost afterwards. Seven co-located events have been announced, including days for edge, security, service mesh, WebAassembly, and Rust.

SAAD ALI: You can build your containers with a Docker file, or you can just write code using a build pack to do the containerizing for you. Genevieve L'Esperance from Doximity explores both these options in a blog post this week and talks about why they have moved their developers to Paquito build packs, citing developer productivity, security, and performance benefits.

CRAIG BOX: Finally, Lighting up the message boards this week, a post by Luka Skugor on why he doesn't feel Helm fits the Kubernetes ecosystem. Any other week this may not have made the cut, but Skugor cites a talk from Saad on Kubernetes design principles and why he doesn't think Helm fits them, and seeing as I have Saad here with me…

SAAD ALI: Overall, I think it's a well-written, philosophical argument that makes a lot of good points, fundamentally, Kubernetes introduces a new problem, YAML hell. There are a lot of projects trying to solve this problem-- Shout out to KPT ("kept"), from Google-- another project in this space trying to address this. Discussions like this one are good, and hopefully will result in pushing the space to something better for users.

CRAIG BOX: And that's the news.

[UPBEAT MUSIC PLAYING]

CRAIG BOX: Michael Gerstenhaber is a director of product management at Datadog. He works on metrics in the container ecosystem and compiles the annual Datadog container report. Welcome to the show, Michael.

MICHAEL GERSTENHABER: Hi, great to meet you. Thank you.

CRAIG BOX: You started out at Cisco working on network management. What's the difference between management and monitoring?

MICHAEL GERSTENHABER: Network management was bi-directional. We were monitoring the routers and switches, the switching fabric, the firewalls, and everything. But we were also pushing changes.

It was a central place that abstracted whether or not you were abstracting into virtual network elements, whether you were using a Cisco device or a Juniper device or whatever. You could push one change, and it would provision those changes. With monitoring, it really is more about receiving telemetry and aggregating and acting on data programmatically-- the configuration management tools, our solutions, our partners in the ecosystem, like Chef or Puppet or Ansible that we partner with, rather than make those right changes directly.

CRAIG BOX: It still freaks me out when I hear people talk about IOS. It makes me tweak something in my brain and say, oh yeah, that's right. That's not what I think it means these days.

MICHAEL GERSTENHABER: [LAUGHING] Yeah, that's right. It's funny, when I was moving from Cisco Systems right at the beginning of my career, it was 2006. And this was my first job. And iOS, the Apple version, didn't even exist yet. So when people started talking about it, it was very much -Internetworking Operating System for me, the Cisco operating systems and the routing and switches.

CRAIG BOX: Now, you were an engineer there for some time. And then you left to become a product manager at a startup.

MICHAEL GERSTENHABER: Yeah. I left engineering. I was 29. I thought I was going to be an engineer my entire life. But a very good friend of mine was starting a company in the video game industry. He was fundraising. He was selling. And he just needed somebody to talk to the engineers, to talk to the customers, and make product decisions.

That's how I accidentally fell into product management. As we wound down the startup and I joined Datadog, I moved back from video games into where I was comfortable, in monitoring, what I had learned at Cisco Systems, but this time on the product side rather than on the engineering side.

CRAIG BOX: A lot of people like to look back on their startup and say, hey, we had a great idea, but the time was wrong, or the market fit wasn't there. Do you have any opinion on that particular technology or that particular time?

MICHAEL GERSTENHABER: I quite enjoyed the experience. I think there was some small degree to which we were a little bit early in video game streaming. We were also doing video game streaming in a very specific way.

The traditional tools that were advertising video game streaming-- the PlayStation came out with a version also when they bought Gaikai. They would render everything on a data center either in the cloud or on their own data centers and stream the video and capture input, right? But there's latency in both directions there.

And if you want to click the button and throw the punch or kill the zombie or whatever it is, right, that latency has to be certainly sub 100 milliseconds. But preferably, I think people think about, like, 50 millisecond latencies at this point, which is very hard to achieve when you're encoding and decoding video. So the direction that we took, as a differentiator here, for our path to market was to watch what the system was requesting from the hard drive into RAM, into volatile memory.

Because when you load into a level, that's literally what is happening, right? You load a bunch of textures and music into memory, and you play out of fast-bottle memory. And the idea here is that we were able to move the hard disk, the 60 gigabyte download, the 128 gigabyte download, out to the cloud. And you could still use your local hardware for processing. And you would still get local hardware speeds.

CRAIG BOX: You're kind of like level five, level six cache for the CPU?

MICHAEL GERSTENHABER: That's a good analogy. This also allowed me to go to publishers and just wrap their game without seeing the source code because we were just watching data ranges being fetched and creating this giant statistical map based on what other people had played. So even in open-world games where there weren't necessarily levels that were discretized that way, "Lord of the Rings Online" or whatever. We could see this tree and then that tree and know that you were moving at 5 millimeters per second in that direction and download the next tree.

CRAIG BOX: So you've now landed at Datadog. What was the first project you took on when you joined?

MICHAEL GERSTENHABER: That landing was very interesting. I was winding down the startup that I was working on, that I had poured quite a bit of myself into. I was thinking about what I wanted to do next. I used a number of monitoring tools at Happy Cloud. And that's how I came to know about Datadog and get excited about the company.

So when I joined, I was working on what's now called the Live Processes product, which allows you to aggregate or filter all of your proc file system over all of your hosts. If you have 40,000 hosts, you don't have to look at that proc file system one by one. You can say, show me all of the Postgres wherever it runs, right?

It's interesting that we're in this particular conversation because, when I was having those interviews, in the very first months of my work at Datadog, I came upon a customer, who I'm sure knows who he is even to this day, who said, I would never use this. I would never use this product. And I asked why, and he said, because I use the containers.

And again, remember, forgive me here, because Kubernetes was less than a year old at the time. I said, what's a container? And he just spent the next three hours explaining everything in the world to me. It was a wonderful experience.

And I came home back to the office and said, that's it. We have to put a pin in live processes. An inventory of containers is just a pivot table of a list of processes aggregated by C group.

So we were able to use the same UI, ship a container-focused product. And in the meantime, I was able to come up to speed on what containers' users needed. Normally, that's not an inventory. This was just one feature. But then I started working on the metrics product and the container product more broadly after that.

CRAIG BOX: It's obviously very convenient if you have something which just deals with processes-- and really, that's all containers are-- it makes it very quick to go-to-market.

MICHAEL GERSTENHABER: Exactly. We were able to change the direction of that product and ship something within a month or two. It was very, very, very fast. And we still did launch the Live Processes product afterward. It was just clear that there was more urgency in the container side of the world.

CRAIG BOX: When you're dealing with network hardware, like back at Cisco, if you have one device go offline, then that's a problem, and someone should be sent out to look at it. But when you start dealing with aggregation of processes or containers that make up a system, now you're dealing with failure as being expected. You're running on unreliable cloud hardware. You're architecting for software failure. And you have to start treating things in the aggregate.

MICHAEL GERSTENHABER: It's not even that you have to expect or architect for failure. You architect for scaling, right? You're intentionally scaling these things up and down in order to run efficiently, right?

I say this slightly arrogantly, but you can always throw money at an uptime problem. But the idea behind engineering is often to build an efficient system, not just a robust system. People scaled down without failures, right? People are scaling down intentionally. It's an evolution in the same direction, but the scale is much different.

So when I came to Datadog, we were talking-- and again, this was four years ago. We were talking about people moving to the cloud, making digital transformations in a way that we're talking about moving to containers these days. So they were getting used to ephemeral architecture at all. Datadog introduced tagging and aggregation, which are concepts that I think are more taken for granted these days. But as we move into containers, the cardinality, the set of signals that underlies the aggregation, has exploded by 100, 1,000 times.

CRAIG BOX: How do you think the industry had to change in terms of any event where you would remove a server would have a similar event in the monitoring system where you would say, stop monitoring that, but now you have to observe the state of things and be able to handle those scaled-down events because they will happen?

MICHAEL GERSTENHABER: When you were rack and stacking your own hardware, you might have been monitoring those hosts. Now you're monitoring arbitrary objects. The obvious one is the service.

And that's where traditional APM comes in. But you want to know that your Kafka topics are healthy. Even if your Kafka hosts come and go, you want to make sure that your queue offsets are OK, that your Nginx connections per second is healthy, right?

So for any system, not just for your services in application performance monitoring, there are key aggregates-- your Kubernetes name spaces, or kube deployments, or whatever-- that you want to monitor that are much more persistent than the underlying signals that inform them, right? Pods come and go.

But you rev a version on a deployment. Replica sets even come and go. But you don't really tear down the entire deployment very frequently.

CRAIG BOX: SRE talks about golden signals, which are things like the latency of a service, in that, you don't even look at the CPU usage necessarily. You look at outside things that relate to how the service is perceived by the people who are using it.

MICHAEL GERSTENHABER: Precisely. And we think of these things. The golden signals is a great model.

We also think about work metrics and resource metrics. Work metrics are those metrics that impact a customer. Again, with services they are the golden signals-- the number of requests, the latency of the service.

But every work signal has a resource signal. That resource signal is a work signal for something else. Your database might cause latency. There is probably somebody who wakes up in the middle of the night who needs to understand why database queries are slow, even if that only might impact the end user if it's causing the service above it to be slow. There are locks on that database and CPU that informs why those locks might be happening. It's turtles all the way down, sort of.

CRAIG BOX: And you mentioned before that you can always throw money at the scaling problem. You can have good results on your signals that are caused by a very inefficient system. How should I think about monitoring the ratio, perhaps, between how good my external-facing signals are and how much I'm paying to maintain them?

MICHAEL GERSTENHABER: Right. And I don't want to overstate that, right? There are hard engineering problems that are not just solved by scaling up too far, right? So I don't mean to overstate that too much.

But you do want to monitor your system so that, as you scale down, you maintain a high quality of service for your customers. And that exhibits itself in work metrics and the golden signals. You have some SLI, which is latency.

And you have an objective for that SLI. I want to keep latency under 200 milliseconds per query, end to end, no matter how many parts of my system it touches. And I want to scale down as long as that latency stays below my SLO. If it goes above my SLO, I need to scale back up because I'd rather my customers have a good experience than save a buck.

CRAIG BOX: Since 2015, Datadog has been publishing a survey on container usage. Back in 2015, it was first of all called the Docker Report. It became the Container Orchestration Report and then the Container Report. That, in and of itself, I think is an interesting view of the history of how we think about the space.

MICHAEL GERSTENHABER: Yeah, absolutely. And especially with the recent changes between Docker and containers in Kubernetes land, I think it's starting to hit the engineer at the keyboard. But this entire ecosystem-- Kubernetes is, what, 6 and 1/2 years old now?

CRAIG BOX: I remember it had a fifth birthday, and then time kind of went crazy after that. And I don't know where we are.

MICHAEL GERSTENHABER: [LAUGHING] That's where I am, yeah. That's exactly how I was trying to remember. It was big in Barcelona at the fifth birthday. Anyway, it is a young technology, right? And these things are changing.

First, you had safe multi-tenancy with Docker. And then we realized that safe multi-tenancy was nice, but to get real value out of it, layering on an API-centric entry point and orchestration let you take advantage of that multitenancy. Now, instead of having to find a bunch of hosts to run my container on, I could use Kubernetes to just say, run five of them. And everything else would work out just great.

Now, we realize that the API layer doesn't have to exist in both places. And containerd CRI-O might be lighter weight than Docker. And it'll continue to evolve. We're not going to stop here or anything.

CRAIG BOX: Let's turn our time machine back to those happier days and look at some of the figures from that first survey. So Docker adoption is up 5x between 2014 and 2015, from 1.8% of Datadog customers to 8.3%. I imagine the number is quite a lot higher today.

MICHAEL GERSTENHABER: Yes, although I think it's funny because you see this, like, very linear growth, right? That fact is one that we don't really include in the report as frequently anymore because it just looks the same. Sure, it's more than 50% of our customers, but it's just been growing that entire time. And it's less surprising these days.

CRAIG BOX: Yes, it says here, Docker's gone from 0% to 6% of hosts in one year-- to the moon!

MICHAEL GERSTENHABER: Exactly. But yeah, I mean, four years later it's more than half, right? And that's enterprise customers, and that's startups, and that's everybody.

CRAIG BOX: Docker hosts often run four containers at a time.

MICHAEL GERSTENHABER: So that's also one that gets updated frequently in the early years, right? First, it was the median was four, then it was five, six, eight.

But then orchestration came along. And there's one fact, I think last year or the year before, where you see there's a huge divergence, right? Orchestrated containers suddenly running several factors above in density, which I think illustrates the reason people are moving to orchestration, right? It's so that natural scheduling and bin packing happens on their behalf. And you can run at a higher density and get more efficient use of resources.

CRAIG BOX: How is this data collected? You're obviously looking at the workloads of your customers. First of all, do you have to ask their permission in order to get this kind of data?

And are you looking only, in the early days especially, only at a small subset of the people? Would you say this was representative of everyone using Docker at the time? Was the overlap between sort of the Datadog user of the time and the Docker user of the time about right?

MICHAEL GERSTENHABER: So in order to collect the data, we anonymize some tens of thousands of customers across lots of verticals. We have customers who use Datadog or customers that have web apps that have a website, right? They're not in any one industry. So it's representative from that perspective.

I think we've made a lot of early investments in Docker and especially Kubernetes. So back then, it's possible that we were biased towards early adopters of the technology. It was 1%.

There was noise. 1%, 5% of our customers, which has its own volatility. But in terms of what that represented about the industry, it might have been overstating it. I think these days, 50% of our customers using it and-- you've been going to KubeCons, you've seen it grow from 500 people to 12,000 to whatever it was this year virtually, which was some enormous number.

CRAIG BOX: I do remember it was a very long walk between booths at Barcelona.

MICHAEL GERSTENHABER: Yes, yes it was. We all see the industry going there. I think it's representative in part because of my own experience where our customers are "Fortune" 500, "Fortune" 100 enterprises.

And they are the long tail of tiny startups. They're unicorn startups. There's no obvious segmentation to me to say that this is not representative. And I think it's a good cross-section of the engineering population.

CRAIG BOX: So when you have a big data set like that, how do you decide which questions to ask of it?

MICHAEL GERSTENHABER: A lot of it is product driven. The original intention here was to help us make intelligent investments, right? Are people using containers? Is this the right place to invest?

What does this mean for metric cardinality? Do people need container ID tagging, or is this preferable to aggregate out that value and offer the Kubernetes deployment or the container image as the atomic unit? Now, obviously we do offer both. But that was just a question that I had that was an open question in my mind, as a product manager.

We thought it would be interesting for the community at large to have these same answers. If they were affecting my roadmap, then probably engineers at the keyboard were interested in my findings as well because everybody was trying to figure it out. You'd see a LinkedIn advertisement for a job or something, and it says ten years of Kubernetes experience. You sort of laugh, but everybody is figuring out best practices, right?

Everybody is figuring out, should I be doing bin packing? Does bin packing expand my failure domain in an unacceptable way? There are lots of considerations here.

And I think best practices are still being figured out. And if I'm looking for my own benefit, hopefully everybody else benefits as well. So you see a change in the kind of questions we're asking.

And these days, it's “are people doing that bin packing? Are people setting requests and limits on their containers efficiently? Or are they afraid of the OOM killer and throttling?”

The questions change over time. We still do look at the old questions. But I think we've sort of settled on well-known answers for them, and we're just looking for drift from what we expect. And the reports change to that same extent.

CRAIG BOX: Are there any concrete examples you can share of things that you perhaps weren't planning to do or product changes that you made as a result of doing this research?

MICHAEL GERSTENHABER: First of all, there's the investment in containers itself, right?

CRAIG BOX: Mm-hmm.

MICHAEL GERSTENHABER: We have quite a large team working on container integrations. Containers are a technology, right? When we make changes, yes, we have experts. We have people focused maniacally on the container experience.

But when you're writing an APM product, it's not like you're not writing a containers product. You are also writing a containers product there. So spreading that knowledge, having a central team, but also moving that knowledge into the rest of the company was a real investment.

We also made certain choices. We are running entirely on Kubernetes these days. Rob Boll and Laurent Bernaille have a great talk at KubeCon Barcelona about how we run our infrastructure and what we have found in that infrastructure.

But part of the reason to move there is for our own benefit in infrastructure land. For all the reasons that containers and Kubernetes ideas are great, but also so that we know where our customers are coming from, right? The great thing about being a monitoring company, a multi-tenant SaaS provider is that our customers are also, in many cases, large, multi-tenant systems, right?

CRAIG BOX: Mm-hmm.

MICHAEL GERSTENHABER: We use Datadog to monitor Datadog. And it's important that we have that empathy for where our customers are coming from. And even our customers that are running largely on-prem with traditional networks are often putting their feet in both ponds and moving from one to the other, either as a permanent step or as a migration. So through these reports, we also recognize the importance of running like our customers were running. And we did move our own infrastructure there as well.

CRAIG BOX: Let's have a look now at the most recent report, which was released in November. We start by saying Kubernetes runs in half of container environments. And nearly 90% of containers are orchestrated. So what's the other half?

MICHAEL GERSTENHABER: The other half is ECS, right? ECS and Kubernetes have completely dominated the orchestration market. ECS, obviously, is a first-party service by Amazon. So it's not an option for people not running in their data centers.

Even in AWS data centers, Kubernetes is extremely popular. Their EKS product has seen enormous adoption since it's launched. I think all of these tools are very good. The goal here for engineers is to be able to, again, have this one entry point for their apps and be able to not worry about where the apps are running now that there isn't resource-level noisy neighbor problem for CPU and memory.

CRAIG BOX: Later on in the report, we go to talk about the fact that the managed services from the clouds tend to dominate on their own platform. If you're running on GCP, for example, it says here you are roughly 90% likely to be running GKE. But if you're running on Amazon, you may be half chance you'll be running EKS and half a chance you'll be running yourself. Do you think that is a function of the fact that EKS was later to market? Or do you think there's a different reason for that?

MICHAEL GERSTENHABER: I think that's a factor of the fact that GCP had a very strong offering for a managed Kubernetes service very, very early on. It's also a factor of the fact that AWS has been running EC2 for so many years. And people adopted Kubernetes into their EC2 clusters.

In the report, these numbers are normalized against 100%. But EKS has had an enormously fast uptake. It's 50% of usage, but that's really because people already had established Kubernetes clusters on their AWS cloud.

Some are moving to EKS. Some new adoption is happening in EKS. But EKS has proved to be a very popular product.

And based on the strength of that observation and on GKE, I do personally think that the control plane for Kubernetes doesn't really have to be something that customers manage. We can think about it as the hypervisor. And as we move forward, that's more and more going to be a common pattern where people don't necessarily need to tweak everything in their control plane.

They just want a managed control plane. And they want a data plan that they can provision. Even that, actually, I might put a caveat on. But for now anyway, they just want a data plane to run their apps and not really worry about becoming people who can manage Kubernetes as experts on Kubernetes.

CRAIG BOX: Something people may like to tweak is the resources that are allocated to their pods and their containers. Your survey says here that 49% of containers use less than 30% of the CPU that's requested. And 45% of containers use less than the 30% of the requested memory.

This will be comparable to what we saw in Borg studies years ago, which eventually led to the development of things like Autopilot and have influenced auto-scaling in Kubernetes. Is this a user configuration problem? Is this something that people should be thinking about? Or is the technology letting them down?

MICHAEL GERSTENHABER: In every technology problem is a product problem. I think there is a degree to which two migrations happen. One is a lift and shift into containers and Kubernetes, and the next is optimizing Kubernetes.

That's sort of natural, based on my anecdotal interviews. People do think in the back of their head about optimization. But in terms of putting that into practice, they are more worried about making sure that-- this is a wildly difficult migration.

People have stateful services and stateless services. And it's much easier to move stateless services. But once you've done that, you have a bunch of pods that represent web services jumping around and connecting to Kafka queues that are outside of the cluster or something. I think people are still getting their hands around what this means.

And definitely the more sophisticated customers that I talk to are already doing bin packing. Datadog does actually quite a bit of bin packing ourselves and optimization here. But that's a secondary milestone for a lot of people. And what you see here is the immaturity of the ecosystem.

Now, I think you can certainly make the argument that tooling accelerates those milestones, right? Good tooling makes it so that you don't have to schedule these things in series. So we are, at Datadog-- and one of the reasons that I put this fact in the report is we are, at Datadog, looking into this, how we can help our customers with this without them having to know what to query and know what it means to bin pack. We can just push suggestions, something we're heads down in now.

CRAIG BOX: I remember seeing a graph of the amount of memory, perhaps, that was allocated to containers. And it's got little spikes at 10, and 100, and 1,000. Because humans say, hmm, I don't know how much memory my container needs. Let's give it a round number.

MICHAEL GERSTENHABER: Yeah, that's exactly right. I think it's fun. I also see that people are much more afraid of memory.

CPU, on one hand, is more volatile. But memory fails more catastrophically, right? When you broach a memory limit, OOM killer comes for you. Whereas with a CPU, you'll get throttled. And if the number of cycles you need comes back down, you just have more time to either scale up or handle the load.

So you want to hit your limits all the time. I'm just saying it's slightly less catastrophic. So it's interesting to see these patterns, both the round numbers and the fear of one over the other.

There's also a hidden story here, right? There is a network, right? Computers exist in the cloud. And Kubernetes doesn't necessarily isolate for network issues. So as people are moving their staple service into the cloud, as web servers and proxies all live in the same place, if two pods that are highly networked-constrained happen to be scheduled to the same host, you can quite easily get noisy-neighbor problems. So even in these requests and limits that we were able to study, we know anecdotally that there are blind spots. And that might appear on future reports.

CRAIG BOX: Another fact from this report says that the most popular Kubernetes version is 17 months old. So 1.19 was the latest release at the time of this publication. And the version with the largest number of users was 1.15 at the time.

Now, that version is more than three versions old. So that's outside the general accepted support window. Do you think that's a piece of data that maybe the Kubernetes community should use to consider how long they're supporting versions?

MICHAEL GERSTENHABER: Hopefully, it's helpful. I don't want to be prescriptive, certainly. There are people who know a lot more than me about this topic. And then I know that they're taking mindful decisions.

I do think it shows the maturity of the ecosystem. And it also shows the velocity of the ecosystem, right? Kubernetes cuts releases very frequently. And that's great for innovation.

This is still a platform-level technology that touches every part of a customer's ecosystem and is going to represent some inertia. And I think that makes sense. As we get new versions, we'll see the average customer falling slightly further behind until it stabilizes.

Look how many people aren't using a Linux of a kernel or whatever that can't run BPF products, right? There is some lag there. And that's because changing the underlying infrastructure, not just revving a single application, has a larger potential impact on the company. And people just have to plan for these things.

CRAIG BOX: Now, the last fact in the report is talking about the most widely used images in containers. And it says that we have Nginx, which makes sense as a stateless serving product. And then numbers two and three, Redis and Postgres where we start getting into data storage in containers. I guess I'm a little surprised by that. Do you see that people are starting to adopt stateful storage in Kubernetes and treat it like it's nothing now?

MICHAEL GERSTENHABER: "Treat it like it's nothing" is doing a lot of heavy lifting there. I think this is one of the things that people do build expertise on.

CRAIG BOX: There's no fear, perhaps. Is that a better way to think about it?

MICHAEL GERSTENHABER: There is a fear, is what I'm saying.

CRAIG BOX: Right.

MICHAEL GERSTENHABER: The benefits outweigh the risks. There's a risk either way, right? There's a fear when I move all of my stateless services into containers, all of a sudden, I have to route through the ingress through the cluster. I have two systems. I have traditional infrastructure that I have to train my engineers on, and we provision that with Chef or Ansible.

And I provision my applications through Kubernetes deployments over here. There's a lot of organizational and structural reasons to move those staple services over also. Also, I think a lot of workloads simply are staple services, right? So when you look at Redis, Postgres, Elasticsearch, MySQL, it's much more than the top two, RabbitMQ and Mongo. Almost everything here is staple.

The reality is that staple services underpin all data storage and communication. Kafka and RabbitMQ and the message queuing, they connect them. But nonetheless, they're staple if not quite as durable. But you can't really reason about most applications without considering their staple services. I think what you're seeing there isn't necessarily people not being afraid or a lack of complexity so much as the importance of doing so and how many staple workloads there are out there.

CRAIG BOX: This is obviously looking at things that are running on Docker. Do you think that this is roughly representative of open-source software in general, perhaps running outside of Docker as well?

MICHAEL GERSTENHABER: Yeah. I suspect so. I don't really have a number on that. I do know that, you look at these container images, and you wouldn't be surprised by any of them running in a traditional deployment, except for Calico for instance, which is more specific to this technology. So that's really just to say that, as people move to containers, they are moving everything and not just their stateless services. But I also want to caution that I have not personally done this study outside of the container world. And what I'm normalizing against here is my experience in talking to people.

CRAIG BOX: Now, you've talked a bit about how you use this information internally to work on Datadog products. Do you have any idea of how other people outside Datadog have used the information in these reports?

MICHAEL GERSTENHABER: A lot of people are looking for confidence at some first order, right? Even if you don't take any action, you want to know that you have made a good decision. People are working in isolation.

There's some collaboration in the industry, certainly. But when we say, everybody is using this, it makes people more comfortable to say, yeah, yeah, I did the right thing there. Everything's OK. And I actually do hear that quite a bit. Are people using Calico? We chose to use Calico.

I chose to use Terraform. Are other people? And for me, sitting where I do and talking to so many people, like, yes, of course, everybody uses something like a Terraform. And I'm surprised when people ask me that question. But nonetheless, it's something that people ask. So these kind of facts are helpful for that.

And I think, also, going back not just to the container images one, which is, like, a recency bias there in inserting this, but also in the move to Fargate or the provisioning of limits on individual containers, there is a certain amount that people are recognizing that, yes, they are making the right decisions.

Then the other thing is making new decisions. Should we invest in Kubernetes or ECS? Are people trusting Fargate? Is it OK for me to move to a managed control plane and just work with my container image, or is that going to bite me in some way? As we see a lot of these stats, people do take actions based on what they see in this report. Again, the purpose of the report isn't necessarily to prescribe actions, just to give people data for their own benefit.

CRAIG BOX: Finally, you're in New York City. How are the streets?

MICHAEL GERSTENHABER: Oh, it's not bad. Yeah, we just-- we had some snow the last couple of days, but--

CRAIG BOX: Some snow. That sounds like an understatement.

MICHAEL GERSTENHABER: Yeah, it's been a quiet winter with everybody being at home, it's been a very quiet winter.

CRAIG BOX: Am I not right in thinking that this is now in the top 20 snowstorms of all time?

MICHAEL GERSTENHABER: Oh, I don't know about that. I remember our April Fool's storms and all of that. So maybe it's time of year, but I think, in February, we all expect to get snow. And since I'm locked inside in my tiny studio apartment anyway, right, it's not impacting me the same way it did before.

CRAIG BOX: Do you at least have a pretty view out the window?

MICHAEL GERSTENHABER: It is. It's beautiful out here, yeah. I love being in New York.

CRAIG BOX: Lovely. Well, thank you very much for joining us today, Michael.

MICHAEL GERSTENHABER: Thank you very much for having me. This was a pleasure, Craig.

CRAIG BOX: You can find Michael on Twitter at @mikezvi, although, be warned, he's only tweeted seven times, and you can find Datadog at datadoghq.com.

[MUSIC PLAYING]

CRAIG BOX: Thank you very much, Saad, for helping out with the show today.

SAAD ALI: No problem. I had a lot of fun. Thanks, Craig.

CRAIG BOX: If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter at @KubernetesPod. Or reach us by email at kubernetespodcast@google.com.

SAAD ALI: You can also check out the website at kubernetespodcast.com where you will find transcripts and show notes, as well as links to subscribe.

CRAIG BOX: I'll be back with another guest host next week. So until then, thanks for listening.

[MUSIC PLAYING]

View More Episodes

Datadog and the Container Report, with Michael Gerstenhaber

Chatter of the week

News of the week

Links from the interview

Transcript