#113 July 21, 2020
Released on the same day as Kubernetes, cadvisor is a container monitoring daemon that collects metrics and serves them to monitoring tools. It’s built into the Kubelet, and underpins many components in Kubernetes, such as eviction and autoscaling. David Ashpole of Google Cloud is TL of Kubernetes SIG Instrumentation, and the maintainer of cadvisor; he joins Adam and Craig this week to explain where instrumentation fits in the stack, and what you should do as a Kubernetes maintainer vs. a cluster administrator.
Do you have something cool to share? Some questions? Let us know:
ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.
CRAIG BOX: And I'm Craig Box.
CRAIG BOX: In the two or three minutes before the news on the show, we used to talk about things like what conference we were attending or where in the world that we are. The world's obviously shrunk quite substantially for most people. And I can tell you that there are a number of more people in the neighborhood around me who have books out on the street. Not quite little free library, but giving books away.
There are a number of fun, sort-of COVID-related things. A statue outside a house around the neighborhood from us now has a mask on the lion, which you can see a picture of in the show notes.
ADAM GLICK: That's fantastic. There's a bush in our neighborhood that people have decorated to look like a little lizard. And they've put a mask on that as well, which is always fun for taking the baby out for a walk and going past.
CRAIG BOX: Something I saw on the news in the last week, a school mascot, Archie the Mammoth from the University of Nebraska Lincoln it's all in on mask wearing, it is. He has been kitted out with a mask around his elephant mammoth mouth. But also of course, at the end of his trunk, which is a good six feet away from him.
ADAM GLICK: Yeah, that's a great picture. Speaking of things that make people happy this week, National Ice Cream Day was this past Sunday, for at least the folks here in the United States. And I love ice cream--
CRAIG BOX: Who doesn't?
ADAM GLICK: --as much as anyone. And so to celebrate the occasion, I have bought some materials to put together an ice cream pie that I haven't quite had a chance to construct yet. But that will be my celebration for the holiday. I'll let you know how it goes. But I'm quite excited. I haven't made one of these in a number of years and they're always tasty.
CRAIG BOX: You'd better get to that quite quickly or else it'll surely melt.
ADAM GLICK: If only we had an ice box.
CRAIG BOX: They have freezers where you are? What goes into the manufacture of an ice cream pie?
ADAM GLICK: The way that I normally make them is I'll get a graham cracker crust. And then I will put down a thin layer of "car-mel", and then some crushed graham crackers on top of that. And then you soften up a little bit of ice cream. And you create a layer of ice cream on that. And then if you want to do a two layer one, you'll put a layer in between.
Normally I'll do a layer of graham cracker. But this time I'm thinking I might do peanut butter. I've gotten some peanut butter, I should a slab of peanut butter in the middle. And then another layer of ice cream. And then top it all off with "car-mel" and crushed peanuts. Put it in the freezer, let it freeze back up, and voila.
CRAIG BOX: I'm quite glad that it is National Ice Cream Day and not International Ice Cream Day, because there are many things I would take offense to in that. Not least the use of "car-mel" without the A. It's "car-ah-mel", come on! Add the syllable! [CHUCKLING]
ADAM GLICK: Shall we get to the news?
CRAIG BOX: Let's get to the news.
ADAM GLICK: A number of features have become generally available on the Google Kubernetes Engine ingress controller this week. The back end config CRD, which is used to configure Google Cloud load balancer features, hits the view one milestone. This lets you configure things like session affinity and connection draining timeouts, as well as integrations with the cloud CDN and the identity aware proxy. All are now GA. You can also use a full range of custom health checks available in beta. A new front end config CRD has been introduced, which lets you specify SSL policies, also available in beta.
CRAIG BOX: Red Had OpenShift 4.5 is out. Headline features include automated installation with provisioning on vSphere, installing into a Google Cloud shared VPC with pre-existing infrastructure, and three node clusters on bare metal for small installations.
There are a bunch of fit and finish improvements and both OpenShift 4.5 and the Kubernetes 1.18 base it is now built on. Just like Fedora to Red Hat Enterprise Linux, there is an open source sibling of the OpenShift container platform, the OpenShift Kubernetes Distribution, or OKD. OKD 4 went generally available along with this release. And is being built on the same images, and thus includes almost everything except support and access to pull commercial operators.
ADAM GLICK: Spring Cloud Data Flow is a microservices based streaming and batch data processing framework, originally released in 2015 as part of Pivotal's broader Spring framework. It's available for both Cloud Foundry and Kubernetes. And this week, VMware has announced a commercial support offering.
Customers with the VMware Spring runtime subscription can now get access to commercial support, certified images, and air gapped installation options.
CRAIG BOX: K8 Spin.cloud ran a paid namespace as a service service based on gVisor and GKE. But unfortunately, wasn't able to find a winning business model. Their lost as everyone's gain however, as while they are closing down their SES service as of the 1st of September, their software has been open sourced and is now available for you to run in your own cluster. The K8 Spin operator adds organizations, tenants, and spaces as Kubernetes consents.
ADAM GLICK: Back in March, we brought you the news of research into auto-scaling done by computer science student, Jamie Thompson. Part of that work was a custom pod auto scalar, a way to allow people to create and use custom scalars of their own. That system has gone 1.0 this week. Congratulations to Jamie, who has graduated and is now working at IBM.
CRAIG BOX: The Envoy project has released version 1.15. New features include filters for Postgres and Rocket MQ traffic. Unfortunately, the web assembly runtime built by the Istio team, currently in an Envoy fork, did not land in this release and is now expected in 1.16.
ADAM GLICK: Fluent Bit, a sub project of the Fluentd project that provides connectors to external logging systems, has released version 1.5. This latest release adds output connectors for Amazon CloudWatch Logs, Log DNA, and New Relic. A Google Cloud logging plugin has also been heavily extended to support Kubernetes resources, operations, and labels. Security has also been improved via integration with the Google OSS fuzz service, which has already found five bugs that were fixed in this release.
CRAIG BOX: Rancher Labs has released a new version of the K3D project, which allows you to run K3S, or "Keys" as we like to call it, in Docker. It's a full rewrite with a new CLI, website, and logo. And it is so big they skipped right over version two. New features in K3D v3 include creating multi-server clusters, updating existing kubeconfigs, attaching new clusters to existing networks, handling nodes independently from clusters, and shell completion.
If you're upgrading, watch out for some features that were in earlier versions that aren't yet in the rewrite. In particular, the ability to enable container registries.
ADAM GLICK: Kubernetes can certainly scale. But if you want to know how to use best practices for achieving scale and high availability, Google Cloud has you covered. In a blog post this week, Kobe Magnezi covers important ways to plan for your cluster, including choosing zonal or regional clusters. And the right ways to configure auto scaling, setup monitoring and logging, and use core Kubernetes features like pot affinity and state full sets to make sure you're running the right pods and nodes in the right places and in the right numbers.
CRAIG BOX: Are you monitoring the right things in your AKS cluster? Microsoft wants to help you out with recommended alerts for your clusters. Recommendations include categories like CPU usage, disk usage, pod failures, and job completions. The feature is currently available in preview.
ADAM GLICK: Amazon has released Ingress support for App Mesh, their managed AWS-only service mesh. App Mesh now supports Ingress or north-south traffic along with their existing internal or east-west traffic support. The blog covers the architecture of this new feature, and goes through dives into its usage with both their container and Kubernetes services.
CRAIG BOX: Platform9, our guest on episode 88, has released an update to their managed Kubernetes service. They now provide managed Calico networking with API access, a wizard for autonomous deployments on bare metal and virtual machines, and enhanced monitoring and observability. Their new functionality also includes support for Kubernetes 1.17.
ADAM GLICK: And now for the security section. A new denial of service exploit has been discovered in Kubernetes. The CVE points out that a container in a pod can write to /etc/hosts. But that space isn't used when calculating the amount of space available on a pod.
If a malicious container wrote a large amount of data to that directory, it could have the pod run out of space and cause it to fail. This issue has been fixed in the latest dot releases of Kubernetes 1.16 through 1.18. If you need to run an older version, you can mitigate this threat by not allowing your palettes to run as root. Though this could break an application that is expecting this level of access.
CRAIG BOX: Another CVE this week reflects the fact that if you could intercept network traffic on a Kubernetes node, you could capture incoming requests to the Kubelet, and replied with an HTTP 300 series redirect to a machine you control, thus taking over nodes that should be outside your control. This same point releases contain the fix and there is no other mitigation.
ADAM GLICK: The job of a Kubernetes security team is never done. The folks on the Nautilus team at Aqua have posted about a new threat that they have observed when an attacker builds a container image on the host as opposed to downloading it from a repository. This activity circumvents scanners that look for known bad images or repos.
The containers created appear to be unique to the pod and host, making it hard to have an easy scanner for these images. The team was able to detect this kind of activity through dynamic threat analysis, which is a fancy way of saying that they were looking at what a machine was doing and asking "was it acting uncharacteristically?", as opposed to just what was being installed on the machine.
The research note calls out that looking at the network traffic for the download of a shell script and blocking bad IP addresses could have also helped mitigate this issue. This is a good reminder that protecting your clusters well often requires a multifaceted approach.
CRAIG BOX: As many of you who run Kubernetes clusters everyday can attest, security management for Kubernetes requires a very particular set of skills. The CNCF is working to help companies identify people with those skills, with a certified Kubernetes security specialist certification. The new designation is additive to the Certified Kubernetes Administrator or CKA. And you must be a CKA to sit the CKS exam.
The test will cover cluster and system hardening, microservices vulnerabilities, software supply chain, and monitoring, logging, and runtime security. The exam is expected to be generally available at the KubeCon North America virtual event, starting on November the 17th. The CNCF have also announced a new training course for the Helm package manager.
ADAM GLICK: More news from the CNCF this week, as they have announced that the virtual KubeCon event in August will now have a free pass option, which has access to the keynotes, the sponsor showcase, and the ability to network with project maintainers and leads. The full virtual pass is still $75 and provides access to the full conference experience. Those interested in registering for either pass can still sign up to attend with the link in the show notes.
CRAIG BOX: Finally, if you've ever set up Kubernetes the hard way, you know that creating and distributing TLS certificates can be challenging, and a source of many potential failures. Likewise, Istio uses certificates and roots of trust for identity and mutual TLS.
Christian Posta from Solo has posted a blog this week explaining how to set up Istio's root CA and best practices for certificate rotation. If you are running Istio and not using a managed service to do so, this is a helpful way to see how to keep your certificates fresh and raise the security bar for your clusters.
ADAM GLICK: And that's the news.
CRAIG BOX: David Ashpole is a software engineer at Google, working on Kubernetes and is the maintainer cAdvisor. He is one of the tech leads of SIG Instrumentation, and currently works on an Anthos. Welcome to the show, David.
DAVID ASHPOLE: Hi, thanks for having me.
CRAIG BOX: You joined Google in 2016. Had you had any background working in containers or Kubernetes before that?
DAVID ASHPOLE: No, actually. I graduated from college in 2016. And my first experience with containers was after I joined the GK node team. I had had some experience with distributed systems in college. In fact, my favorite class had me implement Raft in Go. So the GK node team seemed like a good fit. And I was off to the races.
CRAIG BOX: What's the process like? Do you apply for a particular team at Google? Or do you just apply to be an engineer and they kind of look at your CV and say, you'd be a good fit here?
DAVID ASHPOLE: I just applied for a job at Google. And then during the matching process, I said I wanted to do distributed systems. And I got to talk to a number of different teams that did things, including the GKE team at the time.
ADAM GLICK: What is the matching process, for those who haven't been through that?
DAVID ASHPOLE: It's actually fairly simple. And this has probably changed since I did it. But I filled out a survey that said, here's the things I'm interested in and what I'd like to do. And then I had a number of calls with a couple of managers from teams where they walked me through what it would be like to be on their team, what the great parts are, with the less fun parts are. And then I got to make a choice as to which team I wanted to join.
ADAM GLICK: It's kind of a reverse interviewing process, that once you were in the company and then you got to select where you were going?
DAVID ASHPOLE: Yeah, a little bit. It certainly didn't feel like it. You're still new and getting used to things. So it was as a new grad talking to managers of real products and real solutions, it was a little bit of an intimidating process still. But it all worked out for the best.
CRAIG BOX: When you join a new team at Google, you are given a starter project to help get on board. What was your starter project?
DAVID ASHPOLE: When I first joined, I was working with Dawn Chen and Mike Taufen and then a couple others. And they tasked me with implementing eviction for inodes, because they had had a number of customer outages early on, where nodes in Kubernetes clusters had run out of inodes and just completely stopped working altogether.
What we wanted to do was implement monitoring to make sure that we knew how many inodes remained on the boot disk. And then also implement container inode monitoring so that once we ran out we could figure out who the culprit was and make sure that they were the ones that got the boot, and not the other pods.
CRAIG BOX: Inodes is the underlying unit of file system on Unix. So is this a problem that we had because we're running so many different containers on a single machine at the time?
DAVID ASHPOLE: I don't think it was a problem that we ran into frequently. But as the product has matured, we've found that there are lots of Linux resources that exist. And inodes is one of them. And everything that you can run out of can cause failure.
CRAIG BOX: Do you think it's the sort of thing that in a pod spec you should say, my container's going to require 10,000 inodes?
DAVID ASHPOLE: Yeah, probably not. We definitely have a number of resources where we don't want users to specify them. But we also don't want to be able to run out of them. And so we hope that Kubernetes just sort of does the right thing behind the scenes to make sure that your nodes stay healthy. But without requiring users to say exactly how many inodes, and file descriptors, and watches they want.
CRAIG BOX: You're now working on container instrumentation. Talk us through the broad problem space. What is the reason that we need metrics and instrumentation about containers?
DAVID ASHPOLE: I would describe it broadly as being sort of a bridge between what the Linux kernel provides us and all of the monitoring dashboards and storage that we have and that we want to be able to use to see what's going on in our containers. And the big gap that existed when Kubernetes was first started is that we had this pretty neat interface in cgroups that has a number of files that you can check to see how much CPU it's used or how much memory it's using. And so we basically just wanted to be able to take that information and sort of attach the metadata that comes with containers, things like the name or the image, so that it's useful to end users.
CRAIG BOX: Is that necessary because there's no such thing as a container?
DAVID ASHPOLE: Right, so cAdvisor is a container monitoring daemon. But technically it's actually a cgroup monitoring daemon. And you can actually get metrics about Systemd services as well, for example. It isn't necessarily limited to containers. But it does do some extra heavy lifting to attach things like container image or container labels even to the metrics to make them more useful.
ADAM GLICK: The cAdvisor is kind of like Kubernetes' fraternal twin. It launched on the same day as Kubernetes. What is it?
DAVID ASHPOLE: cAdvisor is a container monitoring daemon. And basically it watches all of the cgroups on your VM. And looks at all of the secret files that contain metrics that we care about, collects them and stores them in memory. And then serves them to end users so that they can ingest them into their monitoring pipeline and visualize them later down the path.
CRAIG BOX: You talked about the metrics that you are able to infer by asking the kernel about the cgroups in the containers and then publishing them. What's the format in which those metrics are published?
DAVID ASHPOLE: When cAdvisor was first introduced, I would say it was very ambitious. Initially, we had a number of just JSON end points. And so we came up with our own metrics format. And that was quite useful because containers are themselves a nested resource. They're not a flat resource.
But I would say since then, we've moved much more towards the Prometheus end point as the primary way that we serve metrics, mostly because it's so widely adopted within the ecosystem. We also have some storage plugins for things like InfluxDB or Elasticsearch. But definitely the Prometheus end point is the primary way that people tend to consume cAdvisor metrics.
ADAM GLICK: Although you're currently the core maintainer of cAdvisor, you weren't the one that originally wrote it.
DAVID ASHPOLE: Yep, that's correct.
ADAM GLICK: Who started the project?
DAVID ASHPOLE: I was never fortunate enough to meet him, but Victor Marmol, who was on the team before me started it along Vish Kannan. I think at some point Tim Allclair joined the team and became involved in it as well. And then once I joined, the project was passed to me after my starter project.
CRAIG BOX: Is it sort of a rite of passage? You're there for a wee while and the cAdvisor is given to you to steward for a short period?
DAVID ASHPOLE: Well, I hope not. Because otherwise I'm still undergoing my rite of passage. [LAUGHS]
CRAIG BOX: The architecture of cAdvisor, as you described it in the beginning, was a daemon that ran on the host of a machine that ran containers. Is that still the case? Is that still how much people use it today?
DAVID ASHPOLE: Yeah, I would say the primary way that people use it today is actually that it is used as a library in the Kubelet. And the Kubelet publishes the cAdvisor Prometheus end point on the Kubelet's port.
CRAIG BOX: Doesn't that violate the Unix philosophy, to have it all in one monolith, if you will?
DAVID ASHPOLE: Yeah, I know back in the day they went back and forth between having it as a standalone daemon that communicated over some channel and compiling it in. And I think at the end of the day, they just decided that it was going to be more efficient and more reliable if they used it as a library rather than keeping it as a separate tool.
ADAM GLICK: Is it still also published as a container? And if so, why do it both ways?
DAVID ASHPOLE: It is published as a container. There are a couple of reasons for that. One is simply that it can be used outside of Kubernetes altogether. So if you're a Swarm, or Mesos, or a couple other things user--
CRAIG BOX: Then get with the migration.
DAVID ASHPOLE: Right, true. So especially early on in Kubernetes' project history, there were a number of users that were using other orchestrators. And the other reason is that cAdvisor is highly configurable. But inside of Kubernetes, we lock a lot of things down to make it match what the Kubelet, for example, expects to receive from it.
And so many users want to turn on some of cAdvisor's extra features, or use some of the storage drivers that aren't available when it's part of the Kubelet. And so if you need those extra metrics, or just want to be able to tune all the knobs yourself, it can be very helpful to run it as a daemon set and collect metrics that way.
CRAIG BOX: Give us a couple of examples of the kind of metrics that are exposed by cAdvisor.
DAVID ASHPOLE: There are definitely the basic ones that everyone's familiar with, things like CPU usage, memory usage, network usage, even container disk usage.
CRAIG BOX: Number of inodes?
DAVID ASHPOLE: Number of inodes. Funny story about that. When I first added the inodes, I was told by Dawn that we had some terrible experience internally writing a file tree walking algorithm to count inodes and disk space. So she said don't do that, find some other way to do it.
What we ended up doing, in order to calculate the number of inodes used by containers, was actually just use the find command. And use that as an estimate of the number of inodes that we were using.
CRAIG BOX: Some more great stories of early Kubernetes days with Dawn Chen, you can find in episode number 22.
The Kubernetes community has largely moved on from capital D Docker as the runtime for running containers. There is containerd, there is CRI-O from Red Hat, and there is a runtime interface now, the CRI container runtime interface, which controls all of those things. Does CRI specify how metrics should be exposed by these containers today?
DAVID ASHPOLE: Yeah, so CRI right now defines sort of the least common denominator that we expect from container run times. Specifically, these are the metrics that are needed for Kubernetes itself to function. Things like kubectl top won't work unless we have some common set of metrics that are exposed across all container run times. And that includes container run times that cAdvisor doesn't support, for example, Windows, or things like gVisor, and other hypervisor-based run times that don't use cgroups.
So we wanted to define sort of a standard for what container run times should expose to the Kubelet. And then the Kubelet can then also expose that to the rest of the cluster. The container run time interface, we added metrics there because we wanted to have a common set of metrics across all of the container run times to support in-cluster uses like auto scaling or kubectl top.
And so we didn't like the fact that cAdvisor only supported some of the container run times, particularly those that you use cgroups as an implementation. And so introducing metrics there enables us to have the same metrics everywhere.
What's sort of interesting about this right now is that we will use the metrics that come from the container runtime. But then supplement those with other metrics that are not part of that common standard, such as network metrics, or accelerator metrics, or something like that, before we expose it on the Kubelet's end points. So it's a little bit of a mess today.
ADAM GLICK: If your Kubernetes' distro uses CRI and containerd on Linux, are you still likely to also be using cAdvisor?
DAVID ASHPOLE: Yes, you are. That's not where we want to be, I would say. We're somewhere in the middle. What we want is we want to have a minimal set of metrics that work for all the container run times and that's exposed consistently.
But at the same time, we have a number of end points that expose a much richer set of metrics that we aren't sure we can ask container run times to expose consistently. So right now we're sort of in-between where we will take the common set of metrics and expose them. But then also go ask cAdvisor if it has any extra stuff to give us and add that to it as well.
CRAIG BOX: Let's talk about some other SIG Instrumentation projects and parts of the Kubernetes metric pipeline. What's a Heapster?
DAVID ASHPOLE: Heapster is and was, since it is now deprecated, it's a deployment that you can run in your Kubernetes cluster. It will scrape the summary APIs from all of your nodes. It will collect all of that information, aggregate it together. And then it can send to a number of different storage back ends.
CRAIG BOX: When you say the summary APIs, is that the API exposed by cAdvisor in the Kubelet?
DAVID ASHPOLE: The summary API is actually a separate API where the Kubelet takes the cAdvisor metrics and massages them. It is a JSON endpoint, and it's the original Kubelet monitoring endpoint of Kubernetes.
CRAIG BOX: You say JSON there as opposed to gRPC. Is that an accident of history, or was that a deliberate decision?
DAVID ASHPOLE: JSON as opposed to Prometheus.
CRAIG BOX: Right.
DAVID ASHPOLE: I think there were a couple reasons. One is we like the structure that JSON provides. We could say, here's a pod and here's some pod stuff. And then inside a pod, here's a container. So that aspect of it was nice.
But I think we've definitely found that not using a common metrics format makes it much harder to consume. So really only the Kubernetes associated monitoring aggregators end up using that end point, things like Heapster and the metric server.
ADAM GLICK: Heapster has largely been replaced by metric server. Why was that change made?
DAVID ASHPOLE: I wasn't directly involved with this, but I definitely was an observer at the time. And it turned out that Heapster was fairly difficult to maintain. Primarily because it had so many integrations with so many storage back ends. And once Kubernetes became popular, everyone wanted to become a supported storage back end by Kubernetes. And the maintainers of that component decided that that wasn't what they were interested in doing. And wasn't something that they were able to adequately maintain.
But we still wanted to have a common component to support in cluster uses. So we still wanted Kubernetes to work, as far as metric goes. That means kubectl top, that means horizontal auto scaling. But that means that they didn't want to support exporting to storage back end A or storage back end B. The metric server is actually largely code copied from Heapster, but with a lot of the storage back end bits removed.
CRAIG BOX: Now you mentioned that some of the consumers of these metrics, Kube CTL top and the horizontal pod order scaler that looks at CPU usage of containers running in the pods and sees whether or not they need to be more of them. As the administrator of a Kubernetes cluster, it's reassuring to know that the cluster itself is consuming these metrics. Are they something I should be concerned about? Should I be connecting them to a dashboard and looking at these metrics myself?
DAVID ASHPOLE: Generally we don't advise that. We have historically, at least, broken down metrics into two categories, resource metrics or things like CPU and memory. And disk, which are associated with resources that you specify in your pod spec. And we want these to be available for components that need to make changes to the pod spec, either users or auto-scaling components.
But that's generally the limit of their use. One of the biggest drawbacks of these metrics is that the resource metrics API in Kubernetes, which is the API that serves these, isn't historical. So it just gives you point in time usage so that you can figure out what's going on, not what has been going on. For monitoring, we generally advise having a separate pipeline. The most common probably being using cAdvisor with Prometheus or something like that.
CRAIG BOX: Is there value in me setting my Prometheus system up to scrape all of my nodes' cAdvisor metrics?
DAVID ASHPOLE: I would definitely say that that's good to have. You shouldn't rely on, for example, just the metrics coming from kubectl top to monitor your Kubernetes clusters.
ADAM GLICK: For applications, there's also a tool called kube-state-metrics. How does that work?
DAVID ASHPOLE: That's a really cool component that a number of people have been working on for quite a while. It basically just watches a bunch of resources in your Kubernetes cluster and produces metrics based on the state of all of those objects. I've started using it recently.
And it's been incredibly useful in whenever I want to, for example, make a graph about the health of a StatefulSet or whether or not my pods are running or not running, those sorts of things that I wouldn't normally get out of, say, a CPU metric or memory metric are very helpful for allowing me to tell what's going on in the system, not just from a resource usage point of view.
CRAIG BOX: We had a very interesting conversation with a team from AutoTrader who had built a dashboard that took metrics from Kube State Metrics and then took network throughput metrics from Istio and merging them together and were able to give an idea of how much it cost to run a particular workload based on all of the usage of the items it has in this namespace, multiplied by the storage and networking cost of those cloud components.
DAVID ASHPOLE: Wow, that's pretty cool. That's a little bit more advanced than what I was doing.
CRAIG BOX: Still, well, you can go back and listen to episode 52 and learn all about it, hashtag #SponseredRead. There are a whole bunch of cool projects happening in SIG Instrumentation. Let's ask about a few of those. So first all, the metrics stability framework. What's not stable about metrics?
DAVID ASHPOLE: That was something that Han Kang worked on the last few quarters. And it really stemmed from a desire to have the same sort of stability guarantees we have for our regular APIs for metrics, which are in fact an API that we expose to users. So the idea is, we just want to have some notion of a metric either being stable or possibly being subject to change.
One of the motivating factors is that many storage back ends can't tolerate certain changes to metrics. For example, deleting a label, depending on your back end, could invalidate all of your historical data if you want to make that update. So it's actually quite important, especially for some of the core metrics of Kubernetes like API server request latency, things like that that people have built SLOs around, to have some notion of stability associated with it.
ADAM GLICK: What about structured logging and log sanitization?
DAVID ASHPOLE: Structured logging was a KEP that was merged recently and is a work in progress. It's actually a fairly simple concept. Instead of doing klog.infof, we've added a klog.infos. And the idea being that instead of specifying where in a string to insert your log items, you specify key value pairs that end up being printed in logs in a consistent format across users and companies that choose to use it. And that way any log consumers that want to try and parse that sort of information now have sort of an expected structure that they can look for if they want to try and extract metadata that way.
CRAIG BOX: Is this more for people who are writing and debugging Kubernetes itself? Or will this be useful to application owners who are deploying apps on top of Kubernetes?
DAVID ASHPOLE: This is useful for Kubernetes components themselves. So things like the API server or Kubelet or scheduler. If they want to add additional structure to their log messages, this is a mechanism to enable them to do so.
CRAIG BOX: One of the great things about debugging distributed systems is being able to use distributed tracing. And I understand that you're bringing that to some of the Kubernetes components as well.
DAVID ASHPOLE: Just to give a brief background on tracing, components that are instrumented with tracing produce spans. And a span is just, you can think of it as an amount of time spent doing something. And these spans can be aggregated together from different components to produce what's called a trace. And that forms a tree of spans. And this is useful primarily for debugging difficult latency-related problems in your components.
I'm particularly interested in tracing because currently in Kubernetes we don't have any context-aware telemetry, meaning telemetry where you can actually see how one, for example, log in one component relates to a log in a different component. Right now I have to go grep for either the pod name or something like that. A context-aware instrumentation like tracing can enable me to combine that information in the same view afterwards.
ADAM GLICK: How does the tracing that's going on here relate to projects like Jaeger?
DAVID ASHPOLE: Jaeger is super cool. It is a backend for storing traces. The stuff that I've been working on is primarily how to instrument components in Kubernetes so that we can take what, for example, the API server is doing and store it in something like Jaeger.
ADAM GLICK: So you're feeding metrics into it the same way that you do with Prometheus.
DAVID ASHPOLE: Traces are distinct from metrics. You can actually correlate them with metrics if you're using something like OpenTelemetry.
CRAIG BOX: You also work on node resource management. I wanted to ask that there's kind of two different ways that you can have a workload killed or moved. You have the out of memory killer on the node. And then you also have eviction, which is done by the scheduler. What's the distinction? Why do we still need both?
DAVID ASHPOLE: There's two sorts of classes of problems and each of them solve one of them. There's a class of problem where you have more stuff in your cluster than your cluster can run at once. And so the scheduler's job is to decide which pods are most important and make sure that those are the ones that are currently running.
And so when the scheduler does something like preemption, it's actually saying, oh, here's a pod that's less important. And here's a pod that's more important. And I'm going to remove one and replace it with the other.
The second class, which is the one the Kubelet deals with, is when you have a number of pods that have already been assigned to a node that, based on what they've asked for, their requests should fit on the node. But then it occasionally has to deal with scenarios in which the node still runs out of, for example, memory.
It has to take some action to remedy that to prevent the entire node from experiencing issues. So the Kubelet then performs out of memory eviction when it detects that there's very little memory left on the node. And removes whichever pods are exceeding their requests by the largest amount.
CRAIG BOX: There will be pods that are important to the running and debugging of the system. For example, if you're running cAdvisor as a daemon set or the metric server, for example. What steps should we take to make sure that in the event of either of those changes, that your debugability remains, your observability tools to stay running.
DAVID ASHPOLE: Bobby Salamat, a couple releases back, introduced a new feature called pod priority. And while that mostly deals with the schedulers' placement of pods on nodes, it actually also has implications for out of eviction ranking. We had to sort of balance the desire to keep important pods like cAdvisor or the metric server around with the idea that if you request some amount of resources that you should be able to use those without being kicked off the node.
And so what we came up with is essentially the idea that pods are allowed to use up to their requests. But extra resources beyond what you requested are given priority access to pods that have higher pod priority. So in other words, if you set pod priority high for something, that does end up influencing the out of eviction rankings. And if there's any extra memory on the node available, that will go first to pods like cAdvisor or metric server that have priority set.
ADAM GLICK: What comes next for cAdvisor?
DAVID ASHPOLE: I think there are two broad directions the project has been taking. The first is, we're always adding new and interesting metrics. There are a couple of new contributors from a variety of companies that have stepped up recently and contributed things like hardware performance counter metrics. Or one that I'm particularly excited about is a metric that is especially useful for benchmarking memory usage, which historically has been just a pain in the butt to get an accurate estimate on.
But we're always adding new metrics. And the other thing we've been doing, interestingly, has been removing a lot of old features that were useful when they were first introduced, but are no longer the best that Kubernetes has to offer. For example, when cAdvisor was first introduced, it had a very nice UI that everyone relied on in order to graph their metrics.
ADAM GLICK: David, it's been great having you on the show. Thanks for joining us.
DAVID ASHPOLE: Thanks for having me.
ADAM GLICK: You can find David on Twitter @k8s_dashpole.
ADAM GLICK: Thanks for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @KubernetesPod or reach us by email at email@example.com.
CRAIG BOX: You can also check out our websites at kubernetespodcast.com, where you will find transcripts and show notes as well as links to subscribe. Until next time, take care.
ADAM GLICK: Catch you next week.