Kubernetes Podcast from Google: Episode 40 - GKE Usage Metering, with Madhu Yennamani

#40 February 12, 2019

GKE Usage Metering, with Madhu Yennamani

Hosts: Craig Box, Adam Glick

The new GKE Usage Metering feature lets you find out how much your tenants or applications cost to run. Your hosts talk to Madhu Yennamani, product manager at Google Cloud, about usage metering, and how new GKE features are implemented.

Do you have something cool to share? Some questions? Let us know:

Chatter of the Week

News of the week

Links from the interview

Transcript

Show full transcript

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

CRAIG BOX: Another one of those rare weeks where we're in the same room.

ADAM GLICK: It is indeed, which was a little bit of a challenge to get here. But I'm glad that I made it down to sunny California today.

CRAIG BOX: Yes. I see you had a snow day last week, which we put the picture in the show notes. But it turns out that it's all got a little bit worse since then.

ADAM GLICK: It has continued and continued, yes. It was amazing. I did all the shoveling at the house, which then just led to me waking up this morning and realizing that it all just needs to be shoveled again.

CRAIG BOX: That's the magic of snow. I lived in Canada for, what we call, two winters-- for two years, which, basically, there's snow on the ground from October through, basically, June. So I'm at least roughly familiar with that. The weather in London, actually, took a turn for the worse the day after we left, and there was a great video of a plane-- I must say it's a perfectly normal maneuver for a plane that pilots do train for.

But a 787 coming into land, and basically getting wheels down saying, no, it's a little bit too windy here. I'm going to take back off again and go around for another go. And that was the day after we left, so really quite glad we missed that.

ADAM GLICK: Were you on that plane?

CRAIG BOX: No, no. I was safely here watching it all on a little video on Twitter.

ADAM GLICK: I've only been on a plane that did that once. That was very strange.

CRAIG BOX: Yeah, I think I may have once as well. But I'm glad that we're all here and--

ADAM GLICK: All's well that ends well. The pilots out there are well-trained, and I'm very happy for that.

CRAIG BOX: Let's get to the news. This week opens with a major vulnerability in the Cloud Native stack, this time in the runc container runtime. A user with root inside a container could overwrite the runc binary on the host, thus escaping the container and gaining root access to the whole machine.

Fixes are out for most distributions. For GKE customers, the default container-optimized OS is not affected and no patch is needed. Customers running Ubuntu nodes need to upgrade the nodes to a new version, which is being published, and they have been contacted with instructions.

ADAM GLICK: Infoworld has named Kubernetes one of their technologies of the year for the second year in a row. In their write-up, they mention that Kubernetes has broken free from a crowd of container orchestrators to become the standard platform everywhere and that spending time learning Kubernetes will pay back dividends to people in operations.

CRAIG BOX: Google Cloud announced application-layer secrets encryption for GKE in beta. Kubernetes administrators know that secrets aren't really that secret. They're stored unencrypted the etcd data store. Application-layer secrets encryption uses a feature in Kubernetes to do what is called envelope encryption.

Your secret is encrypted with a local data encryption key, and that key is encrypted with a master key, which is stored in Google Cloud KMS. With the GKE implementation, all you have to do is specify which key to use and grant access for your service account to use it, and your secrets will now be stored encrypted. This model also lets you regularly rotate keys and audit key access.

ADAM GLICK: Two more advances for building containers this week from Google. Last week, our guest was Dan Lorenc. And not long after our chat, he launched Build Artifact Caching in Google Cloud Build that can help you get containers faster. Based on the Kaniko project, this feature stores and indexes intermediate layers inside Google's Container Registry so they are available for use in subsequent builds.

CRAIG BOX: Next up, Jib, a tool for building Java applications into containers has gone GA at version 1.0.0. Jib as a plugin for Maven or Gradle to build applications into containers. New features in 1.0 include containerizing WAR files, integration with Skaffold, and the refactoring of the Core into a library which can be used in other Java applications.

ADAM GLICK: Red Hat are putting the integrated back in integrated development environment with what they're claiming is the first Kubernetes-native IDE. While back in my day, the IDE was software that ran on your desktop, Red Hat CodeReady Workspaces runs inside a Kubernetes cluster. It's based on the open-source Eclipse Che platform. Red Hat CodeReady Workspaces also provides access to Factories, presumably to make the Java developer feel more at home.

CRAIG BOX: The integration of Heptio with VMware continues with a cleaning of house of their open-source projects. First, VMwhere held Heptio's ksonnet in their hands and decided it did not spark joy, so they have discontinued the project. Ksonnet is a tool for managing configurations, and VMware said that, despite their efforts, ksonnet had not resonated with its intended audience.

If you're a user of ksonnet and want to stay working in the same style, we recommend Kapitan with a K, from DeepMind. Otherwise, there is a vibrant community around Kustomize, also with a K, which is becoming part of kubectl in the upcoming 1.14 release.

The Ark project for backup and restore of cluster configurations has been renamed Velero, Spanish for sail-maker or sailing ship. All the projects have had their Heptio prefix stripped and are now to be known just as Project Velero, Gimbal, Sonobuoy, et cetera.

ADAM GLICK: The folks at Platform9 have followed the announcement of GKE On-Prem with their launch of their own managed Kubernetes product named-- wait for it-- Managed Kubernetes. Platform9's managed on-prem Kubernetes offering runs on top of VMware and is said to offer a 24x7x365 SLA. No word on how many 9s are available from this platform.

CRAIG BOX: ClearDATA has announced a Kubernetes solution for health care and life sciences organizations across multiple cloud platforms, including GCP and AWS. The platform allows health care organizations to use containers while ensuring compliance with standards like HIPAA and GDPR. Ask your doctor if Kubernetes is right for you.

ADAM GLICK: Ever wanted to attend KubeCon but not been able to afford it? The Cloud Native Computing Foundation's Diversity Scholarship Program provides support for those from traditionally underrepresented and marginalized groups in the technology and open-source communities. This includes, but isn't limited to, persons identifying as LGBTQ, women, persons of color, and persons with disabilities.

These scholarships are provided for those who may not otherwise have the opportunity to attend CNCF events for financial reasons. A number of scholarships will be provided to recipients and they will receive up to $1,500 to reimburse actual travel expenses. Scholarships are awarded based on a combination of need and impact.

The selection will be made by an assembled group of reviewers who will assess each applicant's request, and all application information will be kept confidential. If you want to apply for a scholarship or read about the experiences of Dennis Salamanca Farafonov and his diversity scholarship, you'll find it in the show notes.

CRAIG BOX: Finally, the Poseidon Project is working on integrating the Firmament scheduler into Kubernetes and has released version 0.7 along with a post on the Kubernetes blog. Firmament is scheduling software described in a 2016 paper by two university of Cambridge researchers, including Malte Schwarzkopf, who is one of the co-authors of the Omega paper from Google.

Firmament uses graph theory in a construct called a flow network to find the minimum cost optimization for a set of workloads and is best explained by a graphic on its website. Poseidon is an integration which allows using Firmament as a Kubernetes scheduler, which is being developed largely by the Huawei Platform as a Service team.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

Madhu Yennamani is a product manager in Google Cloud who recently launched GKE Usage Metering. Welcome to the show, Madhu.

MADHU YENNAMANI: Thank you, Adam.

CRAIG BOX: Congratulations on the launch.

MADHU YENNAMANI: Thank you.

CRAIG BOX: Tell us exactly what GKE Usage Metering is.

MADHU YENNAMANI: GKE Usage Metering allows the users to understand the usage profiles of their GKE cluster. You can see the usage of the underlying resources, such as CPU, memory, PD, broken down by Kubernetes namespaces, labels, and then attribute them to meaningful entities, such as a department, a customer, an environment. What we have is a meaningful entity for your use case.

CRAIG BOX: What was the impetus to build this feature? What problem for customers does it solve?

MADHU YENNAMANI: So basically, Kubernetes is very effective in abstracting the underlying infrastructure details from the customers. And customers see the whole thing as a pool of resources. In a multi-tenant GKE cluster, where number of teams are sharing the cluster, it can be hard to understand which tenant or team is using what portion of resources.

So, for example, through our research, what we found was that some of the customers were using manual rough estimates. Some of them were dedicating a whole team to do the estimates. And some went ahead and wrote scripts to get the estimates done. And we wanted to simplify it.

One of the common patterns that we saw time and again is, when a user starts using the GKE cluster, they see a huge cost savings upfront. So initially, everybody is happy, rough estimates are sort of OK.

CRAIG BOX: It sounds good. Let's all go home!

MADHU YENNAMANI: [LAUGHING] But as more and more teams start using the clusters, GKE cluster, it becomes important to understand how the usage trend is trending across the board. Another problem that we heard from customers is it's often important to determine quickly as to which tenant or team has introduced a bug that led to a sudden spike in usage.

CRAIG BOX: Right.

MADHU YENNAMANI: Or there could be a scenario where somebody has accidentally forgotten to turn off or wind down their clusters after a test. And in those cases, you don't want to realize after a month that things have been wasted or resources have been wasted.

So one of the asks that we heard from our customers is, how do I easily understand that such spikes are happening, instead of me having to manually watch for these spikes and have a manual tab on these things? Is it possible to have a dashboard where my developers can see this for themselves and be more mindful while using the resources? So that was the motivation. Those were like the primary problems that we were trying to solve.

ADAM GLICK: What goes into building a feature like this for GKE?

MADHU YENNAMANI: Great question. So at GKE, in our team specifically, we believe in iterating fast. So we collect information from a lot of our customers. If we find a common pain point across a number of customers, we quickly build an MVP or a prototype and give it to the customers to try out and give us feedback. And then we iterate faster on that.

For example, in this scenario, in the very first version we had provided segregation purely based on namespaces. And after getting feedback from the customers, we found that the namespace-based segregation is not flexible enough for fulfilling all the use cases. So in the next iteration, we built on label-based segregation. Similar thing continues on different facets. For example, we want to do the same thing in terms of the UX development also.

CRAIG BOX: What is the actual implementation for this? I see on the blog post you have a daemon set. You have an agent that runs on each node in order to collect these metrics. What is the process by which it's collecting the metrics, and then what is the process by which it's pushing them out to the place where you eventually see them?

MADHU YENNAMANI: So when the feature is enabled on the master or the control plane, an agent starts running, which is collecting the metrics by talking to the API server. The collected metrics are aggregated and exported on an hourly basis to BigQuery dataset. This holds true for CPU, memory, PD and all those resources.

Now network egress is a little bit tricky. We can't do justice to measuring network egress risk purely from the control plane. So we have to run network egress agents as daemon sets. So there are on all the nodes. The measure network traffic across the board. From there, they report back to the agent that's running on the control plane. And from there, it's exported to the BigQuery.

CRAIG BOX: Why BigQuery?

MADHU YENNAMANI: BigQuery is a very flexible mechanism to analyze. It integrates very well with a number of third-party tools. So our criteria of choosing BigQuery was to provide flexibility to customers and to have more mechanisms by which they can dissect.

CRAIG BOX: Before the feature was produced, what are some of the ways that users were making this happen for themselves?

MADHU YENNAMANI: It wasn't really easy for customers to get good allocation before this feature. Some of them were using manual estimates, rough ballpark estimates. Some of them tried to build scripts, but they did not quite meet the accuracy standards that were desired.

To illustrate with an example, if we think about the network egress cost, there could be a huge variance in the rates. The cost difference between network traffic going within a zone could be almost close to zero versus the cost of traffic that's going from one region to another region could be substantial. So when our customers try to build mechanisms to even measure the network egress usage, it got complex. It's not easy to solve.

CRAIG BOX: Are you aggregating the cost as part of the agent that runs on the control plane or do you just output data numbers to BigQuery and then it's up to the users dashboard to turn that into a cost?

MADHU YENNAMANI: Both. The metering agent that's running in the control plane measures at a very fine grain level. And there is aggregation going on at that stage. Now, the hourly data is pushed or exported to the BigQuery tables and user can aggregate it however they want.

If they want to aggregate it at a department level, they can do that. Or if they want to see how much resources were consumed by a test, they can aggregate that at an environment level or at a test level. So it's quite flexible for the users.

ADAM GLICK: How is usage metering different from people who would choose to use, say, Prometheus and then visualize that with Grafana?

MADHU YENNAMANI: Usage Metering feature offers a ready-to-use solution for understanding usage on a pertinent basis. Like many other tools, its functionality can have some overlap with other powerful tools, such as Prometheus and Grafana. Conceivably, one can use Prometheus and Grafana and augment it with changes to achieve similar functionality. However, it would involve significant effort and changes. Let me illustrate what I mean with a couple examples.

ADAM GLICK: Yes, please.

MADHU YENNAMANI: Like I mentioned before, resources like network egress can be hugely different. The rates can be hugely different based on from which point to which point the traffic is going. Using tools like Prometheus and Grafana, it will be difficult to attribute as to whether a tenant is using the expensive traffic or the tenant is using mostly the traffic which has almost zero cost.

Another difference would be that, depending on sampling frequency or scraping frequency, monitoring systems might miss on the short-lived parts. The Usage Metering feature, on the other hand, has been designed to overcome such limitations. So I think I can go on and on about the things, but in a nutshell, the team at Google has put a lot of thought into what we need to do for these specific use cases. So it's a question of choosing a product that's ready to use versus something that you want to build on your own.

CRAIG BOX: You've mentioned multi-tenancy a lot, and you've also talked about users identifying someone who started a workload and it's run away with them over the weekend, for example. Do you find that, in the customers that you were building this out with, that you were getting a lot of people from single-tenant environments getting value out of it as well?

MADHU YENNAMANI: Yes. In some cases, we have found that there's value out for single tenants also. There are a couple of facets to it. One is, in most of the monitoring systems, there's quota limits. So on average, your monitoring data is active for about 6 weeks or 12 weeks.

As the monitoring data is reached, it's not purely to understand the usage. When you think about usage data versus monitoring data, monitoring data's primary purpose is for debugging purposes and to figure out what's going on with your cluster, the short term versus usage has greater purpose. You want to understand historically how usage is training for an application.

Is there a seasonality? How can I forecast for later and such aspects? So it's easier to use this feature and have the historical data stored for a longer time basis and use that analysis to gain insights. So we have seen some of the customers using this even for single-tenant clusters.

ADAM GLICK: Can people use this outside of GKE?

MADHU YENNAMANI: Not at the moment. Right now, it's only available for GKE customers.

ADAM GLICK: The data, you said, was sent out to BigQuery. What tools can people use to analyze that data once it's in BigQuery?

MADHU YENNAMANI: Users can use any external data analysis or visualization tool that can connect to BigQuery. We don't restrict. We don't recommend anything. But we do provide templates, plug-and-play templates, that users can use for Google Data Studio.

ADAM GLICK: Can people look at historical data or data collected only since this feature has been enabled?

MADHU YENNAMANI: Unfortunately, data is not retroactive. Users have to enable it, and only then the data is exported.

ADAM GLICK: Gotcha. And is that real-time data or is this data that cues up over a while, and then people can kind of look at it as a batch dataset?

MADHU YENNAMANI: Usage data is exported in one-hour intervals. One of the powerful aspects is that, because data is exported to BigQuery, it's easy to build a historical picture of usage. Many tools only keep data for a certain amount of time. But with Usage Metering, users have full control over how long it should be stored because it's kept in their BigQuery buckets.

CRAIG BOX: How can people get started with Usage Metering in their clusters?

MADHU YENNAMANI: The setup is actually quite simple. You can enable this feature on existing clusters or new clusters. All you need to do is create a BigQuery dataset and provide it as a parameter in the CLI interface. Once the feature is enabled it will start exporting data into the BigQuery dataset. And with the data, you can hook it up to any sort of visualization tool of your choice. And if you don't want to hook it up to a visualization, tool you can just run queries in BigQuery and get the insights you need.

If a customer wants to use a specific usage, which is to estimate the cost allocation, then the one additional requirement is to enable BigQuery export of billing data. The data from billing export is joined to the usage export data to provide the cost breakdown.

ADAM GLICK: Fantastic. What is next for Usage Metering?

MADHU YENNAMANI: We are looking at a couple of aspects. Firstly, the beta response has been great, so we're working with our beta customers to make sure this is a great product for everyone. So we are planning to go GA pretty soon. Secondly, we have heard from a number of customers that, often there is a big difference between what is requested versus what is actually used-- sometimes to the tune of 10 times to 20 times.

So we're trying to work on making it easier for customers to correct over and under-provisioning And at a more broader level, we want to make Usage Metering even more flexible and comprehensive. I can't really see say a lot right now, but I am quite excited about what's in the pipeline for later versions.

CRAIG BOX: Well, we look forward to hearing more about it as it develops.

MADHU YENNAMANI: Great.

CRAIG BOX: Madhu, thank you so much for joining us today.

MADHU YENNAMANI: Thank you. Thank you for having me.

CRAIG BOX: You can find a link to the blog post describing GKE Usage Metering and to Madhu on the internet in our show notes.

Thank you, as always, for listening. If you've enjoyed the show, we really love it when you spread the word, tell a friend, or rate us on iTunes. If you have any feedback for us, you can find us on Twitter at @KubernetesPod or reach us by email at kubernetespodcast@google.com.

ADAM GLICK: You can also check out our website at kubernetespodcast.com to check out our latest episodes as well as read through our show notes and read the transcripts. Until next time, take care.

CRAIG BOX: Hope you're not too snowed in, wherever you are. See you next week.

[MUSIC PLAYING]

View More Episodes