Kubernetes Podcast from Google: Episode 205 - The State of Kubernetes Cost Optimization, with Fernando Rubbo and Kent Hua

#205 July 27, 2023

The State of Kubernetes Cost Optimization, with Fernando Rubbo and Kent Hua

Hosts: Abdel Sghiouar, Kaslin Fields

“The State of Kubernetes Cost Optimization,” is a recent report based on research into best practices for running Kubernetes clusters. If you’re running your workloads as efficiently as possible, your costs will be optimal too. The report reviews the data and offers recommendations on tools and techniques you can use to optimize your Kubernetes clusters. We talk with two of the report’s creators, Fernando Rubbo and Kent Hua, to learn more.

Do you have something cool to share? Some questions? Let us know:

News of the week

Links from the interview

Links from the post-interview chat

Transcript

Show full transcript

KASLIN FIELDS: Hello and welcome to the "Kubernetes Podcast" from Google. I'm your host, Kaslin Fields.

ABDEL SGHIOUAR: And I'm Abdel Sghiouar.

[MUSIC PLAYING]

KASLIN FIELDS: This week, we talk with Fernando Rubbo and Kent Hua about the state of Kubernetes cost optimization, a recent report based on research into best practices for running Kubernetes clusters. But first, let's get to the news.

ABDEL SGHIOUAR: Congratulations to Istio on its status as a graduated project in the CNCF. Istio was accepted into the CNCF on September 30, 2022. Now on July 12, 2023, the CNCF has reaffirmed Istio's maturity by confirming its graduation.

KASLIN FIELDS: On July 20, 2023, the Flux Project announced the general availability or GA release of Flux version 2. Flux is a CNCF graduated project which enables continuous delivery for Kubernetes. In Flux version 2.0.0, GitOps-related APIs have achieved V1 level maturity. The new version also adds horizontal scaling and sharding capabilities to Flux controllers. The Git bootstrap capabilities provided by the Flux CLI and by Flux Terraform provider are considered stable and production-ready in Flux version 2.0.0.

ABDEL SGHIOUAR: Kubevirt 1.0 is released. The open source tool is a Kubernetes operator that allows users to run virtual machines and manage them as Kubernetes resources. The project has been a CNCF sandbox since September 2019.

KASLIN FIELDS: Pulumi announced the release of version 4.0 of their Kubernetes provider. This new version uses server side apply by default, which improves resource management and removes the dependency on kubectl. It also allows access to the output of the provider in the Python, Go, and Java SDKs. Pulumi is a popular infrastructure as code tool which allows developers and platform operators to manage infra and apps using their preferred programming language.

ABDEL SGHIOUAR: Have you ever wondered if you can run wasm apps in Kubernetes? Well, it's possible. And Wasm Labs from VMware published an article which lists all the tools needed and a step-by-step guide to deploy your first wasm app on Kubernetes. Head to the show notes for the link.

KASLIN FIELDS: The CNCF continues to see its foundation members grow. In the last three months alone, 30 companies, institutions, and non-profit organizations have joined the CNCF. A big majority of them are silver members.

The new members of the CNCF span a wide range of industries. One very interesting end user supporter is one which makes pillows and beds for the Scandinavian market. Other members are a venture capital firm, a sports company, and the Australian Research Data Commons.

ABDEL SGHIOUAR: VMware announced a self-hosted option of their Kubernetes platform, Tanzu. Customers can use Tanzu mission control to deploy and manage Kubernetes clusters on-prem on top of vSphere. The new option allows customers with low latency or data locality requirements to still leverage Kubernetes for workload orchestration, but on top of their existing IT infrastructure.

KASLIN FIELDS: And that's the news.

[MUSIC PLAYS]

All right. Hello, and welcome to the show. Fernando and Kent are our guests today, and Fernando, you are a cloud solutions architect at Google. And Kent, you're a cloud solutions manager at Google. So what does that mean you all do, if you'd like to introduce yourselves a little bit more?

FERNANDO RUBBO: Yeah, yeah. I can start first. Thanks for having us. Cloud solution architects at Google basically are part of a global solutions team, and we work with customers with real customer challenges. And we put together solutions so that we help customers to thrive. And my main focus on this team is enabling GCP customers to optimize their infrastructure and manage their cost better.

KENT HUA: Great Thanks, Kaslin. Thanks, Fernando. So my name is Kent Hua, and I am a solution manager, as you mentioned. And I focus primarily on taking the assets that are developed by our architecture team as well as our technical teams and making them available to both customers as well as our field-- so essentially, the organizations and teams that you engage with when working with Google Cloud-- to essentially help them, in this particular case, the adoption of Kubernetes and optimization of in this particular case, but also focusing on modernizing application platforms, so taking services from different types of where they are to where they're going to be in the Cloud.

KASLIN FIELDS: When I talk to customers and members of the community, the biggest thing that I'm always hearing people say is, how does everybody else do this? [LAUGHS] Everybody's always trying to learn from each other. So it sounds like in your jobs, you all are some of the best equipped folks to answer that because you're actually talking to lots of different people about how they do what they do.

And that's what we're going to be talking about today, right? You have created a white paper with all sorts of learnings from analyzing different systems and use cases about optimizing Kubernetes, right?

FERNANDO RUBBO: Yeah, yeah, sure. That's exactly what you're going to be talking today.

KASLIN FIELDS: So we've talked a little bit about the global solutions team. The white paper that we're talking about today, it's called cost optimization is in the title of it. So what does cost optimization mean, and how does that relate to this white paper?

FERNANDO RUBBO: Yeah, the white paper is State of Kubernetes Cost Optimization. And we have started this journey basically even before the pandemic not to write this white paper, but to know that this is an important thing that we should be focusing on. Because customers were demanding, hey, I need help to optimize my costs on Google Cloud, and the main product that was more prominent on that was Kubernetes.

And although Kubernetes is an amazing platform, very interesting platform to work with that helps you to do a lot of stuff that it was not possible before, Kubernetes is complex. And many companies that come from their data centers that works with VMs and has tried to deploy on Kubernetes face some cost challenges. Because they think about that beforehand. The developers didn't have to think about that because they have a VM, and that VM they patch the application, they putting that inside of the VM, and they kick off that with a startup script.

And right now, they need to think about YAMLs. They need to think about readiness probe, Kubernetes resources, and lots of things that they haven't thought before. So it's a more complex environment that gives you many more options, but that's also drive to some misuse of it. Because we start to see that mainly when the pandemic arrived, when that exposed lots of people were asking us to, hey, help me cost optimize my Kubernetes clusters because I'm spending much more than I would like.

And in that time, we realized we need to do something at Google. Like we need to stop and think, OK, what we need to do? Because there are so many people asking for this. And we decide to break this into four pillars.

So two pillars of it we put into the product. So we bake it into the product. Basically, one of those is give the customers visibility. So we create what we call today the GKE cost insights and recommendations that helps them to visualize and set the right thing for their platform.

GK also has many unique features like [? for ?] that helps on the cost optimization process. For example, autopilot, no [? dot ?] provisioning, image streaming, these are just a few that we have into the platform. Along with these product features, we also put together some what we call GKE optimization solution that's basically a lot of internal and external material that help people to self-service and understand when to use each feature and how to use that correctly. What are the best practice?

This solution also gives trained the Google team and partners at scale so that we can give the right support to the teams. Like how do you give tailored recommendation to their current environment? So how can we pull the data, and analyze it, and help them to figure out, hey, you need to do this and this to solve your problem?

So this is basically how we are tackling the cost optimization for GKE. But again, the research is not going to be focused on GKE. It's going to be cloud agnostic. So it's basically something that we should highlight here.

KASLIN FIELDS: So in working with Google's customers, you've learned all of these things about the ways that they're running their environments. You've turned that into a bunch of stuff within Google, but now you're working on generalizing those concepts for the white paper it sounds like, right?

FERNANDO RUBBO: Exactly.

KASLIN FIELDS: And Kent, would you like to tell us a little bit about how you see cost optimization as a concept relating to the community?

KENT HUA: Yeah. I think a lot of it is really about helping them understand, to what Fernando mentioned earlier, the visibility and observability of what's happening in their environment. And a lot of times, what we encounter in our engagement with customers is essentially, they are still onboarding, some of them. We have a mixture of those that are more mature and those that are less mature as they're coming on board.

The ones that are coming in from virtual machines now all of a sudden have all of these new constructs they have to work with. Organizations say, hey, we need to bring Kubernetes into the environment, or development teams develop application that says we need a platform like Kubernetes to run our applications. So teams are forced to develop this platform, deploy it into their environment, and they go, hey look, it's up and running, but then not fully comprehending sometimes or understanding all of these knobs and widgets that I can turn.

And we got into a little bit earlier about GKE providing a lot of these facets and features because it's a managed environment. But a lot of times, even though it's managed, there's still a lot of shared responsibility. Customers are still responsible for how they use the resources, how they configure these resources. And interesting enough, a lot of times we get into these conversations is really about understanding the fundamentals.

What does Kubernetes provide for me? What do I actually need to understand? To tweak, turn, and understand, a lot of times if you look at the paper that we wrote, it's really looking at to the point of if we can't measure it, which we define as the signals, how are we going to tweak it or measure our progress as a result of that? And cost optimization turns into not only how do I make my resources more efficient, but how do I understand what's happening in this environment so that I can use it and use the resources that are provided to me by the cloud provider in order to maximize my resources?

KASLIN FIELDS: So there's a significant element here of just understanding what's going on, and it sounds like you all have had a lot of experience with a huge variety of different use cases, different environments. So how have you coalesced all of that into this white paper? What's in the document?

FERNANDO RUBBO: The goal we have done for the white paper is once we cannot share the customer-specific data because of privacy concerns, of course, we decide to analyze the anonymized vision of GKE clusters-- all clusters that we have in our platform. And we decide to segment those in different buckets based on if they are better optimized for cost or not so that we can compare what the most optimized GKE clusters are doing comparing to the least optimized.

So what feature are they using? Are they using more HPA? Are they able to scale down? Are they use more GPA, the cost allocation toolings?

So there are so many Kubernetes constructs that we decide to compare. Of course, the report is all about cloud agnostic. We are not looking at GKE specific features. We are looking what are the Kubernetes constructs that you run, that you use, if you should be using more or less, how you should be compare yourself against to the ones that are doing that the most.

And the thing that's important for this research is also to highlight is that it doesn't look only at the cost. It's basically the main gist of the research is that how you balance cost performance and reliability. So you need to think all these together, and that's why we have one segment that we call at risk. Those are the clusters that have high risk of reliability and performance issue, and what they should be doing to overcome this problem that they are facing.

KASLIN FIELDS: So you're looking at all of the so many different features that Kubernetes itself as an open source project provides to help folks run their clusters effectively and analyzing how folks in different types of situations use it. How did you decide to create your segments? It sounds like understanding these segments of you would fit into this bucket most cleanly maybe might be one of the ways that folks would use this resource. So how did you decide to create those buckets?

KENT HUA: We wanted to use the data to help us derive, for instance, in this particular case, the segmentation. The buckets that Fernando mentioned is how do we differentiate between low, medium, high is really about using a classification tree and using quasi-equal or quantile intervals to essentially look at the weights of all of it. And I think that's really the key here is we're letting the data drive into the behaviors as well as Fernando mentioned earlier, we actually created a separate segment called at risk.

And the reason for this is a lot of engagements that we've done with customers, we've identified a pattern of the reliability aspect coming into play and customers seeing the reliability happening in their environment. Like things are not as reliable as they want it to be and trying to figure out what is that pattern that's driving that? And so we carved out this segment using some of the data to say, hey, this is actually a pretty important segment from the engagements that we've had with customers, and we use that.

We'll talk a little bit about this later in talking more about the different segments themselves, but it's really about what are the segments of those that are performing low to elite in this particular case, but also the at risk, which are essentially potentially workloads that may be running at reliability risk as a result of that.

KASLIN FIELDS: So you've got all of this data about these different environments, and you know a little bit about how well each environment is working for the group. And you notice trends within those and created the buckets from that essentially?

KENT HUA: Yeah. So I mean, as I mentioned earlier, the data is coming out from what the data is. So it's not really the-- we have the signals. So we're using the signals. I think we're going to get into the signals in a little bit, but really about knowing that we have these signals and using the data aspects of it, letting it drive the pattern for what we see in the end.

KASLIN FIELDS: Yeah. Like you mentioned, the signals are definitely a key part to say the least, that's probably an understatement, of the results of this white paper, so that's definitely something that I wanted to talk about today. So in the white paper, you call them golden signals. So what are these golden signals, and why are they golden?

FERNANDO RUBBO: Yeah. We have, as it is to say, you can manage if you can see. And we mentioned before, we need to give more visibility. But there are so many signals. There are so many metrics that we can show, and they can tell many different stories.

So what we have done in the beginning of this process, like three years ago, is that working with customers, we need to show, OK, how can I give the minimal amount of metrics, showing to executive folks as the technical folks how they are doing and where they should be working on? So which teams are responsible for what?

And that's why we decided to put the golden signals together. If you see other-- the golden signals is not an uncommon thing. Like if you look at [? SRE, ?] [? SRE ?] defined for golden signals. So if you need to measure only four things, those are the signals you should be measuring. Dora defines-- the state of DevOps defines the four key metrics. So again, the number four is a good number. And we decided to go with the same one because we found four signals, as well, that are very important for being measured so that you can decide where you are.

And those signals are the heart of the research. Because all the segmentations, if you are a lead performer, or if you are a low performer, or if you are at risk, they are segmented based on the golden signals. The golden signals that we have put together in this were with engagements in customers and learning and polishing them over time.

They are basically workload resizing. That means that that's a question that you can ask by yourself, is your developer setting the right requests to their workloads? Are they requesting 10 CPU for their application to run, but are they using 10 CPUs, or are they using only one?

So this is the first signal and the most important on the research point of view. Because we see that, again, you're going to be probably discussing that more later, but some developers are not setting resource requests, and that's very important. And the other thing is that the workload resizing is very important for the place that has the biggest opportunity for optimization. So this is the most important metric, and we should be paying more attention to that.

The second one is basically demand-based downscaling. The question that we ask in the signal is, can you scale down at night when you don't have lots of requests? So that's important because you need to collaborate developers with cluster admins so that your developer can scale down their application and the cluster can scale down their nodes.

So the third one is basically bin packing. So bin packing is, if you think about Kubernetes, Kubernetes is a bunch of nodes. So are you using the entire CPU and memory? Are you using all the resources from the node? So sometimes, you are not placing the right pods in the right nodes, so that's why you're not using that correctly.

And the last one is discounts, clouds discounts. Once you're running the cloud, this research is made mainly for the Kubernetes clusters that are running in the cloud. So once you're running the cloud, are you taking advantage of discounts? Because most of the biggest cloud providers provide discounts in two big ways. They provide spot VMs where you have higher discounts on a VM that can be taken from you with a short notice and also provide continuous discounts like continuous use discounts. At Google Cloud, we call that committed use discounts so that I commit for using these kind of VMs during three years, but get like a 55% discount.

So this is something that those four signals putting together that's going to be telling you how you are doing, because you also can infer what are the right people that should be working with that. Like discounts is more platform admins and budget owners while workload resizing is more developers and platform admins. So you can figure out where is the problem and who are the teams you should be working with.

KASLIN FIELDS: When I talk to folks who are trying to learn about best practices for running Kubernetes clusters, one of the things that I'm always saying to them is, set your resource requests and limits. And that sounds like that is a major component of this white paper, as well. You mentioned the workload right sizing, understanding how much is your workload actually using, and setting your resource requests and limits accordingly. But realistically, folks don't know, honestly, a lot of the time how much their workloads are using.

It's really hard to get good benchmarks on a lot of applications, especially with one of your other key elements here of demand-based downscaling. So different times of day, your application is doing things all over the place. So I'm going to ask you all about your favorite things that you've learned from the research. But before we started talking, I mentioned my favorite thing that I was learning as I was reading through.

You were talking a little bit about what I would call an effect that folks might not realize that they're getting through not setting their resource requests and limits. In the white paper, let me see if I can quote it here, "in our exploratory analysis, cluster owners shared that they either didn't know that they were running large amounts of best effort and memory under provisioned burstable pods, or they didn't understand the consequences of deploying them." So these best effort, you mention in the white paper that if you don't set your resource requests and limits, Kubernetes is going to assume that they are best effort pods. If they need to be preempted because the node's getting too full, they'll get rid of them.

So could you talk a little bit more about some of these unintended consequences of not setting your resource requests and limits before I dive into your favorite parts of the white paper?

FERNANDO RUBBO: Yeah. If you think about that, Kubernetes is-- basically, everything is based on resources request. So scheduling is based on resource requests. Bin packing calculations based on resource requests. Workload scaling is based on resource request. Cluster scaling is based on resource request. Cost allocation, most of cost allocation tool is based on resource request. Basically, everything is based on resource request.

When you don't set it, Kubernetes, it's like it's blind for what's going to happen. Because Kubernetes don't know how much you need for that, and it's going to be dropping that pod into a node, and that node, you're going to be starting to use that. And the problem with that is beyond that, it's kind of blind thing, Kubernetes also can kill your pod if it's using more than what you have requested.

If you haven't requested anything, it can kill your pod. So that's the main reason. So basically, when your node's getting full, you're getting fully utilized node, all pods that are using more than what they have requested can be killed-- are prominent to be killed without any eviction time. Like if it's running something, it's killed right away.

So that's the main reliability point of not setting requests. So of course, if you run your cluster like five, 10 times over provision, that reduce the chances of that happens. But that actually can happen anyway because one node of the cluster can, once in a while, get full, and the pods that you had not requested anything can be killed.

KENT HUA: I think there's one element around this, the request you mentioned. The individual signals have dependency. And you mentioned resource requests all as a part of it. I think one of the interesting ones is particularly around bin packing. Because in bin packing, if you don't set the request, Kubernetes is like, all right, I'm going to overprovision it.

And you start to look at your signals. If you happen to look at that bin packing signal, you're looking at, hey, I'm only 50% bin packed. I don't really need all of these nodes. I'm going to start tearing down some nodes because I'm going to save money.

And I'm realizing the consequences of that is now all of a sudden the resources that did not set requests now no longer have the extra headroom that I had provided. And when those nodes go down, our resources no longer have their resources. Our workloads no longer have the resources to compute the memory. We no longer have the resources to actually perform the workload that we intend.

So that impacts our reliability, but also potentially our performance because I don't have as many pods running that I would desire. So that brings into the whole construct now back to Kubernetes of pod disruption budgets and all of these factors that ensures that our workloads are running as reliable as possible. We kind of escape it with CPU because CPU we can throttle it.

Now, that impacts the response time, but on memory, it is what it is. I'm not going to be able to share that memory. So to what Fernando mentioned earlier, those workloads are going to get evicted and/or behavior wise out of memory killed depending on the type of workload that we're running. So it's very interesting how these all correlate to these resource requests as I think I'm going to save money, but the consequence of it is potentially reliability. And response time, at the end of the day, is really about the end user's reaction to our settings-- the response time of the application itself, which can negatively impact our end user experience.

KASLIN FIELDS: Which is one of the key things about cost optimization for me. It's about optimizing your clusters to run optimally at the optimal cost. It's not about reducing your costs to the point where you're reducing your efficiency. Kind of defeats the purpose.

And something I find so exciting about this, and why it's my favorite part of the white paper, is that over my career, I've spent a lot of time going through the Kubernetes documentation because that's a great way to learn about Kubernetes. But as you go through the documentation, you'll read about resource requests and limits. And then you read about all of these other bits of Kubernetes, and maybe there'll be a footnote, or maybe they're not even really mentioned there. But when you put it all together into this view of here's how this actually works in the real world, you see things like this, which is why I'm excited about this white paper.

So that was my favorite part of it, and I know it's probably the best part. But [LAUGHS] what are your favorite parts of the white paper, or of your research?

FERNANDO RUBBO: May I make a comment before we jump into the--

KASLIN FIELDS: Please, please.

FERNANDO RUBBO: Because I do think there is lots of room for improvement in Kubernetes platform. So I do think that in the long term, people will not have to worry about that anymore. So in the long term, I believe that it will be able to-- people to drop their application, and the platform decide what's the best for the application to run. But right now, it is what it is. So we need to help people to get more reliable, more performant, and better cost efficient in their platforms. And for that, they need to worry about these things.

You need to understand the consequence of not setting requests, for example. And they need to understand when they should be doing that or not because there are cases that are good not setting requests. So they need to better understand-- and we talk about that in the paper. But we need to better understand that in order to move forward. And until that time, doesn't come.

KASLIN FIELDS: Kubernetes started out as this platform for building platforms, as we like to say, with all of these different little tools within it, but I think a lot of folks have predicted, as you're saying here, that over time Kubernetes should get easier, more streamlined of understanding your workloads and how to run them effectively, essentially, in a distributed system.

KENT HUA: Yeah. I think a lot of it is the instrumentation. Like how are these metrics exposed, and how do organizations use these metrics so that they can make the decisions? As Fernando mentioned earlier, I can't react to if I don't have anything to measure, or setting our goals and being able to understand what signals are available to me. Kubernetes has a lot of signals that are coming out and being emitted. It's more a matter of how do I tone down to figure out what metrics make sense?

And if we go back to the golden signals, it's really about, these are the four that we think, at this point in time, are really helpful to help customers understand what's happening in their environment to measure progress, and having these signals gives them at least a starting point. There's probably going to be other signals for organizations that make a lot more sense, as well, but this is a good starting point for a lot of organizations.

KASLIN FIELDS: That's also definitely one of the top challenges I hear from people is there's just so much data coming at me. I don't know what to look at. [LAUGHS] I need better alerts, and clearer understanding, and messaging to understand which signals are actually important. So that's kind of what you've done here with this white paper. So back to the question, what are your favorite parts of the research that you've conducted?

KENT HUA: It's always funny, because you mentioned it is really the areas around requests, that was the first one, is sometimes when we have these engagements with customers directly, we do share some information about their environment. And we do share, hey, do you realize how much best effort or potential reliability at risk type workflows that you have is a good amount of percentage? And when we share that, it's a shocker for them.

So actually, there's a balance. We have some that are-- wow, I have that many? Is this why my application behaves this way? We also have those that actually consciously, to Fernando's point earlier, there are good reasons for doing some of this.

I think it's more about understanding across both from a platform team perspective and developer. I know organizations have multiple hats and multiple roles. At the end of the day, whichever teams are responsible for setting requests and understanding their workload, that team needs to understand based on how Kubernetes reacts and utilizes the information that's provided to it, or at the end of the day, when it says, I don't have any more resources to give you is, how does this fit into my application?

Because that was one of the big benefits of Kubernetes-- being able to overprovision, being able to maximize my machine by filling it with other resources. And by taking advantage of that is to understand that I want to do this for my workload, so separating the different workloads and talking with customers of which workloads make sense to overprovision, for instance, and which workloads I have to run because this is critical to my business, and I need to ensure that it has the amount of resources that it needs.

So I think the most favorite thing is, one, engaging with customers and understanding, and being able to see that come to life in the research that we did. Because we talked about before the research was about taking the data and letting the data drive the information that we see and being able to understand and seeing that correlate. It's almost like we talked to customers, and hey, this sort of matches. Because at the end of the day, we talk to customers of different balances, some of them, as I mentioned earlier, starting to adopt, and some of them that are very mature.

And even the mature ones, we always see, hey, there's something-- there's a little nugget that we missed. And being able to have that conversation with them and help them, that is an element of what we're trying to drive with the paper, as well, is there's always a little bit for someone to hopefully improve. And at the end of the day, maybe there's another revision of this white paper to get more and more feedback to understand what's happening around in the community.

KASLIN FIELDS: So speaking of customers, the white paper is kind of this collection of all of this analysis that you've done and all of these different experiences that you've had. But what are some of your favorite things that you've learned from working directly with customers?

FERNANDO RUBBO: I don't think this is my favorite thing, but I think it's very prominent in my mind. Because when we started this journey more than 3 and 1/2 years ago, we used to see on external third party service and also talking to the customers that executives were very interested in the cost optimization. Because they were coming to the cloud, and they were saying, hey, I'm spending more than I was planning to. And then cost optimization is very, very important for me.

However, we didn't see that at the technical level. Like think about the developers. Developers want to ship their features. Usually they don't care about cost optimization. Like one person told me once, for developers, it's like insurance. Or for platform teams, like nobody likes it, but you need to pay for that. And you need to pay that.

So that's exactly what it was in the past, but that has changed. So I noticed that from three years and a half or more ago and right now, we are getting more and more platform teams very engaged on this. They are creating platforms over Kubernetes to control, to create [? guardrails, ?] to provide recommendations. Because this is a continuous discipline.

So you need maybe, the resource request you set today to your application, a new deployment may require more. So you need to rightsize that again. So this is a continuous work that you should be working with, and platform teams are the ones most interested.

But we also are seeing some developer people that want you to follow this. I don't know if it's like a-- probably that's a top down thing that's coming, but we do see more movement on the technical side than we used to see before.

KASLIN FIELDS: I think that cost optimization, just like everything else in technology, has a naming problem, which is, like you said, the executives care a lot about cost and want to make sure that costs are as low as we can get them, whereas engineers care about optimization. They care about optimizing their workloads to run efficiently, or maybe they don't even care about that. They just want to get the feature out there. But at least that part will be a little bit more interesting to the technical folks.

So you've got this term of cost optimization bringing the two together, but I feel like folks still misunderstand it. But Kent, do you have a favorite?

KENT HUA: Yeah, I think it comes from thoughts with interpretation. Like what is cost optimization? What does it really mean to my organization?

Because a lot of times, the best part for me is actually seeing the aftermath. Like if I engage with a customer, three months later, what actually happened? And I think the most positive experiences are the ones that started taking the signals to heart, and starting using them as a way to measure progress, and realizing that there is room. I can still save money in this particular case, but also reuse the resources.

As a developer, we always like to, yeah, make me as many resources as possible so I can be productive, and I'm not giving it back is what I'm always hearing. And the not giving it back here is that now I can actually deploy more applications. I have this platform that's available to me. I understand what this platform is able to offer me. I understand what my applications need.

And with that in combination, I can deploy, to Fernando's point, I can deploy more applications and hopefully have insights on these applications and really see where I've saved on my previous resources. Sometimes I'm not saving because I'm actually realizing that I do need more resources in order for this application to be more reliable. But the three months later is seeing their progress through these signals and being able to deploy more because now they have more resources to work with, even though at the end of the day, before the first month and the third month, it's actually the same amount of resources. They just are able to have the insights so that they can maximize. So that's the key term here, maximize their resources.

And their applications, at the end of the day, are still there to be reliable, and the end users are happy. The applications, the end users, and organizations , with all of this new information, hopefully are able to focus on that next generation experience. Because everyone wants to innovate, and now hopefully, through this process of understanding the signals, they are able to create more informative decisions between the different teams within an organization.

Is I have this information. My developers understand what signals are important to them. My platform teams, my operators-- everyone is familiar and speaking in the same language so that the organization can progress further.

KASLIN FIELDS: I can just see the room full of engineers. How many of you all use resources in your day-to-day jobs? Everyone's hand goes up. How many of you give those resources back when you're done with them? All of the hands come down. [LAUGHS] We'll not talk about where my hand was in all of that. [LAUGHS]

So what the result of this white paper is, from what you're saying here it sounds like to me is, if we can understand how things are going, if we can watch these signals, we can get a better sense of how we can manage these things and run our clusters more optimally.

FERNANDO RUBBO: Exactly.

KASLIN FIELDS: So what's your top advice for listeners who are out there listening to us talk about all of this and are trying to optimize their own Kubernetes clusters for their teams or organizations?

FERNANDO RUBBO: At least my top advice for you that are listening to this podcast is, of course, read the white paper. So yes, we want you to read the white paper, take a look on the key findings. If you don't want to read, that's a very long one, we have executive summary. At least read the executive summary.

If you think the executive summary is too big for you, we're going to be launching from now on every two weeks kind of, we're going to be launching blog posts talking about what's the finding and what you need to do to deal with that. So we're going to be launching small pills until the end of the year how you should be dealing each of these key findings.

My second advice is measure. So the golden signals are there. If your platform provide them out of the box, use it, look at it, and follow it. Make sure that you're improving over time.

And remember, this is a continuous discipline. So you should be tackling this once in a while. You don't need to do a big work, like let's optimize everything and then stop. Do a little bit per day, and don't focus too much time on only one shot.

And the third is, of course, set requests. So as I said before, I believe in the future, Kubernetes will be in another position. Maybe you don't need to think about that. But today, requests are important.

So you need to start by setting requests. Then you start by right size. Then you go for right sizing. And then for demand based downscaling, bin packing, and cloud discounts. So that would be my three, at least.

KASLIN FIELDS: It's feeling to me like you're saying, hey, you out there, Marie Kondo your clusters. [LAUGHS] I feel like there's some similarities here.

KENT HUA: There's room. There's room to be organized, potentially, for a lot of these organizations.

KASLIN FIELDS: So Kent, what are your top pieces of advice for those out there taking this Marie Kondo journey?

KENT HUA: I think a lot of it is going to come into organizations defining what are the best practices. So as I mentioned earlier, teams having that common language to speak to one another, that was what DevOps was supposed to bring everyone together culturally. This whole FinOps, or this cost optimization movement is, again, this whole construct of how we need to all speak in the same language. And the common goal is to optimize our environment while also not sacrificing reliability and performance in this process.

And while it certainly helps to have mandates come from above or organizational from an executive saying, hey, we're going to do this, it also helps that we have different teams or individuals in different roles coming together for that common goal and having this conversation to say, how do we optimize this? I think the other piece of it as teams come together is how do we provide guardrails? How do we prevent some of this from occurring?

We talked a lot about shift left security. Like how do we move it all from the platform side to the development side and focus on the security journeys? Cost optimization should be a very similar focus on shifting left, but also shifting down-- looking at different aspects of what the developers can do as well as what platform teams and platforms, cloud providers, being able to provide exposure and information so that the organizations can take action.

These guardrails help us, for instance, saying, hey, require requests to be set for my workloads is one opportunity for this. Now as understanding is, as I mentioned earlier, there are some that take advantage of these best effort workloads because they want to take advantage of this headroom that's available in my notes. So being able to at least have a structure to guide this, like this is the happy path for a workload to go from development all the way to production in a cluster that's running is, what is that happy path? Has it gone through all of the checks and balances before we get it to deployment?

And being able to understand, if I do want to use these resources that don't set requests, why? So that it's one of those things where it's not just, I just want to do it because I don't know what I need. It's more about, I know what I need, and I'm doing this because I have a certain SLA, and I'm able to meet it with this particular setting that I need. And I can take advantage of your leftover resources from an organizational perspective.

And I think the last piece of it is I mentioned that journey of seeing them the first month and potentially that third month or the sixth month, is that it's a continuous journey. It is not a one thing. Fernando mentioned it earlier, as well.

I'm not going to set it and forget it. I'm going to set it, and then I'm going to go measure it and actually tune it to ensure that this is how my resources are available. Because from an infrastructure perspective, cloud providers, my hardware is actually getting better over time. If my hardware is improving over time, my workload should be able to take advantage of these various improvements in hardware, and I can adjust my resources as time passes.

KASLIN FIELDS: And one thing that I'm always hearing from engineers trying to take on these innovative practices within their own organizations is, how do I convince X that this is a good idea? [LAUGHS] So it sounds like with these signals, we have the signals that we need to be able to say, here's what's going on, and here's what we need to do. And that helps folks to communicate within the organization.

I've heard it both ways from executives communicating down to engineers communicating all the way up the ladder. And as long as you have this data to back you up, it makes those conversations a lot easier to have.

FERNANDO RUBBO: Exactly. That's exactly how we started. We had trouble talking to executives and showing, hey, how are you doing? So the signals helped us in that time, and probably going to help you if you're going to start to track it.

KASLIN FIELDS: I hope so. I hope lots of you out there learn a little bit more about these signals, maybe read the white paper, and are able to have those conversations within your own organizations. Thank you so much, Fernando and Kent, for being here with me today. I've enjoyed learning about all of this stuff. I hope that you have, too, and how about we close it out together by telling everyone to set their resource requests and limits?

[LAUGHTER]

FERNANDO RUBBO: Good one. Good one, Kaslin. Good one. Thank you for having us.

KENT HUA: Thanks, everyone. Thanks for your time.

KASLIN FIELDS: Thank you so much.

FERNANDO RUBBO: Bye bye.

ABDEL SGHIOUAR: Well Kaslin, thank you very much for that interview. It was pretty interesting.

KASLIN FIELDS: Cost optimization is one of those things that I find very interesting. Because folks always want best practices, but there are different ways you can frame it.

ABDEL SGHIOUAR: Yeah, I think that's one of these things about Kubernetes generally as a platform is a lot of companies are offering it, but it's very hard for platform providers to provide a prescriptive set of best practices to say, this is how you should run your stuff. Because every customer is different, and every workload is different.

And reports like this, which is based on research, and then from those research we draw a bunch of patterns, and then we make some high level recommendations, I think are very important for people to read. Because it's like we're not going to go there and see, like, OK, you turn on this option, turn off this option, change these two dials, and then done, right?

KASLIN FIELDS: Yeah. Best practices are not just a checkbox.

ABDEL SGHIOUAR: Exactly.

KASLIN FIELDS: You need to understand where they're coming from in order to understand if they're actually a best practice for you.

ABDEL SGHIOUAR: Yeah. And they're like, actually, something that I don't remember if it was Fernando or Kent at the end that mentioned, it's a continuous process. You're not going to turn a couple of options on and off and that's it. You have to continuously change things, monitor, change, monitor, et cetera.

KASLIN FIELDS: Yep.

ABDEL SGHIOUAR: But before we go there, I started like-- and you mentioned this in the interview is the cost optimization, as anything in IT, is probably the wrong term for this. Because it doesn't really represent what all of these stuff are about in a way. Because a lot of times people when they think about cost optimization, they just think, oh, how can I pay less? But it's obviously about more than just that, right?

KASLIN FIELDS: Yeah, it's all about the value. It's about making sure that you're running all of your workloads on Kubernetes optimally. And if they are running optimally, then your costs will be optimal.

ABDEL SGHIOUAR: Exactly. And that's the optimization part of the cost optimization term.

KASLIN FIELDS: Yeah. So it's really about best practices. It kind of depends on your role and where Kubernetes sits in your role. If you're someone who is a platform engineer, and you work with Kubernetes, and your goal is to make sure that you are running applications and workloads for the teams that you support effectively, then you care about it more as a best practices kind of thing. Whereas if you are a decision maker, you probably care more about the cost side of things. So cost optimization is kind of trying to represent both of those, but I don't know that it does it very well.

ABDEL SGHIOUAR: Yeah, and this was also mentioned in the interview is this making developers more aware of what they're doing. Like surfacing the information about cost to developers. Well, there is another term people use for it which I really don't like. It's the shift left movement.

Actually, I had a lot of interesting conversations in the conferences I do because I just usually go to people and ask them, what does shift left mean for you? And a lot of times when you ask this to developers, they just mean, oh, that means more work.

KASLIN FIELDS: More work.

ABDEL SGHIOUAR: I have to do more work, right? So it's a term that really people don't like, but--

KASLIN FIELDS: And they're not wrong.

ABDEL SGHIOUAR: They're not wrong, yeah. But I think it's more about visibility. How can you make developers aware of what they're doing?

KASLIN FIELDS: I like the approach of platform engineering to that a lot better of self service rather than saying, we put these responsibilities on people, say, we enable them. We empower folks to serve themselves to what they need to get done. Feel like that's the better approach.

ABDEL SGHIOUAR: Yeah, that's actually very interesting. The research was basically-- like the overall outcome of the report is these four main signals they call them, right, which are eerily similar to the SRE book or to [? DORA ?] research. So like somehow four is like a magic number.

KASLIN FIELDS: Yeah. One of these days I need to actually read that book.

ABDEL SGHIOUAR: Which one?

KASLIN FIELDS: Feel like I learned so much about it from just talking to people-- the SRE book. I've never actually read the book itself, but yeah.

ABDEL SGHIOUAR: Well, there are three of them now, right? So they have actually released three.

KASLIN FIELDS: That's a lot of time to set aside.

ABDEL SGHIOUAR: Yeah, it is. And they are not small. But yeah, it's interesting that at the end of the day, it's all about visibility, optimization, setting your resource requests and limits, which I think we cannot talk about enough.

KASLIN FIELDS: Set your resource request limits.

ABDEL SGHIOUAR: Yeah, exactly. And then there was two other ones. I forgot what they were.

KASLIN FIELDS: Yeah. You know, the observability point kind of surprised me in it, honestly. That's not the first thing that comes to my mind when I'm thinking about how am I going to run my workloads optimally in my Kubernetes cluster. It probably should be honestly, because really, how am I going to know that I'm running them optimally should be the next question.

ABDEL SGHIOUAR: Yeah, yeah.

KASLIN FIELDS: But so much of the recommendations in this report are making sure that you know what's happening in your cluster. And then it has recommendations, of course, for depending on what you're seeing, here's how you should do it. But really, the first step is understanding your cluster, which should be obvious.

ABDEL SGHIOUAR: I think a big part of it also is the fact that a lot of this tooling that you require actually for observability is usually a second step in your Kubernetes journey. So you start by creating your Kubernetes cluster.

KASLIN FIELDS: Right, exactly.

ABDEL SGHIOUAR: And then you have to deploy this extra set of tooling. Then you would deploy your app, and then you would have the bench or the baseline for your observability, right?

KASLIN FIELDS: Yeah. That's how I usually talk about it with folks is here's all this stuff about Kubernetes. It's great. It does all of these wonderful things, and at some point you're going to have to worry about all of these other things that are not within Kubernetes itself that are really important. So I feel like that's probably why the observability piece is kind of like, oh, right, people need to worry about that.

ABDEL SGHIOUAR: Yeah. Prometheus, and Grafana, and all this stuff, right? I think we started to enable the managed Prometheus collectors out of the box for GKE, at least. So starting some version, I don't remember which one, when you create Kubernetes clusters, the operators and all this stuff is already pre-deployed. So you don't have to worry too much about it, right?

KASLIN FIELDS: Yeah. I heard someone talking about that recently. Someone was asking, well, why did we do that? And one of the program managers that we work with gave a really, I think, wonderful answer to that, which was that just having those pieces already installed in the cluster makes it a lot easier if you want to use them.

ABDEL SGHIOUAR: Yes. The fact that they are installed doesn't mean you will be charged for them. When you start using them, that's when you start being charged for them. And it's all about this frictionless utilization path, or as we call it, critical user journey.

KASLIN FIELDS: Making it more frictionless to observe.

ABDEL SGHIOUAR: Yeah. And observe beyond the kubectl top, which is one of these commands that can tell you what's going on. Because you need observability over time, not in one specific moment.

KASLIN FIELDS: Yeah, that's not useful for a realistic use case.

ABDEL SGHIOUAR: Yeah. Well, thank you very much. I learned quite a lot. So now I have to go back and read the report, actually.

KASLIN FIELDS: Yeah. There's a lot in there.

ABDEL SGHIOUAR: Yeah. We'll leave a link in the show notes for our audience.

KASLIN FIELDS: Yeah, definitely recommend at least reading the executive summary. It's really nice and goes over a lot of the findings very well, I think.

ABDEL SGHIOUAR: Yes. Thank you very much, Kaslin.

KASLIN FIELDS: Yeah. Thanks so much for joining us.

[MUSIC PLAYS]

That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @KubernetesPod or reach us by email at <kubernetespodcast@google.com>. You can also check out the website at kubernetespodcast.com where you'll find transcripts, and show notes, and links to subscribe. Please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening, and we'll see you next time.

[MUSIC PLAYING]

View More Episodes