Kubernetes Podcast from Google: Episode 111 - Scalability, with Wojciech Tyczynski

#111 July 7, 2020

Scalability, with Wojciech Tyczynski

Hosts: Craig Box, Adam Glick

Before Kubernetes was launched, it could have at most 25 nodes in a cluster. At 1.0, the target was 100. Meanwhile, Borg, Omega and Mesos were all running away at 10,000. What did it take to get Kubernetes to this number, and above? SIG Scalability and GKE Tech Lead Wojciech Tyczynski tells us.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

Follow-up:
- Chairs, from Episode 107
- Christmas trees, from Episode 104
- Kids music

News of the week

Links from the interview

Omega
- Episode 43, with Brian Grant
Defining scalability
Original SLOs
- API-responsiveness: 99% of all our API calls return in less than 1 second
- Pod startup time: 99% of pods (with pre-pulled images) start within 5 seconds
Target SLO doc - 25 nodes
Borg - ~10,000 nodes
Sep 2015, Kubernetes 1.0 - 100 nodes
- “Kubernetes Has A Ways To Go To Scale Like Google, Mesos” by Timothy Prickett Morgan
March 2016, Kubernetes 1.2 - 1,000 nodes
July 2016, Kubernetes 1.3 - 2,000 nodes
- Work by Clayton Coleman, guest of Episode 85
March 2017, Kubernetes 1.6 - 5000 nodes
etcd v3 improvements for web scale
Scalability Envelope
Today’s scalability numbers
EndpointSlices
- Episode 104, with Bowei Du
JD.com’s 10,000 node clusters
Alibaba’s 10,000 node clusters
- Episode 95, with Xiang Li
Google’s 15,000 node GKE clusters
Twitter session at the upcoming Google Cloud Next by Reza Motamedi and Maciek Różacki
Poseidon and Firmament
Wojciech Tyczynski:
- GitHub
- LinkedIn

Transcript

Show full transcript

ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.

[MUSIC PLAYING]

CRAIG BOX: A lot of my favorite weekly podcasts do a follow-up at the beginning where they talk about things they've talked about in previous episodes, and a couple of things over the course of the last week have led me to think, we should do a little bit of follow-up.

ADAM GLICK: What interesting details are you sitting on right now?

CRAIG BOX: Well, you set that one up neatly. [CHUCKLES] People may remember that we were talking about office furniture a couple of weeks ago. And I mentioned that I have a chair with one arm that had fallen off somehow in a fit of strength in the past. And whilst on a meeting with a colleague in the US last week, I was leaning on the second arm. And then all of a sudden, I was on the floor.

[LAUGHTER]

ADAM GLICK: So at least it's symmetrical now, right?

CRAIG BOX: It is. It took a little while. I was sort of trying to lean on the ghost arm for a couple of days. But I've adjusted, and I think it's probably a little bit better than it was before.

ADAM GLICK: Ah, it's a little Christmas in July gift for you, I guess?

CRAIG BOX: Well, you're just setting me up for everything here. [ADAM CHUCKLES] Second thing that I wanted to follow up on was a conversation we had about Christmas trees and how you find them out of season in inappropriate times and places. In the Southern Hemisphere, a lot of people celebrate a midwinter Christmas at the end of June. However, there's no excuse here in the UK to find a Christmas tree at the end of June, six months after Christmas, just dumped on the side of the road. I don't know why it's there, and it must've been sitting around someone's backyard for a few months. Why?

ADAM GLICK: Did it still have tinsel and ornaments on it? Or is it just the--

CRAIG BOX: No, it still had needles, but they were brown. [ADAM CHUCKLES] It lost any of its color a long, long time ago.

ADAM GLICK: These are clearly the mysteries of our time.

CRAIG BOX: We should follow up with you then, Adam. What's happening in the world of kids' music?

ADAM GLICK: Ah, yes. I love digging into kids' music-- I've mentioned on this show before. And there's a wonderful song linked in the show notes, "The Duck Song." And it's this clever little story which parallels a joke, as you reminded me, of a-- the slightly more kid-friendly version of it is this song.

And there is a trap music artist out there who basically is doing his version of Auto-Tune the News, if you remember those folks. And he created a trap version of that song. And it just is-- it's a really fun and surprising mix for it. There's a link in the show notes of it. But it's a pretty catchy beat that he's thrown behind it and shown that you can turn almost anything into good music.

CRAIG BOX: "Good," of course, being subjective. I think that might be one of those they were too busy thinking about whether or not they could do something to think whether or not they should.

ADAM GLICK: [CHUCKLES] Perhaps. Shall we get to the news?

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

CRAIG BOX: The CNCF has announced that KubeCon US has become the latest conference gone virtual. The event will take place on the original dates, November 17 to 20. And the registration fee is now $75. No word yet on what times zone the event will operate in. So East Coasters, you'll probably get to sleep in. If you're interested in being a speaker, the call for proposals has been extended until July the 12th.

The schedule for the Prometheus PromCon has also been released, which is happening July 14 to 16, with registration opening soon.

ADAM GLICK: Looking to move your Java or .NET application into a container in AWS? Amazon has released App2Container, a new command-line tool that takes Java and .NET applications and moves them into containers stored in AWS's Elastic Container Registry. A2C is similar to Migrate for Anthos, which we talked about in episode 48, though it currently only works with ASP.NET apps on Windows and Java apps running on Linux. The tool is available today and is free of charge.

CRAIG BOX: New features in Google Kubernetes Engine this week-- the NodeLocal DNSCache feature, as discussed in episode 106, is now generally available. There is also a new feature which allows you to specify custom settings for the kernel and the kubelet on your GKE nodes, available in Beta.

ADAM GLICK: This week's Azure Kubernetes Service release introduces Kubernetes 1.17 as GA and brings 1.18 into preview. It also adds containerd as a supported runtime, which will eventually replace Docker when the feature goes GA. Microsoft also added support for Azure Priority Placement Groups for AKS nodes.

CRAIG BOX: Diamanti has launched version 3.0 of their Spektra-- with a K-- Kubernetes platform. Focal points of this release are hybrid deployments, disaster recovery, and a single control plane across multiple clusters. The announcement references upcoming management of cloud-based clusters in Google Cloud, AWS, and Azure, though no date is provided for that functionality. Policy management for multi-tenant environments is also a part of the release.

ADAM GLICK: A new working group has been formed to focus on language and naming choices in the Kubernetes project. WG Naming aims to remove barriers to contribution and adoption by replacing harmful language with neutral terms, including language linked to racism, as well as replacing idioms and slang specific to the English language. The temporary group will build processes and lists of terms to avoid and establish a timeline for replacement of component names before their dissolution.

CRAIG BOX: The CNCF is introducing Cloud Native Community Groups. The concept grew out of a merger of the almost 200 existing Meetups and the Community Days events that the CNCF was scheduling before COVID-19 ruined everything. The service moved from Meetup.com to a platform run by Bevy. And there's an incentive: the CNCF is offering a one-time complimentary swag certificate to their store for people running a community group, as well as cost coverage for events hosted on the new platform.

ADAM GLICK: The CNCF Storage SIG has updated their storage landscape white paper. The update covers areas including the attributes to consider when choosing an overall storage solution, the different types of storage and caching, different database types and the benefits and challenges with each, as well as ways to interact with storage systems from Kubernetes. The update builds on the original storage white paper, which was released in December 2018 at KubeCon Seattle.

CRAIG BOX: Finally, Presslabs claims to be the first managed WordPress hosting platform running on Kubernetes in a blog post which talks about their new architecture. Their hosted platform, running on Google Cloud, runs code based on their Presslabs Stack and WordPress Operator, two open source projects that you can find on GitHub. The post lays out a large number of pros and a small number of cons for the migration to Kubernetes.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

CRAIG BOX: Wojciech Tyczynski is a Staff Software Engineer at Google Cloud. He is the Area Tech Lead for Scalability on GKE and Anthos, and the Tech Lead of the Kubernetes SIG Scalability, having been on the project since February 2015. Welcome to the show, Wojciech.

WOJCIECH TYCZYNSKI: Hello.

CRAIG BOX: You joined Google out of university. And then you were working on the Omega clustering project. We talked to Brian Grant a little bit about that. But what was that project like at the time you joined?

WOJCIECH TYCZYNSKI: I started to working on understandability, which was basically the effort to show users what the cluster is actually doing and how it's behaving, and why it's behaving like that. But soon after that, I actually expanded the scope a little bit to work on core infrastructure of Omega, mostly what we called, like, Omega persistent store, which was the central storage for which all the communication between any other components were going-- a little bit similarly to what Kubernetes is doing.

CRAIG BOX: Is it fair to think of Borg as a monolithic system and Omega as a microservices architecture that was going to try and replace it?

WOJCIECH TYCZYNSKI: Yeah, absolutely. I think it's a very good summary of that. Like, Borg is-- it used to be mostly a single binary. It was split a little bit over time, but it's very monolithic architecture. Omega was this microservice system. It's pretty similar to Kubernetes. Or I should rather say, Kubernetes is pretty similar to Omega because many of the aspects or features or high-level architecture of Kubernetes is actually based on learnings from Omega-- what works and what didn't really work there, so.

CRAIG BOX: Absolutely. And many of the same people, yourself included.

WOJCIECH TYCZYNSKI: Yep.

CRAIG BOX: So then, as the Omega project started rolling its feature set into Borg and Google progressed more with that platform, a lot of the Omega team, including a big team in Poland, moved on to work on Kubernetes.

WOJCIECH TYCZYNSKI: Yes, exactly. I think it wasn't that big. I think we started Kubernetes with five or six people in Warsaw. Which wasn't super small, but it wasn't very big. I think half of those people roughly are still working on Kubernetes. The rest moved to other projects more so, but yes.

CRAIG BOX: So you joined that project in February 2015. Kubernetes 1.0 was released in July 2015. Given your work on scalability-- that's what we want to examine today. First of all, what does scalability mean to you in this case?

WOJCIECH TYCZYNSKI: When I joined, I actually started for the first month or two looking into random things and trying to understand a little bit better how Kubernetes works. And then probably around April-ish, I started looking into scalability. Actually, it was the moment when we-- by "we," I, here, more mean like the leadership of the project-- realized that we need to say something around scalability when we will be announcing 1.0 version. And we didn't really know at all where we are.

So first of the thing we had to do was basically defining how do we release based on what we can say whether a given cluster scales to particular size or not. And that was basically the time when we defined the first two SLOs. They were, like, I would say a little bit of an anti-pattern for how you should be defining SLOs back then. Like, they weren't very customer-oriented or user-oriented. They weren't at all precisely defined. Pretty much every single person was interpreting them differently. But we at least had some starting point.

CRAIG BOX: So those SLOs, or Service Level Objectives, that were set at the beginning-- was an API responsiveness SLO-- so 99% of API calls should return within one second. And then a pod startup latency-- so basically, if an image is on a node, then pod should start up within five seconds.

WOJCIECH TYCZYNSKI: Exactly.

ADAM GLICK: Why were those the two metrics that were chosen?

WOJCIECH TYCZYNSKI: The second one is actually connected to what made people think, "wow", when they were first looking into Kubernetes. I remember those days where, for many of the people that actually looked for the first time at Kubernetes and they run kubectl run or whatever it was back then-- I think that it wasn't even kubectl back then, but kubecfg or something like that, and--

CRAIG BOX: That was easier to pronounce. No one ever got that one wrong.

WOJCIECH TYCZYNSKI: [CHUCKLES] Yeah. But basically, when they were running this command and seeing a pod being in the running state couple of seconds later, that was the wow moment for many of the people. And we decided that it's important enough to make it-- no matter how big the cluster is or how loaded it is, this should be the case still. So that is the motivation for this one.

The API responsiveness is a little bit of building block for the pod startup because every single non-trivial operation requires multiple API calls. But it's also critical for user experience so that, when you send an API call to kube-apiserver, you don't wait seconds or 10 seconds or even more before you get the response. So it's also about user experience.

CRAIG BOX: Now, of course, we have things like the Cluster API. You can say "kubectl add machine", and then the SLO is, "well, I've got to phone Dell and have the machine shipped to you and installed in your data center". So that API call could take 10 weeks to return.

WOJCIECH TYCZYNSKI: Yes. So at least at that point, we were missing many important things. Like, we didn't really cover networking at all. We didn't cover storage at all. In pod startup time, we were implicitly assuming no volumes attached to the pod because they can make the time longer.

So yes, it's definitely not precise. It's definitely not covering the whole area or the whole feature set of Kubernetes. But it was some starting point given the time pressure and given the fact that I didn't really understand Kubernetes that well back then. I think it was a relatively good starting point.

CRAIG BOX: And that's fair enough, because it was barely a year old at this point. We talked about the SLO primarily in the context of a single number, which is how many nodes you can have in your cluster. That's a measurement that a lot of people like to use. You had published a document later on which said that, before the 1.0 launch, Kubernetes actually supported around 25 nodes in a cluster. You were working before this on Omega and Borg. And they published in the Borg paper that the average size of a Borg cluster or cell was in the order of 10,000 machines. That's a big difference.

WOJCIECH TYCZYNSKI: That is a big difference, but Kubernetes was our bet on how to get people to use the same framework or the same standard across the whole world. And the initial goals wasn't really to target the very large customers, but rather to make people fascinated about the concept itself.

So no one was really thinking that much about scalability and performance. And it's actually understandable. In order to reach a high-enough scale, you need to make many of the things more complicated than they should be, or that they could be, if you just want to have, like, for example, a 10-node cluster. So that was definitely a reasonable rationale. And that's basically why it looked as it looked.

CRAIG BOX: From the viewpoint of the changes that had to make to reach the various milestones for scalability that were made along the way, Kubernetes 1.0 started out with a published support for 100 nodes against this SLO we mentioned before for pod startup latency and API responsiveness. So there will be changes that needed to be made to the software to support these things as we scale each step along the way. What needed to be done to get from our 25-node internal number to the 100 nodes that 1.0 launched with?

WOJCIECH TYCZYNSKI: I think, at this point, those were mainly some very small and kind of obvious, usually setup-related stuff-- things like increasing number of file descriptors that a particular component can use, and things like that. So there wasn't anything super fancy back then. But the real journey started actually after that.

CRAIG BOX: You've been documenting this journey along the way in a number of blog posts to the Kubernetes blog talking about scalability. As we move through, we'll have a look back at those as well. You started with a post in September of 2015 talking about the measurements and the roadmap for Kubernetes, so talking about the SLOs that you've mentioned before. What were the things that you knew you needed to do to move from 100 nodes to the new goal that you set to get to 10x that, to 1,000 nodes, within the year?

WOJCIECH TYCZYNSKI: There were basically a number of things that had to happen. Many of them were much smaller and local. And each of them on its own wasn't super important, but they, together, made a huge difference. I think the two things that are probably worth mentioning that were needed to reach 1,000-node cluster scale was, one, we had to work around the limitation of golang or, the Go language, itself.

The majority back then of the things that kube-apiserver was doing were conversions and deep copying of objects. And all of those were reflection-based. And we basically realized that it's way, way more expensive than it should be. So I started this sub-project to start auto-generating those, based on the object types. And that actually reduced the cost of those operations by more than an order of magnitude.

CRAIG BOX: Wow. Is that something that was, the way that you'd implemented it in the language? Or is that just something that's a primitive of the language, like you would have needed to fix Golang to make those things faster if you'd kept them as they were?

WOJCIECH TYCZYNSKI: No. No, it was just like, let's call it a small program or small script, relatively small, outside of it, that was taking our type definitions as input and producing the code that can then be compiled together with the Kubernetes itself to convert or deep copy the objects.

CRAIG BOX: But we're not talking about unrolling loops, or having to write that part in assembly to get raw performance?

WOJCIECH TYCZYNSKI: No, no. Fortunately, not. I think an interesting thing to say about Golang here is that we have a pretty good collaboration with the Golang folks now. I remember when Go 1.5 was being released, and it actually had a performance regression. When we reached out to them, they said, well, we don't care about you that much. We will fix that in the next 1.6 release half a year later.

And when the same thing happened in 1.9 or 1.10-- I can't remember exactly-- the situation was, whoa, we didn't know that. We will work with you as much as possible to have that fixed as soon as possible. So it really showed that Kubernetes started to matter on the market, and started to matter even to Golang.

CRAIG BOX: Yes. Very important, obviously, as a marquee project for using the Go language. When you released 1.2 with those improvements, and it also said that you were able to improve 99th percentile tail latency by 80%. Was that as a result of these changes?

WOJCIECH TYCZYNSKI: That was one of those. The other thing was that we basically reduced the number of things that has to happen in kube-apiserver by introducing a cache in it. It mostly matters for watch operations. Watch is super crucial for Kubernetes-- fundamental concept of it. We wouldn't be able to scale to probably even 500 nodes without watch support.

But what we did basically-- what we had previously is that every single watch was more or less redirected to etcd, which doesn't really understand or isn't aware of Kubernetes' data model. So for requests like kubelets asking for pods from their own node, the only thing that etcd can do is to send every single pod or change to every single pod and have it filtered out at the kube-apiserver level.

So it basically mean that a lot of work was basically wasted because we were sending out those pods, deserializing, checking that data match the filters or the selectors and throwing them out. So what we did is basically we introduced this cache that is propagated via watch from etcd. And then it's responsible for dispatching the events to correct watchers in a relatively efficient way. And that was one of the biggest things that allowed us to get to 1,000 node clusters.

CRAIG BOX: Progress marches on very quickly. And by the time of the 1.3 release in July of 2016-- so that's four months later-- we're now supporting 2,000 node clusters. So again, another step change, doubling the number of nodes supported in a cluster. What changes were made between those two releases?

WOJCIECH TYCZYNSKI: So probably one of the biggest changes there is we started supporting protobufs in the API, protocol buffers. But it was also a result of many things being in progress earlier and they finally were finished in 1.3. Earlier, there were also a bunch of smaller things that allowed us to reach that scale. But probably the biggest one, as I mentioned, was protocol buffer support.

CRAIG BOX: Now, at that point, you were moving the data around in protocol buffers. But you were still converting it back to JSON to save it to etcd. Is that correct?

WOJCIECH TYCZYNSKI: It was still the default, I think. I can't remember for sure. But we were able to store them in protobufs too. But I think the default was still storing them in JSON.

CRAIG BOX: Google obviously has used protobufs and what became gRPC internally for some time. So was it one of those things where you said, hey, we have this method that we know would solve the problem. Was it based on internal learnings that you said, here's an instant way we can get better performance?

WOJCIECH TYCZYNSKI: The protobuf support is something that took more than a release. It was started well before 1.2. And it actually was driven mostly by Clayton from Red Hat. So it wasn't even Google who initiated this effort.

CRAIG BOX: Now, also, at the time of 1.3, you released some software called Kubemark. Tell us about that.

WOJCIECH TYCZYNSKI: Kubemark is basically this framework that allows you to perform scalability tests without running a full, real cluster. And the biggest benefit of that is obviously the cost of it. So in order to start a 2,000-node cluster or 5,000-node cluster and run it for a couple hours or 10 hours or something, it generally costs a lot of money.

CRAIG BOX: It can get pretty expensive pretty quickly.

WOJCIECH TYCZYNSKI: Yeah. And we came up with this idea that, for many of our tests, we actually don't need the real nodes. What we are mostly focused on is the performance of the control plane. And if something will be able to imitate the load that the nodes are putting on the control plane, it will be good enough to replace a majority of our tests. We still had to run the real cluster test just to confirm that they didn't diverge too much from each other. But we could do that much, much rarelier.

And what we did for Kubemark is basically we took the components that were running on the nodes, and we faked some of the parts. Like, for example, kube-proxy, we faked the iptables. For Kubemark, we faked Docker. Or probably now, it's like Container Runtime Interface-- with something that is not really used-- doing any real work. It's not updating iptables. It's not running any containers. It's just claiming it does.

CRAIG BOX: Right. So it's mocking Kubernetes, basically.

WOJCIECH TYCZYNSKI: Yes, yes. It's basically mocking Kubernetes. And it allows us to save roughly an order of magnitude of resources thanks to that.

CRAIG BOX: Now, this does raise an interesting point. Because you mentioned before that the SLOs that you are targeting for scalability are based around the control plane things and then the ability to start up pods. No way does it ever say anything about, you can reach those pods on the network, or those pods are connected to volumes, like you mentioned before.

You could make a cluster which had many thousands of nodes but didn't meet any of those requirements. And then it wouldn't be very useful. How do you factor in the things that people actually need in terms of a proper Kubernetes experience as you start thinking about these scalability targets?

WOJCIECH TYCZYNSKI: I think we are generally trying to be user-oriented. So whenever we realize that majority of users needs feature x, we should be adding an SLO around that. And ideally, we should have a SLO about every single feature, but that's probably not achievable any time short-term, especially with projects that are moving as fast as Kubernetes is moving.

So we are still not in the shape where we would like to be with respect to a SLOs. We have a number of work-in-progress SLOs, especially around networking that we defined together with Networking Special Interest Group that are focused on customer-oriented aspects of the networking-- so for example, how fast your pod will be reachable via Service virtual IP address, and things like that, or how fast your DNS will be propagated after your pod status changes.

And they are implemented. We already measure them. We don't yet block releases on them because there are a couple smaller issues there to fix. But it's something that where we are targeting, at least. And hopefully, we should have them Kubernetes release blocking in not far future.

CRAIG BOX: All right, then. So moving back through our release history, we are now at March 2017 with the release of 1.6. And that supports 5,000 node clusters. That is a two-and-a-half times jump from where we were in 1.3. What needed to change to enable that?

WOJCIECH TYCZYNSKI: There were, again, a bunch of changes. I think the two biggest things that we had to do together-- the first one was actually redesigning the etcd. It's where basically the etcd v3 appeared. We realized that a little bit above 2,000 nodes that etcd v2 limitations that we were using back then actually are preventing us to going any farther.

Then the biggest bottlenecks was actually watch. The etcd v2 didn't natively support watch. The watch support was etcd client side. So how it looked back then was that etcd client was sending a request, "give me the next thing that happened after a given resource version". We were getting the response, and we are sending it again with the next resource version, and so on.

And it was basically falling over at around small couple of hundreds of events. And we definitely needed more to get to 5,000 nodes. So the etcd v3 was designed and implemented. And we migrated to use the etcd v3 to get to this point.

And the second big thing that we also had to do internally to-- as all the Kubernetes release blocking tests, including those 5,000 nodes are running on GCE, we also needed a bunch of improvements on the GCE cloud infrastructure, mostly around networking. And they also happened around that time.

CRAIG BOX: We talked before about storing data in protobuf. That came in with 1.6 as well?

WOJCIECH TYCZYNSKI: Yes, of course.

CRAIG BOX: Around this time also, we start adding a few other things to our scalability calculations. So we support 5,000 nodes. But we also have numbers for the total number of pods or containers or number of pods per node that are brought in. Alongside this is the concept of a scalability envelope, with a lovely sort of hypercube diagram that goes alongside it. How do you define scalability in more than one or two dimensions?

WOJCIECH TYCZYNSKI: It's basically around, slightly before when we released 5,000 nodes support, was basically when we finally realized that size of the cluster is actually not everything. We started seeing users that were facing significant scalability issues at the level of, for example, 100-node clusters because they were extensively using Services or in-general networking concepts.

So we had to define the scalability not only as a function of size of the cluster, but also in other aspects. That includes, as you mentioned, things like number of pods in the cluster, or number of containers across all those pods, or number of services, or total number of endpoints across all services. So each of those dimensions-- and obviously many others. There are probably dozens of those. We don't even have time to mention all of those.

But each of them is affecting a different behavior of the system. So for example, the total number of endpoints across all services affect the performance of iptables or whatever mechanism is used to route the traffic. Things like size of the biggest service affects how big the endpoint object is and how much data needs to be sent to kube-proxies along with every single update of that service, and so on, and so on. So each of them is affecting a different aspect of the system.

And you can actually have bigger cluster than 5,000 nodes if your workload there is relatively simple. And if your workload is super heavy, you are using very heavily, for example, networking concepts, or have very large services, for sure, you won't be able to reach 5,000 nodes, especially with the 1.6 release of Kubernetes.

CRAIG BOX: That being said, we're now up to 1.18, and the published number is still 5,000 nodes. Why no changes between those two releases?

WOJCIECH TYCZYNSKI: That is true. I think we mostly focused on those other dimensions in the meantime. We realized that there is no big pressure for any users to go higher than 5,000 nodes. But there are a lot of requirements around other dimensions-- for example, around the networking area, or how big your Service can be.

In particular, we are officially saying that we don't really support Services with more than 250 Endpoints per service. But there are definitely users that want to have more. So one of the things that we are working on in this area is the concept of EndpointSlices-- sharding a single endpoint object into multiple EndpointSlices object. That will be reaching Beta soon, I think in 1.19. But those kinds of things were what we were mostly focused around scalability during those three years-ish.

CRAIG BOX: We talked a little bit with Bowei from SIG Network in episode 104 about the concept of EndpointSlices. How does SIG Scalability interact with the other groups that have to make changes in order to support the overall scalability goals?

WOJCIECH TYCZYNSKI: Actually, in the case of EndpointSlices, we were the ones who initiated that. Like, SIG Network also was interested in that, but they didn't really have good motivation for doing it back then. And scalability became the thing that actually made them work on that.

And we are basically trying to cooperate at the design phase to ensure that whatever will be designed will be able to scale, at least in theory. And then we are reengaging again, once the implementation is ready, to ensure that it actually behaves, or performs as well as we want it to be. So those are basically those two main phases where we are involved at SIG Scalability.

CRAIG BOX: The ability to run larger clusters if you're willing to compromise on some of those other parameters-- there are two users who have talked about larger clusters that they operate. JD.com and Alibaba & Ant Financial have both talked about running 10,000-node clusters. What is it that they had to do in order to make that possible?

WOJCIECH TYCZYNSKI: To be honest, I didn't really talk with anyone from JD.com. But from discussions with Alibaba, those are the things that we are mostly doing in open source for supporting higher numbers in other dimensions. Those are things like improvements to etcd that are happening in higher versions of etcd, especially a couple of recent improvements in etcd 3.4. Those were things like adding indexes that are very targeted into their own use cases in kube-apiserver and things like that.

So they know their workloads much better. They understand the characteristic of the workloads. And they are able to customize their Kubernetes to better fit their own use cases. It doesn't mean it would help every single user. So that's why many of those improvements or optimizations are not possible to upstream to Kubernetes, but they help them.

CRAIG BOX: We talked with Xiang Li from Alibaba, who is one of the original authors of etcd in episode 95. And you can listen to that show if you're interested in learning more about the work they did on scalability, up to 10,000 nodes.

But the race continues. And recently, Google announced they now have 15,000 node clusters running on behalf of one of their customers, Bayer Crop Science. What happened there? What did Google do in order to make that next leap?

WOJCIECH TYCZYNSKI: Again, there were a number of things. We actually have things around etcd-- many etcd improvements upstreamed in open source, and we updated that new version. We did many improvements to the API machinery. Things like watch bookmarks are more efficient-- serialization of objects inside the control plane, and many, many other lower-level improvements in Kubernetes.

And actually, to be clear, we did every single one of those in the open source. So it is available for everyone. What is important to say is that it doesn't mean it will work on every other provider because we also did a bunch of improvements in the infrastructure itself. Things around logging, monitoring, the GCP networking stack, things that are backing the running Kubernetes cluster also had to be improved or even redesigned in some cases.

We redesigned the monitoring pipeline, for example, that we use for Kubernetes to make it scale to that level. So it's probably possible to do it on other providers. But it's not something that you can immediately take Kubernetes and just do.

CRAIG BOX: The scalability improvements that you've done working with Bayer are only in early access at the moment. When do you think they'll be available to people who just spin up a cluster from the console?

WOJCIECH TYCZYNSKI: We are planning to have it available for customers later this year. We don't have an exact date yet. We are still working with other partners on validating other use cases. And we will be working on improving that over time. And many of the optimizations and improvements won't be part of the initial launch. And they will be coming over upcoming quarters.

CRAIG BOX: Do you think that the improvements that are being done will then lead to the 5,000 number that's being published for open source communities to be increased?

WOJCIECH TYCZYNSKI: It's possible. We will see how the future will look like. We are not targeting to increase it anytime in the near future unless there will be a need for that or unless others will express the desire to help maintaining that.

CRAIG BOX: A lot of customers are running Mesos at the tens-of-thousands-of-nodes scale. I believe that the two-stage scheduling introduced in Omega makes that a little bit more possible. But are we seeing customers who are now moving from Mesos to Kubernetes and GKE?

WOJCIECH TYCZYNSKI: Actually, one of the reasons for going to those 15,000-node clusters was actually to unblock some customers or some users to actually be able to use Kubernetes; some Mesos users that didn't want to replace their single Mesos cluster with tens of Kubernetes clusters.

So one thing that is probably worth mentioning here is I would like to advertise an upcoming talk at Google Next, where one of my colleagues, the PM for Scalability Maciek Różacki, will be talking with one of Twitter engineers about how we cooperated with them and how they helped us validate our GKE and Kubernetes improvements against the workloads that are simulating how Twitter workloads would look on Kubernetes.

CRAIG BOX: Speaking of the scheduling approach used in Omega and in Mesos, there were open source projects like Firmament and Poseidon that looked to implement that kind of thing, a more flow-based scheduling, which they claimed would help support larger numbers of nodes. But I haven't really heard a lot about anyone using secondary schedulers. I hear, obviously, the work that's being done in the main Kubernetes scheduler to support this. Do you think there's still a place for secondary schedulers or replacing the built-in Kubernetes scheduler to get to greater scale?

WOJCIECH TYCZYNSKI: I think we are getting to the point where Kubernetes' control plane will be able to support higher throughput with many of the optimizations that happened and that are coming. So maybe at some point, we should think about secondary scheduler.

But I think my personal feeling about it is that we should think about more use-case-oriented schedulers, for example, things like a batch scheduler, that will be targeting a specific workload. And we'll be able to focus on specific scenarios rather than making a generic "more scalable scheduler". Because from my experience talking with Kubernetes users and GKE customers, majority of them don't need anything higher than we actually have. The people that do need them are mostly batch users or batch customers.

CRAIG BOX: And finally, it's been a tough few months for everybody around the world. But you have had a unique perspective on the current of our situation. I understand that your wife is an infectious diseases doctor?

WOJCIECH TYCZYNSKI: Yes, so thanks to her, I have very first-hand knowledge, especially from what is happening in Warsaw, where we live.

CRAIG BOX: All right. Well, do thank her from everyone in the Kubernetes community for the work that she's done. And thank you, Wojciech, for your time today.

WOJCIECH TYCZYNSKI: Thank you very much for hosting me.

CRAIG BOX: You can find links to Wojciech on GitHub and LinkedIn in the show notes.

[MUSIC PLAYING]

ADAM GLICK: Thanks for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter at @KubernetesPod or reach us by email at kubernetespodcast@google.com.

CRAIG BOX: You can also check out our website at kubernetespodcast.com, where you will find transcripts, show notes, Christmas trees, and trap music kids' songs. Until next time, take care.

ADAM GLICK: Catch you next week.

[MUSIC PLAYING]

View More Episodes