#104 May 20, 2020
SIG Network is completely rethinking the way you define groupings of applications (Service) and get traffic sent to them (Ingress) by building the Service APIs, a new set of primitives which are better suited to how different groups of users interact with them. Bowei Du is a Tech Lead on GKE and a member of SIG Network who is leading the design and implementation of these new APIs, as well as working on getting Ingress to GA in Kubernetes 1.19.
Do you have something cool to share? Some questions? Let us know:
CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.
ADAM GLICK: And I'm Adam Glick.
CRAIG BOX: This is not the time of year I would usually talk about Christmas trees, but they've bubbled up into my consciousness this past week. A week ago, I was out for a walk, and I saw a dead Christmas tree on the side of the road just sitting there. Thought, that's a bit strange. Took a photo of it because I thought it might be amusing and passed on. And out again last weekend, and there was another one. Why are these sitting on the side of the road?
ADAM GLICK: This feels like a joke setup.
CRAIG BOX: It isn't, unfortunately.
ADAM GLICK: Aww.
CRAIG BOX: I can tell you that, given that it's May, a Christmas tree that's been out for five months now should be very, very flammable. I had a tradition with some friends when I was at university. My friend Joe would go home at Christmas time, and then he wouldn't see his university friends again until February. So in February, he would throw a Christmas party for his university friends who were all now back together the place they went to school. And they would get the old Christmas trees from around the place, and they doused them in petrol and set them on fire.
ADAM GLICK: It's kind of a mix between Christmas festivities and Burning Man.
CRAIG BOX: A little bit. Sort of a Christmas, Guy Fawkes, all in one. And they dry out really well. That's all I can say about them. I don't think that's why these were there. If so, they forgot to burn them in February.
ADAM GLICK: So have you solved the mystery? Do you know why the Christmas trees are showing up?
CRAIG BOX: I don't know. Two's a coincidence. If I see a third one next weekend, I'll definitely consider it a pattern.
ADAM GLICK: [CHUCKLES] I want to send a thank you out to Adrian Mester who sent in a link for a puzzle after seeing the one that I'd been working on we posted up on Twitter to one being created by the creators from Cards Against Humanity. It's a Kickstarter. They've already hit their goal. But it's just kind of an interesting design that they've got, some really interesting art and ways of creating a puzzle with a promised surprise ending. For anyone who's curious to see that as well, we'll put a link in the show notes.
CRAIG BOX: Yes. And that guy seems to be quite good at working Kickstarter. He's asked for $44,000 and so far raised $1.6 million.
ADAM GLICK: Wow. He should ask for my next raise for me.
CRAIG BOX: You know what you're getting for Christmas.
ADAM GLICK: Shall we get to the news?
CRAIG BOX: Let's get to the news.
CRAIG BOX: Google Cloud has announced the rescheduling of the postponed Google Cloud Next event. There will now be a free nine-week digital event starting on July 14, which will have keynotes, over 200 breakout sessions, and training on Qwiklabs and Pluralsight. If you were already sign up for Next, then you don't need to change anything. If you haven't registered yet, the announcement with the signup link is in this week's show notes.
ADAM GLICK: Harbor has hit 2.0. This version is the first open source container registry that is OCI, or Open Container Initiative, compliant. This means you can store containers, Helm charts, open policy agent policies, and more. The built-in security scanner is also changing from Clair to Aqua Trivy, though users who want to continue to use Clair will be able to do so. Other improvements include the ability to configure SSL connections, configurable web hooks with Slack integration, expiration dates on individual robot accounts, and a new dark mode UI.
CRAIG BOX: Following recent announcements by Google Cloud and AWS, Microsoft Azure has updated their AKS service offering and pricing. They are now promising an SLA of 99.95% for clusters spread across availability zones and 99.9% for single-homed clusters for a fee of $0.10 per hour per cluster. This is only available for new clusters and is available in seven regions.
The free offering remains available with a 99.5% SLO, meaning they will try to achieve that level of availability but provide no financial guarantee if they can't. Microsoft has stated that they intend to bring SLA capabilities to existing clusters later this year.
In other AKS news, you can now install Kubernetes 1.18 in preview.
ADAM GLICK: Red Hat has added another cloud vendor to their stable. Working with AWS, they've announced an upcoming Amazon Red Hat OpenShift. Similar to the Azure product, it will be supported and billed through AWS with built-in integrations to Amazon services. The team has stated that they are preparing for an early access program, and there's a forum to express your interest for when it launches in the coming months. GA is expected in the second half of this year.
CRAIG BOX: Linode has announced the general availability of Linode Kubernetes Engine or LKE. The service is a managed control plane automatically patching masters but not nodes and supports Kubernetes 1.15 through 1.17. It is available in three US regions and one in Australia. Planned upgrades include expansion into additional regions and support for bare metal and GPUs. LKE as a control plane is free, as you are charged for the underlying compute, networking, and storage resources that you use.
ADAM GLICK: This week, VMware announced their intent to buy Octarine, a container security software company, to fold into their Carbon Black security software. The acquisition will add Kubernetes and container support to the broader suite that Carbon Black currently provides to more traditional VM based applications. Terms of the deal were not disclosed.
CRAIG BOX: Cybersecurity company Venafi has announced that they are acquiring Jetstack, a UK cloud native consulting company who are responsible for the popular cert-manager project. Details of the deal were not disclosed, but they did say that they plan to operate the organization separately while integrating their software and that the deal is expected to close in June. This is just before cert-manager is slated to go GA with 1.0, according to a project update also posted this week by Jetstack's James Munnelly, our guest from episode 75.
ADAM GLICK: Maesh, the self-proclaimed simpler service mesh built on the Traefik proxy, has released version 1.2. The headline feature is UDP support. And you can now use multiple traffic control middlewares per service. End-to-end encryption, which due to Maesh's design would only be between hosts and not pods, is mentioned as still being worked on.
CRAIG BOX: At this year's GrafanaCon, Grafana Labs announced that Grafana 7.0 is now Grafana-lly available. Features include new plugins from the major clouds and new libraries which reduce the effort to develop plugins, improvements to data processing pipelines, and faster ways to visualize data, now including traces alongside logs and metrics. The announcement says that open source Grafana has half a million users worldwide. So if you are one, go get yourself an upgrade.
CRAIG BOX: The CNCF has kicked off this survey for the first half of 2020. As you may have heard us cover in previous episodes, the surveys provide a snapshot of the community and help track project usage, challenges, and benefits. The survey offers a helpful view of the state of our community, and you can find a link to fill it out in the show notes. Don't forget to say you get your news from podcasts.
ADAM GLICK: Finally, if you're looking for an in-depth technical read this week, check out Lyft's Tony Allen examining the load balancing algorithms available in Envoy. Random, round robin, least request, weight proportional, and two-choice modes are contrasted, and there are others which did not make the cut. Allen says there are use cases where you may not see the problem the more complex algorithms solve. Performance is also impacted by features such as circuit breaking and retry policies such that there is no single best choice, but the blog does help you understand the trade-offs as you choose what works best for you.
CRAIG BOX: And that's the news.
ADAM GLICK: Bowei Du is a staff software engineer at Google Cloud and a tech lead on GKE networking. He's a contributor to SIG Network, working on Services, Ingress, and the next iteration of both. Welcome to the show, Bowei.
BOWEI DU: Thank you.
ADAM GLICK: You got your PhD from Berkeley working with Eric Brewer. You're clearly available, as you're on the podcast today. So are you consistent or partition tolerant?
BOWEI DU: As you know, you can only choose two of these. I try to be as consistent as possible. But probably, these days we are very much partitioned, at least our minds are. Taking care of kids and everything.
CRAIG BOX: What was the topic of your PhD?
BOWEI DU: I worked on the TIER project and helped out also with DTN networking. There's a bunch of words there. So the TIER project was looking at using information technology in developing regions. So a couple of things I did was my research group helped set up a long distance wireless network in various parts of the world, one of them in rural India.
DTN networking was a project basically to create a networking stack for places that have long delays. So one of these-- I think it might be still actively used-- is communication between space probes. Imagine you have a space probe that's talking on Mars, and it wants to talk to Earth. So you're going to have to do something where you send it, and it's going to take-- light's going to take minutes to propagate.
Potentially, the satellite around Mars is going to be behind Mars, so it has to wait and then send things when it comes around. We also use that technology to try to do communications in places without really good communication. So you imagine internet cafe is sometimes disconnected. Or at the start of the project, there were a couple of projects where they actually used Sneakernet to distribute data. So it was looking at that and seeing how to design systems in those areas.
It was pretty hands on. So I actually did visit some internet cafes in rural Cambodia and tried to see if our technology worked there. That was a pretty interesting deployment experience. I also went to the Philippines and saw UP Manila had a project where they were creating a medical system for rural health clinics. So it potentially could connect those using some of this technology.
So yeah, it was much different than Kubernetes. But you could say, in terms of systems and looking at software, disconnected error handling, self-repair, that all ease into the same thing, but probably tackling different problems.
ADAM GLICK: It's interesting when you talk about delay tolerant networking. I assume this is something a little beyond setting a longer time-outs on your callbacks.
BOWEI DU: Delay tolerant networking had interesting routing algorithms because they had this notion that you would-- it's almost like delivering a physical package instead of saying a protocol that had very fast-- or at least the reasonable expectation of sending something and getting a reply immediately. So you almost sent your communication package, it got written to disk or some kind of reliable storage, and then, at some later point it would be scheduled to send along. And one of the cool results that some folks on DTN came out with was that, as part of your routing algorithm, you may, given your storage space, have to move things somewhere else so you had capacity to send things through your DTN node and then move things back. But it was kind of cute stuff like that.
ADAM GLICK: Certainly sounds like an interesting project. How did you get from space communications and delay tolerant networking to Google?
BOWEI DU: After I graduated, I was looking around and felt like, OK, I was kind of interested in a number of things. So I worked at a startup that did distributed databases because I was interested in databases. Didn't know that much about it. And after that, I worked at a startup that was doing web acceleration global proxies and CDN and web acceleration.
And so that got me-- there was some networking threads already, and then that really cemented the networking for me. And then I was looking around for the next interesting thing to do. I saw Kubernetes. Hey, that looks interesting. So I went to Google, got an offer for working on Kubernetes. I was like, hey, that sounds interesting. So it's always what looks interesting and stuff I didn't know about.
So when I joined the Kubernetes team, I didn't know anything about Kubernetes. I read the paper and I was like, oh, this looks like a reasonable way to do things. I wonder what the deal is.
CRAIG BOX: Always a good start. What was the state of Kubernetes networking at the time you joined the project?
BOWEI DU: Many things were already along. For example, most of what we know about Service was there. DNS was sort of there but not specified and not as scalable as it is today. And also, I think in terms of many of the things that we're looking at, such as how to do things multi-cluster, how to give more extensibility, there were rumblings, but it was pretty early. Also, for example, handling multiple networks, that wasn't something at the time but now is clearly many people are interested in it.
CRAIG BOX: A lot of the design work for the primitives of Kubernetes networking happened before you joined the project, so we won't hold you to account for them, but we'll ask you a couple of questions on them. The concept of Services. So first of all, the idea of having a selector and saying this refers to a set of workloads or backends. You also have other objects that refer to them in the sense of a Deployment or a StatefulSet or something.
So there's two different objects that you have to define a selector on. And a lot of people say that there is power in that, that you're able to define slightly different selectors and so on. Do you think that it makes sense to group these things in these two different ways?
BOWEI DU: There's good points in that people use it and we give people a choice in terms of how they define Service versus their Deployment. So I think people use it for blue green. But there are drawbacks. A lot of people conceptually, when you say the word "service"-- and I'm talking from a technical English definition standpoint-- they often are talking about the application as well as the network aspect of it.
And in that way, this mismatch is a little bit too flexible. So people want to define properties on their application, and they have to define it on Service. We have found that sometimes this mismatch leads to things such as you can set up situations where, really, you want to tie these things together so you can make an assumption.
For example, you can make an assumption that a given Service matches its Deployment. In some sense, this is one of the foundational axioms, I guess, of Kubernetes, so it's kind of hard to change now. But what we're looking at going forward, especially if we talk more about Service APIs, is a Kubernetes Service right now defines many things.
It defines grouping, which is how these network applications are grouped together, your pods. It defines how they are potentially exposed on LoadBalancers or NodePort or various ways to do things. And it defines properties about them. One of the things we're looking at going forward is perhaps we can decouple some of these. And that is one of the low-frequency notes that's going on in the Service APIs, is to look to decoupling these things, decoupling grouping from exposure in terms of, for example, you want to LoadBalancer for your network application, from describing the properties of them.
CRAIG BOX: One of the things that is decoupled is the concept of the Endpoints of a Service from the Service itself. People may not realize this by default, but there is actually a different object that defines the IP addresses. That underwent a change recently with the introduction of an EndpointSlice object. What are they, and why were they needed?
BOWEI DU: EndpointSlice meets a couple of goals. The most immediate one is simply scalability. So you have Endpoints, and the Endpoints object corresponds one-to-one with a Service. Now, that's fine if your Service is small, let's say 10 endpoints or so. But when you start getting to 1,000 endpoints, 2,000 endpoints, now, as clusters get bigger bigger, it becomes a scalability issue, where these endpoints are not just resources, but they're resources persisted inside Kubernetes inside etcd.
And when, for example, you do a Deployment upgrade or any kind of change to your system, you may end up having to change the Endpoints that are part of your Service. And as part of that, every single change will have to be written back to this database. And as we know, very highly reliable persistent databases such as etcd, if you write too much to them, that becomes a problem.
Another problem comes when you go to distribute these objects, because Kubernetes has a watch model where you're sending updates to everyone who's interested. And in this case, Endpoints-- every node on your system is interested along with some other daemons. You're going to be sending lots and lots of copies of this entire Endpoints object. And that may be-- the whole object might be huge, but the change inside the object might be small.
So the point of the EndpointSlice was to chop it up so that you can send smaller bits as part of a whole. And then, along with the fact that we are revving the Endpoints API, we are able to put in things such as things to facilitate extra metadata about the Endpoints, redesign some of the ways that conditions on Endpoints were done, and add topology information.
So the immediate goal was scalability, but if you're in there, then why not fix a couple more things? And it turns out that lots of people are going to be able to consume a more general endpoint slice mechanism to implement their stuff. So we're seeing, for example, Istio project, I think Knative project are all using EndpointSlice internally.
CRAIG BOX: EndpointSlice is a new object. Why didn't you just change the Endpoints object?
BOWEI DU: So why were they two objects rather than revving Endpoints? I think KubeCon 2019, SIG Arch gave a presentation that said "there is no V2. We will try to do things side by side". And I think that's very much true. In fact, if you look at EndpointSlice and you look at Service APIs, it's not a V2. It's a sequel.
So what we're trying to do is that you don't want to ever be in a situation where you're breaking large numbers of your users. And to go from Endpoints and make it work like EndpointSlice would probably have broken a lot of people. So we're going to keep those two objects at least for a long, long time, until we see that the entire community has moved off of Endpoints and we can say, hey, EndpointSlice is the default.
But Endpoints, as I said, before works fine for small scale, and you can leave it at that. But really, EndpointSlice comes into the picture for additional features which we hope will push people to use the new API and the fact that it scales way better than Endpoints.
ADAM GLICK: How is Kubernetes aware of when those endpoints are there or not?
BOWEI DU: How Kubernetes figures out what endpoints belong to your services is one of these controllers that lives on your cluster somewhere. And it watches the service, and it does the selection, and it creates the Endpoints and EndpointSlices. And as I alluded to earlier, in terms of other systems using EndpointSlices, they, for example, will have a controller similar to the Endpoints controller but perhaps populate it and do the selection in their special way and then create EndpointSlices.
And what we expect is that EndpointSlices will be the interface API that's used to talk to other systems, such as kube-proxy, load balancers, DNS. So it's like you have a way to talk about endpoints, and then things understand that and are able to do things with them.
ADAM GLICK: Typically with load balancers, you have things like health checks, heartbeats, ways that you know that those services are still available, that say a node or a service hasn't gone away. How does that work with Kubernetes?
BOWEI DU: Health checks as an interesting thing, because Kubernetes comes with a liveness check for their pods and a readiness check. So that's at the pod level. And a lot of systems use that health check as a proxy for whether or not the pod is healthy on the network. But what we have found is that this is only just one leg of the journey.
Imagine you have a load balancer. The load balancer might live outside of the cluster. There might be some network infrastructure that needs to be provisioned in order for this load balancer to send traffic to the pod. And your local health check-- that's the thing that comes with your pod-- is local to the node, and it says the pod is healthy, but that infrastructure hasn't been set up yet.
So in some sense, if you say that the pod is healthy from the networking standpoint, well, that's sort of true, but depending on where you are in your network. It's not the whole story. And that's where this feature called Pod Ready++, which seems to be a name that the '80s wants back. But what Pod Ready++ does is it says, hey, in addition to the pod readiness on a local level, let's also give a hook so we can enable network infrastructure to tell us, hey, the network, which includes all that stuff all the way out to load balancer, is also ready to send traffic.
So there's two layers of readiness. And that is one aspect of health checking in terms of, OK, the pod is ready to come up. Now, there's another piece, which is that many of the, for example, Ingress or LoadBalancer definitions, also configure health checks for the network infrastructure to send to the pods. And one of the interesting consequences of this is, where do you get that configuration for that health check from? Do you use the pod readiness, or do you use some other configuration?
And this goes back to the decoupling between the service and the deployment. This might be a subtle point, so stick with me. So you define health checks on the pod level, but you load balance to a Service. The set of pods that are involved with the service, they might each have a different kind of health check. Now, imagine that you take that health check and you need to program an external entity to health check the pods.
Many infrastructures, they aren't that flexible. You can't just say, for this pod A, it's this health check, and then for pod B is a different one, and pod C is different one. A lot of infrastructures don't support that kind of granularities for programming the health check. So it's hard to just simply take the pod health check and just stick it into the infrastructure.
That brings up the question, should the health check be an aspect of the Service? Well, is it an aspect of the service? Because it's really an aspect of your application. And that's one of the things, it's not a resolved problem. It's one of these things that have been floating around in terms of ideas to think about how do we address that issue.
CRAIG BOX: Prior to version 1.1 of Kubernetes, there was the concept of Services with LoadBalancers, things that existed in Layer 4. There was, however, a desire to bring in a connectivity to cloud load balancers especially, but just Layer 7 as a concept. And from there, we get the Ingress object. What can you tell us, first of all, about the design or the history of the Ingress?
BOWEI DU: A bulk of the initial stuff predates me, but the part that was completed was the L7 aspect of it. You can see hints in the API that there was a desire to do more, but it seemed like most people have converged on the L7. And by L7, really on the internet today, we're just talking about HTTP mostly. So it's a way to describe L7 HTTP load balancing in a very simple way.
And then, there are many, many controllers on the back end that configure, for example, different clouds, different off-the-shelf proxy products like Nginx, HAProxy, and so forth. And it gives people a way to describe their application that's composed of multiple Kubernetes Services, also talk about things in terms of HTTP rather than just L4. By L4, I mean connect to this address and this port, and then you're done. You can talk about, oh, I have a hostname, and it has a path, and it goes to this service.
The other piece that is interesting is, because it's in the Kubernetes API ecosystem, people can build things on top of it. So I know that cert-manager, for instance, is able to configure certificates based on the contents of your Ingress. It also manages the LetsEncrypt flow, I think, through the Ingress object to get the certificate. So it's both a description, and then, because it's part of Kubernetes API, it's also an API for other systems.
CRAIG BOX: Like many Kubernetes APIs, it's implemented as an API and then a set of controllers, as you mentioned. In the case where you're running on a cloud, you have access to a load balancer that you can program, but there are a lot of people who are running in bare metal environments or on providers that don't necessarily have that Layer 7 load balancer. When you define an Ingress in one of those environments, what happens?
BOWEI DU: Ingress actually has a huge number of implementations. The most popular one, as far as I know, is ingress-nginx. I think that one was checked in early into the project and that people have just glommed onto it. It's super, super popular. And as with any popular thing, it does a lot of what people want.
CRAIG BOX: Now that the cloud native community has in large part coalesced around using Envoy, why do you think Nginx is still so popular?
BOWEI DU: If you have something that works, then I think people will keep using it. Envoy is interesting because it is extensible probably in a pretty flexible way with support for Wasm and different custom filters. We'll see where the community goes. As we said before, like Endpoints and other pieces that people already have gotten working, that if it's not broken, then don't fix it.
ADAM GLICK: Ingress provides support for TLS for people that want to encrypt, but it only does so on port 443. Why is that?
BOWEI DU: I didn't mention this before, but one of the philosophies of Ingress is that it was sort of lowest common denominator. And one of the lowest common denominators is to not allow you to configure many things. So 443 TLS, that is the lowest common denominator that's supported on all of the major providers, and that is the way it is. Now, of course, if you go and survey all the actual implementations, there are many that have annotations that let you change the port. But from an API perspective, we cannot expect that to be portable.
ADAM GLICK: Does that mean if you want to do things on other ports, instead of using Ingress, you end up using something like NodePort as your kind?
BOWEI DU: No. I think most people have just said, oh, well, I don't really need portability for this aspect of it, so I'm OK with the annotations.
ADAM GLICK: You did a survey about Ingress back in 2018. What did you learn?
BOWEI DU: A couple of things. One of them is Ingress is very popular. The other one, I think the key takeaway is people have this tension between portability and the features that they want. And it's very hard to say to someone, you've got to be portable, but I know that you know that the box on your controller says that it's all these cool features but you can't get at them, at least in an efficient way.
So that's one of the things that we're really focused on in terms of Service APIs redesigning this, is to be able to bridge that gap. And one of the proposals was to offer different aspects of conformance-- conformance portability. There will be a core that is always portable. So that is where the original Ingress design lay. Let's only do things that are portable.
In between that, there will be an extended set of features. And extended is like a conformance profile, if you know about the stuff that's going on in SIG Conformance, where basically, if you support a given feature-- and these will be called out features-- then it will be supported portably. So there will be basically one way to do things - if it's supported.
And then, finally, there will be a direct effort in terms of baking into the API some extension mechanism just for that third category that will never be fully portable. And we don't expect vendors and implementers to actually converge on anything there. Now, one question is, OK, why these three categories, especially the one in the middle?
We had the first one, which is the core. The last one was not great, but we can design it into the API. Why is there this middle one in there? The reason is that if we look at the different implementations that are out layer-- so clouds, service mesh, proxies, and then I'm sure there's others-- is that, generally speaking, there is a convergence towards a feature set.
You'll start getting more and more feature full. For example, clouds were the ones that had the more basic APIs, but they're rapidly catching up. So at some point, we will expect that you will be able to support not just the core but other stuff. But at the same time, if you don't put these extended things into the API, at least if you did a survey of the Ingress controllers, there was no mechanism by which people could converge.
I guess you could ping on Slack and try to get convergence, but there was no community-driven, Kubernetes-driven place to say, hey, if we're going to do this thing, and I know that-- let's say 50% of the providers support it-- let's do it in the same way so that when it moves into core, we don't have to tear up our old APIs, as people don't have to be broken, and so forth. So that's where the middle extended piece came from.
CRAIG BOX: In February of 2019, we spoke with Tim Hockin about the Ingress concept. And, famous last words, he said that "it's going to go GA", and he didn't see it being past 1.16. We're coming up on the 1.19 release now, and Ingress is finally going to go GA. What happened?
BOWEI DU: Well, going GA takes a long time. I think we were very optimistic about it. The point of the GA was fix the bugs with the app specifications. So there are certain pieces of it that just were documented wrong or just didn't behave like how everyone implemented it. So the biggest one of all is the fact that I think the path field was a regex. Did anyone know that? It was actually a regex with a certain ISO standard that no one implemented. So we fixed aspects like that.
The second priority was to do small changes that were benign and clarify things-- for example, renaming fields. And the final one was to just add a little bit of flexibility in there for future expansion. So what we did is we took the fact that your backends can point only to Services and tweaked it so that in the future it would be possible to add other resources there.
And then, originally the scope probably was a little bit bigger, but the most important thing is to not break users and to give people a smooth transition. So we shrunk the scope as much as possible to that. And then, just the fact that evolving APIs in Kubernetes and making sure people are OK, it will just take time. So we should take the optimistic estimate and double it.
ADAM GLICK: Should we price that into the claim that this will be going GA in 1.19 then?
BOWEI DU: Oh, no. That is done. So I'm way more confident of that.
ADAM GLICK: [CHUCKLES] Excellent.
CRAIG BOX: The other thing that Tim said in his interview was that you and the team are working on what version 2 of Ingress will look like. He said "it might look wildly different. It might have a different name." You've mentioned the Service APIs a few times so far. That is the different name for what has evolved from Ingress. What are the new Kubernetes Service APIs?
BOWEI DU: The Service APIs are basically-- as I said before, it's not a V2, it's a sequel. So it's not like we're going to deprecate the old APIs. But it is to take the L7, look at it more broadly, and not just evolve L7 but L4, and come up with a resource schema that's more broken out, that's more orthogonal and able to handle many more cases than the existing Service plus Ingress can handle today.
So the key resources-- again, this is work in progress, but this is what we have so far that the group has come up with-- is there is a Gateway resource that defines exposure and termination. So we're talking which address you have, which port, which protocol, TLS. There's a routing class of resources. So there's one that's for HTTP, describing HTTP. There's one for, for example, SNI bypass. There's one for TCP.
And then, finally, the grouping, which is going to be Service still, but mostly Service will just be used for grouping rather than, say, talking about load balancing and so forth. So when you construct, let's say, an HTTP load balancer, you will create a Gateway that points to a HTTPRoute that then points to your Services.
Another aspect of why there's breaking up resources is that in the Ingress survey, and then other feedback, is that we found that clusters are no longer owned by the entire dev. You have different teams. They have different demands on permissions. For example, it's very common for people to not be able to create internet facing services like devs just deploying internet facing services. So you wanted to give control.
And the mechanism that Kubernetes gives you to control these things is RBAC. And when you talk about RBAC, you need to start talking about nouns. RBAC involves nouns and verbs. You can limit the kinds of verbs that are done on the kinds of nouns. If you have different roles, then some of the responsibilities of those roles must be split on different nouns. For example, it needs to say preventing someone from create a configuration to expose something. That ideally would be a different resource than, say, defining what my application looks like.
So that is the two, three tensions that are pulling the new API in its direction in terms of how the schema is laid out. One is making it more orthogonal, so talking about exposing applications separate from the routing of the application separate from the grouping. And the other is talking about permissions and giving people more nouns to work with to segment their permissions.
ADAM GLICK: Last week, we talked about storage. And they have a concept of a StorageClass. Now there's a GatewayClass. And I'm wondering if you learned anything from what SIG Storage was doing?
BOWEI DU: Yeah. We actually talked a lot to SIG Storage folks to understand, did you hate StorageClass? Why did you put it on the cluster level? What are the design gotchas? And that's what we are using to define GatewayClass. Luckily, GatewayClass right now is pretty simplistic. It's just a way to say, hey, given this controller and the set of parameters for the controller, a cluster operator can say, this thing is available to you to use to create Gateways.
In my previous example, the operator can, say, create a GatewayClass called "exposed to the Internet", and then anyone who has the permission to create Gateways of that class can expose things to the internet directly. And then, using RBAC, you can prevent people from doing it as well.
CRAIG BOX: When you have two different objects with a relationship between them-- we have a Gateway which defines the ingress point, if you will, to your cluster for HTTP traffic, and then you have a Route, which might say, this path goes to this particular Service. You could define Routes as properties of the Gateway, or for a particular Route, you could say which Gateway it relates to. You've got the many-to-many relationship. How do you decide which thing you add as the property to the other?
BOWEI DU: Like everything and Kubernetes and all the SIGs, you can imagine we talked about this back and forth. There were many proposals in either direction. We felt like the Gateway to Route was the more natural one in that if you, especially in the RBAC permission model, have a Gateway that is a higher level of privilege to expose things, then to define the Routes.
Now, this is where, especially those listening on the podcast who are interested in this, we would really appreciate feedback from the users. I know at KubeCon we had a little meetup where everyone who is interested on SIG Network in these APIs got together and did some whiteboarding. And we had one user in that session. All the other people were controller authors. And we would really appreciate more user feedback, especially as we converge more and more on what the first draft is. So this is something that user feedback would be very much appreciated.
CRAIG BOX: Moving from Ingress to Gateway as a concept carries with it an implication that it's no longer just about bringing traffic in from outside to inside the cluster. Presumably, now we can use the Gateway analogy to model things inside the cluster, maybe for internal load balancing within a network or just internal to a single cluster. Is that a deliberate choice?
BOWEI DU: Yes, it is. Because I think the Kubernetes networking model doesn't actually define what inside or outside a cluster means. And we found confusion. For example, if you create an internal load balancer, like a private IP load balancer, is that inside your cluster or outside your cluster? Well, it depends on how your networking is set up. And Kubernetes doesn't impose a lot of restrictions on how that's set up.
So that's why we went with the Gateway name instead of Ingress, because for a lot of people from just a semantic English word meaning level is that they conceive of Ingress as pulling traffic potentially from the public internet into the cluster. And I think it's more general than that. We're describing load balancers and how services talk to each other at a level that could be L4, could be L7, could be internal. So it's a way to, hopefully, make things more generic.
ADAM GLICK: You mentioned Layer 4 and Layer 7. But currently, Ingress only works at Layer 7 or the HTTP layer. How do people deal with other protocols and things that they want to deal with more at the Layer 4 packet level?
BOWEI DU: That's something that's being actively discussed right now. So it's not even hot off the presses. It's in the kitchen right now. So at the super 10,000-feet level is that you would have a Gateway that has a particular class that that controller allows you to create L4 load balancers, and then you would attach to it TCPRoute or IPRoute. We have to decide on what the specific routing is.
And then, that would name Services. So that way, you can use that Gateway. As I said before, what is inside the Gateway? It's how it's exposed, what ports are there, what protocols are there, where the address is, and all the status around that. That's the same for L7 and L4. You can ignore some of the bits that don't exactly overlap. But generally speaking, that's what that represents.
So that works for L4. Then, on the Route side, instead of having specific resources-- like we have some uber union type of all the different routing protocols. What we're going in terms of thinking about this problem is to basically have a class of resources all called ...Route-- for example, HTTPRoute, TCPRoute-- so that we can be very flexible in terms of describing the different kinds of routing according to the protocol. And then, of course, we have Services at the very base which is the grouping mechanism.
CRAIG BOX: In the process of testing and building out these APIs, you have had the ability to build them as custom resources because now custom resources exist and effectively give you all of the same primitives that you get with a built-in resource. When the APIs are finished and baked, do you see them continuing as an add-on that you install, or do you think that they will then become a core part of the API server?
BOWEI DU: Custom resources are, as far as I know, the way to do new APIs going forward. And I would imagine as custom APIs become more and more powerful, there's just less and less reason to put it in the core. So putting it in the core imposes quite a big tax on maintenance and evolution. Yeah, mostly maintenance. And it is trying to break up the monolith.
For now, it will be custom resources because that is the fastest way to evolve things without having to roundtrip through the core. If custom resources suffice in terms of what they are able to do, then we should just keep it outside.
CRAIG BOX: Is there a way to say Kubernetes 1.22 has these custom resources and that they come packaged with it so you can guarantee that everyone running a particular version has access to these things if they're not built in?
BOWEI DU: This has been raised as an issue which we have some resolution on in the project that eventually, yes, they will be versioned with Kubernetes. But they will stay custom resources. I guess someone already has done it in the storage aspect in there's some storage resources that are CRDs that are distributed as part of the OSS. I think these are just one of the things that need to be figured out when it matures, is to attach it to some Kubernetes version.
Luckily, with CRDs and custom resources, you have the ability to target how much support a custom resource has rather than, say, support for the entire Kubernetes version. So right now, we say that you need to have a GA custom resource, I think, in order to run it, but nothing more. Now, that's very flexible. You can imagine that lets say that's all you needed, and most of the fancy bits are going to be in the controller.
Now you can take this feature and put it in your 1.16 cluster, even though Kubernetes at that point may be up to, like, 1.20.
ADAM GLICK: Ingress and the new APIs exist in parallel. And you mentioned they do some of the same things. But will they both continue to be developed?
BOWEI DU: The general feeling is that Ingress will still stay and it will still be supported as GA. But probably there will be less pressure to make large changes to it because we can do them in Service APIs and we can do them in a clean way. So the challenge with Ingress is that it has an existing user base that is very big. And whenever you have a very big user base, making large changes to it at the concept level is extremely hard.
And also, Ingress, why does it have a large user base? It's because it does some things very well. It does the self-provisioned empowered developer very simple applications, describing those very well. So I think that use case is actually well met by that resource. But if you get more complicated, if you need to start applying RBAC, if you need to start configuring more less generally portable features but portable if you support it, then you would be looking at Service APIs. So it will be supported, but I feel like there will be less pressure to majorly evolve it.
ADAM GLICK: It sounds like it's going into maintenance mode as soon as it hits GA to a certain extent. Obviously, there's a large base of usage that needs to be supported. Is there a long-term plan to eventually merge these two?
BOWEI DU: Not right now, given that the use cases are somewhat different. If Service APIs and Gateway resources pick up and become so predominant that there is less use of Ingress, then we could think about moving, but I don't see that right now.
ADAM GLICK: You mentioned earlier you're looking for feedback from not only the people who are helping build this but actually the people that are using this technology and really getting more user feedback. Where would you like them to send that feedback?
BOWEI DU: It might be a little early. We're looking at trying to produce a first draft, and then, once the controller authors get something out there, we would love to see feedback. And this is something that I think when we're designing these things as the controller author, it's very easy to get wrapped up into how you think people are going to use it.
But when people actually use it, they find out all sorts of things that you would have never thought of. So as soon as the first draft comes out, we will be promoting this heavily to try to get actual user feedback to see how people are using it.
CRAIG BOX: When do you expect that first draft? And given what you said before about timelines, do have a rough idea of when you think this might become generally available?
BOWEI DU: GA, I cannot give you an estimate on that.
CRAIG BOX: But are we talking a few quarters, or are we talking 10 years?
BOWEI DU: It really depends on user feedback, I think. If the users come back to us and say, hey, go back to the drawing table, then, well, that's going to take a bit longer. If the users are generally comfortable with what we have, then we aim to have some sort of rough draft, I'm hoping in a month or two. And based on that, how quickly we can move ahead really depends on how well this meets the use case.
If it meets that use case well and it doesn't have too many drawbacks, then we would feel comfortable with moving forward. If we get lots of, like, oh, but this doesn't really match how I think about it and I can't do this and this, then we would have to go back and redesign.
CRAIG BOX: We encourage people to run Kubernetes clusters in a single failure domain. And so in order to build a replicated regional or global service, we will quite often tell people to build more than one cluster. All vendors now seem to have built multi-cluster Kubernetes support in as table stakes especially to their enterprise products.
In order to run a particular thing in multiple clusters, there are now a bunch of different people with different ways of running multi-cluster services, being able to define a Service or perhaps an Ingress that points to multiple Services in multiple places. How are we making sure that those things that are being led by vendors converge on one central thing in the main Kubernetes project?
BOWEI DU: We are actively working with SIG Multi Clusters. So SIG Multi Cluster has one main proposal to talk about how do you do Services across multiple clusters to treat them in some way as a single thing. And we're actively right now talking to them in terms of, like, OK, what if you pointed a Route at that? What would that mean semantically? So that's something that's definitely present in our mind.
I think one of the challenges is that it is not clear right now-- the definition of the load balancer, for example, how the HTTP routing, that seems like it needs to live in one place from a user API perspective. Now, the definition of your Service, that might actually split across multiple clusters. There is a definition of, for example, centralized stuff like routing, and then the distributed stuff like your Services and what Endpoints comprise your Service.
And we're actively working on how to join those two together. So actually, that's another thing that this is very time boxed in terms of we are actively working on it immediately right now, like this very moment.
ADAM GLICK: Finally, you've been working in networking for a long time and in some really interesting locations. What are some of the memorable moments that you've had?
BOWEI DU: I have a couple of stories there. One of them-- I wasn't personally involved in it, but this is from my research group-- is that they set up a wireless network in rural India, at this hospital. They were connecting ophthalmologists to rural clinics. And then they get this thing. I was like, oh yeah, the network stopped working. It's not connecting anymore. What's going on? So they were like, oh, is it this? Is it that? They were trying to debug it.
And then they send them a picture. They're like, oh, can you take a picture of the antenna? So it turned out that the hospital had undergone expansion, and they basically built an elevator shaft in front of the antennas. So they expanded the building around the antenna, and it was no longer able to have line of sight, the rural clinic.
ADAM GLICK: Physical obstruction for their FM signal that basically removed line of sight?
BOWEI DU: The other thing is, as part of the research project, we put up a wireless tower just to try to see, OK, what is the cheapest wireless tower you can put up? I think it was like 50 feet or something. You can imagine how well this goes with grad students. But we put it up. And then, at some point they have to take it down. It's like, how do you take down a wireless tower that you put up? It's a pole. It turned out we got some guy who actually works on wireless towers. He said, OK, this is how you do it. Take one of the guy lines. Attach it to a pickup truck. Back up the pickup truck. Have a guy with a chainsaw basically cut it down like a tree, and then just slowly drive the truck and lay it down. I was like, oh, interesting.
CRAIG BOX: All righ, Bowei, thank you very much for joining us today.
BOWEI DU: Yeah, thank you.
CRAIG BOX: You can find Bowei on Twitter @BoweiDu.
CRAIG BOX: Thanks for listening. As always, if you've enjoyed the show and you haven't subscribed yet, please do. Also, if you can help us spread the word and tell a friend, we'd appreciate it. If you want to give us any feedback, you can find us on Twitter @KubernetesPod or reach us by email at firstname.lastname@example.org.
ADAM GLICK: You can also check out our website at kubernetespodcast.com, where you'll find transcripts and show notes. Until next time, take care.
CRAIG BOX: See you next week.