Kubernetes Podcast from Google: Episode 106 - CoreDNS, with John Belamaric

#106 June 2, 2020

CoreDNS, with John Belamaric

Hosts: Craig Box, Adam Glick

In a world where pods (and IP addresses) come and go, DNS is a critical component. John Belamaric is a Senior SWE at Google, a co-chair of Kubernetes SIG Architecture, a Core Maintainer of the CoreDNS project and author of the O’Reilly Media book Learning CoreDNS: Configuring DNS for Cloud Native Environments. He joins Craig and Adam to discuss CoreDNS, the evolution of DNS in Kubernetes, and how name resolution has been made more reliable in recent releases.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

First off, I wanted to say something about the death of George Floyd last week, and the unrest here in the United States that has come as a result. We don't usually talk about what is happening outside the world of Kubernetes and Cloud Native on the show, but we do make an exception when something is connected to Craig or myself.

To that end, I grew up in the Twin Cities where George Floyd died. Much of my family and many friends still live in the area, and I've been to the places that you've probably seen on the news.

George's death, along with many of the other injustices that we've seen in the US just in the past week, have been evidence of a painful and wrongful divide in the United States. We all feel great anger at what continues to happen and great sadness for the violence that's destroying our community.

My personal hope is for both calm and justice to come swiftly. I believe these are interrelated and cannot happen without each other. And I hope that we can begin the process of healing, rebuilding, and fixing the inequality in our society.

[MUSIC PLAYING]

CRAIG BOX: Some good news from the US last week as well. I spent the weekend watching the webcast of the SpaceX Demo-2 mission launching two astronauts safely to the International Space Station. I assume you watched it as well.

ADAM GLICK: I did. It's always exciting.

CRAIG BOX: It was scrubbed initially due to the weather. I just noticed today that they say it's the sunniest spring on record in the UK. They had something like 600 hours of sun, where the average was sort of 450, and the previous high was in the 500's. And it makes you think, well, maybe we should be launching astronauts from England?

ADAM GLICK: We'll move the spaceport over there. Although, interestingly enough, I believe Richard Branson has a Virgin spaceport that is located here in the United States, but I don't know if there's one in the UK.

CRAIG BOX: No. There is a spaceport under construction for some low Earth orbit satellite launches or something. New Zealand's actually a bit further ahead of the UK in that regard. But we do have the occasional astronaut up to the ISS launched from Russia. And possibly now we're going to take a ride on the SpaceX rockets.

ADAM GLICK: If you had the money, would you do it?

CRAIG BOX: That's a tough question. It would have to be pretty much guaranteed that it was perfectly safe. I think I'd like it to be a little bit more comfortable than it looks on the TV.

ADAM GLICK: We were having that discussion with my family and some of my friends this week about, would you do it if you could? I decided I would. My wife decided that she's not so sure.

CRAIG BOX: Richard Branson was selling the opportunity to take a $200,000 flight to the edge of space and back. I don't know if you can call it going into space, possibly a few minutes of weightlessness. And given the global economy situation, I think Richard Branson has got a lot of things for sale at the moment. So maybe there will be a discount on that.

ADAM GLICK: Last week I mentioned that I'd been watching a YouTube video. And I sent a link in the show notes about people that are building puzzle boxes out of LEGO. And I went deep down that rabbit hole this week-- got myself a large collection of LEGO, and started building some. And so I've set out three of them on the kitchen table, of which my wife has stumbled past and started to try and open them. So that was a fun distraction from some of the other goings on.

CRAIG BOX: Worst case scenario, can you open them with a hammer?

ADAM GLICK: [LAUGHING] Yes. Yes. There is always the backdoor smash and grab method that does still work.

Shall we get to the news?

CRAIG BOX: It must be time for the news.

[MUSIC PLAYING]

ADAM GLICK: The CNCF has announced that Dan Kohn, our guest in episode 35, is stepping down from his role as executive director to focus on a public health project inside the Linux Foundation.

Priyanka Sharma is joining the foundation as the new General Manager. She was most recently the Director of Cloud native alliances GitLab and was a founding team member of the OpenTracing Standard. Congratulations to Priyanka, and we wish Dan the best in his new role.

CRAIG BOX: Security vendor Aqua has announced Starboard, a native toolkit for finding risks in your Kubernetes environments. Custom Resources are created to expose vulnerability information, workload audits, CIS benchmark, and pen-testing results. Plugins are provided for kubectl and Octant. Plans for the future include a plugable operator and admission webhook to take policy decisions from any Starboard compatible CRD. The announcement comes from Aqua's Liz Rice, who was our guest on episode 19.

ADAM GLICK: Mirantis have made their first release of Docker Enterprise after acquiring the product in November 2019. Docker Enterprise 3.1 includes Kubernetes on Windows, GPU support, automatic installation of Istio for ingress, a new installer CLI, and Kubernetes 1.7. The word Swarm appears in the post once, which suggests that it's still a supportive technology at this time.

CRAIG BOX: As with every version 3.1, we eagerly await the release of Docker Enterprise for Workgroups.

The tools half of Docker held a one-day virtual DockerCon last week, which they kicked off by announcing a collaboration with Microsoft Azure.

Docker desktop will allow you to create Azure container instances and set them as context to run containers in. There will also be integrations with Visual Studio Code, as well as continued work on the previously announced Compose specification. This is all available as a private preview with a beta expected by the end of the year.

ADAM GLICK: Cluster backup tool Velero, formerly hep Heptio Ark, has released version 1.4. The release includes support for volume snapshots using the container storage interface, progress tracking for backups, and improvements around restoring objects to different versions of Kubernetes than the version they were backed up from.

CRAIG BOX: Agones, the game server hosting operator for Kubernetes, last week released version 1.6. This introduces player tracking, which allows you to keep track of which players are connected to your game servers, and what capacities they have. Learn about Agones in episode 26.

ADAM GLICK: Chef has announced support for packaging Windows applications to run in Google Kubernetes Engine. Their Habitat product will now allow customers to package Windows applications and their dependencies using their technology to avoid bloating containers to enable legacy Windows applications to run in a managed Kubernetes environment with GKE.

CRAIG BOX: Speaking of legacy, in the week that Java turned 25, Red Hat announced that Quarkus, their runtime for Java on Kubernetes, has been added to the Red Hat runtime subscription product. The Red Hat build sprinkles unspecified secret sauce on the open source project and brings integration with other Red Hat middleware, as well as the continued promise of "developer joy", possibly now with an SLA.

ADAM GLICK: AWS has released server-side encryption for Fargate users of ephemeral storage, which only exists for the lifetime of the pod. The AES-256 encryption keys are automatically stored and managed by AWS, and the encryption is available for new pods started in Fargate version 1.4 and later than May 28.

CRAIG BOX: From the "for some reason I assumed that already existed" department, PlanetScale has launched an open source Vitess operator for Kubernetes. Anthony Yeh worked first on the Vitess team at YouTube, and then the Kubernetes team at Google Cloud. So he was uniquely placed to write an operator for running Vitess on Kubernetes. The operator, a version of which is used to run PlanetScale's commercial database has been available as closed source since last year. Learn more about Vitess in episode 81.

ADAM GLICK: Hashicorp has announced an Alpha for a new Kubernetes provider that works with the Terraform deployment tool. This update is designed to allow you to control the packaging, deployment, and management of Kubernetes resources using Hashicorp's Configuration Language, or HCL. The tool will convert your YAML into HCL and can package up custom resource definitions and operators with your application.

There are some limitations, such as needing Kubernetes 1.17 or later, that are listed in the release. So check out the release notes before deploying. The alpha is not currently available in the Terraform provider registry, so if you want to try it out, you can download it from GitHub.

CRAIG BOX: Google announced last week that the Vulnerability Rewards Program has been extended to cover Google Kubernetes Engine. A new open source Kubernetes Capture-the-Flag project has been released on GitHub. And if you can exploit a hardened GKE cluster running the software, you can earn up to $10,000 US, as well as additional rewards from Google or the CNCF, depending on the exploit. The program covers exploitable vulnerabilities and all dependencies that can lead to a node compromise, such as privilege escalation bugs in the Linux kernel, underlying hardware, or other components of the cloud infrastructure.

ADAM GLICK: Google's Charles Baer and Xiang Shen have posted their second blog in a series focusing on during and debugging applications running in Google's Kubernetes Engine. The post calls out where to look for errors, how to set up notifications based on data and log files, and how to receive alerts in email or through SMS. If you want more information, you can also join the Google Cloud mailing list for monitoring and debugging tools.

CRAIG BOX: GCP also published a blog on how Migrate for Anthos helps modernize Java apps running on VMs. The post talks about different Java middleware solutions and how the migration tools look for the right resources and allocate the correct libraries and system resources. It then dives into the details of how a migration actually happens.

ADAM GLICK: The CNCF has released their latest project journey report, this time for Helm. The report notes that in the two years since Helm joined the CNCF on June 1, 2018, there has been a 41% increase in companies contributing to the project and a whopping 316% growth in contributors. The project now has over 30,000 GitHub stars, and receives more than 2 million downloads per month. Congratulations to the Helm project. If you want to know more about Helm, check out episode 102 with Matt Butcher.

CRAIG BOX: A more critical perspective on Helm 3 comes from a posts from Sandor Guba of BanzaiCloud this week. A recurring theme is that Helm is a template language without a true understanding of how Kubernetes works underneath it. Some objects have fields, which can't be changed, which Helm is unaware of. The rest of the post talks about real world experience of the differences between Helm 2 and 3, and explains why Banzai have decided to take a different tech for application installation.

ADAM GLICK: The US National Institute of Standards and Technology has published deployment and security guidance for a proxy-based service mesh. The report, written by Ramaswamy Chandramouli of NIST and Zack Butcher of Tetrate, complements a previous report by Chandramouli last year, which talked about security strategies for microservices applications.

CRAIG BOX: Finally, Google Cloud developer advocate and kubectl plugin guy Ahmet Alp Balkan, whom you met in episode 66, is now a YouTube star. In 15 bite-sized videos 2 to 5 minutes each, he promises to show you how to use and develop a kubectl plugin. Two videos are out at the time of this recording, but you can subscribe to his new channel to see them as they are released.

ADAM GLICK: And that's the news.

CRAIG BOX: John Belamaric is a senior staff software engineer at Google focused on GKE and open source Kubernetes. He is a co-chair of Kubernetes SIG Architecture, a core maintainer of the CoreDNS project, and author of the O'Reilly book "Learning CoreDNS, Configuring DNS for Cloud Native Environments." Welcome to the show, John.

JOHN BELAMARIC: Thank you, Craig.

ADAM GLICK: Most everyone on the internet uses DNS, the Domain Name System, hundreds of times a day, but not everyone may understand what it is or how it works. Can you describe what DNS is and how it works?

JOHN BELAMARIC: DNS is basically the system used to translate the names we type in the browser or whatever into the addresses that the machines behind them or the services behind them have. So when you type google.com in your browser, a request is made to a server out there, the DNS server, the Domain Name System server, that will translate that back into an IP address. And that's what you use to make the actual connection.

ADAM GLICK: Is there one giant DNS server that sits out there that everyone's requests go to? And if so, who runs it?

JOHN BELAMARIC: If we look at the top-level domain dot-com there's actually a hidden domain at the end of that called root. There's a dot at the end that you don't actually have to put in. And then we have what we call the root servers. The root servers handle all of those top-level delegations. The domain name system is this globally distributed system of authority.

So essentially, if I run a name server, I can declare that that name server is authoritative for a certain domain. But since it's hierarchical-- so the trusted root servers are the ones that say, here's a set of servers that are authoritative for, say, these top-level domains. And then there may be a server for that top-level domain or many servers for that top-level domain that says, I'm authoritative for this other set of domains, say, google.com. And the trust goes down from the top of the hierarchy down lower and lower.

So in order to delegate the authority for google.com to a particular name server, I have to have the authority to modify the dot-com zone. And in order to do that, I have to control that server. And I have to be very trusted at that point. So you've got this sort of delegation of trust coming from the root all the way down to the different domains lower in the hierarchy.

That's what we call authoritative DNS. That's who's responsible for those. But even that is often too much demand on those authoritative servers for a very popular domain. And so there are what are called caching DNS servers, which you might do like locally within your environment.

ADAM GLICK: When you say something like a caching server, would that be something like I get my internet router, and my internet router connects to my ISP that runs a DNS server, or in my corporation, there might an internal DNS server where they have cached that information and they've pulled their information from one of those root servers that's authoritative. And so you don't have to do a bunch of extra traffic out to those. And also, if companies need to control that DNS for some reason, they have a proxy that sits in between those two.

JOHN BELAMARIC: Yes, in essence. So you can think of there are sort of three most common ways of deploying-- or categories of DNS servers that may get deployed. There's authoritative ones that say, I own the information on this zone. Then there's what we call recursive servers. And then there's what you might call a caching layer.

A recursive server is probably the most common, I would guess, out there in the sense that that's what you would have sitting at your corporate edge that goes out, and it does that recursive lookup. So it says, oh, you want to look up food.com. Well, I don't know the name server for food.com. So I'm going to go to the dot-com name server. I do know the name server for that. And so then I'm going to ask it where the food.com name server is. And then I'm going to ask the food.com name server for the actual record.

So if you can imagine, you've got three or four segments of this domain name. And you could see how that could be a process where you kind of are repeatedly going through and navigating the hierarchy of those domains. That's a recursive name server.

Often, they're also obviously caching. You can also put a caching layer, which doesn't do all that complicated stuff, but literally just talks to the next recursive server and holds things in a local cache.

ADAM GLICK: How did you get into DNS?

JOHN BELAMARIC: I actually came into DNS sort of sideways. I worked for a startup, which we built a network management appliance that went out and collected data from network devices, and automated jobs against network devices, that sort of thing. And we were acquired by Infoblox. Infoblox, they made DNS and HTTP appliances. And so this was-- we became a part of their portfolio. But I worked for them for 11 years before joining Google. And so being at a company where 80%, 90% of the revenue comes from DNS, you pick some things up.

CRAIG BOX: When I run a service in Kubernetes, I want to be able to refer to it by a name which makes sense to me, rather than having to know its IP address, especially because there will be multiple pods, and they come and go and so on. How does Kubernetes use DNS?

JOHN BELAMARIC: The service infrastructure, the service resource, really, a big part of that is the naming. And it identifies the DNS names for the different pods you want to reach and maintains the load balancers and backends.

So I'm sure most of the listeners are familiar with Kubernetes services, but essentially, it allows you to create this resource that identifies the selector for the pods that will be part of the backend of that service. And it often will provide a VIP or a Virtual IP address as the primary address for that service. And then since that service has a name and it lives in a namespace, which lives in a cluster, there is what we call the DNS schema for Kubernetes. And this defines exactly how you will use DNS to look up the addresses of the services or the endpoints backing those services in some cases.

CRAIG BOX: The Kubernetes service object has a name, and it has the pods, the IP addresses that it needs to resolve to. Where does it push that information to make DNS available?

JOHN BELAMARIC: The service is stored in etcd, of course, like everything else. And the API server is sitting on top of etcd. So CoreDNS integrates directly with the API server, and essentially works like any other controller in Kubernetes.

So when you start up CoreDNS, it establishes a connection to the API server, and it establishes what we call a watch on the specific resource's endpoints and services as well. So a watch is a function of the API server that basically allows a client to say I'm interested in resources of this type, and maybe with this label selector, or with certain criteria.

And then anytime there's a change, or an update, or an add, or a delete, or whatever it may be to something of that type, it will send that information directly to the client. So instead of the client having to pull repeatedly, it's able to just open a long-lived connection that gets fed the different updates as they happen.

So this is what CoreDNS does. It creates a watch with the API server that's listening for the specific types of resources that it's interested in and holds them in memory. So it's got a local cache of all of that information. When a DNS request comes in, therefore it just looks it up in its local cache and returns the response.

ADAM GLICK: CoreDNS wasn't the first DNS server that was in Kubernetes, but it's the one that is most commonly used now. What was there before? And why did the change happen?

JOHN BELAMARIC: CoreDNS wasn't originally written to work with Kubernetes. And while that's probably our biggest use case, it's used for many, many other things. But it was originally written, just if you go back, back in time to 2016, Kubernetes was not necessarily going to win the container orchestration wars. We had many other ways of doing service discovery for container-based systems that even weren't necessarily orchestrated.

So maybe you had some Docker machines, and you'd listen for Docker events, and you'd go and register something in etcd. And then you had SkyDNS to serve up that data. That was the sort of world that this was originally created in to replace SkyDNS, which was fine, but didn't have the extensibility and flexibility that the author of the original author, Miek Gieben, desired to see out of a DNS server.

So he saw this web server called Caddy. And he actually took Caddy and said, I'm going to try and make a DNS server that follows the same design pattern and actually forked Caddy. The original very early version of CoreDNS was a fork of Caddy.

What it has is this sort of pipeline structure of request processing. And so some of the philosophy came out of that. Let's build that-- use it to replace SkyDNS. So that's sort of the early history of CoreDNS.

Now, this is right around when I got involved, which was like maybe a month after CoreDNS got created. So it was pretty young. And what I was doing was looking at the general service discovery within containerized environments.

And at that time, I worked for Infoblox, which was a company that does DNS. And we wanted to see what we might be able to do to contribute in this new and upcoming containerized world. And what we saw is a lot of service discovery options that were bespoke to their different environments.

So Kubernetes had a kube-dns. Docker had its own internal built-in DNS. Mesos and Marathon had a separate DNS. And we thought, this is kind of silly, because that means all of these different orchestration systems-- and then in other cases, people might use SkyDNS for, say, just a pure Docker environment.

So we thought, there's got to be a better way to do this. And as we were looking around for things, we came across CoreDNS. We said, this is actually really great. This architecture is really nice. This is a clean modern DNS server written in a modern language, which has other advantages I can talk about in a minute.

And in any case, so we decided to start looking at that and getting involved. And that's when we said, hey, I think Kubernetes is this really cool thing. Let's take that as our first integration. And that was kind of our path to getting involved, and my path to getting involved in Kubernetes, and the path for CoreDNS to get involved in Kubernetes. Luckily, we chose that as our first one, because it turned out to be the right one.

But what we saw that we thought could be done better was at the time kube-dns ran a little mini etcd in the pod, and had a separate service that populated that etcd with SkyDNS-like records, SkyDNS already ran on top of etcd. So essentially you had kube-apiserver. You had a process that took the data out of kube-apiserver, populated a locally running etcd in the format of SkyDNS. And then you had SkyDNS running to serve those records. And then I think you had DNSmasq, which is a caching DNS server, sitting on top of that. So you had all these little parts.

CRAIG BOX: That's a lot of turtles.

JOHN BELAMARIC: Exactly. It's a lot of turtles. We said, this is silly. Why don't we just make CoreDNS talk directly to the kube-apiserver and serve up the DNS directly? So one process instead of like four.

CRAIG BOX: Why didn't the SkyDNS team just do this?

JOHN BELAMARIC: Well, essentially, Miek was the SkyDNS team at the time. And SkyDNS was sort of dead for development, as far as I know, at that time. I can't answer that, I wasn't on that team.

But what I can say is that we saw a real benefit there. And we saw that because of the plugin architecture of CoreDNS, we could back Kubernetes, but we could also have a plugin that talks to Mesos. We could have a plugin the talks to Docker APIs. And therefore, we can kind of make one simpler solution that can manage in all of these environments.

Eventually, kube-dns did evolve and get better, right? It did do some of what I said, where it now today-- it still exists-- it talks directly to the API server. It doesn't keep etcd running locally. It just does things in memory. And it synthesizes the records automatically.

It still, though, runs multiple parts. It still runs a separate sidecar for metrics. It still runs DNSmasq to sit in the front end. And there are other issues with that.

And so it's still like three or four containers in the pod, as opposed to one, which means that it's got more moving parts to fail. And there's network connections between those. When you scale, you're scaling all those little parts instead of just a single container. It's just a little bit more difficult to manage. And we did put a lot of effort into improving the performance and scalability of CoreDNS so that it could be at least as good as kube-dns.

And then the big thing to me, or why I find CoreDNS a much better solution is the level of flexibility you get with it. So we have a lot of configuration options that you can twiddle if you need to. And people sometimes need to.

So one of the examples early on during the evolution of CoreDNS or the progress of CoreDNS into Kubernetes that I found compelling was that if you have certificates for a service that are, say, assigned to some external name, like mystore.com, then things trying to talk to that within the cluster, if they're using the local service name, the certificate validation is going to fail, because they're going to be looking it up as mystore.namespace.svc.cluster.local. And that doesn't match the certificate.

So there's actually ways within CoreDNS that you can tweak the way the lookup works so that the client locally can use the same name as somebody from outside the cluster might, and everything still works. So that's something you couldn't possibly do with kube-dns. There's many other things, but essentially, this level of flexibility of being a full bore DNS server that also is actually specifically tailored around cloud native use cases where requirements change frequently, and things like that.

ADAM GLICK: You talked a little bit there about the flexibility of CoreDNS and one of the things that's talked about a lot around CoreDNS is the plugin model. How does CoreDNS do plugins? And why is that so important to the flexibility and the design and use of CoreDNS?

JOHN BELAMARIC: CoreDNS is written in Go. And one of the great things about Go and one of the unfortunate things about Go is that it's statically compiled, which means that from a plugin type of model, it means that the plugins have to be compiled in.

This has pros and cons. There are ways in newer versions of Go to do plugins, but we haven't gone there. The plugin model in CoreDNS is a compile time plugin model. But it's statically compiled. That has the advantage of that if you have a use case where you need only a few plugins, then you can reduce the size of the executable very easily.

So we do this. In fact, in Kubernetes there's something called NodeLocal DNS. NodeLocal DNS is just a special build of CoreDNS that's got a few extra lines of code and compiles in just a handful of the plugins. This makes it very small. Since we want to run it potentially on thousands of nodes, we don't want it to take up very much space. And so that was a way we could build it down very small.

That said, it's a compiled in plugin module. But it's really about the software architecture and the pipelining model built into it.

If you remember the old Unix philosophy of a utility should do one thing and should do it well, that's essentially how our plugin model works. So a plugin should do one thing, and it should do it really well. And you can chain or pipe those plugins together to affect much more complicated things.

So we have, for example, a plugin that does rewrites. It rewrites the names into different names, and then passes it down the chain. This is how we achieve that thing I mentioned earlier of a use case where we can have the certificate work.

We have to rewrite that says, hey, this person is looking for mystore.com. I know that mystore.com is actually mystore.namespace.svc.cluster.local. I'm going to rewrite that name. And then it gets passed on to Kubernetes plugin. Kubernetes plugin knows how to look up that name and returns it. The rewrite plugin can then again say, oh, I need to rewrite it back-- the name back, because of the way DNS clients work, they will reject it if it doesn't match.

So it rewrites the name back to what the client originally asked for. And now you've got the address that was looked up through the Kubernetes plugin, but it used the name that had to be rewritten.

ADAM GLICK: There's lots of plugins that are available for CoreDNS. What are some of the more interesting ones that you've seen?

JOHN BELAMARIC: There are two sort of classes of plugins we have. We have ones that are built into our default build of CoreDNS. And those are entry plugins. And if you go to our coredns.io website, you'll see there's a plugin section and an external plugin section.

The built-in plugins do a lot of things that very many people want to do. And that's not really what I want to talk about here, because those aren't the most interesting. I think that if you look at the external plugins, this is where you see people doing interesting and unusual things.

I mean, a lot of the plugins I've seen are things like ad blockers. They'll basically block the DNS request lookups for ad services, or for content filtering for things you don't want your kids seeing, or whatever it may be.

But then there's other sort of plugins that are around improving the way DNS works or being able to scale your DNS. One of the ones we have is a Redis-based L2 cache. So if you think about if you want it to build out a very highly available, very performant DNS service, you can run a whole bunch of instances of CoreDNS, which each can have their local cache, but you can have a sort of L2 cache in a Redis server before you go out to what we call the recursive server, as we mentioned earlier, that do a potentially very long lookup across multiple services on the internet. So those are some interesting things.

One that came in recently that I think is fascinating. I have no idea if it's a brilliant idea or an absolutely horrifically terrible idea, because I don't know enough about blockchain. But I'll explain what I think is interesting about this. This is a way to supply DNS records based upon the Ethereum name service, which I just said as much as I know about it.

But what I find so fascinating about this is if you go back to a little bit of our discussion on what DNS is, DNS is this system for delegating authority over these names. And there's a huge infrastructure built up over 30-plus years of infrastructure, and trust relationships, and how these servers are authoritative for this set of domains, and can delegate to this other set of domains. So that's an enormous sort of set of machinery, both human and computer, that works a certain way, and has been working that way.

This is basically throwing all of that out and saying let's replace all of that with this blockchain-based mechanism for delegating authority. But still from a user point of view, there's absolutely no difference. So from a user point of view, from a DNS query point of view, it looks exactly the same.

So I guess you asked earlier, like, what's so great about the plugin model? I mean, one of the things that's great about the plugin model is that it lets you break up the request processing and do interesting things.

But to me, sort of a bigger picture level, what's so interesting about it is the way that it allows crazy experimentation. So in BIND, for instance, which is sort of the big, open source, most commonly used, oldest, cruftiest DNS server out there, it's quite difficult to experiment. It's quite difficult to make changes. The code is notoriously difficult to modify and get right.

And because DNS is so critical, you don't want to break DNS. And DNS can be an attack vector as well. And so you don't want to mess up when you're modifying DNS. But by having these plugins that are sort of single purpose and serve one function, you can replace that more easily without affecting all the rest of the server and the way that it works.

CRAIG BOX: Now, you mentioned BIND there, which is the name server that I ran back on my Red Hat 6 box back in the day. Does CoreDNS have as a goal to do all that other systems do? Does it aim to be able to replace things like BIND?

JOHN BELAMARIC: Not really. There's a lot of stuff BIND does that we would never necessarily want it to do, because they were built over many, many years organically-- may not even be used by that many people. It potentially could, in a sense, because if those things could be done in plugins, then you could do them without affecting the rest of the server.

But the big thing that CoreDNS does not do is recursive DNS. So we talked a little bit in the beginning about how DNS works. We have authoritative. I own the information about this domain. I can provide information about this domain. And then we have recursive that says, I can figure out based on the root name servers. I've got these root name servers loaded in me, and I can figure out-- give any name. And if it's a valid name out there on the internet, I will figure out who has authority over it. And I will ask them. That's a recursive server.

So we don't actually do recursive, because in part it's just really hard. It's a very hard thing to get done right. And we're an open source project. None of the backers have funded that sort of effort, because there are other ways to do that.

So what we do do to sort of be recursive if we want-- there's two ways-- one is, well, we can simply forward to another recursive server. That's not really being recursive, but it's sort of a pseudo fake recursive. Or we can build in a plugin. So there's a pretty modern recursive DNS server out there called Unbound. And we can actually build-- we have a plugin that builds in the lib Unbound to do recursive lookups.

However, Unbound is written in C, or C++, I'm not sure which, and we're written in Go. So it's not so easy to directly integrate. There's ways to do it, but you have to do a special build for that. Essentially, we don't build that in by default, because it would make our binaries less portable.

CRAIG BOX: In our Kubernetes cluster, we use CoreDNS to serve the names for our services that we registered. But as you just mentioned, the recursive DNS lookup for me going to look up google.com to connect to it from one of my pods, for example, that's not handled by CoreDNS. How does Kubernetes handle that? And how has it changed over time?

JOHN BELAMARIC: It is, and it isn't handled by CoreDNS. It is in the sense that the DNS server that's listed as the client DNS server for resolution in the pod is CoreDNS. So those requests will be sent to CoreDNS.

What we'll do instead of actually doing the recursive lookup-- we're obviously not authoritative on google.com.

CRAIG BOX: But we can make a pretty good guess.

JOHN BELAMARIC: Yes. But what we'll do is the CoreDNS configuration file itself is set up with what we call our forward plugin. The forward plugin, basically, you give it a list of upstream DNS servers. And so when we get a request that's outside of the zones that we handle, then we'll send it to one of those upstream servers, and let it take care of all the details.

CRAIG BOX: Are those upstream servers run inside Kubernetes? Or are we now talking about the servers provided by our hosting provider?

JOHN BELAMARIC: Typically, they're provided by your hosting provider. But you could run one. I don't know that I'd recommend it, but you could run one.

There's also what we call stub domains. And this essentially is allowing different places to forward different requests to you, depending on the domain that's being looked up.

So if I work for corporation.com, then I may want to say if I'm looking up hr.corporation.com, then I want to go to the internal corporation.com name server, not to the internet. And that would be called a stub domain.

So you can configure within CoreDNS. You can say, for these zones, for these names, or anything underneath them, then go to these local name servers or these other name servers. For everything else, go out to the internet, to the hosting upstream name server typically.

CRAIG BOX: The more pods that I run, the more DNS requests I might make, and so the more DNS servers or DNS forwarders that I need to run in my infrastructure. What has Kubernetes done recently to make that more stable?

JOHN BELAMARIC: So Kubernetes has sort of suffered from a DNS problem since its early inception. And there's a bunch of reasons for this. One is that this name structure that we talked about-- we want to be able to look up-- if we're in the same namespace as a service, we just want to be able to look it up by the short name of the service.

So when you want to look up "db" from your app server and they're all running the same name service, you can just use "db" as the name. This allows you to say move that whole set of pieces of your application from namespace to namespace without having to update the names in your configuration, and that sort of thing.

But what that means is that on the client side of those lookups, there's actually something we call the search path. So internally, when a client goes to look up a name, if the name is short enough-- it doesn't have enough dots-- then the system-- this is your underlying tool chain, your underlying libraries-- it'll take that search path, and it'll append it.

So when you look up "db", what actually gets sent to the DNS server is db.namespace.svc.cluster.local. It gets sent to the name server. And the name server then says, yes, I know what that is, or I don't.

Now, that search path has multiple entries in it. And in the case of Kubernetes, because we want to use "db", or I'm going to use db.namespace, or we want to do db.namespace.svc, or we want to do the whole thing, it's got a lot of entries. And then the pod has its own entries. And then the kubelet adds a bunch on the end, whatever is already defined in the host. So you can have like six or seven of these.

And we have the system configured so that it's sort of aggressive in how it uses the search path, that thing I said it depends how short the name is, whether you use the search path, we set that really high to say it's got to be a really long name before we give up on using the search path.

That's a lot of explanation. What does that actually mean? That means that every time you request, say, google.com without a dot on the end, what it actually results in is instead of one DNS lookup, it results in like five or six. So the first one is google.com.namespace.svc.cluster.local. And the name server says, I don't know what that is. That's garbage. It says, no such name.

Oh, there's no such name? OK. Then I'll try the next thing in the search path. The next thing in the search path is the same thing with one less. So it's google.com.svc.cluster.local. And here we also have a problem, because if there's a namespace "com", you can run into a problem.

CRAIG BOX: I was going to ask, if I named a service google.com in Kubernetes, could I break everything?

JOHN BELAMARIC: Yes.

CRAIG BOX: Am I allowed to create such a service or is the namespace--

JOHN BELAMARIC: Yes.

CRAIG BOX: Oh, well.

JOHN BELAMARIC: Yes.

CRAIG BOX: We should fix that.

JOHN BELAMARIC: [CHUCKLING] Sure. We've talked for years about a second DNS schema, but no matter what, if you're letting people create DNS entries and you have these search paths, you're going to create these possibilities of collision. So yes, you can cause some trouble. If you can create a dot-com namespace, then you can cause some problems.

But we're not going to restrict namespaces from every top-level domain. There's many, many top-level domains.

So in any case, the short of it is when you make a request like that, it makes a whole bunch of lookups. So what that means is that any one of those lookups, DNS is done via UDP. If you know what UDP is, essentially in networking, we have two commonly used classes, connection and connectionless. So you have UDP and TCP. TCP is this Transmission Control Protocol. That's what we use when we go to websites typically. And it retries. And it sets up a handshake. And it makes sure that both ends of the connection know that they're talking to each other.

UDP is more just like, I throw it out there on the network, and maybe it gets there. Maybe it doesn't. I don't know. I don't care.

This is how DNS works. It throws it out there on the network. Maybe it gets there. Maybe it doesn't. It'll wait up to 5 seconds. So if there's any network issues, you'll see these 5-second delays. UDP is potentially lost.

So now you're making five or six UDP-based requests. You're really increasing the chance that you're going to hit a problem and get a timeout. On top of that, there's really detailed technical issues around that when you create one of these internally, the kernel keeps track of connections. And UDP, since there's no actual connection on both sides, it doesn't know when it closes.

So it has the timeout of this big table. You can fill up this table. There's some race conditions in older versions of the kernel. All kinds of things, really nitty gritty detail that have caused real problems in DNS in Kubernetes. When a whole bunch of pods start up, and they're asking for some external service, and they're all making bazillions of requests to the DNS server, you can get these timeouts that can cause all kinds of damage.

In order to alleviate all of these problems, we, meaning the Kubernetes SIG Network-- this is not CoreDNS per se-- said, let's come up with a better solution for this, and built a service that runs locally on every single node. So this does a couple of things.

One, it's a special build of CoreDNS. It strips out, for example, the Kubernetes plugin, because it's not going to talk to the apiserver. But it puts in the cache plugin. It puts in a few other plugins that it needs. And it essentially sits on the node. All of the DNS traffic instead of going directly out to the CoreDNS that's running as a service in Kubernetes, it talks to this node local cache. And it talks to UDP, but it's on the same node.

Then the node local cache says, OK, I don't know anything myself, but I know where the cluster DNS is, the CoreDNS running in the cluster is. And I'm going to talk to that, but I'm not going to use UDP. I'm going to use TCP. So I'm actually going to create a longer lived, established connection that's reliable back to the main CoreDNS. And so I'm going to pass that request upstream. I'm going to get an answer. And then I'm going to cache it locally.

So basically, you've taken all of that back and forth, back and forth, back and forth UDP, and you've localized it on the node. The longer network traffic is done with a more reliable connection.

And then on top of that, node local DNS does some magic to turn off connection tracking for UDP in general, for these requests in general to improve the performance, and avoid a bunch of those kernel issues and contract table filling up and this sort of thing. So it actually makes a huge difference in the reliability level with sacrificing a small amount of memory on every node, something like 10 or 15 megs of memory on every node.

ADAM GLICK: You wrote a book on CoreDNS with Cricket Liu. It could be said that Cricket has "written the book" on DNS-- and wrote it again and again. Indeed, he's been doing it multiple times since 1992. What was it like to work on the CoreDNS book with him?

JOHN BELAMARIC: Well, as you can imagine, Cricket is absolutely tyrannical. Late nights, no food. It was terrible. [CHUCKLES]

No, Cricket is super easy to work with. He's a great guy. He was actually my supervisor. He was my manager at my previous before I joined Google. He still did the book with me after I left. So he can't be so bad.

ADAM GLICK: Stand up guy.

JOHN BELAMARIC: Exactly. No, he obviously knows DNS inside out. Really easy to work with. And it was a good time.

CRAIG BOX: Is there a rule that he must write all books on DNS?

JOHN BELAMARIC: We thought so, but in fact, we found out recently that there is one book from O'Reilly-- and I can't remember which one it is-- that's on DNS that he has nothing to do with. And it was a shock.

It was like, oh, we went to give a talk at one of the KubeCons. And he went up and was saying, oh, I wrote all the books on DNS at O'Reilly. And then he found out he hadn't. So it was a terrible shock.

There's another kind of funny story there that when the book came out, they sent us the first sort of proof of the cover. It has this fish on it. And Cricket tweeted it out. And somebody replied, hey, that's a large mouth bass. You know what they eat? They eat crickets!

So we completely thought that O'Reilly was trolling us here. But it turns out it's not. It's some saltwater fish that lives on the bottom. So hey, I guess, we're OK. But I thought that was kind of funny.

ADAM GLICK: Do you have any say on what the animal is that's on the front of your book? Or is that decided by an algorithm that is protected like the Coke formula and the Colonel's 11 secret?

JOHN BELAMARIC: I suspect it's the latter. I guess if we absolutely hated it, maybe we could have protested. But it was presented to us as this is the--

ADAM GLICK: This is your spirit animal. Embrace it.

JOHN BELAMARIC: Exactly.

CRAIG BOX: What is the animal on the cover of your book?

JOHN BELAMARIC: It is a comber fish, which is a saltwater fish that eats small fish.

CRAIG BOX: So all DNS books are fish themed?

JOHN BELAMARIC: No, actually, Cricket's original one, "DNS and BIND" has a grasshopper on it. I don't know why it's not a cricket, but it's a grasshopper.

ADAM GLICK: Bonus points if you know the difference between the two.

DNS may be mature technology. But there seems to be a lot of work that's still going on in CoreDNS. What's coming up next?

JOHN BELAMARIC: A lot of the work we're doing right now, we're doing some things internally to the code to try to improve or simplify the lives of plugin authors. So I don't know that I described it before, but plugins kind of come in a few different categories. I tend to categorize them in sort of things that manipulate the request versus what we call backends. And backends provide data from different sources.

So a backend, there's a Kubernetes backend. It provides data from the Kubernetes API server. There's a file backend that provides data from traditional zone files. There's an external plugin that will provide data from a SQL database.

And as the code is structured right now, all of the authors of every one of those backends has to individually write what we call a zone transfer code, which is a little bit tricky. One of the things within DNS is when you have an authoritative DNS server, you can designate secondary authoritative DNS servers, which just means that they accept all of the zone data from the primary one. You don't change it there, but you're still authoritative. So it's a cache, kind of.

And so that mechanism is done via what we call a zone transfer. It's a special kind of DNS request. And right now in the code, we wrote a Kubernetes backend. Somebody had to write zone transfer specifically for Kubernetes. We wrote a file backend. The zone transfer had to be written separately.

So we now have code that will do that for any backend and provide a means for that. That's one thing, improving the code internally.

One of the things I find most interesting is the introduction of policy within the DNS layer. So we have a plugin that we've had around for a while that allows you to inject sort of logic into the process. So if you think about the pipeline, the pipeline is really a code-based place to introduce logic. You can change-- say, do the rewrites we talked about, or change other behavior that you want with the request.

Policy is a runtime thing that in theory is done not necessarily by a programmer type. So this allows you to affect how the requests happen. A great example, our policy plugin integrates with a couple of different policy servers. One is an open source one that Infoblox uses internally, and the other one is OPA.

I think you had Tim Hinrichs and Torin Sandall on here a few weeks ago. Torin and I were going to do a talk at KubeCon-- and maybe we will in Boston, but we aren't doing it. We were going to do it in Amsterdam-- where we've taken CoreDNS, and we've integrated it so that we can do multi-tenant service discovery in Kubernetes.

So in Kubernetes, if you've got role-based access control, well, role-based access control can prevent people from seeing the services that don't belong to them through the Kubernetes API server. It doesn't prevent it through DNS. DNS right now, you can look up other people's services all you want.

So what we can actually do is through the policy plugin and OPA, we can integrate with our back, or, say, with network policy, one or the other, and feed those policies into OPA. And then CoreDNS, what it's making a DNS request, can ask OPA, can the client that's asking this question, actually, are they allowed to get the answer?

So essentially, we know who's asking the question. We know the pod that's asking the question. We know what namespace it lives in. And we can say, set a policy that says, things can only look up things in their own namespace, or in one of these specific namespaces, or we can potentially integrate that, like I said, with our back. Although, it's pretty complicated.

So that, to me, the policy integration, is one of the more interesting areas, and where I'd like to move the product. We have some facility for it now. But it's almost more experimental now than anything else.

CRAIG BOX: You're also one of the chairs of SIG Architecture. What is happening in upstream communities that our listeners should know about?

JOHN BELAMARIC: My role in SIG Architecture is I've spent a lot of effort on sort of trying to improve process, and improve the development and really up level the quality of the features we're delivering, and that sort of thing.

So one of the efforts there is around production readiness. So internally, say, at Google, or in other large cloud providers, or other large SaaS providers, there's a lot of process around-- maybe too much in some cases-- but around what it takes to get something to production.

But one of those things that's really important is sort of a review by our SRE community. And so what we've done is implement in SIG Architecture a process such that any new feature that's being introduced in the Kubernetes has to have some approval by a set of what we call production readiness reviewers. So this is really intended to be pretty lightweight, but it just makes people think about a lot of the questions.

Developers love to make stuff. And they don't always think about how to make those things easy to support and easy to operate in production. And this sort of forces them to put thought into each of those places, and answer certain questions and provide playbooks for operators when those features come to production.

So we're doing some of that now in the 1.19 cycle. And we hope-- if we can handle the review burden, we hope that in 1.20, that will be all the new features going into beta or GA will go through this process.

CRAIG BOX: Finally, there's a meme in the networking community that you can find on T-shirts and mugs that says, it's always DNS. As a DNS guy, how do you feel about that?

JOHN BELAMARIC: Oh, it just means we're important, right? I mean, it often is DNS. I think it's either DNS or a network, right? And the reason is that everything, everything, everything relies on those two things. I mean, if a machine goes down, it's just a machine. But if your DNS service goes down, if your network goes down, everything just stops.

So in that sense, when there's a major outage, there's not that many things you can point to and say, what is it? It's probably routing, or it's DNS. And that's what's going to cause "whole half of the country to go offline" type of events.

ADAM GLICK: John, it has been great having you on the show. Thanks for joining us.

JOHN BELAMARIC: Thank you. It's been a pleasure.

ADAM GLICK: You can find John Belamaric on Twitter, @JohnBelamaric, and you can find the CoreDNS project on the web at coredns.io.

[MUSIC PLAYING]

CRAIG BOX: Thank you for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you really liked it, tell two!

If you have any feedback for us, you can find us on Twitter, @KubernetesPod, or reach us by email at kubernetespodcast@google.com.

ADAM GLICK: Please take the opportunity to subscribe in your podcast app if you haven't already. You can also check out our website at kubernetespodcast.com, where you'll find transcripts and the show notes. Until next time, take care.

CRAIG BOX: See you next week.

[MUSIC PLAYING]

View More Episodes

CoreDNS, with John Belamaric

Chatter of the week

News of the week

Links from the interview

Transcript