#42 February 26, 2019

Policy and Config Management, with John Murray

Hosts: Craig Box, Adam Glick

Kubernetes has a number of mechanisms to enforce policy: some built-in, like quota and NetworkPolicy; some extensions or add-ons like OPA. John Murray, a product manager at Google Cloud, joins Craig and Adam to talk about policy and configuration, and introduce the new CSP Config Management tool launched to Beta along with the new Cloud Services Platform.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

CRAIG BOX: I saw your name in the news last week, Adam. Congratulations!

ADAM GLICK: Oh, thank you. Yeah, as it turns out, when we're not recording, we have day jobs. And mine is working on a project we'll mention in the news called CSP. And we hit beta with that this week-- really an amazing team, an amazing product-- super excited to have been a part of that launch.

CRAIG BOX: Did you find any time during that busy week to find any new indie games?

ADAM GLICK: [LAUGHS] Well, you know, sometimes you have some extra time. And I found this wonderful game called Cat Lady, which is actually a card game. And there's a digital version of it where your job is to play the world's best cat lady and to collect cats and feed them and provide toys to them--

CRAIG BOX: Oh.

ADAM GLICK: --which is a surprisingly interesting strategy game.

CRAIG BOX: I wouldn't know, off the top of my head, what the characteristics would be of, like, better or worse as far as cat ladies go.

ADAM GLICK: Well, I will challenge you to a game at it some time. And you will learn all the best tactics.

CRAIG BOX: I will look forward to that.

ADAM GLICK: Speaking of spare time, did you catch the Oscars?

CRAIG BOX: Yes. I even made a prediction. Back in episode 27, I talked about the movie "Bohemian Rhapsody," and that I thought that it was an Oscar-worthy performance from Rami Malek. And in the Oscars over the weekend, Rami Malek, not only won the Oscar, but then fell off the stage in a fit of excitement.

ADAM GLICK: Wow. You have a prescient ability to predict things. Do you have any lottery numbers for us?

CRAIG BOX: If I did, I'm afraid they didn't work out so well in the Mega Millions last week.

ADAM GLICK: You have some travel coming up, I hear?

CRAIG BOX: Yes. Through the powers of editing, by the time you hear this, I will have landed in New Zealand where I will be for a few weeks on a combination friend's wedding/holiday/conference tour. And I will be at DevOps Talks in Auckland, which I believe will be the 27th of March. And then, next time you see me, it'll be back here in California for Google Cloud Next.

We've got a lot to talk about this week. Let's get to the news.

[MUSIC PLAYING]

ADAM GLICK: Google Cloud announced the beta of Cloud Services Platform. Its open platform for application modernization in cloud or on-premises. CSP includes GKE On-Prem, a version of GKE that runs on top of VMware. And the beta release introduces CSP Configuration Management.

CSP was first announced in July 2018 and, during the Alpha phase, has been used by customers like HSBC, one of the world's largest banks, to bring Kubernetes and Istio into their data centers. Along with the beta, Google released a white paper by Eric Brewer and Jennifer Lin called Application Modernization and the Decoupling of Infrastructure, Services, and Teams, which is a much more interesting read than the name might suggest.

CRAIG BOX: Red Hat, this week, announced that a developer preview of OpenShift v4 is now available for public trial. OpenShift takes Red Hat's Kubernetes-based platform and introduces features from the acquisition of CoreOS. They claim the goal for v4 is to bring a no-ops model, which will, of course, be of great interest to all the operators who are going to be unnecessarily fired as a result.

ADAM GLICK: The Knative serverless platform for Kubernetes continues full-steam ahead this week with release 0.4. New features are based on user feedback, and the headline features for this release are around configuration and secret management. The release also adds support for serving gRPC and upgrading inbound connections to web sockets. Knative has also launched a shiny new website on the shiny new .dev domain, which became available to the public this week.

CRAIG BOX: Microsoft has a tool called Azure DevOps Projects, which is a template or sample deployment tool, and has, this week, added a commonly requested feature to be able to reuse an existing AKS cluster rather than creating a new one every time.

ADAM GLICK: The third post in Google's service mesh series has product manager Samrat Ray talking about securing an environment with Istio. Samrat spells out how you can adopt a zero-trust security approach through authentication and authorization, which can protect your environment from security threats like access using stolen credentials and replay attacks, thus helping keep your sensitive data safe. The post includes a sample in GitHub where you can follow along at home.

CRAIG BOX: Kubernetes consultant and "man with 100,000 Twitter followers" John Arundel, and DevOps engineer Justin Domingus, have just finished writing a new O'Reilly book called "Cloud Native DevOps with Kubernetes-- Building, Deploying, and Scaling Modern Applications in the Cloud." You can buy it at the regular places come March. But NGINX are giving it away for free in e-book format if you are willing to join their email list. As a side, John really knows how to use Twitter. Even his book has over 5,000 followers already.

ADAM GLICK: A shout-out to all of our Redditor friends and all those who know that "the narwhal bacons at midnight". This week Reddit announced all new services deployed to production there will use Kubernetes by default. Woot!

CRAIG BOX: Last week we brought you the news that a runC vulnerability could allow breaking out of the container environment. The exploit code has been released. And Yuval Avrahami from Twistlock has written an in-depth blog post about the vulnerability and how it could be exploited.

ADAM GLICK: If you want to get deep into the depths of Kubernetes security and certificates-- and who doesn't-- Bjorn Wenzel has written a post talking about how to secure a self-installed Kubernetes setup with HashiCorp Vault. He talks about how to get identities to the Kubernetes nodes, so they're able to get credentials from Vault, and introduces his own Vault-CRD for keeping secrets stored in Vault up to date with Kubernetes secrets.

CRAIG BOX: Algolia, a French search technology company who you may know for their Hacker News search engine, recently migrated a crawler service from Heroku to Google Kubernetes Engine. Their lessons learned are that functionality exists in Google Cloud to do many of the things that they had run manually on Heroku. But it's important to migrate like for like, and change these parts out one at a time afterwards to get a conceptual understanding of how Kubernetes and cloud work together.

ADAM GLICK: The Enterprisers Project post an interesting article this week by Kevin Casey on how to prepare for a Kubernetes interview. He has some good preparation advice as well as a list of questions you can use to test your knowledge and practice your responses. Since we continue to see demand for Kubernetes skills increasing, if you're listening to this podcast, it's likely your future career is looking pretty good.

CRAIG BOX: Finally, Kubernetes plumbers will know that there is no such thing as a container. There are namespaces and cgroups, but there's no high-level object called a container that is part of the Linux kernel. David Howells has proposed a set of patches which would introduce such a thing. However, this isn't his first attempt. Linux Weekly News has a write-up on the situation.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

John Murray is a product manager at Google Cloud working on CSP Config Management. Welcome to the show, John.

JOHN MURRAY: Thanks, Adam. Good to be here.

CRAIG BOX: Nice easy one to start. What is configuration? And what is policy?

JOHN MURRAY: Well, I like to think of it as, policies are the thing you're supposed to do, and config is how you end up having to do it. And so policy, oftentimes is, hey, this is our corporate policy; you need to do this. I'm not going to give it to you in code. I'm just going to tell you you need to go do it. And then I'm going to come back and check that you did it.

And those are things like legal compliance, complying with corporate policies, doing things like meeting certain compliances for things like HIPAA or PCI or GDPR-- all sorts of things which are more or less expressed in kind of soft business rules. And then it's up to the people who are operators or administrators to actually figure out how to get those into the systems and enforce them in a way that everybody's happy with.

CRAIG BOX: And you'd normally have a piece of software to describe those policies in some format and then apply that to the configurations that people deploy to their clusters?

JOHN MURRAY: Yeah, I think so. In Kubernetes, take for example, something like PCI compliance. All PCI compliance tells you is, hey, you've got to have some firewall rules. You've got to make sure that your clusters aren't open to the world. How you decide to express that in configuration can be done in any number of ways.

You can think about that in ingress. You can think about it-- network policies. You could think about it in an entirely different layer, like an Istio. So it's up to you, as the administrator, to figure out, what exactly is the configuration that I'm happy living with here, and then to make sure that that configuration is uniformly applied that your system doesn't drift out from underneath it and that you're making sure that people don't come in and clobber it with a change.

ADAM GLICK: Does that mean that policy is also a subset of configuration?

JOHN MURRAY: I think so. And it's not only just a subset of configuration. But it's a number of things that are applied on top of configuration. So you can have a policy that says, OK, we need to have access control rules set up for everything we have in our clusters. And that might be something like a role-based access control configuration. But you could also have a policy that says, I don't want my deployments coming from any other place than my private registry.

And that is really a rule that you're putting on top of a piece of configuration to say, there's a range of values that I expect to come in in this configuration. And anything outside of that is going to be outside of my policy. So it's both expressed in configuration and expressed on configuration.

CRAIG BOX: Some of these things are native to Kubernetes. There are network policies and pod security policies. I may be forgetting others. Why those things? Why do you think the Kubernetes project has settled on those things as being the things that are types that exist in the project itself versus an external add-on?

JOHN MURRAY: Well, I think the things that the community wants to put in as native stuff and not through points of extensibility are the things that are best enforced by the core controllers that sit within Kubernetes. So if these are fundamental things that are expressed in terms of things like quota, or access to the API server, or networking-- which are really the purview of core container orchestration functionality that's in Kubernetes-- then they need to be in there. They need to be part of that configuration. And then everything else can be pushed out to extensibility.

ADAM GLICK: What challenge does a policy engine solve?

JOHN MURRAY: Well, it's just that. You've got business rules. You can't get around them. And you need something that's going to ensure that everything that you're doing is in accordance with those rules. And that's particularly critical for people who operate in these kind of regimes. Because if you're a bank or you're subject to something like HIPAA rules in health care, drifting outside of those policies isn't just a matter of your boss is going to get angry at you. It's a matter of your customers are going to be angry at you. Regulators are going to be angry at you.

It can affect your business and can affect your bottom line if you, say, have a data breach or something. And so it's very important that these things are not just written down someplace on a piece of paper that people have forgotten about, but actually actively enforced at the choke points, like the API server.

CRAIG BOX: And it's not enough just to audit afterwards to check that you didn't break the rules?

JOHN MURRAY: Well, that's nice, right? It's nice if you found out you broke the rules afterwards. But what that means is that you were out of compliance for a while, right? You had drifted outside of those policies. It's much better if you could do, what we call, shifting left, which is to say, move the policy enforcement points up your pipeline so that, as you're making decisions about configurations before they hit your environment, before they affect live applications, you're doing those policy analyses.

And I think that's one of the motivating factors for people to start moving to config as code systems, policy as code systems, where they can do that shift left and avoid having to catch things after the fact.

ADAM GLICK: How are people solving this today?

JOHN MURRAY: There's not a ton of great tools out there, right? So as you scale up, the problem gets harder and harder to manage. You can do things like store your configurations in a Git repository so that they're under change control and that they're auditable. But then if you're picking those up, and you're using command line tools to apply them to a cluster, there are all sorts of other places where things can go wrong there.

And then once they're on the cluster, if you're not actively monitoring them, they could always drift away without you knowing it. So what people are looking for-- particularly as they get into having lots of clusters in different environments and lots of people having to kind of simultaneously work on them-- are ways to collaborate, ways to ensure that things are enforced, and then ways to keep them in line. The tools that are available there, really, are fairly thin these days. But there are a lot of people in the industry starting to work on it, both in open source and in products.

ADAM GLICK: Along those lines, we talked about configuration management being a superset of those pieces. When I think of a number of the configuration management tools that are out there, there are a number of them that with set-up environments-- it might be Terraform, or SaltStack, Ansible, Chef, Puppet-- those kind of things. This sounds like it's different than that.

JOHN MURRAY: Yeah. I think that last set of stuff really comes from a imperative way of doing things, versus a declarative way of doing things, right? And so a lot of the value in Kubernetes is, you have this declarative contract between you and the cluster where you just tell it what you want, and then all of the hardship and all of the difficulty, complexity, and logic sits in those controllers behind that. And the contract with you, the administrator, is you don't have to worry about it. Kubernetes is going to continue to try to reconcile that declaration.

This is really us trying to say, we want a declarative way of managing that environment. And we want to say, this is the way that we want to do things. This is the environment-wide, multi-cluster declaration. And then it's up to you to go and take that, make it happen without me having to think about the various things.

And if it fails, try it again. And then try it again. And then if it drifts, bring it back in line. So the idea of reconciliation, the ideal, the declarative specification, gives you that kind of assurance that you don't necessarily get with the old model of tooling.

CRAIG BOX: The CNCF hosts a project called the Open Policy Agent. Tell me a little bit about that project.

JOHN MURRAY: It's a great project. So the Open Policy Agent is, effectively, a programmable policy evaluation point. And it can be used in any number of different ways. It's something that you can compile into applications, if you want that sort of thing directly in your application. But it can also be plugged into various different systems, including things like Istio, where it works as a mixer plug-in and can do things like service-to-service authorization rules. And it also works with Kubernetes where it's, I think, most often used as an admission controller.

CRAIG BOX: So you would set it up to define your policies. And then the API-- so it will deny the deployment of pods, for example, if they don't meet the requirements that you set?

JOHN MURRAY: Yeah, that's right. And the programmable nature of it means that you can specify those rules any way you want. You could say, allow these Pods on alternate Tuesdays, only when it's not raining. And as long as you can find that information and bring it in, you can write that rule.

So it's very, very extensible. And one of the great things about it is it allows you to manage, generally, whatever policies you have. But it also allows you to not have to rewrite everything as an individual admission controller. You can take this one thing, deploy it, and then feed it rules, update those rules as it changes.

CRAIG BOX: Are those rules written in a domain-specific language? Or are you running them as code?

JOHN MURRAY: Right now, the Open Policy Agent uses a domain-specific language called Rego. It's pretty good in terms of specifying policies. It's not, say, Turing complete, per se. But it is a really good language for bringing in data, evaluating it, and making an allow or deny decision.

One of the things that we're working on at Google, alongside the Open Policy Agent folks, is to add a kind of Kubernetes interface to that so that, as you're running that as an admission controller, you can also specify rules as custom resources against the Kubernetes API-- so allow folks who don't want to necessarily use that domain-specific language to specify things in a more Kubernetes-friendly way of speaking.

CRAIG BOX: So Open Policy Agent is something you install on a per-cluster basis, at least if you're using it with the admission controller. Does it in any way address the problem of managing policy across multiple clusters?

JOHN MURRAY: Not per se. So it's a per cluster admission controller. I think the way to manage it in a multi-cluster environment is, one, to deal with it at a different layer, like deal with it at Istio or something like that where it gets thinking about things at the L7 layer or to put in front of it something like CSP Config Management that allows you to ensure the same policies are being pushed to all the various different clusters so that you are doing that multi-cluster coordination at the configuration layer rather than the actual admission controller itself.

ADAM GLICK: You mentioned CSP Config Management. What is CSP Config Management?

JOHN MURRAY: This is our attempt at Google to solve some of these problems-- the scalability problems around dealing with configuration in a multi-cluster, multi-environment world. And that's what CSP is really all about. And it's also an attempt to deal with some of these policy problems and ensure that you have that kind of declarative environment spec that you can enforce across all of your clusters.

So what it actually is is a series of custom controllers that run on your CSP clusters. And those can run on-premise. They can run in the cloud. And all of them point back, centrally, to a version control system, to a get repository.

And what that allows you to do then is, say, express all of the configuration that you need to exist across those clusters, centrally, express it once so you don't have to repeat yourself. And express it in a way where you're saying, this is what I want my clusters to look like. Anything that you can configure on a Kubernetes API server, you can put into that repository.

And then, as clusters are added with these custom controllers on, they just point back to that repository. And the custom controllers bring down that multi-cluster declaration, all of the configurations that I want to see reflected in my environment and then ensure that they are applied to the API server and that they don't drift away from that declaration.

ADAM GLICK: What makes that different than other means of managing this, like the OPA stuff that we talked about earlier?

JOHN MURRAY: Well, I think it really works very well in concert with something like OPA. And so you can think of OPA as an admission controller. But you can also think of it as something that you could compile into an upstream check-- so something that runs against your code repository and does static analysis rather than analyzing something as it's coming through.

CRAIG BOX: Right.

JOHN MURRAY: So you could envision a scenario in which you've added this to your clusters, but you're also running it in your repo. So I might cut a branch and say, hey, I want to deploy this pod. And then I send you a pull request. And before you even have to do a code review on that, we run all of your OPA rules against that branch, ensure that it's in line with all of your policies, and catch anything that's outside of that before anybody does a code review, before it gets committed to the GIT repository, before it even gets reflected on the clusters.

It doesn't mean you can't also run the admission controller there. But now, you've got defense in depth. You're looking at it before it hits your environment in addition to as it hits your environment.

CRAIG BOX: How do you express this configuration to go out to the clusters in your CSP environment? And then how do you audit to check that everything worked the way you expected?

JOHN MURRAY: The expression in CSP Config Management is exactly the same as it is at the API server. So we've taken great care not to require people to write things in a domain-specific language. You can actually take the YAML directly from a cluster, check it into the get repository, and that's all valid. And we will then, say, fan it out to 10 other clusters that are enrolled.

The way that we check it is, either in a local copy of the repository-- we have a command line tool that can do validation on that and basic linting and make sure that any references in there are correct versus other things that are also included. And then that same process, again, can be run as check on top of poll requests and other things to ensure that they're valid. So you don't have to learn a new way of doing the configurations. You can take your old configurations and put them into this tool without any changes and then use that validation as you add new stuff.

CRAIG BOX: And I can only roll it out to a subset of clusters if I want to enforce different policies in different environments?

JOHN MURRAY: Yeah, absolutely. So we use label selectors, essentially, to do that. So you may have some things that you want everywhere all the time. You want a Prometheus operator on every cluster as, say, a DaemonSet. That's great. And you can check that in once, and we'll ensure that it's everywhere all the time.

But if you had, say, a resource quota rule, for example, and you wanted to have a different resource quota in your production environment versus your staging environment, you could address a particular set of policies or configurations to any cluster that has the label of production versus any cluster that has the label of staging on it.

ADAM GLICK: You mentioned that you just went into Beta with this. Congratulations on that! You must have customers that are using it. Can you give me an example of how someone is putting this into practice?

JOHN MURRAY: Yeah, thanks. We're really excited about the beta. As we've tested with customers, I think one interesting case that we see is where we have very large customers who run a number of clusters. And they have a central team of operators, oftentimes, called a platform team or something like that.

And it's their job to make sure that this environment is up and running and to give developers everything that they need to go in and start deploying their services. Typically, these are kind of like micro services environments where you've got lots of different development teams pushing out lots of small services. And the key thing that these customers want is to move fast, right? They want to be able to deploy a lot of stuff.

And so that team of operators, that central administration team, one, they need to make sure that all their policies are enforced. But they want to do it in a way that doesn't trip up all the developers who are, effectively, their internal customers.

How I've seen them use that is to say, OK, there's a set of resources that we feel the administrator owns, right? And that is, we are going to set you up with a namespace on all of our clusters. We're going to give you a role binding that allows you to deploy your application there. We're going to set a network policy there that ensures that things don't have egress to the namespace or ingress or whatever that rule is. And then we're going to give you a service account for your, say, Ci/CD pipeline to bind to there.

And as long as all those things are there, then you can do whatever you want to do within the context of the role binding we give you. And so we've seen people come with those use cases and use config management in a way that their processes, when they get a new service coming on board, or a new development team that wants access, they send a pull request with all that stuff. They create a Namespace. They create a ResourceQuota with it. They create a RoleBinding with it.

And the minute that they check that in, all of the clusters that are enrolled pick it up. So that development team now has a home, so to speak, in that multi-cluster environment. And they can go about their business knowing that the guardrails that keep them in line with the policies have been set up centrally by that team.

And then, if there's an adjustment that needs to be made-- say somebody else needs to be added to the role binding so that they can access-- that's a simple update to a piece of code that's, by the way, auditable, transactional, revertible, and all that. And that's a fairly common pattern that we've seen with a few of the people we've tested with.

CRAIG BOX: Do you keep track of the clusters that are enrolled to make sure that there's no bad actors disabling the policy on those clusters?

JOHN MURRAY: We keep track of the clusters that are enrolled. And in addition, the fact that you have installed our custom controllers means that we are also watching things on the cluster. So if we create a resource, it means that we feel that our system owns that resource and that nobody else should be able to come in and mutate it. The smart thing to do is make sure that people don't have the rights to do that.

ADAM GLICK: Yes.

JOHN MURRAY: But even if there's a mistake and somebody does come in and accidentally tries to clobber one of these things, we will immediately revert that back and try to reconcile it with the declaration in the cluster.

CRAIG BOX: So if I have an auditor come to me and say, hey, I need you to prove that this is the state across all of the environments, this would be a tool they'd be able to use to generate a report like that, to run those kind of queries on all your multiple clusters?

JOHN MURRAY: Yeah. You can look at it and see, hey, these are all of the objects that are being managed actively by this. And then you can fire up a command line and go try to kill something and show the auditor that, within a few seconds, it's back to the state that it should have been in.

CRAIG BOX: Great.

ADAM GLICK: Would you say this is more of something that is an alerting mechanism or a gating mechanism?

JOHN MURRAY: It's definitely designed to be a gating mechanism. And I think we want to move away from altering, because that's kind of the old model, which is you run one of these tools, and then you watch the clusters. And you do a diff between, say, what's in your repository and what's supposed to be there. And then somewhere down the line, someone gets an alert or a warning in a sea of alerts and warnings that they're getting that something has drifted out of line.

But that whole model of doing things still relies on a human to take a look at the alert, go figure out what's going on, go down to the cluster, understand why it got pushed out of line, and then fix it. What we want to do is similar to what Kubernetes does, which is if something gets out of line, it's the job of the controller to reconcile it.

And so that's not about alerting, it's about active management of these resources. So the idea is not to let you get out of line and then tell you you're out of line. It's not to let you get out of line at all.

CRAIG BOX: Great. That's fantastic. John, thank you very much for joining us today.

JOHN MURRAY: Yeah. It's been my pleasure. Thanks a lot, guys.

CRAIG BOX: We look forward to CSP and the CSP Config Management software moving to general availability later this year. You can find John on Twitter at @jrmurray000. How did you pick how many zeros to add?

JOHN MURRAY: Well, you try one, and it fails. And then you try two, and it fails. And finally, you get to three, and it was successful. So I was a little late in grabbing my handle, I guess.

CRAIG BOX: Are you sure there's no jrmurray01. You didn't have to go to three digits?

JOHN MURRAY: I just like zeros, I guess. It's a nice round number.

[MUSIC PLAYING]

CRAIG BOX: That's all for another week. It just remains for me to say, thank you, as always, for listening. Whether you're a new listener, or if you've been with us since the beginning, please help us keep growing by telling a friend.

If you have any feedback for us, you can find us on Twitter at @kubernetespod, or reach us by email at kubernetespodcast@google.com.

ADAM GLICK: You can also check out our website at kubernetespodcast.com, where you'll find transcripts and show notes. Until next time, take care.

CRAIG BOX: See you next week.

[MUSIC PLAYING]