Kubernetes Podcast from Google: Episode 101 - Open Policy Agent, with Tim Hinrichs and Torin Sandall

#101 April 28, 2020

Open Policy Agent, with Tim Hinrichs and Torin Sandall

Hosts: Craig Box, Adam Glick

Tim Hinrichs and Torin Sandall are the creators of Open Policy Agent (OPA), a project which allows policy to be integrated with popular cloud native software (including Kubernetes and Envoy) or anything you write yourself. Adam and Craig discuss OPA with Tim and Torin after the news of the week.

Do you have something cool to share? Some questions? Let us know:

ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.

[MUSIC PLAYING]

As we roll into the second month of lockdown around the world, it's interesting to see what remains in short supply down at the supermarket. The toilet paper shelves are brimming with pretty much everything you would've expected beforehand. The one thing that seems like it's still hard to find is flour. I think the number of people who are baking things is up through the roof. Flour is really hard to get. Yeast is really hard to get as well. A lot of people are trying to bake homemade bread. I think they haven't realized the bread shelves are completely full, and you can just go down to the supermarket and buy it.

ADAM GLICK: Some people are expanding out, using their social networks to find some of this, I believe.

CRAIG BOX: Yes, Joe Bader tweeted recently that he'd requested yeast from his online grocery shop, and instead, he had been sent a jar of Marmite.

ADAM GLICK: [LAUGHS]

CRAIG BOX: The famous yeast spread, the British version. Those who've listened to the show before or who are familiar with me will know that Marmite is different in New Zealand than it is in Britain. They have subtle different variations. They're both still better than Vegemite, which just tastes like, really.

ADAM GLICK: A little subcontinental divide there, so to speak?

CRAIG BOX: This is the part of the show that my Mum listens to, so she'll be able to inform me. I'm pretty sure my Mum likes one, and my Dad likes the other.

ADAM GLICK: I recall some competition show, where it was people from down under competing with Americans on eating a sandwich. And they had them have to eat a peanut butter sandwich while the folks from the States had to eat a Marmite or Vegemite sandwich. I don't remember which one. (LAUGHING) But is was pretty clear that it was far easier to eat peanut butter than it was to eat the yeast sandwich.

CRAIG BOX: I like peanut butter on toast. (I think what I actually like is butter on toast, and what goes on top of the butter doesn't actually matter.) But I don't get the American thing of putting peanut butter in sweet food, chocolate, jam or jelly, whatever you call it it. It's an acquired taste. It must be. Not for me.

ADAM GLICK: No Reese's over there?

CRAIG BOX: They sell them, but I don't have to buy them.

ADAM GLICK: Speaking of different kinds of pieces, I've been working on a jigsaw recently. I wanted something a little different in terms of puzzling activity to undertake, and so I've got this 1,400 piece puzzle spread out on the desk here. And it's quite an undertaking.

CRAIG BOX: Mm-hmm.

ADAM GLICK: But it's a nice one of those things to kind of calm your mind and zen out a little bit when you just need to kind of focus on something. And so that's my undertaking of the week. We'll see how it progresses. Right now, what I have is a lot of pieces on the table.

CRAIG BOX: I see from the link here, it says it's a four-dimensional puzzle from Game of Thrones, nonetheless. Is that kind of like the four-dimensional chess from Star Trek or Big Bang Theory? It really goes out the whole geek oeuvre.

ADAM GLICK: Well, I know what the three dimensions are. I can only assume that the time dimension is how much time it's going to take me to actually put it all together. So far at this point, all I've done is I've done the North above the wall, so the land of the free folk is completed, and that's about it at this point.

CRAIG BOX: Well, we'll get regular updates, but until then, let's get to the news.

[MUSIC PLAYING]

ADAM GLICK: Google Cloud announced the general availability of Anthos on AWS, allowing the use of a single API to deploy and manage applications across multiple clouds and on premises. Features of Anthos GKE running on AWS include high availability clusters across multiple AZs, autoscaling, and using existing VPCs, security groups, and load balancers. You can now set policy on AWS workloads with Anthos Config Management and use Anthos service mesh to securely connect and manage services running across them. This release also extends Anthos management features to workloads running on VMs. Support for Azure is in preview, with GA expected later this year. Upcoming support for running on bare-metal was also announced.

CRAIG BOX: The European Conference on Computer Systems is on this week, albeit not in person in Crete. Google published a paper on Autopilot, a system that configures resources and limits on Borg tasks automatically. Humans are definitely not the best judges of the resources a task pod needs. Autopilot scales both horizontally and vertically and minimizes wasted CPU and memory using machine learning algorithms. Autopiloted jobs have a slack of just 23%, compared with 46% when manually managed.

Additionally, Autopilot reduces the jobs that run out of memory by a factor of 10. Kubernetes has a vertical pod autoscaling feature which takes lessons from Autopilot and was built by many of the same people.

A second paper was also published analyzing a new Borg trace from May 2019, an update to the widely cited data set from 2011. The trace enables researchers to explore how scheduling works in large scale production computer clusters. This inside look at Google's cluster computer has many findings, which will help influence decisions in projects like Kubernetes.

ADAM GLICK: Cloud Foundry continues its walk to becoming an app on top of Kubernetes with CF4K8s, a name clearly never designed to be spoken out loud. While the KubeCF project basically ports BOSH to Kubernetes and runs like it used to, CF4K8s reimagined the stack in a much more Kubernetes idiomatic way.

The Cloud Foundry Foundation also announced Paketo Buildpacks, an evolution of Cloud Foundry Buildpacks that are now compliant with the CNCF's similar but different Cloud Native Buildpacks specification. A buildpack is a tool to translate source code into container images as automatically as practical. Paketo Buildpacks promise less time building, more time developing.

CRAIG BOX: COVID-19 has necessitated changes to the Kubernetes 1.19 release cycle. This release is now 17 weeks long, instead of the usual 12, which acknowledges the probability of less human testing and slower collaboration during this time. The new release will be in August, just before the rescheduled KubeCon EU. And support for 1.16 will also be extended in order to accommodate this change.

Kubernetes 1.20 will be extended to a four month release cycle and be the last release for 2020.

ADAM GLICK: Aqua Security has announced Dynamic Threat Analysis, a new product which looks for dynamic threats that are hard to detect with static analysis. DTA is available standalone or is part of the Aqua Cloud Native Security Platform. Aqua also announced Cloud Security Posture Management, formerly CloudSploit, is now generally available for Google Cloud and Oracle Cloud.

CRAIG BOX: Red Hat Enterprise Linux 8.2 is out, and with it, is some enhanced tools for container building, for layered security, Skopeo, and Buildah are now distributed in containers themselves. And a new tool called Udica can be used to create container centric SELinux security policies to reduce the risk a process can break out of a container.

8.2 also adds the OpenJDK and .NET 3.0 to the Red Hat Universal Base Image.

Red Hat has also announced some product lifecycle increases, including support for OpenShift v3 extended from June this year to July 2021.

ADAM GLICK: VMware and Kinvolk have collaborated to bring Flatcar Container Linux as a supported operating system on vSphere. Flatcar is a drop-in replacement for CoreOS Container Linux, which reaches end of life at the end of May. We discussed Flatcar in episode 79.

CRAIG BOX: Alcide, with a C, has introduced a new open source tool called sKan, with a capital K. sKan is a security scanner for Kubernetes configuration files, designed to be used in CI systems to detect changes. The tool is designed to work with the Open Policy Agent and works like a linter to perform static analysis on config files. The output of the tool can be set to HTML, making the results easy to read and incorporate into dashboards.

ADAM GLICK: Two specialist Kubernetes clients joined the fold this week. kubeletctl from CyberArk is a client for connecting to the kubelet. You don't normally need to do this, as the kubelet watches the API server to learn what to do, but this may come in handy for debugging.

If you want to get your finance director managing Kubernetes, then you'll need XLS-Kubectl. Based on an April Fools' Day post on Reddit, Daniele Polencic has implemented live Kubernetes administration from within Google Sheets. No YAML was harmed in this process.

CRAIG BOX: Microsoft is working on a high performance HTTP reverse proxy written in C# for .NET Core. The project, spelled Y-A-R-P and pronounced "yarp" came about as its authors found many internal teams at Microsoft were already building a reverse proxy for this service, or were looking into doing so. YARP will ship as a library and project template for maximum customizability and running it as a Kubernetes Ingress is already on the backlog.

ADAM GLICK: Who says containers should only run on the server? Misha Brukman has written up how he used Docker containers to install all the i386 libraries required to run old point-and-click adventure games on a 64-bit desktop. The steps are completely transferable to running server software, so you could consider this a guide on containerizing software in general, or you could just get help running Machinarium from 2009.

CRAIG BOX: Machine learning framework PyTorch is maintained by Facebook. And engineers from there and AWS have launched two experimental tools for it this week. TorchServe is a model serving library to deploy PyTorch models for inference. TorchElastic for Kubernetes is a controller for deploying distributed training jobs on Kubernetes, including on preemptable or spot instances.

ADAM GLICK: NetApp announced Project Astra, a software defined storage system that will plug into any Kubernetes distribution or management environment. Features include migrating applications and data from one Kubernetes cluster to another, backup and restore, and disaster recovery. Astra appears to be more of a direction than a project at this point, with interested users invited to sign up for updates as it develops.

NetApp recently shuttered its NetApp Kubernetes Service and are pivoting back to their bread and butter of storage and data management.

CRAIG BOX: Styra's Declarative Authorization Service, a management console for operationalizing Open Policy Agent, has released support for Kubernetes mutating webhooks, allowing it to modify objects, as well as just accepting or denying them. You will learn more about Open Policy Agent in the interview very soon.

ADAM GLICK: Finally, modern distributed databases rely on clocks to synchronize data for transactions, or to tell you when data is out of sync. Chaos testing time is hard when all pods on a node use the same clock. PingCAP have added a time chaos module to their chaos mesh framework, which can simulate different times in different containers, implemented using ePBF.

CRAIG BOX: And that's the news.

[MUSIC PLAYING]

ADAM GLICK: Tim Hinrichs and Torin Sandall are co-creators of the Open Policy Agent. Tim is the CTO and founder of Styra, and Torin is VP of open source at Styra. Welcome to the show, Tim.

TIM HINRICHS: Nice to meet you all. Thanks for having us.

ADAM GLICK: Welcome to the show, Torin.

TORIN SANDALL: Thanks for reaching out and looking forward to it.

CRAIG BOX: We last talked about policy in episode 42, where we had John Murray from Google as our guest, and we asked him the question what is configuration and what is policy? Let's open up today by asking you both that same question.

TIM HINRICHS: The way I think about policy is that we really do live in this world where there are all kinds of rules and regulations all over the place, right? Like in the world of software, any time you're trying to deploy an application, let's say to production, maybe you've got to have at least three replicas. Or maybe any time you're associating to a production server, it's got to be the case that you're on call. So there are all these rules and regulations that exist throughout the world. And in software, software's different.

So policy to us are just those conditions that we need that we're supposed to follow when we manipulate software, when we release it, when we write it, when we use it. And so that's really policy.

And I think configuration is sort of similar in the sense that it's bits that we control in a piece of software, but those bits are really just parameterizing that software. They're not really guardrails. They're the things that tell the software exactly what to do, whereas with policy, those are more conditions or constraints on what's supposed to happen, things that should be allowed, things that should be denied.

ADAM GLICK: Would an easy way to kind of summarize that be one is the set of settings, and the other is the thing that enforces those settings?

TIM HINRICHS: Sure. Yeah. I think so. I mean, I think we often, like at least with OPA, we try to separate enforcement, like what you do with the policy, from the policy itself. So the policy might say that you have to have three replicas in order to run the software in production.

Now you could choose to enforce it. You could say this particular policy is something that we are not going to allow an application to be deployed, let's say on Kubernetes, unless it's got three plus replicas. But you could take that same policy and just monitor it and say, look, we're going to let people put whatever applications out with however many replicas they like.

But what we want to be able to do is identify those applications that are violating policy. Why? Maybe because we're just in a state of the world where we can't enforce it for whatever reason. But yet, we still want to be able to monitor and understand how close to compliance we actually are.

ADAM GLICK: You're the creators of OPA, or the Open Policy Agent. Can you explain what that is and how it came to be?

TORIN SANDALL: Open Policy Agent, or OPA, as we like to call it. Lots of people call it O-P-A. We call it OPA because that was sort of too good to give up.

CRAIG BOX: Do you make sure that you always drop plates on the floor and shout, "opa!"

TORIN SANDALL: [CHUCKLES] There are a lot of break glass metaphors when it comes to policy, so it's a pretty good one. But we call it OPA. And the reason why we created OPA in the first place was to help provide a building block that would unify policy enforcement across a wide range of use cases. Like the examples that Tim just gave about deciding how many replicas a workload's allowed to have, or how much memory and CPU a workload's allowed to have, or what containers a workload can run, and so on.

Those are fairly different problems at the high level or the service level, but under the hood, what it comes down to is what Tim said. It's expressing constraints or logic over configuration, desired state, API requests, and so on.

So what OPA gives you is basically this building block that allows you to enforce policy across a wide range of software.

CRAIG BOX: Was the creation of the project inspired by a problem that you had tried to solve in a previous life?

TIM HINRICHS: I think so. One of the things that we saw when we started-- maybe uplevel that for a moment-- which is just this idea that we have these rules and regulations everywhere. And typically what we see, and what we saw people do before we started OPA, was that a lot of these rules and regulations are things that just show up in wikis, right? Or emails or PDFs or Word docs. They're written down in a form that are good for people to consume, right? It's very easy to send out an email and say everybody needs to have three or more replicas.

CRAIG BOX: I think there's a word for that. It's folklore.

TIM HINRICHS: [LAUGHING] That's good. Right. So yes, exactly right. So this sort of problem that we saw was like, it's great. This is a very easy way to update policy, to write policy. Just put it in an email. But the downside, of course, is that now you've got human beings who are responsible for not only knowing what all those policies are, but also remembering what they are and making sure that they don't make mistakes. They don't fat finger it.

And so in that kind of world, humans make mistakes. And so the goal of OPA was to create a piece of software that would allow us to take those real world rules and regulations, take them out of those emails and wikis, codify them in software. So now the software can go ahead, because it knows what the policies are, it can enforce those policies, it can monitor them, it can remediate them. It can do whatever it needs to do.

Part of this just came out of sort of thinking about many of the goals of this Cloud and Cloud Native ecosystem, and the fact that if one of the major goals is to be able to release software more quickly than ever before, right? So release software on the order of minutes instead of months.

And if you're going to do that, well, then you all need to automate a whole lot of the software release cycle. And part of that is automating a lot of the security compliance and operational checks that we used to do manually. And so really, that's sort of where we saw OPA come into play. It's like, look, if we're going to accelerate software delivery, we need to be able to automate the process of enforcing and checking all of these rules and regulations.

ADAM GLICK: Was this something that was taking a lot of your time and you wanted to solve it? Or were you looking for just something that you could go and build? There's lots of problems to go and solve. Why did this one grab your attention?

TIM HINRICHS: I've been working on policy for a very long time. It's been 18 years or so I've been working on building policy systems and the like. And so this is just something that has spoken to me for a very long time. And so it turned out that the sort of Genesis of all this came even before Styra. The founders were working at VMware at the time. And so we came into the Nicira acquisition.

And so one of the things we were doing is we were just talking to a number of current customers. And what they were saying was, look, we've got hundreds, maybe thousands, of different pieces of software that all need authorization and policy. And so what they had done is they had sort of cobbled together something internally to provide that unified policy solution for all those different pieces of software. But what they told us was look, this is not something that they wanted to be in the business of. And so could we go ahead and build something, that they would really appreciate that and love it. So that was sort of how the foundation of OPA came to be.

CRAIG BOX: Back before we were all online and doing everything with microservices, say, people might have been running on Windows 2000 networks, for example. Was there a Closed Policy Agent? Was there an equivalent to this in that world?

TORIN SANDALL: There are some standards that emerged in the 2000s. Others, like the OASIS standards body has a thing called XACML. There's a vendor community around that in an effort to standardize on certain architecture for managing policy from end to end. So it's not just kind of like the enforcement, but it's also the decision making, as well as the information gathering, and administration, and so on.

So that's saw some adoption. And if you go back further, you like worked in networking or in telecom, the policy is well established in those spaces. And that's where I had worked for a long time. And I think there are certain domains where it's much more well-established, but I think inside of cloud and DevOps and the cloud native ecosystem, it's definitely something that wasn't there from the start. It's something that we've sort of helped establish a little bit with Open Policy Agent.

ADAM GLICK: There's the question you can ask yourself, like, do you start a company? Do you start a project? Do you do them together? What was the ordering as you came about with what you're doing?

TIM HINRICHS: When we started the company, what we knew was we wanted to solve this problem of policy. We wanted to automate policies. Well, we wanted to solve the problem. And so at the end of the day, what we knew is that we needed to, as part of that overall solution, needed to create this open source project, OPA.

In order to solve that piece of the problem, which is that look, at the end of the day, OPA provides a text-based policy language. And that's just something that just absolutely had to be open source. And so what we knew is that by putting that in the open source-- we had very clear ideas about what OPA should be and what it could be, but there's a whole bunch of stuff. We knew the community would be incredibly helpful for expanding and shaping the project. And we always knew it had to be something that the community would own at the end of the day, right?

It's a policy language. We're going to take these rules in PDFs and emails and then codify them in software in a language. And so that just simply had to be open source, at the very least, because then the community can help and help guide the features of that language. It can guide the tooling that we would need to provide in order to make people who are writing policy have all the same kind of functionality that they have in traditional programming languages, like unit testing and profiling and benchmarking and so on and so forth.

And we also knew that the community would be instrumental in helping us understand, like, which sort of integrations, which use cases were most important, and when to tackle them, and in what order. So we knew that these go hand-in-hand. If you're going to solve the policy problem, you need an open source project.

CRAIG BOX: That being said, though, policy is something that you would normally associate with larger companies, with enterprises, with people who are willing to pay money, and people who come from that traditional world of being more concerned about the closeness of their rules set and perhaps happy to adopt proprietary software in this. Was it necessary because of this Cloud Native ecosystem and that world, that it was open source from the beginning?

TORIN SANDALL: I think that there are certain technical design decisions that we can get into that motivate some of the reasons for open sourcing OPA and doing a lot of this work in open source. I think that if you look at the Cloud Native community now, or even five years ago in 2015 when it was started, there was definitely an obvious trend towards open source, for not just DevOps, but in other spaces like networking and so on. I think Tim mentioned earlier that it also worked within OpenStack.

So the founders of Styra and the people had had experience already working with the open source communities, and it just seemed like a natural starting place for this kind of technology. Because at the end of the day, if you're a large organization and you have 50 or 100 or 1,000 different systems that you have to manage, and each one of those systems has a different way of expressing who can do what, that's just a nightmare to manage.

And so what you want is kind of a standard way of doing that, and the way that that happens today is through open source. And so one of the things that we sort of believed in from the beginning was that we would start with code, essentially, rather than start with a protocol or a spec or something like that. We would start with some code and provide something of value for those people that wanted a standard way of expressing policy and authorization.

ADAM GLICK: Your holiday present to the open source community in 2015 was OPA. Back then, the Kubernetes community was a little different than it is today. So you were making a bigger bet on the community and the decision. How did you come about choosing where you would actually build this policy agent?

TIM HINRICHS: One thing to be clear about is, as we said, when we envisioned OPA, we always thought it was an open source project. So we put it-- you know, every line of code that's ever been written has been done to GitHub. So our first commit was like-- it was the first line. It was not like we dumped a bunch of stuff over in December 2015.

In terms of where, like what community, we decided to create OPA in, I think it was about, what? Maybe a year after that first commit, or roughly a year, that we decided to donate it to the CNCF, right?

So I think the one thing to keep in mind here is that OPA is one of these projects that is intended to solve the policy problem across many different areas in many different domains, many different pieces of software. So we talk about you can use it-- and it is heavily used for Kubernetes. It's heavily used for microserve APIs in sort of a service mesh world. It's being used in production for databases and controlling who can SSH into servers. And I'm sure I'm forgetting a couple-- CI/CD pipelines.

And so we never think of OPA as belonging to any one of those communities or subcommunities. Rather it's something that binds and ties many of them together. It is, at the end of the day, trying to unify the solution to policy across all those different areas or domains.

Now the CNCF, like I said, we decided to donate OPA to the CNCF about a year after it was created. And the reason for that was pretty easy. I mean, it was like the CNCF was a really vibrant community. At least, that's how we saw it. And we knew that it was a bunch of folks getting together that were trying to understand and build a bunch of software designed for this Cloud Native space. And that's exactly where we saw there being the most need for this kind of unified policy system. In this Cloud Native space, you've just got a tremendous amount of dynamism. You've got all these microservices spinning up and down, servers even, maybe even Cloud accounts that are spinning up and down.

And so all that dynamism just makes it clear that the kinds of policies that you need to put in place are just more dynamic. They need to adjust to all that dynamism. And so that's one of the core features of OPA, that you can write those sort of contextually-aware policies.

But then what we also saw was that there was so much new software in this Cloud Native space, that anybody, any organization that's trying to embrace that Cloud Native mindset, they're going to have to embrace a whole new stack of technology as well, right?

Think about the public clouds and Kubernetes and the microservices and even the application. A lot of that is just brand new. And so it sort of made it quite clear that having a unified way of solving policy in that space would be incredibly valuable. So it was a great community. There was a space in which what we knew is that there was a lot of value in having that unified policy system. So for us, it was a very easy decision to go ahead and donate it.

CRAIG BOX: We've obviously got OPA as the name of the project. Styra, the company name there, is that the Greek word for insulating foam?

TIM HINRICHS: No. We had to double check that and make sure it was not a shortened version of Styrofoam.

CRAIG BOX: How did the name come about?

TIM HINRICHS: Like many good startups, its dot.com was available first. And second, it does mean to govern. So those two things, and it's short. And it's hopefully memorable.

TORIN SANDALL: It means to govern in Swedish though.

CRAIG BOX: That counts.

TIM HINRICHS: For those of you who have kids, I can go to my kids, who were quite a bit younger at the time, and say like, Styra, like steer the car, right? They kind of got that.

ADAM GLICK: Are your kids policy gurus?

TIM HINRICHS: No, they're not.

ADAM GLICK: Budding crossing guards.

TIM HINRICHS: Yeah, we're working on training pretty heavily now. And I'm like, this is the test. When I've got the training materials to a point where I can just hand them over to the kids, and they can pick it up and start writing microservice or Kubernetes level policies, then I know the training is super solid.

CRAIG BOX: You define the policies that apply in OPA using a language of your own creation called Rego. What was the decision process at the time to create your own domain specific language versus embedding something like Python or configuring it only as code rather than in the language?

TORIN SANDALL: We actually had some funny stories about things that we tried before we created Rego.

CRAIG BOX: You've come to the right place.

TORIN SANDALL: Yeah. So Tim has a deep background in languages like Datalog and logic based systems like that. But before we went down that road, we tried other things, like, for example, using SQL to define policy. And one of the challenges that we ran into there was that when you try to use any language that's not well designed around deeply nested structured data, like YAML or JSON, it becomes really, really difficult to express policy. And so we have some fun stories where we tried to use SQL to express policies over certain cloud API. And it just ended up being a complete nightmare, and so we gave up on that.

We also looked at whether we could use traditional imperative languages, but there are some pretty hard challenges you need to overcome there around performance and ability to perform different kinds of analysis and optimization that sort of are showstoppers with that.

And so what we ended up doing was taking a lot of inspiration and the core language semantics from a Datalog and other logic based systems and then adding on top of that the ability to express policy or JSON or just any kind of hierarchical structured data. So OPA came out of that.

TIM HINRICHS: I'll add to that just a little bit because I remember this pretty vividly. Because I think I was the one who ended up writing the code to do this, right? So the story about trying to use SQL, so the thing that we ended up trying to do is say, let's just use SQL to write policy over-- I think it was, like Torin said, public cloud APIs. Let's say you're just writing policy, and you're using-- just run one of the public cloud APIs to grab the list of all servers or whatever, right?

And then the idea was you take that data, and that represents a state of the world, and now you write policy over it. Are servers connected to the right networks? Or do they have enough memory? That kind of thing.

And so to use SQL, the challenge there is that the data that comes out of those public cloud APIs is this deeply nested JSON. It's got 10, 15 levels deep of nesting. And so SQL was not designed to deal with nested data. And so in order to use SQL, to write policy over that public cloud data, you've got to basically take the 15, 20 levels of nesting and flatten it out into relational tables, right? Just rows and columns of simple values.

And so we did this with one of the public cloud APIs, and we ended up with like 200 different tables. All right. And so the problem there is, of course, many fold, but not just the size. Like who cares? Computers are big. They're fast. But it's a people problem, right? If suddenly I gave you 200 tables and said, go ahead and write your policy over to those 200 tables and there was no docs, those are all synthetically created, what would you do? You would be at a loss. So we tried it.

And then what we realized was that when we actually sat down and actually tried to write the policy, you end up having to write a whole tower of these joins to sort of reconstruct the embedded structure that you started with. And so you ended up writing 15, 20 joins in SQL in order to just get back to the point that you started with the deeply nested JSON.

anyway, so I literally remember 200 tables coming out and just being beside myself because I knew I couldn't even write the policy myself. So.

TORIN SANDALL: We thought it was great, though. We were like, oh yeah, people can just write policy in SQL. They'll love that.

TIM HINRICHS: Oh yeah. No need to have a new language. It's just SQL. It didn't turn out so well.

CRAIG BOX: Now we have this language, which in fairness, is mostly "if" statements if what I read is correct. So we're looking at a request that comes in from an end user. Let's say someone has embedded the OPA library in their application. They've authenticated a user, something like OAuth with JWT. They know who the user is, and now they want to check it against a certain policy. What's the life of the request at that point?

TORIN SANDALL: The way that OPA works is that basically whenever your software needs to make some kind of decision, it can execute a query against OPA. It can ask the question, should this thing that's happening-- whether it's an API request or an admission control request in Kubernetes or some event from Kafka or whatever, right?

Whenever something occurs that needs a policy decision, your software can ask OPA for the answer, like what to do, basically. Should I allow this? Should I deny this? Should I rate limit this? What should I do?

And so the way that that works is different depending on the use case. You can take OPA. You can basically embed it as a library inside of Go today. So if you want to just basically-- you'll extend your service or your tool or your application with a policy engine and your software engine happens to be written in Go, then OPA provides a very nice, lightweight, easily embeddable. Dependency-free way of doing that.

But you can also run it as a daemon. So you can stand up OPA as a daemon. It's typically run sort of as a sidecar host level daemon for performance and availability reasons. But when you run it as a daemon, regardless of whether it's a service or a sidecar, it exposes an API, an HTTP API, which you can then query and get that policy decision.

So regardless of how you're embedding it, whether you're doing it as a library or running it as a daemon or running it as even a CLI tool, like in a CI/CD pipeline to do checks its CI time or as a git pre-merge hook - it's kind of all the same.

So you're just kind of querying OPA and asking what should the decision be for this data? And then you can pass in arbitrary, hierarchical structured data JSON and OPA basically takes that data, crunches it against the policies that have been loaded into the engine, and then spits out an answer, which it sends back your software. So you're basically decoupling the policy decision making from the policy enforcement. And so that's sort of how it works at a high level.

ADAM GLICK: Where does the back end for OPA run?

TORIN SANDALL: OPA itself, like we said before, it's intended to be this sort of building block that you can embed into all kinds of different places. And one of the decisions that we made sort of early on was to design it to be as lightweight and easy to deploy as possible. And so some things fall out of that sort of requirement, which are that basically, OPA keeps all of the policies and data that it uses to make decisions in memory.

So if you're embedding it as a library, it's your responsibility to load those rules or that context into memory and feed them into OPA.

Or if you're running OPA as a daemon, then it's your responsibility to either push it in by the APIs, or OPA has certain built in mechanisms, which we call the management APIs, to allow you to configure OPA to basically pull down policy and data from different places.

So you could put your policy and data into-- you could serve it up by a web server. You could put it into a Google Cloud Storage bucket or an AWS S3 bucket and then serve it out of there. But OPA will basically just be kind of constantly trying to pull down the policy and data that it needs to make decisions. And it'll always be trying to converge on the latest version.

So OPA itself doesn't really have a back end. It's kind of decoupled, and it allows you to kind of run it in this sort of distributed manner.

CRAIG BOX: That said, if I am an administrator, and I want to push a change to policy out, I like having a centralized place so I know that all requests have to go through this one checkpoint, and I can guarantee if I disallow user access, that's done immediately. How did we solve that problem in this distributed fashion?

TORIN SANDALL: One of the things that we like about OPA and that we have tons of users tell us that they love is the fact that it doesn't force you to use a particular database or anything like that, right? And so that's what makes it very easy, I think, to deploy. Obviously, you've got all these OPAs running around your infrastructure, inside your applications, and your CI/CD pipelines, you know, on the host level, controlling SSH access. And that begs the question, where do the policies and contexts that make the decisions come from?

And so typically, the way that we see this happening, is the people will leverage something like OPA bundles, which are just gzip kind of tarballs that contain policy and data. And those can be pulled from anywhere. You can pull them from S3. You can build a custom control plane that serves those. And at the end of the day, probably a lot of that information, especially the policies, the rules themselves, are coming from Git.

So a lot of time, the policies are checked into Git, they're treated as code, they're tested, they're reviewed, there's a sign off process. And then as part of your CI, you have that kind of pushed out to, like I said, a Cloud Storage bucket that OPA can pull from. Or you build a control plane that can serve them up.

So there are all kinds of different ways. We really designed OPA to be as flexible as possible when it comes to management. There's all kinds of different ways that you might need to push or pull data, let's say. You might need to push it in synchronously or asynchronously. You might want to pull it in synchronously or asynchronously, depending on your use case. And so we try to accommodate those different patterns just because it is a fairly low level building block, and so we want it to be as applicable as possible.

ADAM GLICK: I understand how that works for code that you write, and you embed it within your code, and certainly pulling the policies out from some centralized repository, be that a storage bucket or somewhere else. What about for services where you don't directly have access to the code? So for instance, controlling access to network connectivity, access to databases, who can communicate with what other services, things like that.

TIM HINRICHS: I think one of the things that we've seen, especially in this Cloud Native space, is that more and more pieces of software, like Kubernetes, like service meshes, they're creating these authorization or policy webhooks. And Kubernetes is a great example there, where the Admission Control component within Kubernetes, which isn't part of the API server, every time a new request comes in Kubernetes, you can just configure Kube to go ahead and send that request off to some external system to actually make those authorization policy decisions. So that just comes out of, I think, the sort of mentality in this Cloud Native space, which is that we're building software in a very modular manner. And so it makes sense to have these webhooks that allow users to be able to customize that software in interesting ways.

And so I think that we're seeing that more and more, that more and more pieces of software are just creating these external webhooks that allow us to plug OPA in. And so I think that that's what we see with certainly Kubernetes and Envoy, and the databases are starting to do that too. So those are the places that we certainly start with.

And then what sometimes happens, and we're starting to see this more and more now, is that the folks who are responsible for a piece of software have just started to realize how powerful OPA is and how popular it is. And so they started actually adding support for OPA themselves. And so I think that that's sort of how we're seeing the integration points for OPA expand now over time. You know, starting with webhooks, and then users put them in, and then those users figured out how useful and powerful it is to have that unified way of solving policy. And now more of the folks who own the software are starting to add integrations as well.

TORIN SANDALL: I think that one important thing to just kind of stress that isn't always obvious for new users when they come to OPA for the first time, the typical kind of adoption journey for a user coming to OPA is that they have some problem that they have to solve, right? Like it's their job. They need to solve some problem. They've got to stop workloads from being employed with bad labels, or they have to put API authorization in place, or other microservice APIs, that their teams are running or something like that.

And so a lot of the time people come to OPA with that particular problem in mind. And they don't always realize that OPA itself is not coupled to Kubernetes or microservice APIs or databases or anything like that. It's completely decoupled in the sense that you can send it arbitrary JSON attributes, right?

And then your policies are what give meaning to those attributes and make decisions over them, and ultimately send the answer back to your software to be enforced. So OPA itself is completely decoupled, which means it becomes really easy to plug it into new software that needs authorization policy and so on.

CRAIG BOX: A place that a lot of people are going to come across OPA for the first time is when trying to configure policies for various Kubernetes objects. Two things I want to dig into. First of all, Kubernetes has some policy built in. Network policy, for example. It has some policies you can sit on things like limits and containers and so on. These things that were implemented inside Kubernetes, do you think that there will be eventually a move to have them be all externalized in the sense that you mentioned before, that you should separate policy decisions from the code that runs them?

TORIN SANDALL: I think over time, sure, that would be great. I think that there are obviously certain things that are built into Kubernetes today, and they were built in for good reasons at the time. Probably they can be externalized over time. What we used to do when we were looking for use cases was we'd go on GitHub, and we'd go to a project, and we'd type the word policy into the issue tracker. And we'd see what came out.

CRAIG BOX: Everyone wants this. We should build it.

TORIN SANDALL: Yeah. Yeah. So you're go into Kubernetes and you type the word policy and lo and behold, there are a lot of issues, right? Or there were a lot of issues a few years ago. And so that's what kind of motivated a lot of effort we put into supporting Admission Control use cases in Kubernetes.

There are obviously things like quota, for example, where OPA's not necessarily an amazing fit right now. And that's built into Kubernetes for good reasons. But I think a lot of the Admission Controllers that are in there maybe don't have to be in there. But I think that people recognize that at a certain point.

When we started doing Admission Control with OPA, there was no webhook in the API server. It didn't exist. And so we did some demos where we showed, oh, here's how you can compile OPA into the API server and have this extensible policy system. And people were like, that's really cool. But we don't want to necessarily settle on this exact implementation. And so that's sort of where some of the webhook stuff kind of came from, I think.

And so before the validating and mutating webhooks became standard, I think there was definitely this need to compile things into the API server. Like every single release, there'd be one or two or three or maybe a few more Admission Controllers that had to get compiled in. And that was just a pain.

So I think over time, people recognized that that wasn't the best way to do it. And it was especially a pain if you're like an admin and you just want to enforce some custom policy, right? Like, do you really want to recompile the API server in order to say that this app team has to put this label on their resources? Probably not. Right. And if you're a vendor, you know that doing a Kubernetes distribution, it's a similar story there as well, I think.

CRAIG BOX: A newer integration between OPA and Kubernetes is the Gatekeeper project. How does that work?

TORIN SANDALL: The way Gatekeeper works is that it plugs, effectively, OPA into Admission Control. So last we talked about it-- a whole bunch of different kinds of policies within Kubernetes, quota, network policy. Admission Control is where Gatekeeper certainly lives, the idea there being that on every single request, a new pod, a new Ingress, whatever it is, it goes into Kubernetes from the outside. It goes through the API server. It gets dropped by the Admission Control over to Gatekeeper. And then Gatekeeper makes a decision whether this should be allowed or not.

So Gatekeeper takes that same story we told just a bit ago around OPA, but it adds a number of pretty powerful features. First of which it adds basically CRDs that allow you to manage the policy and control the policies that OPA/Gatekeeper enforce through the normal means that we use to configure Kubernetes pods and Ingress right? They're just CRDs that are policy CRDs and you put them into Kubernetes.

It's also got a policy library, a growing policy library, that sort of intends to make it relatively straightforward for newcomers to OPA to go ahead and put common policies in place without having to write them from scratch. We just have learned over time that there are certain policies, like ensuring you don't have Ingress conflicts, or ensuring that all images come from a trusted registry. And so making sure that those are available out of the box certainly smooths the uptake of using OPA and Gatekeeper and enabling people to write custom Admission Control policies.

And then there's also some audit functionality, as well, which is sort of looking at the current state of the cluster and, like we talked about at the very beginning, sort of monitoring and identifying all those resources on the cluster that violate policy.

CRAIG BOX: Admission control applies to when things are being persisted, like a pod being created causes its resource to be stored, with the data in etcd. If I'm making a change to an object, does it also go through the same Admission Control process?

TORIN SANDALL: Create, update, delete. Yeah. Create, update, delete all go through Admission Control. Read stops it a bit earlier and at authorization and doesn't go through Admission Control.

TIM HINRICHS: Admission Control is like the fundamental way you enforce just basic semantic validation over your Kubernetes resources and all kinds of policies, obviously. So yeah, it's just this fundamental mechanism in Kubernetes.

TORIN SANDALL: The other thing to think about there is the right way to think about admission is that those same 100 or 200 lines of YAML that we as people are giving over to Kubernetes to describe our pod or our Ingress, that same 100 lines of YAML is sent to the Admission Controller. And so that Admission Controller has full visibility into all the different details of how you configure that pod or that Ingress.

Authorization is quite a bit different. It's the stage before admission where the only information you get is-- you don't get that 100 lines JSON. You just know that Tim is trying to create a pod in this name space, and so then there's a yes no decision has been made there. But assuming that that authorization passes and it's a create, update, or delete, then it goes ahead and Admission Control takes over and says, OK, now here's the full extent of this resource. Now go ahead and make a decision. Is this safe to let onto the cluster or is it not.

ADAM GLICK: What companies are currently contributing to the OPA project?

TIM HINRICHS: Today we have sort of the three main companies contributing to OPA. there's obviously Styra, us, but there are also Microsoft and Google are active contributors to the project and to Gatekeeper. And that kind of came out of discussions that we were all having almost a year and half ago now about these exact problems, about the need for better policies support within Kubernetes.

And so in late 2018, we kind of got together and decided it was time to have a joint project around and around Gatekeeper. And so we took that and launched it in early 2019.

CRAIG BOX: Have you considered asking if your cat has any skills that it might bring to the project?

TIM HINRICHS: Yeah, she might have done a bad deployment or something, and she's just like really angry right now. I'm not sure exactly why she's so upset. Or maybe it was some bad Regor or something that she saw. Not very good.

ADAM GLICK: You mentioned that Google and Microsoft are contributors in the project. Does that mean that integration with Active Directory is coming?

TIM HINRICHS: One thing that OPA does is that it focuses on policy. It focuses on the decisions that are being made. What it doesn't do is anything around authentication, like who are you. Like that's an input to OPA, and so if you think about Active Directory, often the way we think about it and the way we hear users thinking about it is that Active Directory does obviously authentication, but it also stores group membership and maybe claims or entitlements. You could think about it all add in.

And all of those, from OPA's perspective, are simply inputs to the policy itself to make a decision. So if you're in the engineering group, then you have certain permissions. And so well, how do you know who's in the engineering group? Well, its that content is stored in AD and typically what we see people do is when the end user, about when you're making a policy decision, when that end user authenticates, it goes ahead and gets all the group membership stuff. It shoves it into a JWT token. And then that JWT token is what's handed over to OPA as part of its input. So in that sense, there already is an integration with AD. It's just that it's not explicit.

ADAM GLICK: So what comes next for OPA?

TORIN SANDALL: Right now the community is growing very quickly. There are lots of folks that are kind of jumping on and getting involved. And so one of the things that we're hoping to do is sort of seed more sub-projects, more projects that integrate with other pieces of software going forward. That's something that we're actively looking at.

On the sort of core of the project, one of the neater things that I think that we've released and that we're supporting now is the ability to take over policies and compile them into WebAssembly. And that support for compiling the policies into WebAssembly is something that is new, but it's going to open up a lot of new opportunities around integration of OPA policies into different software systems. So things like content delivery networks, service proxies, like Envoy, even databases like Postgres, and so on.

All these different pieces of infrastructure are adding these extension points by WebAssembly now, and so we've decided to target that. And so we're kind of building out the core feature set around that area. So that's one of the newer things that we've been working on lately, and we'll continue working on that for the near future.

Yeah, we're kind of always looking for new integrations. That's what OPA. It just loves new integrations because so flexible. So hopefully more and more OPA sub-projects in the near future.

CRAIG BOX: Now finally, all your example policies on the OPA website referred to Julio. com, which obviously picks you as Silicon Valley watchers. Which one of you is Dinesh in which one of you is Gilfoyle?

TORIN SANDALL: [LAUGHING] So Tim has a confession to make.

TIM HINRICHS: I have not seen any of the "Silcon Valley," but for good reason. I'm a little worried it's going to hit too close to home, right?

CRAIG BOX: Everybody says that.

TIM HINRICHS: Yes, right. I'm in the startup world. I can't imagine enjoying watching a show about the pitfalls of running a startup. It's sort of like kids. I don't enjoy movies of parents chasing after a couple of kids that are my kids' ages. It's just not fun because that's real life. And so I'm explicitly saving "Silicon Valley" until a point at which I don't feel so close to it.

CRAIG BOX: So Torin, without spoiling anything for Tim -- is he Dinesh or Gilfoyle?

TORIN SANDALL: He's more of a Richard Hendricks, I think. Yeah. Maybe a little bit more charismatic.

CRAIG BOX: Thank you both so much for joining us today. It's been fun.

TIM HINRICHS: Thanks for having us.

TORIN SANDALL: Thanks for having us.

CRAIG BOX: You can find Torin on Twitter @sometorin and Tim @tlhinrichs. You can find the OPA at OpenPolicyAgent.org.

[MUSIC PLAYING]

ADAM GLICK: Thank you for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @KubernetesPod or reach us by email at kubernetespodcast@google.com.

CRAIG BOX: You can also check out our website at kubernetespodcast.com, where you will find transcripts and show notes. Until next time, take care.

ADAM GLICK: Catch you next week.

[MUSIC PLAYING]

View More Episodes

Open Policy Agent, with Tim Hinrichs and Torin Sandall

Chatter of the week

News of the week

Links from the interview

Transcript