Kubernetes Podcast from Google: Episode 103 - CSI: Storage, with Saad Ali

#103 May 12, 2020

CSI: Storage, with Saad Ali

Hosts: Craig Box, Adam Glick

More gripping than a crime scene in Las Vegas, the Container Storage Interface (CSI) lets vendors interface with Kubernetes. Saad Ali from Google led development of Kubernetes storage, including the CSI and volume subsystem. He joins hosts Adam and Craig for an in-depth look at how storage works in Kubernetes.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

Adam’s puzzle
How they made The Mandalorian
- Unreal Engine: Project Spotlight
Fraggle Rock: Rock On!
Lockdown music videos:
- Crowded House: Something So Strong
- Mostar Diving Club: Quiet Hands

News of the week

IBM Cloud Satellite
Google Cloud Buildpacks
Anthos for app modernisation via CI/CD and transforming legacy Java applications
Azure Container Registry adds dedicated data endpoints
Amazon ECR: multi-architecture containers
Amazon Cloudwatch adds Prometheus metrics
run:AI creates fractional GPU sharing for Kubernetes
The State of Cloud Native Development: CNCF survey (PDF)
VMware’s State of Kubernetes 2020 (PDF)
Gatekeeper Policy Management from SIGHUP
- Episode 101, with Tim Hinrichs and Torin Sandall
Datastax Astra on GCP and Sam Ramji’s blog
- Episode 98 with Sam Ramji
Introducing PodTopologySpread by Aldo Culquicondor and Wei Huang
Pod Security Policies at Square by Jason Price
Introduction to OpenTelemetry by Ran Ribenzaft
- Episode 97, with Yuri Shkuro
Kubernetes and Istio on the F-16 jet: CNCF case study
GKE logging introduction by Charles Baer and Xiang Shen
Helm and Kustomize, better together
- Helm, with Matt Butcher
- Kustomize, with Phillip Wittrock

Links from the interview

Transcript

Show full transcript

ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.

[MUSIC PLAYING]

ADAM GLICK: For the person that wrote in and asked how am I doing on the puzzle, I am pleased to say that I've finished all of the puzzle except for one piece. As it turns out, I am missing or have lost one piece.

CRAIG BOX: Oh, for shame.

ADAM GLICK: I've got all three layers. And unfortunately, it's not even a piece that sits beneath one of the other layers so it'd just be hidden. It just sits right there in the corner. You're just like, aw. But yeah, I did finish it.

CRAIG BOX: Is it at least surrounded by cut-out other pieces such that you could 3D print something to fit in the space?

ADAM GLICK: You could. And since it's a piece that's actually just one solid color, I have thought about just taking a piece of colored construction paper and just sliding it underneath the puzzle so it kind of looks like it's just there, even if it's not.

CRAIG BOX: If it was a "South Park" puzzle, that would be perfect.

ADAM GLICK: [LAUGHS] And since that's been done, I've been able to dive into "The Mandalorian." I realize that I am late, but everyone spoke so well of it. Finally getting a chance to see it. I'm a couple episodes in. And I must say, for anyone that is late to it like me, great show. Just absolutely fantastic. I'm loving it.

CRAIG BOX: If you're interested, check out the behind the scenes of how "The Mandalorian" was filmed. It's not explicitly them saying, this is us filming the show. But there's some technology demos from the people who make the system that they used to film it. They're working in front of a screen that's effectively running the Unreal Engine. When they're moving the camera around, the field of view behind is changing to make it feel like you're actually on the set, but it's all done in a basement in Hollywood.

ADAM GLICK: That would be awesome to see.

CRAIG BOX: We'll link it in the show notes. But it fills me with hope that they could keep filming season two while still under lockdown. Because basically, you just need to act out in front of the screen. Everyone could be in their own rooms around the world doing this.

A few things that have caught my eye over the last couple of weeks, the new "Fraggle Rock" shorts on Apple TV are fantastic for anyone who, like myself, has a special place in their heart for the original TV show. Very true to form, especially given what they can do in the lockdown situation.

And a few videos of bands who are all performing in different places and emailing their files and videos around to give something which is actually quite good performances. There's a couple of my favorite bands. I'll link a couple of them in the show notes. But there have been a lot of them out there.

ADAM GLICK: That was a grade A segue to go from "The Mandalorian" to "Fraggle Rock." And I'm going to try my own going from this to the news of the week.

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

ADAM GLICK: IBM had their Think Digital event last week, where they announced a preview of IBM Cloud Satellite, their hybrid cloud service. Satellite extends IBM Cloud into remote locations using Red Hat technology. It promises a single place to manage your deployments from on-prem to edge, and a service mesh based on Istio to tie them all together. This follows a common trend for Kubernetes providers of extending their offerings across environments. The platform is due to launch later this year.

CRAIG BOX: Google Cloud has released buildpacks to run on services that take containers as input, including Cloud Run, GKE, and Anthos. Using buildpacks, developers deploy their applications without having to describe how to transform their code into containers. These buildpacks are also used as the build system for Google App Engine and Cloud Functions and are 100% compatible with the CNCF buildpacks spec.

ADAM GLICK: Google Cloud has announced two new solutions for the Anthos platform. The first of these is a continuous integration and deployment solution created and released with partner GitLab. The second is for job application modernization. Google claims it's easy to migrate existing job applications onto Anthos to help organizations save money and increase agility. Both solutions are now available.

CRAIG BOX: Some recent updates were made to hosted container registries at Microsoft and Amazon. Azure has added dedicated endpoints for its shared container registry, which allows you to allow access to a single URL, as opposed to having to allow access to all of Azure storage. AWS has added support for multi-architecture container images, which lets you use the same name for the same image when built for different architectures.

ADAM GLICK: Amazon also added Prometheus support to CloudWatch Container Insights, allowing you to ingest metrics from EKS and Kubernetes on EC2. ECS and Fargate are said to be coming soon.

CRAIG BOX: Israeli machine learning company Run:AI has announced fractional GPU sharing for Kubernetes as a new feature of their platform. Kubernetes can only allocate whole physical GPUs to containers. So Run:AI's fractional system creates virtualized logical GPUs, which means several containers can use and access a single accelerator. This allows you to run more lightweight AI tasks on the same hardware with no changes to the containers themselves.

ADAM GLICK: The CNCF has published a report on the state of cloud-native development. According to their vendor SlashData, cloud-native developers are more likely to run their code in the cloud, preferring the three major cloud vendors. Most industries prefer to run in the cloud, with the exception of financial services and health care.

Additionally, it may surprise you to learn that 40% of containers are running outside of Kubernetes, with 21% on a different orchestrator. ECS, ACS, and Docker Swarm were the major players in that Other segment.

Serverless continues to be dominated by the three major cloud vendors, with AWS and Google Cloud leading Azure in awareness and usage. The report also calls out that developers who are using Kubernetes are more likely to be involved in the platform decision for what their organization uses, with 71% reporting that they were a part of the organization's platform buying decision.

CRAIG BOX: VMware also released an annual report on the state of Kubernetes for 2020. They suggest that the largest segment of Kubernetes use is on-premises. But if you sum up one cloud and more than one cloud, the data actually suggests that cloud is more popular. The top benefits of running Kubernetes were improved resource utilization, shorter software development cycles, and the ability to containerize monolithic applications.

Similarly to the CNCF report, VMware's audience of developers are often at the table when deciding on the Kubernetes platform and say that access to infrastructure is their biggest challenge. Managers cited integration as their top concern, and listed a lack of experience and expertise as their top challenge.

ADAM GLICK: SIGHUP has announced to the Gatekeeper Policy Manager, or GPM, an open-source and web-based tool to see your Open Policy Agent, or OPA, policies deployed on your cluster and their current status. GPM, which definitely won't be confused with Microsoft's Group Policy Management, also lets you review and edit your policies in Rego, the language of the OPA. Additionally, if you want to learn more about the OPA, check out episode 101.

CRAIG BOX: On episode 98 with Sam Ramji from DataStax, we discussed Astra, the hosted Cassandra platform running on Kubernetes. As of today, Astra is now generally available in nine GCP regions worldwide with a 10-gigabyte free tier available. You can install it from the Google Cloud Marketplace.

ADAM GLICK: Google's Aldo Culquicondor and IBM's Wei Huang have posted about the PodTopologySpread scheduling plugin for Kubernetes that went beta in 118. The feature aims to help spread your pods evenly across your cluster to increase availability. The blog goes into details on specific use cases, such as doing a spread but limiting to a particular environment, like production or QA. They also cover the case of making it aware of zones as well as nodes.

CRAIG BOX: Jason Price from payment company Square has written the missing manual for pod security policies on Kubernetes. His team's perspective is to start with everything locked down and grant exceptions. For example, the log exporter will need to access the host logs, and he provides examples that do exactly that. Given that they are a financial services company, the examples are very auditor-friendly.

ADAM GLICK: Ran Ribenzaft from observability vendor Epsagon shared an intro to OpenTelemetry to help introduce people to the project and make the case for its usage. The blog talks about the origins of the project at Google, as well as why a standards-based approach with SDKs for many common languages plus metrics, tracing, collectors, and auto-instrumentation, all available in the open source ecosystem, is the right choice for organizations to adopt. If you're interested in learning more about OpenTelemetry and OpenTracing, check out episode 97, where we talked to Yuri Shkuro from the Jaeger project.

CRAIG BOX: The US Air Force's chief software officer, Nicolas Chaillan, gave a presentation at KubeCon North America about using Kubernetes and Istio on F-16 jets. And this has been summarized in the CNCF case study and video this week. Chaillan says that he estimates 100-plus years have been saved across the Department of Defense by the DevSecOps platform so far.

ADAM GLICK: Charles Baer and Xiang Shen from Google Cloud have posted a deep dive on logging with GKE. Logging is enabled by default, and the blog dives into what information is logged, where those logs are stored, and how you can search those logs, and even how to export those logs to more powerful data mining tools to give you better visibility and understanding of what is happening in your clusters. Additionally, they've provided a link to a mailing list so you can join to stay informed and learn more about GKE's logging tools.

CRAIG BOX: Finally, last week, Matt Butcher said that Helm and Operators can peacefully co-exist. This week, longtime listener Povilas Versockas has written about how Helm and Kustomize can complement each other. His approach is to expand Helm templates locally to make sure you're not blindly installing things into your cluster, and then use Kustomize to patch the YAML locally before deploying. You can then commit to Git, and use Flux or ArgoCD to use all the possible deployment tools in one giant sandwich.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

CRAIG BOX: Saad Ali is a staff software engineer at Google and a member of the CNCF Technical Oversight Committee. He led development of the Kubernetes storage and volume subsystem, and serves as co-chair of the Kubernetes Storage SIG. He's co-author and maintainer of the Container Storage Interface. Welcome to the show, Saad.

SAAD ALI: Hi, Adam. Hi, Craig. Thank you for having me. I'm a big fan, so excited to be here.

CRAIG BOX: Thank you. Aren't you a bit too young for storage? There's no gray in that beard.

SAAD ALI: [LAUGHTER] Yes, definitely. I think this is a good point of differentiation between the data plane and the control plane for storage. The data plane, you've had folks working on it for decades now, and I am definitely not a gray beard, and I do not consider myself a storage guy, honestly, because I know very little about the data plane.

The area that I've been focused on is the control plane, or automation around making storage available to your container workload. And so that's been my focus since I joined the Kubernetes team.

CRAIG BOX: What background did you have that led this to be an interesting challenge that you wanted to address?

SAAD ALI: It was a little bit random. I started off at Microsoft working on Hotmail, if you've ever heard of that.

CRAIG BOX: HTML.

SAAD ALI: That's right. I worked on ActiveSync and IMAP. I created the first IMAP implementation for Hotmail, which was very late, but it was a lot of fun doing. So I had experience building distributed services and working on distributed services. So that was my background.

And then when I joined Google, I joined a different team, actually. This was early 2014. And it was a fun team to be on. It was research and development in G Suite. And the mandate was go out and figure out what businesses need next, and go and invent and have fun. And that was great. It was a lot of fun.

That lasted about 11 months. And the cool thing about Google is you get hired for the company, not for a specific team, so you're free to move around. And my manager at the time was helping me look around Google to find a new team to work on.

And I spoke with a number of teams, I think upwards of 11 teams. It was like being a kid in a candy shop. You're looking at all of Google and trying to figure out what you want to work on. But my manager was friends with this guy named Tim Hockin.

CRAIG BOX: Never heard of him.

[LAUGHTER]

SAAD ALI: He'd been working on a project. I think it had been about six months on this Kubernetes thing. And I sat down with him and talked about it. And it just clicked for me, because I was like, oh, my god, if I had this back in my Hotmail days, it would have simplified so much of the work that I had to do. This really makes sense to me. I want to be involved with this.

And so even at that point, I wasn't 100% sure this was what I wanted to work on. But I really like Tim, and the project sounded really cool. So I rolled the dice and said let's do it. So this is about December of 2014, about six months prior to the 1.0 release. So I just was very lucky to get involved at the right time.

CRAIG BOX: And you can, of course, hear all about Tim's Kubernetes journey in episode 41.

ADAM GLICK: You were part of Kubernetes pretty early on. And back then, there was a lot of talk about what were the use cases of Kubernetes, how would people use it? It was a lot about stateless microservices.

And a lot has changed over the period of time from then until now, including in the storage space, of a lot of ways that you hear people talk about running stateful services. There's lots of things we'll chat about in terms of what's evolved in the interfaces space on that. But what caused all those changes? Where did it start from, and then what's driven that evolution?

SAAD ALI: When Kubernetes started, "pets versus cattle" was the catchphrase, right? You want to be able to treat your workloads basically like cattle, where you don't really care where they are running. And your container is going to be able to run, it will get terminated, it will get rescheduled to a different node, and it'll get started again. You don't have to babysit it. If it dies for whatever reason, Kubernetes will automatically bring it up.

But the inherent nature of containers is that they're ephemeral. So the file system goes away as soon as the container is terminated. And what we are talking about with Kubernetes is not just, hey, we're going to run your stateless workloads, but we're going to run all your workloads.

Being able to run just your stateless workloads is not very useful. You can imagine I've got a shopping cart, or something like a profile for my website. If I visit a website and I come back and that is gone, I'm going to be very upset. We need to be able to persist that data somewhere so even if the container running your shopping cart moves to a different machine, it has access to that data.

And so that's inevitable, I think, is you start with a stateless and then you evolve towards figuring out how stateful workloads were going to work on this.

CRAIG BOX: What was the state of storage in Docker at the time you started working on Kubernetes?

SAAD ALI: Docker was very much focused on containerizing a binary on a single machine. And so the volume and storage interface for Docker was very much focused on that, as well. It was, how do you basically take a given block device or given filer and expose it into a container?

And if you have just a single machine and you're manually scheduling your Docker containers, that works for you. But when you start talking about a cluster orchestration system like Kubernetes that dynamically schedules your workloads across a number of machines, that subsystem no longer works. How do you ensure that your persistent storage is available inside of your Docker container wherever that container is scheduled?

ADAM GLICK: What's the next step that you took? If you think about for all of us that use Docker containers, you just map it to a local directory. But when Kubernetes is deciding what pod that's going to be deployed in, what's on that node, what happens when it decides to put it on another node? What happens to my storage, and how does Kubernetes solve that?

SAAD ALI: Ultimately what Kubernetes does is exactly that. It says, OK, I need to make some sort of persistent storage available at a specific path inside the container. And as long as the application inside that container writes to that path, the data will be persisted outside the container. So the container can die, go away, come back, and the same path is always populated with some persistent external storage.

And what Kubernetes did was basically automate the mechanism around this so that regardless of where your workload is scheduled, it will take care of making the persistent storage available within your container. And so this involves multiple steps along the way, including attaching a disk to the appropriate node, being able to do a global mount and a mount specific to the pod, and then having that mount surfaced up into the container. That's basically where Kubernetes automation around storage started just to make sure that as these workloads moved from node to node to node, the persistent storage followed.

And then the evolution of Kubernetes storage was off of that. And the next step was, well, this is nice, but wouldn't it be cool if my workloads could also request that storage be created on demand, instead of some cluster administrator having to go create storage for you ahead of time, which was the pattern forever before that.

CRAIG BOX: Given that Kubernetes was born in the cloud and that, with a cloud provider, you can just say to some magic API, please give me some storage and attach it to the machine that I'm on, we then start getting to a world where we want to map to previously existing things. We want to map to NFS systems. We want to map to concepts that we're more familiar with from a white-glove, gray-beard data plane kind of world. How did Kubernetes evolve from one use case to the other?

SAAD ALI: The big step was the volume plugin interface. What this allowed you to do was be able to plug in an arbitrary block or file storage system into Kubernetes. So if you had some sort of NFS filer, if you had some iSCSI LUNs, you could use those just as you could use a GC persistent disk on cloud or an EBS disk on Amazon. You could reference them from your workload directly in your pod in line and say this is the exact volume that I want to use.

And that API slowly expanded to, one, become more portable, so that we abstract away storage such that an application developer doesn't have to worry about the underlying implementation details of a specific storage system, but also more powerful, so that it encompasses more and more use cases. I talked about dynamic provisioning a little bit, but it's expanded beyond that to things like volume snapshotting and volume resizing.

And so the Kubernetes storage API grew out of that. But the first big step was the volume plugin interface, which was initially completely in-tree, meaning all the code was compiled and built into the core of Kubernetes. And there were a number of drawbacks for that.

CRAIG BOX: There are a number of things in Kubernetes that have been deprecated by the passage of time. What was a flexVolume?

SAAD ALI: Flex was actually a first attempt at extending the Kubernetes storage interface. So what we realized really quickly was the way that we were doing volume plugins in-tree was unsustainable. We had probably 8 to 10 volume plugins. And what we were quickly realizing was that a lot of them were actually not being tested. For example, fibre channel was a volume plugin that exists in-tree, but how do you test fibre channel without having fibre channel hardware?

And so what ended up happening for a lot of these volume plugins was we were depending on users to report issues after a release and then patching them, versus shipping a release and knowing that a volume plugin was going to be working well. So testing was an issue.

Security was another issue. Because these volume plugins were compiled into core Kubernetes binaries like kubelet and kube-controller-manager, if there was a bug in one of these volume plugins, it would actually crash your entire binary that is responsible for deploying your pods or for scheduling your workloads, things like that. So that was a bad approach. So security was a big problem.

And then extensibility in general was just painful. Because if you were a third-party storage vendor and you wanted to integrate with Kubernetes, it meant you had to commit code to the core of Kubernetes, which can be a very daunting process. You have to align yourself with the releases that we had, which were quarterly.

If you miss a release, you had to wait for the next quarter and deal with all the, hey, a test that is completely unrelated to me broke. What do I do? And then you're maintaining core Kubernetes code just because you care about one little piece, which is your extension to the storage system.

So all of those problems existed, and what we decided was we want to have some easier mechanism for a third-party block or file storage system to plug into Kubernetes. And flexVolumes was an initial attempt at doing that. FlexVolumes was a naive attempt, because basically what we said is let's just have the driver vendors have an executable or a script that they write with simple operations on what to do for mount, what to do for attach, and deploy those on each of the node machines.

And so whenever Kubernetes has one of these flexVolumes requested, when it comes time to attach or mount, instead of handling that within the binaries, we would look at the local file system on the node machine and call out to the third-party binary.

Now, the drawbacks of this approach were two things. One was deployment was very painful. The magic of Kubernetes is you have a Kubernetes interface, you write some YAML, and your workloads are automatically deployed. The resources that they need are automatically managed.

But then we're saying, well, yeah, except if you want to extend Kubernetes to use a specific storage system, please make sure that you have this magic file copied into the root directory of every node on your machine. You have security issues with that. It's just painful to do.

The second problem with flexVolumes was that it also required that binary not just to be deployed on every single node, but also on the master node if your volume plugin had an attach operation. And a lot of users don't have access to the master. So you talk about GKE, for example. The entire master is hidden away from the end user.

And so Flex was an initial attempt. It just had a number of issues in terms of extensibility.

CRAIG BOX: So it sounds like we need some sort of storage interface for containers, perhaps a Container Storage Interface. That might be a good thing to introduce.

SAAD ALI: [LAUGHS] Precisely, and that is exactly how the Container Storage Interface was born, was a realization that we do want to have some way to build an extensibility layer for Kubernetes, but Flex wasn't fitting the bill. And so there were two aspects of it. One was the technical aspect of it, and the second was, where does all of it fit in?

So when we were starting CSI, there were a number of efforts underway to try and do essentially the same thing. Some were led by storage vendors. There were existing, I think, previous projects. OpenStack had things that people were trying to reuse with Kubernetes.

And we'd reviewed a lot of those to try and see if they would be a good fit. But I think the big differentiator with Kubernetes was dynamic volume provisioning, and nobody really was handling that the way that we wanted it to. And dynamic volume provisioning was basically the idea that just like you request CPU resources or memory resources when you schedule a pod, you should have a way to be able to request storage resources-- I want 60 gigs of ReadWriteMany storage-- and have that made available to me.

And Kubernetes had handled that through the StorageClass interface through the PVC objects. And so we wanted to make sure that whatever interface that we plugged in on the southbound side would enable those use cases. And so in order to handle that, what we realized is we wanted to probably start something from scratch.

Now, the interesting thing was, at that point, Kubernetes was not the de facto container orchestration system. There were a number of competitors out there. Docker had Docker Swarm. Mesos was doing quite well. Everybody had a cluster orchestration system.

And what we noticed with storage vendors was that they did not want to pick a winner. They had the opportunity to go and build an in-tree volume plugin for Kubernetes, but a lot of them were sitting on the sidelines waiting to decide, am I going to have to build a Docker plugin or a Kubernetes plugin or something else? Let's wait to see where this thing goes.

And so instead of going at it alone and building an interface that was purely for Kubernetes, what we decided was, there is a need that all of the industry has here, the storage vendors as well as the cluster orchestrators. Let's see if we can get consensus and build something together that everybody would agree on.

And so what we did was we reached out to Docker and Mesos and Cloud Foundry and found that they were dealing with a lot of the same issues and were very eager to work with us on a standard. And so February of 2017 is where we all nodded heads and said, OK, let's start building this thing and come up with what it's going to look like.

CRAIG BOX: Is the CSI an API or a spec? How would you define what the actual CSI thing is?

SAAD ALI: CSI is purely a specification of an API.

CRAIG BOX: So it's both.

SAAD ALI: It is an interface. It does not actually dictate packaging. It doesn't dictate how drivers should be developed or deployed. It is strictly a protobuf that says, here are the synchronous methods that you can call against a storage system and what the expectations for each one of those methods is.

So you have very obviously a CreateVolume call, and we dictate the inputs for that CreateVolume call and the outputs. And so as a storage vendor, if you're following just the CSI spec, implementing a gRPC service that implements these methods is sufficient.

The reason we wanted to differentiate the specification from how the packaging works, how it's distributed, is the nature of how the CSI spec was formed. We were trying to collaborate across Kubernetes, Docker, Cloud Foundry, Mesos. And each one of those systems was fairly different.

And so to prescribe exactly what distribution would look like or what packaging would look like would mean that they wouldn't necessarily work across all of these systems. And so we said, let's just focus on the interface itself and leave it up to the cluster orchestrators to then dictate what it would look like to actually distribute or integrate it with that storage system.

CRAIG BOX: You mentioned a CreateVolume API call. Is there a distinction in CSI for what a volume actually is? Like, the storage interface is obviously built around the idea that there will be volumes. You can have SANs and NASs and attached disks and fibre channel and SD cards and so on. How can you summarize all of those things up into one primitive?

SAAD ALI: The CreateVolume call only makes one assumption, which is that the returned piece of storage has an isolated capacity. So for example, you can't just return a generic storage pool where somebody else might be writing to it and you are no longer guaranteed the 60 gigabytes or whatever you requested. Beyond that, we make no guarantees or requirements from the storage system. And that has been sufficient for us to get effectively what we want for Kubernetes volumes.

ADAM GLICK: We've talked about a lot of the traditional kinds of storage as people think about it, disks and things that are represented with disks. But storage can be a lot broader. What about things like databases?

SAAD ALI: I think eventually, something needs to write bits to disk. The rubber needs to meet the road somewhere. And where that happens is file and block. And the nice thing about file and block is that the operating system has standardized the data path protocols.

So for file, you have POSIX. And for block, you have block device interfaces that are standardized within the operating system such that your workload does not care about which specific implementation of block or file you're using. It's just going to work. And so given that the data path is standardized, what we could do is focus on standardization of the control path and make the storage that a particular workload needs available anywhere.

So now if you have a database or you have a message queue, or any other stateful workload, it's going to need a filer or a block device to be able to write to, and Kubernetes has a generic, abstracted-away way of being able to make that resource available to your database. And so while databases are definitely storage, the things that we focus on within the Kubernetes Storage Special Interest Group is more along block and file.

CRAIG BOX: If I have a particular type of block or file system that I want it to talk to, I can define my Kubernetes storage workload to say I need to speak to this external service. But then it will use the CSI to talk to some kind of driver, which implements that.

SAAD ALI: Correct.

CRAIG BOX: So I might need to talk to NFS. It'll be some sort of NFS driver. How do I distribute that particular driver to the nodes that need to access it?

SAAD ALI: For Kubernetes, the recommendation for distributing a CSI driver is exactly like you'd deploy a workload on Kubernetes. So on Kubernetes, drivers are containerized gRPC interfaces, and they are bundled with sidecar containers that we have implemented as the Kubernetes Storage Special Interest Group that tell the driver how to communicate with the Kubernetes API.

So as a storage vendor, you would write just the gRPC interface, containerize that, and then pair it with a sidecar, for example, that knows to look for persistent volume claim objects, and then initiate the CreateVolume call against the driver.

And similarly, we have a number of these sidecars that, for example, look for volume attachment objects from Kubernetes and triggers effectively an attach operation against the CSI driver. And so when you deploy a CSI driver, it's just a pod, basically. It's a Kubernetes deployment against Kubernetes that extends Kubernetes. It's a very nice distribution system.

So compared to Flex, you don't have to worry about deploying a binary on an existing node. Instead, you just deploy a Kubernetes workload, and now you've extended Kubernetes to speak to a new storage system.

ADAM GLICK: Is the CSI flexible to touch any kind of storage, or is it really targeted at certain types of storage?

SAAD ALI: The limitation is block and file. But within block and file, there is a lot of flexibility in what CSI will support. Our initial use case was persistent storage, whether that be on cloud or classic on-prem storage. That was our initial focus.

And so there are CSI drivers for all the major cloud vendors-- Google Cloud Persistent Disks, Amazon EBS-- as well as a number of on-prem solutions. NetApp has a great driver. Dell has a number of drivers. So that was our initial focus with CSI.

But since then, it's evolved to encompass other use cases. So one of the more interesting use cases has been ephemeral volumes. And so if you're familiar with the Kubernetes API, you've probably heard of in-tree familiar volumes, like emptyDir or Secret volumes or ConfigMap volumes.

And the idea with those volumes is that the lifecycle of the volume is actually tied to the pod. So when the pod is created, some scratch space is taken from the local node and some data is prepopulated in it, like the secrets that an application needs to access. The application can use it while it's running. And then when the pod is terminated, all of it goes away.

The benefits of this is that you could have multiple containers in a pod that have some shared scratch space, or you could have some prepopulated data from outside of the container, like Secrets or ConfigMaps, that are populated into these empty directories so that the application can consume them.

And so we wanted to take that model and allow anybody to be able to build a volume that would be able to do these ephemeral volumes. And so what we've seen is there's a Secret CSI volume that allows you to plug in arbitrary secret management systems into your Kubernetes system via CSI.

CRAIG BOX: It can't be that secret. You're telling everyone on our podcast about it.

SAAD ALI: [LAUGHS] It is definitely not. It's available on the kubernetes-sigs repo if you want to check it out. But yeah, the benefit is now CSI has expanded beyond just persistent storage. It's being used for ephemeral use cases, as well.

ADAM GLICK: When you talk about ephemeral use cases, I always think of that as people doing data processing, for instance. Say I just need to scratch disk to put data to do my processing. But if it goes away, it doesn't really matter because I'm not storing that long term.

And then there's data that people want to store long term, or data that already exists that people want attached to their volumes. How does the CSI and the drivers for it treat those types of volumes differently, and how should people using the system think about how they set up those two different types of storage?

SAAD ALI: There's a little bit of a gray area in between those where you're using, for example, a local SSD and you're using that as a caching layer where you want it to be a little bit more persistent than just scratch space that you really don't care if it goes away or not, but you're doing replication at the application layer, so you're tolerant of failure more so than persistent disk. So we segment it as persistent, local, and then ephemeral.

ADAM GLICK: That'd be for something like Redis.

SAAD ALI: Exactly, where you're able to make the trade-off, saying I don't necessarily need the reliability of a storage system that keeps three synchronous copies. I can trade that off for performance and do the replication at my application layer. And so we have a number of use cases for that, and there is an existing entry volume plugin called Local PVs that handles local volumes.

But going back to ephemeral volumes, we should differentiate what ephemeral means in the context of Kubernetes. For Kubernetes, when I say ephemeral, it very much means that the lifecycle is tied to the pod. So if you have some sort of data that you only need for the lifecycle of that pod and it goes away, that is what the ephemeral use case is for.

And how it's handled on the CSI side is actually interesting. CSI doesn't have ephemeral as an inherent concept. It only has persistent storage in mind.

And what we did was hack the CSI implementation in Kubernetes to enable ephemeral use cases. Ephemeral is therefore still data in CSI. And what we're thinking about moving forward is trying to figure out if we should extend the CSI API to make ephemeral inherent, or continue to do it the way that Kubernetes does, which is when the driver registers itself, it says I am an ephemeral driver. Please treat me in this unconventional, non-standard way.

But as far as application developers are concerned, the thing to be aware of is trying to figure out whether the data that you're consuming, the lifecycle of that is tied to the lifecycle of the pod or not. If it is tied to the lifecycle of the pod, you need an ephemeral volume. Traditionally, something like an emptyDir will be sufficient. And what you could do is have an init container that populates your emptyDir with some data that your container might need, and then the emptyDir will be available to all of your containers.

But if you need something more specialized, where you have a token that must be exposed by a service, that's where it makes sense, potentially, to have a CSI ephemeral driver that takes care of the prepopulation for you. In the Kubernetes Storage SIG, we're looking into potentially introducing data populators not just for ephemeral volumes, but also for persistent volumes, and making that a first-class concept.

The idea so far has been that when we provision volumes, that the volume is empty. But like you pointed out, there are a number of cases where you might want some pre-existing data. And you might not even want to write it. You just want to read it because there's a static file that we want all your containers to consume.

And so today, what you have to do is either have an init container that populates that, or have a very specific CSI driver that prepopulates that data. But what that does is it couples your storage system with a thing that populates the data, and that is not necessarily always something that you want. You can imagine, if you have a backup appliance, for example, the backup appliance keeps a copy of your volume, but the backup appliance may be different from the underlying storage system.

And so rather than having the storage system create a volume and prepopulate it with data, which is possible today, we want to have the storage system create an empty volume, have a third-party backup system populate that volume, and then make it available for use for the application. And we want to be able to have that concept applied generically. And we're calling it data populator, currently exploring it in SIG storage.

CRAIG BOX: At the risk of asking you a data path question, when you run a pod on a node, you can define a certain request for memory and CPU usage. And the scheduler can use that as a hint to say, well, I can fit a certain number of these on the node because there is a finite amount of that resource available on the physical machine that runs it.

One of the other finite resources of a physical machine is the I/O throughput, whether that be attached disks and the number of spindles that they have, or whether it be network bandwidth to connect to some sort of external service. Does CSI have an opinion on any of that?

SAAD ALI: When Kubernetes initially started, storage was completely independent from pod scheduling. And so what that meant was you could have a volume scheduled to a node using a local disk, but there is no local disks available on that node, or scheduled to a zone where there is no more capacity. And so Kubernetes wasn't making intelligent decisions based on capacity or I/O.

Since then, what we've done is allow the Kubernetes scheduler to become more intelligent in the way that it schedules workloads by taking storage into account. And the way that it does that is it takes into account cluster topology. The idea here is that a given volume may not be equally accessible to all nodes within the cluster.

So you can imagine if you are running on a cloud environment, you may be segmented by zones, where a given disk is only available to a specific zone. Or if you're running in an on-prem environment, your storage volume may be only accessible to workloads in a specific rack.

And so what we wanted to do was allow a way for the storage system to be able to express these availability constraints up to Kubernetes and let the Kubernetes scheduler take that into account. And so we introduced volume topology within CSI which allows the storage system to generically say, here is how I see the cluster segmented.

The challenge here was really interesting, because what we didn't want to do was hard-code the concept of topology, the concept of zone or rack, into Kubernetes, because Kubernetes doesn't really care about those individual segments. And the fact is that every single cluster you have out there may have its own way to segment that cluster, and it would be impossible for us to encode every possible segmentation type into the Kubernetes API.

So instead what we did is come up with a generic way where the storage system can, when it comes up, say, here is what the topology looks like for me. Here are the keys and values to identify the nodes. And those labels are then applied as labels to the Kubernetes node objects and can be used as constraints on the workload.

And so as a workload, you can say, oh, I see there are labels available to me that are defining what zones are. So let me say I want zone foo for my workload as a constraint. And now that is going to be used by the Kubernetes scheduler not just to schedule the pod, but also to influence volume provisioning.

On the PVC side, we introduced late binding, which allows for a persistent volume claim to hold off on provisioning until the workload is scheduled. And so when the workload is scheduled and you have late binding enabled, we will effectively have the scheduler decide where the workload should be scheduled. So in effect, we've made the scheduler aware of storage, but we haven't gone all the way.

So I talked about three types of constraints that a storage system can apply. One is availability, which nodes a volume is available to in the cluster. It may not be equally available. So we've tackled that use case.

The second is around capacity. So if I have a storage pool that I'm provisioning against and it only has 100 gigabytes available and storage pool two has 200 gigabytes available, but storage pool two only has 10 gigs free versus 5 on the first, today, Kubernetes does not take any of that into account when it's making its scheduling. So it could make a bad scheduling decision without taking capacity into account and corner itself.

And similarly, there is no consideration around IOPS. We don't have any plans around IOPS at the moment. It's very difficult to do, but it's also a challenge that we've solved by having separate volumes that are inherently IOPS-isolated. And so our recommendation is that if you are IOPS-constrained, you should be consuming volumes for your application that are IOPS-isolated and dedicated to your application.

This does become a challenge when you're talking about ephemeral storage, like you mentioned, where you have some scratch space that you're using from an emptyDir. The emptyDir is using the node's boot disk. And so if you have a lot of workloads that are all scheduled to the same node, all reading and writing from an emptyDir and they're all I/O-intensive, then you will have problems.

CRAIG BOX: Is there a way that I can prioritize which one should have more priority?

SAAD ALI: Not at the moment. The recommendation that we have is if that is a problem for you, what you should do is you rely on local persistent disks. And in that way, you can get the I/O-isolation guarantees that you want. But before we tackle the I/O-isolation case in SIG storage, we're first looking at having capacity as an input to the scheduler, to be able to make intelligent decisions around how much space is actually available in a given storage pool before we provision.

And just last week, we had a big discussion between SIG storage on what that should look like. There are folks that are proposing having storage pools be a first-class concept. And there are a number of related use cases that we want to tackle, which are failure domain spreading, for example.

A storage system, when we talk about topology, is not just cluster topology, where a volume is available to a specific node. We already handle that use case. But a given storage system may have its own internal topology.

So for example, it could have an internal three or four disks that make up the storage pool. And if you schedule naively or provision naively, you could have all four volumes provision to the first disk. And so if that one disk dies, you lose all of the storage for your application.

And wouldn't it be nice if we could spread across all four disks? Kubernetes has the ability to do this across domains that are applicable to the nodes. So as long as a volume has availability topology, Kubernetes can schedule against that. But if it's invisible, meaning, for example, you have a volume that's available to every single node in the cluster equally, then Kubernetes says, well, if it's equally available, I'm just going to schedule it naively and say all nodes are equal.

Whereas when the volume is actually provisioned, the back end might have some internal topology constraints that it should be spreading out on. And so that's a use case that we're currently looking at and trying to figure out how we should incorporate that into CSI as well as the Kubernetes API.

ADAM GLICK: So you've seen people do a lot of stuff with storage over the years. What is the most interesting use of the CSI that you've seen someone do?

SAAD ALI: I think using it for persistent memory is the most interesting use case. Patrick Ohly from Intel has been trying to create a CSI driver that works with their new persistent memory device. And it was just not a use case that we had thought of, and he has a number of ideas and suggestions on how to improve it. So that, I think, has been the most interesting use case.

ADAM GLICK: RAM drives are the next storage interface we should be looking at?

SAAD ALI: Exactly, and how do you handle that from a perspective of, is this going to be treated purely as persistent memory or something else? It raises a number of questions that I think need to be resolved within the application operating system storage/memory interfaces before I think we can even discuss what the interactions within Kubernetes look like. But at least for the sake of CSI, we've been treating it like persistent storage.

ADAM GLICK: You sit on the TOC as well. So you've got a pretty good view of what's going on in Kubernetes and where it's headed. You've seen the creation of a number of interfaces over the years-- the CSI, CNI, CRI, just to name a few of them. What do you think is the next interface that we're going to see?

SAAD ALI: I think device interfaces for GPUs is an area that's being explored actively. I heard somebody talk about a container device interface. I think Nvidia is behind that. I haven't looked into the details of what that actually looks like, but I realize the potential there and the difficulty, especially around how you can come up with a standard that will work for everyone, given the number of implementations that are out there.

ADAM GLICK: Yeah, I was wondering would we see an interface for GPU interfaces or other hardware-attached devices, sometimes dedicated computing pieces. Is that on the roadmap, or is that probably further down the path?

SAAD ALI: I haven't kept up with exactly what's going on there, but I think it is inevitable. The interesting thing is going to be, again, just trying to figure out a way that they can make it work for Nvidia and AMD given all the different flavors of internal proprietary APIs that they have.

It was a challenge that we ran into on the storage side, where every storage vendor wants to differentiate and offer things that nobody else has. So how do we come up with an interface that abstracts away storage from an application developer, but at the same time, enables the power of each individual implementation to shine through and differentiate?

And the nice balance that we struck in the Kubernetes and CSI interfaces for storage is to allow that by realizing that an application developer is different from a cluster administrator. An application developer is someone that we want to hide all the implementation details from. All they want to do is be able to consume storage without worrying about the implementation details. But the cluster administrator is the one that's going to want to be innately familiar with how a storage system is deployed, how it's configured, which subset of knobs they want to expose from that storage system up to the application developer.

So as an administrator of a cluster, you deploy a storage system, and then you create storage class objects. And in those storage class objects, you provide a set of opaque parameters that a storage system can arbitrarily define as knobs that can be configured. So it's completely up to the storage system to define what those are. Kubernetes API doesn't limit it at all and isn't really aware of what they mean. Instead, it's a contract between the cluster administrator and the storage system.

And so then the cluster administrator can decide, well, I want to be able to expose fast storage and slow storage to my end user. And fast storage for me means that I am going to set IOPS to x and I'm going to set IOPS for slow to y, and that corresponds to these specific parameters for this driver. And so they configure those two storage classes, fast and slow.

And as an application developer, I get on my cluster and I do a kubectl get storageclass. I see fast and slow. And I can just pick based off of those names. Do I have a workload that requires fast, or do I have a workload that requires slow? So as an application developer, I don't need to be innately familiar with exactly what storage is underneath. The cluster administrator has done that work for me.

And now the beauty of this approach, really, is the portability aspect of it. Because as an application developer, the objects that I create in Kubernetes, the pod object, the persistent volume claim object, ideally I want those to be portable such that if I deploy them against a given implementation, let's say, on the cloud against GKE, and I move it to, for example, on-prem or another cloud, as an application developer, I should not have to rewrite those application definitions.

And by abstracting away storage in this way, what you can do is have your persistent volume claim point to a storage class with a generic name like fast or slow. And as long as that storage class exists across these clusters, your workload is going to work, without you as an application developer having to change anything.

And so now from the Google side, what we're doing is we're extending beyond just GKE Classic on Google Cloud. We're going on to on-prem. We have a GKE on-prem. We're talking about GKE on AWS as well as GKE on Azure. As we spread across these environments, what we're planning on doing is having a standard set of storage classes across all of these environments.

And they'll be named in such a way where you could have your workload request, do I want standard storage? Do I want premium storage? Do I want ReadWriteOnly? Do I want ReadWriteMany? And those are the constraints that you, as an application developer, care about. And regardless of which GKE environment you deploy on, you're going to get the storage that you want because the cluster administrator-- in this case, GKE-- will take care of mapping that name to what the correct underlying storage is to fulfill the needs for that application.

CRAIG BOX: Finally, we can't let you go without discussing the fact that CSI instantly makes you think about the crime TV shows, which all have a very famous song by The Who as their theme tune. Does your CSI have a theme song?

SAAD ALI: That is a question that I never, ever thought about.

CRAIG BOX: Dun-dun.

SAAD ALI: I think it definitely should. Do you guys have any ideas?

ADAM GLICK: Every release has a logo. So every interface can have its theme song.

SAAD ALI: We have a logo, and I thought that was sufficient. I didn't realize we needed a theme song as well.

CRAIG BOX: Well, if we want to go for The Who, they have a song called "Substitute." And I think that that would fit perfectly in with the theme.

SAAD ALI: Can we play that as an outro?

CRAIG BOX: I don't know that our copyright people would let that happen, but we'll link to it.

SAAD ALI: Let's open up a GitHub issue, and we'll take it into consideration.

ADAM GLICK: Saad, this has been absolutely fascinating. Thank you so much for coming on the show today.

SAAD ALI: Thank you for having me. This was a lot of fun.

ADAM GLICK: You can find Saad Ali on Twitter @the_saad_ali.

[MUSIC PLAYING]

ADAM GLICK: Thanks for listening. If you aren't already, why not subscribe in your favorite podcast player so you can get every episode when it comes out? If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @kubernetespod, or reach us by email at kubernetespodcast@google.com.

CRAIG BOX: You can also check out our website at kubernetespodcast.com, where you will find transcripts and show notes. Until next time, take care.

ADAM GLICK: Catch you next week.

[MUSIC PLAYING]

View More Episodes