Kubernetes Podcast from Google: Episode 98

#98 April 7, 2020

Cassandra, with Sam Ramji

Hosts: Craig Box, Adam Glick

Apache Cassandra, a scale-out datastore, is becoming more Kubernetes-native. Sam Ramji is Chief Strategy Officer at DataStax, a company that builds Cassandra-based products. He explains how DataStax has pivoted back towards supporting upstream Cassandra, and how they’re making it easier to manage on Kubernetes. As always, we also cover the news of the week, and we look at what is and is not a dinosaur.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

I saw some wonderful news this week. There was an article in "Scientific American" that says that the brontosaurus is back.

CRAIG BOX: Hooray!

ADAM GLICK: They have determined that the brontosaurus indeed was different enough to actually be a separate dinosaur, so there is an argument made that the brontosaurus, much like Pluto, is back. My childhood has returned. All is well.

CRAIG BOX: What was the brontosaurus if not a dinosaur?

ADAM GLICK: It was a dinosaur, but they thought it was an apatosaurus. Apparently, the brontosaurus has been defined as something that has what is a longer neck and a wider head, something like that. There were some distinctions on it, but it just means that the bronto is back, baby!

CRAIG BOX: Well, they can put that in the category with the do-you-think-he-saurus.

ADAM GLICK: [CHUCKLES] We found some fun things to do over the weekend. We went on a bear hunt. If you haven't heard, there are people doing bear hunts all around the world, where people put stuffed bears in their windows. If you're walking around with small children, and you need something to entertain them, they can look, and try to do bear hunt, and try and spot the bears in people's windows.

So we put one in ours, and we went around, and we found about 13 of them as we walked around on a walk. There is a number of them out there, floating around. For those you who are slightly older kids, you might also enjoy geocaching, which we did a little of this weekend.

CRAIG BOX: We went for a wander around the neighborhood a couple of days ago, and there were a bunch of teddy bears in the window.

This is the part of the show that my Mum listens to. I should point out that she actually posted a picture on Facebook of the two teddy bears that she has in her window, and one of them is mine!

ADAM GLICK: As in, the one from when you were a child?

CRAIG BOX: It was later into my childhood. It's not one I have a particular affinity to. But I remember winning it in a contest on the radio, or something like that. I'm like, I remember that. That's my teddy bear. What are you doing, putting that in the window? But that's all right. We should put a teddy bear in the window here in the UK, but we have not yet gotten around to it.

A number of people have their kids drawing various rainbows, which are the sign of hope for the NHS and the sign of support. So thank you very much to everyone who is supporting our frontline carers and everyone out there who needs something to do to take their mind off the situation we still find ourselves in.

ADAM GLICK: Shall we get to the news?

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

CRAIG BOX: Google Cloud has released "kept" (Kpt), an open-source tool for Kubernetes packaging. Kpt uses a standard format to bundle, publish, customize, update, and apply configuration manifests. It has a Git an YAML architecture, which means it works with existing tools, frameworks, and platforms and can be expressed in terms of the Kubernetes resource model.

Kpt was built by a team including many previous guests of the show. While they thoughtfully provided a pronunciation, they chose not to say exactly what kpt stands for. The FAQ suggests it's derived from apt, as in the Debian packaging tool. But Brian Grant, guest of episode 43, encouraged our listeners on Twitter to come up with their own interpretation. Our favorites so far are "Kubernetes PowerPoint Tool", where architecture diagrams come to life, or "Kubernetes Post-Teatime", where, in Britain, packages can only be installed in the early evening.

Please continue to tweet us your guesses @KubernetesPod. Wrong answers only, as they say.

ADAM GLICK: The Kubernetes Blog has started its usual deep dives into features of a new release. The first post focused on the topology manager, which is now in beta. This feature is designed for people running latency-critical applications. Without topology manager, the CPU and Device Manager would make resource allocation decisions independent of each other, which could cause degraded performance on latency-critical applications.

Next up was the Ingress API beta. It has been updated to support wildcard host names, a new path type field that specifies how paths should be matched, and a new IngressClass resource that can specify how Ingress should be implemented by controllers. It is expected the Ingress API will go GA in 1.19.

The third post covered the server-side apply functionality that is now in beta 2. This feature moves the kube control apply functionality off the client machine and onto the API server. Lastly, use of the container storage interface on Windows entered alpha with the 1.18 release, giving Windows users a much easier way to standardize access to third-party storage offerings.

CRAIG BOX: If you listened to episode 89 about GitLab, you would have learned they have several editions, some commercial, some open-source. This week, a number of features have trickled down into the open-source, or Core, edition. These include management of multiple Kubernetes clusters and network policies in clusters managed by GitLab CI.

ADAM GLICK: Rancher Labs, our guest on episode 57, have released Rancher 2.4. Features include support for up to 2,000 clusters, upgrades that can happen with intermittent connectivity, a zero-downtime maintenance option, and integration of 100 CIS vulnerability checks to do ad hoc security testing of a cluster.

CRAIG BOX: When you are using Cloud object storage, your endpoint is actually backed by hundreds or thousands of machines. When you replicate it locally, your storage servers become your pits.

The MinIO project provides an S3-compatible storage engine. To enable scaling across many machines, this week they released Sidekick, a single-purpose, high-performance load balancer designed explicitly for the storage use case. You give Sidekick your list of storage endpoints, and it will balance between them. Sidekick runs as a sidecar on workloads in a cloud-native environment.

ADAM GLICK: The Cortex Project has released version 1.0 of their horizontally scalable Prometheus storage implementation. Co-creator Tom Wilkie, now at Grafana, announced the release comes with documentation detailing the steps necessary to build a production-ready Cortex deployment, turnkey Grafana dashboards, and ready-made Prometheus alerts, stability and backwards compatibility guarantees, and an easy-to-use single-process airplane mode for getting started. Cortex was originally created by Weaveworks and became a CNCF sandbox project in September 2018.

CRAIG BOX: Many security issues are found by well-meaning researchers, but sometimes that can be found by well-meaning robots, too. Fuzz testing uncovered a denial of service vulnerability in the Kubernetes API server which has been deemed CVE-2019-11254. The issue was disclosed last week but was patched in point releases made earlier this year.

ADAM GLICK: French hosting company Scaleway has launched Kubernetes Kapsule with a K, new service to manage Kubernetes clusters on their infrastructure. Kapsule supports up to 500 nodes per cluster and offers a 99.95% SLA on their control plane.

CRAIG BOX: Nicolas Frankel from Hazelcast has written the three-part series demystifying custom controllers for Kubernetes and walking through building one. Given that Hazelcast is based on Java, his operator is, too, but using the GraalVM, he was able to build a static image which uses one third the resources of a regular JVM. If you're too enterprised for writing in Go, check it out.

ADAM GLICK: Simon Bernier St-Pierre has released kuby with a K, a command-line tool designed as an alternative to kubectx and kubens. It supports context switching, namespace switching, and prompt modification in a way that isolates shells from each other. This means that you can open multiple shells in different Kubernetes contexts without any issues. It also supports loading Kubernetes contexts from multiple files to keep your clusters and contexts separate. The tool is open-source and available on Simon's GitHub repo.

CRAIG BOX: Over the past five years, Kubernetes has moved from being a Google-run to a community-run project. Use of Google infrastructure sometimes meant that only Googlers had access to certain systems, one of which was the container repository that the Kubernetes images are served from. The Google team has been working to transition this to community ownership, and this week, the switch was flipped by Linus Arver, who posted graphs showing traffic fall off the old repository and onto the new. Congratulations, and thanks to Linus and the GCR team.

ADAM GLICK: Darkbit has released MKIT, an acronym for Managed Kubernetes Inspection Tool. MKIT leverages open-source tools to query and validate several common security-related configuration settings of managed Kubernetes cluster objects and the resources running inside those clusters. MKIT can quickly look for common cluster misconfiguration in GKE, EKS, and AKS. After running, the tool provides you a report of tests that passed and failed. The tool is open-source and available in Darkbit's GitHub repo.

CRAIG BOX: Rafael Fernandez Lopez has announced oneinfra, a Kubernetes-as-a-service service. Oneinfra is conceptually similar to hosted Kubernetes services, like GKE, or open-source projects like Gardener, and it is a control plane for creating Kubernetes control planes where the nodes are run elsewhere. The project is in alpha, looking for feedback, and also available on GitHub.

ADAM GLICK: Henning Jacobs, our guest on episode 38, has posted about how to save costs while running Kubernetes in the cloud. His post uses AWS as an example of places where people use the horizontal pod autoscaler to make sure their applications can scale up and down but often forget to clean up underutilized nodes. He points out tools like kube-janitor and kube-downscaler can help clear up resources, as well as kube-resource-report, which can help you see underutilized resources. He also recommends using lower-cost ephemeral instances like AWS spot or Google Cloud pre-emptive instances.

CRAIG BOX: PlanetScale, our guests on episode 81, have announced multicloud databases are in beta. Built on the open source of their test engine and MySQL, their clusters support four regions each for GCP, AWS, and Azure. Multicloud clusters launch in beta and promise GA within 90 days.

ADAM GLICK: Google Cloud is making many of their training materials free in April for those who are at home looking to build new skills. Kubernetes training is available through Google Cloud's hands-on Qwiklabs as well as coursework on Coursera and training from Pluralsight.

CRAIG BOX: Google's Project Zero has discovered a new security vulnerability in HAProxy. An attacker could send specially crafted HTTP/2 packets, which cause memory corruption, leading to a crash or remote arbitrary code execution. Red Hat has issued a critical security warning about this vulnerability, as it is a default component in many Red Hat products, including their OpenStack, OpenShift, and Enterprise Linux distribution. Their current recommendation is to turn off HTTP/2 support until a fix is provided.

ADAM GLICK: Finally, with many people around the world practicing some level of isolation, the CNCF has posted both an audio and text version of a well-being guide from the Well-Being Working Group. The post provides some good reminders about the current situation and how to best take care of yourself.

CRAIG BOX: And that's the news.

[MUSIC PLAYING]

ADAM GLICK: Sam Ramji is the chief strategy officer at DataStax. His past roles included VP of Compute and Kubernetes product teams at Google, leader of the open-source transition at Microsoft, and the CEO of the Cloud Foundry Foundation. Welcome to the show, Sam.

SAM RAMJI: Hey, Adam. It's great to be here.

ADAM GLICK: I know from your background that you studied AI and neuroscience. You've often said in some of the talks that I've seen you give that there was a winter of AI at the time that you graduated, and so that actually is what drove you towards software development professionally. It seems like that winter may have thawed over the past few years. Have you ever thought about going back?

SAM RAMJI: Yeah. You know, it's funny that you say that. I think that we kid around and realize that the third AI summer started with Google, because search is AI. "AI is dead-- long live search" was the rallying cry in the 2000s. So I feel like we're in the middle of a giant AI revolution.

And now we don't say AI. We just say auto, right?

ADAM GLICK: Mm-hmm.

SAM RAMJI: So autocomplete, autofill, autotuning, autonomous driving.

ADAM GLICK: That's a good point.

SAM RAMJI: Anything that says auto is really what we would talk about as being AI-- underlying machine learning, pattern recognition, data pipelines, enormous data ingest, and training sets. So we're surrounded by it. I feel like I'm in the thick of it.

ADAM GLICK: Very true.

This is your second stint as a chief strategy officer. Many folks have heard of a lot of C-level titles. That might be a new one for some people. Can you explain what a chief strategy officer does?

SAM RAMJI: There are a few different breeds, and I can only talk about what I know as well, which is to create a shared narrative that allows all of us to make decisions at speed and at scale that are both consistent with each other and that are coherent with the real world. Strategy is really about fitting yourself dynamically to the environment you see yourself in, sort of a Darwinian notion of adaptation. How do you take an organization of human beings, allow them to do distributed cognition at scale, and make that whole thing actually line up to some kind of serious impact in the market? That's the purpose of strategy.

ADAM GLICK: It's almost like a philosophy of business and technology.

SAM RAMJI: Yeah, I think that's totally fair. Some people would look at the chief strategy officer position as being very financially defined, doing analysis, market sizing, the ability to go and do particular strategic merger and acquisition activity. I think all of those are fair definitions, but those are all tactics. I think the ur-strategy, the top-level strategy, is being able to understand, what's the shape of the world currently? What's the shape of the world you want to get into, and how does the shape of your organization fit that journey?

ADAM GLICK: Previously, you also worked at Microsoft-- interestingly enough, working on the open source efforts a number of years ago. What was that like at the time, when Microsoft still had a large focus on proprietary pieces, and you were helping shift the culture?

SAM RAMJI: It was super intense. It was magical, actually. About the same time, "The West Wing" was very popular, if you remember that TV show.

ADAM GLICK: Oh, it's my wife's favorite show. I do.

SAM RAMJI: There was this sense of a divine purpose. For those of us who were driving open-source and Linux strategy and activity for the company, it was like you had to be the strongest microbe in the Petri dish to get through it. So we were all very closely aligned. We felt like every heartbeat was valuable. Nothing was wasted.

We'd have meetings in the hallways between the meetings. We were constantly alive with energy and trying to fight the good fight. We're going to make the world safe from Microsoft for open source, and we were going to bring Microsoft to open source to the benefit of everybody. We were fully on fire.

ADAM GLICK: How gratifying is it to see the changes that have been made? If you take a look, Microsoft is a very different company today, with a lot more investment in open source.

SAM RAMJI: Oh, it's deeply gratifying. It's funny to think of some of the things that we did which were seen as absolutely out of left field. Like, we made a huge contribution to the Linux kernel in 2009 under the GPL, all of these things people said Microsoft would never do. And now Linux has a huge workload for Azure.

One very funny thing that we had was-- we should really make SQL Server something that runs on Linux. People said, that's absolutely crazy. But guess what? A couple of years ago, Microsoft announced SQL Server's on Linux.

ADAM GLICK: Yeah, so you were definitely ahead of the curve on that one.

SAM RAMJI: Yeah. It was an amazing time, because when you have the resources of a company the size of Microsoft, it's really about your ability to have an insight, come up with a strategy, and then tell the story so that people get on board. It was slow at first, in 2006, when I took over open-source and Linux strategy for the company, but by late 2009, we were like a freight train. We had the Novell deal. We built the Linux interoperability lab. We became an Apache Foundation top-level sponsor. Just kept going from one strength to another.

The last meeting I had with Bill Gates, actually, was mid-2008. It was about a week before he retired, and it was a culmination of a lot of work that many of us had done to reframe how we thought about open source at the company to allow engineers to participate directly in open-source projects. Bill approved. He greenlighted the Apache license for development, the MIT license and a few others.

That was paving the way for the strength of the Microsoft engineering community to start to gradually get really comfortable with open source. And I think that speaks to where they are today in 2020. So yeah, it's a privilege to have been an early part of that journey.

ADAM GLICK: You're now at DataStax, working with a number of old friends and some of the folks that I know as well. You build products based upon the Apache Cassandra project. Where did Cassandra come from?

SAM RAMJI: Cassandra is a really neat project. It came from, originally, the Bigtable paper released by Google, which turned-- at the same time, roughly-- into Hadoop at Yahoo and into Cassandra at Facebook. Early 2008, Facebook is trying to solve a problem which has to do with global inboxing-- similar problems that led Google to build Spanner, running underneath Gmail.

It took a couple of years to come through incubation phase, and then by early 2010, it was a top-level Apache project. So Apache Cassandra has been going strong, running what's called a wide-columnar store, or more colloquially, a scale-out NoSQL database built in open source.

ADAM GLICK: You used some interesting terms there. Cassandra described as a wide-columnar data store, as you said. Can you explain for folks that might not understand the difference between what a wide-columnar data store is, versus, say, a traditional relational database like MySQL or Postgres, or a key-value store like TiKV?

SAM RAMJI: Simply put, the origin of databases that we think about, like MySQL Oracle, PostgreSQL relational databases, really started in the '70s, when we were looking at ways to do analytics. You're looking at, what can you query? Compute's expensive, so you structure the whole environment so that you can ask these questions, and that gives us modern relational databases, RDBMSes.

Now, it's not to say that there are no relationships in NoSQL. And it's a little bit of a funny term, but we don't get to name markets, right?

ADAM GLICK: Mm-hmm.

SAM RAMJI: A wide-columnar store says, we don't necessarily know what questions we want to ask about the data, but we do know that we need to be able to write and read extremely quickly, and we don't want to put arbitrary constraints on what the application can read and write. So the constraints, like BNF, third normal form, all the things that we know about relational databases-- those constraints are trade-offs that we're imposing very, very early on in application development around the schema so that we can do analytics later.

Internet-scale applications said, look, we don't even know what that's going to be, but we do know that we need to scale fast now, and the applications need to work super well. That's the wide-columnar store.

Key-value is really important, because it lets the developer not think too much. One of the things that you find in software is, you always get it wrong, and then you have to make it incrementally less wrong. So it's good to have something that feels a little bit more like putty, something that's very malleable, and key-value is very malleable.

One of the interesting things about Cassandra is its ability to do what you want to do in an application without having to shard. A shardless NoSQL database is another way to think about how Cassandra works its magic.

ADAM GLICK: Sharding is one of those things where people are looking for scale without having to restructure the database?

SAM RAMJI: Yeah, that's exactly right. The classical example is, you have a database. You have a bunch of users. You can put all of the users, sorted by last name, into the database. No problem. But at a certain point, you end up with government regulation or a particular level of scale, and now you need to segregate the users into different buckets.

OK. That's all well and good. You spin up another database, and you come up with some tool, some algorithm for how you're going to separate them. Could be by zip code. Could be by the first letter of your last name.

That's all well and good, and it sounds like it's just a DBA task, but it's actually an application development task. You now need to change your application code to call the correct database based on what key's been passed in. That's tricky. So for no additional benefit-- your application gains zero features-- you get to write new code, take the outrageous risks of testing, deploying, pushing it live to millions of users and hope everything continues to work.

It's what we'd call a featureless upgrade. All you're trying to do is solve for scale. As Patrick McFadin, who's our dev rel lead, says, friends don't let friends shard.

ADAM GLICK: I think about relational databases. Traditionally, they scale vertically instead of horizontally. If your database is going to grow, you put them on bigger VMs, bigger boxes. That's a fairly brittle design. Does Cassandra do more of a horizontal scaling design?

SAM RAMJI: Yeah, that's exactly right. We used to sell people a Sun E10000 so that they could scale their Oracle and Java monolith vertically as high as you could go. It was, like, a million-dollar piece of hardware. But what the internet infrastructure companies realized is, that's a crazy way to build the economics of an internet business. We have to scale out. So this idea of horizontal scaling-- we often abbreviate it as scale-out-- being able to have really small, not very powerful individual boxes doing a lot of amazing work cheaply, in concert.

ADAM GLICK: When would someone choose Cassandra as their data store? There's a ton of databases out there that people look at for different use cases. When's the right time to think about Cassandra?

SAM RAMJI: I think when you know that you've got a problem that has arbitrary scale. That's where people reach for Cassandra. It's often, these days, called the database of last resort, because once you run into the limitations of particular databases, and your application has got the good fortune of getting huge in its usage volume, you end up turning to Cassandra.

I think a little bit of wisdom, amalgamated with those application design principles, is to say, hey, what do we think, if we were wildly successful, might be our scale? And if that starts to look large, if you start saying, well, that's certainly terabytes of data, and we want to be able to have gigabytes-per-second access into that fleet of applications running around the world, that's a good place to start and go, wow, we should probably start with Cassandra.

ADAM GLICK: Where would you say this fits in the CAP model when people talk about databases?

SAM RAMJI: Ah, the CAP model. Brewer's theorem.

ADAM GLICK: Eric will be so proud, yeah.

SAM RAMJI: This is something we'd call an AP database. It's eventually consistent. It's a little bit overstated what "eventual" means in eventually consistent. We typically mean milliseconds. When you think about these trade-offs, it's often used to say, oh, well, you can't support transactions. Therefore, this doesn't make any sense. But for the vast majority of high-performing, high-scale applications, a few milliseconds between nodes becoming consistent is well within operational tolerances, and you can't tell the difference between that and an ACID transaction, which is often what these systems are held in tension with.

There's some really interesting thoughts on this future of data, what a distributed system looks like in distributed databases, and whether acidity is even a valid principle in a widely distributed environment. Network two-phase commits-- those things don't seem like a great idea. And if you think about the system as being ACID, you can put ACID properties in the entire distributed architecture that you would never expect to have in one particular piece.

ADAM GLICK: For those that aren't familiar, what does ACID mean? People talk about ACID compliance.

SAM RAMJI: Basically, it means that you can trust on atomicity, consistency, isolation, and durability. That's the A-C-I-D. And that's opposed to what people describe NoSQL or NewSQL as, which, of course, very cleverly, is BASE. I'm not going to get into BASE, because it's not quite as sticky as ACID.

ADAM GLICK: They really took the chemistry metaphor a little far there.

SAM RAMJI: They did.

Acidity really means-- a little bit like a bank, you want to be able to write a piece of data to one account. You want to write a piece of data to another account, typically a bank transfer, and you want it to all happen, or you want none of it to happen. It would be a terrible thing if the bank lost money or if you lost money. You want the supply of money to be constant, but you want that asset to be able to move consistently.

Consistency means that we leave the system in a consistent state. Isolated means that for each application writing to the system, they feel like they're the only application in the world. And then, durable just means that we're going to write it to disk so that if you end up shutting down the power, if some horrible fault happens, you can always trust the state of the system. That's the classical '70s model of databases, and that's the expansion of ACID.

ADAM GLICK: You mentioned NewSQL in there. Will Cassandra ever become a NewSQL database, or is that really a separate direction?

SAM RAMJI: It's really hard to know what NewSQL is or what NoSQL is.

One thing that's come up in my life often is, I'm not very good at naming things, and I object to everybody else's names, which is a tough combination. At Apigee, I really didn't like-- the web remote procedure calls were called APIs. I was like, no, no. An API is a local thing. But the market named it APIs. The market named NoSQL NoSQL. That's life. Our market also named Serverless Serverless, and I really don't like defining something in the absence of something else. That's really weird, so let's just call it what it is.

Where I think Cassandra is going is super interesting, actually. We're in this moment of a Cassandra renaissance. 2020 is when Cassandra 4.0 is coming out. A tremendous amount of leadership from Apple, who is running over 100,000 nodes of Cassandra in their infrastructure. So this is one of the most extensively tested, distributed scale-out databases on the planet.

Where we start seeing Cassandra move is in loosening its hold on opinions about what the developer interface should be, what the operator interface should be, and what the storage interface should be. The inside of Cassandra starts to look more like a data fabric, and the other components look more like plugable areas in the architecture. So there's a lot of interesting work that's been done by many companies in the last few years, and I think all of that is starting to come together in 2020.

ADAM GLICK: DataStax, the organization you work for now, started as supporting the Cassandra project.

SAM RAMJI: Yeah, that's right. It started as a company called Riptano, I think. There is a rhino emblem that some people who've been at the company a long time still use on their jackets, a mark of old-school pride.

ADAM GLICK: Before the Cleopatra eye symbol?

SAM RAMJI: I think the Cleopatra eye symbol was always with Cassandra, but the rhino was for Riptano, which was the Cassandra company. That first iteration really focused on technology, technical contributors to the Apache Cassandra project, and then to professional services and support.

ADAM GLICK: You've recently supplemented your commercial product with a renewed focus on supporting upstream Apache Cassandra. Why the doubling down back on open source in the community?

SAM RAMJI: I think the opportunity for any open-source company is conditioned on the growth and the excitement that the community feels. There's nothing more powerful than a fully engaged community that is pushing the technology forward, that are pushing each other forward, finding new places for this technology to go.

As you look at some of the great data companies that have been built in the last few years-- Mongo, Confluence, Databricks, Elastic, just to name a few examples-- you tend to see this really nice combination of custodianship-- safeguarding and taking care of and engaging with community-- and that turns into large-scale adoption. Some of that adoption can be turned into revenue through software products, through software services, and you want to make sure that everybody is growing faster than you might hope to grow.

There's a positive-sum game at root in the economics of open source. What we're doing at DataStax is reminding ourselves and everybody else of the positive-sum economics of open source, how that works within Apache Cassandra, and how do we unite these different lessons that everybody's learned in the last few years that apply to the Cassandra project and bring that all together for everybody's benefit for the coming decade?

ADAM GLICK: Like most databases, Cassandra is traditionally installed on VMs and bare metal. What had to change to make it work with Kubernetes?

SAM RAMJI: You point out specifically what the challenge is in bringing databases to the cloud-native era. Frequently, cloud-native is about statelessness, and you'll see a little thin pipe poke down to a particular data service. You can talk to the data service, but it's not really cloud-native, so that's kind of an opportunity.

It turns out, to make a really, really high-performing database, you have to get very, very simple. We had to throw away the ability to do a lot of complex schema, a lot of complex analytics, and get right down to the core of it. That's what Cassandra was 10 years ago.

What it then learned to do-- what she learned to do, because I tend to personify Cassandra and my mind-- is, she learned to make the very best use of all of the assets around her. That's bare metal. And if it's going to be virtualization, you're still using that as a deployment mechanism to take the fullest possible use of any of the hardware that you've been given. Once you've got that, then you want to be able to flex your muscles. You want to be able to do all sorts of very low-latency, high-performance interactions to support the application workload.

Maybe 18 months ago, we started seeing, hey, there's a lot of demand for a cloud-based Cassandra service. How would that work? Once you start putting Cassandra in the cloud, you realize, oh, we need a lot more elasticity. Some of the opinions that the Cassandra kernel, if you will, has about what it controls and how it gets deployed need to shift.

We've learned a lot about that, and we've isolated that into management API sidecar, which we open-sourced, as well as a Kubernetes operator, which we learned the hard way by trying to make Cassandra scale in an economic fashion as a cloud service and something that we call Astra. So a lot of different things had to move around, and we've isolated those into the management API sidecar.

ADAM GLICK: I would love to follow up on a couple of those things. When I think about Kubernetes, certainly, when it was younger, before stateful sets or pet sets, as they were sometimes lovingly called, the mantra was always, stateless in Kubernetes, stateful outside of it. We encourage people to use hosted services. Don't run data stores on top of Kubernetes unless you are an expert with Kubernetes, an expert DBA.

Have we reached the point where the default advice should be changing a little, that people should be running their data stores in Kubernetes?

SAM RAMJI: I think we should be architecting for data stores in Kubernetes and thinking about it hard. I think making that production ready-- that's something that we're just starting to do as a community this year. We have a commercial product, DataStax Enterprise, coming out in about a week that will run on the open-source Kubernetes operators.

It's something that is already happening. We're seeing companies like Sky and Orange Telecom, Netflix, many others meeting this boundary of bare metal, where you have to provision to peak, versus this very broad, open environment of cloud-native, where Kubernetes can just go take advantage of whatever resources are around, use them to support a spike of work, and then let them go when that spike has passed.

The core secret of cloud-native, of course, is not just scale-out. We've always been able to scale out. It's being able to scale back in. Withstand the spike, and then relax back into the sustain.

ADAM GLICK: The elasticity of it, as it were.

To that point, Kubernetes obviously understands where different parts of its system are, where the nodes are, and how to communicate with them. But because Cassandra was built before Kubernetes, it had to locate those nodes itself with its own cluster. How did it do that, and has that fundamentally changed as you've moved it to being able to run within Kubernetes, where the system has that information?

SAM RAMJI: Yeah, you make a really good point. This started around the same time that Google was starting to contribute its thinking on containers into Linux. This is 2007, 2008, when Google contributed cgroups to the Linux kernel, which then Solomon Hykes and his team turned into the beginnings of Docker.

That was going on in parallel while Cassandra, a project of about the same age, had to make really good use of the bare metal. So relaxing its opinions on discovering and managing hardware underneath it-- that's been isolated into this idea of a Cassandra operator for Kubernetes and the management API sidecar.

One of the things that's had to shift is that it needs to know how to deploy itself alongside Kubernetes and to scale with Kubernetes. How can that be addressable? How can that be reachable? What are the other Kubernetes-native or cloud-native, if you prefer, technologies that it needs to play well with?

It's not enough to just say, hey, I can run in Kube. You want to be able to be visible and manageable by Prometheus. You want to be provisioned into an Istio and Envoy service mesh so all these things are discoverable and natural.

ADAM GLICK: You mentioned a new Cassandra operator and that DataStax indeed has just announced a new Cassandra operator for Kubernetes. There have been other Cassandra operators out there by other organizations. Is this a net new project or something that built on those? Will it unite those other projects?

SAM RAMJI: The intent here is that it joins the pantheon of Kubernetes operators and represents a point of view that we've earned the hard way by delivering Astra and by delivering enterprise databases on-premises to our users. It's probably the eighth or ninth Kubernetes operator for Cassandra.

Our hope is that what we can do is combine the opinions that each of these operators represents, and 6 months, 12 months from now, a net new user could say, hey, I want to run Cassandra in a cloud-native way. They go to Apache Cassandra. They pull down the latest version from the repo. They say, OK, where's my Kubernetes operator? They pull down one, and there's an obvious solution.

What we've had to do is ask the question, how would you write one operator that could be used by literally everybody in the world? The other operators that have been built are each well built for their own purpose. Each operating environment makes particular choices about, what's the management software they're using? What are you using for security? What are you using for log analysis? Those operators have learned how to do a great job in the specifics of each of their locations.

What we've had to do is to say, how would we build one that learns from all specifics and can run in the general case? Not to say that our opinion is the best or the most informed, but it is the most generalization-oriented, because we have one operator that runs in our cloud, Astra, and runs on-premises with DataStax Enterprise.

That's something that we're contributing. Ideally, all of these different operators will come together, unify, have some kind of 80% case that's covered really well, and there's one clear set of code that everybody can use. And if people have other, more specific needs, then perhaps we can have an architecture for participation, where there's pluggability or specificity of those kinds of choices.

ADAM GLICK: You've mentioned Astra a couple of times. I know from the announcement that the operator that you're talking about sits behind and powers the Astra service that you run. Did the operator come out of the work that was done with Astra, or did you build the operator and are now using it as part of your system? The chicken-or-the-egg question, so to speak.

SAM RAMJI: The operator came out of the work we did with Astra, because there's an emergent process of learning about the world. When you wake up, and you see a problem, and you see a solution, it's almost guaranteed that at that moment, at least a half a dozen other people in companies have looked at the same problem the same way. So if you talk to a whole bunch of large-scale users of Cassandra-- as I mentioned before, you can include Target, Sky, Orange, Apple, Netflix, Instagram-- you have this emergent learning process where you're like, ah! This needs to change in particular ways so that it can scale out on Kube. And everybody starts saying, well, the operator pattern is really nice. That was something that CoreOS came up with that ended up in Red Hat, now part of IBM, with a chain of acquisitions. They all look a little bit similar.

And we're no different. We needed to be able to run Cassandra at scale. We needed to be able to run it elastically rather than provisioning to peak. That immediately demands-- what are you going to use for your scale-out scheduling environment? How are you going to discover resources? How are you going to do networking? That pulls us into a Kubernetes, and then you have to create an operator so that you have enough knowledge about the system that it can be automatically scaled.

Going back to that AI term we talked about before-- auto. Automation is necessary. You have to put some knowledge into that system so that it can work properly.

ADAM GLICK: Cassandra, like most databases, is configured with command-line tools. You mentioned your new management sidecar, which can be controlled with an HTTP API. Why not build that directly into Cassandra?

SAM RAMJI: The intent would be for this to be able to be pulled into the Cassandra project if that's what the community wants. So it's important to look at Apache Cassandra as being a mission-critical, global-scale, fault-tolerant, highly available database that thousands of companies rely on to handle enormous scale operations. Never more than now, with the current COVID-19 crisis. It's putting a lot of strain on digital infrastructures. The last thing we want to do, even if we had the power to do this, would be to jam something into the core environment.

In fact, what needs to happen is, as we bring out Apache Cassandra 4.0, the community of contributors themselves need to come together around 4.0, take all the things that the community's learned in the last few years, establish a super stable release that's extremely trustworthy, and then we can look at, how does the standard adapt and pull in elements like the management API sidecar? How does the standard pull in the Kubernetes operator?

Or does it not? Does the core Apache Cassandra repo decide, hey, this is a place that we're going to keep purely Apache Cassandra code-- the kernel, if you will-- and we want other things to live somewhere else? That's something that nobody knows yet.

We think that the Cloud Native Computing Foundation has done a really nice job of creating a big field to play in, making sure that we work well, both from a technology and a philosophy standpoint with the CNCF. As Cassandra becomes cloud-native, our intent is, Cassandra and Kubernetes should be like peanut butter and chocolate, that gets a lot of people excited. Even if you only like candy, you can still get excited about peanut butter and chocolate.

But if you're a really deep nerd, the idea that Cassandra could be the ideal scale-out and scale-in database to participate inside a Kubernetes fabric-- that's pretty exciting.

ADAM GLICK: Cassandra has been a top-level Apache Software Foundation project for a decade at this point, and the CNCF is coming up on its fifth birthday in December. You were a member of the CNCF's governing board when you were at Google. Obviously, you're close to the Apache Software Foundation. How would you contrast the two foundations?

SAM RAMJI: The CNCF is interesting. I'm pretty close to the structure, also, because when I ran Cloud Foundry, Cloud Foundry was also homed within the Linux Foundation. So this metastructure of the Linux Foundation is super interesting.

For those who are interested in the gory details, I'll just give you a little bit of the financial technology terms. There's a difference between a 501(c)(3), which is a nonprofit public benefit corporation, which is a classification we use for charities. That's a classification that the Apache Software Foundation won many years ago and fights hard to maintain.

And there are very particular rules along the lines of no favor or prejudice to any particular corporation or market outcome. It has to be for the benefit of members and for others. There's a very particular governing principle that constrains what that company, because every foundation is a corporation, can do.

That's been amazingly successful, and you've seen this surge and this renaissance of data projects, with Apache, that have been widely adopted and have coupled with the Apache contributor license agreements and the Apache license. That's been phenomenally successful for Spark, for Samza, for Hadoop, for a range of projects, including Apache Cassandra.

The CNCF, and the Linux foundation, and all of the similarly structured foundations are 501(c)(6). That's what's called a trade association. That's got broader powers to be able to pool money from different companies, whether it's from vendors or from users, and then deploy those to pay people salaries and to do marketing campaigns and to say, here's what this thing is. That made it fairly easy to take a stand at Cloud Foundry and say, this is what Cloud Foundry is. Now we'll run advertisements. We'll let everybody know what this is, and then we can take a different stance with vendors and say, hey, we're going to certify you as conforming or nonconforming. But it's connected to a commercial ecosystem.

The CNCF is similar, as a 501(c)(6). It focuses on this balance of growing commercial relevance. But what they wanted to do very carefully-- and I think Craig McLuckie and Joe Beda did an amazing job of thinking through the technical operating committee. One of the things I think the CNCF did really, really nicely is, they say, how can we get something that's a little bit more Apache-like in the purity of how we manage the technology, while also getting the benefits of being able to bring in a lot of money, host really big events, pay salaries, do a lot of marketing around the idea of cloud-native?

The technical operating committee, the TOC, is siloed from the board of directors in the CNCF. That's really important, because the TOC solely determines what projects enter incubation, what projects aren't permitted in the foundation, what projects mature, and how all those projects are actually mentored. That's all the TOC, and the TOC sends a representative to the board of directors to say, here is what we think as the TOC.

The board of directors' sole authority is to decide what to spend money on, and the only way to be on the board of directors is to represent a company that is funding the foundation. So there's a good balance of power there and separation of responsibilities, and that, I think, has shown to be a very good structure, as you can see from the CNCF. It's-- I don't know-- 550 members now. Back when we had conferences in those quaint old days, [LAUGHS] the conference size was tens of thousands and growing rapidly.

ADAM GLICK: You were CEO of the Cloud Foundry Foundation. What's your perspective on that group replatforming what they're doing on Kubernetes?

SAM RAMJI: That was a shift that we needed to make in 2016, and in June of 2016 or so, I hosted a board meeting. One of the few powers you have as the CEO of a nonprofit foundation is, you can call board meetings. And the really powerful conversation we had at the end of the meeting was, look at what is happening with Kubernetes.

Within Cloud Foundry, we have our own bespoke container scheduling system called Diego. You can't really run it externally. We've already seen a lot of pressure from people saying, hey, I'm using Docker. Can Cloud Foundry run my Docker? Kubernetes is going to change everything, so sooner rather than later, we need to rebase the container management system that Cloud Foundry uses on Kubernetes. And if there are gaps in the capabilities that Kubernetes has versus what Cloud Foundry Diego can do, we need to contribute those.

We had that conversation again, in a very strong way, in September of 2016. And when you look at the people who were assembled, the companies that were represented-- SAP, IBM, HP, EMC, VMware-- the panoply of thinking was, yeah, we need to go and do this.

I'm very, very happy to see that it's happened. And it's going to be an amazing moment for cloud-native applications for people to be able to take the architecture that was started with Heroku, to say-- you want to be able to cf push and trust that the environment's going to be taken care of. A great developer-down experience, interoperability through build packs, and then running on any Kubernetes infrastructure will be just awesome.

ADAM GLICK: Given that you're a chief strategy officer, you're probably the closest thing that we've gotten a chance to talk to for a gray-haired wizard with a crystal ball for the future. When you think about Cassandra and DataStax moving forward, what do you think are the next steps? What's on the roadmap that people should be looking forward to?

SAM RAMJI: I think the first, most important thing are the new releases-- 4.0. We'll see the beta in Q2 of this year. We're all driving for GA release by the end of the year, in the second half of 2020. With that, I think, comes this renaissance of what's called the CEP, or the Community Enhancement Proposal, a structured way to bring new capabilities to Cassandra. That's where I expect we'll see a lot of cloud-native contributions, a lot of management and interoperability contributions.

But I would break down the future of Cassandra into three key interfaces. There's a northbound interface, which is how developers and applications talk to Cassandra. There's a lot of interest in talking to Cassandra not via CQL, Cassandra Query Language, but via JSON, via GRPC, via GraphQL. That's an area that we see a lot of change and growth in.

The eastbound interface, if you will, the interoperability interface-- as you pointed out earlier, there's a lot of command-line interaction, and typing commands into a shell is a pretty clumsy way to manage a very large-scale environment. Even automating some of those recipes with Puppet and Chef is pretty tricky, because that's really sort of rote automation rather than intelligent automation.

What was great about the Kubernetes interface, the brilliance of the Kubernetes API, is that it's a declarative API where you can instruct the infrastructure below, please give me these kinds of abilities. It doesn't tell them how to do it. It just says, I want it to look like this. An operator API that can give Cassandra an intended state and then let the system do intelligent things to make that state real-- that's super important. You'll see, with the management API sidecar, other capabilities that move the operator's experience and elevate it from the CLI into defining intent. That's important.

Finally, the storage engine interface. There are many, many opinions about how to write bits to storage, and there should be. Networking is changing. Speed of access requirements on read and write are changing. There are a lot of changes in the world. And when I look at the great lessons that we were taught by Vint Cerf and Bob Kahn, with the 40-plus-year-old protocol in TCP/IP, permissionless innovation requires good layering. Currently, the storage engine in Cassandra is not well layered, but it will be.

Instagram forked Cassandra so that they could run Cassandra against the RocksDB storage engine. That's super cool. Now, Amazon Web Services looked at that environment and said, hey, what if we fork that fork, and we write a backend that talks directly to the DynamoDB infrastructure as a storage engine. That is super interesting. There's a renaissance available to us by standardizing and making the storage engine plugable so that it's easy for many different providers to be able to have storage engines that connect to the Cassandra data fabric. It's very consistent with other lessons that we've seen over and over again in the software industry.

Certainly, at Cloud Foundry, we saw this with BOSH, which is like Borg, but it's one more. I don't know if you get that. But B-O-S-H-- if you increment the R and the G, you get BOSH, which is the Cloud Foundry infrastructure. BOSH could support many different clouds, so you could run your Cloud Foundry applications on any cloud, on modified, because all the modifications happened in the BOSH layer.

The BOSH layer had this thing called the CPI, the Cloud Provider Interface, which is where you write the secret codes that actually talk Amazon to Amazon, Google to Google, Azure to Azure, Ali Cloud to Ali Cloud. But that was forkable, not plugable. As they started to make progress in the architecture, that became more mature, became a more participatory architecture where you could just plug things in, and that was really good for Cloud Foundry. I think we see very similar things for Cassandra in our ability to have Cassandra be an amazing data fabric with great developer interfaces that can be added to over time with an intelligent, easy-to-automate operator interface and a plugable storage engine.

ADAM GLICK: Thank you for joining us, Sam.

SAM RAMJI: It's my pleasure. It's a privilege, Adam, and it's great to hang out with you.

ADAM GLICK: You can find Sam on Twitter @sramji. You can find the Cassandra project at cassandra.apache.org, and you can find DataStax at datastax.com.

[MUSIC PLAYING]

CRAIG BOX: Thanks for listening. As always, if you've enjoyed the show, please help us spread the word, and tell a friend. If you have any feedback for us, you can find us on Twitter @KubernetesPod, or reach us by email at kubernetespodcast@google.com.

ADAM GLICK: You can also check out our website at kubernetespodcast.com, where you'll find transcripts and show notes. Until next time, take care.

CRAIG BOX: See you next week.

[MUSIC PLAYING]

View More Episodes

Cassandra, with Sam Ramji

Chatter of the week

News of the week

Links from the interview

Transcript