Kubernetes Podcast from Google: Episode 234 - LitmusChaos, with Karthik Satchitanand

#234 August 20, 2024

LitmusChaos, with Karthik Satchitanand

Hosts: Abdel Sghiouar, Kaslin Fields

This episode we spoke to Karthik Satchitanand. Karthik is a principal software engineer at Harness and co-founder and maintainer of LitmusChaos, a CNCF incubated project. We talked about Chaos engineering , the Litmus project and more.

Do you have something cool to share? Some questions? Let us know:

News of the week

Links from the interview

Links from the post-interview chat

Transcript

Show full transcript

ABDEL SGHIOUAR: Hi, and welcome to the "Kubernetes Podcast" from Google. I'm your host, Abdel Sghiouar.

KASLIN FIELDS: And I'm Kaslin Fields.

[MUSIC PLAYING]

ABDEL SGHIOUAR: In this episode, we spoke to Karthik Satchitanand. Karthik is a principal software engineer at Harness and co-founder and maintainer of LitmusChaos, a CNCF incubated project. We talked about chaos engineering, the Litmus project, and more.

KASLIN FIELDS: But first, let's get to the news.

[MUSIC PLAYING]

Kubernetes 1.31, codename Elli, is released. The detailed blog with enhancements, graduations, deprecations, and removals can be found in the show notes. And don't forget to listen to our episode with the release lead, Angelos Kolaitis.

ABDEL SGHIOUAR: The schedule for KubeCon and CloudNativeCon North America 2024 is announced. As a reminder, the event will take place in Utah's Salt Lake City between November 12 and 15.

KASLIN FIELDS: Score has been accepted by the CNCF as a Sandbox project. Score is an open-source, platform-agnostic tool that allows you to write an application configuration using the Score spec in YAML format and then converting it into a deployable manifest using one of the supported Score implementations. Currently the implementations include Docker compose, Kubernetes, Helm, Cloud Run, and Humanitec. And that's the news.

[MUSIC PLAYING]

ABDEL SGHIOUAR: Today we're talking to Karthik. Karthik is a principal software engineer at Harness. He is the co-founder and maintainer of LitmusChaos, a CNCF incubated project. Karthik has worked closely with the cloud-native ecosystem, first with OpenEBS, a Kubernetes storage solution, and now with LitmusChaos, in the last six years. He also co-founded something called ChaosNative, which was acquired by Harness. Welcome to the show, Karthik.

KARTHIK SATCHITANAND: Thanks, Abdel. It's great to be here in the Google Kubernetes Podcast. Really looking forward to this conversation.

ABDEL SGHIOUAR: Awesome. Thanks for being with us. So I guess we'll have to start where this conversation has to start, because we're going to be talking about LitmusChaos. And "chaos" is in the name, so I assume that means chaos engineering.

KARTHIK SATCHITANAND: Yes. This is about chaos engineering. I'm sure all of you know about chaos engineering already. It's been around for more than a decade and a half, I should say. It's really become popular, and it's become a little mainstream over the last few years. And there are a lot of projects in the CNCF landscape that are around chaos engineering.

So yes, LitmusChaos was one of the first chaos-engineering projects that was accepted into CNCF. We were one of the earliest projects to get into Sandbox and then into incubating status. And the community around the project has really grown over time-- a lot of great feedback, a lot of releases that were led by the community. So I think it's really brought about a change in how people look at chaos engineering, especially for cloud-native environments. So yeah, I think it's been a great journey so far.

ABDEL SGHIOUAR: Awesome. So I think we'll have to, like-- because our audience is quite diverse in terms of their experiences, and we receive a lot of feedback about people saying that sometimes we go a little bit too deep, and sometimes we are too high-level. So I want to start a little bit high-level. What is chaos engineering, for those who don't understand what it is?

KARTHIK SATCHITANAND: OK, so chaos engineering is-- the standard textbook definition is it's the process of testing some distributed computing system to ensure that it can withstand unexpected failures or disruptions. There is a principlesofchaos.org website that was put together by the initial pioneers of chaos-- Netflix, Amazon, et cetera-- which gives you more details about the principle around chaos engineering, how it should be carried out, what a typical chaos-engineering setup would look like, or the practice of chaos engineering, how would it look like?

They talk about being able to inject different kind of failures that actually simulate real-world events. There's something called Murphy's law, which you might all be aware of. If there is something that can fail, it will fail at some point. That's the gist of it. So chaos engineering is mainly about understanding your distributed system better, how it withstands different kinds of failures-- because failures are bound to happen in production-- and then also trying to create some kind of an automation around it because you would want to test your system continuously.

So chaos engineering is not like a one-off event. It's not like you perform something called a chaos experiment. Chaos engineering is carried out as experiments. So it's not like you do an experiment one day, and then you're revisiting that after months or weeks. It's something that you would need to do constantly.

So there is a need to simulate these failures in a very predictable and controlled way. We are talking about chaos engineering, and we are talking about unexpected disruptions. But it's really interesting. The experimentation itself is actually a very controlled event.

So you see the blast radius of what you want to cause, and you try and simulate failures, and then you go armed with something called the steady-state hypothesis. So you have an expectation or a notion of how your application should behave. Under ideal circumstances, what is its steady state, and how much of a deviation that is from the steady state, how much of a deviation do you expect from its steady state? That is the hypothesis.

So you go armed with that hypothesis, and you inject a particular failure. You see how the system is behaving. You see whether that conforms to your expectation, or you learn something new. And sometimes you learn something about the system that needs some kind of fixing. You discover something suboptimal inside of your system.

So you understood a weakness. You uncovered and understood a weakness, and typically, you would go back and fix it. It could be a process fix, it could be an actual fix that you're making to your software, or it could be something in the way you deploy it. It might be some kind of a deployment control. It could be any of these things.

So you make that fix, and then you repeat the experiments. And you try and take your experiments from a very sanitized, controlled, low-level environment to higher environments. You always have different kinds of environments that build up towards your production. You have various dev environments, you have QA environments, your performance test environments, and then your staging environments, and eventually, your actual production.

So chaos engineering starts out, typically, in some of the lower environments, just for you to understand how the experiment itself is carried out. Then over time, you increase the stakes. You do it in an environment that really mimics what happens in production, and then eventually, you do the experiment in production itself.

So this is a very quick introduction to what chaos engineering is. It's all about experimenting. It's all about injecting some kind of failure that simulates a real-world event, and then trying to understand whether the system behaves as expected or no. That's the essence of chaos engineering.

And when it was initially conceived-- we talked about Principles of Chaos Engineering that was put together by Netflix and co. When they built it initially, they really advocated its use in production because that is where the real value of the experiment is, because that is where you have the system experiencing the dynamic workloads. That is where the system has been soaked.

You have a lot of patches going into production. You can see a lot of changes. And then you have a lot of real-world load coming in. There are a lot of maintenance upgrade actions that are going on on your production. So it's really a very complex and a dense system, so that is where you get most value for money, when you run the experiment.

But that's not where most organizations start today, because an experiment that has gone wrong and inadvertently causes downtime can have a lot of negative consequences. So the chaos is actually exercised in lower environments until you're comfortable on both aspects-- doing the experiment in a controlled way, as well as how tolerant your application is to this kind of a failure. You mature on both sides, and then eventually, you take it to production. That's how people are doing it today.

ABDEL SGHIOUAR: Got it. So you covered quite a lot of things, and I want to unpack one thing at a time. So one exercise that I have participated in in the past, that I was thinking about while you were talking, is-- at Google, we call this DiRT-- Disaster and Recovery Testing. And I remember that when I was doing DiRT exercises in the past, there was two types of DiRT exercises. There was real DiRT exercises and simulated DiRT exercises, so real, where you are actually taking stuff down, and simulated is more like you simulate or you pretend something went down just to test the process of recovery.

And then the other thing was what you talked about, which is the controlled disaster. So essentially, you're not just randomly shutting things down in production. You start with lower environments, and then you graduate toward the production environments. I think my question to you is, it feels to me that chaos engineering versus disaster-recovery testing is like, chaos engineering is something you will do continuously. You will continuously run experiments in an automated way.

KARTHIK SATCHITANAND: Yes.

ABDEL SGHIOUAR: Instead of you go, OK, once a week, we're going to do our disaster exercise and see how it goes, or once a year, sometimes, right?

KARTHIK SATCHITANAND: Yeah, yeah. I think that you're right. The philosophy has sort of evolved over time. When chaos engineering was initially introduced, it centered around the concept of game days. Game days are these events where all the stakeholders come together. You have people, that are SRE's that are managing the application infrastructure. You have the developers. You have the support folks. You even have somebody representing the customers.

And then you all take a decision to inject a specific kind of a failure at a very small level and then see how things are behaving. You have your APMs primed. You're verifying if you have the right alerts, you're looking at receiving the right notifications. And if at all things go bad, you know exactly what to revert or what to change to get back to normal.

So people used to do chaos-engineering experiments only as part of game days. But then with the introduction of cloud native, where the release times have become 10x or maybe multi-x faster, everything is independently deployable. You have multiple moving parts, and there are a lot of dependencies. You have an entire dependency tree. For example, if you look at Kubernetes, you have a very dense orchestration infrastructure that's sitting on some kind of hosts.

And on top of that, you have the actual Kubernetes microservices. You have your container runtimes and things like that. Then you have your actual application dependencies. You might have message queues and databases, et cetera.

You have your own application's middleware, and then you have its own back end, front end, et cetera. So it's really a pyramid, and there are so many moving components there, individual deployments that are getting upgraded all the time. And the notion of CI/CD today, the CI/CD ecosystem allows you to deploy every day, or maybe even quicker. When so many changes happen so rapidly, and when there are as many dependencies as there are today, you really need to be testing continuously. So chaos engineering went from being this specialized game-day model to becoming a continuous event.

So there is a concept called continuous resilience. So you basically test every time you deploy, or you use chaos-engineering experiments as a way to greenlight your deployments. So these are promotions to production. So these use cases are what we see predominantly in the chaos-engineering world today.

And disaster-recovery testing, by definition, they are still one-off events. Probably you do them once in a quarter or once in a few months, where you could be actually taking down systems, like you said, or you could be simulating the loss of certain systems. You basically cut off access to a specific zone so that everything in that zone is not accessible. Or you actually physically-- when you actually shut down, let's say, cloud instances instead of a specific zone, both are valid disaster recovery scenarios.

It is another matter that you could use the chaos-engineering tooling of today to carry out the disaster-recovery tests whenever you choose to do that. But, I think, by and large, the disaster-recovery tests still continue to be one-off events, like you said, whereas chaos-engineering has moved into the realm of continuous testing.

ABDEL SGHIOUAR: Mm-hmm. Yeah, that's what I was thinking about when you were explaining chaos engineering. And as you said, yeah, when I used to be at least part of these DiRT exercises, it was more a one-off big event that-- a lot of people are aware it's happening. So it was a lot of fun. I had a lot of fun doing these kind of things. So then can you-- in this context of automation and continuous chaos testing, where does LitmusChaos fit, and what does it do?

KARTHIK SATCHITANAND: Yeah, I can give you a little bit of history on how LitmusChaos came into being.

ABDEL SGHIOUAR: Yeah, sure.

KARTHIK SATCHITANAND: In fact, it was this need for continuous resilience and testing that led us to build Litmus. So this was sometime around 2017, '18 time, and we were trying to operate a SaaS platform that was based on Kubernetes, which was using a lot of stateful components. And one of the things that we wanted to do was test the resilience of our Kubernetes-based SaaS. Every time we release something into our control plane, every time we release something into our SaaS microservices, we wanted to go and test it.

And what we had, initially, was an assortment of scripts to do different things, and there was no one standardized way of being able to inject something. So if I had some particular failure intent, this is my failure intent or chaos intent. And this is what I would like to validate when I go ahead and do my fault, and this is how I would like to see the results of my experiments.

There was no one standardized way of doing that, because different groups of developers and different teams were testing their services in different ways, using different tooling, some of which was actually already called chaos tooling by that time. We already had some open-source tools and also some commercial tools that were available at that point of time.

And we wanted to do all this standardization of how you want to define your chaos intent, how you want to do the hypothesis validation, how you want to see results, how you want to attach it to pipelines. And how would people write newer experiments? There should be one standardized API for writing newer experiments. There should be some kind of homogeneity.

And we wanted to do all this standardization in a cloud-native way, because we were primarily dealing with Kubernetes, so something that, let's say, a Kubernetes DevOps or a developer person would understand, something that is storable in Git repository, something that can be reconciled via an operator, something that conforms to a resource definition in Kubernetes. So we had all these requirements coming together, which is why we built Litmus.

So Litmus is basically an end-to-end. When it began, it was just doing failure injection via Kubernetes custom resources. So you would basically define your chaos intent in a custom resource, and there would be an operator that would read, it and inject the failure, and give you the results in a standard form. The result was also another custom resource which you could read off.

That's how we began, but over time, as we moved this into the community, as we open-sourced it and we learned more about what people want in this space, over a period of time, it grew into an end-to-end chaos platform that actually implements everything that is talked about in this Principles of Chaos. Principles of Chaos asks you to be able to inject different failures that correspond to different real-world events.

So we built up a huge library of different faults, and then we basically added something called probes. That is a way for you to validate your hypothesis. Probes are entities or means of validating certain application behavior.

You could be doing some API calls, you could be doing some metric parsing, or you could be doing some custom commands. You could be doing some Kubernetes operations. All these standard stuff that you would want to do, which actually tells you or gives you insights about how your application or infrastructure is behaving. So we built that framework.

And then we also added the ability to schedule experiments. We added the ability to trigger experiments based on certain events. We added the ability to control the blast radius. How do you isolate your fault? How do you ensure that your failure is getting injected only in a specific namespace, only against a specific application, only for a specific period of time? And how would you define which user has the ability to do what kinds of faults, in what environments, for how long?

So all this governance and control, we sort of brought that in, and then also made it easy for people to scale this entire chaos operation from a single portal. So you have the ability to add different target environments into a control plane, a centralized control plane, and you have the ability to orchestrate chaos against each of those targets the way you want it. So this kind of multi-tenancy was also built in.

And then slowly, we went from doing single faults or specific failures to more-complex scenarios, where you can string faults together in some patterns that you would want because when real-world failures happen, oftentimes, they're a result of multiple components behaving in an unexpected manner. Yes, sometimes there are single point-of-failures, but many times, it's a combination of several events that leads to a big outage.

So let's say you want to--

ABDEL SGHIOUAR: Like a cascading failure, basically.

KARTHIK SATCHITANAND: Exactly. So how do you simulate that kind of cascading event? So we brought in the concept of workflows, and then we also-- workflows where you can string together different faults at different times-- and then also the ability to reuse experiments across teams. So we made the experiment as, essentially, a resource. It's a YAML file, so it's a reusable entity.

So you create templates out of it, store it in a Git repository. Anyone who has a requirement to create that kind of a scenario can just pick it, tune it with their application instance details, and then run it, so all this ability to reuse these experiments.

So this entire framework was built over a period of two to three years. And that's how Litmus became really popular in the community. We have users that are using it even for non-Kubernetes use cases, as well. Though it was initially built in a very Kubernetes-centric way, we also gave enough flexibility within the platform to be able to use Kubernetes as a base, as an execution plane, while you're doing chaos against entities that are residing outside of Kubernetes, as well.

For example, you would like to take down something in AWS or GCP. You could still do that. For example, you have a managed service that you're doing chaos against, or you have some kind of a vanilla compute instance somewhere that you want to bring down. You can do these things via the cloud provider-specific API calls, but they're all getting executed from inside of a Kubernetes cluster, which has some permissions to do the chaos against your cloud provider. You may be using workload identity in case of Google Cloud. You could be using the the IRSA in case of Azure, those kind of things.

So we built that system the way you could reuse Kubernetes as a way to orchestrate the chaos, a way to define the chaos, et cetera. You could still do the chaos inside of Kubernetes or outside of Kubernetes, and that sort of caught on. And that's where we are right now in the community.

So Litmus was very useful for us. We built it as something that would aid us to test our Kubernetes SaaS platform, and over time, it acquired a life of its own. And today, we have organizations across domains that are using it, people who are majorly users of the cloud or Kubernetes, but they are across domains.

For example, there are telco organizations. There are food-delivery organizations, folks in the medtech, software vendors. There are different kinds of users, including other open-source projects that are leveraging it today. So that's a quick snapshot of how Litmus began, and what it does, and where it is today.

ABDEL SGHIOUAR: One of my follow-up questions was going to be something you described, which is, you could target basically any environment, although Litmus itself was built to run on top of Kubernetes. But the target environment against which you run an experiment could be whatever, right?

KARTHIK SATCHITANAND: That's true. Yes, yes.

ABDEL SGHIOUAR: And those integrations, those specific, how do you shut down a VM on a specific cloud provider, is this built by the open-source community? Or is this something that people have to build themselves, or how does this work?

KARTHIK SATCHITANAND: There are some experiments that are already available. The community has built it, and we've pushed it onto a public ChaosHub. But we've also laid out exactly how someone could do it if they want to build their own experiments for something that is not already put up on the public ChaosHub.

So there is a bootstrapper that we provide and some templates. So what this bootstrapper does is, basically, asks you to construct a very simple YAML file that consists of some metadata about your experiment, and the target that you're trying to do chaos against, et cetera. And then it uses this information to generate the code. That is the scaffolding of the experiment.

So in Litmus, the experiment has a very specific structure. You have something called a pre-chaos check that you perform, like a gating condition to say, I can actually go ahead and do chaos. And then you have the actual fault injection.

Then you have the post-chaos phase, and then you have all these different probes that we have today-- HTTP probes, Prometheus probes, and command probes, and things like that. So they are all brought in. So you have basically a scaffolding that allows you to go ahead and add your business logic for doing the fault that you want on any infrastructure.

And we have some documentation that helps you to package this entire code into a Kubernetes job, and then eventually into a custom resource. And then once you have that, you're ready to orchestrate it from the Litmus control plane. It becomes a first-class citizen on the Litmus platform. So there is a specific approach to how you could construct your experiments. There is some aid that is always provided in terms of this bootstrapping utility, and the documentation, et cetera.

But there's also a huge community. There is a Litmus Slack channel on the Kubernetes workspace, which is vibrant. There are a lot of conversations happening there. There are a lot of folks who have actually built out their own experiments, created their own private chaos hubs that they are using. And you could definitely interact with them and see what you can reuse, see what you can contribute upstream, what methodologies they follow to create a certain experiment, and then go from there.

ABDEL SGHIOUAR: Got it. This makes me think of something. This is a little bit off-- maybe not off-topic, but off what we would-- the questions that I had in mind. Would you potentially be able to use Litmus for user-validation testing?

I'm thinking about a scenario where you could inject some erroneous messages to simulate an actual user and see-- so for example, let's say you have a queue-based system, and you have microservices that subscribe to queues, receive data, and act on them. And you want to see if a microservice will behave properly if the data is malformatted. I guess that's something you could do. You could just inject erroneous data, data with errors in a queue, and see how a subsequent microservice would behave, right?

KARTHIK SATCHITANAND: Yes, that's definitely something that you can build out. Though the core-- when you look at failures, there are different kinds of failures. Failures could mean different things to different personas. It could mean different things to different target applications. The platform has been designed in a flexible way, so you could write something like how you just described it, inject erroneous messages, and see how the service handles it. That's something that you can definitely do.

Things that are on the periphery of chaos, for example, load-testing-- you would basically want to put insane amount of load on your service and see whether you have the right rate limiting in place. What happens to all the genuine requests that were going out? Let's say you're putting a lot of spurious load on your service, and you're rendering it incapable of handling the actual genuine requests. Is that happening? Do you have the right controls against that? That is also a chaos experiment. What you describe is also a chaos experiment.

And then you have the more traditional forms of failures-- you take down a node. You take down a pod. You cut off the network. You inject latencies. These are some of the things that probably are what you would find in chaos tools, when they say we have chaos experiments. But a lot of these other things are valid chaos experiments, and we have some users writing some very innovative experiments that are still being orchestrated by Litmus.

So that's the idea. You should have the flexibility to use one platform to inject different kinds of failures that you want to and track all your results and track all the resilience aspects in one place. So that's the idea.

ABDEL SGHIOUAR: And that's actually what I wanted to come to because when we think about failure, it's not necessarily always something is down, or it's up, or something is not able to handle load. It could also be all sorts of random weird stuff that could happen-- malformatted data, maybe on purpose or not on purpose. It could be an SQL injection. It could be a cross-site scripting, all security-related stuff.

KARTHIK SATCHITANAND: Absolutely.

ABDEL SGHIOUAR: It could be, maybe, failed authentication. You explicitly try to authenticate with the wrong credentials multiple times to see if the system will behave properly, if you have any-- if you have any logic to say, if the same user tried to authenticate a couple of times with the same username, password, and it fails, you should block them or stuff like that.

KARTHIK SATCHITANAND: Absolutely. These are all valid scenarios. We recently had someone try and write experiments to test their security. Check if you are able to do a certain operation, and if you are going through, that means that you are not secure. Are you able to create privileged containers?

Or let's say you've probably had the right settings on your S3 bucket. If you're able to access it, then that's a problem. The experiment can also, basically, incorporate negative logic tests like this. So if, at all, you're able to do certain thing which you're originally not expected to be able to do, then that's a failure. You could use the platform for doing things like that, as well.

ABDEL SGHIOUAR: Yeah. So who are the main audiences of LitmusChaos? Who are the main personas?

KARTHIK SATCHITANAND: Yeah. So when it began, and when the traditional approach to chaos engineering held sway-- early 2019, around that time, '18, '19-- it was mainly the SREs. So they were the folks who were trying to ascertain the resilience of the services that they had deployed and were maintaining.

But as the awareness around continuous resilience grew, we had more folks whom you would associate with DevOps functions, people who are writing pipelines, managing pipelines, et cetera. So they were interested in adding chaos as part of pipelines, so we started getting requests to create some kind of integrations with GitHub actions. We started providing some remote templates for GitLab, integrations with Spinnaker and things like that, where people started adding it into their pipelines.

And then, as things grew more interesting, we partnered with another open-source project called Okteto, which helps you to do some testing, even before you create your image, and you push it to your registry. This is specifically for Kubernetes, where they allow you some namespaces so you can basically sync code between your code workspace and your pod on the cluster. And people were using chaos experiments in that kind of an environment, too. So this is the actual core developers even before they shipped anything or committed something.

So we had persona groups evolving over time. We went from being SRE and cluster-admin persona to the folks who are doing the continuous delivery to the actual developers, the innermost loop. So we have chaos engineering being looked at by all of them.

But I should say, predominantly, the users are still of the latter type. It is the SREs or somebody with that kind of an allied function. It could be somebody that is looking at signing off the QA, or people doing performance testing.

And this has sort of caught up. People have started doing chaos experiments as part of their performance-testing routines. So they have standard benchmarks, the pure benchmarks that they do, with different workload parameters, with different IO profiles and things like that. And then they have these mixed benchmarks that they do where they're trying to benchmark the system under a specific condition, under a certain degraded condition, the degradation having been caused by a chaos experiment. So that is something that we are really seeing evolve. So these are the different personas that are looking at chaos today.

ABDEL SGHIOUAR: I see. Cool. So at this stage, LitmusChaos is an incubated project, and you are on your way to graduation, right?

KARTHIK SATCHITANAND: Right, right.

ABDEL SGHIOUAR: How is that going?

KARTHIK SATCHITANAND: Yeah, it is-- graduation is a long process. We are very excited as the Litmus project team. Our community is excited. They were made aware of the fact that we've applied for graduation, and they showed a lot of love on our graduation PR. And they have been asking us also this question as to how it is going.

So I think the CNCF incubation and graduation process takes its own time. There is a specific due-diligence process. There are certain criteria that they look for. One of the things that, as a project team, we've been prepping ourselves on is the security audit. There are a lot of security features that we built in into Litmus as we do, but we wanted to actually get an audit done and get a lot of feedback on where we can improve our security posture. That's something that we worked upon, and we've submitted all our improvements to the auditing authority, which is very shortly going to do the retest.

We've also gone ahead and added more over time, and this was not specifically something that we decided to do once we got to graduation. This has been an evolving process from the time that we moved from sandbox and incubation and then onward. After incubation, we've added more committers into the project, people who are invested into the project, some because they're using it in their organizations. And they're dependent upon Litmus very strongly for testing the resilience of their solutions, and they've made it part of their list processes.

And there are some other committers who are probably not actually using it within their organizations, but just because of the love for chaos they've been doing a lot of it over a period of time, so they've become maintainers. And then we've gone ahead and improved our community base, people who have started adopting the project, both on the individual level and organizations, who've actually publicly come out and said, we've adopted Litmus. Many times, organizations might be using it, but they might not be very open in saying so. But the number of organizations that are publicly stating that we've used LitmusChaos, that has grown a lot, especially in the end user community. So we've been working on that.

We've been working on adding more mentorship programs as part of Litmus project. So there is the LFX program in CNCF. There is the Google Season of Code. A lot of folks who participate in these programs have contributed to Litmus, so there's always some or the other mentorship program that's going on where LitmusChaos is participating.

And we're also trying to work with other projects in the CNCF community, where we are trying to get them to use Litmus. And we had a very fruitful relationship with the Telco Working Group, or the CNF Working Group, who have been actively using Litmus as part of their testbeds. And there have been other CNCF projects who have been using Litmus to test their resilience.

We're also integrating with other projects where we see a natural fit. For example, Backstage is one of the integrations we've done. We are also trying to integrate with other tooling which is sort of allied with chaos, which is on the periphery, like we said, Load. For example, k6s integrations.

All of these things have been going on, and they have been improving our presence in the community and our relevance, thereby indirectly helping our graduation efforts. Right now, we've basically created the graduation PR. We have some folks who are interested in sponsoring or carrying out the due diligence.

And that's where it is right now, while we prep on our side to help with whatever is the process, whatever information is needed by the TOC for evaluating the project. We're trying to get that in place, and we're looking forward to engage with them. So we are in that journey. Hopefully, we make more progress more quickly and get there, but I'm sure that we will get there at some point in the near future.

ABDEL SGHIOUAR: Nice. And my last question to you was going to be, you have an upcoming conference called LitmusChaosCon, right?

KARTHIK SATCHITANAND: Yeah.

ABDEL SGHIOUAR: Can you tell us a little bit about that and where people can find information if they want to check it out?

KARTHIK SATCHITANAND: Definitely. LitmusChaosCon is something that we really wanted to conduct. We were enthused by the reaction and enthusiasm we saw for a Chaos Day event. It was a co-located event that we did with one of the previous KubeCons, where a lot of Litmus users-- individual users, organizations-- came and spoke about how they were using it, what they wanted to see in Litmus going ahead.

We had a lot of good traction during the KubeCon's project meetings, and some of our other talks were very well-received. We have been having chaos talks being accepted in main KubeCon event over the last few years, and we saw all this as a positive indicator for people's need for, let's say, getting a full confidence around chaos itself, around Litmus itself.

So we decided to do LitmusChaosCon. It is on September 12. It is a full-day event. And you can find details about it on the Events page, community.cncf.io/events. That's where you will find details about LitmusChaosCon. And we have a very interesting lineup of speakers, folks from different-- people who are Litmus users, and there are some general chaos practitioners in there, as well.

And there are users from different kinds of end-user organizations-- speakers from different end-user organizations, I should say, people who run poker, online poker; people who do food delivery; people who are maintaining streaming services, video-streaming services; people who are software vendors. Different kinds of users of Litmus are coming and speaking about their unique challenges, what they wanted to do, how they used Litmus to achieve that.

I think it's going to be very interesting for the community to learn from the experiences of these various speakers. We've got some amazing speakers here, and the agenda and other details are all available on the community.cncf.io. So yeah, we're really looking forward-- as the project team, we're really looking forward to hear from all the speakers during the conference.

ABDEL SGHIOUAR: We will make sure to add the link to our show notes about your upcoming events. Karthik, thank you very much for your time. I learned quite a lot from you. I had no idea what Litmus is, and now I have a basic idea of what it does. Thank you very much.

KARTHIK SATCHITANAND: Yeah, thank you so much, Abdel, for giving us this opportunity to talk about LitmusChaos. I really enjoyed this podcast. Great questions, and looking forward to interacting more with the audience.

ABDEL SGHIOUAR: Thanks for your time, and have a good one.

KARTHIK SATCHITANAND: Thank you.

[MUSIC PLAYING]

KASLIN FIELDS: Thank you very much, Abdel, for that interview. I'm really excited about this one, because I've always been interested in chaos engineering, because something with "chaos" in the name has to be fun, right, [LAUGHS] and also because I started out my career as a quality-assurance engineer writing tests in Perl for a Storage-Attached Network, SAN system. So-- [LAUGHS]

ABDEL SGHIOUAR: Wait, wait, wait, wait, wait, wait, wait. This is a scoop. You've written Perl code?

KASLIN FIELDS: Yes.

ABDEL SGHIOUAR: Wow. All right. I have to bow toward you. Perl is such an interesting language.

KASLIN FIELDS: And that's why now I'm just a YAML engineer.

ABDEL SGHIOUAR: OK.

[LAUGHTER]

Got it. All this drama from Perl, I guess.

KASLIN FIELDS: Bash and YAML. That's what I do. [LAUGHS]

ABDEL SGHIOUAR: All right, all right. I didn't know that. OK, OK.

KASLIN FIELDS: Yeah. So I was excited to hear about chaos engineering. And I really liked how you all discussed-- I mean, when he was talking about the basics of what chaos engineering was, I was like, hey, testing.

ABDEL SGHIOUAR: Yeah.

KASLIN FIELDS: I know this world. [LAUGHS]

ABDEL SGHIOUAR: Yeah, it's kind of-- I mean, I think it's all the same words to say the same thing. I have to admit that I am not hearing that much chaos engineering recently, or up to when we decided to do this episode, I haven't been hearing it that much. I think maybe just we call it something else. But as you said, the principles are the same. You just want to make sure that your system works, right?

KASLIN FIELDS: In my early days as a quality-assurance engineer, I remember I was always coming up with ideas about what testing needed to be and different ways that you could do testing, like the concept of writing tests before you create the system versus writing them after you have the system and things like that. And so I feel like this took me back to those days and reminded me of how much fun it can be to think about ways that you can break a system.

ABDEL SGHIOUAR: Yeah, yeah. So the only experience I have, personally, with chaos engineering, generally speaking, is using Chaos Monkey, which is a very popular tool. So I used that in the past, much smaller scale and for VMs, which is the same concept. It's just like orchestrating your tests, if you want to call them that. I call them orchestrating chaos through tool.

KASLIN FIELDS: [LAUGHS] I've never used Chaos Monkey, but I've definitely heard of it. You also talked in the interview, though, about DiRT, which is an acronym that I have seen at Google, though I have never been involved with it. And it sounded like that was quite related, as well.

ABDEL SGHIOUAR: Yeah. So that's also something I did in my time at Google when I was working in a data center. I can't say that much details about it for obvious reasons, but I think the concept of DiRT is public. How do we conduct it is not. So DiRT stands for Disaster Recovery and Testing, which is, essentially, the same idea. You basically, come up with a scenario. Sometimes it's a hypothetical scenario. Sometimes it's an actual scenario. And you simulate or you actually take stuff down, and you see how other things behave.

KASLIN FIELDS: Which I think is a really fun approach to testing that is very chaotic.

ABDEL SGHIOUAR: Pretty much. The only thing I would say is-- and this is, I think, public information-- is that DiRT doesn't necessarily-- or, OK, a disaster- and recovery-testing exercise doesn't necessarily have to be always about IT systems. It can actually be about physical stuff.

KASLIN FIELDS: Yeah. And I feel like this is a good point to throw in the constant reminder of, test your recovery plans.

ABDEL SGHIOUAR: Yes. Please test your backups.

KASLIN FIELDS: Having a recovery plan is not the same thing as having a recovery plan that works.

ABDEL SGHIOUAR: Yes. Test that backup you've taken off the database last month to make sure it works.

KASLIN FIELDS: Yeah. Kind of a random aside on this-- [LAUGHS] since I worked for a storage company-- I worked for NetApp in the past. That is public information. [LAUGHS] And so I worked on testing their SAN systems, and I also worked in vault at one point. And so a lot of my work at that time was about bad things that can happen to your storage. And of course, we work in tech, and so "Silicon Valley," the TV show, comes up periodically.

ABDEL SGHIOUAR: Oh, yeah.

KASLIN FIELDS: And it's a very painful show to watch. [LAUGHS] It's fantastic but very painful. And I had to stop watching it, the point where they had a catastrophic data loss. [LAUGHS]

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: And I was like, no, this is what I do for work, and I can't handle this.

[LAUGHTER]

ABDEL SGHIOUAR: Reminding you too much of work?

KASLIN FIELDS: Uh-huh. Yep. So chaos engineering.

ABDEL SGHIOUAR: I watched "Silicon Valley." I don't think that it's that disconnected from reality. And--

KASLIN FIELDS: Unfortunately, yes. [LAUGHS]

ABDEL SGHIOUAR: The only thing we can-- well, one of the things we can say is just remember what happened a few weeks ago with airlines, and we just stop there.

KASLIN FIELDS: Mm. Whole thing. That's some chaos.

ABDEL SGHIOUAR: Without getting too much into details.

[LAUGHTER]

So I think testing was at the core of that fiasco.

KASLIN FIELDS: True. Good point. Wouldn't that be an interesting interview to do?

ABDEL SGHIOUAR: Yeah. So there was actually a YouTube show-- I'll try to find it to include in the show notes. There's actually a YouTube-- not show. It's like, I think, an interview, where-- and this is related to what I talked about earlier, when I talked about the disaster-recovery exercise being physical exercises.

Some actually physical exercises even sometimes involve trying to physically break into places, like physical intrusion, as part of testing, where you're not testing the IT system. You're testing if your security systems, as in physical security systems, are actually up to standards. And those are really fun exercises to do, actually.

KASLIN FIELDS: Yeah. And so for chaos engineering, I think this concept of disaster-recovery testing is one thing that can fit inside of the box of chaos engineering. It seems like chaos engineering is a very big umbrella.

ABDEL SGHIOUAR: Yes, yes.

KASLIN FIELDS: Introducing chaos into your systems through testing. And that's what tests are meant to do, so I feel like most forms of testing could, arguably, fit into the world of chaos, as long as it's intentionally doing something that will probably break instead of a test that's intentionally doing something that is supposed to be a good path. [LAUGHS] I guess any of the tests that are doing a bad thing could be chaos engineering.

ABDEL SGHIOUAR: Yeah, but I think it's interesting, the term. I mean, "chaos" is probably a scary word, but I think in the context of what we discussed with Karthik, it's probably-- we could say that it's controlled chaos, because you know what you're doing.

KASLIN FIELDS: I did like the use of that term, yeah.

ABDEL SGHIOUAR: Yeah. So you're trying to come up with a realistic scenario, but you execute it in a controlled environment, and you also collect metrics and logs. And you see how your system behaves in general.

KASLIN FIELDS: A realistic failure scenario--

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: --that you implement in a controlled way.

ABDEL SGHIOUAR: Exactly. I think the controlled way is a key here, because I mean, technically, anybody can just walk into a data center and start pulling cables out. That would be technically chaos. [LAUGHS] I don't know how many people would want to do that. So I think in this context, it really means you do it in a controlled environment. So you kind of-- and I like that Karthik talked about the fact that you-- the recommendation is to always do it in the lower environments, and then bring it up to the higher environments, as you feel comfortable with how you execute your tests.

KASLIN FIELDS: Excellent. So let's talk a little bit, then, about LitmusChaos. It's a CNCF project, open-source incubating project, and it's meant to help folks do chaos engineering, right?

ABDEL SGHIOUAR: Pretty much, yeah. It's an orchestrator for chaos scenarios, if you want to call them. I think they call them recipes, if I'm not mistaken.

KASLIN FIELDS: Oh, yes.

ABDEL SGHIOUAR: And basically--

KASLIN FIELDS: Oh, experiments?

ABDEL SGHIOUAR: Sorry, experiments, yes.

KASLIN FIELDS: Experiments.

ABDEL SGHIOUAR: Yes. And so--

KASLIN FIELDS: There's a special word.

ABDEL SGHIOUAR: Yes. And the experiment is--

KASLIN FIELDS: Experiments. We're scientists.

ABDEL SGHIOUAR: Yeah.

KASLIN FIELDS: [LAUGHS] Mad scientists.

ABDEL SGHIOUAR: So they have a framework for building experiments, and then you can build your own experiments, you can use community experiments. But essentially, you just have an orchestrator which runs on top of Kubernetes, and then you give it an experiment, and the experiment would be doing a set of actions and monitoring, collecting information about how your system behaves. And yeah, so that's essentially-- that's in the TLDR what LitmusChaos is, really.

KASLIN FIELDS: I do really like the term "experiments" for this. It makes sense, in a scientific concept of you're trying to develop a plan, your hypothesis, and test it, essentially. So it kind of makes sense. There's that connection there. But also, you're being a mad scientist, introducing chaos into the system. So I like that a lot.

ABDEL SGHIOUAR: Yeah. Actually, I remember when I was preparing for the episode, I did some research. I went to the website. They have a hub, which is sort of like a marketplace for these experiments, so already pre-made experiments that you can just reuse. And I was looking at them, and at some point, I was trying to figure out, are these actually realistic?

So one example I can give that I looked at in the hub was a cloud-provider experiment. We don't mention-- it doesn't matter. The name is not important. But essentially, what you do is you have a load balancer, and then you have a bunch of virtual machines attached to it. And what you do is you detach those virtual machines from the load balancer, right? So that's the experiment.

And I was looking, thinking about it, like, is this actually realistic? But then I was like, yes, it is, because-- and you tell me what do you think about this. Imagine you are doing this as part of infrastructure as code. You're running some Terraform code, and then your run breaks in the middle. So the load balancer is created, but your VMs are not attached, or the other way around. That could happen, and that's actually a realistic scenario.

Or somebody runs the wrong command and end up executing the same-- replicating the same behavior. So I find that quite interesting in these scenarios that maybe doesn't look realistic, but you really think about them, you're like, hmm, yeah, this could happen, actually.

KASLIN FIELDS: Yeah, certainly. I mean, VMs disconnecting sounds like something that's going to inevitably happen. You all talked about Murphy's law, as well, at the beginning of this.

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: That just seems like Murphy's law waiting to happen.

ABDEL SGHIOUAR: Exactly. Another thing that I was also thinking about-- another example, would be you accidentally added the wrong firewall rule, you know?

KASLIN FIELDS: Oh, yeah. Mm-hmm. That is going to happen for sure.

ABDEL SGHIOUAR: Yeah. Or you remove the wrong firewall rule. It could be either way, right?

KASLIN FIELDS: Misimplement a firewall rule.

ABDEL SGHIOUAR: Exactly.

KASLIN FIELDS: Accidentally type in it.

ABDEL SGHIOUAR: Yeah. Using the wrong tag, the wrong label, selecting the wrong virtual machines, or writing a firewall rule that maybe overlaps or disables another one because priority in firewall rules, that's how usually firewall rules in cloud work. They have priorities. So yeah, these are actually scenarios that could happen.

KASLIN FIELDS: That's interesting. So there's definitely this level of chaos testing that's pretty obvious, I feel like. At a cloud level, you have all of this infrastructure, basic things like turn it off and on again.

ABDEL SGHIOUAR: Yes.

KASLIN FIELDS: Misconfigure something. There's a whole world of generic chaos engineering tests, or experiments, I suppose, that could work on all sorts of systems and provide valuable testing. But Litmus, specifically, is a Kubernetes tool, right? Or does it work at the cloud level, as well?

ABDEL SGHIOUAR: Yeah, it runs on top of Kubernetes. The orchestrator itself is Kubernetes, yes.

KASLIN FIELDS: OK, yeah. So CRD, basically, I imagine.

ABDEL SGHIOUAR: Yeah, a bunch of CRDs and operators. Yeah, pretty much.

KASLIN FIELDS: Ah, several, yeah. That makes sense. So did you all go over any examples of Kubernetes use cases? Because I could imagine pod disconnects, that's something that we test ourselves all the time. When I set up my blog, actually, that was the first thing that I tested. I deleted the container, and then recreated it to see if all of the data was still there, testing my volumes for my Kubernetes workloads.

ABDEL SGHIOUAR: I don't remember, to be honest, if we discussed this, but I can see how those kind of scenarios could be valid. I mean, I don't know, random example from the top of my head-- if you don't have autorepair enabled, and you just take stuff down, just very simple example, right? Take down the kubelet. Just shut it down.

KASLIN FIELDS: Yeah.

ABDEL SGHIOUAR: Or block a port. Like again, back to the firewall example, write a firewall rule that blocks certain ports that Kubernetes need for the communication between the nodes.

KASLIN FIELDS: Ugh, the networking.

ABDEL SGHIOUAR: Yeah.

KASLIN FIELDS: The networking tests could get ugly.

ABDEL SGHIOUAR: Yes, yes. Those would be, actually, very fun to execute, to be honest with you.

KASLIN FIELDS: Mm-hmm.

ABDEL SGHIOUAR: Yeah, that's what can happen. You know what it reminds me? It reminds me of the episode we had with David, "Klustered."

KASLIN FIELDS: Mm-hmm.

ABDEL SGHIOUAR: Remember, we had David on one of our episodes. And that was, essentially, what "Klustered" was about, the show. It was like, hey, break this thing, and let me try to figure out how to fix it.

KASLIN FIELDS: That's very interesting. I wonder if, to set up a scenario like "Klustered--" so in "Klustered," of course, David Flanagan, or RawKode, he sets up a cluster with bad things happening on it, and you have to figure out what's going wrong and fix it. I wonder if you could use existing standard chaos-engineering experiments, and just run those on your cluster to get it into a bad state.

ABDEL SGHIOUAR: I guess you could. I'm looking at--

KASLIN FIELDS: Chaotic use of chaos engineering.

ABDEL SGHIOUAR: Yeah. I'm actually looking at the ChaosHub, and there are some experiments for Kubernetes on that. So--

KASLIN FIELDS: I would imagine they normally involve recovery.

ABDEL SGHIOUAR: So I'm looking at a list here of the already existing Kubernetes experiments. So you have container kill, disk kill. Disk kill is, essentially, you fill up ephemeral storage. You go on the node, and you just fill up the disk to see how the node will behave.

KASLIN FIELDS: I don't need automation to do that. I have plenty of it.

ABDEL SGHIOUAR: Yeah, that's actually-- I mean, DD, a bunch of 1-gigabyte files, I guess, right?

KASLIN FIELDS: Yeah. One of my old prototyping projects that I did at one point, we were constantly running out of memory on the nodes.

ABDEL SGHIOUAR: There you go. There are actually some experiments for hogging up memory and CPU resources.

KASLIN FIELDS: Mm-hmm. Yep.

ABDEL SGHIOUAR: They're returning some HTTP codes from the pod. So you receive an HTTP request, but instead of returning a 200, you return a different code-- draining a node, causing some IO stress. I'm just reading off the list. So yeah, there are quite a lot of basically what you're describing, just a bunch of random-- kill a random pod and see what will happen, right?

KASLIN FIELDS: So for our next episode-- between now and our next episode, we need to set up a cluster with some chaos-engineering experiments run on it, leave it in a bad state, and then give them to each other to try to fix.

ABDEL SGHIOUAR: Yes, it would be nice to have-- it would be fun to have David and Karthik on the same show, actually. I don't know if they have ever been together, but we should probably-- I should probably DM. I don't know--

KASLIN FIELDS: Chaotic.

ABDEL SGHIOUAR: Yeah. I don't know if David has decided to bring the show back, because I think he stopped at some point.

KASLIN FIELDS: Oh, yeah. Because he's been busy.

ABDEL SGHIOUAR: Yes, yes.

KASLIN FIELDS: But definitely check out RawKodes and his community. He has a Discord, makes all sorts of really good content.

ABDEL SGHIOUAR: Yeah.

KASLIN FIELDS: If you want to see more of people trying to fix broken clusters, it's a good place to go.

ABDEL SGHIOUAR: It's a lot of fun. It was a lot of fun. Yeah, no, it was a lot of-- it was really cool to discuss this. It's definitely not something that we get a chance to talk about very often, especially if you're a developer. But, I think, for people who are test engineers, like yourself, like you used to be, it's pretty fun. Yeah.

KASLIN FIELDS: So let's wrap it up, then. Thank you very much, Abdel, for the interview, and I'm excited that we got to learn about chaos engineering and explore testing.

ABDEL SGHIOUAR: Yeah, I hope this episode was not chaotic.

KASLIN FIELDS: [LAUGHS]

ABDEL SGHIOUAR: All right. Thank you.

KASLIN FIELDS: Thank you, and we'll see you next time.

[MUSIC PLAYING]

ABDEL SGHIOUAR: That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media @KubernetesPod, or reach us by email at <kubernetespodcast@google.com>.

You can also check out the website at kubernetespodcast.com, where you will find transcripts, and show notes, and links to subscribe. Please consider rating us in your podcast player so we can help more people find and enjoy the show. Thank you for listening, and we'll see you next time.

[MUSIC PLAYING]

View More Episodes