Kubernetes Podcast from Google: Episode 82 - Chaos Engineering, with Ana Margarita Medina

#82 December 3, 2019

Chaos Engineering, with Ana Margarita Medina

Hosts: Craig Box, Adam Glick

Chaos Engineering is the discipline of experimenting in identifying potential areas of failure before they express themselves in outages. Ana Margarita Medina is a Chaos Engineer and Developer Advocate at Gremlin, a chaos-as-a-service vendor that recently added Kubernetes support. She talks to Adam and Craig about the discipline, and her journey to it.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

AWS announcements:
Eirini 1.0 is here
Security considerations for GKE by Maya Kaczorowski
- Episode 8. with Maya Kaczorowski
Managing a multi-site Cassandra cluster on multiple Kubernetes with CassKop / MultiCassKop by Seb Allamand
Run Ansible Tower or AWX in Kubernetes or OpenShift with the Tower Operator by Jeff Geerling
Everything I know about Kubernetes I learned from a cluster of Raspberry Pis by Jeff Geerling
Prometheus OpenMetrics Integration
Develop a Kubernetes controller in Java by Min Kim and Tony Ado
Running Kubernetes locally on Linux with Microk8s by Ihor Dvoretskyi and Carmine Rimi
- Episode 21, with Ihor Dvoretski
- Episode 60, with Mark Shuttleworth
Linux Foundation Cyber Monday sale
Barrons says Kubernetes is the future of computing by Tae Kim

Links from the interview

Transcript

Show full transcript

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

CRAIG BOX: We just had Black Friday and Cyber Monday in the US, two days that are of interest to the Kubernetes community, largely due to the amount of online shopping activity, which now is pushed through Kubernetes. I saw some great numbers from Shopify on how much they were pushing through their Kubernetes clusters. Did you buy anything?

ADAM GLICK: I did not. I spent the time with my family, though over the course of this general time, I have taken a look at some of the app sales and picked up some new games, always wanting to try something new and interesting. So looking forward to playing some of those on my flight back from spending time with the family. How have your holiday travel's been?

CRAIG BOX: I have a slightly different Black Friday story, in that we were up at the Grand Canyon for the week last week. And there was a giant storm went through northern Arizona. And basically, the power went out, and it was just dark for a day, cold and dark. No electricity, no generators, no heat. But it was fun. We all sort of huddled together under electric candlelight in the restaurant for a cold sandwich dinner.

ADAM GLICK: Nice. Were there any howling coyotes, wolves in the background? Was it like the start of one of these movies?

CRAIG BOX: I think it was probably a little bit too cold for the wildlife, even just the regular December weather. But overall, beautiful place. I'm very glad to have had a chance to check it out.

ADAM GLICK: Yeah. It's lovely. Anyone who gets a chance certainly should go. Shall we get to the news?

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

ADAM GLICK: Ahead of its annual Reinvent Conference, AWS has announced managed node groups for its Elastic Kubernetes Service. Previously, EKS was a bring your own node affair. This new feature adds APIs where Amazon will manage an auto-scaling group and associated EC2 instances running Amazon Linux on your behalf. The new groups can be created with the eksctl tool from Weaveworks.

Amazon also added support for AWS EventBridge to their Container Registry so you can trigger actions when an image is created or delete it. And Kubernetes Operators for SageMaker so you can train ML models with that service and deploy them to Kubernetes.

CRAIG BOX: Eirini, the project to use Kubernetes as the runtime for Cloud Foundry, has released version 1.0. While the primary goal was replacing the back end for existing Cloud Foundry users, a welcome side effect is that Kubernetes users can now gain the famous CF push experience. This project has been developed by the Cloud Foundry community with special mention to IBM, Pivotal, SAP, Sousa, and Google.

ADAM GLICK: Maya Kaczorowski, our guest on episode 8, has posted a blog on security considerations as you launch your Kubernetes environment in GKE. She discusses critical questions to ask yourself and then goes through how to structure your environment, set up permissions, and handle deployments. The article ends with recommendations for features that you should turn on from day one.

CRAIG BOX: Yo, dog. I heard you liked operators operating your operators. The CassKop, with a K, operator can be used to install Apache Cassandra in a single cluster. Seb Allamand from French telco Orange posted about the release of multi-CassKop. This new meta operator makes it easy to deploy multiple Cassandra instances across different regions in Kubernetes clusters, but make sure they are all part of the same Cassandra ring. The post contains plenty of details as well as examples and architectural diagrams to help you understand the design and get you up and running.

ADAM GLICK: If you use Ansible Tower, the SAS back end for Red Hat's Configuration Management System, you can now deploy that with an operator too. Jeff Geerling from Red Hat has posted a blog where he goes through setting up AWX, the upstream for Ansible Tower with a new operator that he wrote. Geerling cautions that this is an early alpha, not for production use, but the operator is much more Kubernetes native than the existing playbook-based deployment. His roadmap includes backups and upgrades in real time.

CRAIG BOX: If you were at KubeCon recently, you may have noticed that a number of booths had the small stack of Raspberry Pi devices. They're all connected together as a very small Kubernetes cluster. Although it might not be production ready either, it sure does look cool. And the aforementioned Jeff Geerling has shared his experience and plans for how to build a four-pi cluster. The blog links to the Pi Drabble project, which focuses on these ultra-small clusters as well as links to setup scripts and instructions.

ADAM GLICK: New Relic has announced the addition of Prometheus OpenMetrics integration. The new feature means that those using New Relic for their data storage, visualization, and notifications can now pull in their Prometheus metrics and integrate them with the rest of the metrics they store.

CRAIG BOX: Min Kim and Tony Aidoo of Ant Financial have been hard at work on improving the Kubernetes Java libraries to the point where you can now write a controller in Java. The new framework learned a lot from the design of the Go Controller Runtime. And they, of course, throw shade at Go for not having generics. You can read all about it on a post to the Kubernetes blog this week.

ADAM GLICK: Also from the Kubernetes blog, part two in a series of posts about running Kubernetes locally on Linux. Microk8s with 1k and one number eight is one such option. And Ihor Dvoretskyi from the CNCF and Carmine Rimi explore it this week. You can learn more about Microk8s in our interview with Mark Shuttleworth in episode 60.

CRAIG BOX: Are you interested in becoming certified for Kubernetes, and do you listen to the show the week it is released? Right now, the CNCF is offering their annual Cyber Monday sale with discounts on training and certification.

If you are interested in becoming a certified Kubernetes administrator or developer, you can buy the official Linux Foundation Training and the pass to take the exam for only $189 US. The Cyber Monday discounts are available through December the 10th.

ADAM GLICK: Finally, you know your project has made it when the financial press are telling people to pay attention. "Barron's," a sister publication to "The Wall Street Journal," founded almost 100 years ago, covers US financial information, market developments, and statistics.

Author Tae Kim says Kubernetes is the future of computing, quoting Google and other industry executives and Gartner Reports. There's nothing you won't know as a listener of this podcast. But it's something you could present to your CEO if you need to justify an investment in Kubernetes.

CRAIG BOX: And that's the news.

[MUSIC PLAYING]

ADAM GLICK: Ana Margarita Medina is a Chaos engineer at Gremlin. She was previously a software engineer and site reliability engineer at Uber after eight years of freelance software engineering and working at various places from small startups to federal credit unions. Welcome to the show, Ana.

ANA MARGARITA MEDINA: Thanks for having me.

CRAIG BOX: Chaos engineering is a new profession. So for the benefit of our audience, what is a chaos engineer?

ANA MARGARITA MEDINA: A chaos engineer is someone that tries to build more resilient systems. They're ideally helping a company avoid downtime. So they're just thinking of new ways that they can look at past learnings that they've had in outages, incidents, applications just not working as expected, customer pain points, and just trying to find ways to make the systems and applications just be constantly up and running and resilient and providing a good user experience.

ADAM GLICK: Traditionally, some people might have thought of that as the job of operations. Is this something separate from operations?

ANA MARGARITA MEDINA: Not necessarily. We actually see the folks that work in operations are the ones that are doing chaos engineering. But we see that operations have always been like the folks that sometimes get thrown just being on call and keeping systems running.

But they don't get to strategize and give their input on, hey, this is how we make things better. And this is kind of what we're seeing with the DevOps movement, that we're also now putting our devs on call and actually tell them, hey, you didn't make your system really reliable so now you're going to feel the pain of it. If it goes down at 3:00 in the morning, you're the one that's going to get paged.

And with chaos engineering, we also kind of like help foster that culture in a way. We then now tell folks to run chaos engineer experiments in development to also make sure that they're building resilient systems for their internal customers. But the idea is that you want to run these experiments in development and then later grow the chaos maturity to start running those chaos engineer experiments in production.

CRAIG BOX: So if the idea of site reliability engineering is to bring back operations earlier into development, would you say that chaos engineering could be thought of as the same for testing?

ANA MARGARITA MEDINA: Definitely. We see that chaos engineering now gets added all the way from doing it in staging to development to production. And we also get to see that folks are starting to think of ways to automate their chaos engineer experiments. So we also have the conversation that you can roll in chaos engineer experiments in your CI/CD pipelines.

ADAM GLICK: Where did chaos engineering come from?

ANA MARGARITA MEDINA: Chaos engineering has a little bit of a history. We like talking about that we go back to Amazon, where Jesse Robbins used to do game days. And some of the ways that they actually built resiliency around Amazon's retail was that they would actually manually go and unplug a data center. And it was like, how is this going to react?

But then a few years later, we see this practice continue evolving, and the actual term of chaos engineering gets coined by Netflix. Netflix was known as the company that was doing something radical and moving everything they owned to the cloud and hoping everything was going to go really well. But then that injected a lot more abstract and complex things.

So they decided to build some software that would actually turn off their instances on the cloud and hope that their applications were still running and users were having a good experience. And this later got open sourced, and it was known as Chaos Monkey.

So with Chaos Monkey, we started seeing folks actually understanding chaos engineering. And then Netflix actually came up with a few more tooling around there called the Simian Army, and we got to hear about other ways that they had been doing chaos engineering internally at the company. But they weren't open sourcing all those tools.

CRAIG BOX: Before cloud, we ran reliable infrastructure. We would take a server, and we would put multiple power supplies, multiple disks in to make sure that that thing in itself was always reliable. Cloud moves us to a world where we have to architect reliable infrastructure from these unreliable instances where the vendor says they could go away at some point in time, whether deliberately or accidentally. Was chaos engineering necessary before cloud?

ANA MARGARITA MEDINA: Oh, definitely. I mean, we go back to the idea that kind of also goes into with capacity planning, where if you're not necessarily knowing what's going to happen in real time in your production, you actually might just run out of capacity in computing and servers. And you won't be able to actually handle the load.

And we also have the part where chaos engineering helps you grow muscle memory. So even when we talk to folks that are not running on the cloud whatsoever, we see that some folks basically are just doing a lot of manual work when there's an outage going on.

But they have never done that work until they actually go on that incident. Maybe this runbook has not been updated. Maybe the system actually doesn't have a runbook. So we actually see that a lot of this is also just doing fire drills and building muscle memory within your engineers.

ADAM GLICK: Does this have any relation to forms of automated test creation, like model-based testing or fuzzing in the security realm?

ANA MARGARITA MEDINA: We do see that there is a lot of things that chaos engineering in the testing space have done very similarly. We have seen that a lot of folks, they think about testing as something that just stays in development. And they don't really think about what tests do I actually run in production?

So it's actually been pretty awesome to see the movement of like, hey, we actually need to test in production, because there's not going to ever be an environment that is production unless it's production. So they do have a little bit of those similarities.

But with chaos engineering, we kind of think a little bit more than just application or vulnerabilities that it can have. We actually think of real-world scenarios that are happening, whether it's making sure that you have enough capacity to handle a surge that goes on on your disk resource layer or your CPU to maybe even thinking about how do my applications handle latency or packet loss when we have some of our users coming from countries that their internet connections are not as reliable.

ADAM GLICK: This sounds like something that happens a lot more on the operations side as opposed to unit tests that happen with developers. This sounds like something that the production and operations teams do with live running services.

ANA MARGARITA MEDINA: Yeah, definitely. We see that a lot of folks when they're thinking about implementing chaos engineering or they're actually doing it, it's usually the site reliability engineers that are the ones that bring some tooling into it. They build something in-house. And operations, of course, is a big part of that.

But we also see that a lot of devs are starting to look into this space as, hey, I actually want to make sure that as I'm deploying my applications, and I think about my dependencies going down, I'm actually handling that failure quite well.

So we also see that folks on the front end UI spaces, they're actually thinking of how do I think about failure holistically, that let's say the entire application is built on React, and one of the components that is getting pulled in from the React, that entire microservice goes down, well, if you don't handle that failure quite well, your entire UI is going to go broken.

So I think we're getting to the point that chaos engineering is kind of growing a little bit more, and we start seeing more devs actually be like, hey, I don't need to just wait before my code goes into production or for the SRE team to just be running this on the actual servers. I can start doing this in my development cycle and actually implement it with something as like your CI/CD tools.

CRAIG BOX: Testing went from something that people would do manually, perhaps just trying things at random, then to a runbook, then to an automation, and then eventually to things like fuzzes where you would go through and just test things to see what happened basically. Is chaos engineering following a similar path or has it all been automated from the beginning?

ANA MARGARITA MEDINA: We see that a lot of folks have started from actually just shutting down an actual host, like unplugging it, to actually folks just running kubectl like delete pods and then doing it just like that. Then some folks have decided to build something internally to actually script that.

And of course, there's like vendors out there, such as the company that I work for, that have built a resiliency tool that allows you to do experiments with chaos engineering. But we do it in a way that you actually have to be a little bit thoughtful and planned in doing these experiments.

It's not about just walking into your company on Monday and be like, hey, we're doing chaos engineering and production. And we're going to bring down 50% of our data center and see what happens. It's like, don't run a chaos engineer experiment if you know it's actually going to break something.

So the flow that we like thinking about doing chaos engineering is that we follow the scientific method, where you look at your system, you look at your architecture, and you form a hypothesis of something that you actually wonder how is it going to handle failure? And after you form this hypothesis, you scope out an experiment.

You then also define the blast radius and the magnitude of this experiment. The blast radius ends up being the amount of hosts or containers that you want to target, and then on the magnitude, it's going to be the impact of such experiment.

So basically, don't run it on 50% of your containers if you don't know how it's acting on one or two of your containers, to the point of don't inject 2,500 milliseconds of latency if you don't know what 100 milliseconds of latency is going to do.

CRAIG BOX: Google runs an annual test called DiRT, for Disaster Recovery Testing. And a lot has been talked about publicly -- I'm not sure how many of the exact tests that we run are public -- but I understand there are tests for things like what happens if this country is offline or the telephony provider that would page someone goes offline?

Then from there, we go to actual tests that are run in line, CI, as you mentioned. Do you think that there is a progression between those sort of one-off events through to getting a chaos experiment run with every piece of code that you deploy?

ANA MARGARITA MEDINA: Yeah. I think you definitely start off by doing one-off instances of testing, where you just had something happen in your development environment. And you actually want to make sure that you patch it up. So you go and you make that system a little bit more reliable, but we also see that sometimes there's just outages that continue happening that make you learn that one of your systems is a little bit more fragile to failure.

So you get to maybe put in some patches after this outage happens, but you never really go back to give your system those outage conditions unless it was to actually happen. So with chaos engineering, you're actually able to start doing that a little bit more.

CRAIG BOX: How would you recommend people map processes like a vendor is unavailable or a person that has the keys to something isn't online at the time of the failure? Things that can't as easily be tested automatically, how would you bring them into your chaos engineering plan?

ANA MARGARITA MEDINA: There's a lot of parts that we talk about, like the human side of chaos engineering, where part of it is that maybe you actually take away the manager, the lead tech that's on call for that month. And you actually see how the rest of your engineering team can actually handle not having that other person and like that person acts as their single point of failure.

But we also see that some of the other things that are not necessarily easy to test of like training folks to go on call and making sure that folks are not getting pager fatigue, these are things that kind of get to be tuned a little bit by following this practice.

ADAM GLICK: It's interesting. So you're not only just testing the software, but you're also testing the human systems and processes. What was the most interesting bug or incident that you discovered while doing chaos testing?

ANA MARGARITA MEDINA: One of the ones that was the most fun to find actually may have been one that I got to share on stage at AWS Reinvent where I got to build out a demo. And the idea was that I was just going to use some open source tooling and straight out of the box see how resilient it was.

So the open source tool that I chose is actually Redis. I deployed it the way that the tutorial was telling me to do it. It was like a guest book application. And in the back, I was going to have a primary Redis container that will hold all my information and make sure the replica container was there too.

And I decided to run a shut down chaos engineer experiment. What would happen if I just shut down my Redis primary container? And well, as I did that, right before it officially shut down, it decided to go empty. And in that moment, replica looks at primary and becomes empty too. Then as primary shuts down, replica gets promoted to primary. And we suffer a data loss.

So it was very much of a lot of open source tooling or just any tooling in general is not resilient to begin with. So we now get to talk about the idea that you get to take tooling, whether you built it internally or it's just an open source tool, run chaos engineer experiments before you actually bring it in to your environments.

CRAIG BOX: And especially before you do a keynote.

ANA MARGARITA MEDINA: Yes.

ADAM GLICK: Chaos engineering is a relatively new discipline. How did you get started in it?

ANA MARGARITA MEDINA: I have a very interesting story in terms of my way that I got into tech but also got into chaos engineering. I come from a very untraditional route. I taught myself how to code at 13, which later led to me running my own freelance web design business at the age of 15.

And I continued working in front-end technologies, ventured out to doing iOS applications, Android applications. Then all of a sudden, I was trying to find my new gig as I had basically decided to drop out of college, and I ended up working at Uber as their first site reliability engineering intern.

But they didn't just throw me into site reliability engineering with not necessarily having the systems experience. I had a very interesting task that I got placed on the chaos engineering team. So my first time that I ever worked on anything infrastructure related was actually chaos engineering.

So in my first two, three weeks at Uber, a lot of it was ramping up on production, how do servers work, learning Linux. But I did have the chance to have some amazing engineers around that were willing to let me put time in their calendar and ask them, hey, can you give me an explanation on how this actually works on this routing to, like, IP tables getting matched up to actually deployments?

And folks were pretty willing to explain to me how these thousands of microservices were working at Uber. And then I got a chance to also sit down with the lead engineers for these microservices and start thinking of, hey, we can actually inject a little bit of chaos on this service and actually test what happens if some of the ports happen to be shut off in this other microservice that you're talking to.

So I then started learning more about infrastructure by actually just breaking more infrastructure. So it kind of became this whole thing where all of a sudden, I just kept breaking stuff and learning. And that's kind of my path in the resiliency site reliability engineering space.

And I kind of have joked around that the way that I have learned a lot of things in monitoring unobservability has been by just doing it backwards, where I deploy something, I try to instrument it. And then I go run chaos engineer experiments to actually see if I properly instrumented it.

ADAM GLICK: How was it going from the shiny front end to the critical back end?

ANA MARGARITA MEDINA: It was hard. But in a sense, it made me be creative in different ways. I decided to go into the front end because I also had a passion for graphic design and photography. So front end kind of fit in to those other passions I had. But then when I got to infrastructure, it was like, well, I don't necessarily get to have pretty websites or make things sparkly in the colors that I like.

But then it made me just think of other different ways that I can think about constantly pushing the envelope forward in the ways that folks were having namings and organization things in parts of the infrastructure to taking it like the extra level, that I always was. And it's like, oh, how do I make sure that the terminals that I'm constantly working on, like the code editors are the colors that I want, because I don't get to develop shiny websites anymore.

CRAIG BOX: Do you think that having an affinity for front-end work and the experiences of interaction that people have bring something to back-end work, I think empathy for people that isn't necessarily there in all back-end systems?

ANA MARGARITA MEDINA: I definitely think so. I mean, it was always thinking of building a good user experience, where things were going to work and things were quite fast. So I always develop with the customer or the end user in mind. And then when I move on to infrastructure, you now have thousands of users you never get to actually even know who they are. So when I start thinking of like, hey, we want to make sure that we're building everything at the back end to constantly be up in a way that when things fail, they're degrading in a proper way.

So let's say your application is not loading something properly due to latency. Is this something that you actually even need to be loading on your website or can you just display your website with that container missing or that microservice not displaying information? So it's constantly being like, what's the best user experience we can give an end user when something in the back is not necessarily working as expected?

ADAM GLICK: You're a Latinx person working in tech that has a strong deficiency of people like yourself in it. You also have a passion for diversity and inclusion. How do you bring that to the work that you do in the companies that you're a part of?

ANA MARGARITA MEDINA: I do that in various ways. I think the number one is that I bring my whole self to work. I love listening to my Hispanic music, very much into the reggaeton scene. So probably listening to that as I'm working on any coding projects or just preparing for like giving a talk.

And then I spend a lot of my time just working with a lot of underrepresented communities and marginalized folks and being able to give them resume tips, tell them how I have navigated my career throughout the years. And the biggest one is realizing that when I was growing up, I didn't have anyone to look up to that looked like me. And knowing that I can now be that role model that I wish I had when I was 13.

CRAIG BOX: What is a day in the life of a chaos engineer?

ANA MARGARITA MEDINA: The life in a chaos engineer kind of changes. There's various days where you're just reading postmortems, trying to learn what has gone wrong in the last quarter, in the last few months in my company or my organization that I can be actually building chaos engineering experiments on.

There are also some days that you're maybe following your favorite company. It's having an outrage, and you're like #hugops to everyone I know from it. But you're also kind of like questioning, hey, how could they have prepared better until the point they were just like, hey, how can I reach out to them and tell them, hey, you know, maybe you shouldn't just look at your cloud as your single point of failure.

Maybe you should have a little bit more of a hybrid cloud, a multi-cloud strategy around it. To some other days, it's very much running game days. At the company you're helping some of our customers run game days and actually make their systems more resilient.

ADAM GLICK: Is chaos engineering more or less relevant in a world of Kubernetes and Kubernetes-like infrastructure, where you have dynamically restoring and automatic replication that happens for things that disappear?

ANA MARGARITA MEDINA: I personally think that now that we're in a world where there's more Kubernetes, we actually need more things like chaos engineering, just because we've actually built an abstraction layer on these complex applications that we're running, that it gets a little bit harder to know what's running under the hood.

And of course, when you're running this on the cloud, you also already had those abstractions of being on the cloud. But when we look at Kubernetes, a lot of folks have been able to move onto it and adopt it. And a lot of folks are sometimes scared to adopt it, just because they know it's complex and abstract.

So with chaos engineering, you kind of get to possibly even strategize how to bring in Kubernetes into your workspace but in a way that you're doing it in a very thoughtful and incremental way. And you're also building the assurance that the applications that you're putting on Kubernetes are being resilient.

And one of the things that Gremlin has been doing in the past few months is actually researching what are the largest outages that have happened in the Kubernetes space? And turns out that it looks like around 50% of the largest outages for Kubernetes end up being just auto-scaling.

CRAIG BOX: I thought you were going to say "etcd".

ANA MARGARITA MEDINA: [CHUCKLES] Almost. But with auto-scaling, we see folks are not necessarily doing any horizontal pod auto-scaling or maybe even to the point that their nodes and their clusters are not doing anything in the auto-scaling term, where now your containers are just waiting for getting allocated and the resources are maxed out.

And the other side is that we see the other ones fall in the network layer so things that could kind of become a little bit more resilient by running chaos engineer experiments with like latency, packet loss, and black-holing traffic. So it's like now that you have distributed these things across different containers, pods, deployments, how are these things actually handling the failure as you go on through your company's day-to-day operations?

CRAIG BOX: When something does go wrong, it could be anywhere in the stack. It could be your application, it could be Kubernetes, it could be the kernel. Do you have to have a really broad understanding of all those areas to succeed in debugging them?

ANA MARGARITA MEDINA: I wouldn't necessarily say it's a complete understanding, but you need to have a lot of instrumentation in the monitoring of observability side. But there is a part that you can't wait until you're monitoring and observability is perfect, because we're constantly still trying to get there.

So in a way, you can also be learning about your systems by introducing chaos engineer experiments when you have just infrastructure metrics and a little bit of metrics around your service, in terms of knowing whether the resources on your infrastructure layer, how are they doing to how is the network doing, and how are my service requests, how many HTTP 400, 500 errors am I having?

ADAM GLICK: If someone wants to get into chaos engineering, it's relatively new. There's not a lot of coursework. There's not even a lot of books about it at this point. How can someone get to where you're at?

ANA MARGARITA MEDINA: One of the best things about the chaos engineering space is that there's a large community that has been upcoming around it. There is over 3,500 members in this chaos engineering Slack channel that's around. And the link to that is tinyurl.com/chaoseng.

And basically, there you get to come hop online and learn from folks that have been doing chaos engineering from 10 years ago to folks that are like, hey, what the hell is chaos engineering and how do I get started, to I am currently running a monolith architecture, how can I actually instrument this, that I can actually possibly start thinking about chaos engineering?

And we have those folks that are like, hey, I'm actually running Kubernetes. Can you actually talk to me about some experiments that I can get started on? And then with that, I work at a company that offers a free tool that allows you to get started with chaos engineering.

So Gremlin has a free forever community offering called Gremlin Free that allows you to get started with chaos engineering by running shut down experiments and chaos engineer experiments. And you can run this on your regular hosts to your containers to your Kubernetes clusters as well.

ADAM GLICK: In the testing days, people used to talk about white box versus black box testing, of understanding all the code that's inside or just kind of poking at it and seeing what comes out the other side. Do you have to understand all the technology that you're testing to be a good chaos engineer or can you just understand the system pieces and flip the bits there and see what happens?

ANA MARGARITA MEDINA: There is this part with chaos engineering that we're not doing it just as one person. We try to do chaos engineering in your entire team or you come together in a group of like your tech leads and maybe some of the junior engineers that you're also trying to train up. And when you do that, therefore you have folks that have all sorts of expertise in the room as you're running these chaos engineer experiments, whether it's in development or production.

So there's never this one person that has to be the mastermind of how your entire architecture of the company works. And you're all able to have different roles. You have folks that are going to be basically just focusing on looking at error rates, looking at the traffic of your customers. Is it still constantly coming in as you're running these experiments?

You have folks that are just going to be looking at your observability, your monitoring. And then there's the folks that are just going to be actually taking notes of all the experiments and everything you're doing. And there's going to be that chaos command there that is the one in charge of actually owning and running this chaos engineer experiment.

ADAM GLICK: I totally want that as my job title, chaos commander.

CRAIG BOX: Your company is Gremlin. What is the story of that name?

ANA MARGARITA MEDINA: Gremlin actually got influenced by the Royal Air Force and World War II, where part of it is an author called Roald Dahl made it popular with the book after he had been in the Air Force and experienced a crash landing. And this later turned out to be a Disney movie. So this little Gremlin mascot ends up being this little, like, mischievous character that can be good and can be bad.

CRAIG BOX: And has been through mythology obviously for many hundreds of years.

ANA MARGARITA MEDINA: Yes.

CRAIG BOX: I understood you watched the movie as a team event.

ANA MARGARITA MEDINA: Yeah. Gremlin is a remote first company, and we try to be a little bit creative in the ways that we do a little bit of team bonding. So one of the fun things that we did is that folks were realizing that a lot of the company had actually never watched "Gremlins," myself included.

So we actually got together on a Zoom call, and we were streaming the "Gremlins" movie. And we also were using Slack to have a little bit of a channel where folks were actually sharing commentary of the movie. So very similar to just having an off-site, going to go watch a movie, enjoying it with your co-workers, your friends. We're like, how do we actually bring that kind of experience where we're building a remote first culture?

CRAIG BOX: You really had to get a copy on the VHS and post it around to everyone if you wanted the true '80s experience. All right. Ana, thank you so much for joining us today.

ANA MARGARITA MEDINA: Thanks for having me.

CRAIG BOX: You can find Ana on Twitter at @ana_m_medina.

[MUSIC PLAYING]

CRAIG BOX: Thanks for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. Please leave a rating on iTunes, if that's your thing. If you have any feedback for us, you can find us on Twitter @kubernetespod or reach us by email at kubernetespodcast@google.com.

ADAM GLICK: You can also check us out at our website, kubernetespodcast.com, where you'll find transcripts and show notes. Until next time, take care.

CRAIG BOX: See you next week.

[MUSIC PLAYING]

View More Episodes