Kubernetes Podcast from Google: Episode 121 - TiKV, TiDB and PingCAP, with Ed Huang

#121 September 15, 2020

TiKV, TiDB and PingCAP, with Ed Huang

Hosts: Craig Box, Adam Glick

Ed Huang is co-founder and CTO of PingCAP, creators of the TiDB distributed database and the TiKV key value store. Ed worked on clustering Redis while at Wandou Labs, creating and open-sourcing a tool called Codis. Deciding to focus on this space, he created TiDB and then TiKV, and founded PingCAP. He shares the story behind the projects, bridging the gap between China and the West with open source, and his Desert Island Disc.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.

[MUSIC PLAYING]

ADAM GLICK: I made an interesting observation this week.

CRAIG BOX: Oh, really?

ADAM GLICK: My wife had gone to Starbucks and gotten herself a coffee. And she came home. And I looked at the cup and then I looked at what I was feeding my daughter. And realized that a Starbucks cup-- really, any of the disposable coffee cups that you see-- is really just a sippy cup for adults.

CRAIG BOX: Interesting. I would say that the main criteria for a sippy cup, at least in the child/toddler situation, is that you can tip the cup upside down and the liquid doesn't fall out of it. Is that true of a Starbucks cup?

ADAM GLICK: You know, I haven't tried. I'm a bad person to ask on that. I assume that there is some amount of force you could apply and you could cause the lid to come off, but they are meant that if you tip it over it doesn't all go--

CRAIG BOX: Yeah, if you shake it wildly. Like, I imagine your daughter taking it and banging it up and down on the table or something, saying I can get the liquid out. This is a science experiment.

ADAM GLICK: It does make me wonder, what are the other things that you look at and you're like hey, I thought that was a kid's thing or I thought that was an adult's thing. And the reality is, it's the same product and people just have found a new market and a new group of people to use it for.

CRAIG BOX: I have seen a number of takeaway coffee cups and things in the past-- back in the day, of course, when we could go and do such things. But some of them had a little flap that you could actually close. You could get your hot drink, open the flap, close it again. So I reckon if you were to turn that one upside down and not bang it on the table with great force, it probably would meet the requirements for being the proverbial Tommee Tippee cup.

ADAM GLICK: Perhaps we should start up a GitHub page with the child version, the adult version. You know, your kids have a high chair and we have bar stools.

CRAIG BOX: Oh, the things you do when you become a parent.

ADAM GLICK: [CHUCKLES] Shall we get to the news?

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

ADAM GLICK: Version 3.6 of the Lens Kubernetes IDE has been launched. The first release since Mirantis took control of the project. This release changes to using kubeconfig files for access as well as improving terminal configuration options and integration with OpenID providers. Additional bug fixes and underlying changes are also a part of the release.

CRAIG BOX: AWS has introduced security groups for Kubernetes pods on EKS. This gives users the same kind of controls they had with firewall-like security group functionality on EC2 instances at the pod level, implemented in the CNI plugin. The update is available for new clusters and has promised to roll out to existing clusters in the future.

ADAM GLICK: The CNCF End User Community has released its second Technology Radar, this time on observability. The radar works something like a maturity model and breaks various projects and vendors into the categories of assess, trial, and adopt. Congratulations to Datadog, Elastic, Griffana, OpenMetrics, and Prometheus for landing in the adopt category.

CRAIG BOX: The Crédit Agricole platform group has launched Kotary with a K, an operator to manage quotas with confidence. Kotary brings a layer of verification and policy to the native resource quotas mechanism. Instead of your scheduling failing because there are no resources to provide it, your claim object will be updated to tell you why it couldn't be fulfilled, either due to a lack of resources or violating a policy that was set.

ADAM GLICK: Another new project this week is Onepanel, a Kubernetes native deep learning platform for computer vision. It builds on CVAT, the Computer Vision Annotation Tool, and provides fully integrated components for model building, semi-automated labeling, data processing and model training pipelines.

CRAIG BOX: Idit Levine of solo.io, our guest on episode 55, has announced that they are creating a specification for web assembly, or wasm modules, to be packaged in an OCI image. The images will contain two layers-- one for the module and one for the metadata. The first use case for these images will be the Envoy support built by the Istio team, but the spec is general enough for other uses. It currently has the advanced version number of 0.0.0, but it's being worked on in line with the upcoming Istio 1.8 release.

ADAM GLICK: Red Hat announced the Red Hat Marketplace operated by IBM, an app store for OpenShift that can deploy to clusters regardless of if they are running on prem or in the cloud. For an additional fee, Red Hat Marketplace Select is also available. The Select version of Marketplace provides control for an organization to limit which software is available to be installed from users within their company.

CRAIG BOX: Kubernetes security vendors StackRox has announced a $26.5 million financing round to expand internationally and continue investing in R&D. StackRox says its business grew by 240% in the first half of this year, citing companies accelerating the digital transformation initiatives in response to the COVID pandemic.

ADAM GLICK: At their .NEXT event last week, Nutanix announced the Karbon platform services, with a K-- a PaaS that can deploy on prem, in the cloud, or at the edge. The platform brings a managed Kubernetes and PaaS with serverless functionality, along with a host of features like ingress, service mesh, observability, security, AI, and a message bus.

CRAIG BOX: Kubernetes 1.18 has hit Google's GKE, and with it, the beta of using confidential VMs as GKE nodes. Confidential GKE nodes use the AMD EPYC Hardware Encryption to enable processing data while encrypted, keeping the keys in the node hardware with Google having no access to those keys. You can learn more about it in the linked video by none less than Vint Cerf, father of the internet.

ADAM GLICK: Are you curious about running serverless applications on Kubernetes? If so, the CNCF has a new course on the edX platform for you. Aptly named "Introduction to Serverless on Kubernetes," the course is a free intro on how to build serverless functions and run them on Kubernetes. The course targets doers in IT and development and assumes a basic set of cloud-native principles knowledge as well as some understanding of the Python programming language. It's written by Alex Ellis, our guest on episode 116, so no prizes for guessing which serverless platform it will teach you.

CRAIG BOX: Finally, the discipline of SRE is often described as changing the engine while the plane is in flight and Jetstack had to do just that. A customer wanted to change the CNI plugin used by the Kubernetes cluster and customer reliability engineer Josh Van Leeuwen wrote up how Jetstack achieved this. Copious examples and a full set of scripts are provided.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

ADAM GLICK: Ed Huang is the co-founder and CEO of PingCAP, the creators of TiDB and TiKV. He's also the co-author of Codis, a widely-used Redis cache solution. Welcome to the show, Ed.

ED HUANG: Hello, everyone.

ADAM GLICK: Back in 2015, you were working at Wandou Labs and you released an open source project called Codis. What sort of company was Wandou Labs?

ED HUANG: Wandou Labs is a fast-growing internet company in China, and it is like an app store for the Android platform. I was working in the storage infrastructure team at the time.

ADAM GLICK: What problem were you trying to solve?

ED HUANG: You know, at the time, we used Redis a lot. More than 99% of our online read workload was served by Redis. We used Redis as the front end of our database. At the time, I think the amount of data was around two terabytes total in Redis. I still remember.

But Redis didn't have a cluster mode back then. You know, it was version 2.3 or 2.4 I think. So of course you could not fit that amount of data into a single node server, right? Sharding the Redis cluster to many nodes was the only way. Back then, we only had an open source tool for sharding Redis. It was called twemproxy. It was open sourced by Twitter, I think. It is a static sharding tool, it is a middleware for Redis.

At the time, our business-- our data was growing so fast. So re-sharding and re-balancing the data in Redis was super painful for us. It was the problem we were trying to solve. And that's why we wanted to create Codis-- we wanted to create a new Redis middleware.

On the application side, you can just use it like Redis, but under the hood, Codis will automatically handle the data sharding and then the data re-balancing, data placement; it is totally transparent to the application layer. And I think we came up with a very clever idea to keep the strong consistency and 100% availability while re-balancing the data in Codis. That would save a lot of time and manpower on Redis at the time.

ADAM GLICK: You built this and you're doing sharding on your DB, presumably, and then sharding on the caching layer in front of your database. You decided to release and open-source this work. What made you decide to open-source it?

ED HUANG: Codis, at the time-- I think it was 2013, or 2014-- was written in Go. The [language] was very popular, widely used in my company at the time. And there were not many open-source projects in the Go community at that time. And we thought that the code of Codis was very elegant. At the time, we didn't have [built in] Redis clustering. And the problem Codis was trying to solve was a general problem in the industry. So that means open source was a natural choice for us and we didn't think too much of it. We wanted to have more engineers [be able] to solve the scalability problem for Redis.

And secondly, I really like the Go programming language. I hoped more people would use Go. And Codis is a good example if you want to copy some code from. If you wanted to build middleware or a distributed system, you could borrow some code from Codis. And we could prove that we can really get performance and engineering productivity at the same time. So that's why we open sourced it. We wanted to show off how beautiful the code was.

ADAM GLICK: Is this what sparked your interest in open-source databases overall?

ED HUANG: Yeah, actually; that was the beginning of TiDB and TiKV.

ADAM GLICK: In 2012 to 2013, Google released the Spanner and F1 papers. Did they have any influence on you or what you were creating?

ED HUANG: Yeah, you know, I'm a big fan of Google and I think I have read most of the papers published by Google in the distributed system space. So, yeah, I still remember the first time I read the Spanner paper. I think the Spanner paper was published first and followed up with the F1 paper. I would say the TiDB project and TiKV projects were, at the very beginning, inspired by these two papers. I thought "this is the future of databases".

ADAM GLICK: Not long after releasing Codis, you left Wandou and created PingCAP, where you released TiKV and TiDB. Which came first, the company or the projects?

ED HUANG: [CHUCKLING] Yeah. After the success of Codis, I would say we were a little bit over confident, right? I just said that I was working on the storage infrastructure team at the time, at Wandou Labs. Behind the Codis cluster was a huge sharded MySQL cluster. It had the same problem as the cache layer. Scaling out a MySQL cluster and then re-sharding MySQL was also really painful.

So that means you have to give up the features of relational databases, like transactions, like complex queries. You know, at the time, like, Codis we wanted to build a new distributed database to replace the sharded MySQL cluster. But we have this idea at Wandou Labs and we talked to our boss, saying, "hey, we want to build a new database". But it was a very ambitious project, even for a big internet company. And so that means a lot of investment. So our boss wasn't supporting us to do this, because it was like a moon shot and we were just an app store for Android platform.

But we think we can build it., and the database is also a common pain point in our industry. So that means a big potential market, even though our boss wasn't supporting us. But we had a great reputation, because of Codis, in the open source community in China. So it was not very hard for us to get some seed money from the venture capital firms. So, we just decided to quit our jobs and go full time to create a startup to solve this problem. That was the beginning of the company and project.

ADAM GLICK: What is the origin of the name PingCAP? I noticed that the C-A-P are all capitalized. Is that a reference to the CAP theorem?

ED HUANG: Yeah, it is the CAP theorem. Yeah, and you know, TiDB is a CP system. Consistency and partition tolerance-- CP. But still with a very high availability, just like Spanner. I think, in a sense, we are close to breaking the CAP theorem. [CHUCKLES] We also used ping. Ping is the network command that implies connectivity. So this name means we want to link C and A and P. Yeah, it is a good name and is a very geeky name. According to this, you can see this company is built by some engineers and it is a distributed system company. That's why we used this name.

ADAM GLICK: You've created two databases-- TiKV and TiDB. What order should people think about these technologies? Do they build on each other or are they separate databases?

ED HUANG: Actually, TiDB here-- the repo pingcap/tidb-- works together with TiKV. And the TiDB project is-- TiDB, in this context, not the whole TiDB platform or TiDB project-- is a stateless SQL layer built on top of TiKV. And that TiDB SQL layer exposes a stateless MySQL-compatible endpoint to the application layer and translates the SQL statements into key-value operations and sends them to TiKV. So the data is actually stored within TiKV, and TiKV is a transactional key-value distributed database. Just like the relationship between Spanner and F1. TiDB is the F1 part, the TiKV is the Spanner part of this key-value interface.

ADAM GLICK: If TiDB is the SQL-querying layer on top of TiKV, did you create them simultaneously? Or did you start with one and then later realize that the other would make a great addition to it?

ED HUANG: Iit is a long story. We started from the SQL layer first, because we thought it was closer to the application layer. At that time we used HBase-- you still remember HBase?-- as the storage part under the SQL layer. So it was TiDB on top of HBase. We make HBase look like a MySQL cluster.

But the performance was really terrible, you know? I think the big selling point of TiDB is the transactions part, right? Just like Spanner has the ACID transactions, cross-row transactions. But, you know, based on HBase means that we had to implement another distributed transaction layer on top of HBase, which made the performance even worse. So we decided to build our distributed storage engine to replace Hadoop. So that's the origin of TiKV.

At the beginning was the SQL layer and then we found that we have to build the high performance key-value NoSQL database. And then we have TiKV. I think it was 2015 at the beginning, we built SQL. In 2016 or '17 we started to build TiKV.

ADAM GLICK: If you put these two products together, how do you distinguish them from traditional databases people may be used to, like MySQL or Postgres?

ED HUANG: The difference between MySQL and Postgres? You can consider TiDB as a super, super huge MySQL where you don't need to worry about scalability, just like Spanner. From the application developer's perspective, you can still use your JDBC, MySQL drivers to connect to TiDB, so you don't need to worry about scalability. But at the same time, you can still keep the good features from traditional databases, like transactions, like complex joins-- you can use a single node database, but you have the good features of the SQL. Something like that.

ADAM GLICK: Do you have any plans to merge these two projects, since it sounds like they are often used together?

ED HUANG: Actually, the idea and the philosophy behind the whole TiDB platform is highly layered, just like Spanner and F1, that separated the computing and storage. And we think we should separate it, yeah, because they are totally different projects. And, you know, we use a different programming language for them. So it is-- I would say, we will not merge them together because I'm a big fan of the Unix philosophy -- do one thing and do it well.

ADAM GLICK: You mentioned that you use different languages for each product. If I recall correctly, TiKV is written in Rust while TiDB is written in Go. Why did you choose different languages for each of these projects?

ED HUANG: I really love Go. Before PingCAP, I basically used Go to build everything if possible. Go has a very simple syntax and grammar, which means very good engineering productivity, although Go has more CPU overhead compared to Rust or C++. Before PingCAP, I hadn’t used Rust a lot. But it was really early, like 2016, and the Rust community was not very big.

And I really liked Go. At the time, Go had a great big problem on GC [garbage collection] performance. But right now, in recent versions of Go, they have greatly improved the GC performance. So GC is not a problem in most cases. So we chose Go to build the SQL layer, even though we had CPU overhead. But I thought it was acceptable.

The problem with Go is the scheduler, if you have many Goroutines. Goroutine scheduling policy is invisible to developers, which is not very friendly in some performance-critical scenarios, like building a storage engine. So we chose Rust to build a storage layer. Like C++, it is closer to the bare metal: full control of the hardware with zero overhead. Even better, compared to C++, Rust has a very strict syntax to prevent memory leaks and unsafe pointer references at compile time. So we can have zero overhead.

And compared to Go, Rust has a modern type system and modern syntax. But this comes with the cost of a much steeper learning curve compared to Go. The TiKV team is not very huge, but they are very sharp engineers. And even though from the contributor level, TiKV, compared to TiDB-- the SQL layer-- they have a smaller developer community. We think it is because of the learning curve of Rust.

ADAM GLICK: If you were to restart one of these projects today, would you still write them in the same two languages you chose? Or because of some of the garbage collection changes that have happened, as well as other improvements, would you pick a different language for one or both of these projects?

ED HUANG: Well, from a technical point of view, I think I made the right choice. I don't want to change. But from a community perspective, probably I would use Go. That would give TiKV a larger developer community, I guess. And in the future, the performance for Go, the scheduler performance, the scheduler control may no longer be a problem. But I really love Go. They have a really good learning curve for the community and developers.

ADAM GLICK: And they have a really cute mascot. Some things you just have to go with your gut on. [ED CHUCKLES] Speaking of which, you embraced Kubernetes pretty early on with your project. Why did you choose to build on Kubernetes?

ED HUANG: Well, I just mentioned that I am a big fan of Google's infrastructure. Yeah, I like a lot of systems published by Google -- you know, Spanner, F1, Megastore, Omega, Borg. And I think one of the hardest things for building a distributed system is not only the implementation. The hardest thing is the deployment, orchestration.

I really think most people in the past, before Kubernetes, underestimated the complexity of operating and maintaining huge distributed systems. Engineers tend to focus on the algorithms and the code and how to implement it, but in the real world, we are talking about a cluster with hundreds or thousands of nodes. And the complexity in operation, like how to do failover, how to do a rolling upgrade, is something we have to face every day.

So one of the biggest contributions of Kubernetes is to simplify the development of a lot of automated maintenance logic for the infrastructure software. I'm talking about the operator pattern and CRDs. Instead of a lot of fragile shell scripts, the infrastructure software can put that deployment and maintenance logic into an operator. So the users of this software do not need to understand a lot of the operational details. Just like you use APT package management to install or update your software on Linux.

And second, I used to envy Google because of Borg. You know, Borg is very elegant and abstracts the data center into the operating system, right? And it hides the complexity of the physical nodes for applications. So what you can see is a well-defined resource and APIs. I think it is very similar to something that happens with a standalone operating system like Linux. Just like we use POSIX API, syscalls, and turn the bare metal hardware into standard resources.

So just like we program as POSIX C API provided by Linux, maybe in the future, let's say we assume that our system is running on Kubernetes. Why couldn’t we rely on the resources and standard API provided by Kubernetes as part of our distributed application?

So for example, if today I develop a cloud-native database, I can use persistent volumes as my storage and place my cold, persistent data using such API. And I can use local volumes, local storage APIs to place some hot data for performance, because sometimes the local PV is served by the SSD.

And you don't need to care how to get and how to maintain this kind of resource. It is Kubernetes' job. So that will greatly reduce the complexity of building a new kind of distributed system. I think this is the future of distributed systems. At the very beginning of TiDB, we wanted to build a large-scale discrete system. We thought we should find a great foundation for that and prepare for the future. So I think Kubernetes is a very natural choice.

ADAM GLICK: For many years, there was debate about if running databases in Kubernetes was a good idea. You've clearly proven that you can. What were some of the biggest technical challenges in getting a distributed database running on top of Kubernetes?

ED HUANG: This is a very sad story. For me, the hardest part is that we used Kubernetes from the very early versions, like 1.2, 1.3 or 1.4. Databases are not like normal stateless applications. We have to leverage the performance of local disk, especially SSDs. So we cannot directly use the network storage, like persistent volumes. We are not Google. We don't have Colossus. Because of the performance and the latency, at the time-- I still remember, at the beginning-- we used TPR, the third-party resource API, and custom controllers to implement a local disk management controller to manage the local storage.

So that's exactly what local PV API and local PV provisioner do today. So that means after Kubernetes 1.7 was released, we basically rewrote our operator using CRD to replace all the TPRs. And after 1.10, I think, Kubernetes released the local PV API. So we dropped our local disk management module and embraced the new API.

ADAM GLICK: Always the risk of writing something early on.

ED HUANG: [LAUGHING] Yes, yes yes. That's the sad story I mentioned.

ADAM GLICK: As you've continued to work in this area, are there any improvements that you would like to see in Kubernetes?

ED HUANG: Well, yes. I always want Kubernetes to have better support for cross-datacenter, cross-region development. And multiple Kubernetes [cluster] federation management. I think that would be nice. Because we are building something like Spanner, sometimes people want to deploy that database across multiple regions, multiple datacenters. But I think Kubernetes today does not work well on the super huge clusters, cross-region.

ADAM GLICK: You focused on things that are really high scale. What is the use case that you had in mind when you created TiDB? What separates it from other products already available that inspired you to go build this?

ED HUANG: A scalable relational database with MySQL protocol-- so that means this system can be the single source of truth for the entire enterprise. In the old days, if we want to scale out a relational database, the only way is to, you know-- sharding. But I just mentioned, if you start using some sharding middleware on top of your database, that means you have to give up a lot of good features from the relational database. So if you can have a relational database interface but with near-unlimited scalability, that means that we reduce the complexity of building the cloud native application or microservice architecture.

Even for some-- you know, any legacy application which relies on MySQL and has a scalability issue, TiDB would provide an alternative solution with very low migration costs. So I think even for Google, before F1 and Spanner, I think Google was using a huge MySQL sharded cluster for their Google Ads. According to the paper published about F1, Google said that hey, we want to build a new database to replace the MySQL sharded cluster in Google Ads.

So I think that's the same use case. We want TiDB to help people.

ADAM GLICK: Stepping back, you decided to submit TiKV, but not TiDB, to the CNCF. What made you decide not to donate TiDB?

ED HUANG: You know, after releasing the first version of TiKV, we found that the scalable transactional key-value API was a very basic storage primitive for building different kinds of systems. SQL is not a very basic semantic for building different kinds of systems, but a key-value API is very basic. And we found that-- for example, imagine that if you want to build a new distributed file system, just like "hey, I want to build Colossus". The problem you may have is how to build a scalable metadata storage. Like in Hadoop, in HDFS, the name node is a single node. It is not scalable.

So if you have a storage system with a very simple API, but still have some features like cross-row transactions, that will really reduce the complexity for building a large distributed system. Another great example is, sometimes, if you have a super huge Kubernetes cluster, etcd will be the bottleneck. That's why we wanted to donate TiKV.

ADAM GLICK: Are you hoping to replace etcd at some point?

ED HUANG: [LAUGHING] Etcd works well on, you know, thousands of node-level clusters. But I think, what I mentioned is, some friends of mine in China, they have super huge Kubernetes clusters. But, you know, etcd works well in a small cluster, but it is not scalable, right? All the data is stored within one node. So that's why some people are very interested in using TiKV as the storage part of etcd. So they replace the local storage with TiKV for etcd and that etcd serves a super-huge Kubernetes cluster.

TiKV is like a building block. So we don't want to donate the whole product. Like, oh hey, we are promoting our product using CNCF as a marketing team. No, we just want to donate the building blocks, helping people to do other things.

ADAM GLICK: It's like an open-core model?

ED HUANG: Something like that, yeah, open-core. [CHUCKLES]

ADAM GLICK: TiKV was accepted in the CNCF sandbox in August 2018. What happened at that point? What changed for you and for the project?

ED HUANG: The one thing we really like about CNCF is, CNCF does not get involved too much within the project's decisions. We have our own governance policy, we have our own operational rules for our community. And so CNCF to us, the significant support is the branding and the marketing and the logos at KubeCon. And the different kinds of reports published by CNCF that people really like to read. And especially outside China. We have a lot of adoption in China, but outside of Asia, there is a high probability that our users learn about TiDB and TiKV, or PingCAP, the first time from CNCF Landscape. So we really appreciate that.

But from the technical side, the CNCF isn't involved too much, which we really appreciate.

ADAM GLICK: When you put it into the CNCF, did that change the adoption as people became aware of it outside the greater China region? What has that meant for contributions to the project?

ED HUANG: I think, for open-source software, adoption is everything. If you have more users, you have more adoption, which means you can have more potential contributors. I just mentioned that outside China, I think most of the adoption is driven by CNCF. So that built a great brand for us. And I think CNCF helped a lot in driving adoption and the building of trust in the open-source project. That's the contribution of the CNCF, not only to the code.

ADAM GLICK: Not long after you were accepted into the sandbox, you closed a $50 million funding round. These two things must have been happening simultaneously. What did your venture backers think about the CNCF and you donating your code to open source?

ED HUANG: I think at that time, in Asia and especially in China, there were not too many open-source projects working very well with the Western world, like CNCF. But the TiKV project and PingCAP is a great example-- proof that a project started in China, or started outside Silicon Valley, can work well with the global open-source community. So our VCs thought, "hey, this company is a global company. They work very closely to the global open-source community". And, yeah, so I think it was good news for them. Yeah.

But actually, from my point of view, I didn't care too much about how VCs thought about it. We just want to [be open source].

ADAM GLICK: I also wanted to say congratulations on your recent graduation of TiKV in the CNCF.

ED HUANG: Thank you.

ADAM GLICK: In July of this year, the CNCF accepted PingCAP's second project into the CNCF. That project is Chaos Mesh. This is something different from the last two open-source database-related technologies that you've done. What was your involvement in this project?

ED HUANG: Chaos Mesh is a very-- we have been building it since the beginning of PingCAP. At the very beginning of PingCAP and TiDB, we were building a testing framework. You know, we were building a database. Distributed databases are very complicated. It is a huge, ambitious project. So how do you make sure your code is correct, or how do you reproduce bugs effectively?

At the beginning, we used some technologies like fault injection. At that time, we didn’t even have a word called "chaos engineering." That's something that was brought out by Netflix, like, I think in 2018. But even before that, we invested a lot in fault injection and simulation testing. That's the beginning of Chaos Mesh.

Chaos Mesh is an independent tool that works with Kubernetes. You can use Chaos Mesh to mess up your Kubernetes environment for your application running on Kubernetes. This tool, before we donated it to the CNCF, we used it widely internally. It helped us to find a lot of bugs and find a lot of issues in TiDB. We think it is a great tool.

And it not only works well with TiDB. It will be very helpful if we want to build a microservice architecture. Imagine that if you have hundreds of microservices running on your Kubernetes cluster, you may worry about if some of the services crash, if that may cause the whole cluster or the whole service to go down. So Chaos Mesh is a tool that intentionally does some fault injection to the cluster and you can see if the system is stable and the failover works just like you want it to work.

And on the other hand, Chaos Mesh uses a lot of technology. It is very cloud native. We use sidecars. We use some CRDs to define the chaos. So it is very easy to use. Because before Chaos Mesh, we use a lot of customized shell scripts that were not very stable and were very hard to use. So that's the reason we built Chaos Mesh.

ADAM GLICK: It's clear that you're a big fan and user of open-source technology. Are there any projects out there that are on your radar and that you think are really interesting, that people should know about?

ED HUANG: Besides distributed systems and databases, I'm really interested in programming language technology. Especially Wasm. I think Wasm will be the next JVM. Recently, there is a very promising project called Wasmer, which is a browser independent, stand-alone Wasm runtime. It is very interesting, so people can check it out. Yeah, I really like this project.

ADAM GLICK: Finally, I know you used to be a fan of metal music. If there was one song that you had to listen to for the rest of your life, what would it be?

ED HUANG: Well, I used to be a metal fan, but nowadays-- I would choose one song from Pink Floyd. But it is not metal! [LAUGHING].

ADAM GLICK: Craig would be very happy. What song?

ED HUANG: "Dark Side of the Moon." The whole album, I think it's one song. [LAUGHING]

ADAM GLICK: Ed, it's been great having you on the show.

ED HUANG: Thank you, thank you.

ADAM GLICK: You can find Ed Huang on Twitter @dxhuang.

ED HUANG: Yeah, or you can just query on the search box. "Ed PingCAP TiDB" and I will pop up.

[MUSIC PLAYING]

CRAIG BOX: Good taste in music there Ed. "Dark Side of the Moon" should definitely be thought of as a single track. Or at least two sides, but you have to listen to them consecutively.

ADAM GLICK: Thanks for listening. As always, if you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @KubernetesPod or reach us by email at kubernetespodcast@google.com.

CRAIG BOX: You can also check out our website at kubernetespodcast.com, where you will find transcripts and show notes as well as links to subscribe. Until next time, take care.

ADAM GLICK: Catch you next week.

[MUSIC PLAYING]

[This transcript has been lightly edited for clarity and readability.]

View More Episodes

TiKV, TiDB and PingCAP, with Ed Huang

Chatter of the week

News of the week

Links from the interview

Transcript