Kubernetes Podcast from Google: Episode 84 - Monitoring, Metrics and M3, with Martin Mao and Rob Skillington

#84 December 17, 2019

Monitoring, Metrics and M3, with Martin Mao and Rob Skillington

Hosts: Craig Box, Adam Glick

Martin Mao and Rob Skillington are co-founders of Chronosphere; CEO and CTO respectively. They both worked on the monitoring team at Uber, where they created M3: a metrics platform with an open source time-series database built for scale. They join Craig and Adam to talk about monitoring, metrics and M3 on the last episode of 2019.

Do you have something cool to share? Some questions? Let us know:

CRAIG BOX: Hi and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

CRAIG BOX: I hear Delta Airlines are testing in production.

ADAM GLICK: Yes. They always say if you want to be sure of it, test in production. And I haven't taken a Delta flight in quite some time. But I do have their app on my phone. And they have conveniently sent me a notification, saying, test. So apparently, someone's doing some production-based testing today. And always a good reminder-- remember when you're testing in production, that's actually touching some customer.

CRAIG BOX: You should reply and say "pass".

ADAM GLICK: [CHUCKLES] This is going to be our last show of the year.

CRAIG BOX: It is!

ADAM GLICK: It's been a fantastic year. I had a lot of fun doing this. Craig and I like to take the last two weeks of the year off, spend it with our families, recoup a little to come back for next year. We're excited for a lot of the shows we've got lined up, then, but just want to let you know there'll be a couple of weeks gap.

When you look back on the last year, anything jump out at you -- things that you've seen in terms of trends in the community?

CRAIG BOX: I think that it's fair to say that all the vendors last year were trying to provide Kubernetes as a service. And that's now accepted as table stakes. I think that the bar has moved up slightly, and all vendors are now having to offer some sort of multi-cloud or multi-cluster management. And I think a lot of the choices that are being made about those products this year are based on that, is the roadmap and the current technology for being able to manage multiple clusters. We've seen it, obviously, with Anthos from Google, but also in announcements from VMware and Microsoft over the course of the year. There is now a requirement to have that multi-cluster support with clusters running in multiple locations in a hybrid environment. How about yourself?

ADAM GLICK: Yeah, I noticed a couple of things, especially coming out of the last KubeCon. Hybrid and multi-cloud tech-- like, Kubernetes has basically become the platform for that. As you see companies talking about it, Kubernetes is the way that they're talking about how they're doing hybrid and multi-cloud. And that seemed to be fairly ubiquitous across most of the major players.

Speaking of which, a lot of those players have really shown up. We continue to see large enterprise players getting into the Kubernetes space, offering their own Kubernetes distributions or partnering with cloud vendors or on-prem vendors in order to make some sort of distribution available. It's really becoming an enterprise technology, and people are focused on what can it do for the enterprise segment of the market. So I like to put that in a little category I call "Boring Got Hot."

CRAIG BOX: It's boring to call it boring these days. I think everybody is doing that. Let's get to the news.

[MUSIC PLAYING]

CRAIG BOX: The Kubernetes blog has continued its tradition of following up a release with a dedicated deep dive blog post on new features. The first two of these have been posted focusing on storage, migrating Kubernetes in-tree storage into Container Storage Interface volumes, and on volume snapshots. Both of these features entered Beta with the 1.17 release. Moving entry storage to the CSI architecture continues the pattern of reducing what is native Kubernetes and further building on the extensibility of the platform. The volume snapshots feature brings a standardized and native way to capture the state of a storage disk, which is helpful for backups and for stateful applications.

ADAM GLICK: New and managed Kubernetes services this week. Microsoft Azure announced private AKS clusters in preview, which allow you to have an internal IP address for your cluster master and ensure that your nodes only talk to it over the private network. Google Cloud's GKE, meanwhile, made maintenance windows and maintenance window exclusions generally available. Both clouds also launched Kubernetes 1.15 to general availability.

CRAIG BOX: Last week, Google Cloud launched a new class of VMs, the E2 series, which can run on your GKE nodes almost 30% cheaper. E2 instances have their virtual CPUs scheduled on available CPU cores as needed, instead of being allocated one-to-one with actual hardware. Live migration and the new CPU scheduler round out the series of improvements, which Google refers to as "dynamic resource management".

ADAM GLICK: Google also posted a rundown of new features in Cloud Run for Anthos, built on the Knative platform, including traffic management between service versions, support for on-premises clusters, and the ability to reference existing Kubernetes secrets, as well as ConfigMaps.

Another post this week offers guidance on container forensics on GKE clusters, including how to ensure you can get your log data from containers that have come and gone and how to build on, as well as act upon, a proper incident response plan. New documentation also covers how to mitigate a security incident, as an ounce of prevention is worth a pound of cure.

CRAIG BOX: Moving now to other search engines, we turn to Cliqz with a Q and a Z, presumably so named as to score more Scrabble points. Cliqz has built a search product on open source software, including Kubernetes and its whole set of cloud native complements for its infrastructure. They go on to describe their Hydra system for managing the propagation and lifecycle of data sets in a cloud native environment. Cliqz say they intend to eventually open source the Hydra software.

ADAM GLICK: Three CVEs have been announced and patched in the Envoy proxy, including one with a score of Critical. Buffer overflows, crashes, and security bypasses are all potentially possible. So please check mitigations and new patch releases for both Envoy and Istio linked in the show notes. The bugs were found and fixed by engineers at Google, Dropbox, and Lyft.

CRAIG BOX: Over at The New Stack, Zach Jory sums up the top three service mesh developments for 2019. He says the need is growing, Istio will be hard to beat, and the core use cases will solidify in 2020. Meanwhile, if you don't know how a service mesh works, IBM's Ram Vennam has published a lightboard video, where he explains the basic concepts behind Istio in five minutes.

ADAM GLICK: Datawire announced the launch of the Ambassador Edge Stack, a commercial product bundling their Ambassador API gateway, which is also built on the Envoy proxy. In this case, Edge refers to the cluster edge, as opposed to where all the 5G modems are, or the guitarist from U2. It added a developer portal and CRD-based management to allow self-service from developers. Ambassador Edge Stack is available in early access with a rate-limited community edition, or as a commercial enterprise edition. The core Ambassador API gateway remains open source, and it too released a patched version this week for the Envoy CVEs.

CRAIG BOX: Envoy can now be dynamically extended using WebAssembly code, and Solo.io, our guest in episode 55, has launched an Envoy WebAssembly Hub. It's a catalog and service for building, sharing, discovering, and deploying extensions for Envoy-based API gateways and service meshes. It launches with two such extensions from Solo, with four more coming soon.

ADAM GLICK: Envoy has recently gained native support for the Kafka protocol, and the team at Banzai Cloud have been finishing off the support needed by writing an Envoy filter for Kafka traffic. Instead of treating the traffic as TCP, Envoy can now understand Kafka's semantics at the protocol level. They've tested the support and demonstrated it in their backyard service mesh powered by Istio, and are working with the Envoy project to finish it. A patched Envoy is available if you want to try it out.

CRAIG BOX: Talos, a self-hosting OS for running Kubernetes, has released version 0.3. New features include support for Kubernetes 1.17, moving away from static pods and towards using bootkube, and five distinct release channels for varying degrees of bleeding at your edge. The entire OS can now be run from RAM, which improves security by not writing anything to the disc.

ADAM GLICK: PingCap posted this week about a new tool called AutoTiKV, an open source tool that uses machine learning to automatically tune TiKV, the key value pair storage engine based upon Rocks DB. TiKV is currently in incubation with the CNCF. AutoTiKV aims to do some of the hard work that database administrators have done in the past to make sure you can get optimal performance out of your data store by using machine learning. An interesting side note is that this blog post was originally written in Chinese and was translated to English for the CNCF audience, showing just how global the Kubernetes community is.

CRAIG BOX: The Open Policy Agent Project held a one-day event before KubeCon recently and summarized it in a blog post this week. Their aim was to showcase a variety of use cases from companies running OPA in production, many of whom were using the Gatekeeper integration with Kubernetes. If you're interested in policy and configuration management, check out Episode 42 with John Murray.

ADAM GLICK: Alex Brand has posted a first look at Antrea, the Container Network Interface plug-in for powering Kubernetes networking, using Open vSwitch. Using a daemon set on each node, the cluster network can be configured to provide a bridge network for all the pods you run. Alex's post covers the installation process and how all the pieces work together and should help you if you're curious about Antrea or Open vSwitch.

CRAIG BOX: There are a lot of TODO comments in the Kubernetes code base. How many? Patrick DeVivo from Augmentable analyzed the code with his company's new tool. He found over 2,300 TODO comments, more than the number of open GitHub issues, including from the very first commit. Joe Beda, if you're listening, you still need to "provide a way to override the cloud project".

ADAM GLICK: John Schnake from VMware has posted about how to use their Sonobuoy tool to get updates on Kubernetes test runs without waiting for the full run to complete. Sonobuoy installs a sidecar into your cluster and sends updates out as tests run so you can get progress reports as the tests are executing. This functionality requires Kubernetes 1.17, so you'll need to be up to date with a fresh Kubernetes release to take advantage of this functionality.

CRAIG BOX: The CNCF has posted a couple of case studies this week on how various companies achieve success with their projects. The first of these was a post on how the Chinese cloud provider Alibaba manages tens of thousands of Kubernetes clusters. It discusses their design, how they scale, ways to do capacity planning, and global observability on their infrastructure. The second covers how Grafana Labs uses Jaeger for distributed tracing to look into performance impacting bugs.

ADAM GLICK: Did you know that Quora uses Kubernetes? Are you curious about how they went about adopting it? Quora Engineering has posted a blog talking about their adoption of Kubernetes over the past year. They talk about how they built a separate team inside their company to focus on this project, as well as what other supporting technologies and projects they use to take full advantage of the cloud native ecosystem. In particular, they spent some time on describing and diagramming their CI/CD system. They provide a number of takeaways for other orgs considering the move, and they state that one of their focuses is to increase the hardening of their production clusters and planning for scaling before adding additional applications to their environment.

CRAIG BOX: Finally, the CNCF has announced the schedules for the Kubernetes forums in Bengaluru and Delhi. These are smaller events than KubeCon designed to reach the broader community of Kubernetes users in locations that do not get a dedicated KubeCon event. The events are one day in length and have both beginner and advanced tracks for content. Registration is now open, and you can save $15 if you register before January the 10th.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

ADAM GLICK: Martin Mao and Rob Skillington are co-founders of Chronosphere, and CEO and CTO respectively. Rob, welcome to the show.

ROB SKILLINGTON: Thanks a lot, Adam. It's great to be here.

ADAM GLICK: Martin, welcome as well.

MARTIN MAO: Thanks, Adam.

CRAIG BOX: M3 was born at Uber around 2015, when you both joined that company. Did you join explicitly to do monitoring?

MARTIN MAO: One of us did. So yeah, I was hired to join the monitoring team and solve the monitoring problem. And actually, my co-founder, Rob, he was the one that convinced me to join Uber. And now I'll let him tell his story, I guess.

ROB SKILLINGTON: Yeah, when I joined, I'd actually joined the marketplace platform team. And then I got fascinated with infrastructure definitely. As I spent some time there scaling some of the dispatch systems, I found that stats were basically the fundamental underpinnings of what let us to actually understand how we were breaking the system or pushing it to its limits. And I was fascinated with it. I wanted to work more on infrastructure. And thankfully, eventually, we got the opportunity to come in and try to help the company be able to scale that system up. And Martin actually was one of the first engineers on the team there. And after I convinced him to join, he managed to convince me to join his efforts soon on that.

CRAIG BOX: And you'd worked together in the past.

MARTIN MAO: We have worked together before. Yes, we were both graduates working at Microsoft and we were moving Office to Office 365. Quite a long time ago now, yeah.

CRAIG BOX: So what was that monitoring stack at Uber before you started M3?

MARTIN MAO: Before we started M3, the monitoring stack was a fairly standard open source stack for 2015. It was a Graphite-based stack that was running on top of a Whisper DB. And folks who know Whisper DB, it's a file system based database with no replication and no horizontal scale. So that did give us quite a bit of pain.

CRAIG BOX: Graphite is a tool that may not be as familiar to people who are used to hearing about some of the older tools, the Nagios and those kind of things, and then the more modern time series databases. What is Graphite?

MARTIN MAO: Graphite is an open source monitoring tool. It has its own metric ingestion format. You can ingest an either the statsd format or Carbon format. These formats are pretty different from Prometheus in the sense that they are hierarchical based. They don't really have tags. So most of the metrics had really long dot-separated names. And it had its own separate query language, which, again, is slightly different from what we know today as PromQL because it was not a tag-based query language.

CRAIG BOX: Let's dig a little bit into metrics and how they work. What are the kinds of things that Uber - or companies who have people who might be listening to this show - what are they trying to do with metrics, and how are they collecting them?

ROB SKILLINGTON: We find that a lot of people have many different use cases for metrics. At the companies that we've worked for and especially recently, in the open source ecosystem, we've found that people are monitoring applications, they're monitoring their infrastructure, and they're monitoring their business now. Or even just like fundamental operations that they're serving traffic to real users and they're trying to monitor how the code paths that are serving that traffic, how that's working and how their applications in the real world are basically working.

CRAIG BOX: So someone might have a set-up that will query the state of something every five seconds or minutes or something to that effect?

MARTIN MAO: Exactly. Exactly. And as Rob mentioned, you're not just querying your infrastructure applications. You're probably querying all the way from the business layer. And then from that point onwards, you're then querying the applications that your business runs on. And then from that point onwards, you're sort of querying the infrastructure that those applications run on. So you're sort of querying all the way up and down your stack.

ROB SKILLINGTON: And I would say that by typically a lot of people used to deploy applications, and that would be how they introduced change to their system and how they controlled how their application worked in the real world. However, more recently, there's a lot of tools that lets you switch things on and off dynamically. So I know LaunchDarkly is a company in this space, and a lot more products and applications are becoming more dynamic, based on the market that they're running in.

And so monitoring is becoming increasingly more important because it's not just about watching things when you do a deploy. It could be when someone at your business, at your company, or at your organization is making changes to the application, dynamically in real time without your knowledge. So yeah, we've definitely found it's exploding in popularity based due to the complexity of systems and businesses and applications.

ADAM GLICK: How does monitoring change when you're doing something at Uber's scale?

MARTIN MAO: The first thing is the scale, right? So when you have a large complex technology stack, more monitoring data gets produced. So then you need technologies that can both store that data, and then you need technologies that can make sense of that data. So you're just sort of solving it at a much larger scale than you would at other companies, I guess.

ROB SKILLINGTON: Yeah, and I would also say that at companies that grow to certain sizes, you tend to find it's really hard to get everyone together and say, hey, we've got to do things this one way, and we've got to keep everything pristine. What tends to occur is that every sub-organization wants to do things just slightly differently. And then varying companies allow varying different amounts of entropy in how they operate. But it's becoming a lot more popular that how you want to do things and how you monitor things is up to you, or at least, some part of it is up to you. And that means that the monitoring is being used in a variety of different ways in a single company.

CRAIG BOX: When we talk about moving to microservices, a lot of that is around the idea, "well, I can use a different language for whatever makes sense to me as a team". Should teams be using different monitoring systems?

MARTIN MAO: What we believe is that teams should be using different clients maybe, but those clients are obviously written in different languages. But it's actually really beneficial for those clients to be emitting metrics in the same format. And they've been queried and stored in the same system because what you really want to do is better query, make your queries across a lot of these microservices. So if the metrics in these microservices have been produced in different formats and they've been stored in different systems, it actually makes the monitoring of your overall stack much more difficult. This is actually preferable for everybody to consolidate on one single query exposition format at least, regardless of the clients, and then have that stored in one central place has actually been beneficial from what we've seen.

ADAM GLICK: How should a user go from a world of no monitoring into a world of appropriate levels of monitoring? I mean, there is a whole rabbit hole you can go down. Where do you start? What are the things to start with? What are most important? And then how do they expand on that?

MARTIN MAO: I think where people should get started, especially in our cloud native world and with the rise of the popularity of Kubernetes, people should definitely get started with a tool like Prometheus. I mean, it's very simple tool to get started with It's a single binary so you can emit metrics into the single storage system. And you can query it and set up your alerts. That's definitely a great way to get started for sure. And then as you sort of scale that up and have more Prometheus instances, at a certain point, you may want to look at a system that's slightly more horizontally scalable. And that's when some of your long-term storages come into play here. And M3 is one such system, but there's also Cortex and Thanos options as well.

ROB SKILLINGTON: Yeah, and I would also say that you find a lot of the time that people don't necessarily instrument their application themselves if they've never used metrics and other things extensively. So what we tend to see works great is that you can take some existing set of data that exports automatically to metrics and start using that. So for instance, in the Kubernetes ecosystem, kube-state-metrics obviously exposes a whole bunch of internal Kubernetes metrics.

If you're using something like haproxy or some more advanced service mesh and you're able to collect metrics from that, then that basically allows anyone that develops backend software at your company to at least have some level of insight into how their software is running, without a developer having to add a Prometheus client library and start to instrument their application. So I think it's always great to get some stats out of the box of free into your system, just to allow developers to even explore what's possible.

CRAIG BOX: This concept of time series metrics and so on is a little different in the sense that you have your application just expose some metrics. And rather than pushing them somewhere, they have the agent, Prometheus in this case, connect to those services, pull that down, and then submit it to the database. Is there a leap that makes sense for people to start with?

MARTIN MAO: There's definitely two ways of doing it. One is scraping and one is pushing to an agent. Prometheus goes by the scraping route, and we think that's very applicable in a lot of cases. And in particular, that gives you the sort of control so that your system does not get overloaded by a lot of inputted metrics. But there are also a lot of other use cases out there that push based is a very fine solution as well. So a lot of things like batch jobs where you want to run an admit a metric at the end of the batch job, you might not be at a perfectly time when you're going to scrape those metrics. So I think both are very valid ways of producing metrics.

ADAM GLICK: A lot of people use metrics gathering and dashboards to really look at real time or near real time performance of a system and know what they need to do to tune that appropriately. Sometimes people need to go back and take a look at things forensically, especially if something has happened and take a look at what caused failures to happen or degradations in service. How long should people keep data, and what does that mean in terms of how they should think about designing their monitoring systems?

MARTIN MAO: I don't know if there's one answer there. The way we thought about solving this problem is that you probably want to keep data around for different periods of time at different resolutions, right? So if you're looking at the more recent data, you probably really want that at high resolution, like a per-10 second basis or a per-second basis. And then as you sort of persist this data for like long-term capacity planning purposes or deep dive analytics, later on, you can have that at a much lower resolution. So it really depends on the use case, but you want a tool that can help you generate these different resolutions and store them for different retention periods.

ROB SKILLINGTON: We're firm believers of basically being able to keep a default policy that makes sense, and apply that to basically all metrics that you collect, and then basically opt-in cases where you want to keep it for longer. And so, especially with M3, we focused really early on allowing people to match labels on your Prometheus metrics and then define policies that match those labels. So essentially, you can dynamically kind of work out where metrics are going to be stored and for how long, just by defining these policies.

CRAIG BOX: Some of the older members of the audience may remember MRTG and RRDtool. These are systems that effectively stored, like you said, more recent data at a higher granularity and then fell off over time. Is that a similar pattern that we're seeing today, just with more advanced engines?

ROB SKILLINGTON: Yeah, I would say that it's definitely quite similar. What I guess is interesting today is that as I was kind of implying that we're basically seeing the ability to dynamically choose which ones go through that process. So typically, if you apply a five-year retention for all of your metrics, even if you're sampling it at one hour, that's still a very expensive operation. And so we obviously want you to be able to choose subsets of those metrics to go through that operation. But fundamentally, the down-sampling procedure is just as it was before.

ADAM GLICK: Why did you build M3?

MARTIN MAO: Inside Uber because of the environment we were in and the complexity of our technology stack, as we were moving to microservices and moving away from physical host to containers, a lot more monitoring data was just being produced. You're tracking a lot more things now. And not only are you tracking a lot more things, but you're tracking a lot more relationship between these things.

So as the number of things grow, the number of relationships exponentially grows. So all of that sort of led us to having to at least find or build a system that could store all of this data. And having a look on the market back in the day in 2015, there was nothing available in open source or even on the commercial market that could really store the amount of data that we were producing. And that was what led us down the path of building our own solution. And we're happy we decided to do it in open source.

ROB SKILLINGTON: Yeah, and at the time, Prometheus had just been open sourced. It was in its early days and was very explicitly wanting to ensure that you operate a single Prometheus, and Prometheuses don't talk to any other Prometheus instances unless it was via federation. And so within an organization as large as ours, we didn't really want to have to work out all the various different stats that would get federated and wouldn't get federated. And we just wanted a single pane of glass where basically all the metrics could get collected without really having to partition or shard any of the metric space.

CRAIG BOX: You reached out two other companies who are operating at web scale at the time, talked to their logging teams. What did you learn from some of those teams?

MARTIN MAO: Where we couldn't figure out how to solve the problem, we went and asked other people who had solved the problem before. There's only very few companies at the web scale that had solved this problem before. And when we talked to all of them, they recommended us that they would-- the only way to solve the problem from talking to them was that we would have to build our own. So they sort of gave us a lot of their lessons learned from building their own systems.

And actually, one piece of advice they gave us was to actually build the thing in open source from day one. So one interesting thing was that all of them had thought about open-sourcing their projects internally at some point in time, but because they didn't build an open source in day one, it took some sort of internal dependency. And because of that, they could never open source it eventually. So one recommendation we got was to build in open source from day one. And that's what we ended up doing.

CRAIG BOX: Now there's a lot of love behind Prometheus in the community. Is M3 complementary to Prometheus, or is it a different tool entirely?

ROB SKILLINGTON: It's remote storage for Prometheus the way that users are using it. And that is how we view it. It's essentially a way to keep data at very long retentions all together in one place. And it is also a clustered solution. A lot of people think of Prometheus as something that gets you one thing out of the box very quickly, and that's metrics, some level of visualization, and an end-to-end alerting platform. And then it focuses very well on doing that. And everyone should get started with Prometheus. And just Prometheus suits most organizations.

And then essentially, when you need to think about, yes, storing data from a whole bunch of different Prometheus instances, that's when you really want to think about remote storage. And that's where kind of M3 into the picture. There are other projects out there as well, such as Thanos, which is a CNCF project, and Cortex as well.

With M3, we focused on the database being always at least three replicas. And we do quorum reads and writes for storing and reading the data. So by its very nature, it is made to solve a scale-out scenario and is therefore high availability without any data loss and ensuring we have consistency, as you kind of lose machines and replace them, it'll stream data between one node and another as you add it. So it's very much like a typical distributed database. It doesn't store things in object storage and that kind of matter. So we really think about it like an industrial grade solution, I guess. And that's kind of as users approach that, that's where we think M3 fits best.

ADAM GLICK: You've stored a lot of this data. How do people access that data?

MARTIN MAO: With the open source M3 platform, you can pair it up with open source visualization tools, like Grafana to visualize your data. On the alerting side of things, you can still use Prometheus Alert Manager to read data straight out of M3 as well. So there's a lot of open source toolsyou can piece together with M3 that let you leverage the scale of the underlying storage system.

ROB SKILLINGTON: With regards to the types of data people are pulling out of this at long retention, some of the times, it's like around capacity planning. Say I want to look at, say, how many of my applications are running and how much resources are they using per unit of transactions that the business or organization is doing. And so if that over time means you're using way, way more resources and you're doing about the same number of operations, that's usually a sign that performance is degrading overall. And then you're using your resources less wisely.

ADAM GLICK: Is there a query language people can use?

MARTIN MAO: We only support the open source query languages. We decided not to build a proprietary query language with M3, so you can query either using PromQL, which is the Prometheus-based query language. Or if you have a legacy statds, Carbon metrics you can query it using open source Graphite query language as well.

CRAIG BOX: At the time of building M3, you will have had to handle ingestion and query, as well as database. And Prometheus didn't exist at that point in time. What did you build in M3? And then what did you do when you saw the community building behind Prometheus?

MARTIN MAO: We actually built a tag-based custom query language, called M3QL, as well as a custom sort of tag-based ingestion format as well. Because we really believed in the power of tag-based query languages, as opposed to the legacy flat-based query languages. So we sort of had that internally - that was in use inside Uber Technologies when we first built M3. And then when we saw the popularity of Prometheus, we decided to ditch that and then sort of support the native Prometheus ingestion formats and support PromQL as a query language. Because that's just one that a lot more of the open source community understand.

ADAM GLICK: You've decided to write all of your code in Go. Why did you choose Go?

ROB SKILLINGTON: Go's an incredibly flexible language. And it's very pragmatic. We're Australians. We're pragmatic people. I mean, the team all basically got behind it. But everyone was very effective at being able to write Go from day one. And although there's a lot of Rust fans on our team, Go honestly is an extremely diverse effective language that compiles extremely quickly. It allows you to manage memory exceptionally well with a lot of control. And so it honestly lets us do some pretty amazing things with basically people that only picked it up one or two years ago as well.

MARTIN MAO: Yeah, I would say if you think about running a database, Go is probably not the first language that comes to mind. Probably something like C++ would be a more obvious choice. But I think for us, because we had to build this in a such a short period of time, we were trading off how easy it would be to ramp up and engineer on the project. And having the project written in Go made that sort of ramp up much quicker than trying to find 20 senior C++ engineers to work on a project.

ROB SKILLINGTON: The memory management model of Go, while it is garbage collected, it does allow you to really control what's on the heap and what's on the stack a lot more, as opposed to something like Java or other typical early garbage collected languages. So it's really been fantastic to be able to-- you kind of end up writing it not like your typical Go, some parts of the code base. But it really lets you choose when and when not to use features and allows you to use it as a purely systems language in certain parts of the code base.

CRAIG BOX: M3 is obviously being targeted at the cloud native community who are in large part using Kubernetes. Uber were using Apache Mesos. I'm not sure what their current situation is. But was that built by the team at Uber, even though they weren't going to use it?

MARTIN MAO: We did end up building a Kubernetes operator for M3 just because a lot of the community and a lot of the users of M3 outside of Uber were on Kubernetes. So it made sense to build that for the community. And Uber was behind that decision as well. They were supportive of the decision to build that operator.

ROB SKILLINGTON: And we also performed basically monitoring of the on-premise data centers from a cloud-based deployment. And so there, we operated on top of Kubernetes in the cloud. And so the operator was able to help us scale up and scale down that deployment to collaboratively and give us insight into all of our on-premise networking, routing, data center temperatures, network fabric, and that kind of thing.

ADAM GLICK: M3 is a relatively new open source project, or at least, publicly available open source project. How are you working to build a community around M3 and its contributions?

MARTIN MAO: M3 is a fairly large complex piece of technology, so it's not something that any developer can sort of pick up and get going with straight away. And we struggled a little bit in that sense. So the things we're doing now to help build a community around it is we're actually dedicating a lot of engineering effort to help engineers who are interested in contributing to M3 ramp up on the project. Because that's generally a process that takes a bit of time. So we dedicate engineering effort to help them sort of ramp up on the project, get them comfortable with the concepts, and of course, do all the code reviews for them as well.

ROB SKILLINGTON: M3 itself is built on etcd. And the founders there recommended that we run a community meeting every month. And so, that is one piece of advice that we took, and we basically do that every month. And then we have of course a Slack channel that allow people to ask us questions about their experiences. And we can give them feedback as quickly as humanly possible. So we're really just trying to provide those avenues for access for people to get started and for people to learn quickly.

MARTIN MAO: I just want to add one clarification there when Rob said, it's built on etcd. It's the cluster management that is built on etcd. The actual time series database is custom.

ROB SKILLINGTON: Correct, yep.

CRAIG BOX: You are the two co-founders of Chronosphere, a company which you founded this year to take M3 forward. How did you come to found that company?

MARTIN MAO: It got to a point earlier this year where we got a lot more adoption of the open source M3 project. And Rob and I have loved working on the project, so we decided to create a company whose sole goal is to continue to support the community moving forward.

CRAIG BOX: How did you decide when the right time was to make the announcement and come out of stealth?

MARTIN MAO: We were in stealth for more than six months. The right time was mostly because we had purchased a booth at a conference [CHUCKLING], so we had decided that we probably needed to come out of stealth before people came to our booth and were quite confused at what the company was.

ROB SKILLINGTON: Yeah, and we also have been developing our hosted platform. So we wanted to make sure that when we were talking to people about our products, that they had something tangible they could get their hands on and something that they could see the value of the platform, rather than talking about it in an abstract manner. So that helped guide us towards as well when to kind of talk to people about the company and the platform.

ADAM GLICK: We talked earlier about what people would get started with. And you mentioned that getting started with something like Prometheus is a great way to get started. At what point should people be taking a look at what you've created?

ROB SKILLINGTON: When you're starting to run into a lot of either organizational or technology pains, running your monitoring stack, and essentially thinking of either you have multiple teams all running different things or they're leveraging the same type of monitoring, but they're all doing it slightly differently, and not a single Prometheus can hold all that data. Or even you may be able to scale a Prometheus deployment to 10's, 20's, and 30's of instances. And at some organizations, that's sane.

However, when you get to a point where it becomes quite difficult because you're pinning applications to certain Prometheuses, that is when you're basically performing human partitioning of the monitoring data. And so if you can find a way to ensure that these metrics can all be collected in a fair and sane way and then exported at the same time to a central system that becomes quite attractive to you, that's probably when it's worth looking at a system like M3.

CRAIG BOX: When you're building a company, how do you decide which parts of your product you want to be commercial and which parts you will continue to contribute to your open source project?

MARTIN MAO: I think the way we look at this is a lot of the features where we're sort of not putting into the open source product are features that large enterprise need to run this across multiple teams in the organization. So we want the open source product to be able to be scaled up to billions of times series. But particular enterprises want particular security controls, help with compliance of this data. Those type of features are things that we do not put in the open source product.

ROB SKILLINGTON: Getting someone from nothing to something quickly, depending on your organization, may be easy or more difficult. And so we basically try to make sure that features that can help you run at very high scale without having to have 10 to 20 engineers helping you run that system is where we're targeted for the hosted platform. So we're really just making sure that if it's not your core business at high scale and you don't want a dedicated team towards it, those are the type of features we're looking to build. So that's why there's still a whole bunch of room for the community to get a whole bunch of value out of the system all in the open source Apache 2 licensed version of M3.

ADAM GLICK: The open source project's name is M3. Are you really big car fans, or is there another story there?

ROB SKILLINGTON: At Uber, the project internally was just called Stats. And essentially, that project name wasn't cutting it, and we were told we had to name it something. And we refused to name it anything at all. So someone basically took the initials of someone on the team and added the word metrics to it and shoved it into an acronym. And that's how it came to be.

CRAIG BOX: No comment on which team member it is, Martin?

MARTIN MAO: I mean, you know, yeah, no comment there. [LAUGHING] It could really be anyone on the team.

CRAIG BOX: All right, Rob and Martin, thank you so much for coming on the show.

MARTIN MAO: Perfect, awesome. Thank you for having us.

ROB SKILLINGTON: Brilliant. Thank you so much.

CRAIG BOX: You can find Martin Mao on Twitter at @martin_c_mao. You can find Rob Skillington on Twitter at @roskilli. You can learn more about Chronosphere at chronosphere.io. And you can learn more about M3 at m3db.io.

[MUSIC PLAYING]

CRAIG BOX: It has been another wonderful year of this podcast, so thank you all for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend that they should binge listen if they have a big holiday break. If you have any feedback for us, you can find us on Twitter at @kubernetespod, or reach us by email to be read in January at kubernetespodcast@google.com.

ADAM GLICK: You can also check out our website at kubernetespodcast.com, where you'll find transcripts and show notes. Until next year, take care.

CRAIG BOX: See you next year.

[MUSIC PLAYING]

View More Episodes

Monitoring, Metrics and M3, with Martin Mao and Rob Skillington

Chatter of the week

News of the week

Links from the interview

Transcript