#37 January 22, 2019

Prometheus and OpenMetrics, with Richard Hartmann

Hosts: Craig Box, Adam Glick

Richard Hartmann is a member of the Prometheus Team and the founder of the OpenMetrics project, which aims to replace SNMP with a modern format for transmitting metrics. He joins your hosts to discuss both projects, and how Cloud Native technology can improve the datacenter.

No soup for you! Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

ADAM GLICK: Hi. And welcome to the "Kubernetes Podcast from Google." I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.

[MUSIC PLAYING]

Are you a LEGO collector, Adam?

ADAM GLICK: I used to love LEGOs when I was a kid. I still have a bunch of them sitting around, actually.

CRAIG BOX: Yeah. A lot of people like to not only buy LEGO to build, but they sort of hang onto it. I think, first of all, it might have been as a childhood thing.

ADAM GLICK: Oh, yeah.

CRAIG BOX: Nostalgia. And then there's now people who are-- apparently it's a really, really good investment. The article in "Bloomberg" in the last week said that in one case a Star Wars kit that retailed in 2014 for $4 [was] one year later going for $28 on eBay-- a 613% premium.

ADAM GLICK: Not bad.

CRAIG BOX: Unfortunately, I don't think that works if everyone tries to do it. If we have a glut of 2014 Star Wars Darth Revan, whoever that was, I just don't think the market will be there.

ADAM GLICK: You could end up with the great crash of the plastic blockchain. Ba-doom tsssh.

[LAUGHING]

CRAIG BOX: We left a little Easter egg in the show notes last week, and I'd like to say thank you to everyone who reached out. More than I expected, actually, a couple of comments. Unfortunately, no one commented on the correct pronunciation of the tool, the command line tool name. We thought that one was obviously a easy setup.

ADAM GLICK: Forever it will remain a mystery and debate.

CRAIG BOX: And no one asked for your hot and sour soup recipe.

ADAM GLICK: Very, very sad. Although we still have some in the freezer here. We will not send that out, though. You want to get to the news?

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

ADAM GLICK: Knative, the serverless framework for Kubernetes, has announced it 0.3 release. This is the first release in a six-week release cycle. And new features include the scaling on CPU usage, as well as requests and publishing services only within a cluster.

CRAIG BOX: Tetrate have announced Service Mesh Day, the first ever industry conference dedicated to service mesh technology. It's a two-day event on March 28th and 29th in San Francisco. And speakers announced so far include Google VP and CAP theorist, Eric Brewer, and Matt Klein, author of Envoy, who we spoke to in episode 33. A call for papers is open until February the 8th. And there are 30 full-path scholarships available for students and underrepresented minorities.

ADAM GLICK: Apple has open-sourced the Record Layer for FoundationDB and published a paper on it. FoundationDB is a distributed database which was acquired and open-sourced by Apple. Various layers on top of the FoundationDB base enable different behaviors like traditional relational SQL and document storage. The Record Layer is also of note as it powers Apple's CloudKit service.

CRAIG BOX: Tumblr is moving to Kubernetes and has open-sourced three utilities they've been using along the way. A sidecar injector helps ensure migrated workloads are configured with Tumblr's defaults. A config projector takes config from files and Git and pushes it into ConfigMaps. And the secret projector performs a similar function with secrets and keys. These utilities may be useful to you as is, or they might give you some idea of the challenges you may have migrating an application to Kubernetes.

ADAM GLICK: Interested in gVisor, the sandbox container runtime? Adin Scannell, one of the primary authors, gave a presentation at KubeCon in November, which has been published in a video and transcript on InfoQ. Adin talks about the architectural challenges associated with user space kernels, the positive and negative experiences with writing in Go, and how gVisor ensures API coverage and compatibility.

CRAIG BOX: Aleksa Sarai from SUSE loves the Open Container Initiative, but doesn't have much love for tar, the venerable tape archive file format. He's posted a long but surprisingly readable rant pointing out the problems with tar and how OCI v2 will address them. More posts are promised in the series. So if you're a fan of container internals, check it out.

ADAM GLICK: Do you know how much money you spend doing nothing with your Kubernetes cluster? Webb Brown from kubecost gives you some insight into how to monitor the cost of your cluster using Grafana and Prometheus. He builds on the work by Karl Stoney at Auto Trader UK. And this week's focus is on idle resources, talking about the trade-off of maintaining overhead versus paying for services you're not using. We look forward to Webb trying out the new vertical pod autoscaler on GKE.

CRAIG BOX: RK Kuppala from Searce has been looking at running Microsoft SQL Server on Kubernetes. In two posts, he looks first to the standard installation, and then at setting up redundancy using AlwaysOn Availability Groups, the native HA feature of SQL Server. Along with a post, he includes example YAML files, which makes installing SQL Server on GKE as easy as rolling out an open-source database.

ADAM GLICK: Namely, an HR SaaS company, was an early adopter of Istio and has started a blog series on their experience with Istio and production. Shray Kumar has written a crash course hopefully with little to no crashing.

The first post explores the relationship between Istio, Envoy, and Kubernetes concepts and includes a great diagram posted recently by Jimmy Song, which shows traffic flow through IP tables with the Istio sidecar. This should be recommended reading for anyone interested in Istio.

CRAIG BOX: Zalando, a German e-commerce company, had a failure in their Kubernetes environment last year. To their credit, they shared the story so others could learn from it. And on top of that, architect Henning Jacobs has set up a repository of Kubernetes-related failure stories and postmortems for the community to read.

One of the most recent stories published is from Dan Woods at Target, an actual commerce company in the US. After a network outage in their Kafka message queue system, their Kubernetes login sidecars were unable to connect, which caused them to start using much more CPU than usual. The nodes went unhealthy, and Kubernetes kindly rescheduled the pods onto other nodes, which went unhealthy also.

The churn in the Kubernetes environment impacted their console service catalog, as all the pods attempted to register with it. That, in turn, impacted Vault and the deployment system. Dan's write-up reaffirms his suggestion to run smaller clusters and more of them in order to restrict the blast radius of a distributed failure.

ADAM GLICK: The CNCF this week published a nice beginners view of basic security practices which fall into the categories of patch, isolate, control, and log. It's a nice primer if you're new to security. If you're old hat, this might be a bit repetitive for you.

CRAIG BOX: Finally, TechCrunch reports that Google is responsible for about 53% of all commits to CNCF projects. It's a nice article that shows the depth of commitment and contribution that Google makes to the CNCF and the world of containerized applications and infrastructure.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

Richard Hartmann is the moderating lead for SpaceNet, a member of the Prometheus team, and the creator of OpenMetrics. Welcome to the show, Richard.

RICHARD HARTMANN: Thank you. Thanks for having me.

CRAIG BOX: I understand SpaceNet is a data center business?

RICHARD HARTMANN: Data center and ISP.

CRAIG BOX: OK.

RICHARD HARTMANN: So basically, we do services very much on bare metal for customers who are not yet cloud native. Course, that's the niche we found, and people keep paying good money for it.

CRAIG BOX: And you've described yourself as a Swiss Army chainsaw.

RICHARD HARTMANN: Yes.

CRAIG BOX: What does that mean?

RICHARD HARTMANN: I tend to be thrown at problems which are kind of hard. And I'm stubborn enough to just follow through. And that's kind of how I got that name.

CRAIG BOX: How does cloud-native technology like Prometheus relate to that kind of business? How did you get involved with these projects?

RICHARD HARTMANN: I was looking for a new monitoring system, both for the conference I run and for my company. And we were looking at tons of things. And basically, Prometheus came out as /the/ thing. I took a look in September of 2015, and after literally one day of POC-ing I was already in love with it. I've just stuck around ever since.

ADAM GLICK: For those that aren't familiar, can you describe what Prometheus is?

RICHARD HARTMANN: Prometheus came into being because a few ex-Googlers were quite unhappy with what they found in the open-source and also in the paid software world. So they basically reimplemented large parts of Borgmon into Prometheus.

It's a time series database. It's highly scalable. It can ingest as of right now more than a million data samples per second consistently for long-term. And that's the one thing which is really interesting about it.

The other thing is basically it's like a vector math engine for metric data. So you don't just have your monitoring data, and you do a few simple things. You can run really, really complex analysis and math on that data, which is insanely powerful when you start to combine your data.

To be able to actually combine this data, you need to be able to slice and dice it. Normal traditional monitoring systems would have hierarchical data models where you, for example, you have your region, you have your data center, you have your site, you have your rep, you have your customer.

Someone else might need the same data, but in a view that the customer is region-specific. So already, your hierarchical data model, it breaks down. Course, once you're done doing that, it's already wrong for everyone else and probably even for yourself.

By attaching labels sets, as in just key-value pairs, to data, you can create your own structure. And you can just slice and dice your n-dimensional label set or your n-dimensional matrix into whatever you currently need.

And there is a ton of label-matching, as in just magic happening on the background, which if you do proper work on your labels and on the data you put into Prometheus, will just come out as the right thing more often than not.

CRAIG BOX: The state of the art for monitoring tooling before Prometheus was presumably MRTG, for example. I understand that would have a round robin database that's not storing every data point. My recollection of this is that it's keeping five-minute granularity for the last hour, and then just throwing that away, and so on.

So the model of Prometheus is more we keep every data point? Or do we only keep them for a certain period of time?

RICHARD HARTMANN: Yes. There's two looks-- MRTG, based on the RRDtool, which basically decays data as [the data gets older]. There's also something newer which, for example, Nagios or Icinga or Zabbix. But all these are quite limited in what they do.

Prometheus doesn't actually change the data. Once it's persisted on disk it stays that way. So if you scrape your service every minute like 10 years ago, you still have this one-minute granularity back at that time, which is a plus in some regards and a con in other regards. There are things or are ways which we work on to work around this and to actually compact that data. But this is not yet ready.

CRAIG BOX: Is it feasible for someone to maintain all the historical data for their service? Or would you think the most common use cases today are saying, well, disks are getting bigger, and so on. Should I keep only the last two years? Or should I aim to keep everything?

RICHARD HARTMANN: This is a highly contentious point even within Prometheus team. The common wisdom used to be, you retain your data for two weeks, and you drop everything which is older. Personally, I kept all data since late 2015. So data which we collected back then is still available today.

The truth is probably somewhere in the middle for most users. If you really care about persisting your data long time, you would probably be looking at something like Cortex or Thanos or Influx Data. Or one of those other tools where we have our remote read/write API, where you can just push data to those other systems and persist data over there.

CRAIG BOX: A lot of people have multiple environments that they monitor with Prometheus. The general recommendation is to have the Prometheus environment maintained within. So if you're monitoring, say, different data centers, you run Prometheus for each data center inside that data center?

RICHARD HARTMANN: Yes, but not always. So what you would typically do is you would try and keep your Prometheus server as close to your data as possible. So if you have a Kubernetes cluster, typically you would keep Prometheus close to that cluster. If you have a data center, you would run the Prometheus servers in the data center, etc. pp.

But also, you obviously want to have visibility into the state of those Prometheus servers from the outside. Of course, if that whole site goes down and you don't have any monitoring anymore, that's not very useful.

CRAIG BOX: Exactly.

RICHARD HARTMANN: Also, you want to aggregate the data between those data centers or those deployments or whatever you have. So what you would typically do, let's say, you have 10 data centers. You would have at least two Prometheus servers per data center. You would probably have more, maybe for different teams, maybe for different deployments. You might have different ones for your cloud stuff and for your actual hardware which is running the data center.

And then you would use so-called federation, which is the system for Prometheus to emit data to other Prometheus servers. You can define rules, what type of data or what kind of data is emitted based on labels, based on names. And then you can com-- you don't really compress that data. But you just limit what you emit to those other servers.

CRAIG BOX: Right. So you'll collect data on your one-minute granularity within your cluster. But you might send only-- every 10 minutes you might send that off to the--

RICHARD HARTMANN: Exactly, for example. Or you might collect really detailed CPU analysis for stuff within your data center. But you might only emit large averages or something which you care about on the global scale.

CRAIG BOX: So once we have all this data in a time series database, we have access to the state of our environment now. And we can then obviously query back to as long as we've been collecting it.

Prometheus is often mentioned in the same breath as Grafana or other tools that are able to do aggregation. Basically, are they doing queries against Prometheus in its own query language and then using that to draw graphs? What's the process of visualizing this data or making it useful?

RICHARD HARTMANN: Grafana is not so much about aggregating data. It's more about just visualizing that data. Grafana has a ton of different backends for different databases, one of them being Prometheus. And basically, you define queries PromQL within Grafana, and those get executed against your Prometheus service.

There is a templating engine, or they renamed it to variables, where you can basically do stuff within Grafana. Like, for example, collect customer numbers and then have a dropdown of customers or of sites or whatever, and re-inject those variables into PromQL queries, which is quite powerful for building dashboards, like premade dashboards for drilldown.

ADAM GLICK: There is a number of projects out there, both commercial and open-source, that basically take that time series data, in many cases I think it was like logging and monitoring data, and not only collect that, but also the visualization when you think about tying the Grafana pieces in.

How do you think of using Prometheus in Grafana is different from those others? Why should people focus on using that, especially in the world of Kubernetes that I often hear it used for?

RICHARD HARTMANN: There is different answers to that question. The first and the most simple one, if you're using Kubernetes, you have full integration, native integration between Prometheus and Kubernetes. So you have services and discovery, and you have all these things. So it's basically more or less turnkey. You just deploy both, and it just runs and it does the right thing. So that's the really short answer for anyone running Kubernetes.

As to why the combination of Prometheus and Grafana, basically we still have our exploration UI within Prometheus. We used to have something for dashboarding within Prometheus, but we deprecated it in favor of Grafana. Of course, Prometheus people tend to be good at backends and stuff and not so much at UI. So we decided to let the people handle it who can actually handle it as opposed to us.

And the third one was-- so the combination of Prometheus and Grafana itself is basically just you have a good data backend and a good UI. How does this compare to others? So if you look at the more classical ones where you don't really have even the concept of time series, that's a completely different league.

But even looking at, for example, Influx data, we tend to be quicker than Influx. I mean, they have the advantage on the issue of supporting more than just metric-based or number-based time series, so obviously, they have more complexity than us. So we can optimize a lot more in certain regards. Still, if you follow the old Unix thing of doing one thing and doing it well, that's what Prometheus is there for.

ADAM GLICK: In terms of, I think of a lot of these things as part of the cloud-native infrastructure, architecture. What about on the data center side of things? Is this something that people use commonly in data center? Or is this really about the new way of building applications?

RICHARD HARTMANN: I see Prometheus as firmly in the world of cloud native, but not limited to the world of cloud native. So to me, cloud native is basically a lot of good operating principles adapted to what we currently consider a good and modern stack of technology. But that doesn't change the fact that the underlying good principles of good operations haven't really changed all that much over time.

So if we're trying to do things right in the past or right now, you would probably end up at roughly the same operating principles. Even though terminology may change, you might have new concepts, for example, a shared responsibility between developers and operations, common error budgets, all these things, these tend to come as new concepts. But the underlying truth is still about good operations.

So using Prometheus for old stuff like, for example, network hardware, as in keeping the internet alive or for data centers is becoming more and more common. Because people who really care about good operations in those fields, they look at the tools available, and they see that Prometheus is really good. So they adapt it for their needs. And it's quite easy to do so in the meantime.

CRAIG BOX: If we think of Prometheus's goal as providing a monitoring system, it needs to have a query language in order to be able to query. But then, it also needs to have a place to store that, a persistence format. And so basically, it includes a database.

Most line of business software doesn't include an implementation from scratch from a database. It uses something else. It connects out to MySQL or a Mongo or whichever paradigm works for that. Why does Prometheus include a time series database? Was there none at the time, for example?

RICHARD HARTMANN: Yeah. There was just nothing else which we thought would be working. And we are really, really efficient in what we do with the current data engine. So we basically have uint16 bytes for timestamp in milliseconds. And we have a float 64 for the actual value. So you end up at 16 bytes. If you compress this with our storage, you end up about a little bit more than one byte per data point. So you compress by more than the order of magnitude. This is obviously something we didn't do from day one. But still, we just didn't find anything which was as quick and as reliable and as good as we wanted it to be.

CRAIG BOX: You said before that there is integration between Kubernetes and Prometheus. And a lot of people will install Prometheus out of the box. And they know that it just works. And then there are libraries that they can include in their application if they want to expose metrics in a format that Prometheus can consume.

But for a layperson, how would you explain the process by which data from my application gets pulled by or pushed to Prometheus?

RICHARD HARTMANN: Data always gets pulled in a Prometheus world. You have an HTTP endpoint, which is basically plain text. We used to also have a Proto format, but we basically only have text now because we didn't need anything else. And you expose this text through HTTP. Prometheus comes along, scrapes all the data, and that's it.

You have your libraries, as you already said. There's also libraries by other people. I know someone who just does prints within its C code and dumps it onto HTTP server. It's really, really simple to image data to Prometheus, which is part of the reason why we have so many adaptions.

We have more than 300 registered ports on our Wiki for people who wrote exporters or integrations. And it's really, really easy to do that. You can-- even if you're not a good coder, even if you're stuck with shell scripting, it's really easy to emit data to Prometheus. The barrier of entry is really low.

ADAM GLICK: Given all the work that you're doing with Prometheus, how did that lead you into creating the OpenMetrics Project?

RICHARD HARTMANN: Politics. It's really hard for other projects, and especially for other companies, to support something with a different name on it. Even though Prometheus itself doesn't have a profit motive, so we don't have to have sales or anything, it was hard for others to accept stuff which is named after Prometheus in their own product. And this is basically why I decided to do that.

Of course, initially I come out of networking space. And the thing is we have SNMP. And SNMP is horrible. And I wanted to have this new cool thing. But I was certain they wouldn't be supporting something generic.

CRAIG BOX: If it was called the Prometheus Network Model, PNMP, they wouldn't take that.

RICHARD HARTMANN: Exactly. So my initial goal was to write an RFC which I could then slip into a tender and tell others about, so they could also slip it into tenders or conditional deals, and just force people to support it. Of course, that's how it works in a networking space. And also to do it like this in other spaces.

This kind of snowballed from there because more people heard of it. More people got interested. And we actually had input from, especially Google and Uber, about certain aspects which they wanted to have differently, just to adapt to their use cases.

The most prominent example would be we have no exemplars in OpenMetrics, which basically allows you, if you have a bucket in your histogram, you can attach an ID off a trace to directly link to that trace. So you know your latency in that bucket is more than like, let's say, 60 seconds. And you want to look at why this is happening. So you now have exactly this trace over to that-- you have that link to that other trace ID.

For example, Stackdriver will support this and it will ingest the data. And it'll just work. Prometheus decided to just drop that data and ignore the exemplar. So OpenMetrics is about enabling systems to emit data in a certain wire format and agree what that wire format should be. And also to ingest data with that same wire format.

But it's not so much about prescribing in detail what you must do on the other end. We are kind of sneaking the concept of labels into the world by doing it this way. Course, we mandate label use within OpenMetrics. And by this, we totally kill the concept of hierarchical data, which is a purpose of doing this.

CRAIG BOX: How many labels can you apply to a single data point?

RICHARD HARTMANN: You can apply more or less as many as you want. With Prometheus we suggest that you don't go above the tens of thousands of labels or label values within one time series.

CRAIG BOX: You described before the process by which you would expose data that Prometheus can retrieve. You mentioned OpenMetrics has a standard for exposing that data. Is it just standardizing what you emit? Or is it standardizing the connection between the pulling server and that data?

RICHARD HARTMANN: Primarily, it's just a wire format. And we mandate data calls over HTTP. There is considerations, for example, the concept of an infometric where you can put stuff like build version or such.

We will probably also include things like how to emit information about if you have a lower or upper threshold for warning or alarms-- I don't know, temperature in your router or whatever. So there is basically a set of this is how the wire format must look. It must go over HTTP. And here's a bunch of considerations and best practices to please implement.

CRAIG BOX: Is there something if I'm writing an application, I will now use an OpenMetrics library in order to make sure that what I'm exposing is compliant rather than using a Prometheus library?

RICHARD HARTMANN: The Prometheus libraries actually are being transformed into just emitting OpenMetrics. Of course, Prometheus will switch to OpenMetrics. Prometheus 2.5 actually already has experimental OpenMetrics support. And the Python library also is able to emit OpenMetrics. So you can already do that.

CRAIG BOX: Can I still emit them with print statements in C?

RICHARD HARTMANN: Yes, of course. That will not go away, at least not in the near future. Of course, this is one of the main selling point, to make it really, really easy. I don't care if you use our library or someone else's or make your own. Doesn't matter. I care about having a lingua franca in the whole observability story.

Of course, what you are currently seeing in the observability space is we have all this new projects, and basically it's exploding. And we must reconverge to something. Course, there's just too many bits and pieces floating around at the moment.

So by having this inflection point where everything comes together and they all speak the same language, I don't really have to care about what they do before or what they do after. I just care about them being able to talk to each other. And then you can just pull bits and pieces together. And that's your stack. And that's totally fine.

ADAM GLICK: What comes next for these projects?

RICHARD HARTMANN: Next is-- first is actually finishing the internet draft, which then will hopefully become an IETF RFC. So OpenMetrics actually goes the old way of becoming a standard.

We are already looking at also having something which is basically like statsd where you push data. On the wire it looks the same like OpenMetrics. But the point is the underlying working assumptions for pull and push model, and also for when you want to do either.

Aside from the fact that it's a religious question, there are certain use cases where one or the other is actually better. And to make it easier for people not to get confused, because we were realizing even amongst ourselves when we identified this valid use case of pushing data out, we got confused about what we were currently talking about.

So to avoid that confusion we just want to have different names for basically the same thing. That's the first step. After that, I want to tackle events and such. I already talked to OpenTracing. I already talked to CloudEvents, which obviously look at different parts of this whole observability story.

But what I care about is I want to have one new lingua franca, which is not SNMP, which you can use for monitoring, like the whole monitoring and observability story-- be it metrics, be it traces, be it syslog, be it whatever, that everything goes over the same similar wire format. And everything has labels attached and no hierarchical data models at all.

CRAIG BOX: Would it be possible to take SNMP output and translate it to OpenMetrics?

RICHARD HARTMANN: Yes. There is an SNMP exporter. It used to be written in Python. It got reimplemented in Go when we ran into scaling issues. And it scales really well in the meantime.

CRAIG BOX: While we're talking about internet drafts, I see that we're looking now at QUIC becoming HTTP version 3. You've mentioned that OpenMetrics is a wire protocol as opposed to just a metadata standard that you publish on any particular endpoint. As the wire protocol used by the internet changes, how will OpenMetrics change?

RICHARD HARTMANN: That's actually an open question. We'll just look at how things progress and go from there.

ADAM GLICK: If other people want to get involved in Prometheus and the work that you're doing with OpenMetrics, where would they go?

RICHARD HARTMANN: Probably a mailing list. And we would love for people to join the Prometheus community and to try and become Prometheus team members. Course, a few of us are actually funded by their companies to work on Prometheus. Or quite a large portion, amongst them myself, are just doing this in their free time basically.

So more hands to help or also companies who just want to invest, either by paying people or by just assigning engineers onto Prometheus would be totally awesome. Course, there's tons of stuff that we would like to do. But we just don't have the resources to do it.

As to OpenMetrics, we are currently-- before we actually release 1.0, we are not taking on new people. Course, most of the stuff is in our heads, which is a really bad place for it to be. And the burden of onboarding people even who left for only a few months and then returned was insane. So we basically decided to just try and push out that standard and then go from there.

CRAIG BOX: All right, Richard. Thank you very much for joining us today.

RICHARD HARTMANN: No worries. Thank you.

CRAIG BOX: You can find Richard on Twitter @TwitchiH, T-W-I-T-C-H-I-H, and learn about Prometheus at Prometheus.io, and OpenMetrics at OpenMetrics.io.

[MUSIC PLAYING]

ADAM GLICK: Thanks for listening. As always, if you've enjoyed the show, please help us spread the word by telling a friend. If you have any feedback for us, you can find us on Twitter @KubernetesPod, or reach us by email at kubernetespodcast@google.com.

CRAIG BOX: Please, someone write to us and ask for that soup recipe already. You can also check out our website at Kubernetespodcast.com. Until next week, take care.

ADAM GLICK: Catch you next week.

[MUSIC PLAYING]