#4 May 22, 2018

Stackdriver Kubernetes Monitoring, with JD Velasquez

Hosts: Craig Box, Adam Glick

On this weeks Kubernetes Podcast, your hosts talk to JD Velasquez from Google Cloud about Stackdriver Kubernetes Monitoring; a new product that brings first-class Kubernetes monitoring and Prometheus support to the Stackdriver monitoring and observability suite.

Do you have something cool to share? Some questions? Let us know:

News of the week

Stackdriver Kubernetes Monitoring:

ADAM GLICK: Hello again, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.


ADAM GLICK: Today is the week of places that start with M. Where are you going to be, Craig?

CRAIG BOX: Oh, I'm here in Melbourne, Australia, where I'm giving a talk at Container Camp on Thursday, which is on Istio 1.0. And the interesting news is that Istio 1.0 has not been released. So the good news is it turns out that it's actually one year to the day since Istio was released to the public. So it is now the Istio 1.0 years old talk, where, everything else remaining the same, talking about customers who are running Istio in production even before they've hit the 1.0 boundary.

Well, a lot brave users, a lot of great stories. So if you have a chance to come along. If not, I will give the same talk in a couple of meet-ups in New Zealand-- one in Wellington and one in Auckland next week. And I'm sure you'll hear more about it as we move around the world the rest of the year.

ADAM GLICK: It's pretty impressive how fast Istio is moving. I believe they do a dot release almost every month, isn't it?

CRAIG BOX: It should be. That's one of the great things about technology is we say, let's do a monthly release, and then the schedule slips a little bit. And that, of course, is why we're not at 1.0 yet. And to tell you more about that, you'll have to come along on my talk, I'm afraid.

ADAM GLICK: Ooh, look at the teaser. Nicely done.

CRAIG BOX: And yourself, Adam?

ADAM GLICK: I'm going to be meeting with customers at the Google Cloud Summit in Milan. So if any of you listening will be in Milan, please stop by the booth area, and I'd love to chat with you and catch up. And then I'll do a little more travel next week before I'm back in my usual location back in the States.

CRAIG BOX: Let's check out the news of the week.


Rackspace this week announced Kubernetes as a managed service. They'll act as a managed service provider offering management of Kubernetes wherever the customer wants to run it, including on premises, in the Rackspace data centers, or in public clouds. Rackspace Kubernetes as a service will be available on their private cloud in all regions later in May. And support for public clouds is planned at a later date this year. In their announcement, Rackspace called out the 451 Research prediction, that the application container software market will grow 40% over the next few years to nearly $3 billion in 2020.

ADAM GLICK: Cisco released several patches this week for their Digital Network Architecture Center. Cisco's DNA embeds Kubernetes, and this CVE stems from a misconfiguration of that environment, allowing an attacker to bypass security protections within your running containers. The exploit was accessible to any attacker who had access to the service port. The attacker could gain elevated privileges within the running containers in your Kubernetes pod and could completely compromise the affected containers. There is currently no workaround for this vulnerability, so installing the patch release this week is highly recommended by Cisco.

CRAIG BOX: eSecurityPlanet-- little e capital S, capital P, very important-- this week released a video interview with Brandon Philips about how Kubernetes responds to security threats. In the interview, the CTO of Red Hat's CoreOS division and member of the volunteer team that handles Kubernetes security reports, Philips details how security reports are handled and how the Kubernetes subpath vulnerability issue was managed this March. He says that there are currently eight volunteers on the Kubernetes Security Response Team who act as project managers, triaging issues as they come in, then engaging with the right engineers involved with Kubernetes SIGs to get the fixes implemented quickly.

ADAM GLICK: Kubernetes The Hard Way has been updated to support Kubernetes 1.10. Kelsey Hightower, developer advocate with Google Cloud, updated his popular online guide to deploying Kubernetes from scratch. If you haven't worked through it yet, it's an excellent way to understand the inner workings that are a part of the Kubernetes setup and will make you truly appreciate what managed Kubernetes services provide. There's also a module that lets you use that runSC runtime powered by gVisor. If you're just tuning in, we had a great interview with Yoshi and Nick from the gVisor team last week.

CRAIG BOX: Finally, buried in a session video from Red Hat Summit last week, we found a gem of a service called Kiali, a service mesh visualization tool for Istio. Kiali, which is the Greek word for "monocular," ticks both boxes-- being both Greek and starting with a K. That's bingo on Kubernetes' related project naming. It was also the name of 10 babies born in the USA between 1880 and 2016.

More importantly, it's an observability display visualization platform for Istio. It's still early stages, but it looks to be a very interesting way to see the status of your environment, linking in to some of the services that Istio provides for monitoring and tracing. We have a link in the show notes to a couple of demo videos very much worth checking out.

ADAM GLICK: And that's the news.

CRAIG BOX: Our guest this week is JD Velasquez, product manager with Stackdriver, who recently launched Stackdriver Kubernetes Monitoring.

ADAM GLICK: Welcome, JD.

JD VELASQUEZ: Thank you, Adam. Thanks, Craig. It's good to be here.

CRAIG BOX: Congratulations on the launch of Stackdriver Kubernetes Monitoring. Tell us a little bit about the product and what it means to you.

JD VELASQUEZ: Thank you. So we were very excited to announce the product at KubeCon in Copenhagen. The Stackdriver Kubernetes Monitoring, what it does is it really offers very rich and comprehensive observability on your Kubernetes environment, which simplifies operations for developers and operators on SRE.

ADAM GLICK: When you say "observability," what do you mean by that?

JD VELASQUEZ: I'm glad you asked that question. It's a term that's come recently as being really paralleled with monitoring. They're not exactly the same, so at Google, the way we think of observability is more of a property of a system. And so what we try to do is to increase that property, increase observability, which, in specific terms, what we're trying to do is to find a way to increase signals so that you can debug, essentially, your system in multiple ways, so that you can diagnose and understand failure in production or that you can understand, actually, the normal behavior of a system, which is better in terms of usage and so forth.

CRAIG BOX: Is this normally something you engineer into the application, or is it something that's provided by the platform and the tooling around it?

JD VELASQUEZ: That's the beauty of the product, in a sense. So normally, it requires a lot of effort to instrument your infrastructure, and your application, and so forth, so that you can actually get that rich observability. What we're doing with the product is precisely that we're taking that toil away from you. So if you're a developer, you come in and you already have this observability right from the start. You get a lot of visibility into your Kubernetes objects, and you also further have the ability to inspect those objects in the right context.

And what I mean by that is if you're the developer, for instance, you may be interested in your Kubernetes environment and your application from the higher level. So you want to look at your workloads, and your services, and so forth. So you can do that. But if you're an SRE, if you're an operator, and an SRE, a Site Reliability Engineer, you may actually want to look at your Kubernetes environment from the perspective of your infrastructure.


JD VELASQUEZ: And so the Kubernetes Monitoring gives you, in a single place, the possibility to monitor multiple clusters, can get an at-a-glance view of the health of those clusters. And then as you focus your attention-- because something you may need to look into, you can drill down or up and then inspect those objects further.

CRAIG BOX: And you're obviously talking about observability. That brings in things like logging and tracing. And how do we integrate with those features?

JD VELASQUEZ: Right. Exactly. So what I mentioned earlier in terms of observability as trying to increase as much signals as possible from your environment, usually in the monitoring space, this has referred to metrics or some other people would use logs. What we're doing here is that we actually integrate metrics, logs, events, metadata about your Kubernetes environment all together in a single place. So that's where this rich and comprehensive observability comes from.

ADAM GLICK: That's really interesting, because when I think about traditional monitoring that people do, it tends to be very basic things of CPU usage, say, for instance, or how much memory is used. Do you go beyond that? How does that work in a containerized world where you're kind of dealing with a virtual layer inside a virtual layer?

JD VELASQUEZ: Right. So the main thing here, Adam, is that this rich observability is the possibility of getting those diagnostics across the entire stack. And so Stackdriver Kubernetes Monitoring will do from a container at a microscale, in a sense, getting those system-level metrics, and logs, and so forth, all the way up to a Kubernetes service, getting even into traces. So you can actually understand the latency, for instance, of your application and bottlenecks that may exist.

But further, to get to your point, is we may be talking about infrastructure-related signals and Kubernetes signals altogether, but there's also the application aspect of it. So we know, for instance, that many customers use and instrument their applications, meaning they want to have application-specific metrics or data.

And they use Prometheus. It's very common in the Kubernetes space. We're very aware of that. And of course, we want to make sure that we get to customers where they are, they could pick the tools that they use. So as part of this product, we integrated with Prometheus. So if you have an instrumentation and configuration in Prometheus, you can take that into Stackdriver Kubernetes Monitoring without modification. So that gives you, in a single place, the possibility of getting infrastructure-related metrics for Kubernetes all the way to your own application metrics as well.

CRAIG BOX: So if we were to think of this as a hosted Prometheus equivalent, in some sense-- there's obviously more to it in terms of alerting and monitoring-- is there things that Prometheus does not bring?

JD VELASQUEZ: Well, the thing is that you can see Prometheus in many different ways. I have Prometheus as a server, a protocol, a time-series database, and so forth. The idea here is that if you're already using something, for instance, on-prem and you're using Prometheus for it, now you have a choice. And it's very easy for you to switch back from an on-prem and then now take things into a different back-end in the cloud with Stackdriver. That's what we give you.

ADAM GLICK: So you've mentioned on-prem and the cloud here. Stackdriver's a cloud-based service. Does this only work within the cloud, or can you run this anywhere that you are running your Kubernetes clusters?

JD VELASQUEZ: Great. Yes. That's the main thing with Stackdriver Kubernetes Monitoring-- is that you can run it everywhere. So it is, of course, right out of the box. You get it pre-integrated with Kubernetes Engine. But if you have a cluster deployed on-prem or in a different cloud provider, then you can also configure Kubernetes Monitoring there and you bring everything into that single place. You can have that multi-cluster-- one cluster in Kubernetes Engine, one cluster on-prem, one cluster wherever it is.

ADAM GLICK: So it's a true kind of one view across all of your clusters that you're operating.

JD VELASQUEZ: Right. Stackdriver was built from the ground up to be supporting a multi-cloud and hybrid environment. So essentially, this is our mission. We want to help developers and operators keep their applications running fast, being available, and doing, most importantly, what they're meant to do, no matter where they run.

ADAM GLICK: If I'm running on premises, how should I think about this connection? And are there any special challenges that I might have, say, with firewall rules or security concerns that people might have?

JD VELASQUEZ: Yeah, I think the main thing from our perspective is that key aspect is the metadata. So in cloud, because we have access to APIs and so forth, it's very easy to organize your environment in a way that is coherent and that aligns with your mental model. On prem we don't necessarily have that visibility. That's why we developed a new metadata agent that customers can install on prem and then send metadata and organize their environment in a way that will be understandable by Stackdriver, so that you can have the same type of configurations and so forth.

CRAIG BOX: Does that get installed on every node in their on-prem cluster?

JD VELASQUEZ: That is the idea. Yes.

CRAIG BOX: And how influential has the Google SRE culture, be it the book or individuals in the SRE organization, been in helping guide this product as you built it out?

JD VELASQUEZ: This particular offering is an example of our commitment to externalize many of our practices. So Google has, of course, more than a decade of experience of running, and deploying, and managing container-based applications. And we're taking all of that and putting them precisely into our main offering here. So a lot of this is the ability to do monitoring and using this observability in the same way that Google SRE does.

ADAM GLICK: So what's probably the newest thing that, with this announcement, people are able to do with Stackdriver that they weren't able to do before?

JD VELASQUEZ: Oh, the main thing is that you would have a single source of truth, so to speak, in terms of those signals. In the past, what people have had to do is that they need to stitch together manually all those different sources of data as well as tools to get your metrics, your logs, and so forth together. And we now give you that in a single place. It's integrated with open source. So if you use it, then you could immediately get those signals also into Stackdriver. It runs everywhere. And it externalizes those SRE practices.

ADAM GLICK: So if I think about something-- traditionally, I think about, hey, I'm going to run Prometheus, that's going to aggregate my data, I'll use something like Grafana to do my visualization of those pieces. Is this complementary to that stack? Does it augment some of that? How should I think about that in comparison to that stack that I'm used to?

JD VELASQUEZ: So Stackdriver gives you that single place where you can do all of these things. The Stackdriver Kubernetes Monitoring piece, it works great on its own, but it really works best when you use it with the rest of the tools. So for instance, when I'm using Stackdriver Kubernetes Monitoring, I can still have access to the very comprehensive alerting that Stackdriver has and the powerful visualization tools. You have dashboards and so forth. So in that sense, it is in the same space of that Grafana and other tools that you mentioned.

Now, if you want to get into latency analysis and bottlenecks-distributed tracing, you can go and use Stackdriver Trace. If you want to understand your consumption of resources, then you could use Stackdriver Profiler. If you're diagnosing an issue, trying to do root cause analysis and so forth, then you would get into Stackdriver Login, for instance, to get those signals.

Altogether, Stackdriver is meant to help you understand the behavior of your system or your application to speed root cause of any issues when they happen and to minimize the time to repair those issues, right? Now, we cannot be everything to everybody. So we're building an ecosystem. And we, of course, rely on partners that will build on some of these same tools. And if a customer is using Grafana, as you mentioned before, you can totally take those signals out of Stackdriver and visualize them with Grafana.

CRAIG BOX: How does Stackdriver Trace relate to some of CNCF's tracing projects, like Open Tracing and Jaeger?

JD VELASQUEZ: Yes. Part of some of our efforts in Stackdriver, we have this project of OpenCensus. And in OpenCensus, we've been leading precisely to get into the multiple open projects and initiatives to define a common, shared way to get to instrumentation. So we have an initiative with Open Metrics, we have an initiative with OpenTracing, and we're starting something with Open Logs as well.


JD VELASQUEZ: The idea here being that, as a developer, I would only need to instrument my application once. And if I need to switch back ends, that's all right. You still are using the same way. You don't need to modify your application instrumentation.

CRAIG BOX: Is that similar to the Mixer service in Istio, for example?



ADAM GLICK: As a developer versus being an operator in Kubernetes, why would this matter to me?

JD VELASQUEZ: It matters primarily because we do all the instrumentation for you primarily about your entire Kubernetes environment. So right out of the box, you get all that observability you need. And this allows you to do several things, but most importantly, it allows you to focus on doing what you love best, which is writing code and building apps.

ADAM GLICK: How would someone install this? If you needed to do this, do you need to actually change, say, your Docker files and then redeploy your application in order to put it out there? Or does it kind of attach natively to things that are already there?

JD VELASQUEZ: We do all of this for you if you have this in Kubernetes Engine. It's pre-configured as soon as you spin up a cluster and just select if you want to enable Stackdriver Kubernetes Monitoring. And then you get it right out of the box. If you want to configure this on prem or somewhere else, then you would have to install our agents to that effect. You can have some configuration there.

CRAIG BOX: If I have a simple microservices application running on a hosted Kubernetes environment, how important is it to use the rest of the traditional Stackdriver tooling to manage the health of the nodes? Am I just now saying, all right, as long as my service is up, then I don't need to do alerting on my underlying infrastructure? Or is there still a place for doing management of the underlying infrastructure with more traditional Stackdriver toolkits?

JD VELASQUEZ: I think this is where the solution provides that comprehensive observability that we talked about, right? So it really depends on the use case and the role that you have. Normally, as I mentioned earlier, if you're a developer, you may want to really focus, at a high level, on your workloads on your application itself, on the services.

But at times when things happen, you do need to dig deeper and go into your infrastructure to understand if there's something specific happening with your pods. If you have many pod restarts in a period of time, for instance, and so forth, you may really want to get sometimes at your nodes and the infrastructure itself. So Stackdriver gives you that full stack, from infrastructure monitoring and observability to cloud services to application and so forth in a single place.


CRAIG BOX: The Stackdriver Monitoring system before the Kubernetes launch obviously monitors both Google Cloud and Amazon EC2. You have a system here that enables you to monitor a Kubernetes environment wherever it might happen to be-- on other public clouds or on premise. Is it your intention that you will have customers who have running infrastructure that's no connection to Google do their monitoring with the service?

JD VELASQUEZ: That is exactly right, Craig. It's part of our mission. We want to make sure that we help developers and operators run their applications, get them passed, keep them available doing what they're supposed to do in a multi-cloud and hybrid world.

ADAM GLICK: In terms of being able to gather those metrics, you mentioned that you have a dashboard for visualization, can I plug that back in to be able to have my Kubernetes clusters actually take action on them? Can I put those into a queue or someplace where I can actually use those to automate reaction to metrics that I'm seeing, say, for auto-scaling? Or is this purely a kind of visualization understanding framework?

JD VELASQUEZ: Well, it's both. From a perspective of you as a user, you may take that visualization, as I said earlier, to understand the normal behavior of your system, so that you can do two things-- prevent failure. But failure will always happen. That's inevitable. That's where you need that observability. So then you can root-cause the issue faster. And then minimize that time to repair that failure.

But on the other hand, underlying what we have here is a set of metrics that Kubernetes Engine also uses to do autoscaling based on the same Stackdriver metrics.

CRAIG BOX: JD, do you use Stackdriver to monitor Stackdriver?

JD VELASQUEZ: We do indeed. We actually have several, we call it internally, Stackdriver workspaces or scopes, where you actually monitor our own environment using Stackdriver. So it is a tool, definitely, that we use at Stackdriver. More importantly is the fact that the underlying infrastructure is a massive infrastructure we have built to even provide some of that single source of truth that I mentioned earlier to monitor other Google services. That's where it's an important aspect of ours.

ADAM GLICK: I'm going to throw one more in there, which is when I think about the future, can you share anything in terms of what you've got on your roadmap, what's coming next?

JD VELASQUEZ: Specifically for the Kubernetes piece or Stackdriver in general?

ADAM GLICK: For the Kubernetes piece in specifics. That's our audience.

JD VELASQUEZ: So the main things that we have you will start to see is, first, we're going to make much easier the configuration for Stackdriver Kubernetes Monitoring on prem and on other places. So we have a certified solution in the short-term feature.

We're also going to be working on integrations with Container Builder, for instance. We want to get information about deployments, so they can start to correlate your deployments with the state of a Kubernetes environment. Usually when there is an issue, most likely because something changed very recently, you did a roll-out, right? So we want to give that information to our customers as well, to get them to do that faster.

CRAIG BOX: And Spinnaker integration also?

JD VELASQUEZ: Right. That is in our roadmap as well. And then the last thing-- when we do things at scale, as we do at Google, you may need a different type of visualization to get that at-a-glance sense of health. So we're going to be working on a different type of association for your Kubernetes clusters. You should see that in the short term as well.

CRAIG BOX: JD, thank you so much for your time today.

JD VELASQUEZ: Thank you, guys. I really appreciate being here.

ADAM GLICK: Great talking to you.



ADAM GLICK: That's about all we have time for this week. If you want to learn more about Stackdriver Kubernetes Monitoring, you can check out the documentation at Cloud.Google.com/Monitoring/Kubernetes-Engine.

CRAIG BOX: Thanks for listening. As always, if you've enjoyed the show, please help spread the word. Tell a friend. More importantly, tell us. If you have any feedback on the show so far, you can find us on Twitter @KubernetesPod. Or reach us by email at kubernetespodcast@google.com.

ADAM GLICK: Also check out our website at kubernetespodcast.com. Until next time, have a great week.

CRAIG BOX: Take care.