#97 March 31, 2020

Jaeger, with Yuri Shkuro

Hosts: Craig Box, Adam Glick

Jaeger is a distributed tracing platform built at Uber, and open-sourced in 2016. It traces its evolution from a Google paper on distributed tracing, the OpenZipkin project, and the OpenTracing libraries. Yuri Shkuro, creator of Jaeger and author of Mastering Distributed Tracing, joins Craig and Adam to tell the story, and explain the hows and whys of distributed tracing.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.


CRAIG BOX: One of the small advantages of the lockdown situation that I believe 1/4 of the world's population now finds themselves in, is there are a lot of musicians who are sitting at home with a bit more time on their hands than they used to, unable to tour. But instead, they are putting out a whole bunch of great things on the web. Some people are recording little guitar tutorials. Brian May from Queen, for example, puts a couple of minutes out each day, where he tells you his thoughts and plays one of his solos, and sometimes slows it down so you can learn to play it yourself.

Another favorite musician of mine, Neil Finn from Crowded House, is locked up in Los Angeles with his sons, who are also now in the band. And they were rehearsing for a show, which obviously is now canceled. But instead, they're now doing a daily webcast where you can listen to them play some Crowded House greatest hits, and obscurities, as well, which are always my favorite.

ADAM GLICK: So how do you have a crowded house when you're social distancing?

CRAIG BOX: Well, you have a big family, I guess. You're allowed to stay with your own household. They say things over here like, you must restrict the size of gatherings to two people. And I don't know, if your household has more than two people in it and you're out for a walk, that presumably means you're not allowed to stop at any point because then you become a gathering.

ADAM GLICK: Well, you could put your house in the middle of the street, but that might be Madness.

CRAIG BOX: How have you been keeping in touch with friends this troubling time?

ADAM GLICK: We've been experimenting with a little bit of social interaction in the age of social distancing, playing some online games, as you might expect. And this past week, we were playing around with an old favorite. If people remember from back in the '90s, the You Don't Know Jack games have now morphed online. And they now have an online version that you can play with folks. And so we've been playing that. And also, we tried out a game this week called Galaxy Trucker, which is actually a board game. It also has a digital version. And we got a chance to try that one for lots of fun and interesting things.

For those looking for interesting things to do with little ones, the Sesame Workshop, the people that put out Sesame Street, have made a good percentage of their online books available, the e-book formats for the books that they've written, something over 100. They come in multiple languages. I've seen them in German and Chinese and Spanish and in English. And for many of the digital bookstores that you'd normally think of-- Google Play Amazon, Barnes and Noble, Kobo-- all have them available right now. You can go and download those and read with your little ones.

CRAIG BOX: For free?


CRAIG BOX: Including the best children's book of all time, "The Monster at the End of this Book?"


CRAIG BOX: That book is a classic. Now, you have a Little Free lLbrary out the front of your house for people who like paper-copy books. I understand that with the enforced one-walk-per-day that we're all allowed, that you're getting a little bit more traffic with your library than usual.

ADAM GLICK: With the library closed, we've noticed that there are a lot of people who come down our street. And the velocity of books, if you think of it like a CI/CD pipeline, the churn of books coming in, books going out, has greatly accelerated. So it's kind of neat to watch that, as access to the regular library has restricted in this time of shutting down non-essential businesses, people are finding other ways. And it's really great to see new books popping up in the library and people coming by and taking them, and just it being a great little resource for the people in the neighborhood.

CRAIG BOX: Hopefully one day, someone will drop you off a copy of "The Monster at the End of This Book."

ADAM GLICK: [CHUCKLES] If not, I have a digital version, thanks to the aforementioned Sesame Workshop giveaways. Shall we get to the news?

CRAIG BOX: Let's get to the news.

ADAM GLICK: This week would have been KubeCon EU, and the CNCF has now announced a rescheduled date for the conference in some form. The new dates are Thursday, August 13 to Sunday, the 16th. A live event is predicated on the WHO, CDC, and Netherlands Health Ministry, all providing guidance that holding a large in-person event would be safe. If this guidance isn't provided, the CNCF intends to shift to an online event, with the same sessions and keynotes happening on the same days.

CRAIG BOX: Also this week would have been the Cloud Native Rejekts conference, which we talked about in episode 79. That event has become Virtual Rejekts, still with a K, a global interactive live-streamed event, following the sun around the world. It happens on April the 1st, no joke. So if you missed it, you'll be able to catch all the videos on replay soon.

ADAM GLICK: DataStax, a company that commercialize and supports the Apache Cassandra database, has announced a Cassandra operator and corresponding management API. The operator promises zero downtime, zero lock-in, and global scale. It joins a crowded field-- there are at least three alternatives in the market already-- but has the support of marquee Cassandra users, like Netflix and Sky TV, as well as powering DataStax database-as-a-service Astra. The management API runs as a sidecar app, but in the same container, allowing for standardized control of actions across the cluster. It works with or without the operator. Both are available on GitHub. And you'll hear more about them on next week's show.

CRAIG BOX: Monitoring company Sysdig has released PromCat, a resource catalog for enterprise class Prometheus monitoring. PromCat is a curated repository of Sysdig tested and documented Prometheus integrations for monitoring applications, infrastructure, and services. The catalog launches with monitoring resources for the Kubernetes control plane, the Istio service mesh, and selected AWS services, plus the promise of many more products and platforms to be added soon.

ADAM GLICK: Another week, another autoscaling innovation-- this week, Jamie Thompson talks about the work he's done over the past six months with his Custom Pod Autoscaler framework and a Predictive Horizontal Pod Autoscaler, or PHPA. Jamie's model uses linear regression and the Holt-Winters model. The latter is interesting as it can understand the seasonality. In this case, he shows that the PHPA can recognize patterns in traffic and proactively scale clusters to provide lower latency for systems in peak times. The results are encouraging and the data is posted on his blog. Jamie has also released a test suite so you can try it out yourself.

CRAIG BOX: Sidecars in the Istio service mesh get their tier list certificates using the Envoy secret discovery service. For services that don't need to use the sidecar for traffic, but still want to be able to communicate securely, Lei Wang of Google has posted on the Istio blog, explaining how you can use the sidecar in a provisioning only mode. You can then take that TLS certificate and use it in your own application, allowing connections to other services that run in the mesh.

ADAM GLICK: Lewis Marshall has posted a blog on securing your GKE cluster. He goes over what you get by default, some of the current challenges with locking down clusters, and a step-by-step walkthrough on how to apply his adjusted settings. He concludes with a call-out of both why these changes are needed and also why they are often hard to implement.

CRAIG BOX: If you're running EKS on Amazon, watch out for an upcoming change. Nodes in a managed node pool were previously given public IP addresses. Going forward, they will only be given public addresses if the subnet they are being deployed to is configured to explicitly allow it. Product manager Nathan Taber explains this in the blog post, with a second post explaining the networking setup of public and private endpoints. AWS also announced they have increased the EKS SLA to 99.95%, after Google did the same earlier in the month.

ADAM GLICK: Some tips on ops from a blog called ops.tips. Ciro S. Costa has posted interesting write-ups exploring Kubernetes secrets from the kubelets perspective, and looking at the details of resource reservations and how processes are killed when you run out of memory. If you still rock RSS, you might like to subscribe.

CRAIG BOX: Last week, the Cloud Foundry Foundation announced the adoption of KubeCF as an incubating project. This week, they announced that Google Cloud has joined the foundation as a platinum member. Google's Jennifer Philips will join the foundation's board of directors.

ADAM GLICK: Finally, the CNCF has released another case study, this time with Vodafone. They call out their challenges with operating in many countries and with many applications. They express how Kubernetes gives them the flexibility to build the platform they need to be software-defined, provide a universal experience, and compatible with the security mindset and tools they feel they need moving forward.

CRAIG BOX: And that's the news.


CRAIG BOX: Yuri Shkuro is a software engineer at Uber, the creator of the Jaeger project, and the author of "Mastering Distributed Tracing." Welcome to the show, Yuri.

YURI SHKURO: Thank you. Thanks for having me.

CRAIG BOX: You had quite a career in the world of finance before technology. How did you get from there to here?

YURI SHKURO: Finance was an interesting journey. But one thing that always turned me off from finance was that there is a lot of focus on business knowledge and process and not so much on technology. And so I was really looking to do more of a distributed systems and more hardcore technology work. That's why I decided to leave and go to Uber, which definitely provided the scale and good background to do that kind of work in infrastructure.

ADAM GLICK: The finance world has often seen their technology as a competitive advantage and differentiator. How did that impact your decision to take on open-source work later on?

YURI SHKURO: Well, finance, I think, does a little bit open source. But most of the work in finance that, at least, I was involved with had to do with kind of domain knowledge and business knowledge and business processes of finance and trade processing and all of that. And obviously, that's never going to be open source because it's proprietary to every company. And when I went to Uber, the part of the Uber that I was working on was infrastructure, which is not specific to Uber business.

Like the distributed tracing and observability space where I am is really generally applicable to any company. And so Uber didn't really have any reason not to open-source it. And Uber has a very robust open source program, with guidelines of when we want to open-source projects or not. And since when we started distributed tracing, this was clearly a good candidate to be an open-source project, so we went with this route.

ADAM GLICK: We've spoken before to a number of folks who've come from Uber doing open-source work -- the M3 team comes to mind on that one. How does Uber make the decisions on what projects they think should be open-sourced versus what projects really are a proprietary company technology?

YURI SHKURO: We do have a set of criteria or guidelines for projects. One of the more recent criteria is that we want projects to actually run at Uber for at least six months in production, so that when we open-source something, we know that it's a good project worthy of showing it to the world and getting people excited about it, rather than something that's half-baked or started just now. And so that's one of the approaches.

But as far as which projects are even allowed to be open-sourced or not, as I said, this is mostly related to whether the projects are critical to Uber business and whether they represent any specific know-how for Uber. So obviously, if you take any sort of driver and rider matching algorithms, we would not open-source those. But for example, visualization of trips in the city, visualizations of kind of activities within the city, we actually have a very large portfolio of open-source visualization tools.

Because again, even though they are core to the business, but they're not a competitive advantage at the same time. They're more of a side effect of the business. And most of the infrastructure that we build is also allowed to be open-sourced. But again, we have other criteria, like how stable it is, how mature the product is, whether there is a competitive advantage, as a project itself, being open-sourced or not. Because sometimes you want to look at what other projects exist in open source, and you don't necessarily want to come out and compete with something. It's better to just collaborate with other projects if possible.

CRAIG BOX: It's fair to say you have quite the passion for distributed tracing, having literally written the book on the subject and created a project in the space. How did you get interested in distributed tracing?

YURI SHKURO: This was mostly an accident, actually. When I came to Uber, Uber New York office was just starting. We had less than 10 people at the time. And one of the areas that group of people were already working on was observability. Specifically, that was the metric that eventually became the M3DB.

However, that team was mostly well-formed already. And even though I joined the infrastructure team, there wasn't really anything that they needed me to do in their metric space. And so we started looking for other observability projects at Uber. And tracing was one of them that came to mind. And because it wasn't really done by any other team at Uber and New York office wanted to become the observability hub of Uber, so we decided to take that project on.

And that's kind of, by accident, how I ended up leading that project. And from that point, I started, basically, reading more about and getting interested in it. I found it to be a very exciting area to work with, especially, given the explosion of microservices that Uber was undergoing at the time.

ADAM GLICK: How would you define distributed tracing?

YURI SHKURO: Distributed tracing is a technique for monitoring and troubleshooting transactions in highly distributed architectures. So think about when you're ordering a ride from Uber app. It's a simple push of a button. But when the request comes to Uber's data center, there's potentially several dozens of microservices cooperating to execute that one single request, from basic services, like authentication and some edge gateway, API gateways, all the way to kind of business services, for matching, for payment checking, for fraud detections, and things like that. And so a single request actually spreads out into a whole bunch of other small pieces that all the services must cooperate to produce one single response back to the app and saying, yeah, you got a ride.

And once we have that kind of architecture in place, it becomes very hard to monitor that execution of those requests, and especially to troubleshoot them. If something goes wrong and you have either now 40 or 50 services involved, it's very hard to figure out what specifically goes wrong and how you troubleshoot certain things. And that's what distributed tracing is aiming to address. It's a different type of observability tool. If you look at metrics and login, it's also really the troubleshooting case for complex requests at the transactional level.

CRAIG BOX: You published an article in 2017 talking about the evolution of tracing systems at Uber before landing on Jaeger, which we'll talk about in a moment. Can you summarize that journey for us?

YURI SHKURO: When I joined Uber in the observability team, Uber, at the time, was undergoing this transition from big monolith written in Node.js and Python into a whole bunch of microservices mostly written in Go. And specifically, in the Python world, there was already a homegrown tracing system, kind of a tracing system, which conceptually was very similar to distributed tracing. Except it wasn't distributed, because the monolith itself wasn't distributed. It was like one single process.

CRAIG BOX: Life was so much easier back then.

YURI SHKURO: Yes. But conceptually, it was very similar. You could still tag executions within that monolith, like capturing and execution of individual functions, capture database calls. And you would get something resembling a distributed trace, except that you never actually left single process. But it was still a structure of the request execution.

And so that system existed, and there were lots of users who were pretty happy with it. But there was no alternative in the distributed microservices world for that, especially in a different language like Go. Because in Python, it was monkey patching a lot of stuff. And in parallel, Uber, historically, back in maybe 2014, they started a project to develop their own RPC framework, somewhat similar to gRPC, somewhat more low-level than gRPC.

And gRPC was not available at the time, but they needed a standard framework for all of the services at Uber to communicate. That framework was called TChannel. And it's still being used in Uber, but phased out. And that framework came with a notion of distributed tracing built in. The notion was very similar to how Zipkin project did it, with, essentially, almost the same data model.

And so when we started the Jaeger project, the channel framework was already in production in many services. So we were already getting the instrumentation and data in production, which didn't have any tracing backend to collect them and start actually analyzing that data. And so there was one group that started the prototype run in Zipkin server.

But again, because the data was coming from a different type of framework, they needed to write their own servers to adopt that data into the Zipkin format. And so when we moved the mission for tracing into the New York office, we sort of took over the work that that team did. But we redid it in cleaner way. It minimized the dependencies on other infrastructure. We used supported databases that Uber had experience with, like Cassandra instead of React, for example. And we started using part of the Zipkin stack as part of the product that we built internally at Uber.

But eventually, that product diverged from what Zipkin project was doing. One big reason for divergence was that we got involved with the OpenTracing project from CNCF. And all instrumentation at Uber is based on OpenTracing. But Zipkin was not really compatible with the OpenTracing model. There were some things that it couldn't represent properly. And so we decided to change the data model because we already were running on our own servers anyway. We were only using part of the Zipkin stack. So we decided to change our internal data model to fully match up on tracing.

And at that point, we kind of started on the path of divergence from Zipkin more and more, to the point where we said, you know what, let's just build our own UI. Because again, the data model of Zipkin UI was more restrictive than the OpenTracing model. And so at that point, we completely dropped Zipkin from our stack. And at that point, we decided, well, let's just open source what we have. In fact, the instrumentation libraries for Jaeger were open-sourced from the beginning. They were initially developed in open source. But the backend kind of went through that cycle of open-source projects at Uber, where it ran for I think more than a year in production. And then we said, OK, let's move it to GitHub.

ADAM GLICK: How did you generalize what you had at Uber and turn it into an open-source project?

YURI SHKURO: I think the critical piece was really the OpenTracing support. Because if you are a third company and you're looking to adopt a solution, you don't want to be bound to that solution forever. You want to have the ability to change your mind later and let's say, go to paid vendor or to another open-sourced project. And develop and deploying distributed tracing in an organization is a pretty difficult task. And the most expensive part of it is getting instrumentation into all of your applications to collect the tracing data.

And so for that, that OpenTracing was the perfect candidate because it says, you know what, you're not dependent on anything. You're just dependent on this very thin API library to instrument your services and your applications. And then even when you decide which tracing backend or which tracing infrastructure you're going to use, you can always have the freedom of switching, without going through the massive expense of changing all of your instrumentation to another API and another instrumentation libraries.

And so Jaeger was designed with that in mind, even though Jaeger was one specific vendor, you could say, one specific application or type of tracing system. But if you instrument your business applications for Yeager, then it's really, the Jaeger is never mentioned in the application almost, like maybe once when you instantiate the tracer. But all other code is generic OpenTracing code and so it's very easy for people to switch.

And that I think was a very big selling point for people to start adopting distributing tracing with Jaeger, as opposed to Zipkin, where in Zipkin, you would have to instrument with very specific Zipkin APIs. It's pretty hard to, let's say, switch from Zipkin to, like, Google Stackdriver or Amazon X-Rays or any vendor, because they all have their own proprietary APIs, and you don't want that in your business code. You want a generic code, which is like only dependent on OpenTracing. So that, I think, was one aspect what we took with the project.

And the second one was just to make it more accessible to people as an open-source project. I think we did a couple things. One is we created a very easy way to run Jaeger, whether it's a full-blown production installation or just a single instance or all-in-one Docker container. So from the beginning, we had Docker images published for Jaeger, where with just one command you can bring the whole installation.

And the second thing is, we invested the time. We actually had an intern at Uber working, who developed support for Elasticsearch. So Uber originally developed Jaeger with Cassandra as a backend storage. And Elasticsearch was an alternative, which we didn't really need at Uber. But we felt that it was kind of an important thing for an open-source project to allow more flexibility. And Elasticsearch was actually a lot more accessible storage solution for people to use than Cassandra.

Cassandra people think is hard to manage. Although these days, you can get managed hosted versions, like on Amazon. But like five years ago, Elasticsearch was actually much easier to get as a hosted version. And so we spent a bit of time to develop that support that I think also allowed people to be more comfortable with trying Jaeger with whatever storage they were more familiar with.

ADAM GLICK: How would you describe Jaeger to someone who is familiar with microservices but doesn't know a lot about observability?

YURI SHKURO: Assuming that you already have microservices, and let's say you have an alert saying, my SLA broken for one of my top-level services, which is something that happens at Uber. And then we have first responders, and they looked at it, and they say, OK, great, so we know this request. We know this endpoint in this service was hit at the top level. But then, we also know that that service probably talked to 40 to 50 other services before it produced a response.

So how do we troubleshoot that specific alert? What's the root cause of that alert? You could go and look at metrics for your service. Well, your metrics will clearly show that you have an SLA broken. Let's say your latency increased. They don't tell you why it's increased. You can go look into the logs. And in logs, you may see there is an error coming from one of the downstream services, maybe, If you're lucky. But they don't really explain whether that microservice was responsible. Maybe something downstream of that was responsible.

And so you're kind of in this situation where you're trying to debug something for which you have very little visibility using traditional monitoring tools, like login and metrics. Because again, metrics don't explain anything. They just indicate certain things. Whereas logs, they might explain, but they're very hard to find. Because if you have 50 services involved, you have to go through logs for over 50 services. And all those logs are intermixed between different concurrent requests, presumably sharing more than one request per second in your system, and only some of them are failing.

So you really need the visibility into, OK, well, what happened to this one request? I know that that request failed, but what happens to all of its lifecycle within this microservices architecture? And that's what tracing gives you. It gives you this microview of everything that happened within the system when processing this one single request. And at the same time, if you, after that, given that microview, you decided, oh, I think the problem is in the service, tracing also has the capability of giving you a microview. You can dive into that microservice and you can see a lot of details captured just for this one request.

Let's say that service was talking to the database. What kind of query was it executing? You can see that from the trace typically, if you have a good instrumentation. And you can't get that kind of visibility with any other monitoring tools. Like clearly, login wouldn't help you because with logs, you would not even be able to associate logs for this with just one single request, unless you do something similar to distributed tracing and tag your logs with some sort of request ID. So in a nutshell, that's what distributed tracing gives you.

CRAIG BOX: Developers may be used to stepping or tracing through their code to find bugs in specific parts of it. Is this a good analogy to what Jaeger does in the distributed sense?

YURI SHKURO: It's a bit of a good analogy, in the way that you can think of a trace as a distributed stack trace. From a single service, if you put a breakpoint somewhere and you get a stack trace within that breakpoint, it's just the function that you call in this service. But you can't get the same picture from other services. Whereas, with distributed trace, you kind of step back a bit. You don't have usually at the individual function level because it's too expensive to collect. But you see that same level of picture, but at the micro level, saying OK, well, this is how all the services interacted, this is what function they called within each other. In that sense, it's very similar. Like I said, it's almost like a distributed stack trace.

ADAM GLICK: How does Jaeger work with Kubernetes? Is it something that you install in your cluster? Does it have its own database for storing traces? How does that whole system work?

YURI SHKURO: There is nothing specific in Jaeger for Kubernetes. Jaeger works with any distributed systems, really. Jaeger consists of several pieces. The first piece, which is actually not part of Jaeger project officially, is the instrumentation that you put in your application. Let's say your services are using gRPC framework to communicate to each other. You can go to OpenTracing registry and pull a library which implements OpenTracing instrumentation for gRPC in your language. Then you install that as a middleware within your application.

And at that point, your code is instrumented. You're good. The only thing you need is then something that actually implements that OpenTracing in API. And so you also grab a Jaeger library for your language and instantiate and then create an instance of a tracer within your application. And then you pass that instance of a tracer to the instrumentation library, which wraps the gRPC endpoints.

So once you've done that, you are good on your application side. You have achieved instrumentation and the Jaeger tracing will start sending data out. And then you need to have something which receives that data. And for that, Jaeger backend comes and play, which there are several deployment strategies that you can use for Jaeger backend. The one that we use at Uber is, we run a host agent. So all the services running on a given host, they just send tracing data to UDP port and Jaeger agent picks it up and sends them to the central backend for Jaeger.

You can do the same thing in Kubernetes as a daemon set host agent. Or more often, people use Jaeger agent as a sidecar so that each application instance has its own small Jaeger agent running next to it because that allows you to do better things if you want something like multi-tenancy. Whereas if a single agent runs on a host, then the multi-tenancy becomes more difficult.

And then once the agent gets the data, the next step is sending that data to Jaeger collectors. And that's why you actually have Jaeger agents so that your application doesn't need to know where the Jaeger collectors sit, how to find them, et cetera. If you want, you could use that in that way as well. But typically you don't. You just run agent locally and then you just talk to the local port for sending data.

And then agent figures out how to find the rest of the Jaeger backend. And Jaeger collectors are the main piece of the backend. They combine all this data from all the microservices and they store them in the database, so you do need to run a database. Jaeger supports Elasticsearch and Cassandra as their primary distributed databases, kind of in a production scale. It also supports Badger, which is in-memory, sort of like a local in-process database. If you're only playing with Jaeger and you're running just a single instance of Jaeger collector, then Badger might be a good choice because it's much easier to manage than Cassandra as a distributed system.

So once the data ends up in the database, the last piece of Jaeger is the Jaeger query service, which combines the UI and some API for retrieving those traces from the database. And so Jaeger by itself is also a distributed system. I mentioned already three components, like agent collector, query service. There is also a way to deploy Jaeger with Kafka between your collector and the database. That's what we use at Uber.

Having Kafka in the middle is sometimes beneficial because Kafka is more elastic storage-type platform than a typical database. With a database, you have certain limitations on a spike in traffic. So if you really get the high spike of traffic and a lot of traces generated, your collectors may not be able to just handle all of that and say then to the database, I will now have to drop the data.

Whereas with a Kafka-- because writing to Kafka is very dumb. You just write the byte stream. It's much more elastic, usually, and you can absorb a lot more spikes in this way. And then once you have data in Kafka, you can read them into the storage with another component in Jaeger code and Jester. Another benefit of having Kafka in that pipeline is that then, because you already have tracing data in Kafka, you can start running additional data mining tools, like filling jobs that we run to extract service diagrams, to build all other kinds of intelligence that we use at Uber for extracting it from the traces.

CRAIG BOX: A common way that people use distributed tracing is via a service mesh, which promises that you don't need instrument your code directly because that can be done by the mesh sidecar. Istio, for example, has supported Jaeger for tracing instance version 0.2 in 2017. What are your thoughts on tracing in service mesh?

YURI SHKURO: The biggest benefit of service mesh is the standardization of the telemetry. When you instrument something explicitly, let's say with OpenTracing or with OpenTelemetry, and let's say you have an endpoint, and you say, OK, I'm going to start a trace for this endpoint. And I'm going to call this trace, get order. But then you say, oh, but I also want metrics for that.

And maybe another person's writing the metrics instrumentation and they call that same thing. In the metrics, they tag it with, get dash order. And now you get this discrepancy of telemetry signals describing the very same thing, but being named differently. And this is the problem that you avoid with the sidecar, because sidecar is the one that actually generates telemetry for you. And so it can provide very consistent view so that all your telemetry signals-- metrics, logs, traces-- they all use the same taxonomy for both business operations and for other infrastructure labels, like which pod you're running and which zone, things like that. I think that's a very powerful thing to have.

However, on the flip side, service mesh by itself cannot provide full distributed tracing without the help of the application. If you read the fine print in Istio or any other sidecar, they will say that, if you want to get tracing working, you actually have to implement context propagation inside the application, at minimum, in the form of passing the header that you received on the inbound request. And so to implement header propagation, the complexity of that problem is actually the same scale as just doing the regular tracing instrumentation anyway.

I even have a whole chapter in my book dedicated to that, about, how do you get, with deploy and tracing instrumentation in the large organization like Uber, which uses multiple languages, multiple frameworks. It's a very challenging problem. And that problem does not change just because you use a sidecar. The very simple notion of propagating the header actually requires the very same rich context propagation framework to be built into your applications. And so that's the critical piece that people often miss with a sidecar.

And so if you have the context propagation already implemented, great, yes, then you can just use a sidecar to do all the tracing. If you don't have context propagation implemented and you need to implement it, you might as well go with OpenTelemetry. It will implement it for you.

Sometimes you don't even have to do anything because of the auto-instrumentation project that exists in OpenTelemetry. But it will essentially have the instrumentation inside the application, which in this case, makes the sidecar tracing a bit redundant because you already can get the exact same thing from the application. But like I said, sidecar does give you a lot of standardization, which is very useful.

CRAIG BOX: We've gone from the very specific way that you instrument your code to be traced by Zipkin, to Jaeger, which then helped contribute to the early OpenTracing project. Now we've gone to an OpenTelemetry project, which brings the benefits of tracing, and some instrumentation based on some Google work as well. How did that journey start?

YURI SHKURO: This was a somewhat painful journey, actually, because the aim of OpenTracing was to become the de facto standard for instrumenting things for tracing and to be completely vendor agnostic, which I think it succeeded in that way. But it also had one specific challenge that OpenCensus, the project that came from Google, was trying to address, meaning that OpenTracing is just an API. You can't run OpenTracing. You can instrument with that API. But in order to get the data out of the application, you still need to compile it, let's say, with Jaeger tracer or with the Lightstep tracer or whatever the vendor implementation of the OpenTracing API is.

And this compilation step was actually a bit of a roadblock for adoption, especially for applications which are distributed on a compiled form, let's say, Postgres or Nginx. You get a binary, and so there is no choice for you to actually compile it again with another tracer. If Nginx wanted to adopt an OpenTracing project, they would either have to compile it with a whole bunch of different tracers for different vendors, which wasn't palatable to them, or they have to support some complex form of plugins.

And so OpenCensus, in that sense, had the better solution because it came with battery included. It was not just an API. It was also an SDK. But the downside of OpenCensus was that it was just an SDK and not an API, so it didn't have, actually, the flexibility of the OpenTracing API, where implementations could do very different things. Like I've seen people implement in OpenTracing API to do unit testing, for example. I don't even know how that works.

But the idea is that it's completely different domain, and you just use the fact that your code is self-describing what it's doing in terms of the OpenTracing API. And you take that description and you do something with the data. And something may be different from tracing.

And so there's these two philosophical differences between OpenTracing and OpenCensus. And they kind of created a whole rift in the community and in the industry because other projects didn't really know where to go, like which projects to adopt because they were both solving the same problem and competing. And you can't have two standards. You should only have one standard. And that's how OpenTelemetry came to be.

After certain people's work, we finally got the leaders of two projects together and agreed on certain goals, which were acceptable to both projects. And then said, OK, let's merge them because two projects is actually worse than having one standard project. So that's how OpenTelemetry started. And again, it's the same goal, having a unified way and standard way of instrumenting your code, which is vendor neutral. The difference from OpenTracing is that OpenTelemetry is not only an API but it's also an SDK. And the difference from OpenCensus is that it's also an API, where you could, theoretically, implement it with something completely different from what the default SDK is doing there.

But other than that, I think it's a pretty interesting project and it got a lot of momentum. A lot of vendors and other people in the industry kind of came together. I think within four or five months after officially announcing that project, there were more than 100 people contributing to it and building SDKs and participating in the specification discussions, defining APIs and all that.

ADAM GLICK: Now that we have the OpenTracing API and Zipkin has adopted it, as well, how would you describe the difference between the Jaeger and Zipkin projects today?

YURI SHKURO: I try not to get into the point by point comparison between two projects. I don't want to say any bad thing about Zipkin. I think it's a great project and it has a good community behind it. And it's very popular as well. So as far as comparing with Jaeger, they're very similar. They're built after original design published by Google for their Dapper tracing system. They are generally compatible, even on the wire. Like Jaeger can support all the Zipkin formats.

Zipkin doesn't support Jaeger formats because it's an older project. They didn't need to. But we decided to support it as a sort of an adoption step. If you already have services instrumented with Zipkin and you want to send starts to sending data to Jaeger, then Jaeger is happy to accept that data. And then a bunch of other open-source projects do the same things with the other projects' formats. As far as feature set, like I said, it's hard to kind of do point by point comparison. The projects go on different schedules and different goals, slightly.

So we position Jaeger as more of a tracing platform at this point, rather than a distributed tracing tool that you only use for one purpose, meaning that we focus more on the data collection being very generic. That's why we sort of always supported documentation API. And OpenTelemetry now natively exports to Jaeger. And we focus a lot on tools, like data mining, built-in visualizations for complex data structures, like trace comparisons that we recently released a couple years ago.

But again, overall, if you just take the very core features, I think the projects are very similar. So it's really a matter of preference which one you go with. I don't think you're going to be wrong by choosing Zipkin or by choosing Jaeger. You can get the similar types of data. You can get a similar type of front-ends, just for the plain inspection of the traces.

ADAM GLICK: Are there any performance impacts to running Jaeger.

YURI SHKURO: Yes, performance impact to running any tracing infrastructure definitely exists. It's very hard to quantify what they are because they're highly dependent on how busy a service is, in terms of like CPU load, QPS. It can vary wildly depending on what the service. Especially if you think about a proxy like Nginx, that's a service which can probably-- I don't know what their benchmarks are. I would assume many tens of thousands of requests per second they can handle.

And so at that scale, adding anything, like even a microsecond overhead to processing every single request, is going to matter. It's going to be actually, potentially, a high percentage of your overall CPU usage or whatever other throughput you're measuring. Whereas, if you have a business application, which maybe makes a database call, and that database call takes 20 milliseconds, then, well, adding the 1 microsecond overhead to that is nothing. You're not going to even feel it. And so it depends on a service what overhead.

I've heard numbers that, for very high QPS services, 5% overhead can be easily observed from tracing instrumentation alone, which may be high. Like if you're running many thousands of hosts with that particular service, then it would come to real money, in terms of both network traffic and CPU usage. But there's not a straightforward answer here. In general, I think the better answer is that it's worth it to have that overhead because of the visibility into your infrastructure that you get from tracing.

ADAM GLICK: Where did the name Jaeger come from?

YURI SHKURO: When we started the project, we definitely did a bit of soul-searching in terms of what the name should be. And we went through a whole bunch of different names, which I wouldn't repeat, some of them quite embarrassing. But we were looking into something with the theme of detective or like a tracker, tracer, obviously.

And jaeger, it's a very common word. It's actually an English word, if you look at the dictionary. It means hunting assistant. And I think it's the same thing, almost with the same pronunciation, in multiple languages, like in Russian where I'm from, and in German. And this is the one that felt the best of all the other alternatives that we looked at. And it definitely fit the profile of something that helps you troubleshoot issues, look into, and hunt for latencies. This is what the hunting assistant comes from.

I don't know actually what hunting assistants do because people don't hunt these days that much. But in the past, I imagine, if you were an aristocracy type, then you would have these jaegers tracking boars for you, and then looking at their footsteps and saying, OK. And then you just come on your horse for a kill. So this is something that's kind of what Jaeger does. It's a tool to help you analyze the data and look at the problems with the architecture so that you can nail down the problem and root causes.

ADAM GLICK: Does that mean that you, as the founder, can be called the Jagermeister?

YURI SHKURO: It feels like a trademark violation, a bit. I prefer distributed tracing maestro.

ADAM GLICK: The logo itself has a couple of components to it. One of them is a gopher, that looks like the Go gopher. Is that an allusion to the language that Jaeger is written in?

YURI SHKURO: Oh, definitely. Like I said, when we were going through the names, one other theme that we were exploring was the detective type. And that was the actual original logo that we had. That was a gopher with a sort of detective hat, the Conan Doyle style. But then because Jaeger became a hunter theme, so we changed it a bit to hunter.

And it used to actually hold a rifle in the first iteration of the Jaeger logo. But then for political correctness, we were told to remove the rifle. So now it just looks at the tracks, the footsteps, trying to figure out why there is the six fingers on one of them, if you look closely. And definitely, yeah, it was a nod to the gopher Go logo.

CRAIG BOX: Why are there six fingers in the footprint?

YURI SHKURO: Because it's an anomaly. That's what the gopher is investigating.

CRAIG BOX: How did Jaeger become a CNCF project?

YURI SHKURO: When we published that blog post that you mentioned before, we got a lot of interest from various people, including from Red Hat. A Red Hat team at the time had their own tracing system called Hawkular. It was an APM, Hawkular APM. And when we open-sourced Jaeger, they were very interested.

And they said, OK, well, because Jaeger was already OpenTracing compatible and Hawkular wasn't, they said, let's just combine forces and work on Jaeger. Which I think was a huge help for Jaeger as a project because we suddenly get three more official maintainers from completely different company. And that was a great foundation to go to CNCF.

And in fact, the Red Hat teams were the ones who poked us to apply to a CNCF membership because it would be more comfortable for them to work on a project not officially owned by Uber as a company. And for Uber, it was also kind of an interesting-- it was the very first project that Uber has ever donated to a foundation. We kind of opened the door after that. There have been a lot of other projects that moved from Uber to foundations.

And the Red Hat team spoke to us, and we said, yeah, why not, it sounds like a good idea. And plus, OpenTracing, at the time, was already in the cloud native foundation. I think it was like number three, project number three. And so Jaeger was in good company, definitely, to join there. And I think it was a very good move on our part because it definitely helped Jaeger to gain visibility and popularity across all of the cloud native folks.

CRAIG BOX: And building on that, of course, you became the seventh project to graduate the CNCF last October. What's next for Jaeger?

YURI SHKURO: As I mentioned earlier, Jaeger positions itself not just as this front-end tool that lets you look at the traces. We want to be more of a general platform for trace collection and processing. Because I found that a lot of teams have very specific needs for what they want to get from the tracing data. And those needs are sometimes kind of niche or sometimes are pretty hard to generalize and implement in this general purpose front-end tool.

So rather than focusing on those very specific needs, we said, what can we do to make the data more accessible for users to consume? And visualization is not the only part of that. And so what we've been doing recently is that we started another project-- it's called Jaeger Analytics, it's also on GitHub-- which allows you to do data mining on traces, including using the Jupyter notebook. You can load the traces.

And right now only, I think, Java is supported as a language because we provide an API. But you can load a trace and you can programmatically look for things in that trace. And the reason why it's very powerful is, like I said, there are very specific things sometimes people are looking for traces. And it's very just hard to build the UIs for every single use case. Whereas, programmatically, it's sometimes very easy to write a few lines of code, maybe like five lines of code, saying, you know what, I'm really interested in this particular feature from the trace, maybe call when you calls my specific service generally makes on average or what's the fan-out.

This is a feature that you can extract from a trace. And so it's very easy to write in that code. And so this is, I think, our primary focus, really, at this point, is to build an ecosystem of data mining, access patterns, and even pre-built features. Like some of these features already, you can just take them off the shelf and run to analyze the traces. So this is one of the directions that we go in.

Another direction is, despite what I said about the specific UI features, we have been investing a lot in the UI features. So we realized that, when you look at promotional materials from many APM vendors, and they show you these cute graphs, with saying, oh, our system automatically detects your architecture and all of that fun. And then when you look at those graphs and they have like 10 nodes. And that makes me so sad because I have never seen a trace of 10 nodes that was remotely interesting.

At Uber, a typical trace is at least several hundred spans and dozens of services. And I can't even imagine how those APM vendor tools can cope with that number, with that complexity of the architecture. Because the way, when I looked at the example with 10 services, if I tried to imagine what it would look like with 60 services, I just can't imagine it. It seems like the UI would just break, in terms of how the layout works, how the annotations work.

And so this is something that we actually needed because Uber is a very complex architecture. And for us, dozens of services are very normal. And dependencies between services can count into hundreds in some cases. And so we needed tools that are able to visualize that. And so that's another area that we heavily invested.

So we have the service graphs, which can actually handle that many number of nodes and still be usable. We also invested in another tool, which allows you to visualize an individual trace as a graph, and then compare it with another trace or set of traces, and do use kind of color-coding things similar to code difs, red and green. And those tools were very instrumental in us building our internal troubleshooting tools for when-- again, the use case I mentioned is when you have an alert and you try to figure out what is actually broken based on that alert, it's very difficult.

But having this visualizations of the architecture as color-coded graphs actually speeds up that process a lot. It allows you to immediately see where you need to go, and then dig deeper into one specific service or area, rather than scratching your head and saying, OK, 50 services, where do I go about it. So those are two main directions that Jaeger is evolving as a project, like data mining and the visualizations for complex architectures, rather than 10-node architectures.

And on the other side, it's actually worth mentioning the things that we are subtracting from Jaeger because we are very involved with OpenTelemetry. One of the components that OpenTelemetry has is called collector. So collector, it's a backend process similar to Jaeger collector, for receiving all kinds of data and transforming them into OpenTelemetry format, and then send into the tracing backend. So OpenTelemetry does not mean to be a tracing backend. But the collector is a generic piece that can be used.

And so what we are doing is, we're planning to replace some of the Jaeger components with those from the OpenTelemetry, like definitely the collector. And most likely we will also eventually duplicate all the Jaeger tracer client libraries in favor of OpenTelemetry SDKs, because there wouldn't really be any reason for us to support those. And so that actually will just move a lot more resources from the Jaeger project, from supporting those into building a better backend, better data mining tools, and better visualizations, and not worrying about the data collection piece on the telemetry side.

CRAIG BOX: Well I'm sure a lot of our listeners will be running 10-node clusters. But it will make them sleep easier at night knowing that if their company grows to Uber scale, they will still be able to trace it. Yuri, thank you so much for joining us today.

YURI SHKURO: Yeah, thank you for having me. It was fun.

CRAIG BOX: You can find Yuri Shkuro on Twitter at @yurishkuro, or on the web at shkuro.com. You can find more information about the Jaeger project at jaegertracing.io.


ADAM GLICK: Thanks for listening. As always, if you've enjoyed the show, please help us spread the word by telling a friend or giving us a rating online. If you have any feedback for us, you can find us on Twitter @kubernetespod, or reach us by email at kubernetespodcast@google.com.

CRAIG BOX: You can also check out our website at kubernetespodcast.com, where you will find transcripts and show notes. Until next time, take care.

ADAM GLICK: Catch you next week.