#247 February 13, 2025
Kakeru is the initiator of the Kubernetes History Inspector or KHI. An open source tool that allows you to visualise Kubernetes Logs and troubleshoot issues. We discussed what the tool does, how it’s built and what was the motivation behind Open sourcing it.
Do you have something cool to share? Some questions? Let us know:
ABDEL SGHIOUAR: Hello, and welcome to the "Kubernetes Podcast" from Google I'm your host, Abdel Sghiouar.
[MUSIC PLAYING]
This week, I am alone. Kaslin is on vacation, and no one else was available. So I have to do this by myself.
This week I spoke to Kakeru Ishii. Kakeru is the initiator of the Kubernetes History Inspector, or KHI, an open-source tool that allows you to visualize Kubernetes logs and troubleshoot issues. We discussed what the tool does, how it was built, and what was the motivation behind open-sourcing it.
But let's get to the news.
The schedule for KubeCon and CloudNativeCon 2025 Maintainer Summit is live. The event in its new format takes place on March 31 at Excel London.
The CNCF published their 2024 review of the top 30 projects. The ranking measures the projects by number of contributions, and surprisingly, the podium is taken by Kubernetes, followed very closely by OpenTelemetry. Then Argo, Backstage, and Prometheus are all in the top five.
The CNCF is looking for an end user study to highlight during the keynote of KubeCon and CloudNativeCon London this year. If you have an interesting case and you want to get an opportunity to speak about it for five minutes, fill in the form in the show notes. Applications are open until Friday, March 7, 2025.
Google, AWS, and Azure announced kro, or Kubernetes Resource Orchestrator. Kro is a Kubernetes-native cloud agnostic framework that allows platform teams to define groupings of resources that users can consume as standard Kubernetes APIs. Check out the announcement blog and GitHub links in the show notes.
AWS announced the general availability of AKS Hybrid nodes. The feature was announced at re:Invent 2024. It allows users to connect oj-prem and edge nodes to manage AKS clusters on AWS. The company says this feature could help with modernization and migration of existing applications.
CoreWeave announced the availability of Nvidia GB200 NLV 72 instances on their platform. With this announcement, CoreWeave becomes the first cloud provider to make the Nvidia Blackwell platform generally available.
And that's the news.
Hello, everyone, we are talking to Kakeru today. Kakeru is the initiator of the Kubernetes History Inspector, a new open-source project released under the Google Cloud GitHub organization. This project helps visualize logs, and it already helped the support team at Google troubleshoot GKE problems through logs. Kakeru made it using his experience from working at the support team, obviously.
Welcome to the show, Kakeru.
KAKERU ISHII: Hello. Thank you for inviting me to this podcast, Abdel. I'm really excited to be here.
ABDEL SGHIOUAR: I just have to say, you're based in Japan, right?
KAKERU ISHII: Yes, I'm based in Tokyo.
ABDEL SGHIOUAR: And what time is it for you right now?
KAKERU ISHII: It's 5:00 PM.
ABDEL SGHIOUAR: All right.
KAKERU ISHII: I know you are in the early morning, right?
ABDEL SGHIOUAR: It's 9:00 AM for me. It's not too bad.
KAKERU ISHII: I'm sorry to wake you up early.
ABDEL SGHIOUAR: It's fine. It's 9:00 AM. Actually, it's funny, I live in Sweden, but I am in the very far north part of Sweden. I'm in a ski resort very, very far north. So wherever I look, there is just snow everywhere right now.
KAKERU ISHII: Nice. In Tokyo, it rarely sees the snow, and I like snow.
ABDEL SGHIOUAR: Oh, yeah. I've seen-- I have a friend who went to Sapporo, and I think it snows in Sapporo, right?
KAKERU ISHII: Yeah, Sapporo is a really cold place, and it's a good place for skiing, maybe. But here in Tokyo, it's relatively warmer than Sapporo.
ABDEL SGHIOUAR: Got it, got it.
KAKERU ISHII: We've rarely seen snow recently.
ABDEL SGHIOUAR: Got it. All right.
So, all right, let's talk about this tool that you guys have open-sourced recently, the Kubernetes History Inspector. Can you tell us what's that?
KAKERU ISHII: Yes, sure. It's a rich log visualizer designed for troubleshooting Kubernetes issues. Before this tool's existence, we needed to troubleshoot the port or something by checking their content log. But if the problem can't be solved with content log alone, they needed to rely on various kind of logs like kubelet log container, containerd log, Kube API server log, Kube controller manager log.
There are so many various logs needed to be used in troubleshooting. So it requires high expertise to craft the log filter or something to gather these logs and investigate them, because it just generate at least tons of the logs. And we needed to understand what's happened in the past around the port just from the lines of logs. It was so hard for us.
So then these two will provide us detailed timeline visualization or resource relationship diagram just from logs available on your log back end. Currently, this is only supporting cloud logging, but we are expanding this support to the other classes, especially for the open-source Kubernetes class.
ABDEL SGHIOUAR: Nice. And so one important detail actually that I want to talk about is that it says history in the name, which means that the tool actually allows you to go back in time.
KAKERU ISHII: Yes. So when they say that inspector or something related to Kubernetes, maybe users think this is a kind of agent for tool, like Prometheus or something, but actually this is just a log visualizer. It's visualizing the history of the cluster resources just from logs by parsing all of them.
ABDEL SGHIOUAR: Yeah, we're going to talk about it. But it's important for people to understand, you don't need to install anything in your cluster.
KAKERU ISHII: Yeah.
ABDEL SGHIOUAR: I tried this yesterday. I just fired up the Cloud Shell in the console and started the Docker container. And as long as it has permission to pull the logs, it will just pull them, right?
KAKERU ISHII: Yeah.
ABDEL SGHIOUAR: And give you this rich visualization, as you said.
Where did the idea of open sourcing the tool came from?
KAKERU ISHII: Well I'm working as a support engineer in Google, and I needed to troubleshoot customer cluster issues when I got the ticket. But the problem is, maybe that would be my first day to see the customer cluster or even the customer said my port was dead yesterday or something. But when I test the cluster, maybe the cluster is running healthy without any issue.
But troubleshooting the current ongoing incident is easier than troubleshooting past issue, because I can interact with the cluster if customer allows it. However, the troubleshooting past issues require me to go through many kind of the logs, and it takes a really long time. So I wanted to understand the macroscopic view of the cluster. So that's why I needed to create it.
After I created the prototype, I showed this to my colleague or other support team members, and it gained popularity among my support team or many other teams in Google, and I decided to make this available to the other customers on the internet.
ABDEL SGHIOUAR: Nice. Yeah, so the tool is obviously-- is on GitHub. So everybody who's listening to us, you should check it out and maybe give it a little star if you want. It's very easy to start. It's just a Docker container.
I did a little bit of troubleshooting and support back in my days before my current role. And I remember yes, troubleshooting past problems is difficult, but also troubleshooting problems that don't happen very often, when you have a transient issue. So how was this tool helps like support engineers particularly? How does that help them in troubleshooting these kind of problems?
KAKERU ISHII: Well this is very important for the support team, because querying the logs is also requiring skills. At first, we needed to understand the explicit time the incident happens. But once this tool was used in support team, after querying the logs by one support agent, the support agent can show the visualized timeline with the other support engineer, hand it over. So they will continue troubleshooting with detailed understanding of the cluster.
ABDEL SGHIOUAR: I see.
And so can you talk about some issues that the tool has helped you actually solve, that the Kubernetes History Inspector has helped solve?
KAKERU ISHII: Well, let me introduce a little difficult complex ticket before.
ABDEL SGHIOUAR: All right.
KAKERU ISHII: So this is about GKE with workload identity, which is a feature to get the access token to access to GCP API from the pod with checking the permissions granted for the Kubernetes service account.
So my customer told me to troubleshoot the intermittent error over a workload identity. The customer said, my port got the authentication error intermittently, but it only happens very a few times in the month or something. And I realized that is caused by a third party security product restarting containerd because workload identity needed to communicate with containerd to verify the port is actually running on the node to return the access token.
But to troubleshoot this kind of issue, I needed to check the custom port long, containerd log, third party security product port log, and also I needed to check the workload identity system workload logs. So crafting the log query, which is checking all the logs, would be a little hard. And even if I could get to the list of the lines of the logs, it won't make sense.
But this tool will help me to show how these logs happened. And I can see the kind of line of the dots of the logs happening at the same time over the visualization. So I could easily see, oh, this is caused by containerd starting or triggered by third party security product. This kind of difficult problem involving multiple components on the cluster can be solved with KHI easily.
ABDEL SGHIOUAR: Yeah. And so I think what's important in this particular use case you're talking about is the correlation part.
KAKERU ISHII: Yeah.
ABDEL SGHIOUAR: Like how can you correlate events happening in multiple parts of the system, using the logs, and understand that those kind of events are related to each other. Like you were talking about workload identity, which has its own pod. Then you have the customer pod. Then you have containerd. And then you have the pod of this third party security tool. And you have to pull the logs across all these things and kind of try to understand how they line up with each other, right?
So I was using the tool yesterday, as I said. I fired up the Docker container. I launched the interface. It has a graphical interface, a rather interesting graphical interface because it's built in WebGL. I haven't heard of WebGL for a very long time, but we're going to talk about it.
But how does that correlation work? Is it the tool that will correlate these things together? Or is it you that you have to try to find that correlation?
KAKERU ISHII: Well, meaning users need to customize the correlation settings or something?
ABDEL SGHIOUAR: Or does the tool finds those events and then says, OK, these events are related in time. How does that work?
KAKERU ISHII: So basically KHI has various parsers implemented already on its code base. So it parses the structured log fields, and it decides where which resources is related to the logs or something.
So but log parsers are not so simple. Some log parsers needed to be run before the other log parser or something. For example, containerd log parser shows a container behavior with container ID but it won't show any port name or something.
ABDEL SGHIOUAR: Yeah.
KAKERU ISHII: But kubelet also show container ID and port name. So I can correlate the container ID with a port ID because I'm running the container parser after getting the port name from the other logs. But this is a little harder, because many parsers depend on the other parsers. So KHI is basically based on the directed acyclic graph. It's a DAG-based log parser system. That makes KHI be extensible.
Currently we are not publishing any document on how to extend KHI logparser, but I will do that later. And that helps customers to extend their logparser to support their own custom controller or something.
ABDEL SGHIOUAR: I see.
KAKERU ISHII: And this extensibility makes KHI support many kinds of dialogs.
ABDEL SGHIOUAR: OK, cool.
Then I want you to talk to us quickly about how it actually works behind the scenes. I tried it. Essentially you start a Docker container. You select-- well, in our case you select a project ID. Then you select which cluster you want. Then there is like a little thing happening on the interface that says, I'm running, I'm pulling some logs. And then you get an interface. But how does it actually work behind the scene?
KAKERU ISHII: Well, each parser has a dependency of the form. This person needed to have the project ID before querying or something like that. This dependency is defined on each parser DAG-based graph. So the KHI is just running this graph-based task runner and generating one single log bin file, and then that would be parsed from the front end and show that diagram.
ABDEL SGHIOUAR: I see. And as I said, it uses WebGL for the rendering of the interface, right?
KAKERU ISHII: Yeah.
ABDEL SGHIOUAR: Did you learn WebGL to build this? Or did you work with WebGL before?
KAKERU ISHII: Before joining Google, I was a-- actually I joined Google as a new grad-- but my hobby was doing some open-source work, especially for the WebGL framework.
ABDEL SGHIOUAR: OK.
KAKERU ISHII: So I made a WebGL framework before joining Google. So that's why I had the experience with WebGL. I see. So for me, usual web development is harder than the WebGL for me. Because I needed to learn Angular JS to build this application. But for the WebGL side, I know really basically, and I could realize this performant visualization with my existing knowledge.
ABDEL SGHIOUAR: I see, I see. All right. So I think I have to ask this question, because we are in 2025 and AI is all around us these days. Before I go there, the tool is pretty much in memory only, right? So when you fire up the container, all the logs are in memory, right?
KAKERU ISHII: Yeah. Actually that is intended to be in memory. So if the tool was for the storage, I think that should be done in back end. But I built the application for investigating quickly. So that's why I wanted to take all the logs on the memory on front end side.
ABDEL SGHIOUAR: Yeah.
KAKERU ISHII: That's designed intentionally, and yeah, that works on the memory.
ABDEL SGHIOUAR: Nice. And so do you see a future in which an LLM could be integrated into KHI and help troubleshoot issues?
KAKERU ISHII: Well, maybe I can consider it. But the important role of the KHI taking part in AI would be-- even LLM says, OK, this problem happened because this configuration issue or something, and this is intermittent issue happened by triggered by this port or something. Maybe the people couldn't be convinced so much. So they wanted to understand why that happens with visualization, not just by text.
ABDEL SGHIOUAR: Yeah.
KAKERU ISHII: So that's why this visualization is still important for the LLM data.
ABDEL SGHIOUAR: I see, I see. Nice, cool.
Well, this is actually pretty cool. I highly recommend people to go check it out. It's on GitHub. We'll leave the link in the show notes. Go check it out. Go try it out. Give it a star on GitHub. If there is any features missing, either implement it or open an issue I guess, right?
Kakeru is on Twitter, but your Twitter is mostly in Japanese?
KAKERU ISHII: Yeah. You can follow me, but maybe it's only limited for-- no, no-- it will be a little hard for the non-Japanese speakers.
ABDEL SGHIOUAR: Yeah, so that's fine. Maybe they can talk to you on GitHub.
KAKERU ISHII: Yeah.
ABDEL SGHIOUAR: Yeah. Well, thank you for joining us on the show, Kakeru.
KAKERU ISHII: Thank you.
That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media at Kubernetes Pod or reach us by email at <kubernetes@google.com>.
You can also check out the website KubernetesPodcast.com, where you will find transcripts and show notes and links to subscribe. Please consider rating us in your podcast player so we can help people find and enjoy the show.
Thank you for listening, and we'll see you next time.
[MUSIC PLAYING]