#77 October 30, 2019

Engineering Productivity and Testing, with Katharine Berry

Hosts: Craig Box, Adam Glick

Katharine Berry works in the Engineering Productivity team at Google Cloud, and works in SIG Testing on the Kubernetes project. She joins Adam and Craig to discuss Prow, Pebble and ponies.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box

[MUSIC PLAYING]

ADAM GLICK: Craig, I wanted to congratulate you as someone who lives in London. I saw that your rugby team did exceptionally well this week.

CRAIG BOX: Yes, it's been a tough week to be a New Zealander. [ADAM CHUCKLING] You may recall, last week, I was talking about the DevOpsDays conference in Auckland. I was there on the first day, did our podcast, and then on the second day, the building next door caught fire. There was a giant fire at a convention center in Auckland, and it was right next door to the hotel we were staying at. And it's basically national news when anything like that happens.

The center hadn't been finished building yet. It was due to open next year. It was due to host a big APEC leaders meeting in 2020, which is now under question as to whether that will happen.

But we were all reeling from that disaster when the next disaster happened, which of course was that England beat New Zealand in the Rugby World Cup semi-final. And in fairness to New Zealand, we've won every game that we've played in the Rugby World Cup for the last 12 years. We are two times holders. So every now and then, you have to have a bad day. And it just happened that this was ours

ADAM GLICK: Well, hopefully, no one was hurt in the fire. And I'm sure there'll be another chance for the Rugby World Cup.

CRAIG BOX: . Yes there's more just hurt feelings on the pitch, I believe. How's things in the world of restaurants and board games?

ADAM GLICK: Following on the discussion we had a couple of weeks ago, it now it turns out that there is now a board game that's opening a restaurant.

CRAIG BOX: Oh, really?

ADAM GLICK: The makers of Cards Against Humanity have decided to open themselves up a game restaurant, with a couple of escape rooms, and lots of board games, of course. And I was just like, wow, that's a wonderful flip-around from restaurants creating games. Now games are going to create restaurants.

CRAIG BOX: One of the more on-topic things the Cards Against Humanity team have done, I remember watching them live stream digging a giant hole a couple of years ago.

ADAM GLICK: [LAUGHS] Yes, indeed. Their Christmas gift. They dug a do-nothing hole. You've got to give them credit, like just clever stuff that they've been doing.

CRAIG BOX: It's a fun game, and they're good at staying in the news.

ADAM GLICK: Shall we get to the news?

[MUSIC PLAYING]

ADAM GLICK: Google Cloud has launched GKE release channels into beta. First announced in April, release channels that you pick from three different versions of Kubernetes based upon the stability versus freshness tradeoff you want to make for a cluster and its workloads. The rapid, regular, and stable channels currently offer 1.15, 1.14, and 1.13 versions, respectively.

CRAIG BOX: GKE usage metering, the topic of episode 40 back in February, is now generally available. The feature allows you to see your GKE clusters resource usage broken down by namespaces and labels, and attributed it to meaningful entities, like department, customer, or environment. New features include the ability to query consumption metrics and compare them against requests, letting you size your pods more appropriately and reduce overprovisioning. You can also join your usage data with your billing data to get a cost breakdown per namespace and label.

ADAM GLICK: Google Cloud this week released a solution guide on implementing PCI DSS, the Payment Card Industry Data Security Standard, on top of GKE. Anyone dealing with payment card data online has to ensure their solution complies with the many requirements set out in the standard. And the new guide helps you address concerns unique to GKE applications. The solution also includes a starter project with Terraform scripts to help you setup multiple Google Cloud projects to keep in-scope and out-of-scope data separate.

CRAIG BOX: In many of our news segments, we bring you information about a CVE, or security vulnerability, in the Kubernetes project. Most are quickly accompanied by a new patch release. But how does this all work?

CJ Cullen from Google Cloud is one of the Kubernetes Product Security Committee members, and with colleague Ann Bertuccio, has written a guide to demystify the process. The seven-person committee contains two previous podcast guests, and along with two associate members in training, represents five different companies. The group has an on-call roster, and triages incoming security reports, working with maintainers and other projects to coordinate the release of fixes. The post ends with information about GKE's security bulletin process and a suggestion to familiarize yourself with the security process from whoever your vendor may be.

ADAM GLICK: Hewlett Packard Enterprise has teased a new Kubernetes-based platform at their annual security analyst meeting. HPE claims the platform will unify technology from AI and data analytics companies BlueData and MapR, according to comments from CEO Antonio Neri, adding that no matter what hybrid cloud stack their customers choose, they want to differentiate their experience with HPE's software.

CRAIG BOX: Did you learn about kubectl plugins and krew through our interview with Luk Burchard and Ahmet Alp Balkan in episode 66? You will have heard that some plugins are written in Bash, and the maintainers have been trying to get more robust implementations written for these. Jonas-Taha El Sesiy stepped up and rewrote the view-secret plugin in Go, adding new functionality to automatically base64 decode secrets for administrative convenience. He documents his path to contribution in a blog post, which celebrates the ease of working with the krew team.

ADAM GLICK: At Mobile World Congress last week, NVIDIA announced the aerial application framework for running software at the edge of 5G mobile networks, presumably on the top of tall towers that are made to look like trees. They also announced a partnership with Red Hat to run this stack on top of OpenShift and a GPU operator which can be used in any Kubernetes cluster.

CRAIG BOX: Red Hat has updated the OpenShift container storage product to 4.2, matching the underlying engine. The product uses Ceph, Rook and NooBaa, a 2018 acquisition, to provide block, file, and object storage services. Like everything Red Hat does, it's now powered by an operator and available through their Operator Hub. The announcement post says it runs anywhere OpenShift does. But two sentences below, points out that it's only supported on AWS or VMware.

ADAM GLICK: Version 2.3 of the Kontena Lens dashboard is out, touting massive performance improvements provided by fixes pushed up to the Kubernetes client libraries. Lens is free, but not open source. Other dashboard options include VMWare's Octant, which released 0.8 recently, with highlights including a dark mode.

CRAIG BOX: Finally, Zoho, a company producing a suite of SaaS tools for CRM and online productivity, has launched Catalyst, a platform for building integrations with its services. Catalyst lets you build applications on the same backend that Zoho uses for its 45 SaaS applications, and offers proprietary functions as a service layer based on Kubernetes to interact with it. No word yet on what serverless framework they're using behind the scenes.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

ADAM GLICK: Katharine Barry works on the engineering productivity team at Google Cloud, and works in SIG testing on the Kubernetes project. Welcome to the show, Katharine.

KATHARINE BERRY: Hi. Thanks.

CRAIG BOX: Before Google, you worked at Pebble, a smartwatch company. I was one of the Kickstarter Pebble watch members, wore it for many years. And it just so happens that my brother, who came to visit this weekend, he left his smartwatch charger for some newer device at home. And I got the trusty Pebble out of the cupboard, and that's what he will be wearing for his trip around Europe for the next few weeks.

KATHARINE BERRY: Cool.

CRAIG BOX: What was it like working there?

KATHARINE BERRY: Working at Pebble was very interesting. When I joined, it was, like, 30 people. And it was 200 people by the time I left.

CRAIG BOX: Did you leave because of the company's sale and closure?

KATHARINE BERRY: I would have stayed at Pebble. But unfortunately, it ran out of money and closed, and sold the remaining assets to Fitbit.

CRAIG BOX: I understand that Fitbit kept everything online for a while. But after they shut the backend services down, that you-- you perhaps a team of people-- did something about it.

KATHARINE BERRY: Yes. So Fitbit was pretty generous and kept the servers running for a year or so. But eventually, they did shut down, as Fitbit had promised. So me and a bunch of other people started a project called Rebble and founded a company called the Rebble Alliance to run it--

CRAIG BOX: Of course.

KATHARINE BERRY: --because we love our puns. But basically, we re-implemented the entire Pebble backend service via reverse engineering, and set up a system so that basically, anyone could switch their watches to use that instead. And that now has over 100,000 users.

CRAIG BOX: Fantastic.

KATHARINE BERRY: There are still some people using their Pebbles.

CRAIG BOX: Is there any Kubernetes involved in that infrastructure?

KATHARINE BERRY: Currently, no.

CRAIG BOX: OK.

ADAM GLICK: Who pays for running the infrastructure?

KATHARINE BERRY: It's actually a sort of freemium thing, where you get a couple of extra features, mostly features that cost us a lot of money, if you pay us $3 a month. So that funds the servers as well.

ADAM GLICK: I mentioned in the introduction that you work in engineering productivity -- sometimes called EngProd -- but people outside of Google might not be familiar with the term EngProd. Can you explain what EngProd is?

KATHARINE BERRY: EngProd works with developers in order to help make them more productive and efficient. We help work on things like tests and ensuring that they can basically get our jobs done as efficiently as possible and for greatest velocity.

CRAIG BOX: A number of companies have engineers that work solely on testing. Your role is more on the infrastructure that runs testing. In the Google case, for example, do you have a different team who builds the tests, or do the engineers build the tests for the code that they write?

KATHARINE BERRY: For the most part, the engineers are expected to build the tests of the code they write. And EngProd generally does not own tests.

CRAIG BOX: I believe that follows through to Kubernetes SIG Testing, who also don't own tests?

KATHARINE BERRY: Yes, I imagine this was inherited from Google, but SIG Testing in Kubernetes also does not own tests. And again, we provide the way to run the tests, and perhaps the test frameworks they run in. But we do not actually build tests, and we are not responsible for them.

ADAM GLICK: When you're working with the engineering teams that are building the products with Kubernetes, how do you work together in terms of the test infrastructure? Is it a mandated set of infrastructure that they need to write their tests in to test it? Or if they have new test frameworks they want to use, can that be implemented? You work with them?

KATHARINE BERRY: The only thing we really enforce is that they must run their tests on Prow, which is our test infrastructure CI system. In the main Kubernetes repo there are a whole bunch of conventions that they are generally expected to follow. But SIG Testing does not inherently enforce those. And assorted sub-projects can do more or less whatever they like, as long as they have tests and they run of them on our infrastructure.

CRAIG BOX: You mentioned the tool Prow, which is the portion of a ship's bow that is above water. So we went for the nautical theme, rather than the Greek theme with that name.

KATHARINE BERRY: We did, yes.

CRAIG BOX: Why a custom tool?

KATHARINE BERRY: In days past, we ran Jenkins. And that was tricky because we have substantial testing load, and a whole bunch of tests. And Jenkins didn't handle things like cleaning up after tests that aborted midway or anything, which resulted in maintaining it was more work than building our own tool, essentially. So we built our own tool. We also tried using Travis, I think, a long time ago. And that just crashed.

ADAM GLICK: The things you're mentioning here are normally thought of as CI tools.

KATHARINE BERRY: Yes.

ADAM GLICK: Was there any thought put into taking the CI process and tools that you built and turning that into an open source project itself?

KATHARINE BERRY: Prow is indeed an open source project, and we have contributors and users from all sorts of companies. I know Jetstack uses it. OpenShift uses it. It's part of Jenkins X - so that's kind-of come full circle there.

ADAM GLICK: Will it be part of the CDF eventually?

KATHARINE BERRY: I have no idea. I think it's probably going to stay on the Kubernetes for the foreseeable future.

CRAIG BOX: For people who are unfamiliar with the Kubernetes development process, my assumption is that whenever somebody commits code, or makes a pull request to the Kubernetes infrastructure they are proposing, I would like to submit this code to it. And then it has to undergo a barrage of tests. So all of the areas that it touch will need to have a test run for it, and Prow is the system that runs that. What is the actual end-to-end experience for someone from making a pull request to the test running?

KATHARINE BERRY: When you make a pull request against a Kubernetes repo, Prow will indeed start running some tests. And if you're new, also leave you a comment explaining how to contribute to Kubernetes as a user, which was done in partnership with [SIG] Contribex. Once Prow starts running those tests, which are actually all just essentially Kubernetes pods for run jobs, Prow will report for status of the tests back to the PR. And if they are all successful, then the PR is considered mergable.

This isn't all that Prow does, though. Prow is also responsible for handling OWNERS files in Kubernetes, which essentially control who is allowed to approve these PRs. So Prow will essentially check for approvals from users who are listed as owners. And if it gets enough of them, then it adds a label saying it's approved.

And then another part of Prow, which is called Tide, will see that the thing has all the labels it needs and proceeds to actually start merging it into the code base. Because we have a lot of tests and a lot of PRs, we also actually rerun the tests on the merged results before we commit it. And because of volume, we actually do this in batches. So we take, say, 15 PRs that are ready to merge, merge them all together, test them, and then merge all of them if, and only if, they all pass.

CRAIG BOX: Does that second part of the system you mentioned run Tide Pods?

KATHARINE BERRY: [CHUCKLING] Yes.

CRAIG BOX: Where's the state for Prow kept? You mentioned, obviously, the PR itself had some state, and that it's updated when various states happen. But does Prow have a back end database for each commit?

KATHARINE BERRY: Prow is resolutely stateless. It keeps all of its actual state either in PR in the form of GitHub labels, or sometimes in text in the comments it leaves, and also in our Kubernetes CRDs In particular, we have a ProwJob CRD, which knows the status of any jobs that are supposed to be running.

ADAM GLICK: Is there a way for developers to run the tests locally before they submit it to the system?

KATHARINE BERRY: For unit tests, it is possible by using the standard "make test" construct. For integration tests, essentially, no. But we are attempting to work on improving this by using Ben Elder's kind to help run some of the tests. In some other cases, it is near impossible, because they test, like, GCE resources or similar that you can't really run locally without paying someone money.

CRAIG BOX: You have an interface which developers can check into to see the status of the tests, and in some cases why the tests have failed. What is that interface and where is it keeping its data?

KATHARINE BERRY: When a test finishes running, it uploads all of its information to a GCS bucket, which includes the test logs, any artifacts they happen to generate, JUnit files that contain the output of tests, etc. On the GitHub PR page, we drop links to a page called Spyglass, because we're sticking with our boat theme. You may have noticed everything is boat-themed here. Spyglass is a page that I built that basically allows us to create pluggable artifact viewers.

So in essence, it reads the things from GCS and attempts to render them in a way that will be understandable to you. So we do things like log highlighting, breaking out your test successes and failures. If you generated coverage information, we'll render that, and so forth.

ADAM GLICK: Do you have any requirements in terms of amount of code coverage that the tests cover in order for it to be integrated?

KATHARINE BERRY: Currently, we do not, for the main Kubernetes project. There is an ongoing effort to improve the coverage of Kubernetes, but it is a long haul, and we've only just started being able to measure it again reliably.

CRAIG BOX: We hear a lot about the concept of a flaky test.

KATHARINE BERRY: Yeah.

CRAIG BOX: What exactly is a flaky test?

KATHARINE BERRY: A flaky test is a test that you can run it, and it fails. And then you run it again on the same test, and it passes. Which is deeply frustrating, because it suggests that your code is not the reason the test failed.

CRAIG BOX: So it could be time of day or solar flares, or anything else?

KATHARINE BERRY: Yes. Solar flares are an especially common failure mode.

ADAM GLICK: So that comes to the question of who watches the quality of the code that watches the quality?

KATHARINE BERRY: Yes.

ADAM GLICK: Do you have a way to identify tests that are not reliable, and so you know what failures coming through the system are likely to be actual failures, versus which ones are likely to be these, quote, unquote, "flaky" tests?

KATHARINE BERRY: We have some dashboards that attempts to measure the flakiness of individual jobs. We are currently less good at measuring the flakiness of individual tests. In some cases, we literally just rerun the tests until they pass on a per test basis, which masks the flakes and arguably exacerbates the problem. But it does make the developers happier.

[LAUGHTER]

CRAIG BOX: In the case of these flaky tests, where a certain percentage of them will fail, can you be sure that the code is good in that situation? Or would the project look to enforce a change to the test, to make sure that they're validating the thing they're actually trying to validate.

KATHARINE BERRY: If a test is too flaky, and we can't trust that it's testing what it means to test. Or we can't trust that the code is correct, and the flakes really do mean that things are just broken sometimes. By policy, we are trying to reduce the number of flakes, and we attempt to report them when we see them and get them fixed. We have issues where a lot of our flakes are infrastructure, rather than tests. So that is where SIG Testing is attempting to improve the state of our infrastructure, so we don't have infrastructure flakes either.

ADAM GLICK: Do you have a policy as to what percentage of tests must pass in order for a PR to get checked in?

KATHARINE BERRY: All of them.

ADAM GLICK: So you need a 100% pass rate?

CRAIG BOX: Unless they're flaky.

KATHARINE BERRY: 100% pass rate. And if your test flakes, you have to keep rerunning it until it passes, which people do. We actually have a bot that will just drop "/retest", which is a command that reruns your tests repeatedly until your tests have passed.

CRAIG BOX: And that's considered acceptable?

KATHARINE BERRY: Usually, when this happens a lot, it's because the test suite was failing, because some cloud provider ran out of VMs or something. And it's usually not considered a reflection of the state of your PR.

ADAM GLICK: Many people talk about wanting to automate themselves out of a role. And you're one of the few people I've actually met that has actually done this. Normally, you see people, they get halfway through it, and then basically end up embedding themselves further into the process, with a bunch of custom automation that only they know that anyone else has. When we talked to Lachie in episode 42, he mentioned that you were the test-infra lead for 1.15, and that you automated the roll away. Can you talk a little bit about what it's like to automate yourself out of a job?

KATHARINE BERRY: When I joined the release team for the 1.15 release, and took a look at what the job actually involved doing, it was a bunch of almost mechanical config file munging, specifically to create jobs to run on new release branches, and to turn code freeze on and off. So this was pretty labor-intensive in part, because it meant that you had to go to the release team meetings every week, and also primarily because the role involved an awful lot of editing config files. Like, thousands and thousands of lines of config files. And while it was pretty mechanical, it was also not entirely clear what you were supposed to do with them, which had resulted in there being errors made in every single release so far.

I helped automate this by making changes to the tools and to the processes involved. So part of the problem was that we needed to build TestGrid config, our TestGrid being the tool we use to view whether our tests are passing and which the release team used extensively to know if a build is green and can be released. So I substantially improved the ease with which one could edit that configuration, and additionally made it possible to automatically generate configuration for it, which was then important when I built tooling that could be run by a member of a release team that would read new annotations in the job configuration and figure out what to do with it.

So all of our jobs, I added markers to them that say that this job needs to also be run on release branches and should have these differences when it does so. And in essence, I added those annotations to all of the jobs. And for tools now, we'll create new files for every release which can be automatically maintained, so that humans don't need to edit almost any configuration files. And that took the workload from hours down to minutes of work, which enabled us to essentially move it, or merge it, to another release team role.

CRAIG BOX: If we made you a release manager, could you automate that role away, too?

KATHARINE BERRY: I wish. [CHUCKLING]

CRAIG BOX: When we spoke to Jorge Castro in episode 74, he talked about the work that you'd done in automating Slack and working on the Kubernetes Slack. You're a moderator on that Slack. Can you talk a little bit about your responsibilities and what you've built to make that job easier?

KATHARINE BERRY: Sure. This stems from an incident in March, I think, where some trolls raided the Kubernetes Slack channel. And we ended up discovering that, in essence, we had no process, and did not have the manpower to handle the task of dealing with these events, which resulted in Kubernetes Slack actually being closed for new members for a month or two. As part of reopening it, a whole bunch of new moderators were drafted. I think maybe 20, one of whom was me. And I additionally built a bunch of tooling in order to make this easier.

One of the most important tools I added was a new tool that lets Slack users report messages from people. So they can just click a button on a message, and it will be reported to the moderators with an explanation of why it was reported, and potentially anonymously if they so choose. Which has helped us enforce our code of conduct, even in lieu of Slack somewhat lacking community management tools.

Another useful tool I built is something called Tempelis, going back to the Greek theme. And that basically lets us configure our entire Slack setup, like all the channels, all of the user groups, using YAML config files in true Kubernetes fashion.

In essence, anyone can make PRs against those files. And Tempelis will act on the results and update Slack to match what the config file say it should be. And this has actually enabled us to have user groups which are, in essence, a Slack feature that lets you ping multiple people at once, which we had previously not been able to use, because only admins can update them. And that would have been substantial work for the admin team.

CRAIG BOX: Have these tools being adopted by other projects?

KATHARINE BERRY: I don't believe so, although other projects have occasionally come by and asked how they could use them. I'd never heard back, so I don't know if they were used or not. I would love to see them be used by more projects.

I think they are generally useful, especially the ability to report users, and also the ability to ban users, which Slack barely has. The tools are all open source, and live in a Kubernetes repo. So people could pick them up if they wanted to.

ADAM GLICK: For the Kubernetes testing infrastructure, how big is that infrastructure and how large is the team that you work on that manages it?

KATHARINE BERRY: The build cluster that we run the builds in has something like 800 cores, and runs many thousands of tests a day. On top of that, a lot of the tests themselves spin up Kubernetes clusters. And those use I don't know how many resources-- huge amounts.

ADAM GLICK: How big is the team that's managing all of that?

KATHARINE BERRY: The team managing this infrastructure is probably about eight people, I would say, mostly at Google, but not entirely.

ADAM GLICK: Since the infrastructure is up and running right now, and you've done a bunch of automation, what comes next?

KATHARINE BERRY: Our biggest project at the moment is to migrate it over to the CNCF instead of having it be controlled by Google, which will enable far more people to be involved. It will enable us to have 24/7 on-call monitoring, so that when the infrastructure goes down at 3:00 AM my time, and the rest of my team's time, the tools don't stay down until one of us wakes up to fix it.

ADAM GLICK: If anyone listening is interested in getting involved in helping with that, where should they go?

KATHARINE BERRY: The SIG Testing channel on Slack is a great place to go. We also have a SIG meeting every other Tuesday, which they can attend. And we discuss where we're going there, basically.

CRAIG BOX: Finally, your Twitter avatar is a My Little Pony. And it feels like that might just be scratching the surface.

KATHARINE BERRY: Yes. I have a bunch of ponies around in my life. I quite like "My Little Pony," and have done for some years now. One cool thing I did with ponies and computers recently was, I took a generative adversarial network, which is a type of neural network, and basically fed it a large number of pictures of ponies.

And the results are not perfect. But they are pretty interesting. So I have taken to posting those on Twitter sometimes, and even selling them on badges at "My Little Pony" conventions.

CRAIG BOX: Well, you'll be able find Katharine not only at KubeCon, but at the next BronyCon? Is "Brony" a gendered term?

KATHARINE BERRY: "Brony" used to be a gendered term, but no longer is, because the alternatives were all bad.

ADAM GLICK: Fair enough. Katharine, thank you very much for joining us today.

KATHARINE BERRY: Thank you for having me.

ADAM GLICK: You can find Katharine on Twitter, @KatharineBerry, or on her website, kathar.in.

[MUSIC PLAYING]

ADAM GLICK: Thanks for listening. As always, if you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter, @KubernetesPod, or reach us by email at kubernetespodcast@google.com

CRAIG BOX: You can also check out our website at kubernetespodcast.com, where you will find transcripts and show notes. Until next time, take care.

ADAM GLICK: Catch you next week.

[MUSIC PLAYING]