#161 September 2, 2021

Unicron, with Daniel Megyesi

Hosts: Craig Box, Jimmy Moore

Adevinta is an online classified ads company, operating many local brands. Daniel Megyesi is a DevOps engineer at Adevinta and maintainer of their central big data and Machine Learning platform, Unicron. Learn why they wanted to replace Mesos, how they aligned their engineering efforts to do so, and the choices that had to be made to provide an easy experience for their data engineers.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

News of the week

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box with my very special guest host, Jimmy Moore.

[MUSIC PLAYING]

CRAIG BOX: Let's compare picnic stories. Where did you get to this weekend, Jimmy?

JIMMY MOORE: [CHUCKLES] Well, we were talking about it, and I told you that I went this weekend to a park here in San Francisco, Dolores Park. It's pretty well known if you've been in the Bay Area. It has this outstanding view of the city, really sunny. People go there to kind of hang out. And people are still social distancing, wearing masks. But you put out a blanket, get some overpriced sandwiches from a local bodega, and drink some beers or whatever and have a good time.

So I spent my afternoon doing that. But I always bring my Sport-Brella because I'm a little allergic to the sun. So I sit under an umbrella in the sun outside with friends. It was great. Actually, it was really relaxing after a pretty active Saturday night. So it's just what I needed. How about you?

CRAIG BOX: Well, on the bank holiday Monday here, I went to a picnic at Buckingham Palace.

JIMMY MOORE: Oh, well, why are we wasting time with my story? [CHUCKLES] Tell me more, Craig.

CRAIG BOX: It's not as interesting as it sounds, in complete honesty. The Queen goes away for the summer. We shall call it the summer. It was raining, as we'll get you later on.

JIMMY MOORE: So she doesn't bring out the biscuits?

CRAIG BOX: She does not, not here at least. She's off in her Balmoral estate up in Scotland.

JIMMY MOORE: I learned about that one on "The Crown." A very nice place, very rainy as well.

CRAIG BOX: Indeed. But while she is away, tradition dictates that the Buckingham Palace is open to the public.

JIMMY MOORE: Hmm.

CRAIG BOX: Tradition dictates this because a few years back Windsor Castle caught fire and the Queen needed the way to raise some funds. They're renovating it this year, and with COVID as well, they are unable to have tours of the inside. So what they've done instead is they've opened the garden up for people to picnic. You can get yourself a ticket, you can go along with some friends. It was not as full as one might expect. I think they thought that people would all need to be 2 meters apart from each other, but we all came in groups and so it did seem like it was reasonably empty.

There were fun family activities. There were a couple of tent-like structures that people huddled under when the inevitable rain happened. We did a tour of the gardens, which let us see a couple of areas that the queen apparently does like to wander around herself. But it does remind you that while it is a beautiful green space in the middle of London, there are parts where you can very clearly hear that on the other side of the beautiful wall, London is right there, and the road and the traffic and everything. I'd like to think the queen probably spends time in different parts of the garden instead.

JIMMY MOORE: Hmm. A few things stand out to me about that story. Number one, the fact that the Queen needs to do a fundraiser, frankly. I guess you never really think about that and how the finances of the Royal Family works.

CRAIG BOX: It was political at the time because the Queen does not pay tax. And there were substantial repairs required to the collapsed roof of Windsor Castle. I want to say in 1992. But at that point, it was decided that rather than the country having to pay for the repair of what was privately owned by the Crown, then the Queen would find a way to contribute to do that.

JIMMY MOORE: Well, as I hear on many podcasts, ZipRecruiter has a good place for a job if she's looking for some extra income.

CRAIG BOX: I think she's got plenty of things to do. We actually did check out the Royal Circular to see what she had been up to, and she'd had a phone call with an ambassador thanking him for his service, even though she was on holiday.

JIMMY MOORE: Nice. Well, I find her presence in the world to be kind of a lovely thing. And it actually gives me great comfort whenever I see her.

CRAIG BOX: It is. And it's a thing that I think is important, not so much about the monarchy, but about the person herself. While certain countries in the Commonwealth have had thoughts about becoming republics over the years, they've kind of faded away as the Queen has gotten older. Because having that one person there who's been around for so long, who's met so many world leaders, has that element of consistency, it's been harder and harder to argue against it.

There may be an upswell at the point where there's a new monarch at some point in the future, but it's a nice thing to have, it seems. The instance we have of the monarch is quite nice at the moment. And as long as that continues, I can see the monarchy maintained in some form. Someone's got to maintain the garden at least so that we can go and have picnics in it every now and then.

JIMMY MOORE: Exactly. Plus, she's probably the person on the most amount of money in the world, I'd say.

CRAIG BOX: And stamps.

JIMMY MOORE: And stamps.

CRAIG BOX: We're going to have to replace a lot of stamps.

JIMMY MOORE: That being said, shall we get to the news?

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

CRAIG BOX: Docker has changed its subscription plan, predominantly affecting Docker Desktop. The company is changing their free plan, which is fair to assume a majority of its users were using, to a personal subscription and requiring a paid license for most business use cases. If your company has more than 250 staff or $10 million in revenue, you will now need to be on the Pro plan or higher, starting at $5 per user per month, to use Docker Desktop. The changes take effect immediately, with a grace period for license purchase until January the 31st.

JIMMY MOORE: Google announced a $10 billion investment over the next five years to strengthen cybersecurity, including expanding zero-trust programs, helping secure the software supply chain, and enhancing open-source security. They also pledge to provide $100 million to support third-party foundations, like OpenSSF, that manage open-source security priorities and help fix vulnerabilities. Google and other partners, including IBM and Microsoft, are collaborating on a new framework to improve the security and integrity of the technology supply chain. The work will be published by the US National Institute of Standards and Technology. To learn more about software supply chain, check out episode 155 with Priya Wadhwa.

CRAIG BOX: The NGINX Ingress Controller has released version 1.0. The project, first created in 2016, was an early reference implementation of an ingress and has long been community maintained. At their Sprint Conference last week, the company NGINX announced that they would dedicate a full-time employee to the project and several engineers to work on the Kubernetes Gateway API. The new version is a breaking change as it uses the new V1 Ingress API and requires Kubernetes 1.19 or higher.

JIMMY MOORE: The OpenTelemetry project has ascended from the sandbox to become an incubation stage project in a CNCF. OpenTelemetry was formed from the merger of OpenCensus and OpenTracing projects in May 2019. More than 500 developers from 220 companies have contributed to the project. OpenTelemetry currently includes libraries for metrics and tracing with logging on the roadmap.

CRAIG BOX: IBM Research has open-sourced Tornjak, with a "K," a management plane for SPIFFE identities managed by the SPIRE runtime. Its goals are to provide global visibility, auditability, and configuration and policy management for identities across a multi-cloud environment. The project is named Tornjak not just because it contains a "K," but because its other name, the Bosnian-Herzegovinian-Croatian shepherd dog, was too hard to type.

JIMMY MOORE: SUSE Rancher 2.6 has been released, the first big launch since Rancher's acquisition by SUSE last year. The new version offers full lifecycle management of clusters in GKE, EKS, and AKS, and a new user experience. It also promises a fortified security and compliance posture with new image scans, improved log traceability, and integration of SUSE Linux Enterprise Base Container Images. Commercial and open source versions are available with the same feature set.

CRAIG BOX: VMware has announced the Tanzu Application Platform at their SpringOne conference this week. They have called TAP the spiritual successor to Pivotal Cloud Foundry, perhaps as a way to distinguish it from Tanzu Application Service, which appeared to be the actual successor to Pivotal Cloud Foundry when it launched in early 2020. TAP lets you build on and deliver to any Kubernetes distribution, including GKE and the like, as well as VMware's Tanzu Kubernetes Grid. In its initial beta release, it includes foundational elements, including application templates, automated container builds, runtimes, API discovery and routing, and insights into running applications.

JIMMY MOORE: Finally, funding news. Rafay Systems, makers of an enterprise Kubernetes operations platform, has raised $25 million in a series B round. Grafana Labs has raised $220 million in a series C round that values the company at $3 billion. Check out our interview with Grafana Labs Co-founder Torkel Odegaard in episode 122.

CRAIG BOX: And that's the news.

[MUSIC PLAYING]

CRAIG BOX: Daniel Megyesi is a DevOps engineer at Adevinta, based in Barcelona. Welcome to the show, Daniel.

DANIEL MEGYESI: Hi. Thank you for having me.

CRAIG BOX: You joined Adevinta in 2019, but you've been blogging about Kubernetes since before then. How did you come across containers in cloud-native?

DANIEL MEGYESI: OK, this is quite a funny story. So it starts with an April Fools Day in, I think, 2015.

CRAIG BOX: OK.

DANIEL MEGYESI: So I was a sysadmin for a company called Liligo, a French flight metasearch company. We had about 40, 50 people in the office. And I was looking to prank them somehow for April Fool's. So I started digging on the internet, and I found some really cool scripts that would manipulate images — so either using ImageMagick in Linux or I found different EARL and Python scripts.

So, I thought, OK, let's set up a transparent reverse proxy on our FreeBSD firewall in the office so I would redirect everyone's HTTP traffic to my rogue proxy and basically replace every image they loaded. So back at that time, HTTPS was not as popular or not as widespread. So it means that most of the traffic was not encrypted so I was able to catch it and replace it.

So quickly I started to set up the scripts and I realized I need a lot of libraries and all kinds of runtime environments installed on my computer, which was not so fun. There is this cool new technology called Docker, which I was reading about at the time. OK, so why don't I try it? So I built my very first Docker image in my life for this project and I started running it, testing it on my laptop. And I did some performance benchmarks, some tests, and I realized that if I start routing 40 employees' traffic to my laptop, that's going to be a problem. So I'm going to need to scale somehow.

I picked up like three or four desktop PCs from the storage. I was starting to install it. OK, so let me find an Ubuntu live CD and quickly install these computers. I realized that, OK, maybe that's going to take lots of time. And eventually I just want to run a simple Docker container, so why do I need to do this hassle with all the installations and everything? There was another project that I read about at the time. It was called CoreOS. Oh, that sounds like the perfect fit. It's extremely easy to install it and it runs Docker containers and it doesn't know anything else. So it does one job, but it does this job very well.

In a couple of hours, we had a cluster of four CoreOS machines running together and running my Docker image. And then this is what I connected to our firewall. Eventually, half a year later, we ended up having CoreOS in production for 10 machines and we migrated all of our frontend stack from classical bare-metal servers running Apache and NGINX to Docker containers based on CoreOS.

CRAIG BOX: There's a lot to unpack there. That's a very overengineered solution for an April Fool's joke, so congratulations.

DANIEL MEGYESI: Thank you.

CRAIG BOX: I remember doing something like that many, many years ago using the Squid reverse proxy. But what was the transformation that you did to the images and how long did it take people to notice?

DANIEL MEGYESI: I flipped some images upside down, put some waves on them so you would get nausea at least. And my personal favorite was putting a watermark image of my boss on some of the pictures. Actually, it was random what effect it would choose. A funny story with this, one of the girls came to me, a French girl, and she said, OK, I just opened the main page of "Le Monde," which is one of the biggest French newspapers, and I saw that the face of Jeremy, our boss, is on the main page.

CRAIG BOX: Wow.

DANIEL MEGYESI: How did you do that? I cannot imagine how you have the contacts to the editors of "Le Monde" to put Jeremy on the main page? How did you do that?

CRAIG BOX: How did they know it was you?

DANIEL MEGYESI: Obviously, they started suspecting it, especially the engineers, that, OK, maybe it's fishy that everyone's computer with all the traffic is kind of slow. Because eventually even these four machines were not enough. And also then they see the photo of our boss, then that was obviously telling that maybe there was something fishy going on. I also put some extra text on some of the pictures like, OK, hey, bastards, get back to work and stuff like that.

CRAIG BOX: That's a very good way to learn a technology, I would say. Because quite often I'll say, well, hey, I'd like to learn something, but I don't have a problem that necessitates solving it in a way to learn something. And by coming across this crazy idea, effectively, you had a reason to teach yourself Docker and CoreOS and clustering and so on. So that seems like it worked out really well for you.

DANIEL MEGYESI: Exactly. This is a technology that I was reading and learning about already, but, as you mentioned, I was lacking a real interesting real-life project that I could do. So this was like a perfect starting point.

CRAIG BOX: Adevinta isn't a name that I was familiar with, but it turns out I'm very familiar with some of your brands. What does the company do?

DANIEL MEGYESI: So our company is one of the biggest players in the classified ads market. So this means we own a couple of local brands that sell or help the user sell second-hand stuff. So second-hand stuff like your used bicycle, a used iPhone, or even a house.

CRAIG BOX: Big difference.

DANIEL MEGYESI: Yes, absolutely.

CRAIG BOX: Some of your brands that people may be familiar with-- Kijiji. They were the classified system when I lived in Canada. And over here in the UK there's Gumtree and Shpock.

DANIEL MEGYESI: Yes.

CRAIG BOX: They are two brands that were Adevinta brands.

DANIEL MEGYESI: Yes, unfortunately, we had to sell these with the recent merger with eBay Classifieds. But, yes, these were our brands.

CRAIG BOX: With businesses and brands in several different countries, how does Adevinta handle centralized services? Have you acquired companies and then you have to support the systems that they operate, or do you bring them on to some central stack that you maintain?

DANIEL MEGYESI: In general, we try to give them autonomy so they can work with their environment that they are used to. But also, the reason we are buying them, because we see potential in them. We detect synergies that, OK, we could be more efficient if we start merging certain services. So just to give you an example, we have maybe five different brands where you can sell your house. One of the common problems for these platforms is to upload photos of your house. We could have five different teams implementing image uploader service. Each of them would handle, OK, how to upload, how to save, how to serve these images.

CRAIG BOX: How to insert the watermark of your boss?

DANIEL MEGYESI: Exactly. So instead, we have a central team in our headquarters in Barcelona who are dedicated to do one thing, but do that one thing very well. So they are maintaining and operating an in-house image serving, an image uploading service, which is handling all of this image manipulation, image storage systems. When we acquire a new brand or there are some interested teams who would like to onboard our system, then they gradually start shifting their traffic and start using our centralized services.

So, actually, this is the same way how we ended up implementing our big data and machine learning platform. So to avoid that, the data, in general, the data scientists, are installing maybe Hadoop or upgrading Spark versions or messing with all of these things, instead there is a centralized product and solution that they can use and onboard. And we don't force anyone to onboard on these systems. But, instead, we want them to come by themselves. So we want to make a product that is truly useful and truly covers the needs of the users so they actually want to use us and not maybe a competitor or in-house solution.

CRAIG BOX: I can imagine that a business like yours and running at such a scale will have a large need for analytics, which would be well served by some of these big data tools. What are some of the things that you and your brands use big data for?

DANIEL MEGYESI: As I mentioned, we have a bunch of marketplaces. Most of these are collecting user activity. So we want to see what the users are interested in so we can give them better recommendations, for example.

CRAIG BOX: You looked at an iPhone. Would you like to buy a house?

DANIEL MEGYESI: Something like that, exactly. Users usually like that.

CRAIG BOX: Do you then also feed information back to sellers? And then say, hey, these ads perform well?

DANIEL MEGYESI: Yes. Actually, our revenue model is mostly coming from external partners and agencies. So in case of selling a house, most of our products are completely free for the end user. So if you just want to sell one house or rent an apartment, it will be completely free for you. And our revenue model is based on the professional advertisers, like real estate companies, who want to purchase maybe highlighted ads or they want to get insights about the market, they want to know which type of houses are more popular. For example, now during the lockdown, there is an elevated need for apartments with a terrace, or bigger apartments, or more bedrooms. So these kind of things that we can monetize.

CRAIG BOX: Now, previously, you were using Spark and Hadoop, on top of Mesos. They sound like technologies that were built to operate together. What was that old infrastructure that you were using and what were its bottlenecks?

DANIEL MEGYESI: We were using Mesos with Marathon and Chronos. This cluster was connecting to an AWS EMR cluster to a Hadoop cluster. So the users would be able to build their jobs, submit it to the Mesos cluster, and then it would schedule a task on the Chronos framework. So then eventually it would launch this Spark driver and executors process this job and come back with some result and write it to an S3 bucket, for example.

CRAIG BOX: What was the interface to that? Were people writing queries in SQL? Or did you have a front end that was a little more interactive?

DANIEL MEGYESI: So actually they didn't need the front end for this. All of the development was on their local computers. So they would write the code. They would have some prepared common templates that they can use. And they would launch a tool that they built for them that would package, compile all of this code, and build a Docker image. And this Docker image would be then uploaded to our Docker registry and submitted then to the Mesos cluster.

CRAIG BOX: Was there something that was missing in terms of what the users could do with a system that made you consider changing it? Or was it simply from the backend perspective?

DANIEL MEGYESI: The fact that the build was going locally on their computer was already a big bottleneck and a problem. Also, the system started to get dated. So we didn't have the capacity and the chance to upgrade and keep the components up to date. This meant that, for example, the users maybe wanted a better, newer Spark version or a newer Hadoop version, but then we had a lot of dependencies that would prevent us from upgrading.

And the final nail in the coffin was when they split our team due to business needs and reorganization. We had a team of, I don't know, maybe eight, nine engineers, and they split us into two teams. So one of them would manage our existing Kubernetes platform, which was hosting web microservices, and the other team would manage the big data platform. When I joined, I joined this second team with the big data platform. And we had one engineer left who was familiar with the whole setup. So we had four other engineers who had absolutely no idea about anything about the system. I would say maybe partially business requirements, partially technical requirements that caused us to come up with a new idea.

CRAIG BOX: That new idea is a system that you call Unicron. Did I pronounce that right?

DANIEL MEGYESI: Yes.

CRAIG BOX: The logo is a unicorn, so I did wonder if the "O" should have been a zero.

DANIEL MEGYESI: Actually, we don't have an official logo still for our product, but we are working on it. The name is coming from uni-cron — so we were building a unified common platform where you can run your cron jobs.

CRAIG BOX: I'm still going to choose to believe that the name is just "unicorn," but leet. Who named it?

DANIEL MEGYESI: Funny thing. If you asked me half a year ago, I wouldn't be able to answer this question. But my colleagues told me it was actually me who came up with this idea. And I had no idea about it.

CRAIG BOX: Oh. Do you know how you came up with it? Did they tell you that as well?

DANIEL MEGYESI: We were brainstorming. This was before the pandemic time. So all of the planning and brainstorming was done in the office, a lot of whiteboard planning. And one of the topics was, OK, what kind of name should we come up with? We had maybe 10 different candidates. Probably one of these was coming from me. And eventually this is what was the most popular.

CRAIG BOX: Now, your team is going to build a new platform to replace something which presumably is responsible for a large part of the revenue of the company, indirectly at least. How do you get the buy-in from the rest of the business, that this is a change that's worth making?

DANIEL MEGYESI: We had the different teams, the different marketplaces, internal users who wanted to have maybe better features, newer library versions, newer Hadoop versions, or maybe just in general a more comfortable user experience. So there was already some kind of pressure coming from these teams that, oh, hey, this product kind of feels abandoned. Can you do something about it?

So we as a team started to think about, OK, what could we do? And, as I mentioned, we had four out of five people who didn't know anything about the platform. So we were not super eager to learn about something that has years of legacy, years of infrastructure built, and we should be on call for it very soon. Because, obviously, we cannot put it on just one guy to be responsible for it 24/7. That was also a motivation for us. So we started brainstorming and asking the management, OK, is this something that we could maybe leverage? And will you give us a chance to come up with something better? Because we have a lot of cool ideas, so why don't we implement it?

CRAIG BOX: And then at that point, were you operating a little bit like a startup? You had a small team of founders and you needed to decide on tech stack and how the people on the teams would interact, not only with themselves, but with your external customers?

DANIEL MEGYESI: Exactly. Luckily, we were quite a small team and a fresh team with a lot of new faces. So we had to get used to working together. Still, we had to operate the old system, of course, in parallel. But, meanwhile, this was even more motivation for us to get up to speed and start planning the new system. And they gave us quite a blank sheet. We could do whatever we want. Of course, we still had to coordinate with project managers, make sure the product would fit the requirements and the business needs, and also it would be delivered on time. But besides that, we got an absolutely blank sheet and do whatever we want.

CRAIG BOX: From an engineering perspective, how did you work on how you would collaborate with one another?

DANIEL MEGYESI: We came up with an idea of setting up so-called discussions. These are similar to maybe RFCs, but in a smaller version. For a small team you don't need some complex bureaucratic process to discuss ideas. Sometimes we could maybe just have a coffee in the canteen, and while having our coffee, we would already come up with a partial solution for a problem. So this is something that we try to formalize as kind of a democratic process.

We would set up a meeting every Monday in the afternoon for one hour where we would discuss a topic that was proposed by someone. And this guy would ask some questions, propose a problem, propose some solutions and ask for feedback. And everyone would share their ideas, whatever comes to their minds. And then we would debate it and eventually come to a conclusion and then shake hands on it and then actually document it in our internal knowledge base. So if we have any newcomers, or later we would need to search back, what was the agreement about, then we have everything documented. Since we have this discussion also in GitHub, it means we have all the details that people can search back and try to understand the reasoning why we made this decision.

CRAIG BOX: What were some of the more important or large decisions you needed to make at the outset of the project?

DANIEL MEGYESI: One of the biggest decisions was actually which Kubernetes solution to choose. So we were quite sure that we wanted to go with Kubernetes for various reasons. One of the biggest challenges that we faced, and we actually debated it for weeks, if we should deploy our own solutions, to have a self-hosted Kubernetes control plane. Because this is what we have done, our team, before the split for the microservices platform.

Or, let's go with a package solution. For example, the EKS in AWS. Or, there was one other project that we were testing and actually doing benchmarks and stuff is a project called Gardener. This is a Kubernetes and Kubernetes solution, which is basically a multitenant solution for hosting Kubernetes control planes inside the master control plane. So this was one of the projects that we tested and tried. But, eventually, we found out that this would be way too complex. And, to be honest, just to pay for $70 a month for EKS to do everything for us, add CD cluster, Kubernetes control plane, everything, that is a no-brainer. We couldn't find any good reason why we would do it ourselves.

CRAIG BOX: What are the things that are missing from a Kubernetes distro?

DANIEL MEGYESI: So one common misconception about Kubernetes is, OK, you just take it off the shelves, you install a cluster, and you are ready to run your workload. So there are new emerging projects, like the GKE Autopilot, which are tackling this problem to make sure that the customers can onboard these clusters more easily, the necessary components would be installed. Some of this stuff are still missing when you deploy a new cluster.

For example, there is no cluster autoscaler running at all in these clusters, depending on the provider. Or then you wouldn't have the vertical pod autoscaler installed, or if it's installed, it's not tuned correctly. Same with CoreDNS or any DNS service, you need to make sure that maybe you have proportional autoscaling based on the number of nodes, and you need to configure this or customize the CNI network plugins.

The other thing is the login setup. For example, in our case, we wanted to ship our logs to S3 buckets. So this would be a very low effort task to do. There are basically no bottlenecks except Amazon's capacity to ingest raw logs. And later we could query them by Athena or even by Spark jobs if you want, or to upload them to Loki in Grafana Cloud. Or also, monitoring and observability. This is a big thing that I think is missing still from many providers.

So for this, we also had to come up with some custom solutions but then also some other smaller pieces like, OK, how to handle graceful termination for the nodes. So you want to shut down a node or downscale, you need to evict some pods but make sure they are going to wait for a task to finish or send a graceful termination request. So these are things that are not handled correctly or fully today in Kubernetes. For these, you need to install still a lot of small pieces.

CRAIG BOX: It almost sounds like you need a test suite that says, here are not just the things that make a cluster Kubernetes compliant and certified, but here are the things that you actually need. It needs to be able to have the vertical pod autoscaler installed. It needs to have these things. And some vendor's systems will have some of them installed out of the box but others will not. And it's like you need to have a way of testing that and being able to say, all right, well, here are the things you actually need to run Kubernetes, and you may have to install some of them yourself.

DANIEL MEGYESI: Maybe it's available, but then, in this case, please let us know because we would be definitely interested in hearing about this project.

CRAIG BOX: With all of the different brands and business units that you have, where was the decision about multitenancy? You can have one big cluster and you can, obviously, get better utilization, but then you're sharing people's workloads on that cluster. Or you can have individual clusters per business units or perhaps individual clusters per workload. There's a continuum of choices available to you. How did you decide where to sit on that line?

DANIEL MEGYESI: So this is exactly what we have done before. We had one big multitenant cluster with Mesos. So we had all the different teams with all of their different workloads running in the same infrastructure. So we had small teams who would run maybe every day one smaller job, which would take 20 minutes and just take 2 gigabytes of RAM. And then we would have other teams who are running once a week a giant job, which is taking maybe one entire day to compute and is using terabytes of RAM.

So these different sizes and types of workloads could affect each other very much. We tried and set up, of course, all the isolation we were able to do, but still that was not enough. It was a single point of failure as well. So to touch this cluster and do upgrades on it, modify anything, we were afraid to do so because we knew that the business is relying on it a lot. So if there is something going wrong, we have a big problem because nobody in the company can work, none of the data engineers.

CRAIG BOX: So then what management tools and processes did you put in place to have your small team be able to manage a large number of clusters?

DANIEL MEGYESI: We got inspired a lot from Zalando. They made some really good presentations at the Google Cloud Conference and later at AWS re:Invent about how they manage their clusters. We set up a similar environment where we would have different channels of the clusters. So we would have a channel where we can deploy dev clusters, then alpha clusters, beta clusters, and stable clusters. We would set up a continuous deployment pipeline for all of the components we have — so a random cluster autoscaler, deployment or a specific component, a DNS service, anything.

We would deploy it first to the dev clusters, which are our personal clusters of the engineers in my team. If this is working fine, then we would merge it to the next channel, the next tier, which is the alpha channel. In alpha channel is our production. So this is production for our team. This is where we run our management stuff. This is where we would run the pipelines that are doing the deployment. If this also passed this channel, then they would deploy and roll out to the beta channel, which is the dev environment for our customers. So here we have most of our clusters living in the beta channel. We would give a few days to let the customers test it and make sure everything works fine. And if it works fine, then later we would roll out the stable channel which is hosting the pre-production and the production environments.

So it was important to have the staging and production also always on the same version to make sure there is no divergence. And later we added some additional testing. So even before merging to the dev channel, we would deploy so-called PR clusters. So each pull request that passed the different test we had for each repo would end up eventually in this master repo where we host our components, a list of components, that we know that we want to deploy to our clusters. And we would spin up a new cluster, a temporary cluster for this, to make sure we are testing it from scratch, to make sure deploying a totally new cluster with no previous life, no previous mess in the configs or anything, is working fine.

CRAIG BOX: You have a lot of applications that you need to run to manage and maintain your clusters and they will run in containers as well. Are you able to use the same tooling to manage and maintain your container applications that you use to run the platform, as the users are in order to be able to deploy their workload containers?

DANIEL MEGYESI: It's similar and also not similar, in a way. Because we are using the GitOps methodology for our users. So previously, as I mentioned, they were building from their local computers and pushing all of their stuff. This time we are kind of forcing them to use always remote infrastructure. So they would write a code, commit it to a git repo, and this git repo would later create a deployment manifest, a commit to their GitOps repository, which is then attached to a service called Argo CD that we are running in each cluster. So Argo CD is a tool that is pulling a git repo and making sure that everything is deployed according to the specifications in the repo.

CRAIG BOX: You could, though, extend Argo CD to do the same for clusters using something like the Cluster API?

DANIEL MEGYESI: Actually we do that. We are deploying most of our clusters. The pipelines are running in Argo Workflows. And then we have some optional components that are being deployed by Argo CD. We call this the Unicron Store. So this is basically like an application store where you can decide to install some optional components. This is something first we use for ourselves.

So we would install, for example, the Spark operator or Luigi or, later, the machine learning components that some users needed, some not. This is not something that we always want to deploy but only on certain occasions. And later, we added some extra components where actually the users can decide what do they want to install. So maybe some users wanted to use Datadog to monitor their jobs. So they would be able to install it in a self-service way, and then Argo CD would take care of the deployment.

CRAIG BOX: Once you've defined the things that you want in your GitOps manifests, are you using Helm or operators or something like that to install the software as well? Is there room for overlap between those components?

DANIEL MEGYESI: Yes, we use actually Helm for most of our deployments. And then Argo CD supports not just Helm 2 and 3, but also customized or even just simple Kubernetes YAML manifests. So we are combining all of these each time, which is the more suitable for the task.

CRAIG BOX: You've been honest in your writing about incurring technical debt throughout this process. Can you give me an example of some of the things that came up as technical debt and how you've gone about resolving them?

DANIEL MEGYESI: Yes. So one example that we would do is implement Prometheus as a monitoring stack on top of our clusters. Initially, we implemented the simplest, easiest solution we could do just to have basically proof of concept to see if it works or not. And then, of course, as it usually happens, you end up using it in more and more stable or production-grade workflows. And it means that we know this is something that we must rely on and it must work very reliably. So we identified these shortcuts. What we usually do is we try to identify tech debt at the time of creating it. So we have an agreement, a discussion about this, that it's OK to do tech debt because sometimes we need to do a shortcut. But then we make sure this is not forgotten.

We set up a GitHub issuing the corresponding repository. We put a label in it to make sure we can find it in the future. And every quarter we would set up like a one or two-hour meeting where we would go through all of the tech debt issues we have ever created and prioritize them, try to decide which is the most important, which is the one that is going to burn us maybe in the near future. Or maybe if we already burned ourselves, we identify the tech debt after it was created because we didn't create it on purpose, but we made a mistake. So we would prioritize these things, agree that, OK, these are the three, four issues that we are going to solve in the next quarter. We would have a dedicated two weeks per quarter where we would sit down and only work on fixing tech debt.

CRAIG BOX: Is that a hard sell to your business stakeholders to tell them that, yes, there are new features that you want, but for two weeks we're going to effectively show no progress from the outside perspective?

DANIEL MEGYESI: Actually, I was quite surprised, personally. This was my first workplace where they would encourage to do our best for reliability. So, actually, we spent two entire quarters building this platform before we even started welcoming customers in production. That was already a good sign at the beginning. And also later we realized, OK, we have some issues that are not allowing us to scale. We would occasionally get one month or even a quarter to work on this. Of course, we need to keep the lights on. We need to still help our customers with their problems. But besides that, they allowed us to spend time on reliability because they understand, especially now with the eBay acquisition, that maybe today we have 60 different Kubernetes clusters operated by only four on-call engineers, and this works fine.

We have maybe one page per month or two pages per month. And even those sometimes are false or just a small issue that can be resolved very quickly. So this is something that made us very proud. Honestly, quite unbelievable because all of my previous jobs were never this peaceful. We need to make sure that when we start to scale-- so with these new acquisitions we have significantly more companies onboarding, more brands and marketplaces onboarding on our platform, hopefully. And it means we need to make sure this is going to be sustainable and it's going to scale well. So the business, the managers, understand this and they encourage us to proactively try to find these hotspots that could be a problem. And they let us plan in our OKR planning sessions for the next quarter to tackle these one by one.

CRAIG BOX: You mentioned a few foreign concepts before, like everyone standing around the whiteboard, or all going and getting a coffee together, which kind of dates the period in which this work was done to the before times. As people started going into lockdown, you had customers that were already running on the system and production. How was it like to bring people onto it remotely or when they weren't all able to be in the same room anymore?

DANIEL MEGYESI: Yeah, this was definitely challenging. So previously, they would hold workshops in person. So we would book a meeting room where you can sit down with the engineers and we can explain to them how the system works, do a demo, explain the basic concepts. But then, anyway, we started to onboard some customers from different countries. So we had our brand Yapo getting onboarded from Chile. Then we had Leboncoin from France. So these people were obviously remote anyway, so there was not a chance to onboard them in person. Even before the lockdown, we already had some experience how to onboard these people. It came quite naturally doing it for remote.

CRAIG BOX: Now you have the core of the system, which is able to run Spark jobs and any other big data related things. You've also been building out machine learning capabilities on top of that. What have you used there?

DANIEL MEGYESI: Here we are using Kubeflow mostly and some in-house components. So, for example, we have a squad in our team. So eventually we started growing. We had more than 10 engineers. We decided to split our team into two squads. One squad would work on the core platform-- so making sure that Kubernetes is up and running, it's up to date, always with the latest versions, the latest components, and everything is working fine.

And then they would have the so-called data squad who are developing the customer facing interfaces. They would be responsible to manage and maintain Kubeflow. They would be the ones who debug issues if, for example, Spark or Luigi is not working fine. And also, they would develop some customer tooling to make the life of our customers and users easier.

So the idea is that we don't want any data engineer or data scientist to know about how the system works in the background. They don't need to know what is kubectl or even have Amazon credentials maybe. They can just use GitOps and some UIs that we developed for them.

One of these examples is we developed a UI for launching Jupyter notebooks for analysts. Our team wanted to learn Rust. And they already had some experience in it, but they wanted to leverage this knowledge and find an actual production use case where they could work on this. So they got the time and opportunity to spend enough effort to make sure that this is a stable product, and they can at the same time work in something that they find passionate or they find it exciting to learn about.

CRAIG BOX: Find them, for all instances, in the corner and let them test out their Rust on April Fool's Day?

DANIEL MEGYESI: Yes, something like that.

CRAIG BOX: What's next for this platform? What capabilities would you like to build on top of it now that it's stable?

DANIEL MEGYESI: Right now we are focusing all of our efforts on machine learning. Actually, this is a big data and machine learning platform, which means we don't want to receive live traffic on these clusters. So what we are going to do right now is run the training jobs on Unicron, on our system, and then the actual model serving part would be done by our microservices platform, that are platforms which are the best, each, at what they do.

So we want to make sure that they are running these preemptible jobs that can be terminated at any time. And there will be no problem if we replace a cluster, redeploy it in the middle of the night or during business hours, no problem. And then we have the other service managed by a different team who are running live production websites. And this is the platform that would host the machine learning models. And now we are working on the connection between these two platforms.

CRAIG BOX: You've described yourself as an occasional tech blogger. What occasions cause you to put proverbial pen to paper?

DANIEL MEGYESI: In the past, I faced some challenges or technical questions that I couldn't find a solution on the internet. So, for example, we were trying to deploy Let's Encrypt wildcard certificates for our internal websites. At the time, this was quite a new thing. It was announced maybe one or two months before that now Let's Encrypt is supporting wildcard certificates. To issue the certificates, you would need to have an HTTP-based validation or a TXT record-based validation in a public place.

To be able to issue valid working certificates for internal websites, you would need to expose how your internal domain structure is folding up. This was one of the challenges that I wanted to solve in a secure way. And I realized I just simply cannot find any solution on the internet, nothing in forums, on Stack Overflow, anywhere. So I started digging into it. OK, how can I solve this? And I was like, OK, why don't I write an article about this because maybe this is something that other people will also find interesting and useful. I don't write a lot of articles, but when I do, these are about these kinds of topics, that are hard to find or I couldn't find before.

One similar example was the vertical pod autoscaler. So I realized about a year ago or so that there is no good documentation for this in Kubernetes. There are some basic stuff that you can read, OK, how does it work? But then nobody actually goes into the details, how to tune it. How does the request and limits ratio work? What happens if you have only one replica and it turns out the VP needs at least two replicas running to be able to scale?

So these kind of things I couldn't find in any documentation. So this also inspired me to write this article. And half a year later, my colleague came to me with a question about the VPA. And I was not sure, so he went on and started googling and he sent me a link. Hey, look, I found this article. This looks pretty cool. This guy is explaining how to tune it to make sure it works for our use cases. I looked at the link and, hey, did you scroll up? And did you see the author? This is actually my article, but I already forgot about it completely. So once I write down something, it just leaves my brain and I don't even know it existed anymore.

CRAIG BOX: I hear you completely. I've done that on many occasions. And the worst thing is that you ask a question about how to do something on the forum thread or something like that, and then you go back and google for the same problem later on, and the only reference you can find on the internet is you having previously asked about it, and it still isn't solved.

Many, many years ago, I wrote an article on steps to how to connect a couple of networking devices to Linux machines. And I would get emails about that on and off for years afterwards. And I just had a quick search. It's still on the first page of Google hits if you know the particular device names to search for.

DANIEL MEGYESI: Yeah, I know the feeling.

CRAIG BOX: Finally, I understand you're building a 1:8 replica of the DeLorean from the "Back to the Future" movies. And my question to you is, why not a 1:1 replica?

DANIEL MEGYESI: That would be quite fun. But even the 1:8 replica is taking more space in my home that I would like to provide for it. It started as a lockdown project in March 2020. Actually, I wanted to build a Terminator life-size replica. I found some really cool modeling communities who are doing this. But then I realized this is not available in Europe. So you can do it in the UK, but mostly in the US and Mexico and Australia, but you just couldn't find these issues and the components in Europe, or it would be extremely difficult.

So I started checking these modeling communities, and I bumped into this Eaglemoss DeLorean group, where they are building this 1:8 replica of the DeLorean. I was like, OK, this is actually one of my favorite movies, so why don't I start building this? I onboarded on this project. Sometimes I regretted it because it could be quite cumbersome. So it's built up from many, many small components, mostly metal and plastic pieces.

Right now my car is about 10 kilo or 20 pounds of size and about 60 centimeters, like 2 feet long. So it's quite a big model already. And you can do all kinds of things with it. You can open the doors. You can turn on the lights. There is a real flux capacitor inside that you can have the time circuits, everything. So this is a very good replica. But it also takes a lot of time. But eventually, I think this is what made me keep my sanity during lockdown.

CRAIG BOX: How far are you from complete?

DANIEL MEGYESI: Right now I already own all the components. So at the beginning of the lockdown, there were a lot of issues with getting my hands on the components. So I tried to buy most of them in Spain, but some of them were only available in the UK or in the US or in Australia. Some of them I had to buy from Japan, from a Japanese newspaper vendor who only had a Japanese web page available. And they could only deliver to Japanese addresses, so I had to find all kinds of solutions to buy this or some stuff. I actually bought, on our own, at second-hand marketplaces from different countries.

Eventually, it took me a lot of time and effort to collect these pieces. And now I'm at the point where I have everything. And I would need to disassemble the car again because there were maybe some pieces missing from the middle of the car. Anyway, I kept building because I was really curious to see how it would look like. So now I'm at a point where I need to disassemble everything.

It will take maybe two days and remember all the 200 screws. Where did I pull it from? Where did I put it? Install the new components and put it back again. Maybe it will be only one or two weeks of work, but this is something I keep postponing for months and months now because it just scares me, this amount of work, and it looks so nice right now.

CRAIG BOX: Well, if you do need a larger house in order to store it, I can recommend a website you can go to to find one.

DANIEL MEGYESI: Yes, I think if I ask my company, we would come up with some ideas as well.

CRAIG BOX: Thank you very much for joining us today, Daniel.

DANIEL MEGYESI: Thank you for having me.

CRAIG BOX: You can find Daniel's write-up of the Unicron story, his personal blog, and links to Adevinta in the show notes.

[MUSIC PLAYING]

CRAIG BOX: Thanks for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter @kubernetespod, or you can reach us by email at kubernetespodcast@google.com.

JIMMY MOORE: You can also check out our website at kubernetespodcast.com, where you'll find transcripts and show notes, as well as links to subscribe. Until next time, take care.

CRAIG BOX: See you later.

[MUSIC PLAYING]