Kubernetes Podcast from Google: Episode 62 - Large Hadron Kubernetes at CERN, with Ricardo Rocha, Lukas Heinrich, and Clemens Lange

#62 July 16, 2019

Large Hadron Kubernetes at CERN, with Ricardo Rocha, Lukas Heinrich, and Clemens Lange

Hosts: Craig Box, Adam Glick

Back in 2012, CERN announced one of its most important achievements; the discovery of the Higgs boson. This work led to the 2013 Nobel Prize in Physics. Ricardo Rocha, Lukas Heinrich and Clemens Lang of CERN redid the data analysis on top of Kubernetes this year, which Ricardo and Lukas demonstrated at a keynote at KubeCon EU. All three join Adam and Craig for a short physics lesson and a view into computing at the largest scale, for particles at the smallest.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

50th anniversary of the launch of Apollo 11 by NASA’s Astronomy Picture of the Day, and as reported by CBS News in real time
LEGO Saturn V - mid-completion
47th annual Seafair Milk Carton Derby
- Adam’s pictures, including the Saturn V rocket

News of the week

IBM announced it has closed its acquisition of Red Hat
Hashicorp Consul 1.6
Benchmarking best practices for Istio by Megan O’Keefe, Mandar Jog and John Howard
IPv6 enhancement proposal for Kubernetes
- Now passing tests!
Architecting with Google Kubernetes Engine specialization
Weave Ignite
Cloud Native CI/CD with OpenShift Pipelines
k3v
Avoid time-of-measurement bias with Prometheus
- Prometheus client tracer for Ruby

Links from the interview

Transcript

Show full transcript

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

CRAIG BOX: If you are listening to this on the day of release, it is the 50th anniversary of the moon takeoff, which is like the moon landing, but happened four days earlier.

ADAM GLICK: Quite a monumentous thing. I guess you weren't alive for that, were you?

CRAIG BOX: No. I don't think you were either, in fairness.

ADAM GLICK: No, no. But it certainly inspired a lot of folks, my brother included and an earlier generation, to get into science and technology. And you can still see folks around the office here in Seattle, and I believe in San Francisco as well of people that have the rocket model that they've built.

CRAIG BOX: It's very popular, the Lego Saturn 5 rocket. A number of people I know have built that.

ADAM GLICK: Indeed. Speaking of people who have built things for that, I saw a wonderful float this weekend. There was the Milk Box Derby, which is a bunch people build various floats that are propped up by milk containers and then they race them at the lake. And someone had built one with the Apollo rocket on top of it as an interesting float. Probably not the most aerodynamic thing to build, but it was great to see, and certainly very creative.

CRAIG BOX: I saw those pictures. In my mind, a Milk Box Derby is something that you would do in the old milk crates. You used to get milk in glass bottles and they'd come in crates, and people would put those crates on wheels and race them down the street. But this is the modern world where the milk comes in plastic containers, and now we're actually making rafts out of our milk cartons, is that right?

ADAM GLICK: Indeed. It was a fair sized gathering with lots of really interesting floats and boats that you can find pictures of online from the Green Lake Milk Box Derby.

CRAIG BOX: You must have to head off to the airport soon because I understand you're in Chicago this week.

ADAM GLICK: I am. And if anyone is listening and will be at our Chicago summit or the Kubernetes pre-day, please feel free to stop by-- I'll make sure I have some stickers with me-- and say hi. But we'll be out there talking to folks about Kubernetes and getting a chance to meet some of the folks in the Windy City.

CRAIG BOX: Brilliant.

ADAM GLICK: I do hear that there is a birthday coming up.

CRAIG BOX: Today.

ADAM GLICK: Indeed. Happy birthday to Craig's mom, right? Your mom?

CRAIG BOX: Yes. Happy birthday, Mum. This is the only part of the show that she listens to, so she'll be chuffed.

ADAM GLICK: Wonderful. Want to get to the news?

CRAIG BOX: Let's get to the news.

[MUSIC PLAYING]

CRAIG BOX: IBM announced it closed its acquisition of Red Hat last week, purchasing all the Red Hat shares at a price of $34 billion. While talking up the advantages of the union, both companies reiterated Red Hat's independence and neutrality, including that they remain committed to partnerships with other cloud vendors.

ADAM GLICK: HashiCorp hosted the European HashiConf last week and announced Consul 1.6, bringing Layer 7 traffic management to their Consul Connect service mesh. Similar to Istio, Layer 7 features in Consul Connect are powered by the Envoy proxy server. This release adds the ability to direct traffic based on HTTP path. It also adds mesh gateways to allow connection between two locations on different networks.

CRAIG BOX: A common question regarding service mesh is how much additional latency will it add to your application. Google Cloud's Megan O'Keefe has published a blog on how to benchmark Istio for production use. Her key findings are that users must install the software correctly, focus on the difference between the application with and without the mesh, and focus on the utilization of the data plane, as that's the thing that will end up using the most resources.

Additionally, users should understand the trade-off of using Mixer, a centralized engine which lets you add adapters for policy and telemetry, is that it can add to performance cost. Future work to Istio will move the Mixer functionality into the Envoy engine and remove this overhead.

ADAM GLICK: IPv6 is coming to Kubernetes. Support landed this week in kind, or Kubernetes in Docker, which means that IPv6 could be added to CI conformance tests. Such tests were one of the last blockers for beta, and now IPv6 support is being proposed for inclusion in the upcoming 1.16 release.

CRAIG BOX: Google Cloud's training division announced an Architecting with Google Kubernetes Engine specialization. It is made up of four courses, which include one on GCP fundamentals and three related to GKE-- foundations, workloads, and running and production. Attendees to a webinar on July 26 can get one month's free access to the training, or it can be purchased through Coursera.

ADAM GLICK: Weaveworks celebrated their fifth birthday this week with the release of Weave Ignite, a tool for automating deployments of Firecracker microVMs, a project announced by Amazon last November. Ignite works a little bit like Docker, but with Weave's GitOps model included. You commit to a repo with what you want, and an agent that watches that repo will actuate it for you. The images deployed our OCI or Docker images, but each will run with their own kernel.

CRAIG BOX: Tekton Pipelines, introduced by Kim Lewandowski on episode 47, have made it to Red Hat's platform as open shift pipelines, now in developer preview. The release allows integration with many Tekton and Kubernetes native tools and aims to address problems users were having with the Jenkins-based platform in previous releases.

ADAM GLICK: In episode 57, Darren shepherd from Rancher Labs hinted at a new project called k3v. He has just released a proof of concept of what is a virtual control plane for Kubernetes. k3v uses the compute, storage, and networking resources for a real Kubernetes cluster, but runs multiple control planes for different tenants. This allows you to take one physical Kubernetes cluster and chop it up into smaller virtual clusters. Given the number of objects installed into a modern Kubernetes cluster and the fact that hard multi-tenancy was never a design concern, this model may help provide stronger multi-tenancy by allowing each tenant to have their own control plane.

k3v continues Rancher's unpronounceable naming scheme, so from here on out we're going to call it Kev, just like your buddy who is a builder.

CRAIG BOX: Finally, if you can't trust your metrics, what can you trust? Lawrence Jones from GoCardless realized when debugging an incident that it was not possible to accurately establish what his async workers were doing and when due to bias in the metrics implementation. He shows the way Prometheus scrapes metrics can lead to over or underrepresentation and gives an example of how he fixed his Ruby code base to solve this problem. The fix has been spun out into a library, which you can now install as a Ruby gem.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

CRAIG BOX: Our guests today are a computer scientist and two physicists from CERN, the European Organization for Nuclear Research. Ricardo Rocha is a software engineer on the Cloud team, working on networking and container deployments. He helped develop and deploy several components of the worldwide LHC computing grid, a network of 200 sites collaborating around the world, helping to analyze the Large Hadron Collider data.

Lukas Heinrich is a particle physicist working on the Atlas experiment. He focuses on introducing modern cloud computing tools to more systematically search for phenomena beyond the standard model of particle physics.

Clemens Lange is a particle physicist working on the CMS experiment, searching for new particles in the collision data of the LHC and preserving the analysis using modern cloud tools. Welcome to the show, Ricardo.

RICARDO ROCHA: Hi.

CRAIG BOX: Lukas.

LUKAS HEINRICH: Hi.

CRAIG BOX: And Clemens.

CLEMENS LANGE: Hello.

ADAM GLICK: So you all work with CERN. What is CERN?

RICARDO ROCHA: CERN is, as Craig said, the European Organization for Nuclear Research. We're a big particle physics laboratory-- the largest in the world. We're based in Geneva in Switzerland. And the main mission is fundamental research-- so to try to understand what the universe is made of, and to try to understand what is 96% of the universe that we still don't understand exactly what it is made for, look for the state of matter right after the big bang and try to understand why we don't see any anti-matter. So we build large experiments where we try to answer those kind of questions.

CRAIG BOX: The 4% that we do know about, what's that made of?

LUKAS HEINRICH: You might know there's a-- actually, a nice show by Neil deGrasse Tyson as well explained this. 5% of everything that we see around us-- all the galaxies, all the planets, all the dust in the universe-- that is the 5% that we understand, and it's called baryonic matter. But we know that 95% is out there, because we can measure the energy content of the universe. And we know from that 95%, 20 percentage points are roughly dark matter, and then the rest is what we call dark energy-- because we see that the universe is expanding, and that hints at this dark energy. So that's energy content in the universe that is not really bound into matter, but there is also dark matter. So we can see galaxies using our astronomical observations, and we can look at the rotation of these galaxies, and these rotation curves suggest that there's a lot of more matter than what we can measure from the light. And that's why it's called dark matter.

CRAIG BOX: CERN is famous for two things. The first one, in the area of computer science, is the World Wide Web. Obviously predates your history there, perhaps. But Ricardo, what's the story of computing at CERN?

RICARDO ROCHA: We've always had a lot of requirements to have a lot of computing power to process all the data. But also, we had a lot of requirements to share information among the physicists that work with us. And that's what basically triggered the World Wide Web initial proposal in 1989. It was a way for physicists to easily share data and information between themselves and interconnect the spaces.

So this was very popular at the start, and in '93, CERN realized that this could be much more valuable if it would be made available for everyone. And in 1993, they made it public domain. So we kind of donated this technology. And of course, the protocols that were part of the proposal became very popular and grew up very fast. And that's one reason why we're all here, I guess. But actually, it expanded a lot in terms of what this means also for computing power. The last 20 years, we realized we didn't have enough resources to compute all the data from the LHC. So we built a network that is very similar in its goals to what we did with data-- to share resources. So that's why we have the oldest interconnected centers around the world.

CRAIG BOX: And the other thing CERN is famous for, as you just mentioned, is the LHC-- the Large Hadron Collider itself. What is the Large Hadron Collider?

CLEMENS LANGE: The Large Hadron Collider-- LHC-- is the largest particle collider that we have in the world. It has a circumference of 27 kilometers-- so something like 15 miles-- and it's actually underground in the border region between France and Switzerland. And it collides particles at almost the speed of light-- so that's something like 99 point then five 9's, 1 percent the speed of light. And that, you can basically translate into energies, and the energies that we obtain with the collider are the highest that we can actually produce man-made on Earth.

CRAIG BOX: So in layman's terms, we're shooting beams of light in opposite directions around the ring, and they crash together in the middle. and you're looking at what the output of that is, in terms of energy or spectrums or so on?

CLEMENS LANGE: Exactly. Most of the time, we're colliding protons. And as you said, they rotate our flights with the collider-- one beam clockwise, the other one counterclockwise. And then these particle beams are brought into collision in the so-called experiments-- and one of them being the CMS experiment, and another one being the Atlas experiment, and there are two other large-scale experiments as well. And these experiments are then made, basically, to detect everything that's happening in these collisions, and detect and measure all the particles that were created in the collision of these events.

CRAIG BOX: Is the theory that the Big Bang was probably caused by the collision of some two things, or more things?

LUKAS HEINRICH: Sometimes it's said that at the LHC, we recreate the same environment as at the Big Bang. But it's not that what happens during the collision is equivalent to the Big Bang at all. So the Big Bang happened, and what happened-- I said that the universe is expanding, and so that means if you go into the past, that means the universe was smaller and smaller and smaller. And that means it got hotter and hotter and hotter, and the particles were zipping around at higher and higher energies. So what happens if we accelerate the particles to high energies is that we are kind of recreating a similar environment. So these type of collisions that we recreate at LHC are types of collision that also happened shortly after the Big Bang, but it's not the other way around that we are creating a Big Bang at all.

ADAM GLICK: How does this relate to the Higgs boson?

LUKAS HEINRICH: The Higgs boson is basically a particle of the Standard Model. So the Standard Model is the main theory that we have to compute what happens during these collisions of fundamental particles. And it's the last particle that was missing from the Standard Model. The Standard Model predicts a bunch of particles, and the Higgs boson we hadn't seen before we turned on the LHC. It was one of the main motivations to build the LHC, is to find this Higgs boson.

CRAIG BOX: Protons and neutrons, they're standard particles?

LUKAS HEINRICH: No, they're not. So the history is a little bit-- the old Greeks already kind of had this idea that there is this notion of atoms, that they're the smallest building blocks. But it turns out that atoms are not the smallest building blocks. In atoms, you have the atomic nuclei, and then you have electrons zipping around this nuclei. Also, the atomic nucleus is not elementary, because it's made out of protons and neutrons.

CRAIG BOX: Yes.

LUKAS HEINRICH: And then it turns out that protons and neutrons are not elementary. They are made of what are called quarks. Every proton and neutron consists of three quarks. And so quarks are, as far as we know, elementary particles, and electrons are also elementary particles. And these are described through the Standard Model.

The Higgs boson is a very special particle in the Standard Model. It has very unique properties, and it also explains-- or it signals that we understand all the other particles have mass. So for example, the electron has mass, and that's why the electron is not going at the speed of light. The photon is massless, and it goes at the speed of light, but the electron is not. But it was kind of a mystery how these particles gain mass, and the Higgs is part of a mechanism that explains that, so it's very important. And so it was predicted in the 1960s by Peter Higgs, building on some other theoretical works, and that's why Peter Higgs got the Nobel Prize after we discovered it.

CRAIG BOX: So you're saying my high school physics teacher lied to me?

LUKAS HEINRICH: It's not a lie. So there are the different levels of abstraction, the same as in the tech ecosystems. You don't explain everything in terms of bits flipping around from 0 to 1, but you have different levels of abstraction. And then usually, in school, you start at a quite high level of abstraction, and then the more-- if you go study physics, obviously, then you peel off every layer, and then you learn more and more.

CLEMENS LANGE: Just to add, I guess that your teacher just didn't tell you the full story. A thing that's important to know is that, basically, we create lots of particles that immediately decay after their creation, and then we can only observe them in these collider experiments and our detection experiments here. If we didn't have this kind of state, then we wouldn't see them. It's like-- the Higgs boson, it's basically all around-- or the Higgs field, as we would call it-- is everywhere around us. But we can actually only observe it in these collisions that we have at the LHC.

ADAM GLICK: So if you have these particles with incredibly short half-lives, how do you detect that? And what volume of data are you collecting in order to be able to do that detection?

CLEMENS LANGE: Basically, the Standard Model of particle physics is a very nice theory. As Lukas said, it's been around since the 1960s. And it basically describes how all these particles interact with each other. And what we actually detect in the experiments are the decay products of these particles that were created. So they actually cannot exist freely in nature, some of them, like quarks. That's why we don't see them in nature. We only see the protons and neutrons, all the nuclei and atoms and above. But we're basically able to detect a lot of those traces, and we're knowing that the collision happened in the center of the detector to calculate back to what happened there.

And in terms of data rates, we have collisions happening in the experiments every 25 nanoseconds. So that's basically 40 million collisions per second. And during these collisions, something like 30 to 40 protons collide each time. And each event has the size of something like two or three megabytes. So if you then do the math, it's a huge data rate. And we can actually not record this, because we have something like 100 million channels. So you can see it's like a digital camera that needs to take 40 million pictures per second. And so we have a trigger system that reduces the rates to something like 1 kilohertz, which is something that we can then store on disks and tape. And this data is then, from our experiments, brought up to the CERN computing center-- the so-called tier 0-- and then processed further.

CRAIG BOX: So Ricardo, what happens to that data once it's come out of the collider?

RICARDO ROCHA: The first step is to store it. So aggregated from all the experiments, the four experiments we've been describing, we get something like 10 gigabytes a second, which means we are generating in the order of 70 petabytes a year, because the collider is not always operating. The first thing we do is to make sure that we don't lose any data. So we have a large tape store in the computer center, and we archive everything. That's the first thing. So right now, we have something like 400 petabytes of data, and we keep adding in the order of 70 petabytes.

Then the next step is to do what we call reconstruction, which is looking at the raw data and using a large computing farm to generate more data out of it. And this is something that then the physicists can easily look. We don't have enough capacity at CERN to do this, so we also distribute the data around many centers in the world that will offer more compute capacity to process all this.

The last step, once all this is done, is that the physicists then select the data they're interested on, depending on the groups they are working. They make a request, and they just submit their job and get back the analyzed data.

ADAM GLICK: What does your technology stack look like for that?

RICARDO ROCHA: It's been changing a lot. Visiting the computer center at CERN is a bit of computing history. It's pretty old. It's a big building.

CRAIG BOX: It's all NeXT cubes.

ADAM GLICK: Do you all walk around with the bags full of punch cards, and you can just feed them in?

RICARDO ROCHA: You can see photos of this. We still have the NeXT where Tim Berners-Lee created the World Wide Web. So if you visit us and you go to the computer center, we have it in an exhibition there. It's quite interesting to see how the building was made so that people could walk around and sit at tables, and then suddenly the mainframe times came, and then eventually, commodity hardware. And right now, the data center looks very much like a traditional data center. And then for the computing stack, we have a virtualization layer that we use OpenStack for this. And then Kubernetes has been growing a lot in popularity, so we deployed Kubernetes on top of OpenStack today. For storage, then, we have our own in-house storage systems-- actually, we have several, but for the physics data we have a custom storage system. But for old private cloud data, block storage, and object storage, we use Ceph.

CRAIG BOX: The Higgs boson was discovered in July 2012, which obviously predates Kubernetes. What was the technology stack at the time, and what experiments needed to be performed on that data in order to prove that this particle existed?

RICARDO ROCHA: We used large-batch systems, and it's a very traditional deployment in terms of offering people, basically, Linux systems. And the experiments developed their software, and then our responsibility in the IT is to make sure that their software works in a scalable way. So it's a very large batch farm. The computer center today has something like 300,000 cores, and 80% or 90% of that is dedicated to this batch system. So for the experiments, they see it like a batch submission system. So their responsibility is to package the jobs and submit them in. The tricky part there is that different experiments will have different requirements, and before all this container era, we would have to make sure that the packages available in the batch system would be compatible with the jobs that were to be run. This was always very tricky, so upgrade campaigns were very complicated. If we'd migrate from one-- say, CentOS 6 to CentOS 7-- OK, at the time it was like, 4 to 5. But this was always very tricky to do. It would take months.

CRAIG BOX: And Lukas, as a user of the system, what was the experience like?

LUKAS HEINRICH: At CERN, the experience is always pretty reliable. But since it's a very international collaboration, the level of support obviously changes, because we're not operating under a huge budget. And so the software stack that the experiments have is also pretty interesting, because it's not like we are in a single language site, like we're deploying a Python application or we're deploying a Node.js application. So in our scientific analysis software, we use a lot of different things. So some parts of our software is in C++. All the mission-critical stuff that needs to be fast is in C++. And then some things are still in Fortran, because we took some code that we used in the previous experiments--

CRAIG BOX: Read them off the punch cards?

LUKAS HEINRICH: Yeah. No, we have some code that we use for the magnetic field that has been in Fortran until a couple of years ago. And then also, some stuff for the steering of our-- configuration for a job, that is done in Python. And so deploying them-- when we talk about containerization, actually putting it into a container was a little bit challenging, because then it turns out that our container images become huge. So the container image that we used for the keynote, that was something like 20 gigabytes. And so one of the challenges that we were facing-- so obviously there are a lot of advantages to containerization, but one is actually distributing these images to a lot of compute nodes in an efficient way, because a single job that actually runs and analyzes the data only reads 6% or something, or 10%, of the container image. But it needs, potentially, everything in there. And so we have some in-house systems to do the software distribution before we had containeriz

ation, and now we are trying to merge these two technologies together.

ADAM GLICK: Do you maintain all of the software from previous experiments, so that if you ever needed to-- you keep the data. And so if you ever needed to re-run an experiment done 10 years earlier on earlier technologies, you'll be able to do it?

LUKAS HEINRICH: Now, that's a super crucial point that I've also been working on a lot, is this idea of reusing old software. So the software in the keynote was 10 years old, and as you said, it predates Kubernetes. But we managed to put it in a container. And this idea of reproducibility and reusability is extremely important, because these are unique experiments. It's not like we were going to build a new Large Hadron Collider somewhere else, because they're very expensive. And so this idea of open data and open software, and also archiving software and being able to reprocess data, is pretty important.

CRAIG BOX: Ricardo and Lukas recently delivered a keynote at KubeCon EU, where they took the experiment which was performed in 2012 and reperformed it today, on top of Kubernetes on the cloud. I understand that initially running the experiment in 2012, it would take around 20 hours to get the results?

LUKAS HEINRICH: So the experiment that we did was very streamlined now, and while we were on stage, it produced a single plot. And that only is possible because we knew what we wanted to produce, because now we know the Higgs boson exists. Actually, when you're developing this kind of analyses, it's a much longer process. It's also a very simplified version of it. So you have systematic uncertainties, because we also simulate collisions. We need to have simulations to see and explain our data. And we simulate on the different hypotheses and check whether the data fits one hypothesis or the other. But these simulations are not perfect. There are some uncertainties, and we need to change the simulation here and there. And so the actual development of the software that led to the discovery is obviously much more involved. You can't do it in five minutes.

CRAIG BOX: But if you-- as an analogy, it might take me a long time to figure out the way to the center of the maze, but once I know that, let's say it takes me five minutes to walk to the center of the maze? So the happy path that was developed in 2012 took 20 hours. How long does it take now to run on the cloud?

RICARDO ROCHA: We started by trying the experiment in-house, and it still took quite a long time. And we only had 20 minutes for the keynote and with introductions and things, so we started looking around, and we--

CRAIG BOX: Maybe you should look for something easier. Try and prove the proton exists. You can do that in 20 minutes.

RICARDO ROCHA: [CHUCKLING] Exactly. We somehow decided that this live demo was a good idea. And then when we had to actually make it work, we started doubting this idea. But in the end, it all worked, so it was a good process. But the main thing is that, at CERN also, the problem we have is that we would need a large amount of resources. So in this case, we used 25,000 cores, and we needed a data set which is around 70 terabytes.

The storage one is a problem, but actually, to get 25,000 cores at CERN is not that easy, because we have a production system. We are basically maxing the resource users all the time. So when we went to our managers and asked for all these resources, they said, you can't do it. But we have an interstate collaboration group which is called OpenLab, and through this, we were able to convince the nice Google Cloud people to give us some resources to go through this demo. So that's where Kubernetes played a big role, is that we could easily take what we had been developing in-house and just re-run it on the public cloud without too much effort, which was one of the key points.

CRAIG BOX: And so you got that down to the point where you were able to run it in the scope of a 20-minute keynote.

RICARDO ROCHA: Right. Initially, we were expecting to do it in 12 minutes, but actually--

CRAIG BOX: We skipped some steps.

RICARDO ROCHA: Well, we actually ran the whole thing, but the public cloud worked a bit better than we were expecting, I would say. And so we added some more content on describing what we were doing. We were able to bring it down to five, six minutes, which was perfect.

ADAM GLICK: How much data were you processing, and how much compute power were you throwing at that data in order to make that happen?

RICARDO ROCHA: Exactly. So the Higgs data set that we were going through is 70 terabytes, and we processed it in five minutes using 25,000 parallel jobs-- which meant that we are actually pre-pulling the data before the analysis. And we were getting rates of 200 gigabytes a second pulling the data in. And so that was pretty spectacular to look at, actually.

CRAIG BOX: You mentioned that the container images were 20 gigabytes.

RICARDO ROCHA: Yeah.

CRAIG BOX: Do you need to have an individual container image for every 20 gigabytes of your 70-terabyte data set?

RICARDO ROCHA: The data set is 25,000 files, so it's 70 terabytes in total. But it's actually an interesting point, is that the software is from 2010, as we mentioned, and it's single-core. So it's before the experiment software migrated to multicore, which meant., this is our maximum parallelism. It's 25,000 parallel jobs, because it's one per file, basically. And each one was processing something like four gigabytes of data. But it ended being the image of the container. So we pre-pulled the images on all the nodes to make sure that at least this part wouldn't take three or four minutes. And then the actual jobs would pull only four gigabytes each, but we had 25,000 of them.

ADAM GLICK: How many clusters and nodes were part of that entire process?

RICARDO ROCHA: In total, we were thinking we could get away with one single cluster, and it's kind of working. We had some tricky bits on the Kubernetes scheduling and configuration. And then one thing is that, in-house, we can tune the masters quite easily in the Kubernetes setup. But in the public cloud, we don't have that much flexibility. So some of the features we would need to run in a single cluster were not quite there, so we kind of split the clusters. But the total number of nodes was something like 700 nodes, and they were 16 core each. And the network throughput was the thing that puzzled us most, because in the Google Cloud, we are able to get two gigabits per core guaranteed, which is something that we don't have in-house. So in total, we had 25,000 cores and around 700 nodes in the cluster.

CLEMENS LANGE: The interesting part is, then, once we scaled up to 25,000 jobs, we actually also needed to aggregate this data in the end. And usually, what we would do in a standard experiment or data analysis is, we'd actually write out the data to some local storage and then add it up, scale it correctly, and then produce the final plot that would then show the Higgs boson mass peak, for instance.

But this was actually far too slow for us at the moment when we tried this. So we had also had to work around this by basically just writing out the information in a smaller format-- so basically, skipping some of the tools that we would usually use-- using, then, JSON format, and then pushing that in Redis, and then push again into our Jupyter Notebook to actually make the final plots, and also make that plot look nice and update every second. As you can see if you look at the video, that worked really well.

ADAM GLICK: So if you've got 70 terabytes of data spread out across all this, and it's all in JSON format, how many curly braces is that?

LUKAS HEINRICH: So the 70 terabytes, that is not JSON data. We have an in-house data format called ROOT. Actually, when I was talking about the software stack, I meant to mention that. So ROOT is our main data analysis format. It is quite interesting. So if you're in the data analysis ecosystem, you might have also come around the stuff that Bloomberg is working on with the C++, interactive C++. And that actually comes originally from CERN.

So this is this ROOT data analysis system, and that also comes with its in-house data format. So these 70 terabytes are in this format. But then what we do is, we analyze this-- so every job analyzes four gigabytes of data that is in this ROOT format, but only extracts the minimal amount of information that is needed to build up this plot. So that summary data of what was extracted from these four gigabytes is the kilobyte-sized JSON file, which we then pushed it to Redis, and then streamed into Jupyter Notebook.

CRAIG BOX: What you're describing sounds very much to me like the MapReduce paradigm, where you start with large parallel data sets and end up with something very small.

LUKAS HEINRICH: Yeah. So this is basically the gift that particle physics gives us. So we have 40 million collisions every second, but every collision is independent of each other. It's like rolling dice. And so every roll of the dice is independent, so it's actually embarrassingly parallel workload. And so that's why we are using worldwide distributed computing so heavily, because we know that we can, in principle, process every event even separately. Obviously, we chunk it into files so not to have trillions and trillions of events flying around. But yeah, we can completely parallelize a lot of the workload.

CRAIG BOX: Is it feasible that you could use the modern tooling-- the Hadoop and Cloud Dataflow tooling-- that exists on that problem?

RICARDO ROCHA: Yes. So there's groups at CERN that are looking at using Spark and similar tools. So we actually started also looking at the KubeFlow because of the MPI integration. But the Spark deployments-- we are also looking at using Kubernetes. So just last year, while we were at KubeCon EU, we heard about this Google Spark operator. And just passing it to our colleagues in the Spark team, they created something in-house which is called Spark-as-a-Service. Basically, it's just deploying Kubernetes clusters behind, and then people, instead of using the traditional Spark submission, they just use the Google Spark operator and they submit their jobs directly to Kubernetes.

CRAIG BOX: How much does it cost to operate the Large Hadron Collider?

LUKAS HEINRICH: I think the budget of CERN, correct me if I'm wrong, is something-- a billion a year?

CRAIG BOX: But is it a sunk cost? Do you have to pay when it runs? Or is it--

LUKAS HEINRICH: So one of the things that is important in these large-scale scientific experiments is that there are very long running. So the planning for the Large Hadron Collider started at the end of the '80s, and so it will run until 2030 at least, or 2040. And then that cost is spread across a very long timespan. I think I read some number somewhere that basically-- for a European citizen, the cost is a coffee cup per year, actually.

CRAIG BOX: If you want to run an experiment, do you go to the team who are shooting photons around and say, please shoot some photons around? Or are they continually doing that, and you choose to look for different things in the output that they generate?

CLEMENS LANGE: Actually, collisions are basically taking place all of the time. So the data set that we analyzed for the demo is actually the collision data set from 2010, and a fraction of the 2011 data set. And the LHC basically operates all the time-- I think at that point, we basically had collisions for something like eight to 10 hours. And so the particles would be injected into the main collider, then they would be accelerated and then brought into collision. At some point, you wouldn't have enough protons left in the beam. There are billions of protons in the beam, and then we basically dump them, ramp down the magnets that are there to steer the beam, and then start again. So the experiments would basically turn off for an hour or so, and then everything would start again.

And we'd have people sitting in the control room 24/7 overseeing the operation of this. And the LHC would basically operate all year long-- with the short exception during winter because of the electricity costs, because in winter you'd have people use electricity for heating, and that would make it much more expensive, while in summer-- since there's not too much air conditioning here, because it's not that hot, at least not in that region-- you could just continue operating all the time.

ADAM GLICK: So the LHC takes the holidays off.

RICARDO ROCHA: That's exactly the case, yeah. So we all go home for two weeks.

LUKAS HEINRICH: One thing that's maybe interesting or understandable for everyone is basically, as physicists, I wouldn't claim as though we work at these experiments at the collision point. And we are basically served by these collisions. So we consume the beams of the Large Hadron Collider as a service, and we are just served 40 million collisions a second. And then our job is to record these collisions, and then we'll just take data and then analyze it. And so most of the time, we're collecting protons. And for a couple of weeks every year, we switch and we collide lead nuclei. And then there is one experiment that specializes in those collisions, and they analyze that data.

CRAIG BOX: Now, I understand that you're making a lot of these data sets open. Can you talk a little bit about that?

CLEMENS LANGE: Yes. So there's an open data initiative at CERN, and actually, the CMS experiment is basically leading that effort in terms of open data accessibility. And the analysis that we use now for the demo is actually an example analysis that uses exactly this data set, because what's really important is actually to preserve, also, the knowledge. It's not only the software, actually. You need to know how to run it. You need to have the old Linux distribution in the container. And then on top, you need to have the software installation. And you still need to know, how do we actually execute this? How do you chain this together? And what's the output? And what is an electron in your collision data set? And all this knowledge needs to be preserved and needs to be documented.

And there's this opendata.cern web portal that you can go to, and there you'll see a plethora of information on all this. It gives you the software. It gives you information on the so-called luminosity-- so that's basically the data statistics-- and also the simulation data sets that we used to compare to the data, and how to process all this and finally get an answer, if you want to study something on it.

ADAM GLICK: Is there any data that you're not making available, or is all the data that's collected part of the open data project?

CLEMENS LANGE: So at the moment, the policy is that within five years of data-taking, we make available 50%. And that's only for the CMS collaboration. We make available 50% of the collision data set. And within 10 years, we actually provide the full data set. So there's going to be more and more data being added over time.

ADAM GLICK: What drives the decision of half of the data being available and then half being held back for another five years?

CLEMENS LANGE: I think it's just basically giving us-- put in lots of time, some time to actually analyze the data and also understand it really well before we make it available, because we have lots of reconstruction campaigns that are happening. So we have the so-called prompt reconstruction that we have almost immediately after data-taking that we can look into. And then you might understand that one detector maybe wasn't really working as well as it was supposed to, or it actually was working much better than it was supposed to work. And then you want to make sure that the simulation and the data, they are in very good agreement. And we put in all the knowledge that we accumulated over years into these data before we actually make them available for people to play around with them.

ADAM GLICK: Have you had any external people go and try and recreate the experiments, given that you're making the data and how people can run these processes on their own-- has anyone gone and done that?

CLEMENS LANGE: Yes. I mean, you'll use it mostly for education-- so you have so-called master classes where high school students, for instance, can actually look at these data-- reduced data sets, basically. But we actually also had people-- so that's particle physics theorists-- who actually took this data-- and it took them something like two to three years to actually analyze these data. But then they actually made a publication that is based on the CMS open data and actually did a really nice analysis of these data. So that was really nice to see that actually make it work. And just to add, coming back to the demo that we did, we actually weren't sure that this would work, because nobody actually tried running this code at this scale. And we actually found a couple of bugs in the code when we were developing the demo, and we fixed them. And now we can actually be sure that this is all working.

LUKAS HEINRICH: I think to some extent, you could actually say that I'm one such person, because I'm actually external. I work on a different experiment, and usually I do not have data to the embargoed part of the CMS data set. And basically, we were on stage using-- both Ricardo and I are not members of this collaboration that made the data available, and we were able to recreate this data set.

Now, I think this is actually where Kubernetes comes in. I think it was a part of why we're not just dumping petabytes of data, is that we need to figure out how you can actually make this practically available to people. And so while we are happy to make the data available, you need to somehow develop the processes to analyze this data and teach people how to do it and do it at scale, because not everybody has a data center available to them. And I think one part of why I was super excited to do this demonstration is that we can show that we can analyze this data in a pretty short amount of time, and then maybe a lot of people might use this more for actual science.

CRAIG BOX: We have two physicists who are working in computer science, and we have a computer scientist who's working in physics. Do you feel that they are similar disciplines?

LUKAS HEINRICH: No. I don't think they are similar disciplines, but they are very adjacent. And so I think CERN really exemplifies that there is this fusion of a lot of exchange between the IT department and the physics department. So I personally walk over to the IT department-- so they're on the CERN campus. They're somewhat physically separated a little bit. And so every week, I run back and forth because I have some questions or some ideas that we might try. And I think CERN gives us this fertile ground to actually experiment and do different things that you would normally not do. I think both the computing people enjoy this interaction with the physics people, and we really appreciate it and good things can happen.

RICARDO ROCHA: I think it goes beyond just computer science and physics, because we have requirements-- we were describing the accelerator. One of the things, for example, is that to achieve these high energies, we have to cool the accelerator quite a lot-- almost to the absolute zero, 1.7 Kelvin. So we have cryogenics. We have electrical engineers. We have mechanical engineers. And it's a very diverse environment. Everyone understands very well what the mission is, and we all work together in different areas.

It's very popular that a physicist, for example, gets a job in the CERN IT department, because they convert themselves to computer science. It's a bit less common that computer science becomes a physicist, because it's usually a bit too late to go back to school for that long. But everyone works on a mission that-- we understand that the more data we can process per second, the better the science will be. So we measure our capacity in events per second. So whatever it is that has to be done, we'll always find someone with the skills. So it's very heterogeneous.

CLEMENS LANGE: You might picture us in lab coats, but actually, what we're doing as physicists most of the time, if we're not talking about physics in meetings, we're actually writing analysis code. So we're programming in C++ or in Python, and we write a data analysis. And that's really advantageous if you can talk to someone really knowledgeable from the IT department, so the computer scientist actually tells you what to do.

ADAM GLICK: Speaking of what you folks wear, you're wearing a shirt with a very interesting image on it. I'm curious which data set that came from.

CLEMENS LANGE: So what you see on my t-shirt is basically a simulated collision. It's extremely simplified, but we can see is lots of bent particle tracks. You can see that not all of them are straight, and that's because of the magnetic fields that we have in the detector. And you can see different particles with different energies-- or the moment-- passing through the detector, and what this would look like if you actually made a photo.

CRAIG BOX: And we'll have a picture of that in the show notes.

ADAM GLICK: Ricardo, Lukas, Clemens, thank you very much for coming on the show today.

RICARDO ROCHA: Thank you.

LUKAS HEINRICH: Thank you.

CLEMENS LANGE: It was a pleasure to be here.

ADAM GLICK: You can find CERN at home.cern, and our three guests' Twitter handles can' be found in the show notes.

[MUSIC PLAYING]

CRAIG BOX: Thanks for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter at @KubernetesPod, or reach us by email at kubernetespodcast@google.com

ADAM GLICK: You can also check out our website at kubernetespodcast.com, where you'll find transcripts and show notes for each show. Until next time, take care.

CRAIG BOX: See you next week.

[MUSIC PLAYING]

View More Episodes