#17 August 21, 2018

Shopify and Security, with Jon Pulsifer

Hosts: Craig Box, Adam Glick

Jon Pulsifer is a Production Security Engineer at Shopify, and Canada’s biggest Kubernetes fan. Adam and Craig dig into why, and what Adam’s new mode of transport is going to be.

Do you have something cool to share? Some questions? Let us know:

Chatter

News of the week

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

So what happened last week, Craig?

CRAIG BOX: I flew back from America. I recovered from jet lag. I think that was last week? I lose track of time easily.

ADAM GLICK: Did you see the article about Sling TV talking publicly?

CRAIG BOX: No, tell me more.

ADAM GLICK: They're now a case study up on the Kubernetes site talking about how they're using Kubernetes to back the Sling TV service. So if you're using a Slingbox to shoot video around the world, you're taking part in the growth of Kubernetes.

CRAIG BOX: Well, a little bit of a word on the grapevine about electric car maker Tesla this week. I hear they may be doing some more with Kubernetes. We'll hope to get you more on that story as it breaks.

ADAM GLICK: You have any other details there? Or is that just a teaser?

CRAIG BOX: No, it's just a short story right now.

ADAM GLICK: [CHUCKLES] Got you. I spent some time this week playing with the mitmproxy. Have you ever played with it?

CRAIG BOX: Not mitmproxy itself. I've used Charles on the Mac, and then before that, Fiddler on Windows I think are similar tools.

ADAM GLICK: Yep. So for anyone who's looking for a way to kind of take a look at network packet inspection, it's a pretty good tool on Mac and Linux. It's open source. I got a chance to play with that a little bit this weekend, taking a look at access to containers, and what's going on in container networking, and monitoring the network traffic there. There's some nice articles out there. And if you do any debugging or want to take a look at the traffic and play with it, especially because you should be always encrypting all of your communications traffic, it's really a great tool to be able to look at those. I was really impressed with what I was able to do.

CRAIG BOX: ABC, Always Be Cryptographing.

ADAM GLICK: Indeed.

CRAIG BOX: I see you're in California, in Sunnyvale this week. Are you there for long?

ADAM GLICK: I am indeed. I'm glad you said that. I've reached the diamond level with my hotel, which probably means that I've spent far too much time away. I'm not quite at the Matt Ray or Michael Coté view of, like, analyzing travel points. But I have to ask myself, as I looked at the system, that what will I do with a quarter million in Condolodge points?

CRAIG BOX: Is that enough to get you a Harrier Jump Jet?

ADAM GLICK: I only could hope.

CRAIG BOX: Do you remember that story?

ADAM GLICK: No, but I would love a Harrier.

CRAIG BOX: There was a Pepsi competition, I want to say maybe in the '80s, where with 20 magillion Pepsi Points, you could get a Harrier Jump Jet, and you could buy Pepsi Points for $0.10 each or something. So the exchange rate was such that you could get your Jump Jet for 800,000 points or dollars or something. And so someone tried. And then Pepsi said, no, we didn't-- like, the ad said you can have this, but why would you think we were telling the truth?

ADAM GLICK: Are you serious?

CRAIG BOX: And there was a court case about it, but yeah.

ADAM GLICK: Fascinating.

CRAIG BOX: I think they had to withdraw it in the end. But unfortunately, no Jump Jets were given out or harmed in the story.

ADAM GLICK: Let's get to the news.

[MUSIC PLAYING]

Praveen Shukla of Go-Jek posted an article this week talking about his experience connecting VMs in Kubernetes' networks. He describes the various barriers he ran into, such as having a different network addressing space due to Kubernetes using its own virtual network and how we solve these issues. If you're building a Kubernetes infrastructure that needs to talk to existing VM-based resources, this article provides some helpful tips to save you time and headaches.

CRAIG BOX: The first KubeCon and CloudnativeCon to be held in China is on the 13th to 15th of November. They've just this week announced their schedule, including-- drum roll-- yours truly talking about customer use cases for the Istio service mesh. In case that alone isn't reason enough to buy a ticket, there are 180 other great speakers from all over the world. And sessions will be simultaneously available in English and Chinese via translation. Early bird registration is open until September the 7th.

ADAM GLICK: Théo Chamley of Google Cloud posted this week about best practices for operating containers. In particular, he calls out seven best practices. Spoiler alert, here they go. Number one, use the native logging mechanisms for containers. Number two, ensure your containers are stateless and immutable. Number three, avoid privileged containers. Number four, closely related, avoid running as root. Number five, make your applications easy to monitor. Number six, expose the health of your application. And number seven, carefully choose the image version you use.

CRAIG BOX: If you're a Mac user, you'll be familiar with the Homebrew tool for downloading and installing software. In case you listened to Episode 7 and want to get kustomize, with a K, you can now get it with brew install kustomize, thanks to Ahmet Alp Balkan.

ADAM GLICK: Anoop Vijayan Maniankara posted nice blog diving into the Container Storage Interface, or CSI. His article focuses on dynamic provisioning and provides a nice high-level overview of CSIs and is intended to be a starting point for people looking to implement CSIs.

CRAIG BOX: If you like Istio and you have up to 5 and 1/2 hours to spare, the Istio community hosted a live stream on Twitch last Friday, which is now available to watch on demand. It's a bit like a virtual conference day with presentations, and fireside chats, and participants from companies like Google, IBM, and Pivotal. And as a bonus, at about 2 hours and 20 minutes, Dan Ciruli shows up. And that has to be my new favorite part.

ADAM GLICK: And that's the news.

CRAIG BOX: Today's guest is Jon Pulsifer, a production security engineer for Shopify, based in Ottawa, who has been described as Canada's biggest Kubernetes fan. Welcome to the show, Jon.

JON PULSIFER: Hey. Thanks for having me.

ADAM GLICK: Jon, tell me a little bit about your security background.

JON PULSIFER: Before I came to Shopify, I spent nine years in the Royal Canadian Navy, where I ended up working at the Canadian Forces Network Operations Center. So doing defensive cyber operations in a military context is really where I had earned my tinfoil hat, as it were. So getting to see some fun attacks against the nation that we haven't been able to see at Shopify has just given me that extra little bit of context for that.

So while I was working there, I ended up becoming a SANS instructor and doing some teaching for the SANS Institute and was a-- well, the only-- network defense instructor in the Canadian Armed Forces for a couple of years. So I'm very fortunate in that respect.

CRAIG BOX: And you're have now working on Shopify as a production engineer?

JON PULSIFER: Yeah, a production security engineer. So we went through a little bit of a rebranding. We have a production engineering team at Shopify. We're not SRE or anything like this. So it just seemed to fit that we would become production security engineers to fit that model. We're really sort of hanging onto the coattails of production engineering and securing all the things.

CRAIG BOX: What does production engineering as a discipline mean to Shopify? And then what does production security engineering relate to that?

JON PULSIFER: Well, they follow a production engineering model at Facebook. And so we took a lot of lessons learned from Facebook's production engineering. And we took some principles from SRE, the Site Reliability Engineering. And we jammed them all up together into our production engineering model at Shopify, which we blogged about. I can't really talk too much about it because I'm not on that team. But previously to becoming a production security engineer, we were an infrastructure security team. And so given that production engineering is maintaining all of the servers and these sorts of things, we care about them as well. Right? So we needed to just to follow them with their branding and work alongside them. Integrate with them is really what we're trying to do today, so a little bit more integration with the production security team.

CRAIG BOX: What's the story of how Kubernetes came to Shopify?

JON PULSIFER: It's really curious. My arc is a little bit fun. We actually played with it at a company Hack Day. Not myself, but a few production engineers wanted to try it out. They heard that Kubernetes was here. This was like Kubernetes 1.5, maybe 1.4 at the time when we were playing with it for Hack Days. And we proved the model. Right? We've been playing with containers for a long time, so we were used to these sorts of things, but scheduling with Kubernetes was what we wanted to try out.

So after a successful Hack Days at Shopify of getting Kubernetes up and running, what we did is there was a conversation happening on Slack, where one of our directors of production engineering, Camilo Lopez, said, Hey, we're going to go down to Mountain View and talk to Google. And John-- he's a fanboy and a customer-- said, hey, you're going to the mother ship? Well, shouldn't you have a security person go with you? That would be super neat. And Camilo, god love him, said, yeah, I don't think that's a bad idea. So that's really how we ended up in Mountain View together, and how I ended up doing security on Kubernetes. So that's how we started down that road.

ADAM GLICK: And did you start working on the security pieces from day one? Or was that something that came in later, after you started to adopt Kubernetes?

JON PULSIFER: That's a really fun question. So when we had started on GKE, it was Kube 1.5. Right? And Kube 1.5 was before RBAC was even a thing. And so a lot of the security features that we have in Kubernetes today weren't available at that time. So we had to solve these problems in unique ways.

One particular way that I recall is every workload on Kubernetes gets a service account token mounted with it. And what we would do at that point is actually just mount an empty volume over where that secret would be, such as if our container were to be compromised, they couldn't get an identity. Just talk to the API server.

CRAIG BOX: That was best practice at the time.

JON PULSIFER: It was. Right? I remember sending an email into Google. And I was like, I noticed this thing. And I was a noob on Kubernetes. And is this real life? Like, did I find this? Is this what's happening here? And that's what it was, and that was pretty cool.

So let's see. We have been focused on security since the beginning. There are what I like to call solving old problems in new ways with Kubernetes. Right? So previously on metal we were doing AppArmor and seccomp for some of our workloads. But bringing those into Kubernetes, those features weren't really there in a nice scalable way. So we have been focusing, but we've been chomping at the bit. For every release, there's some new security feature that we just want to turn on. Right? So it's been a really fun journey so far.

ADAM GLICK: What security features are you looking forward to coming up that leads to believe that there's other ones that you're looking forward to?

JON PULSIFER: So at the last KubeCon, Google announced gVisor and open sourcing that. And I'm really excited for this because I really commend the marketing that was done on containers in Kubernetes. It's widely adopted now across many enterprises. But the one thing that was failed to be mentioned-- and Jessie Frazelle will tell you the same thing-- is that we forgot to mention isolation and how that is not happening here. So with gVisor, and Kata Containers, and these new technologies, I'm really looking forward to putting sandbox:true, my pod security policy. And I want that isolation. So this is the next want for Jon, that's for sure.

CRAIG BOX: Shopify provides software in the retail space. And you have SaaS customers who consume that software. Presumably, you don't want people to be able to see each other's environment, so you need to have some of multi-tenancy. Can you tell us a little bit about Shopify's environment and the multi-tenancy needs you have?

JON PULSIFER: Yeah, no problem. There's two different use cases here. And I'll try to tackle them both. So Shopify is our leading multichannel commerce platform. So the online store is just one small piece of how we enable our merchants to sell things. We offer a number of other ways to sell-- social media, mobile POS, these sorts of things. But a big part of Shopify's success story is our resiliency.

And we've built Shopify to fail in such a way that, even if you're, like, a one-person business, you're running on the same infrastructure that our larger brands, like Kylie Cosmetics-- the flash sales that destroy a lot of other e-commerce platforms. Even as a small business, you're running on top of that same infrastructure. So we've had a lot of history in decoupling our stateful services from our ones that are not. So any request that comes into Shopify could hit any worker at that point, and that's how we deal with those resiliency primitives there.

So tendency for Shopify Core-- that's what we call our main application-- it is only a single tenant. Right? There are hundreds of workloads for these stateful and stateless services, but it's only a single tenant. Now, that's pretty easy to solve. We can wrap all that up and put a bow on it with some mandatory access control and seccomp and further sandboxing. But for all the other services at Shopify, this is really where it gets fun.

So Shopify is one of-- I don't know how many we have now. I wish I could say. I'm going to say 400 to 500 services at Shopify, and a service is described as the whole suite of technologies that make up an application or web application. So what I mean by this is a Ruby on Rails will typically have a MySQL or a Memcached or a Redis. That's where the end workloads come in. So each of these workloads-- it could be anything from an internal office map, or something to check the tap levels, or something at the office, but also business-critical applications as well, supporting applications. Like, if you're going to change a theme on your theme store at Shopify, their theme store is a separate application itself, a service.

So multi-tenancy, for us, has been really challenging. We run over 60 clusters now, Kubernetes clusters, to try to solve these problems. And what this means is-- company Hack Days is the best example that I can give for this. So a couple of times a year, we get together for a few days and solve some problems really resourcefully. That's how I'll put that. Not vetting any libraries, very scrappy work, to use the term. And these are deployed at production Kubernetes clusters.

Are these workloads trusted? I'd argue that, no, they're not. And so the argument for Kubernetes is it's bin packing. Right? We want to keep the servers at max capacity. But these workloads are generally untrusted. So what I'm really excited for are the isolation primitives that are coming out-- Kubernetes security profiles, gVisor, Kata Containers-- that will enable us to collapse our many clusters into hopefully that one mega Borg. Right? That is the pipe dream.

We can't do that today. So multi-tenancy has been really challenging. So we've just been using things like network policies, AppArmor, seccomp, capability drops. Pod security policies has been challenging for us to put on, but we're trying to flip all the security bits. Trying to make some sense of all of this is sort of what that means for us right now.

ADAM GLICK: Google recently announced binary authorization, which I understand you are involved with. What was your part in that?

JON PULSIFER: Sandra, the product manager for binary authorization, described us as a design partner. And that was really-- I hadn't heard that term before. I'm really proud of that and the work that we've done. We've been working side by side with the binary authorization team for the last year and a half, almost two years, in trying to build up this capability. So binary authorization allows you to digitally sign attestations of a given container. Now, what that really means is how about a container going through the build pipeline?

Build time is the most critical, arguably, point in a container's lifecycle, where we can make a lot of decisions about what to do with that container, what sort of signatures to generate. So if we could generate a signature that says, this container has been built by us, it's been through our CICD pipelines, that increases its trust. If we can sign off that also it does not run as root, that's another big win for us. Right? So a number of these signatures-- and the idea of binary authorization is sort of decoupling all these to build your own policies, whatever policy fits the cluster or your organization. And it's these policies that enable or restrict the deployment of a given workload inside Kubernetes. And this is what we're excited for this year. Finally happy it hit beta.

CRAIG BOX: How does binary authorization relate to the Kritis open source project?

JON PULSIFER: That's really cool. So the container analysis API in Google Cloud Platform is a reference to Grafeas. github.com/grafeas/grafeas is the API spec for the container analysis API. But as part of binary authorization, binary authorization is a Google-ism. Right?

But Google really likes to give back. And so because of this, we've actually provided an open-source signer for minor authorization and a reference implementation, which is called Kritis. So Kritis - being the judge - is the open source reference implementation for bin auth. So we want you to be able to do this everywhere. We try to make it as open as possible, but it's still offered as a product on GCP.

CRAIG BOX: There was a really interesting blog post, or a security write up if you will, a few months back about a server side request forgery which was found in Shopify and resulted in a $25,000 payout. I'd love if you'd tell us a little bit of that story.

JON PULSIFER: So that's a great case study for multi-tenancy. I tweeted about it right after we had done the payout and these sorts of things, and it became public. This is why multi-tenancy is important. The SSRF came in-- oh, let me just back up a bit.

If you have a Shopify store, if you're a business and it's becoming successful, but maybe it's taking up too much of your time, perhaps you'd like to sell that business, or exchange them. We built this product culture Shopify Exchange, which allows merchants to essentially buy and sell their own stores as a whole.

CRAIG BOX: Very meta.

JON PULSIFER: Right. Exactly. And a big part of this is having a screen shot. We want to see how pretty your store is. So the screenshot service was actually the service that was vulnerable to the server side request forgery. So what happened was one of our security researchers got the screenshot service to take a screenshot of the metadata service that runs alongside every GCE VM.

And there's this one endpoint. For some reason, over the last couple of years as we're working to build this, there were two endpoints, a v1 and a v1beta1 endpoint. And one of those endpoints does not require a specific http header, which means that it was vulnerable to a server side request forgery in this case. So the researcher took a screenshot of our metadata server. And JKE, because of its bootstrap model, in that were the kubelet credentials. Right? So it's at that moment that they took a screenshot, and actually had a hand jam of the entire certificate from the screenshot. But aside from that, that's where that compromise came from.

So we had these kubelet credentials. And so they were admin, essentially, on our clusters. And well, you can read the report, and that's really what happened there.

But this is a really big case study for multi-tenancy. Exchange is running in a completely different cluster than the screenshot service. And so it's a little respecting the-- Google calls it the resource hierarchy for access control in GCP. So it's just not app starting in my GCP, it's just knowing where your security boundaries are and making some decisions around where you think those security boundaries should be. Had we done it differently, had binary authorization been in beta, I don't know that we would have paid 25kK. But I'm glad that we did. It was a great case study.

ADAM GLICK: How would you recommend users get started with securing Kubernetes?

JON PULSIFER: That's a great question. It's important to know that your Kubernetes clusters by default are not secure out of the box. You need to flip all the security bits, but that decreases usability. And we have to remember that the major cloud providers want you to pay for these things, so they make it as easy for you to turn on Kubernetes entities as possible. But not all the bits are flipped. And so where I would start reading about Kubernetes security, there's a really good website called kubesec.io which run through all kinds of security primitives.

But there are blogs. There are talks. The greatest thing around this community, even at Google Next, at Kubernetes Con -- KubeCon, CloudNativeCon, all the talks are recorded, and they're available for free after the fact. So if you can't make it to one of these conferences, that's OK, because all of these resources are available for free on the internet. So it's all about just taking the time to read, to watch, to listen, to learn, to get intimate with the components of Kubernetes and see how they operate, see how they break. That's really how I would say.

CRAIG BOX: We had a discussion on a previous show about whether those bits should be flipped out of the box. What's your opinion on that?

JON PULSIFER: It's my opinion that they should. Right? But I'm a security person, so that's really easy for me to say. Usability, though, is very important. So it's not about flipping all of the bits. It's about taking care of an entire class of attacks. And what I mean by this is if you turn on Istio and the MTLS mesh, that's super complex in order to turn this on.

CRAIG BOX: It's one line of config, is it not?

JON PULSIFER: Something like that. [CHUCKLES] Absolutely not. But given that, the MTLS service mesh is an example. That solves network-level attacks across your entire infrastructure. So if you're going to move to something like Zero Trust, BeyondCorp, that's what you would want to do to move towards that model. But what I'd like to see is your default AppArmor, default seccomp, because these defaults are sane. They are good, but they're not turned on by default. And why not?

CRAIG BOX: I think it was a concern that software wouldn't work out of the box.

JON PULSIFER: That's always the concern. It's always the usability versus security fight. But I know Jessie Frazelle and co have spent many, many, many hours researching this and have built these profiles that work out of the box for all the workloads. We've done it at Shopify. We've had no arbitrary seccomp kill-9s. It's just worked. And that's been really helpful for us in our tendency story.

CRAIG BOX: You're a big part of the Kubernetes community in Ottawa specifically and Canada in general. Can you tell us what the scene's like up there?

JON PULSIFER: The scene is really healthy. It's amazing for me because I came from the government. And things move a little bit slowly there. But what I'm seeing is that with this reinvigoration of containers and Cloud Native on the back of Kubernetes, there's a really healthy community of members in Ottawa who just want to learn how containers work and how they schedule. And at Shopify, we're described as an edgy company. We're leading edge in technology. We're a technology company.

And so these individuals will come into our offices. And they look up to us in many respects to see how is it being done today. And so taking that concept, I started to want to give back a little bit more. Shopify asked me to start teaching internally, start giving talks internally. I said, yeah, OK, there's value in that, sure. But how can we do it better?

And so I'm happy to say that we started a GCP meetup to talk about cloud stuff, and that turned into a GDG cloud chapter, which is amazing, but also the CNCF and Kubernetes meetups that we're hosting. There are a ton of businesses in Ottawa and even government departments who are now using Kubernetes and really find value in this. So there was an opportunity, so I just jumped at it.

A nice example of this is I was in Austin for KubeCon. And I saw some person walk by, and they had on this badge that said City of Ottawa. And I said, no way. Here I am, as a Shopify person, who just so happens live in Ottawa, but had no idea that our city was actually using Kubernetes or at least invested enough to go down to Austin to KubeCon. And I found that really exciting. And I think that speaks to the velocity of Kubernetes and its massive adoption. And we're seeing it not just in private or public businesses, but also in governments now too, which is exciting.

CRAIG BOX: Tell me about your car.

JON PULSIFER: My car. [LAUGHS] I'm really happy about my car. I don't know what you call car people, but I was one of those in a past life. I used to do some track-side marshaling at Atlantic Motorsport Park in Nova Scotia and got into the smell of race cars, that sort of thing. And I had never had the opportunity to buy a car that I really liked.

So anyway, long story short, I came up with this idea. I'm going to buy this car but -- a vanity plate? Am I that kind of person to buy a vanity plate? I don't know that I am. But I started playing around on the website and seeing what ones were available. And aw, you know, just JON's not available. Go figure.

But what was available? What am I doing right now? Of course, the second thing that I plugged in there was KUBECTL. And it was available. And I said, all right, I got to have it. Right? I'm a fanboy. I love this stuff. I love the cloud.

So yeah, indeed, my pretty little black Volkswagen Golf now has KUBECTL license plates. And it's hilarious. I actually have a lot of people who stop me. Of course, downtown, near work, near Shopify, they see me drive around. And they'll actually, Kubernetes! Hey, yeah, that's me.

CRAIG BOX: I know what that means. I'm in the club.

JON PULSIFER: Absolutely.

ADAM GLICK: You have a picture of that on your Twitter.

JON PULSIFER: Yeah, I do.

ADAM GLICK: It begs the question, if that your second thing that you plugged in, what was the first?

JON PULSIFER: CLOUDSEC. CLOUDSEC was the first one. And I was like, I am a CLOUDSEC person. As a production security engineer, that's a broad term for myself. But as we move to cloud, we've had these-- we'll call them project teams. And Cloud Security has been our project team. And so being one of the people in CLOUDSEC, I said, well, maybe I should CLOUDSEC. Yeah, maybe that's my thing. Right?

But thinking about that, I don't really care about CLOUDSEC. And what I mean by this is if we coupled to any technology at Shopify for a deployment, it is Kubernetes. It's not the cloud. And so I'm a certified Kubernetes administrator as well, so there's a little bit of pride there and like being involved in the community. Even being on the show is something that I take great value in. So I don't know. It just felt good.

CRAIG BOX: And that's why you've been described as Canada's biggest Kubernetes fan.

JON PULSIFER: Yeah, exactly.

ADAM GLICK: Awesome. Jonathan, it has been great to have you on the show. Thanks for coming on.

JON PULSIFER: Hey, thank you so much.

CRAIG BOX: You can find Jon's writings and pictures of his car on Twitter @JonPulsifer, and you can find links to that and all the things we discussed today in our short notes.

[MUSIC PLAYING]

Thanks again for listening. As always, if you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can email kubernetespodcast@google.com, or you can find us on Twitter @KubernetesPod.

ADAM GLICK: You can also check out our website at kubernetespodcast.com. Until next time, take care.

CRAIG BOX: See you later.

[MUSIC PLAYING]