Kubernetes Podcast from Google: Episode 53 - Optiva and Arctiq, with Dan Dyer and Kyle Bassett

#53 May 14, 2019

Optiva and Arctiq, with Dan Dyer and Kyle Bassett

Hosts: Craig Box, Adam Glick

Dan Dyer is Senior Vice President of Technical Product Management at Optiva, a provider of business support services to the telecommunications industry. Optiva have been moving services to Kubernetes, and with the help of Kyle Bassett and team from Arctiq, a cloud-native consultancy, kicking the tyres of Anthos and GKE On-Prem. Adam and Craig learn about this journey from Dan and Kyle, and discuss dragons and foxes.

Do you have something cool to share? Some questions? Let us know:

Chatter of the week

Baby foxes
Aaron Crickenberger interview on the Kubernetes blog
Dragon research

News of the week

Links from the interview

Transcript

Show full transcript

ADAM GLICK: Hi, and welcome to the Kubernetes Podcast from Google. I'm Adam Glick.

CRAIG BOX: And I'm Craig Box.

[MUSIC PLAYING]

Spring is well and truly upon us here in the UK. We can tell because we have a family of baby foxes living with us in the backyard.

ADAM GLICK: Adorable. Did you post a picture of that?

CRAIG BOX: I did. I have a video, even. We'll have a link in the show notes to my tweet with the six baby foxes and their mother, who, as far as we can tell, probably were born underground in a den a couple of houses over from us. And they have a little path. They snuggle under one fence, and then they jump across the back of the fence between us and our neighbor and like to cavort around on the stones of our backyard.

ADAM GLICK: Awesome. I saw the posting, and they look adorable. I also wanted to congratulate you on your latest blog post I saw up on the Kubernetes blog.

CRAIG BOX: Yes, every now and then we have an interview with a Kubernetes release manager. This time around, it was with Aaron Crickenberger in episode 46. And we've taken the transcript of that interview and we've put it up on the Kubernetes.io blog. So if you don't follow our transcripts on our website, hopefully you have a chance to read this one and get a little bit of insight into why cat shirts are important and why Kubernetes releases are like "Groundhog Day."

ADAM GLICK: Oh, nice teaser. So any news on the dragon research front?

CRAIG BOX: Well, I get why you're asking here. It's not a "Game of Thrones" reference, although you'd be forgiven for thinking that it was. You'd be in the dark on that situation, I'm told. But it is a letter that was written to the Prime Minister of New Zealand, who was asked to commit the government to researching the possibility of there being dragons with a $5 bribe. To point out, it was from an 11-year-old girl.

ADAM GLICK: I hope they turned down that shameless attempt to manipulate the government.

CRAIG BOX: Yeah, it was one of those things that the internet, first of all, would just think, oh, I'd made up a letter. People claim all the time that their kid did this cute thing, and they just made the letter up on their behalf. But this was actually real. Someone contacted the New Zealand government. And the Office of the Prime Minister confirmed it was real, and the $5 bribe was returned to the 11-year-old.

ADAM GLICK: That was an awesome story. I loved that. I also wanted to say thank you to Sai Teja Penugonda for his review on iTunes. It was very nice of you. Thank you.

CRAIG BOX: If you would like to have a similar shout-out in an upcoming episode of the podcast, please do log onto iTunes and leave us a review. It helps us understand what we're doing well and what you'd like to hear more of.

ADAM GLICK: Speaking of which, if you want to chat with us in person, you can run into us at KubeCon in Barcelona.

CRAIG BOX: Yes.

ADAM GLICK: We will both be there, and there will be a meetup.

CRAIG BOX: We've arranged to have this meetup in the Google Cloud Lounge at 6:00 PM on Tuesday. So there is a Booth Crawl. It is just after the evening keynote on the Tuesday. And the rest of the evening is dedicated to wandering around looking at the lovely sponsors and vendors and all the things they have to offer. But you should instead spend the first half-hour coming down and enjoying a libation with us. If you bring your own shot glass, we'll tell you more when you get there.

ADAM GLICK: Excellent. Let's get to the news.

CRAIG BOX: Red Hat hosted their annual summit in Boston last week. Over three days of conference, they announced incremental updates to their major products, Enterprise Linux and OpenShift. Red Hat Enterprise Linux 8 includes the universal base image, which is a set of publicly-available container-based images built from RHEL parts and supported when used on REL as a host OS by the customers.

In a week where the base container distro is under scrutiny due to a root elevation CVE in Alpine Linux, you might want to look at the direction of base images like Red Hat's UBI, or a distroless image, like that from Google Cloud. More on that later.

ADAM GLICK: Next up, OpenShift V4, which is actually the second major version of the platform based on Kubernetes, which was adopted into the platform in V3. 15 months after the acquisition of CoreOS, its fingerprints can now be seen in the new version, which now is container-based. It's all operators, all the time, with 40 included in the catalog and 22 deemed supported by Red Hat themselves. Other features in OpenShift 4 include the OpenShift service mesh, a.k.a. Istio, and Knative, a.k.a. Knative.

CRAIG BOX: I'm sure they'll rename that when it leaves developer preview.

Microsoft CEO Satya Nadella popped up in the Red Hat keynote, which is not a sentence you could have predicted uttering 10 years ago. He was there in support of Azure Red Hat OpenShift-- not just a series of brands all in a row, but a managed version of OpenShift and Azure, jointly supported by the two companies, paid for on your Microsoft bill, and now generally available.

In coincidental good news, the US Department of Justice has approved IBM's $34 billion acquisition of Red Hat. IBM continues to work with other competition authorities, but expects the deal to close in the second half of the year.

ADAM GLICK: Speaking of acquisitions, F5 Networks closed its acquisition of NGINX.

After DockerCon last week, Docker announced that Steve Singh would be stepping down as Docker's CEO. Taking over the top job will be Rob Bearden, who was previously the CEO of Hortonworks. Rob brings the experience of managing a large open-source company to Docker, as well as the experience of having that company acquired. No word on a strategy shift with the change, but Steve Singh will stay on as Docker's chairman of the board.

CRAIG BOX: Back to that Alpine Linux vulnerability. Turns out, they didn't set a root password. That's not as bad as it sounds. Because unless you had installed the shadow or Linux PAM packages, you couldn't log onto the machine with no password, and that was actually more secure. Update your base images or have a look at Google's distroless project if you want to cut out the distro entirely.

ADAM GLICK: GitHub released a package registry acting as NPM, NuGet, RubyGems, and Docker Hub all in one. GitHub Package Registry lets you host packages next to your source with the same credentials and permissions. Similar to source repositories, package repositories are free for public hosting of open-source software and will be paid for private use. They're being introduced in limited beta.

CRAIG BOX: Also from Microsoft this week, the open-source VS Code extension for Kubernetes has reached 1.0. The extension adds support for building and troubleshooting applications and Kubernetes clusters. New in version 1 is an API for adding new features to the extension.

ADAM GLICK: Microsoft announced the successor to Windows Subsystem for Linux, or WSL, and it is-- wait for it-- WSL 2. Version 2 comes with better file system performance and full-system call compatibility, which means you can run Docker natively. If you've ever tried to install and run Docker in Windows, you will surely appreciate this improvement.

As part of the change, Microsoft will be shipping a full Linux kernel with Windows, which is another thing you wouldn't have thought you could say 10 years ago. Patching will come through Windows Update, and the kernel will be fully open-sourced and available on GitHub. It will be based on the latest upstream LTS kernel, currently 4.19. Microsoft Windows Insider Program will have builds of WSL available by the end of June for users who want to try the beta.

CRAIG BOX: I wonder if the next version will be WSL 6 for no apparent reason.

Gravitational stoked the public-versus-private cloud debate this week with their post suggesting people up sticks from Amazon and start reopening the doors to the co-location facilities. The article itself pointed out some of the ways you could save money, but ends with what must certainly be the statement of the year, "the costs can be much lower or much higher." Hacker News offers its regular mix of insight and opinions.

ADAM GLICK: Did you enjoy episode 38, where we discussed Kubernetes failure stories with Henning Jacobs? Then you'll love the new website that catalogs Henning's ongoing list. The website is k8s.af, and it looks k8s AF.

CRAIG BOX: I am not down with the language of the kids and have no idea what you just said.

Google Cloud has announced that GKE, as well as their key cloud products, are available in a second Japanese region, asia-northeast2. The new region in Osaka is the seventh region in Asia-Pacific and the 20th in the world, bringing the global number of zones to 61.

ADAM GLICK: KubeCon North America is happening in San Diego, California from November 18th to the 21st, and the CNCF has just put out their call for proposals for the event. Do you fancy talking at KubeCon, and have something great to share with the community? Then here's your chance. Proposals are due by July 12th, and the CNCF has plenty of resources available for folks or first-time submitters or speakers to help you be successful. If you want to give a talk, it's a good time to start warming up your writing skills.

CRAIG BOX: Payment platform Stripe has described their Kubernetes-based machine learning platform, named Railyard, in an engineering blog post. As the platform itself isn't open source, the more interesting part was how they adapted Kubernetes for running it and the lessons from scaling a machine learning workload over the past 12 months.

ADAM GLICK: Loodse has announced a new open-source lifecycle management tool for highly-available Kubernetes clusters, called KubeOne. The new tool that I'm sure will soon have the shorthand of "a number 1 to the third power" is designed to allow for the installation, configuration, upgrading, and maintenance of Kubernetes clusters, out of the box, both in-cloud and on-prem.

CRAIG BOX: Do you get it, cube 1?

Over at the Kubedex, Steven Acreman has noticed the growing popularity of container-focused operating systems, and has done this trademark write-up in comparison of the more popular ones. He suggests that while you may want to consider Rancher OS if you are using a Raspberry Pi of k3OS, or "key-oss," if you are doing IoT, Edge, or continuous integration work, the one to watch is Talos, which Steven has his eye on for running on general cluster nodes.

ADAM GLICK: Kontena has announced Akrobateo, from the Greek word "walking on tiptoes," as a general-purpose load balancer for Kubernetes. In implementation, Akrobateo is it Kubernetes operator to expose in-cluster load balancer services as node host ports using daemon sets. This is meant to be a general-purpose load balancer for when you don't want to rely on external pieces, like specific hardware or configuration of MetalLB. They've posted the project to GitHub and are looking for feedback.

CRAIG BOX: Finally, if you're interested in optimizing your etcd performance for high-scale applications, then this week's CNCF blog post from Xingyu Chen is for you. He talks about the work they have done at Alibaba to optimize etcd, which makes querying 100 gigabytes of data as fast as querying 1 gigabyte. The changes have been checked into both DB and etcd, and the computer science behind it is described in the post.

ADAM GLICK: And that's the news.

[MUSIC PLAYING]

Dan Dyer is the senior vice president of technical product management with Optiva, and Kyle Bassett is a partner with Arctiq, a DevOps company focused on emerging technologies. Welcome to the show.

CRAIG BOX: Thanks.

DAN DYER: Good to be here.

KYLE BASSETT: Yeah, thanks. One of my favorite podcasts.

ADAM GLICK: Thank you.

CRAIG BOX: Perhaps you'd both like to start by discussing what Optiva and Arctiq do.

DAN DYER: So Optiva is a provider of business support solutions for telecom providers. So we do the billing and charging systems. We have two main products. One is the Optiva Charging Engine, which is a tier 1 charging application for large telecom providers. And then we have a secondary product, which is more of a broad solution, that does a lot more than just the charging. And that's more geared towards smaller businesses that want an all-in-one solution.

KYLE BASSETT: Yeah, and I work with Arctiq, one of the partners and founders of the company. We're a services-led company, and we focus on automation, containers, microservices, all things Kubernetes, and help customers move to cloud, and also help move applications into microservices.

CRAIG BOX: The workloads that your customers run, Dan, do they run in their own data center environments? Do they run behind a firewall, in the cloud? What's the world were talking about before the modernization?

DAN DYER: So a typical legacy BSS application is running on-premise in the telecom provider's data center. It's typically running on bare-metal, although sometimes it's virtualized. It's behind a variety of firewalls and things, but it's connected directly to their network, which is then connected to mobile providers or fixed-line providers, things like that.

CRAIG BOX: And so this is the infrastructure that basically decides who pays and how much.

DAN DYER: It's not only who pays but also whether they're authorized to pay. So there's a sensitivity, in terms of real-time, for some of these business models. There's a variety of business models in the telecom business. You can have prepaid, or you can have postpaid, subscription-based, things like that. But in a lot of cases, the network is actually calling into our application to say, OK, can this person actually make this call, do they have the balance to do it, are they authorized, that kind of stuff. So it's more than just the billing. And so there's a lot of time sensitivity to it.

CRAIG BOX: And so that, then, needs to be available as people make calls.

DAN DYER: Yeah, correct. I mean, we got things like 100-millisecond latency requirements. So there's a high sensitivity in the real-time charging.

CRAIG BOX: What stack did all that run on beforehand?

DAN DYER: The OCE, which is the one that we are primarily focused on initially, our tier 1 application, was essentially built on a proprietary stack. A lot of the things that are now available off-the-shelf, things like Kubernetes and other platforms, they basically built themselves years ago. So this is a relatively old application. And so it's a Java application with a bunch of, essentially, homegrown proprietary application server logic. You know, they have things like pool management for worker nodes and things like that.

All the things that you would get now out of a modern platform, they built themselves. And so obviously we're trying to get rid of all that proprietary stuff and move to something that's a little more standard in the industry now.

ADAM GLICK: How did you decide on the new stack that you're using?

DAN DYER: We went through quite a bit of analysis in terms of, so first of all, what was compatible with what we needed to actually deliver, what we could actually deliver without making too many radical changes, because there is a lot of legacy knowledge built into this application suite. And we're trying to make this sort of an incremental process for our customers, so we get the benefits of moving to some of these newer technologies while still minimizing the risk for the customer. So those are all sort of components of what we were considering when we decided how to proceed.

CRAIG BOX: Were all your vendors on-board with this kind of change?

DAN DYER: It's an ongoing journey. Let's put it that way. You know, the telecom business is pretty conservative. They see the conceptual value that we're providing. We start talking about level of reliability that we can deliver, TCO savings that they can get, some of the other benefits of moving to the cloud and to things like Kubernetes. On the other hand, this is something that they drive their business from, and so they have low tolerance for risk.

So we're working them through it. I wouldn't say everybody is totally convinced, but they're very interested. And we've gotten a couple of customers to make the leap, and a lot of the other ones are sitting, waiting to see what happens with those.

ADAM GLICK: What does your stack look like now, in terms of technology?

DAN DYER: Like I said, it's primarily a bunch of homegrown Java components. We have our own-- there's a component called CAF, which is essentially an application framework that has a lot of overlap with a bit of the Kubernetes. It's got its homegrown application server that we run with the Oracle database. We have a reporting stack that's built off Hadoop and HBase. And then we do a little bit of internal configuration management via the Postgres SQL database.

The internals of that application are all proprietary. So we have things like proprietary application server, session management, pools of workers. And then there is a variety of components that, in the telecom business, you have specific protocols you have to support to ingest data from the networks. And so we have several of those as well.

CRAIG BOX: Migration is big and expensive. You must have looked at the existing stack and done the research to figure out what it would take to bring that up to the kind of level of reliability that you wanted.

DAN DYER: Reliability is already there to a certain extent. I mean, in some ways, this application is-- certainly the OCE one, the tier 1, is maybe even a little over-engineered. So really I don't think it's a matter of making it more reliable. It's a matter of making it easier to maintain, and easier to run, and cheaper to run. So I think if you look at the level of reliability there is now, the customers are pretty satisfied with that. But it's really hard to figure out what's going on in the application. And because we have so many proprietary components in there-- just as an example, right, so our whole monitoring stack is all a proprietary bunch of applications. And so we're trying to move to something more like Prometheus and Elk Stack. Then we're not spending our time developing our own tools in something that's pretty commoditized at this point.

ADAM GLICK: How is the migration going? What does it look like? Is it just the new components, or are you actually taking core components and moving them over to these new technologies?

DAN DYER: We're taking a relatively incremental approach. So the first thing we did, because this was already sort of componentized in this CAF framework I was talking about, it actually lent itself pretty well to containerizing some of the components. But there are pieces, because of things like the context management, and the component management, things like that, that are not exactly compatible with existing frameworks and capabilities. Some of that we had to sort of leave as-is. So our first step was to basically break up the pieces that were easiest to break into components, containerize those, run those as pods in Kubernetes, and then take advantage of the orchestration that Kubernetes gave.

Oh, and we also, for the cloud part, moved to some managed services like Spanner for our database, as opposed to Oracle. What we want to do moving forward is get rid of some of the internals, like the transaction and context management. We'd like to move to some managed services in the cases where we're going to be running in the cloud. And so things like the Hadoop and the Oracle get replaced by Spanner and the managed SQL, and BigQuery, things like that.

ADAM GLICK: Are there any of the components that you're looking at and deciding, hey, you want to keep those with the open-source technologies and just run those open-source technologies in the cloud versus moving over to solely cloud-hosted tech?

DAN DYER: We're certainly looking at that. So for example, with the reporting, we expect that we're going to have to provide an on-prem solution with our customers for the foreseeable future. And so the trade-off between using managed services and using something else like, say, Hadoop really comes down to whether we think we can actually get people to move to, say, even a hybrid model sooner for some of these less critical pieces. If we can do that, then we'll jump to the managed services just because it'll help with reliability and lower the level of operational requirements. But if we have to maintain an on-prem solution, then there's always that trade-off where the managed part is going to be problematic for us.

CRAIG BOX: How much of that do you want to be your own core skill set versus someone else doing it for you?

DAN DYER: The place we want to be able to focus on is really around the intelligence around the actual telecom and the billing pieces themselves. So we have some very elaborate policy-based rules engines that we use to handle hierarchical billing and real-time billing. There is some magic sauce in terms of how we handle these latency and this high-performance calculations of the charges that we have. So those are places that we want to continue to invest in. The places where we've currently built our own platform because previously that wasn't available, those are the ones we want to get rid of.

ADAM GLICK: Kyle, where does Arctiq come into this story?

KYLE BASSETT: So to give you a little feedback on the project that we did, we started it just five weeks ago, actually. So I got in touch with Dan and team, and they had some interest around GKE On-Prem, and porting their application. Obviously, Dan mentioned it runs in GKE today, so they're leveraging a lot of GKE technology. But they wanted to be able to test it out on-prem for some of these customers that are in different regions and lots of different business cases.

So it was a really interesting one. And so we formed a remote team. One of the cool things about this project is everyone's been kind of spread out all over the world, actually. And we've been using collaboration tools, and doing Google meets, and getting through it.

So over the past five weeks, we've stood up a VMware cluster in Berlin. It's a five-node cluster. The app's pretty sizable. There's a lot of pods. We're using about almost 90 virtual CPUs and 400 gigs of memory for the application.

So that was kind of week one. Then we worked through architecture. The good part is they were pretty mature. They were leveraging helm templates. They have a Cloud build process. They're using GCR. So we stood up a Harbor registry on-prem. We did some syncing so we could do the deployments. And by that time, we got into deploying the app. So I think everyone was pretty pleased with the amount of rework that was required. We didn't really need a lot of rework. I think that's-- kudos to the team and how they've been managing their cloud environment.

And then we jumped into-- we got to beat on this thing. We've got to throw some load at it. So Optiva's got some nice tools to be able to simulate callers, so number of calls per second. So we were able to test them. They've got pod scaling built into their applications. We were able to test that. We also did some integration with config management to be able to do some auto-scaling capabilities through Git pull requests, so being able add more nodes in the clusters, things like that.

So we really went pretty fast, and we were, in the end, successful getting the app up. I think there's still some work to do on really testing every edge case and putting more load through it. But in four or five weeks, to be able to get that done, it was pretty satisfying to be able to get it done in time for Next and be able to show off what we did.

CRAIG BOX: So you've been one of the earliest partners working with us on GKE On-Prem. How did you get involved with it to start with?

KYLE BASSETT: Yeah, that's an interesting one. Last year, before Next, I ended up having a chat with some of the Googlers. They reached out. I was actually on my cottage in PEI, and got asked if I wanted to talk about a potential new project.

CRAIG BOX: That's Prince Edward Island, for the non-Canadians in the audience. I spent two years living in Canada; I can translate for you.

KYLE BASSETT: Oh, did you? Oh, well, we won't derail the podcast with PEI stories, then. I'll throw my accent on here in a second. So really it was-- we had a lot of experience with deploying OpenShift and things like that over the last years with large customers. So Google was looking for a partner who was pretty nimble and was willing to get into the weeds and bring that customer perspective to the project. And so we got involved really early, got access to the first release, and we were able to work through the deployment and have lots of debates and complain about things. And the engineering team has done a great job. They've ramped up-- I've made several flights out to Seattle to kind of get to know them.

And yeah, it's been really cool. I'm outside Google, but a partner-- we're a premier partner, so it's been cool to be able to work through it and just see it come to market, and then obviously the announcements and the focus has been great. So bigger things are going to happen now that it's out the door and we can actually talk about it and start getting more customers on it.

CRAIG BOX: You mentioned some debate with the engineering team. What are some areas that your feedback, you feel, has helped them with this product?

KYLE BASSETT: I think a lot of the debates we had early on was like, what's the platform that we're going to support, right? So VM is everywhere. It's probably not the most forward-thinking approach as far as when we talk to customers. Customers, even some of them, they're saying, I just want to get back down to bare-metal. I want to get rid of the Virtualized Hypervisor altogether. But it was everywhere, so it was safe to kind of go down that road.

And you've seen the announcements that they're going to support Azure and AWS. And my vision of the whole thing was, we should be able to run this anywhere. And it's that common experience, whether it's on-prem or even running in another cloud, or running on bare-metal. Some of the other pieces, I say, is just how it's deployed, how it's provisioned, so the tool kit they use to do that, and just the process, and pushing for everything to be declarative so that we can manage and scale in that way.

I mean, we really beat on this thing-- storage. We've been through the network stack. We've deployed a lot of extra things on top of it. So we've got vault deployed on it, we're leveraging console for applications, and we've put a lot of monitoring tools in place. Dan mentioned Prometheus and Grafana, and leveraging things like partners like Sysdig, security scanners.

So our idea is, it's not really that elegant to go install a cluster. Where we want to be is on the reference architecture and be able to go in and help customers get off to a fast start. And a lot of these tools are new to them, and then they're hard. So if we can do the testing and walk in with a reference architecture, we've solved 80% of the problem, we feel, and then we can do customization. So that's what we're working hard on. And also the whole CI/CD pipeline. I don't see it as an on-prem, really, product. I see it as a hybrid answer, really, is what I think.

ADAM GLICK: You're speaking to a lot of the scale and consistency pieces there. I'm curious what you did around configuration management.

KYLE BASSETT: The Configuration Manager product, I think, is going to have a lot of legs. Everything we do is, we believe in building all of our code, putting it into Git, and driving automation from there. And no one's in mucking around with the servers. If you want to make a change to your application, start in dev, do a build, get it through.

Then the other part that I think we can do-- we've been working on the Config Manager-- is the entire install. I mean, I mentioned it's declarative. We need a file that configures it, but that should be driven from a Git repo, which means now we can do clusters on demand via that. So what we're going to show a little bit out at the dev keynote is, why can't a developer come in, ask for a cluster, and get it in two minutes, and deploy their application? Or why can't we build that into the CI flow that when they check in code, that we spin up a cluster, we run it through a test case? If it passes, maybe the production version is on-prem, but we leverage ephemeral infrastructure in GCP for all the other pieces.

We've messed around with some auto-scaling, so being able to update a pull request, and then send a webhook, and then drive all the node provisioning. And then just consistency across clusters. So I somewhat see the power of config management as solving some of the federation 2.0 problems of, I've got all these clusters everywhere, I need to manage secrets and config across them. If something changes, I need to reset back to state. So you're going to see us put a lot of work into that, because I think that there's a lot of legs there.

CRAIG BOX: And, Dan, how has this resonated with you as a customer?

DAN DYER: I think there's a couple of things that have been really nice about this. So first of all, as I mentioned, we need to be able to move people incrementally, either on-prem, to hybrid, to cloud. That's our goal, right, to get everybody to cloud. And the experience that we've had, and specifically the POC that we did here, showed, as Kyle talked about, how little it took us to actually run this on two different platforms.

So we took, essentially, that code that we'd already built, and we were able to run that with relatively minor changes in the on-prem kind of scenario. And the fact that that's connected back to the Google Home ship, and it's built around this idea that you can start to distribute those workloads in a hybrid manner really lines up really well with what we want to do. Because like I said, we want to start to move pieces of this into the cloud for those who aren't willing to move right away.

I think the other piece that really has resonated with us and resonates with our customers specifically is around this TCO. So the demo we did around the auto-scaling and the fact that we could show that we could do that relatively easy, that's something that's really appealing to our customers. Because they can see that-- this is a very cyclical kind of workload, both on a daily basis and over the year. And so if they can adapt their infrastructure to do that, then this is one of the reasons why they'll be willing to move to the cloud and to move to something like GKE.

ADAM GLICK: How similar are the environments? We've seen things as hosted products and then other things that are on-premises. Is it really the same experience in both?

KYLE BASSETT: Obviously, storage is different. So we had to solve-- the default storage provider today is leveraging VMware. It ends up carving a VMDK, but it's dynamic. So at the same time, if the developer is asking for a PV in their YAML config, they're going to get one.

So storage-- we're going to look for other solutions. We've been working with some other partners, and also the CNCF community, things like Rook. But also being able to have cloud-native storage and scale out across larger clusters is a big one.

CRAIG BOX: Is that not the promise, though, that the platform abstracts away the underlying thing? Is the challenge that VMDKs aren't resilient enough, they're not replicated?

KYLE BASSETT: It's that, and also, I think, just under extreme load, we're just seeing-- I'm worried about scale a little bit, so just distribution. I mean, it doesn't mean you couldn't architect it. But when I say storage-- when we think of storage in the cloud, it just runs, right? It's always fast. So it's just something you have to think about when you're doing an on-prem deployment. You know, it ingresses a little bit differently right now. We deploy in F5 virtual clients for most of our projects, or we leverage a hardware appliance. But the full integration is there. So when you carve out, you want an endpoint, it dynamically builds it in the F5. So you'll see more load balancer support coming.

But other than that, I mean, it's a deployment target. I keep saying it just looks like another region. We didn't mention-- you can see it in the Google Cloud console. You can dive into the pods so you're not having to bang around and do kubectl commands. You can really see what's going on in the console. So it's that common experience. And I think the goal is for it to be just another region. So you do have to worry about some of the infrastructure pieces still, but it's a one-time setup for the most part.

DAN DYER: Yeah, I think, from an application perspective, most of that, once you get the deployment going-- I mean, obviously it's tricky to get these deployments all set up correctly. But once you get it working, it doesn't look significantly different. I think that, obviously, the big promise in the cloud is this infinite resource pool that you always. It's going to take a little more management in the customer environment. And the fact that we're not going to have some of these manned services we might like to. So we're still stuck with things like running Oracle on-prem and stuff.

But overall, I think the experience is going to be significantly better than at least what's there. We actually hope it's a little bit better in the cloud than it is on-prem. Because that's where we want to move to. And so if there is some delta there, I don't think that's particularly a terrible thing as long as there's a nice path forward to get to it.

CRAIG BOX: Are you comfortable using one environment as a proxy for the other? You mentioned earlier that you want to, on each commit, cause an environment to spin up. Obviously you have more dynamic availability in the cloud. If your deployment target is on-prem, would you still want to test in the cloud?

DAN DYER: We're working through that, right? I mean, I think that's going to take some more work on pounding on this and trying to find the right correlation. As I mentioned, this has a high sensitivity to performance and latency on this application. So I don't think, right now, we would be comfortable just doing all our testing in the cloud. But we can do the majority of the testing in the cloud.

So our CI/CD system, most of our functional testing, our basic acceptance testing and everything, we can run on the cloud. And then we do some final performance testing in the customer environment or in the hardware in our data center. And that's going to be significantly better than what we do right now, which is everything in our own data center is on-prem. Having to manage all that crap, it's really hard to deal with, right?

ADAM GLICK: GKE On-Prem is one of the key features of the new Anthos environment. What other features are you looking forward to adopting?

KYLE BASSETT: Stuff I'm looking at is-- I guess I do have a big vision of using both cloud and on-prem, and leveraging cloud for ephemeral and testing. As Dan mentioned, sooner or later, you got stage this stuff and test it where it's going to run, make sure it's going to run. But there's a lot of-- when you're trying to automate your entire test flow, and getting developers to drink the Kool-Aid of unit testing, it's nice to just have that instant feedback for them.

Developers want to write code, and security and ops are worried about stability and change. So we really need to solve that through some type of policy manager, which I think Config Manager can solve.

I think there's other use cases. Things like Cloud Build-- I'd like to see us be able to leverage Cloud Build but maybe spin up build workers inside an environment. Because a lot of our customers are financial services. So they are sensitive to putting their container images in source code in cloud in some cases. So there's no reason why we couldn't have the orchestration in cloud but then build some kind of worker, and then do the build, and then push to a local registry-- so leveraging things like that.

SpiNNaker, the whole CI/CD flow, managing all these clusters. So I think there's a lot of legs with all of it. We need to, obviously, make sure we make customers comfortable with the security model and make sure we have the right connections up to cloud if we're going to truly use both hybrid services. But I think that's where people are going to be able to move past. Because we shouldn't be repeating these services and rebuilding them all on-prem. And I think you're going to see Google productize more pieces of cloud and allow it to run on-prem. And it's just a container running, so it's not that far of a stretch, I don't think. So I think that's where you'll see it grow up.

CRAIG BOX: And, Dan, you're the customer in this situation. How do you feel about the security of cloud?

DAN DYER: Well, I mean, from our perspective, we don't really want to deal with infrastructure, right? We'd like to be an application developer that focuses on the business logic that we have. And we're convinced that Kubernetes is going to become the infrastructure of choice moving forward. And so at some point, what we would like to be able to just do is say, OK, Kubernetes is going to be running in a customer's environment, so for the on-prem-- which there are a lot of sensitivities that are going to drive that, right? So there is obviously the data security. Although, I would say, in general, that a lot of our customers are maybe not adapting to the change as fast as they should, and they are probably in more danger of security problems than they would be if they just moved to some of these new technologies. Because things tend to stagnate after a while.

But the long term, really, it's going to come down to what kind of legal limitations they have in terms of where they have to process things, what kind of rules they have to follow in terms of local countries, because we have deployments all over the world, and then where is that data located? But from the perspective of what runs in the cloud and what doesn't, we would like to get to the point where, as long as local laws are being followed and we can essentially deliver that, either on-prem or in the cloud, we don't really want to think about the infrastructure at that point. We just want to be able to say, you've got a stable Kubernetes environment, we can deploy into it, and we run well in that, whether it's on-prem, hybrid, or in the cloud.

CRAIG BOX: What things do you think are missing from today's cloud-native stack? If there are some things that you would say startups should be focused on, or open-source projects need to fill a gap, what are those things to you?

DAN DYER: I mean, I think it's still hard to use them, right? I mean, some of this is just the level of maturity. I mean, as a technologist, it's really easy for you to forget that. I mean, I'm always focused on the new stuff. So it's easy to forget that a lot of people aren't always that far ahead on things, and so making it easier for people to consume that, to understand these new models, to leverage that. So the basic tools are there-- being able to monitor things, being able to deploy things. But giving people the level of comfort that things are working properly and that they can deploy them without a whole lot of inside knowledge about all the different components of Kubernetes, and how they work, and what kind of messages go across the API, and things like that, that's a place where I think you'll get more adoption in a traditional IT environment, when you can deliver those.

KYLE BASSETT: Yeah, I'm not willing to tell you about my next startup, so--

[ALL CHUCKLING]

I'm just kidding. But I agree with Dan. Also, just wrangling YAML is too hard for people now. It's just-- they're getting frustrated. So I think there's some good stuff going on in the CNCF community, but tooling to allow people to move monoliths into containers, and just understand and ease the pain of that, where we're spending way too much time in editors and trialing-and-error of that. So I think a workflow or a toolkit that kind of eases that pain, it speaks to the complication and the knowledge side. I mean, the part I worry a little bit about is, as everything gets abstracted and we make these tools to make it easier-- we can kubectl apply, and now we have these applications-- it doesn't mean they're going to run properly.

And we still need to understand them so that we can support these things. So it's sometimes a little too easy to deploy an app in Kubernetes, and then it gets its way to production. And we still need to think about architecture. We still need to understand all these pieces. Because when apps fall down, who's going to fix them? You just can't keep running the same command.

So as we abstract more things, I worry that the new crew that's coming in to learn this technology doesn't get in the weeds and suffer those late-night digging in to logs, because it's all abstracted. But I'm optimistic. That's where we spend all our time, is trying to understand the system and focus on good architecture.

DAN DYER: Yeah, and that's that visibility I was talking about, right? Because, I mean, I think, as we decomposed our application, we changed where our bottlenecks were, we fixed some problems, but we introduced new ones. And so being able to see the impact of your changes and what you did to it, what was good, what you hosed up, those kinds of things-- that, like Kyle said, is critical to the success of that. And so the better tools you have around that, the better visibility you have to what your application is doing, the decisions you made about how you broke it up and how it's running, those are all, I think, core pieces of making this more consumable.

KYLE BASSETT: Monitoring should've been solved a long time ago, and we still spent way too much time doing it. Lay down a graph when you finish your build. And then you have a baseline of where you started. And then when you throw a load at it, you can dig into it. So it's often overlooked-- just simple time graphs of performance and stats. So there's some great tools out there, and they don't take that long to set up. So just spend some time doing it. It'll save you a bad night when your app's in production and not working so good.

ADAM GLICK: On that note, both of you are kind of on the leading edge in terms of these technologies. And I was wondering if you had any advice that you would give a dev team or an ops team that's starting to walk down the path that you've been headed down to help them accelerate that process, possibly pass some of the pitfalls that you may have found along the way.

KYLE BASSETT: Yeah, I mean, advice for me is always dig in and get involved in the community side. The great part about today is, if you're building a new application today, you're probably going to turn to open source for sure, unless you're picking something-- you're buying a solution, something like Optiva has to solve a very unique problem. And most likely they're leveraging open-source tools underneath it.

So learn those. I mean, that's where the future is going. Look at how cloud providers are building what they've got. I mean, the DevOps stuff is often talked about. But it still comes down to people. Collaborate, go have a coffee with someone, have some empathy about their job-- it's probably not that easy.

There's still a lot of divide between development and people that are operating platforms. But one thing I do see is we're hitting a world where the people that are actually building are, in many cases, having to run these platforms. So the divide is kind of falling over.

CRAIG BOX: It definitely gives you a lot more empathy as to how you should operate a platform if you're the person responsible for building it as well.

KYLE BASSETT: Yeah, and I think there's a lot of responsibility there that developers just can't write code and throw it over the fence. You've got to write unit testing. We need to be waking them up at night if their app doesn't work. But also, testing is often overlooked. A lot of companies aren't testing stuff enough, and they're finding out about problems when it's way too late. So these are pretty fundamental stuff, back to when they used to build mainframes, too. So we can't lose sight of that.

DAN DYER: Yeah, I think those are all things that I would emphasize as well, right? So if you're going to use open source, you've got to commit to the community. You can't be a consumer of it. You've got to participate. I've run into this even in my own company, because some of these people are new to the open source. I've been doing it for quite a while. You're not going to get value by just consuming it and then trying to find somebody to write a contract to say, I'll support it. You need to be part of that.

I think the automation is critical. You've got to start that automation from the beginning. It's got to be built into your process. Especially you look at-- like, for example, for us, I mean, we've got close to 100 deployments, and every one of those is slightly different. We would like to get to the point where they're not different, but that's not going to happen anytime soon. And this is pretty typical of most enterprise apps.

So building the right automation pipeline to handle that upfront so you can do all your standard stuff in an automated fashion, and then lay down all your customizations on top of that, and automate that, that's core to deliverables. And then I think whether you do a DevOps model or you're more traditionally partitioned, getting the ops guys engaged in this early so that you have the right operational input into there so your functionality and your features are all balanced with how to actually make the thing work, those are all core in my mind.

KYLE BASSETT: And security. Security-- please come to the first meeting. It's important.

[DAN CHUCKLING]

DAN DYER: Yeah, you should have security. That would be good.

CRAIG BOX: Dan, Kyle, thank you very much for joining us today.

DAN DYER: Sure.

KYLE BASSETT: Thanks for having me.

CRAIG BOX: You can learn more about Optiva at optiva.com. And you can find Arctiq at arctiq.ca, where they're blogging extensively about Anthos and GKE On-Prem.

[MUSIC PLAYING]

ADAM GLICK: Thank you for listening. As always, if you've enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter, @KubernetesPod, or reach us by email at KubernetesPodcast@google.com.

CRAIG BOX: You can also check out our website at kubernetespodcast.com. Don't forget to come and see us at KubeCon next week. We have stickers aplenty. Until then, take care.

ADAM GLICK: Catch you next week.

[MUSIC PLAYING]

View More Episodes