The Pros of On-Prem Kubernetes with Justin Garrison
Justin Garrison: How does this actually work when you're doing this drastically? And in most cases it says like, no, actually you just want to pin. You want to pin your workload to a type of node and always make sure you have the same amount of cores.
Corey Quinn: Welcome to Screaming in the Cloud. I'm Corey Quinn, and I'm joined today by Justin Garrison, who these days is the director of DevRel over at Sidero. Justin, thank you for joining me. Thanks for having me, Corey. This episode's been sponsored by our friends at Panoptica, part of Cisco. This is one of those real rarities where it's a security product that you can get started with for free, but also scale to enterprise grade.
Take a look. In fact, if you sign up for an enterprise account, they'll even throw you one of the limited, heavily discounted AWS skill builder licenses they got, because believe it or not, unlike so many companies out there, they do understand AWS. To learn more, please visit panoptica. app slash last week in AWS.
That's panoptica. app slash last week in AWS. I have to say this. One of the things I adore about having you on this show is that there's a standard intake questionnaire of the form I send people when I invite them onto the show. And one of the fields that's there is, does this need any PR review? I think legal counsel, corporate comms, PR folk, etc.
And your answer to it was, ha ha ha, no. Which I just, in all caps, which is just the best answer I think I've ever gotten to that and basically perfectly embodies my philosophy on these things. So, Thank you for making me smile.
Justin Garrison: Yeah, you know, I do what I can. I've, I've spent my career, uh, mostly at, uh, very old or very large companies.
And every company I've worked for before Cidera was either a hundred years old or over a hundred thousand people. And this is the first time that I'm not in one of those situations and it feels really good actually.
Corey Quinn: You finally get to say what you want to say, how you want to say it, to whom you wish to say it, and that is no small thing.
Justin Garrison: I mean, having an opinion that doesn't align directly with the company is a good thing sometimes.
Corey Quinn: You're recently at AWS, and before that you were at Disney, both of which are To say that they're large companies dramatically understates the situation by a fair bit. Before I did this, my last job was at BlackRock, and I basically sat there and stewed for almost a year, unable to say anything that looked like an opinion in public, particularly in the way that I like to share opinions.
I can only imagine, given that you've spent many years in those types of environments, what shenanigans you're going to get up to now, given that, to my understanding, Cidero is a little bit smaller than You know, companies with four commas in their market cap.
Justin Garrison: Yeah, it's been, uh, it's been really different.
Uh, I've been at Sidero now for four, four months and, uh, already it's just, I was thinking back of what it was like at Amazon at four months. I, I just got my laptop when I started. It took three months to get a laptop. When I started at Amazon, I was like, Oh, okay, this is big company business. And I remember at Disney, it was about three and a half months before I had my first, like ticket assigned to me, my first project.
It was just like, it was just, Three months of like reading docs and going to meetings and figuring out what people do.
Corey Quinn: I understand the onboarding and reading docs. I spent a consulting project that lasted six weeks at a big company. Once the first four of which were spent waiting to turn on my AD account, which was okay.
Big companies are going to big company, but without a laptop, how could you even read the docs?
Justin Garrison: Yeah, I couldn't. I actually had people sending PDFs to me in various ways that weren't supposed to be allowed because I had nothing to do. It was actually during OP1 season when I started too. And if you know OP1 season at Amazon, there's a lot of docs to read.
And so, uh, but it was entertaining because I had a Chrome, a personal Chromebook and nothing that they had supported Chromebooks. And so it was just like, okay, well, I have a phone and a Chromebook and I can't do what I'm supposed to be doing. So I spent some time, you know, Playing with new services in my personal AWS account to learn what they did.
Corey Quinn: I do a lot of the my computing work, especially on the road from an iPad, and most things, yeah, don't tend to support that as a primary means of interaction. All my dev work is done on an EC2 box for a number of reasons, but it just strikes me as weird in that at big companies in particular, I would never, as a, as an employee, expect to be able to use my personal devices for things, and because it's always going to come with these restrictions of, great, install this MDM on your stuff to, so we can do the corporate management thing.
And it's like, how about you provide the equipment you want to have me, uh, engage with your corporate systems, or you leave me alone? Like, oh, when you start rolling out MDM for mobile devices, where it's great, we want to be able to wipe your cell phone. My position on that is great. You're going to get me a corporate cell phone.
Alternately, I'll just be in touch when I happen to be on the laptop. It's, it's not a great position to be in, but I've heard too many horror stories of not just, it's not just malfeasance you have to worry about, it's accidents. Someone in Corp IT accidentally wiped away your personal phone while you're on a trip through no ill intent.
Great, now what? My entire life lives on that thing.
Justin Garrison: I was part of the team at Disney Animation that rolled out MDM for people, and so I definitely know the struggle of like, uh, well, I don't, I won't, I don't want to do this. It's not like, but it's my job, right? Like, I have to do this thing.
Corey Quinn: Even on corporate machines here, we have a very light touch Jamf profile that is only on stuff that we own and control.
And all it does is it enforces screensavers, password strengths, and encrypting the disk, so I don't have to report a data breach when you get it swiped out of your car trunk. Awesome. Great. This is stuff that anyone who wants to, who works here, can pull this up at any point. Hey, pull this up and take a look at it.
I want to make sure you're not doing something underhanded. Absolutely. I'm not. It's very straightforward. And it's the stuff that I Like my entire philosophy is I will never ask employees to do something that I won't do myself. And not just because I happen to be the one holding the switch, because those things can change pretty quickly.
Justin Garrison: Yeah, and I actually took your lead on joining the new company where my, my primary or only mobile device is an iPad now. And, and partially because they got good enough, right? Like the software is still In the middle there, but I was like, I need something that I do a lot of drawing. I like animation. I like doing that side of it.
And I like the iPad and the pencil format. And I do a lot of video editing and DaVinci works on the iPad. And so like that combo has worked really well for me of just like, Hey, I want a single purpose device. I do writing and those sorts of things. And then I have a shell that remotes back to my desktop in my home lab.
And I can do the things that I need to do for work. Um, it's been working really well for me.
Corey Quinn: We are at very different ends of a particular spectrum, where my, my version of a user interface that's good talks about, like, the position of command line arguments in something. I use VI for most of my writing, uh, although I do admit for code, increasingly I'm drifting in a direction toward VS Code, because it's gotten pretty decent.
But, yeah, there's a, everything I do is just green screen terminal style stuff.
Justin Garrison: Oh, no, Blink on the iPad is great. Blink as the terminal. is an excellent terminal app on the iPad, and that's where I do all my writing. Like, it's in Vim, it's, it's remote back to my box, and it's, uh, yeah, with, with Tablescale and a 5G connection.
Corey Quinn: It sounds like we're using the exact same stack on that. I've been using, I've used their, uh, VS Code implementation, just by typing code and the place, the path to a directory.
Justin Garrison: I gave up on code. Uh, like I was using code at Amazon for, I was like, I want to switch. I want to switch off Vim. I'm finally going to go into the gooey world of IDEs.
And I used it for four years at Amazon. And I'm like, I don't like this anymore. And I just went back to, actually went to NeoVim with LazyVim as a, like a framework on top. And it brings a lot of those, like things that I liked in code automatically, like all the pop ups and. Uh, things that get in the way.
Uh, all those come with it. It's great.
Corey Quinn: One thing that I've done that I think is probably just the side of a war crime, is I've gotten GitHub Copilot to work in Vim when prompted. But unlike in a lot of other environments, it doesn't automatically slap the text in. It has to explicitly be asked, which is kind of important.
Justin Garrison: Yeah, it's auto, auto filling for me, so it's disabled by default. It's just like, it just gets in the way too much.
Corey Quinn: Yeah, when I want a robot's opinion, I will ask explicitly for it. It's like that old, uh, shitposting meme picture of a robot will absolutely not speak to me in my, my holy language. I am a divine, I'm a divine being.
You are a pile of bolts. How dare you? Like just this, this over the top screaming, uh, at a computer's thing, which, yeah, that's very much my bad. So what are you doing at SideAero these days? What do you folks do exactly?
Justin Garrison: Uh, at SideAero, we mainly focus on people having problems with on prem Linux and Kubernetes.
And so it came from actually starting Tilos Linux, which was a single purpose Linux distro. And in my days at Disney, it was at Disney Animation, I was doing on prem Kubernetes, and we were using CoreOS. And it was like a fantastic, like, oh, simplified the rel stack to be basically just systemd and container runtime, enough that I could bash script my way into running a Kubernetes cluster.
And in around that time, Andrew, the CTO, had said, oh, he started this Talos Linux thing. And it's like, no, like, PID1 is just an API. There's no SSH. There's no, you don't need users. You, everything is an API, API driven Linux. And all it does is run Kubernetes. It does nothing else useful besides run Kubernetes.
And, and I was like, that's an appliance. That's amazing. And so I remember when he announced it on Hacker News, I emailed him immediately. I was working at Disney. I'm like, I want this to exist in the world. This should be a thing that is available. And Throughout my career, I've stayed in touch. I've kind of used it here and there.
And I went over to Amazon and we launched Bottle Rocket as a direct competitor to Talos Linux. And then Bottle Rocket kind of went off and got re org'd a couple times and does some other things. No, you're kidding. A re org at Amazon? No. It doesn't do the same things that it used to. And it's not really focused on Kubernetes anymore.
It's like this weird container sort of thing. And of course, CoreOS got bots, uh, twice and it does some other weird things and that kind of like shifted off. And then this Talos Linux thing, it just kind of stayed in this, like, we do Kubernetes and that's it. And it's everything from bare metal. You can see on my bench behind me, I know it's an audio podcast, but I have like a Raspberry Pi running it and a laptop running it.
Like I'm setting up a lab for. A talk later and I'm like, Oh yeah, this is like, it runs on very small stuff and then very big stuff. We have an oxide rack that we're, we're testing it with. And so it's like big metal. And then also like cloud providers, it's just, it runs everywhere. It's Linux and it just runs Kubernetes.
And that's like the really key point of like, we have this streamlined Linux OS. Cause people are like, you have to learn Linux. To no kubernetes. I was like, actually, you can lower the bar. You can make Linux disappear and just make it an API.
Corey Quinn: You have to dial it in super well to get there, but I believe it is possible.
It's a, that is kind of the dream where you don't have to think about these things anymore. I mean, I wrote a blog post a couple of years back on the idea that Nobody cares about the operating system anymore. Which, of course, was dunking on Red Hat at the time for deciding they're going to basically backtrack on the CentOS long term support commitment.
Surprise! You need to pay us now. Everyone loves hearing that. And it was a bad move from my perspective because most companies don't care that much about the operating system the way that they once had to. It's, some people need to care very much, but it's not a necessary prerequisite to
Justin Garrison: build things now.
Especially when you have something like a Kubernetes clustering system on top, you want the OS to disappear. It should not be the hindrance. It should get out of your way. It should let you debug some things. It should have some access to do like, Hey, what's going on? But beyond that, it should just disappear.
And that's exactly what Talos was meant to do. And, and even like I, when I started, I started digging into it more because I haven't touched it for, you know, a little while at Amazon. And like, I was like, Hey, how many binaries do we have on the system? And it was like, Oh, I can just like, List out this directory.
I was like, Oh, there's 12, there's 12 total binaries on the, I was like, what? Like, how is this possible? Like, what are you doing here? Like, well, yeah, and like started counting other Linux distros and it's like, you know, RHEL has like 7, 000, Ubuntu 6, 000. Even the, the smallest bottle rocket is like, you know, 1400, 1300 or something like that.
Corey Quinn: I mean, technically you could get a lot of it down to near one with BusyBox. Just it changes its behavior based upon what the, what SimLink invokes it.
Justin Garrison: And there are some, there are siblings in there, like if you count all the, the linked binaries, there's like 30, but it's like most of them are LVM, right?
It's like we have a, like half of the binaries are for like disk management.
Corey Quinn: The thing that drives me nuts, and I've wanted this for years, even the stripped down minimalist distros aren't stripped down and minimalist enough because their argument is, well, if you get rid of some of these things, it'll be really uncomfortable to work in the environment.
It's production. It's not supposed to be comfortable. If you're copying your dot files to customize your shell to production, you are doing it wrong in almost every case. Stop that. I mean, I, I got into Kubernetes finally for basically losing a bet with the internet earlier this year and gave a talk, Terrible Ideas in Kubernetes.
To do that, I now have a 10 node, uh, Kubernetes of my own running in the next room on a bunch of raspberries pie. And the most annoying part of running the thing, to be very direct with you, is in fact the underlying operating system. Having to keep those things patched and updated and caring about those things.
I just want it to basically run Kubernetes as an appliance, and please shut up and leave me alone for other things that aren't that. And I can't get there.
Justin Garrison: That was exactly like at Amazon, I was helping build the EKS Anywhere, the on prem EKS version of you should run Kubernetes in this environment with these, this set of, Linux distro and stuff.
And it was like, it was automated, but it was difficult. And it was like, Oh, we could do this stuff. And it had this cluster API and all these things that got way too complicated. And I was like, Hey, I need to look at what the competition's doing. Like, what are other people doing? And again, I was like, Oh, I know Talos.
It's like, I'm going to spin it up. And I was like, okay, well, I get the API. Like I can get the commands. And I actually booted the, the. Main product that we sell at Sidero is Omni. And it's a, it's a SaaS version of like, you can run, you can literally put it on a USB drive, boot a machine, and it connects back.
You have a Kubernetes cluster, like magic. Like I, the first time I did it, I was like, wait a minute, I missed something. Like there's gotta be something I'm doing wrong because this, this part of it was too easy. Like I went from booting a USB drive to a Kubernetes API in like two clicks. And I was like, I don't, I don't know what I just did wrong, but let me go try it again.
And I started, Internally at Amazon, like, making, like, competitive, like, Hey, we should look at what's going on over here. Like, this is not Cluster API. This is not complicated. Like, I had to get the dev team to, like, troubleshoot anything.
Corey Quinn: It's like an ERP system. You buy the thing and the real money is for the consultants who wind up spending the next four years of their lives dialing in all the hundred thousand configuration options just right for you.
I don't think that we're at a point where that needs to be the case. You can, I imagine it could be tunable in a bunch of different ways if you absolutely had to be. But the mean path, the common path for most folks should not involve that level of obscene customization.
Justin Garrison: And absolutely, there's a place for, I want to learn it.
I want to do it the hard way. I want to go through and like, I need, I want to learn the Linux steps because I need to for a career, whatever. I've done it, right? Like I was doing that. I was part of the initial like, uh, SIG on prem inside of Kubernetes. Like we were like building Kubernetes on prem. I was one of the chairs for it when it originally started.
And it's like, it was really, really hard. And all of my system D units were like curling down Hypercube and all this stuff. And it was like, it was automated. But it was hard. And now it's like, this should just disappear. That whole layer from the TPM, like you want trusted boot all the way up to Kubernetes API, that should just be handled.
Like we should be able to not automate that and just abstract it away like Kubernetes does for a lot of other stuff when you run a pod, right? Like I want a deployment. I just want the service load balancer available. I don't care how you get there.
Corey Quinn: Few things are better for your career and your company than achieving more expertise in the cloud.
Security improves, compensation goes up, employee retention skyrockets. Pan Tika, a cloud security platform from Cisco has created an Academy of Free courses just for you. Head on over to academy.panoptix.app to get started. How do you, uh, handle things like Kubernetes version upgrades when there are no moving parts?
Justin Garrison: All the Kubernetes components are part of containers that get run, right? So like we have like system containers that run. And so it's like you can shift out, like I have a, you know, Talos OS version. I say, oh, upgrade Kubernetes. And, and when you have a. Declarative spec, right? Like Kubernetes can say, like, I know how to roll through upgrades of a service or a pod or whatever.
We can roll through upgrades of the components of a control plan. We're like, okay, well, etcd goes first. We know how to make sure etcd is healthy. We know how to roll that out. If you have highly available, we do one at a time and we just do the steps you're supposed to, just like Kubernetes does with the application layer.
Corey Quinn: So you don't, in this case, need to do this from the perspective of someone who is Having to babysit the process through. You just tell the API to do it, and it, it does a solve problem effectively. It knows how to do it, just expect that it might be in, in an intermediate state for the next period of time until it completes.
So maybe this is not the same time to roll out a major deploy.
Justin Garrison: Yeah, there are, you know, things in any infrastructure where you're like, oh, if I'm going to upgrade this thing, let's, let's hold off and make sure that we're not going to cross deploy. Send errors because something else happened, right? Change one thing at a time.
Corey Quinn: So one of the things that I've learned through dealing with the Kubernetes has been that it's been the first time in a long time that I've done stuff on prem. I'm used to doing everything in a cloud provider environment, but I didn't do that with one of the managed Kubernetes services because I've seen the bills when those things start misbehaving, and frankly, they are not pleasant.
I, I don't want to spend thousands of dollars for basically a fun talk that I'm giving in terms of oops a doozy charges. And I'd forgotten, on some level, just how annoying some of the aspects of running things in, in your home, own environment can be. Uh, deviances between some bad cables or a bad chip that winds up getting sent out.
The, the joy of running power and the rest. The challenge inherent in, uh, The storage subsystem. For example, Longhorn is apparently terrible. It's just the second worst option. Everything else is tied for first. And EBS in AWS land largely just works, so you don't have to think about these things. And you learn that, oh, just like DNS, when the storage subsystem starts acting up, so does everything else.
You will not be going to space today because you have surprise unplanned work. Now, I would never run things like this in a production style environment. There'd be a lot more care and whatnot. But this just runs a bunch of Homelab stuff that I want to exist, but if it goes down for a day or so, it doesn't destroy my ability to operate.
Justin Garrison: Yeah, for sure. And I mean, there's always that, like, people ask all the time, like, hey, how do I get my storage inside this Kubernetes cluster? And I'm like, do you have an external way you run storage today? You don't have to reinvent everything. Everything should not be a Kubernetes problem, right? That like NetApp that you paid for, keep using the NetApp.
It's really good at doing storage, right? Like that's fine. Like, don't, don't think you have to shove everything in here. And I understand people, they want the interface. They want something that looks familiar. I'm like, no, it's okay to like have a team. That manages NFS over there. And then you mount that in.
And those things are great. And, and like having those things separate. But again, like the Kubernetes mindset of like, I have to run everything myself because I can is, is just kind of a, a way of like, we've gotten in trouble with that with everything in the past, right? Like config management and everything had to be an Ansible playbook for a long time.
I'm like, actually, no, I don't want to SSH and try to do this in YAML.
Corey Quinn: I do manage the individual pies with Ansible because it was either that or a bunch of shell scripts running in loops, but no, thank you. But, but yeah, it's the, it's a neat approach to doing these things. Now, most of my use cases are very much contrived here, in that there are containers that I want to exist and run.
I could run those anywhere. Probably Docker Swarm or just running them in Docker Desktop would be more than sufficient for my use case. But I needed some workloads that I actually were running things I care about. And these are mostly Singleton style container approaches. A couple of them have services that have a few containers talking to each other.
But it's, it's really not a Kubernetes shaped problem. I have introduced a raft of unnecessary complexity for what I'm doing, but now that I have it, I find that just spinning up new containers of services here to do things to be relatively straightforward. I have become something of a reluctant convert to Kubernetes.
I like it more on prem than I do in a cloud provider environment, which is what I want to get into with you. Because it feels like it's a good way to basically build a cloud like platform of your own. But when you do that on top of AWS, it's like, well, crap, why not? If you want to go work at a cloud provider so bad, fill out an application.
You can go and do it there for real. If you're going to work with, if you're going to work in the cloud, use the primitives that they provide for you for your use, and you'll generally wind up happier. Has been my position on this historically.
Justin Garrison: I would fight back on the fact that, uh, for Disney Plus, Disney Plus, when I was there, was all built on ECS and ECS, it was, it was one of the larger, uh, installations of ECS and it is, ECS is a native, or a native AWS.
Services and all of the things that you would want to do are very native to what AWS can do. And in any, even starting small, the first AWS native primitive you're going to find yourself hitting is account limits. Like that as like a cloud, as like a thing that happens in the cloud is like, Oh, this artificial thing.
To protect the service. I understand why it exists and why I would want to build it. But then it's like, I have to do that all the time. And there were so many account limits.
Corey Quinn: And that's why people always want to have development done in separate accounts. Because if you don't, you can have all the permissions isolation in the world.
But if you exhaust the rate limit for easy to describe instances, well, suddenly things like auto scaling groups and load balancers will not be able to make those calls and your production environment will once again, not be going to space today.
Justin Garrison: Yeah, in a lot of cases, even at medium size scale, I just basically say like, you could abandon a lot of your labels in AWS and just go different accounts, right?
Because like, that's the real limitation factor of AWS is like, Oh, this is that account. And like, they're not talking to each other and you get that separate clean bill. And you're like, for every account, Every application you're deploying, every team that wants some resources, stick them in a new account.
And that becomes easier to a lot of degrees than trying to divvy up things and say like, Oh, well, this is these labels and they shouldn't talk to this other one. And then you get this, I am nightmare. Of course, when they have to talk to each other, you get into a lot of other problems. But in a lot of cases, it was just like, you know what?
As separate accounts as the boundary to just like a separate Kubernetes cluster as the boundary for a lot of teams. It just makes a lot more sense because. Even though you have RBAC in namespaces and other things, it just gets cleaner and easier when you just say, this is a different API and you just go over there and you do your own thing and when you break it and come back to me and say, I need another one, I'm just going to make you another clean one.
I don't have to talk to anyone else.
Corey Quinn: You've been doing a lot of work in the hybrid space, and I, I wound up going on a bit of a rant a few years back, where multi cloud is a worst practice, and that was, and that was approaching a very specific thing that has been misinterpreted a number of times, but cool, I'll live with it, I should have been more clear, I'll live with that on me, but my point has been that for most folks, The setting out to build something from day one with the idea that it can run seamlessly and just the same in any or all cloud provider environments is probably not what you want to be doing without there being an extenuating circumstance, because rather than embracing a provider, it matters not which one, and taking advantage of all the things that they have operated, Instead, you're doing things like rolling your own effective load balancers because no one has a consistent load balancing experience or a provisioned database service.
It's, that's a lot of undifferentiated heavy lifting that you could be spending on the thing that your business does. There are exceptions to all of this, but that's the general guidance that I've taken on this. I don't feel the same way about hybrid, because hybrid I don't think is something that people set out to achieve, I think it is something that happens to them.
My theory has been that most hybrid environments are someone trying to do a full on cloud migration, running into a snag somewhere along the way, such as, we have a mainframe and there is no AWS 400, oops a doozy, declaring a victory midway through, saying we're hybrid now, and then going to focus on something else.
It's cynical, but also directionally accurate from what I've seen. Agree? Disagree?
Justin Garrison: Two parts there of like multiple clouds. I absolutely agree. Like you should go into one cloud specifically. And I call that undifferentiated, undifferentiated, heavy clouding. When you try to make, make your own cloud on top of clouds in any significant size enterprise is going to have a cloud team that basically abstracts some of that away of like, Oh, we want this to work everywhere.
And this Terraform provider should go to every cloud. I'm like, actually, maybe you don't, maybe you should just keep that as clean, as simple of an interface as possible because again, account limits.
Corey Quinn: And also, if you're not actively testing the stuff by running it active active across multiple providers, it's like a DR plan that isn't being tested.
The next commit after you test and get and validate your DR plan has the potential to render the DR plan irrelevant. So it has to be consistently tested. If you've never done a restore, you don't have
Justin Garrison: backups.
Corey Quinn: Exactly. No one cares about backups. Everyone cares about the restore. And similarly, if you are, if you are, there's some people that desperately need certain things to run multi cloud.
If you're a telemetry vendor like Datadog or whatnot. You need that stuff to live where people are running their workloads. So yeah, that thing needs to live everywhere. You know what doesn't necessarily though, is your account management stuff or your marketing site or a bunch of other services.
Justin Garrison: Even like I was a DataDog customer and even like, does that need to run in AWS?
The only reason that needs to run in AWS is because there's egress fees, right? Like that's the, like the, the business model of the cloud directly impacts the architecture of someone else. And that, that becomes a problem. And so that for sure is like a thing that exists, but yeah, you have to be where your customers are because of that stuff.
Um, I actually just saw a headline today that Azure is dropping cross AZ charges.
Corey Quinn: I did not expect Microsoft to be the leader in that particular direction, but I usually think Microsoft, when I want someone to basically treat security as a joke, not be avant garde in terms of data transfer pricing.
Justin Garrison: Yeah, competition is good.
Um, but yeah, to your point, like you either have to go with Azure and hope you don't get hacked or go with Google and hope that they don't shut it down. Um, like those are your choices right now for a lot of things. Uh, but back to the hybrid conversation. Absolutely. Like it, it becomes a thing of like, is this architected to be hybrid or is this a, a failed half attempt of like, actually that stuff just is going to stay on prem.
And the successful situations that I've seen are the, we just need to burst and we have compute, right? And that's like, we want the elasticity of compute. and compute resources in the cloud, but we want to keep that majority. Anything that would be a reserved instance or a savings plan, that should be on prem.
It's going to be always cheaper, and you don't have to ever worry about cross AC because you own the switches, um, that sort of stuff. And then, like, that extra, anything you would run in Spot, yeah, go run it in Spot in a cloud provider. And how you get there becomes a lot harder for a lot of people, and they think that Kubernetes is going to solve that problem.
It's like, oh, well, I have a Kubernetes. So now I can have another Kubernetes and we can just deploy twice. I'm like, Oh, not really. Because there's, there's a lot there. Right?
Corey Quinn: Yeah. And no, no two Kubernetes installations are alike. There's always different prerequisites of what services and how people have approached these things.
It feels like we've come full circle because this was the guidance people gave back in the, in the noughts when cloud first came out before EBS was a thing. It was, Oh, great. Just scale, use this to scale up. So you own the base, rent the peak. Yeah. And that's the way that people always approach it. Somehow that changed to, oh, put everything in the cloud.
And there are values and validity to that approach, especially at small scale when you're a startup building something. You should not be negotiating data center leases. At some point of scale, though, the economics start to turn, and ideally that's when the scale discounting starts to come into play. But I've never yet, well I used to be able to say this, I can't anymore, but I used to say that I don't know too many companies that spend more on cloud infrastructure than they do on people to operate that cloud infrastructure.
And then we had a bunch of AI companies go and basically spend 100 million that they can't pay on their cloud bill. Stability AI, I'm looking at news articles about you on this. And okay, VCs out of money, and then step two is you go buy all the GPUs from cloud providers. You got your order of operations confused and now you're having a fire sale.
That's unfortunate.
Justin Garrison: Sure. And I mean, to the, to the first point of bursting, peeking into the cloud, the most successful times I've seen that done is when I was at Disney Animation, like our render jobs were like, Hey, the movie has a date. We know how long it takes us to render things. We don't have enough computers, right?
And like, we could just do that math. We're like, this is where it lines up. Okay, well, let's just go find cheap compute somewhere. And then just, we'll spin up, you know, more. We'll get more CPUs. We'll render the movie. And then when it's done, the movie's done. And we shut it all down. And literally we built and we were creating box software.
Right. And that sort of business model, we can shut it all down as soon as we were done with the movie. And it was great because we could just cleanly turn it off. We're like, Hey, this is just coming out of the budget and we got to make it back in the movie, right? Like this is how money works. Um, the other one is like those big launches where it's like, Oh, it's a huge peak at the front, like gaming, right?
Like they're like, Oh, we want on prem because we have to be close to users, but we just, we don't want people to not be able to log in on day one. And after two weeks, we're Everything kind of settles down and we can figure out from there. And those are perfect examples of like, how do you extend that? How do you make a stretched environment that isn't, you know, completely separate in one area or another?
And in a lot of those cases, you want that like consistent scheduling of like, Hey, at Disney Animation, we had a custom scheduler that we would, you know, spin up jobs anywhere we wanted to. And at with SideArrow, like we have this thing called CubeSpan, which is a wire guard connection. That makes a mesh between your nodes and it joins the same cluster.
And at that point, you have one cluster that you say, Hey, I can still run Cluster Autoscaler or Carpenter or whatever and say, Hey, when I need computes, automatically grab me a few. They're not going to live forever. It's only going to be, you know, for a temporary time, but I'd rather give a user a good experience than Bifurcate my deployment engines and run two things and that becomes a really hard thing to maintain.
Corey Quinn: Gaming companies are fascinating in that space because in many cases they have to both get as close to users as physically possible and they have a very strong ebb and flow throughout the day and almost a follow the sun type of style and I haven't looked at the details of it depends on a case by case basis but very often it's the evenings that you're following around there because most people don't play video games from work.
Accept during a pandemic, but that's neither here nor there, so they basically distill something down into it just needs to be close to users. It's highly dynamic throughout the day, and it basically can run anywhere that can handle a docker container, give or take. So you have an X 86 or arm instruction set that you can run there.
Great. You can run the thing that we need to do. What I love looking at is people with sustained workloads in clouds. What is the average lifetime of an instance? How scaling are you? Are you just treating the cloud like a big dumb data center? That's not economically great. It does solve a number of problems, don't get me wrong, and I don't have enough context without more information to say whether this is good or bad.
When I see something that, oh yeah, we're running heavily on spot, we are scaling up and down constantly throughout the course of the day, here are the workloads and here are the patterns where it shifts, Yeah, I'm not going to suggest those people move to on prem in almost any case, just because it makes little sense for them.
But that's not everyone, especially at Enterprise. There's an awful lot of effectively permanent PET style EC2 instances living around where if you already have a data center and the staff able to run it, Maybe there's an economic story to not be there.
Justin Garrison: I spent a lot of time at AWS working with the Carpenter team on the, the node scheduler and Kubernetes Carpenters is like workload native thing that like can do that dynamic scaling really well.
And I've changed my mind a lot about how auto scaling works, uh, in a lot of cases, cause it's just takes a lot of engineering. And in, in a lot of cases, I still prefer like, Hey, you know what? If you're auto scaling, let's say you're going up and down by, by 30%, which is a pretty big shift throughout a day of like, we get 30 percent more instances and we shut them all down.
And I'm like, actually, again, that steady state of just over provisioning on prem is going to be cheaper. Not only cheaper, just from the budget perspective of like, Hey, I'm On prem machines cost less, but also the engineering time of making sure that you're getting instances that you need, that you're not getting iced from your cloud provider and saying like, oh, those instances aren't there, or even the performance impacts of a Linux kernel when you're scheduling a process and that same application lands on a machine that has, say, Four cores and lands on a machine that has 24 cores.
Even if you have reserved reservations set in your application, the performance characteristics of how the kernel schedules you on the processor is going to be drastically different. And we've had a lot of performance analyzation, uh, analyzation of like, how does this actually work when you're doing this drastically?
And in most cases it says like, no, actually you just want to pin. You want to pin your workload to a type of node and always make sure you have the same amount of cores. I'm like, well, then you should just buy a handful of rack of servers and just stick it on there and then just not use it as much.
Right. Cause I know that it looks like waste, but really it's not, it's like, Hey, but it's still cheaper than doing the engineering work. of all of that effort to say, I want to scale up and down, which is cool. You can, it looks really neat. Those graphs are amazing. But also you wasted a lot of time when you could have just like bought two boxes and said like, Hey, guess what?
We're done.
Corey Quinn: I do find it fun when I'm talking to customers about this, when they talk about auto scaling, because it's hard. It's great at giving you the. A capacity you need 20 minutes after you needed it. And I see some significant peaks in some cases, and I was talking to one, and I'll change some details a little bit here, because, you know, we don't out customers here, but it was effectively the equivalent of, yeah, we do have to scale massively for incoming rush, but those spikes are also, you know, At Formula One races, and spoiler, they schedule when those things are going to be.
We don't wake up at two in the morning, surprise, impromptu race where we have to scramble to auto scale. So yeah, things like that make a fair bit of sense. I will say that there is a bit of a countervailing point now, which is the proliferation of AI, by which I mean the latest scheme people have come up with to sell NVIDIA GPUs.
Before this it was crypto, before that it was gaming. And it's so hard to get. Exactly. You're, you're holding an NVIDIA. Exactly. You know what you're up to. It's increasingly the hard part has been finding them, particularly at scale. So it's almost an inversion of data gravity, where people are moving workloads to wherever they can get the requisite GPUs.
They'll eat the data transfer costs. They just need to be able to get them somewhere. And looking at the cost that, you know, Cloud providers charge for these things versus what NVIDIA does, assuming you can get them. Yeah, you'll hit breakeven in a disturbingly short number of months if you can get them.
And right now, that seems to be the hard part.
Justin Garrison: Uh, I mean, I know companies and people that have enough money that they have them and they can't use them. Because they don't have a data center that is capable of racking it. Like, it's not just buying the thing. It is prep work of like, hey, guess what? At some point, that colo doesn't have the cooling system and power requirements needed for those, you know, A400, H5400s, whatever it is.
That's like, oh yeah, we have these like giant NVIDIA GPUs. We want you to put them in a colo. I'm like, no, actually, You do have to plan something.
Corey Quinn: You take them home, you plug one of them in, they're half rack sized things and suddenly all the lights on your block go out.
Justin Garrison: Yeah, I mean, I have, I got solar power installed in my house, which has been great, but like my batteries would drain in two seconds if I was like running actual stuff here.
But like, yeah, there has to be some planning. And I think in a lot of cases, this like knee jerk reaction of catching up has, has caught up to too many people of saying like, I have to have that now. And it's like, actually, let's just take two minutes and do a little bit of math. And see what do you actually need?
What are your actual, and I know again, a lot of it's stock price, a lot of our quarterly earnings, a lot of those things have to be in this framework of business decisions, but like at some level, someone should be smart enough to say like, okay, where's the math line of these lines cross? What do I need to do?
Hey, guess what? I can build that in my next budget, right? That can be a six month project, not a six day project.
Corey Quinn: The more I do this, the more I realize that the common guidance falls apart when it comes to specifics. Because I talk to very large companies that are doing a lot of stuff on prem as well as in the cloud.
And I talk to equally large companies that are only doing things in the cloud. And Something I've learned is that neither one of those two company profiles generally hires idiots to make their decisions for them. Uh, there are contextual reasons in the bounds of what it is that they're doing that makes sense.
I, I would not expect anyone, uh, one of those companies to listen to us talking on this podcast in any direction and suddenly do a hard pivot in another thing, because I heard this on the podcast. It doesn't work that way. There is always going to be nuance in how budgets are done, expertise that you can have available, where you can get power, large, complicated contract commitments that you have in different directions.
The, the world is complicated and generalized guidance does not substitute for using judgment in, in the world that you're in. I'm just, I feel the need to say that just because periodically I have people come back and say, well, we're deciding to pivot our entire strategy based upon what you said. It's like, ah, could I have a little more context?
Don't put that juju on me. You understand? I understand that when I call it PostgreSQL, or when I say that Route 53 is a database, that is shitposting. You should not actually do it, right? I just want to make sure that people are taking the right lesson away from these things, which is evaluate things in a full context of what you're doing, not because some overconfident pair of white dudes on a podcast had some thoughts.
Justin Garrison: And that's one thing that like I've, I've learned a lot over my years of, of since writing a book about cloud infrastructure to today and in most of my time has been working in these. You know, in a cloud, but also on prem and seeing what that balance is. And it is, it very much depends, but I think that a lot of times people get blinded by just forgetting to make the decision of like, Oh, should I not go to the cloud?
Or should I not be on prem? And, and they just blindly go to one or another and they make decisions about that environment. They kind of just forget about the other side of it completely of like, Oh, what is the math of buying a rack of servers and putting it in a colo and saying like, Oh, I get infinite bandwidth of those things.
I'm like, okay, is that possible? I don't know. Should I? I don't know. I have to hire people. What does that mean? Okay. What does it mean with deprecates? Wow. Okay. Like that may be the wrong decision, right? Like if you, if you need GPUs today and you don't want contracts on them, I love the cloud. The cloud is amazing for experimenting.
I've learned a lot and being able to do things and, and built plenty of infrastructure and plenty of applications that run in the cloud. And then at the same time, like I have, I love Being able to plug a physical cable in and being able to feel the heat from a computer. I still think today that like all of this, um, uh, greenwashing of like data centers being green and whatnot and like saying like, Oh, we can go to the cloud because it's, it's renewable energy.
And it's like, well, it's kind of buybacks. And I was like, the more time people spend in a hot aisle of a data center, the more they will realize what they are doing to the environments. And touching metal and feeling heat from actual machines will make it actually real to you of saying like, Oh, those thousand computers I just created have an impact.
And what does that actually mean for what I'm doing is what I'm doing worth it for the world and for humanity, like all of those things, GPUs and AI included.
Corey Quinn: That's the challenge I have with a lot of the narrative around this. It feels like it's trying to shift the blame in some ways onto individual users.
Whereas, if I'm building some Twitter for pets style thing, and I start Googling about the most cost effective way to run my serverless architecture from a green perspective, I have a bigger carbon footprint for those Google searches than I do for the actual application that I'm running, because it turns out dogs don't tweet.
Or, technically they do, but we couldn't find a way to make them racist enough to bootstrap a Twitter clone.
Justin Garrison: Especially with AI built into Google now. Oh my god, yes.
Corey Quinn: Hey, instead of answers, how about we just make things up? Because it sounds good. Have you put any glue on your pizza lately? Oh, I couldn't believe that when I saw that earlier today.
That was Yeah, that strikes me as the kind of thing computers come up with. And, okay, great. But why are we putting that front and foremost instead of actual expertise from people who have been there before?
Justin Garrison: I think a lot of companies have forgotten that they've built trust over years and years and years.
And that trust can break really quickly with something that was not well thought through. And, and trust is a very hard thing, uh, to, to keep.
Corey Quinn: It feels like Amazon earned trust that they can then squander it by telling everyone that they are leaders in AI and then demonstrating again and again and again just how far from reality that is.
It's, it's frustrating to me.
Justin Garrison: You don't have to be a leader in AI as long as you're ahead of someone else, right? As long as you're ahead of your customer. Like that is the, the, the, the message for a lot of this is like, Oh, I'm a leader because I took one more step before you did.
Corey Quinn: I'm old and conservative when it comes to particular expressions of technologies.
File systems, databases, the stuff where mistakes will show. I'm one of the last kids in my block to adopt those things, and it seems to work out reasonably well for me. I don't think that you need to basically pivot your entire company for what right now is extraordinarily hype driven. Yes, there's value in AI, but no, customers aren't expecting you to find, unlock it, and then deliver it to them with a gift wrapped bow on it.
It's too early for that.
Justin Garrison: But the stock market is, and that's where the That's the problem.
Corey Quinn: Remember it was customer obsession and not market financial analyst obsession?
Justin Garrison: When the customers are giving you money.
Corey Quinn: Exactly. That's, well, that's the trick of it too. When I talk to my customers who are spending, in some cases, hundreds of millions a year on AWS, and half our consulting work is contract negotiation with them, with AWS.
We're not seeing people do what the narrative would have you believe, which is, well, we're spending 100 million a year right now, but on our next commit, let's make it 150 because of all of that Gen AI stuff. It's small scale experiments in almost every case, and that those, those same small experiments are trumpeted in keynotes as, this company is radically transforming the way that they do business with the power of AI.
Meanwhile, a week beforehand, we're talking to them, it's, Yeah, we've done this thing, it's an experiment, and yeah, there's a, there's a press release or two in here, but we're winding the effort down because it hasn't been worth the effort and energy and cost it takes to do this, so we're keeping an eye on it, but it's not substantial to what we're doing, so it's a, it's the intentional misrepresentation of what people are doing with these things that starts to irk me more and more, Because they can't, they won't, they can't change their minds and they won't change the subject.
Sorry, I'll rant about this all day if you'll let me. But I want to thank you for taking the time to speak with me. If people want to learn more about what you're up to now, where's the best place for them to find you?
Justin Garrison: JustinGarrison. com is my website. I highly encourage people to buy a domain and run their own websites.
I've been pushing that for years now as more and more things shift to platforms. I love owning a website that I've been running a website for Almost 20 years now and blogging almost monthly for 20 years. And after 20 years of blogging, it turns out that like, I have a lot of bad takes. Um, but also it's just like the place that you can find me.
And it's the place that I primarily want to put my thoughts and want people to reach out to me.
Corey Quinn: That's a really good philosophy. I've done something very similar historically. People like, oh, you have a big Twitter audience. Yeah, but I have a bigger newsletter audience because I own that domain. I have people's email addresses.
I can land in their inbox whenever I feel I have something to say because double opt in and consent are things. And I don't have to please the whims of a given algorithm in order to reach out to people who've expressly stated they want to hear what I have to say from time to time. There's so much value in that.
Justin Garrison: Except for I use review for my newsletter, which got shut down when, uh, Someone decided to shut it down at X. So, yes,
Corey Quinn: that's what I was talking to some of those folks early on. I built my own custom system. Never do that, but I can port the thing between any email service provider or God forbid, I can go back to the olden days of running my own series of postfix mail exchangers.
If it really comes down to it, I'm not saying that that migration be painless and there might not be a week or two of delayed issues, but it's definitely something that is possible to do because Surprise! I back up the database from time to time.
Justin Garrison: Owning the stack gives you options.
Corey Quinn: We will, of course, put a link to that in the show notes.
Thank you so much for taking the time to speak with me. I appreciate it.
Justin Garrison: Yeah. Thanks Corey.
Corey Quinn: Justin Garrison, director of DevRel at Sidero. I'm cloud economist, Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five star review on your podcast platform of choice.
Whereas if you've hated this podcast, please leave a five star review on your podcast Platform of choice, along with an angry, insulting comment disparaging any of the opinions we've just had. But be sure to mention which on prem provider or which cloud provider you work for so we understand the needless, grievous personal attacks.
Join our newsletter
2021 Duckbill Group, LLC