Getting the Basics Right in Cloud Security with Fouad Matin

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is brought to us in part by our friends at Min.io

With more than 1.1 billion docker pulls - Most of which were not due to an unfortunate loop mistake, like the kind I like to make - and more than 37 thousand github stars, (which are admittedly harder to get wrong), MinIO has become the industry standard alternative to S3. It runs everywhere - public clouds, private clouds, Kubernetes distributions, baremetal, raspberry’s pi, colocations - even in AWS Local Zones.

The reason people like it comes down to its simplicity, scalability, enterprise features and best in class throughput. Software-defined and capable of running on almost any hardware you can imagine and some you probably can’t, MinIO can handle everything you can throw at it - and AWS has imagined a lot of things - from datalakes to databases.

Don’t take their word for it though - check it out at www.min.io and see for yourself. That’s www.min.io

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. One of the things I find myself Screaming in the Cloud about entirely too frequently is the concept or idea or implementation of the high-level principle of least privilege. I figured I’m probably not the only person who has strong opinions on this, so I went searching and sure enough, I found a friend who feels the same way. Fouad Matin is the co-founder and CEO over at Indent.com, colloquially known as Indent. Fouad, thank you for joining me.

Fouad: Thank you for having me.

Corey: You are one of those weird friends that continually surprises me in weird ways. I threw a drink up when I was out visiting the New York summit that AWS threw and you showed up for it, which was great. And flattering and that was great, a wonderful surprise, don’t get me wrong, but buddy, we both live in San Francisco. What gives?

Fouad: [laugh]. Yeah, we were just out there for the AWS Summit to meet some of our customers, and I saw that you were in town. We exchanged on Twitter or over email or something and I figured, you know what, nothing better than in person. So, sure enough, I just showed up.

Corey: One of these days, we’re going to hang out together in the same city without there having to be a big conference around. I think RSA was the first time we wound up hanging out in that sense.

Fouad: [laugh]. Yeah, that’s right.

Corey: So, to give a little context on the idea of where I stand on security, and then I’ll let you run with it, I’ve been getting a bit of flak lately for saying that Google Cloud is the number one cloud when it comes to security and AWS is number two. Azure is a distant third place eating Crayons and telling you which flavor tastes the best, confidently but wrongly because that’s what the AI said. Now, the reason I believe that is not because I think the security fundamentals or the primitives are in any way, shape or form, better in Google than they are on AWS, but the user experience, the understanding that they absolutely understand what the customer is trying to do and balancing it accordingly is awesome. With AWS, it feels like I am perpetually blocked by default from doing anything. Add a permission; mm-mm, doesn’t work. And another one; mm-mm, still doesn’t work.

Hell with it. I’ll add a star; you can do everything. Now, it works. And I put it to-do in to go back and fix this later. And that’s a lie. We know that’s never going to happen. So, things wind up massively over-permissioned in perpetuity. It drives me nuts and I haven’t found a good way around it. Please agree, disagree, rebut or turn into a flagrant sales pitch. Your call.

Fouad: [laugh]. Well, I feel like that idea of everyone just needs administrator access just to do the most basic things—you need to create a bucket? You need admin access. You need to restart an RDS server? Well, here’s some admin access.

And I think it all stems from people just trying to unblock their team. And so, no one really wants to stand in the way of other people just trying to help out or trying to get work done. And I think that’s the source of all the frustration is that people are just trying to do their work and then they’re hit with these 403s or some sort of error page that doesn’t really tell them what exactly you should do that results in the snowballing of all this admin, and typically very sensitive admin access.

Corey: In fact, if you wind up scoping down permissions and then log into the AWS console, even with fairly broad scoping, you’re still going to find that as you maneuver your way through the joy-slash-terror-slash-pain that is the AWS console, you wind up encountering a whole bunch of giant red banners yelling at you and saying, “Oh, nope. That’s not the way and the light.” Or winding up just trying to do something innocuous and not being allowed to do it. And the way that that UX works, you feel, on some level, like, oh, your account is broken because you don’t have the permissions to do what you think is fairly innocuous. It just becomes a very limiting and broken user experience that you can tell was designed from the perspective of, “Oh, anyone using the console, of course they’re going to have full admin.”

Fouad: Exactly. And I think it actually—what we’ve seen is the reverse happen where people just end up sharing these admin keys across the team through the CLI. And so, it’s not just the console users. It’s really anyone who ends up running some scripts. Well, oh, turns out our script doesn’t work anymore because we tried to scope it down a little bit, but it really has to do with how people are setting up and architecting their identity interactions within AWS, but also even how they set up their accounts.

I think we see a lot of teams start with one account that has everything, and so naturally, if you need to create a bucket or you need to take some sort of operation in RDS, you end up needing more permissions than you need just for that one operation because you’re just doing a lot of different things. And as teams grow, they start to have multiple accounts and maybe it’s okay to have a production account where there’s a few people who have access, so sure, they all get admin access. But that approach doesn’t actually really work in practice because inevitably, everyone ends up needing to do one operation, and that one operation requires admin access as a result of how it was architected. And I think that’s the, kind of, root issue is how you set up that foundation, how you set up that structure. And it’s really hard to change behavior when people are used to just having standing admin access.

Corey: One of the areas that you and I bonded over—which, you know, this should really say a lot about the kinds of conversations we have in person and the company that we keep—is the idea that you should never have long-lived credentials hanging around on disk, I mean, on my systems, the only things you’ll find in the AWS credentials file other than the occasional, effectively, credential process that grabs out to some sort of secured key store for some legacy implementation of something or other is a set of credentials that is strictly a canary token. Someone grabs it and immediately I get alerts that, “Hey, this particular box has been compromised. Perhaps do something about it before it gets oh, so very much worse.” And that is the right approach.

I’m a huge believer in having things work automatically from a credentialing perspective, and also with credentials that expire in relatively short order. So, for humans, that means SSO or something like that. And for instances and things like it, it means using instance roles. And for things like CI/CD, it means using something like the OICD permissions dance, which I don’t pretend to fully understand, despite having gotten it working with GitHub Actions, but now there’s no long-term credential source where attackers can get to it. And I think that’s pretty cool.

Fouad: Absolutely, yeah. I think the improvements to tooling, I mean, just looking at a couple of years ago, back when we were all using the AWS Vault CLI as a way of kind of venting credentials compared to now, where you’re not just assuming that you have this credentials file that has not just one credential, but usually at the time, had many different account credentials all just sitting in one place. And thankfully, you know, to make it a little bit easier for anyone trying to compromise your computer, it was just the same name on every single device. And where we sit today, where things are driven by [entity 00:07:33] and I think one of the key differences around, kind of, how people are even managing their accounts, used to have—you know, you would have [fouad-dev 00:07:40], you would have [fouad-prod 00:07:41] and those were different accounts. And we were trying to keep them separated.

But in reality, it should just be tied to my email address. I should just be able to log in with fouad@indent.com and then that then lists, here’s the different roles that you can assume. And if I, let’s say, I need to go and restart the database, then I just go and get a role that has just the permissions I need to restart a database. I can’t just go in and start deleting buckets or viewing objects that are in buckets. I don’t need to do that. I just need the one role that I need it for.

And so, I think having that really, kind of, limited focus where you’re, kind of, entering this privileged session, rather than I just, kind of, open up the admin console and the world is my oyster, but also I might step on some barbed wire or going to enter a Home Alone situation and start having paint dropped on my head as a result.

Corey: Does your position change based upon the nature of the AWS account you’re getting into, whether it’s production, whether it has sensitive data or not, whether it’s a development account, whether it’s something you’re just using to kick the tires on a new service, et cetera?

Fouad: I think there’s definitely a spectrum, like with any kind of risk posture or security posture. I think when you’re dealing with development and things are pretty low risk, I think it’s fine for people to just have these kind of elevated permissions on an ongoing basis, with the caveat that—and this was, you know, some experience I’ve had before where, you know, one day I needed—I didn’t need—I wanted a GPU instance, I wanted to train some model, once upon a time. The unfortunate thing is I forgot to stop said instance. And as you probably know, and maybe this was just a great ad for your business, it was quite expensive and I found out a year later.

So, that was a little bit of a bullet to have to bite and say, “Okay, yeah. That was definitely my fault. And no one really to blame other than myself for that one.” And so, it’s not just about necessarily just security per se, as much as it is there sensitive operations, regardless of what level of account it is. I think just acknowledging what the sensitivity is.

So, in that case, it might be cost sensitivity in development. And you want to be mindful of and have some sort of, kind of, review practice around, okay, are we having long-running expensive instances that are running and maybe we want to manage that a little bit better. But then when it comes to production, I have much stronger opinions about when you’re storing customer data or any kind of sensitive data—or confidential data for that matter—you really do want protect it as if it were your own. And I think one of the analogies that I’ve heard that I really like is this idea of handling it like hazardous materials, where you don’t collect it. If you have to collect it, don’t store it. And if you have to store it, then don’t keep it.

And so, taking that approach to how you handle your own customer data, or even access to that customer data, I think really goes a long way in improving just that kind of standing operating procedures at a company of any size, regardless of if it’s production or even staging instances where sometimes, inevitably, customer data lands, for better or worse.

Corey: One of the things that I find that has also worked out super well for me and would have caused less excitement at some jobs in previous years, is these days with trusted computing being what it is to some extent, I use a program called Secretive on the Mac that winds up generating an SSH key pair, but the private portion only lives inside the Secure Enclave and can never leave, which means you cannot export what that is. All it can do is sign individual requests and as a result, when I SSH into a node, I have set it up so that I have to authenticate with Touch ID—although you can disable that part—but at that point, it just means that there is nothing sensitive living on disk to wind up getting compromised, which is a nice way to live.

Fouad: Oh, it’s a huge improvement, I think be able to push more of the—especially when it comes to encryption, but in general pushing more of the security best practices into the hardware itself where there isn’t any kind of software, kind of, workaround where, I think we kind of saw this when we worked on some voting projects back in 2016 and 2018, we were looking at using client-side encryption, and the consistent requirements from some teams that we were working with where they just wanted us to send them plain text ballot information or voter information that we just couldn’t even do. And it was so validating to just say it is literally impossible for us to decrypt this without the customer’s key that is only on their device. We just don’t have access to it. And I think that approach where you’re really pushing the security protections, not just within, you know, a couple of if statements that you have on your back-end, but directly to user devices really goes a long way. And I think it’s just been a major improvement that we’ve seen from consumer tech in general.

Corey: I wish that there were better awarenesses around things like this. I mean, let us be clear though. When we’re talking consumer tech, I would love if people would stop using the same password of ‘Kitty’ on everything that they have. I’ve been tracking with… we’ll call it depression I suppose, the unfolding trickle truthing of the various customer-slash-victims of LastPass where, “There was not an incident.” “Okay, there was an incident.” “Okay, there was an incident and data was breached.” “Okay, and it was everything.”

And it just get worse and worse and worse. Well, I migrated off of LastPass in 2017, and for better or worse, every account that I use has its own unique password even back then. In most cases, I use a tagged email address for most things, I have a few wildcard domains for just that. And this was an excuse to spend an afternoon changing the probably 200 passwords I hadn’t rotated since then because when you have a 40-character password, what does it matter if you rotate it often or not? But it was reassuring, just from a personal perspective that there were some high-value accounts hidden in there that had never been compromised because I would have noticed.

So, at least that was useful. But I wish people would use a password manager, whichever one that they happen to pick that isn’t LastPass. I wish that people would enable MFA on high-value accounts. I wish that people would stop sharing passwords back and forth. And trying to get them to do that, on the one hand, feels like it is a million miles away from trying to talk to companies about, “Oh, yeah. Now, let’s talk more about least privilege. And you know, you’re doing it when working with AWS becomes actively painful.”

Fouad: [laugh]. Yeah, absolutely. I think that kind of idea of getting people to do quote, “The basics.” That’s really what it’s all about. I think that there’s a lot of new and interesting patterns that are emerging and technologies that are emerging, but ultimately, a lot of the breaches that we see all stem from a lack of the basics.

And so, I think one thing just that you had mentioned briefly around 2FA that I think is really interesting is that it’s not just around what people consider high value. So, people might think their bank account where you can’t even log into your bank with a lot of providers now without having 2FA at least in some form, even if it is just text-based. But I think what’s interesting is people are actually experiencing this not just with their most sensitive accounts, but even what they saw as mundane, like their Facebook or their Instagram or their Twitter, where people are having their account compromised because they were using an insensitive password and a lack of 2FA. That’s then what opens the door to their account getting breached. And I think getting the basics right and explaining why 2FA matters and how to do it and making it as obvious as possible—and kudos to—there are teams like Epic Games who, beyond just their own product and their own games, getting people into the habit of setting 2FA and having that across every consumer tech that people use is really important.

Then ultimately it stems back to what we were talking about with least privilege in AWS because I think as engineers, we know, in theory or in principle, you should be doing these things, but we don’t actually because it’s inconvenient or because it slows things down. But if done correctly, it shouldn’t. And I think that’s the kind of balancing act that is important to strike here, where it’s not about using, you know, one piece of technology or a product or anything like that. There’s many different ways to kind of get the job done. But ultimately, kind of focusing on the basics—in this case, maybe not everyone should have access to production; maybe not everyone should be able to just log into the database whenever they want—I think that’s the kind of analogy that I think about most often is, no one would ever suggest that every engineer should just have an open network connection to the database all day, but if you have admin access—or even just kind of standing database access within AWS—you could.

You could just have an open connection to the database all day and just poke around whenever you want and just see what’s going on in there. And that’s not good. We would not feel good if we saw that one of the products that we use, their engineers were doing that. And even if they had some reason for doing it, it just doesn’t really make sense. And I think making sure we’re doing the things that we would expect of others in our own homes, and in our own companies, I think is really important.

Corey: I wonder if there are, I guess, market pressures on these sorts of things. And what I mean by that is not that AWS is going to yell at you for these things. But I still periodically will be kicking the tires on a variety of different vendor solutions that do all kinds of things by reaching into my AWS account. And I still see, “Oh, yeah. Generate long-lived IAM credentials and upload them into our web form here.” It’s, “No, no, no, no, no.”

And then, okay, the smart ones are, “Build a role that has the following permissions.” And you check, and they’re using IAM credentials that are long-lived on their side to go ahead and access that, which, okay, still not terrific. And my least favorite of all of them are, despite the fact that I use all of these tools in a dedicated test account that has nothing sensitive whatsoever inside of it, but there are still a couple of them where, “Wait, I need to roll my own completely separate AWS account just for you?” Because yeah, you wind up, for example, restricting all kinds of permissions around what you can and can’t do with the role, but then you allow yourself to attach arbitrary IAM policies to things. So no, you can just give yourself admin rights and this all becomes security theater. I don’t think it’s malicious. I think it’s just not a whole lot of, I guess, decent thinking.

Fouad: Yeah. I think that kind of concept of making sure that you’re not just doing things for the sake of showing that you’re doing it and performing it I think that’s really important. I think that’s—we’ve heard from a lot of teams who end up quote, “Implementing least privilege,” where, “Yeah, we did it, we checked the box for SOC 2,” but actually, everyone ends up having access, or if you go through this process, you can get it anyway, but yeah, you’re supposed to submit a ticket or something. And I think those flows, those kind of logical inconsistencies is really where you—one, you lose trust, both internally because then what was the point of this exercise where we added some friction, even though you can circumvent it anyway, but also it doesn’t really make you any more secure than you were before if people can still go in and, you know, run a script that can just grant them admin access. And in a pinch, they might just use that anyway.

Or if you happen to know the IP address of your production database then, and you happen to have already created a user account even before you locked things down, well, now I can still just do whatever I need to do, so I can work around this issue for now. And it, one, I think reinforces this kind of seniority bias, where, you know, being a senior engineer just means that you just know where the bodies are buried. But also, I think, on the kind of new team member side, it’s just really hard to just get your basic work done. Regardless of kind of what tool or what vendor you’re using, you end up just reaching every single time for the super admin, for whatever kind of escape hatch that you can possibly use.

Corey: You just put your finger on the pulse of the real problem here from my perspective, which is you’re trying to get work done. “Well, everyone’s job is security.” No, it’s not because I assure you, I am not metric’ed at the end of the year on how my business did based upon security unless I’ve completely failed at it and we know that there’s been a breach. As an engineer, I am graded on what features have I shipped, not how secure were those features. And as a result, I’m trying to do my job and I want to do that job as quickly and efficiently as possible, so any solution that winds up enforcing or creating least privilege has definitionally got to be easier and—

Fouad: Exactly.

Corey: More straightforward and lower friction than just using admin for everything. And that’s a tall order.

Fouad: It is hard. But I think if set up correctly, it is possible. And I think this kind of distinction of how do people get access the first time? And how do people—I think kind of to your point—not just how do people get access, but how do people do their work? And sometimes that can require getting access to something. Like, maybe I’m on an integration team, and I just deployed a service and I want to confirm that it works.

Focusing on what that workflow is and understanding the critical paths within a business and within a team, whether that’s, you know, I need to access Rails Console and so really focusing on the critical path to how do we make it as easy as possible and as fast as possible for people to get production Rails Console access, but in a time-limited and really audited way where you can still record the logs of what’s happening on those servers, but also make sure that every time that someone is requesting that access, they’re providing a reason. And I think that piece of providing a reason, while it can seem like it’s just for, you know, a security or a compliance reason, it actually provides a lot of value for the team overall because if let’s say I needed access to run a migration and I get that access only for the next couple hours, which is how long it should take for me to run that migration and then I request three more times to get access for running a migration, that’s probably a good indicator that maybe my migration wasn’t well tested enough to actually go directly into production and maybe there’s an issue here. And I think that kind of distinction of, I probably need some help, as opposed to let me just keep getting a little bit more access and keep poking around until I find out what’s going on, I’m at least less likely to drop anything in that event, compared to if I’m requesting and I know that it’s visible for everyone, I’m more likely to actually both reach out for help and also, at the very least, it’s more visible to everyone else, so someone’s going to come and say, “Hey, why do you need this much access?” As opposed to the default standing where everyone has admin, it’s impossible to tell. You kind of need a, kind of, needle-in-a-haystack finder to figure out what’s even happening within your team.

Corey: This episode is sponsored in part by our friends at Strata. Are you struggling to keep up with the demands of managing and securing identity in your distributed enterprise IT environment? You're not alone, but you shouldn’t let that hold you back. With Strata’s Identity Orchestration Platform, you can secure all your apps on any cloud with any IDP, so your IT teams will never have to refactor for identity again. Imagine modernizing app identity in minutes instead of months, deploying passwordless on any tricky old app, and achieving business resilience with always-on identity, all from one lightweight and flexible platform.

Want to see it in action? Share your identity challenge with them on a discovery call and they'll hook you up with a complimentary pair of AirPods Pro. Don't miss out, visit Strata.io/ScreamingCloud. That's Strata dot io slash ScreamingCloud.

Corey: So, I have to ask, since you’ve built a company around the entire approach, how do you get to least privilege in practice because even AWS has their IAM Access Analyzer, which is designed as a native offering to let you build least-privilege policies? So, this is great. I’ve run this on things I have in production that are over-scoped and it’ll come back and say things like, “Ah, it was reading and writing to DynamoDB tables.” “Okay, great. Well, you tell me which ones?” “Nuh-uh. Guess.” “Okay.” “You were making some S3 operations?” “Will you tell me more than that?” “I will not.”

I know that it’s doing this; I can look at that from static code analysis around which Boto calls it’s making. The end. I’m trying to be a little bit more granular here. And I’m sure that I’m missing oceans of nuance on this, but I cannot escape the feeling that there is a way for it to build out an absolutely incredibly scoped policy that I then loosen. Okay, great.

Maybe I don’t just put that one key and that one value into Dynamo, but maybe I just let myself have access to that entire table or expand it out beyond just the single S3 object to an entire prefix or an entire bucket, but I still won’t need to build the entire thing from scratch. It’s easier to broaden than it is to tighten in, especially when you’re not entirely sure what some of the syntax stuff looks like. But no, it just feels like it is not built for humans who are not steeped in security.

Fouad: As someone who’s been working on security for a while, I still find [laugh] the analyzer to be a little bit difficult to use and end up having to just look at logs [laugh] to find out what’s going on. But I think the core issue that you’re getting at is exactly right, which is, how do I get the work that I’m trying to get done with the least amount of access to do it? And I think it all comes down to this core issue, which is, you kind of have to start from zero—you actually just pointed this out at a moment ago—where it’s much easier to broaden than it is to tighten. And it’s easiest when you just don’t have unlimited access.

And so, I think starting from zero, you know, like when you show up on day one where you don’t even have an account, you’re actually better off than. The entire organization is more secure if no one has access. But obviously, that’s not realistic. That just can’t work; people do need access.

And what you can do is start from what are people trying to accomplish. Define the role based on that. So, let’s say, I need to go in and—you know, we rely on Dynamo in this scenario—and I have this table that is our, kind of, most critical, like, this is all of our customer data that is really sensitive in Table A. In Table B, we have some product analytics that, you know, a couple different teams are going to need to access to, and then Table C is just something that we kind of just store kind of as a cache; it typically is not going to have anything that’s even remotely sensitive, but we still want to manage it in a somewhat sane way.

That way you can actually say I’m going to have three different roles—maybe even two different roles—but at the very least, I have one role that is hyper-locked down because Table A has the most sensitive of all of our data and rather than having, just by default, people are used to just having access to all three tables, instead I’m already in the mindset that this is a sensitive table, so when I need to go in and perform operations, I’m a little bit more cautious about what am I doing inside of that table. And maybe that means you also separate that and say, let’s have a right and a read. And I think that’s the most common pattern that we’ve seen people implement is just separating those two out is already half the battle. But really, it’s not just about giving people one or the other; it’s typically having three categories where people have the, kind of, most minimal, just kind of log-only access; they have read access to whatever system, and then they have right access in the system, which people should practically never have unless they have a very good reason, and they should only have it for 30 minutes or a couple hours at a time. You don’t need it for longer than a day. You don’t need to just go in and change whatever you want in our Dynamo table. There’s no good reason for that.

And one analogy that I heard from an engineer that I really liked is this idea of thinking of changes to access as a migration in their own form. And so, the easiest way to perform a migration is if you already have everything in a structured way where you already know what are you actually changing as opposed to, “Well, there’s this unbounded mess of things we’ve already created. We’ve accumulated and snowballed a lot of different resources and identities and everything in place.” And that’s really hard to manage. And unfortunately, that’s where most companies sit today.

And so, some of these ideas of well just start out with the right approach is just not really going to be helpful and that’s kind of where I would go back to start from zero, not from the perspective of your account, but from people’s relative access. And think about what the bare minimum—rather than the bare maximum, which is what most people have—but the bare minimum viable access to perform a given task, whether that’s within RDS or that’s within Dynamo, or let’s take S3 as an example, where you might have one bucket that stores all your customer data under different prefixes. Well, that’s probably a good example of where you should have very tightly controlled IAM policies that limit what prefixes you have access to as opposed to a different scenario where you have a bucket per customer—which is what we would recommend—that you’re limiting access on a per-bucket level. But in either case, the least privileged control would be, people should not just have GetObject on every single bucket. That should be limited and really looked at not as a, “Oh, well, it’s read-only so it’s fine,” but actually, that’s an anti-pattern because read-only in that context could be looking at all the customer data.

Corey: I’m a firm believer that this is not that hard of a problem to get to if you’re able to start from a place of building completely fresh in a greenfield scenario. Unfortunately, we don’t have that available to us in most cases. I am staring at a pile of legacy stuff and I’ve only been in business here for six years. And I look at some of the stuff I built early on, it’s like, “What moron did this?” And I don’t have to ask; I know because it was only me for the first two of them. How do you wind up approaching the idea of migrating people from whatever hellacious scenario they’re currently in that’s also load-bearing and getting them to a point where this makes sense for them across the board? Knife-switch cutovers suck and don’t work.

Fouad: Yeah, so most of our customers actually did not start at greenfield as you can probably imagine. We work with some companies that had a lot of, where you call it legacy, or a really just critical infrastructure that there was no option to say, “Oh, we’ll just spin up a new account and start anew.” That was not an option. And so, it’s really about looking at what are people already doing today? And I think it stems to what we were talking about earlier of, “Well, what’s the workflows that people are already doing?”

What are the reasons why people need access to production? It’s really hard to figure that out when everyone already has it because no one’s going around asking for it, per se, but there’s really two sides of the equation where either it’s open by default where everyone already has access, in which case, you have to try to unwind people’s natural instinct, and say, “Well, I am an admin. It’s one of my core personality traits. I’m an administrator access in our production AWS account.” Or on the other side, you have everything locked down and only a handful of people get to be the, kind of, curmudgeons deciding who gets bestowed access in the context of an incident or for a given project, and hopefully remembers to revoke that access.

But ultimately, it all stemmed from access to sensitive data, whether that’s in a blob storage or some sort of structured SQL database, they’re running servers where we’re actually computing with that data, and then typically, there’s some set of pipelining that, whether it’s SQS or any of the many different, kind of, messaging systems within AWS and some of the managed products. There’s really three categories that we saw companies try to approach this problem. So, step one was, “Let’s secure the thing where the data is actually being processed.” So, the servers, where if you were to make a change in there, you just go SSH into that server, you download something onto it, that would be really bad, we would look really embarrassed. And so, locking that down where there’s not really that much good reason for you just SSHing into production servers, willy-nilly.

And so, that was the first place that we see most companies start and limit that access. Because people still need it, but now they can just get it when they need and that kind of on-demand nature to people’s access, I think that’s really important. And then for the others, it’s really about understanding the workflows before you revoke it. And I think a key part of that is you can shift to on-demand access, where people are requesting it and they don’t have it by default, but it can be automatically approved. So, you get people to start saying why they need access.

And they’ll just tell you, “Hey, I needed to do this workflow,” or, “To do this ticket,” or, “I’m debugging this integration.” And then you can start to ratchet up the controls and say, okay, for these really sensitive buckets, which we now know, are extremely sensitive and have very sensitive workflows, but are really time-sensitive, let’s allow this data science team to be able to self-approve and everyone else has to go through their manager. And then on the flip side, there might be some resources where it’s like CloudWatch or ECS, cluster info is a common one where, well, it turns out if you look at the quotas, it’s pretty easy—if you have a growing team—to hit those quotas and then all of a sudden, no one can see cluster info anymore. And there are workflows where people just start looking at different pages in the AWS dashboard because they’re curious, as opposed to, they actually need to do that thing. And then the people who need to do whatever that workflow is are unable to do it anymore because of either a quota getting hit or someone accidentally takes an action that they thought they were in staging instead of production.

And I think just separating out those kinds of accidental mistakes [laugh] or moments of curiosity that maybe should not have been run in production and shifting that to a process where people are saying, “I need it for this reason. I need it for this amount of time.” And given that, people are much more cautious about what are they actually doing within that system. And I think that that change of behavior is really the first place to start.

Corey: Speaking of changing behaviors, one position that you have staked out publicly lately that I’m a fan of and you started quoting me—I’ll even be charitable in context on—has been that nobody should have production access. I like the pattern. And I know people are going to yell at me for it, so I will of course, caveat it—and maybe you’ll disagree on this one—yes, fine. Build yourself a break glass option to get in when things are completely hosed. Fine. Of course. But no one should be able to do that without a whole bunch of other people knowing. Now, if you accept that you can always get in if you need to, now how do you build tooling and operational practices so that you don’t need to?

Fouad: Exactly, exactly. And I think there will always be a situation where you need that break-glass backup. I think when we look at incidents like the Uber incident from last year, where the unfortunate reality when you don’t have the right tooling in place to temporarily elevate people’s access, what ends up happening is that break glass account, this, you know, only break in case of emergency ends up getting used constantly to the point where it gets put in a Google Drive folder with credentials in plain text. And that was the root cause of not the initial incident and the initial compromise, but the escalation from, “This is kind of bad,” to catastrophic, where an outsider now had full admin, super-admin across everything. And I think that distinction of if something is a break-glass account, it should do just that: it should break glass and set off alarms and everyone should be triggered and notified by an triggered alerts and notified because that should be the sensitivity of that account.

Otherwise, people should be going through a process that’s not quite break glass as much as it is an escalation. And I think this kind of idea of an escalation, where it’s not—not necessarily a sense of, like, an incident escalation, where this is a, kind of, one-off thing because these end up happening constantly. It’s escalation mechanics of elevating someone’s access from Viewer to S3 Reader to RDS Backup Recovery, understanding the workflows that people are trying to accomplish. And AWS, to their credit, has made some improvements around Identity Center to make defining these job-specific functions a little bit easier when you have an accounting team who needs access to AWS, doesn’t mean they need access to production continuously; it just means they need the billing account info. That’s really all they’re looking for. They don’t want to go look in S3. Unless maybe they’re Duckbill, but then they do need to go into S3. And I think that having that approach—

Corey: We have to go into everything, on some level.

Fouad: [laugh].

Corey: That’s where the bills live.

Fouad: Exactly [laugh]. But I think that having the right procedure and the right tooling to enable that procedure is really important because it all comes down to what is the easiest way to compromise the system. And if the easiest way is this, there’s this account that people can just go into, or there’s a handful of people who have these accounts, that the inevitable results of people just sharing credentials, or I ask you—if you have admin access, I just asked you to go run this query for me and you’re just trying to help me out to help me ship this feature faster, we’re more likely to make a mistake compared to when we have a process in place that is as lightweight as possible. I think that’s the key part of—process can really add a lot of friction and it can be really this kind of terrible plight when it comes to engineering because you’re just going through these hoops for not really that much reason. But if the process is more of a byproduct of you just kind of doing your work and saying, “Hey, I need—” just like you need access to a Google Doc or something, you just click a button and say, “I need it,” and then it goes to the right person.

I don’t need to go find who in the SRE team or on the IT team I need to ask, I can just click a button and say, “Hey, I need this thing, and here’s why I need it.” It makes it so much smoother for people to get that access, to just get on with their lives and just keep doing work. And I think setting up—the other half of that, setting up auto-approvals for cases where these people will almost certainly always be approved and just encoding that into a policy. I think that’s the key distinction that a lot of people don’t, kind of, take into account is that you can just connect your approval process to your PagerDuty or to your Opsgenie or whatever incident management system you’re using to make sure that people aren’t just stuck around waiting if it is, let’s say 2 a.m. and people need to, quote, “Break glass,” that shouldn’t really be a break glass moment because that person is already on call or they’re already responding to an incident.

So, understanding the situations in which people are going to have to use these kind of escalations and just accounting for it as part of your process. And I think that’s where it is hard to do. There are companies—and credit to Segment where I used to work where they actually built—they spent many months and some really smart engineers building an access service in-house to solve this problem because they had a lot of data, as you can probably imagine. And credit to them because they really invested and they took it seriously. A lot of companies just don’t have that capacity, don’t have the bandwidth, especially now they don’t have that capacity.

And so, I think finding the right solution for you, whether that’s some amount of tooling that kind of removes people’s need to have production access or it’s using products. There’s a lot of different options, but ultimately getting to the right process where people don’t just have that standing admin access, whether it’s not keeping production access, or candidly, even as one of the co-founders, I don’t have admin access on our own systems. I can get it when I need it, but I think it’s a really important procedure to make sure it’s typically going to be the kind of most senior leadership in a company or senior engineers in the company who are going to be so confident that they’re going to do the right thing and, kind of, you know, just cut once and be so confident that—they’re going to make a mistake. And we heard about this recently where someone was debugging an incident and the CTO hops on, tries to help, and accidentally drops the production index. They thought they were on staging because they wanted to test something to see, would that solve the problem, but they actually made the incident much, much worse because now not only were they already debugging this incident, but now all their other alerts started going off because everything started becoming so much slower. And I think that kind of confidence can actually hurt you in the moments like that, and having a little bit more procedure and process can help.

Corey: I really want to thank you for taking so much time to explain how you feel about these, unfortunately, polarizing topics. If people want to learn more, where can they find you?

Fouad: You can check out Indent at indent.com. And we’re on Twitter, LinkedIn, I’m also on Twitter and LinkedIn; Fouad Matin. A little bit harder to spell than Indent. But if anyone, kind of, disagrees or feels strongly about this topic, definitely willing to scream into the clouds and more and hash it out. But thanks again for having me.

Corey: Yes. “You’re encouraged to fight me in real life.”

Fouad: [laugh]. I felt like I was getting pretty close to that [laugh].

Corey: Exactly. Ahh. Thanks once again for your time. I really do appreciate it.

Fouad: Thank you.

Corey: Fouad Matin, co-founder and CEO at Indent. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with it insulting comment that I will then log in as you and remove because you used the same password for everything, including production and your podcast platform of choice.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Join our newsletter

checkmark Got it. You're on the list!
Want to sponsor the podcast? Send me an email.

2021 Duckbill Group, LLC