The Darth Vader of AWS with Eric Brandwine

Eric Brandwine is a VP and distinguished engineer at AWS, where he focuses on the cloud and security and has worked for more than 13 years. Prior to joining Amazon, Eric worked as a principal engineer and senior engineer at MITRE for 10 years and a network security engineer at UUNet. He earned a bachelor’s degree in computer science from Cornell University, with a concentration on engineering physics and research in operating systems. Join Corey and Eric as they talk about why Eric is kind of the Darth Vader of AWS, how meeting coworkers for the first time during security events isn’t the best way to win friends and influence people, how security is job zero at AWS and what that means, why businesses shouldn’t be terrified of making a single misstep but why they should take every security event very seriously, the importance of earning customer trust every day, the two things Eric thinks makes security difficult, how cyberattacks are like playing a blind game of chess against and unknown adversary, why Eric’s favorite word in AWS security is “escalate,” and more.

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Cloud Economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of Cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is brought to you in part by our friends at FireHydrant where they want to help you master the mayhem. What does that mean? Well, they’re an incident management platform founded by SREs who couldn’t find the tools they wanted, so they built one. Sounds easy enough. No one’s ever tried that before. Except they’re good at it. Their platform allows teams to create consistency for the entire incident response lifecycle so that your team can focus on fighting fires faster. From alert handoff to retrospectives and everything in between, things like, you know, tracking, communicating, reporting: all the stuff no one cares about. FireHydrant will automate processes for you, so you can focus on resolution. Visit firehydrant.io to get your team started today, and tell them I sent you because I love watching people wince in pain.

Corey: This episode is sponsored in part by ChaosSearch. As basically everyone knows, trying to do log analytics at scale with an ELK stack is expensive, unstable, time-sucking, demeaning, and just basically all-around horrible. So why are you still doing it—or even thinking about it—when there’s ChaosSearch? ChaosSearch is a fully managed scalable log analysis service that lets you add new workloads in minutes, and easily retain weeks, months, or years of data. With ChaosSearch you store, connect, and analyze and you’re done. The data lives and stays within your S3 buckets, which means no managing servers, no data movement, and you can save up to 80 percent versus running an ELK stack the old-fashioned way. It’s why companies like Equifax, HubSpot, Klarna, Alert Logic, and many more have all turned to ChaosSearch. So if you’re tired of your ELK stacks falling over before it suffers, or of having your log analytics data retention squeezed by the cost, then try ChaosSearch today and tell them I sent you. To learn more, visit chaossearch.io.

Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Eric Brandwine, who's a Distinguished Engineer and VP at AWS. Eric, welcome to the show.

Eric: Hi, Corey. Thanks for having me.

Corey: So, what is it you actually do at AWS? Every time I've mentioned your name to folks in passing, they get this sort of stricken look, and all I can assume is that you're basically Darth Vader.

Eric: Darth Vader, I think is a slightly unfair characterization; perhaps not wholly unfair—

Corey: Because he had a redemption arc?

Eric: Well, there were only three movies. And—

Corey: Exactly. If they'd made a prequel or sequel, it would have probably been a really good movie. Shame they never did.

Eric: There were three Indiana Jones movies, there were three Star Wars movies. And yeah, at the end of the third movie, it kind of sort of worked its way out. Every year at re:Invent, Andy Jassy gets up on stage and he says, “Security is job zero.” And I love it when he says this because one, he's counting from zero, which is how all good computer scientists count, but two, he's very publicly saying how important security is to what we do. And this isn't just Andy getting up on stage at re:Invent; when he comes back to Seattle, this is the behavior that he models for his leaders.

And so, unfortunately, my first interaction with a lot of our employees is during a security event. And so I have met a good number of my coworkers via SEV 2 tickets—SEV 2s or pager tickets. And it's not the best way to make friends and influence people. And so, you've got to put a lot of work into building those relationships and making sure you reach out and contact people after the dust has settled. But I would say that my primary job is making sure that we hold the security bar high, and relentlessly so.

Corey: The idea of security being job zero, when I first heard of that—my instinctive reaction was, “Okay, we made a list of all the things we have to do, oh, crap, we forgot security, so we'll do what everyone does with security bolted it on at the end and put at the top so we don't have to renumber anything.” And it's a funny joke. And it's great to make the cheap shots and whatnot, but let's be very clear here; it is blindingly apparent to everyone who has used AWS in depth that security is baked in. You cannot bolt it on after the fact and expect to see the level of success that AWS has in a security perspective. I want to be explicitly clear on this. I have a laundry list of grievances around AWS—mostly around service naming—but I have never had a problem with how seriously you folks take security.

Eric: That is excellent to hear. I'm glad that that's coming through.

Corey: It's a no-win game when it comes to security because either it's in the way or it's invisible—if it's done well—but no one ever stops—“I really like the security there.” In fact, the only time people really seem to talk about it is after they've had a data breach in public, and they're saying, “We take security seriously,” right after it was exquisitely clear that they did not take security seriously.

Eric: Well, I've heard that perspective many times, and I disagree with it. Because—you're unsure of your security; when you're surrounded by ambiguity, it's deeply unsettling, and the effect of that, the materialization of that in the business is friction and low velocity. And when you work with someone, whether it's one of our service teams or one of our customers, and you give them the data that they need, and the tools to manage that data so that they can understand the security risks that they're facing, and so that they can make data-driven, informed decisions about how quickly they want to move, about which risks they need to mitigate, about which risks they can accept, they become way more comfortable, and that leads to greater velocity for the business, it leads to greater confidence for the leadership, and it leads to greater delivery to customers, which is the reason that the business is there. And so, over my time at AWS, I've seen security go from something that nobody talked about, to something that could only be a deficit, to something that is actually an enabler for us and for our customers.

Corey: I think you were the first people I've spoken to in my life who ever pushed back on the idea of, “Well, obviously, security wasn't the right answer here,” yadda, yadda, yadda. No, I think you're onto something, though. It's always a spectrum between usability and security, and there are trade-offs that have to get made. And on some level, given what AWS is and who your customers are, you can't ever get it wrong in a serious, big way. Because you don't get a second bite at that apple.

Every Cloud-doubter on the planet is going to come back with saying, “See, see. I told you, I told you.” And that's kind of weird. It's a very high-risk story, and so far you've delivered. There haven't been these horrifying nightmare things that on-prem sysadmin, grumpy type—of which I used to be one—was long predicting. It's a track record for what I can only imagine to be a thorough defense, in-depth position. What am I missing?

Eric: Absolutely. It's something we spend a tremendous amount of time and energy on; it does not happen by accident, and we're constantly looking for the maximum leverage we can get out of any defensive mechanism. But I don't think that the world is as boolean as you're describing here. When security events happen, they're absolutely serious. We take them very seriously. We respond immediately.

But if you look at the things that have happened in the world at large, even some of the large newsmaking security issues that we've had recently, it's never a complete company extinction. It's never the end of a line of business. It's definitely disruptive to roadmaps, it is damaging to customer trust, it's not something to be taken casually but it's not like you make a single misstep and it's all over. And I think it's really important to reinforce that. I'm regularly humbled by the amount of customer trust that we've earned.

And I don't take that lightly, I'm not going to play casually with it, but if you're caught up in the belief that a single misstep is going to lead to business extinction, then you're going to be paralyzed, you're going to be unable to move forward, you're going to be unable to objectively consider the risks. And I see security as highly parallel to availability. Availability is something that every service provider thinks about and has deep experience in, and we all think about the risks that we face. It is possible to build a system that is incredibly resilient: you run it in multiple availability zones, you run it in multiple regions, you build two completely separate implementations of it using two different languages and two different runtimes, you completely don't share a [fate 00:09:08] across anything, and you can build an incredibly robust system. Almost no one does that because it's expensive.

And the business either implicitly or explicitly is making decisions about how much they're willing to spend on availability, and which availability risks they're willing to take. And all of these services have some availability risks and sometimes they have availability events. And security is exactly the same. We are surrounded by security risks, no businesses without security risks. And so the way to succeed here is to, as objectively as possible, think about those risks, mitigate the ones that aren't acceptable, prepare to mitigate the ones that are acceptable if it turns out that your analysis was wrong, and to move the business forward.

Corey: The problem that I see is that what you've just said is, first, accurate; I disagree with absolutely none of it. But it's also nuanced. It doesn't fit in easy sound bites; it doesn't fit in tweets—which is my primary form of shitposting—it requires a level of maturity on the part of the listener to understand the nuances. Why is it such a hard concept to convey repeatedly, and well?

Eric: I think there are two things that make security difficult. If you look at availability, we have models for availability. What are the odds that a backhoe is going to cut this fiber? What are the odds that a bit is going to flip in this again? What are the odds that a human is going to violate operational procedures and push code that wasn't completely tested?

And we have models for that and you build this chain of events, and you've got some idea of the likelihood of each of these events, and you multiply through and you come up with some level of assurance that this is an acceptable risk to take, or this is a risk that we need to mitigate this much, we need to drive the likelihood of this event down below this threshold. And when you're dealing with security, you're dealing with a motivated human adversary. There's some reason. And it may be just kids out for the lolz, they're rattling all the doors down the hallway, and if they happen to find yours, you might have an issue. But in general, you're dealing with a motivated human adversary and at that point, probabilities go out the window.

It's wildly unlikely that this event, followed by this event, followed by this event are going to happen unless there's a human at the keyboard making them happen. And I found a lot of engineers shy away from that kind of thinking. And I don't have an explanation for that. I don't know why, but the idea that you're basically playing a blind game of chess with an unknown adversary is unsettling to them.

Corey: And of course, there's more than one adversary and only one of them. And despite what you say, there is the perception that security is a—you only get to fail once, and then it's all over.

Eric: So, you have to be confident in what you're doing. And this is one of the things I love about the culture at AWS. It is an incredibly important thing. If you ask anyone in AWS security what my favorite word is, they will immediately respond ‘Escalate.’ We have a culture of aggressive escalation in Amazon in general, but definitely within AWS. And an escalation is not a vote of no confidence. It's not me saying that you're bad at your job, and I don't trust you, and I'm going to go get a second opinion.

Corey: “I'm going to grab your boss because clearly, you're incompetent.” Yeah, that is in some cultures, how it's perceived. Not, when AWS does it; when people in those environments do it.

Eric: That is correct. That is not what we're doing. We're saying, “I don't think we have the right decision-makers in the room, and rather than getting caught around the axle and having a repetitive conversation that doesn't converge, or having a groundhog day meeting where we have the same document and the same argument again and again and again, we're going to get the right decision-makers in the room and we're going to make high-quality high-velocity decisions.” And so, if I'm uncomfortable with something, I know that I can pick up the phone and I can get a hold of literally any leader in the company. And they trust me: I'm not going to call Andy Jassy because something went bump in the night and I'm scared.

But if I need to get his attention, I know that I can get his attention. And I know that he will listen to what I have to say. And so given that, I know that if there's a decision that I'm uncomfortable with, if there's a path forward that's unclear, I can go get high judgment people that I trust to help me with that decision. And then when we make that decision, it's made with much higher confidence, and that enables me to continue to do my job, to continue to stare into the ambiguity of security. The other thing, I think, that makes security different from other disciplines is availability events happen much more frequently than security events.

And so we just have a larger data set, a larger training set. And so the humans that have to deal with availability issues, have dealt with them way more often than the humans that need to deal with large-scale security issues. And it's a much harder problem to quantify. And I think that's one of the things that the Cloud makes uniquely possible. I've spent a lot of time in security in multiple positions, and I have never had access to the data and the tools that I have access to here at AWS.

Between DNS logging, and Flow Logging, and CloudTrail, and all of the other data sources that we have the amount of visibility that I have, the ability to reconstruct the past, to set up alarming, and then the tools to deal with this data—not just S3 to host it and all of the machine learning and analytics tools, but things like Lambda, where setting up alarming on a new condition is the job for an engineer for an hour, not a major system design, it has completely changed the way that I think about security and the way that team thinks about security.

Corey: Something to emphasize is you're able to do all of that and have that visibility from the hypervisor and network perspective, but not from within the customer environment. And the fact that you could achieve all of this without effectively forcing your customers to make a privacy or data security trade-off is sort of its own minor miracle from where I sit.

Eric: So, I am very happy with how far we've gotten with the data sources that we have. I'm very impressed with the team and what they've managed to accomplish. One of the things that we think about: we're surrounded by constraints. I mean, that's the nature of all human endeavors. And so we have limited time, we have limited money, we have limited human resources, and the human resources are the biggest constraint.

[unintelligible 00:16:08] engineers are a hot commodity, and so every engineer-hour is precious, and making sure that we allocate those—and not optimally because then you wind up spending a lot of time optimizing and not actually delivering, but acceptably optimally is really important. And you look at the leverage of the coverage you're going to get for an invested engineer-hour. And something like Flow Logs was expensive to build, and analyzing Flow Logs is expensive to build as well, but every single thing in AWS talks IP. You can't get into or out of an EC2 instance without talking IP, and so Flow Logs gives us ubiquitous coverage, literally one hundred percent coverage; every packet is accounted for.

And that's huge. It doesn't matter what version of the kernel you're running, it doesn't matter what operating system you're running, it doesn't matter if you're playing with the latest container micro-operating system that we don't have support for, eventually, it's going to turn into IP packets and it's going to wind up in the Flow Logs. And so that's one of the things that we consider when we decide where to invest. And that sort of ubiquitous coverage is incredibly valuable.

Corey: For those of us who are doing things that are—how do I put it—not particularly serious in an AWS environment, where for example, I'm building a Lambda function to wind up taking the status page and make it sarcastic, and worse. And I'm having trouble with it. It's irritating on some level where I'm not able to push a button and grant support access into the environment to look at these things because it's a toy app, and I don't care. And it's easy to lose sight of the fact that, yeah, it doesn't matter. It's a toy app that's doing some nonsense like that, or a bank that is doing something that is incredibly sensitive and valuable and regulated, I get the same level of protection as those workloads. And that's a powerful thing, though it's, I admit, easy to lose sight of that when it's two o'clock in the morning, and I just want the funny joke to work.

Eric: I hear you. And for me, this is one of the most enticing challenges of working at AWS. We don't have grades of service; we don't have different levels of complexity. We have a single suite of services that we offer to our customers. And a novice customer that reads a blog post and wants to try something out, is going to use the same EC2, the same Lambda, the same S3, the same IAM as our most sophisticated government or financial services customers.

And in fact, that novice customer may themselves work for one of these very demanding large customers. And this may be their first foray into AWS, and so today, they're a one instance, one Lambda, one bucket kind of customer, but they're going to evolve over time into one of these very sophisticated, very demanding customers. And so there's this continuum here. And you can't tell the customer, “I’m sorry, that was great. I'm so happy that you liked that. In order to move to the next level, you need to shut everything down, pack it up, and move it over here to the much more rich-featured, complex cloud.”

You have to be able to accommodate the getting started use case, and the mildly more complicated use case, and the early production use case, and all the way on through full corporate governance, multiple accounts, organizations, security audits, compliance audits, et cetera, in a single suite of services. And I don't think we've got it perfect; I don't think we'll ever get it perfect, but figuring out how to accommodate that entire spectrum of use cases in a single service and to grow with your customers and to enable them to tackle complexity incrementally as it becomes meaningful to them, is honestly my favorite part of designing a service.

Corey: The thing that continually eludes me is I accept as fact—because you've clearly demonstrated it—that you can handle, for example, the security in all its sharp and difficult edges around things like an EC2 instance talking to RDS and then storing something in an S3 bucket. That makes sense to me. I don't know how you did it, but you clearly have done it. But then you wind up with the almost Cambrian explosion of higher-level AWS services that are in machine learning, and, “Hey, we have this thing that talks to satellites in orbit,” and oh, “There's this other thing that's Lookout for Equipment,” which is apparently named after a sign on the factory floor somewhere. And all of those things in all those different directions have the same level of security guaranteed, despite what is in many cases, a newly completely alien workflow compared to what the historical expertise has been aimed at. At least that's what it seems like from the outside. Is that accurate? Is there something fundamental that I'm missing, or is this just another demonstration of Amazon doing its operational excellence thing?

Eric: This is my favorite thing about security—as opposed to designing AWS service—is you have someone come to you, and for example, they say, “We would like to have a farm of iOS and Android devices that mobile developers can use to test their applications, and they're going to be awesome because they're going to be located right in our data centers, right next to the EC2 instances that they're using for their development work.” And you go to the bookshelf, and you pull down the big binder of policy—ask anyone in AWS security what my least favorite word is, and they'll tell you ‘Policy’—and the policy is you're not allowed to have mobile devices in the data center, you're not allowed to have cameras, you’re not allowed to have Bluetooth, you're not allowed to have WiFi. And so you run the flow chart that's in the policy, and the answer is clearly no. That is obviously the wrong answer. The right answer is, “Wow, that sounds cool. I bet our customers would love that. Let's figure out how to do it.”

Which leads to the next question, which is, “How?” I have no idea. I have never built a device farm before, but we're going to figure it out. And so we go and we find people that have the specific expertise that's necessary, but there are patterns that crop up over, and over, and over again, multi-tenancy is really challenging, but it's an acquirable skill. Capacity management is really hard, but it's something that you can build expertise in.

And so we have a whole bunch of the fundamental building blocks lying around in different parts of the organization, it's just a matter of getting the specific knowledge necessary to apply to that domain, whether it's the device farm or ground station, or whatever absolutely insane idea our service teams are going to come up with next that's going to delight customers. And it's these crazy ideas, the ones that, prima facie, seem absolutely ludicrous that wind up being really, really valuable to our customers and totally feasible.

Corey: I would be remiss if I didn't make a feature request while I have you in a circumstance in which you can't possibly say no. Now, let me preface this with, I have never yet come to AWS with a feature request and gotten a response of, “Holy crap. We never thought of that.” The answer is always, “The reason we can't do it, quite like your thinking, is nuanced and complicated.” And a couple of times I've been taken down that path, and yeah, there are dragons everywhere and computers are awful is what I take away from it.

But IAM is one of those really—how do I put this—esoteric things for an awful lot of people. It's easier to just grant access to everything, and then in turn—like, later, we'll go back and fix that. Yeah, ten years later, it doesn't work that way. We all write terrible things and we lie to ourselves and others about what we're going to be able to come back and do. It feels like there's an opportunity to build almost a warn-if-reject style IAM approach wherein a test environment—and please only use this in test environments—you could [have 00:24:19] run a Lambda function, for example, through its paces, and it looks at what function it was able to use, it’s allowed to do basically everything, and then it spits out a narrowly scoped down approach.

This is a sort of thing that people have been asking for for a long time, but to my understanding, the closest we've gotten is the IAM Access Analyzer. Is that a reasonable customer request? Is there something that winds up getting missed somewhere when people are asking for this? Or is this one of the ridiculously rare, “Wow, no one ever mentioned that to us. We'll get right on it,” moments?

Eric: I hate to disappoint you, Corey, but this is not the first time we've had this conversation with a customer.

Corey: Well, I am reassured by that, if that helps.

Eric: So, I think that things like IAM Access Analyzer are our preferred path here. And I think that over time IAM Access Analyzer will evolve to be more closely that kind of shrinkwrap that you describe, but what we've often found is that in order to get the right shrinkwrap policy, you have to exercise all of the functionality of that Lambda function, or whatever resource it is that you're attempting to shrinkwrap, and if you miss any branches, and in particular, you often miss the error branches and there are actions that your code takes when things aren't working well that are incredibly important to the survivability of your application. And so it turns out that everything's running fine for a long time, then there's some sort of failure. It's a failure that didn't occur while you were running in test mode to generate the shrinkwrapped policy.

And your code following exactly what you wrote says, “Oh, no. I have to post to this SNS topic in order to let them know that I've had a failure.” And it can't because that wasn't included in the policy. And that kind of latent failure is in some ways worse than an over-scoped policy. And so there's a balancing act here and it winds up, as you said, being nuanced and complicated in practice.

And this is one of the philosophies that we try and help our customers and our service teams understand is that you want to do successive refinement here. The tighter you make the policy, the closer to least privilege you get, the more work you're going to have to do with that policy. You're going to have to spend more time, and in the fully realized corporate governance version of this, there's going to be some other team that has to review your policy changes and approve them. And if you've got a really, really tight policy that allows exactly and only the things that you need, and then you add a feature and that feature happens to use a new SQS Queue, or takes some new feature of S3 and requires yet another API call that's not currently allowed, then you have to go through this whole process of getting this approval, and doing the review, and making sure that it's acceptable. And so as your applications mature, you want the policies to get tighter and tighter, you want the restrictions on changes to have a higher and higher bar, not just for security reasons, but for availability reasons.

The thing that you're playing around with on your own personal time, if it has a complete outage, no one's even going to notice; you might not notice. That production app that your customers are depending on, if it has an outage everyone's going to notice. And so you want to perform successive refinement here, where you keep making the policies tighter, you keep making the operations tighter until you get to a level that's appropriate for your current level of maturity, your current scale of operations, the criticality of the data you're currently dealing with. And so I'm not a huge fan of going all the way to least privilege right off the bat.

Corey: Forget dozens of visualization tools and view your entire system in one place with New Relic Explorer, the latest addition to New Relic One. See your system-wide health at a glance with a dense hex view that has your hosts, services, containers, and everything else.
And get an estate-wide view of sudden changes, so you can catch issues before they impact customers.
So go to NewRelic.com, sign up for free, and start exploring your system today.

Corey: Like everything, security feels like more of a journey, that is a destination. But that does change, for example, when you find yourself on the expo floor of RSA, at which point security is then transformed into something people are attempting to sell you. And my question across the board around that, I think is, do you see that there's a place in the security space for third party offerings to thrive in the context of a—assume a pure AWS environment along with the spherical cow. That's great. Is there a place for partners in that space?

Eric: Absolutely. One of the things that we say all the time is that we're not as smart as the aggregate of our customers. If you're building an AWS service, one of the ways that you know that you got it right is when you learn of some customer that's doing something with your service that you never anticipated, that's absolutely glorious and clever, and enabling for their business, and you got out of their way. And you never even thought about this use case, and they managed to do something that stuns you, even though you helped build this service. And so we're also not as smart as the aggregate of our partners are as the aggregate of the internet as a whole, and we want to make sure that all of these people that have something to offer, that have these differentiating ideas that can make our customers’ experience in the Cloud better have an opportunity to do so.

There's a set of fundamental building blocks that we have to own, things like EC2 itself, or S3, or IAM, or CloudTrail. There's a set of things that customers expect us to offer. For example, GuardDuty: the feedback from our customers is overwhelmingly clear that, as the owners of AWS and as the owners of CloudTrail, they expected us to have a service that would perform security analysis over those logs. And one of the data sources used by GuardDuty, and one of our external security services is CloudTrail. And so that was in response to direct customer feedback.

But we have a very rich ecosystem of partners that help customers out at all sorts of places, and some of these are born in the cloud partners. Some of these are partners that have been working with our customers for years and have made the journey with them from their on-premises data centers into the Cloud. And there is a long and bright future there.

Corey: It seems on some level like there's a bit of a series of terms of art, or its own unique dialect in the security space, where compared to almost every other line of Cloud offerings, or SaaS offerings, or developer tool offerings, that it feels like it speaks in a much more enterprise-style focus way, even when marketing to startups. Is that just because it's so difficult to message that everyone is going from the same playbook, or is there a cultural aspect of infosec done properly at a lot of these companies, that means that I'm just not in that target market, so it's a language that isn't speaking to me?

Eric: You asked me a marketing question.

Corey: Oh, yeah, I'm trying to understand. You started off, once upon a time as an engineer focused type. I mean, you don't generally become a Distinguished Engineer without writing least a couple lines of code. And you used to be hands-on-keyboard, and now you're talking to exactly those folks, and every time I talk to someone in the security space who does speak that dialect, they come away impressed at having spoken with you. So, that tells me that whether it or not, you do speak it, I'm just hoping you can sort of act as my security translator.

Eric: I do think that we've been very clear in our messaging, however. My boss, Steve Schmidt, who is the Chief Information Security Officer of AWS, has talked a lot very publicly about how we think about security and how we treat security is something that's baked in from the beginning; how our messaging with our customers is around helping them move forward with confidence, not about sowing fear, uncertainty, and doubt; it's about making the pie larger and enabling more people to succeed, not in scaring people off from doing things. And so I think, to a large extent, our security marketing, if not our security product marketing is very much in our own voice. And I think it does a good job of conveying the message that we want to convey.

Corey: So, a challenge that I have to imagine is frustrating, if nothing else, is that the reality of AWS and the perception of AWS have some significant gaps, where on the one hand, it's the idea of two-pizza teams, and people iterating rapidly, and bunch of small service teams, each building something as part of a collective whole and on the other, you take a step back, and it's you’re Amazon; your market cap is measured in the trillions. Why is ‘Insert whatever thing annoys you today,’ such a bad experience or whatever it is, how does that tension wind up manifesting in your world.

Eric: So it's true, that the security team has gotten to be reasonably large. And you look across all of AWS—and I've been with the company now for 13 years, and it is dramatically larger than it was when I started. But the job that we're taking on is also dramatically larger than it was when I started. It does not feel like our budgets have gotten any richer. It's just customer expectations have gone up, the expectations in terms of compliance, in terms of security, in terms of availability, in terms of operational excellence, have all gone up at the same time that we've been launching more and more services and features.

And so we're still incredibly parsimonious with our engineer time. And a lot of our best security tools are things that an engineer was tired of dealing with and they went off; in the space of a couple of days, they made an absolutely horrendous prototype. Like, this code should never even have been typed into the computer in the first place, but it made their lives better; it made their job easier, and so another engineer contributed some code and it became a little bit less eye-searing. And over the course of a couple of months, we wind up with a system that's actually really useful. And at some point, you have a discussion, you're like, “Wow, this thing is no longer really useful. This thing is essential to our operations. We've hit another level of scale, and if we didn't have this automation, we wouldn't be able to keep up anymore.”

And so then you build a team around it. And when we say build a team, there's the whole two-pizza team thing, and we don't really talk about buying pizzas and thinking in that term, but these tend to be very, very small teams—you know, a handful of engineers, a software development manager—and now that team owns that thing, and they evolve that thing. And all of the security tools that I see that I really like are things that started off as small tactical answers to an actual problem that we had that accreted functionality over years. And it usually means that they're not beautiful, that there isn't some grand design that some architect sat down and sketched out and thought about all of the future scaling concerns, it means that they tend to be kind of patched together and evolved and as-built. But the reality is that the grant designs that the architect sits down to sketch out usually don't take into account the future that actually happens, and so you wind up with patches and changes, and emergent feature requests anyway. And it is incredible how quickly that value accretes.

Corey: Oh, absolutely. People are familiar on some level with the idea of the mythical man-month, it feels like this is almost a parallel of that the mythical, “Just throw $5 billion at it and wait,” where it's throwing additional resources doesn't lead to better outcomes and in many cases can lead to materially worse ones.

Eric: Absolutely. And so when I'm talking to customers and they want the tools that we have, one of the reasons that our tools are valuable, is that they're tightly integrated with the way we do things. At Amazon, we have a ticketing system, and everything is a ticket: if your laptop needs more memory, it's a ticket; if you want to bring your dog to work, it's a ticket; if the website is down, it's a ticket; if your parking token doesn't work, it's a ticket. Everything is a ticket. And so all of our security tooling is integrated with a ticketing system, we even have security tooling that monitors the ticketing system to make sure that the tickets we've already cut are in a healthy state, and to take metrics on that, so we can report on it, so we can understand if we're spending our time in the right places.

And none of that integration translates. And so what I tell customers that are looking to get started on this journey, customers that want the kinds of tooling that we have, I tell them to just get started. Rather than writing a catalog of all the things you'd like to check and all the Lambdas you'd like to write, just write one. Just pick one.

Corey: Check a single thing, write a quick three-liner, that'll do it, and see how it goes.

Eric: Yes.

Corey: Yeah.

Eric: And the most important thing is not that you have that check, it's that you have the feedback loop. It's that the next time something goes wrong, you think, “Why did this go wrong? What can I check that would prevent this from going wrong?” And then you add that check. And so over time, you're going to accrete this library of validations.

And the way we think about this is in terms of invariants. We call them security invariants. These are statements that should always be true. And they can be incredibly simple, like, “This IAM policy matches this text document exactly.” Or they can be incredibly nuanced, like, “There is no path from the internet through any combination of nodes to any host that's tagged blue.”

And so the validators can be very simple; they can be very complicated. You build this library of invariants. And every time something happens that you don't like, or during the application security process ahead of time, you come up with invariants. And you just keep building this library of invariants. And every single time we've done this, the library of invariants that we've wound up with is very different from the library of invariants we thought we needed, and because it's driven by things that have actually happened or things that we specifically identified in our threat models, they're the things we actually need. And that value accretes incredibly quickly.

Corey: It's a matter of taking a bunch of little things and composing them into something fantastic at the end. It's almost like the microservices story, or some of the architectural diagrams that list a borderline sarcastic number of services, but the outcome is really neat.

Eric: Absolutely. And over time, you will learn that past-you was not as smart as current-you, but that's fine. The principal engineer community has a set of tenets, and one of the tenets is ‘Respect what came before,’ and it's incredibly important to me as an engineer. I've been around long enough that I've seen things where I've said, “Oh, my gosh, what idiot did that?” And you look in the source repo—it's git blame these days, but it was CVS blame back in the day, and your name is next to that line.

Corey: And then you immediately fire up git blame someone else.

Eric: No, no you own it. Like, “I made this decision.” And this—

Corey: Yes. That’s why you use the tool that rewrites history. So it's someone else's fault and not your own. Oh, yeah, I'm right there with you.

Eric: So, the idiots that built the systems of the past weren't idiots. In fact, they're the ones that got us to where we are today. Those systems are what enabled our current business, our current success. Now, every single thing I've ever worked on we've outgrown. You got a couple orders of magnitude scaling out of your design, and then you've got to go back to the drawing board, but you do so making sure that you respect what came before; that you value the systems that got you to where you are, even though they've scaled beyond their utility, even though you think they're old and broken, they embody lessons; they're wise; they're battle-tested.

And you make sure that you take as many of the lessons as you can from the systems that got you to where you are, and you treat them with respect, even as you turn them off in favor of the new shiny thing. And after you've been through that cycle a couple of times the new shiny thing is going to be one of those legacy systems someday soon.

Corey: I tend to view legacy through a lens of being a disparaging engineering term for ‘It makes money.’ It turns out that unlike what we learn in conference talks, you can't generally throw the entire banking system away and replace it with something you built in a weekend off of Hacker News. So, I have an awful lot of sympathy for not just the greenfield stuff, but how you get what exists today into an environment that is better tomorrow? And there's no easy answer.

So, I want to thank you for taking so much time to speak with me about what you're up to, and how you folks view these things. If people want to learn more about what you're up to, where can they find you?

Eric: So, I am on Twitter, @ebrandwine. I'm not very good at the whole social media thing, so caveat emptor. We also have a wealth of material on the AWS Security Blog. A lot of the stuff that I've talked about here about how we think about security and about making incremental progress is well covered there.

Corey: Excellent. We will, of course, throw links to that in the [show notes 00:41:45]. Thanks so much for taking the time, I really appreciate it. One never knows what one's reputation is with different groups at Amazon; there's no unified single opinion Amazon has, so it's nice to know that least some people will still take my calls, and it's very much appreciated.

Eric: I think that your taste is terrible, and the fact that you had me on just confirms that. And the fact that anyone wants to listen to this is mind-boggling to me.

Corey: One person's trash is another person's treasure, and I'm generally the trash. Thanks so much. I appreciate it.

Eric: Thank you, Corey. It's a pleasure. Eric Brandwine, Distinguished Engineer and VP at AWS, I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with a comment saying that actually there's a job negative one, and tell me what it is.

Announcer: This has been this week’s episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com, or wherever fine snark is sold.

This has been a HumblePod production. Stay humble.

Join our newsletter

checkmark Got it. You're on the list!
Want to sponsor the podcast? Send me an email.

2021 Duckbill Group, LLC