Screaming in the Cloud | Transcript: A Chaos Engineering & Jeli Sandwich with Nora Jones

February 11, 2021 • 30 Minutes

A Chaos Engineering & Jeli Sandwich with Nora Jones

Nora Jones is the founder and CEO at Jeli, makers of an incident analysis platform that leverages data to recommend productive solutions to the problems at hand. Before this role, she was Head of Chaos Engineering and Human Factors at Slack, a senior software engineer at Netflix, and a team lead/senior software engineer at Jet.com, among other positions. She also had a four-month stint working on restricted research for the U.S. Navy and literally wrote the book onChaos Engineering. Join Corey and Nora as they talk about just what the heck it is that Jeli does, how incidents can help organization learn more about themselves, what it was like to work at Jet when it was scaling rapidly, how if everything is an incident than nothing is an incident, why businesses need to define exactly what an incident means to them, what the purpose of chaos engineering is, the unintended positive consequences of chaos engineering, why Nora thinks the word ‘post-mortem’ should be removed from the incident response lexicon, what’s surprised Nora as her role has evolved over her career, and more.

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Cloud Economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of Cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Norah Jones who, despite having a storied history, is probably best known these days for being the founder and CEO of Jeli. Nora, welcome to the show.

Nora: Thank you, Corey.

Corey: So that's Jeli—J-E-L-I. I’ll avoid the various jam puns we could wind up going with. Let's start at the very beginning. What the heck is Jeli?

Nora: So, first of all, please never stop using the puns. [laugh]. We all use puns internally; we call ourselves ‘Jeli Beans,’ so it's complete pun-dom in our Slack. But Jeli is an incident analysis platform. So we've built the first incident analysis platform that allows companies to not only learn from their incidents, but address everything they can from them so that they're actually understanding what's contributing to some of their major failure modes, what they're doing well at versus what they think they're doing well at, and exposing the delta between those two worlds.

And honestly, using incidents as a catalyst for helping orgs understand themselves better so that they can make better decisions. This can lead to things like helping them with their OKRs, helping them with their headcount on teams, it can lead to a number of things. Really, Jeli is using the incident as a catalyst for helping you understand how you think your org works versus how it actually works.

Corey: Okay, let's back up a little bit. You have a history as a software engineer. You were at Jet, and then you wound up at Netflix, you quite literally wrote the book on chaos engineering, then you went to Slack, and now you found a company of your own that's aimed at this. What's the common thread there?

Nora: So, I was actually in hardware prior to Jet and I was working on reliability there. And I've been focused on developer productivity, reliability, honestly, my entire career. And I was seeing some of almost the exact same flavor of incidents happening at some of the companies I was working at.

Corey: It's always [DNS 00:04:03], a disk fills up that sort of thing, or are you talking about something beyond that?

Nora: Honestly, even with certain tools. Like I love console, I think it's a really great mechanism, but I was seeing the same types of console incidents at certain companies, like, five or six years apart from working at them. And I thought that was kind of incredible how folks were using it in the same unintended ways that were leading to certain failure modes. And I just started thinking, “Wow, it would be really helpful if we, as an industry even started sharing this stuff with each other a little bit more.” And also even giving these companies tools to understand it more.

When I was at Jet, we were having incidents pretty regularly. Our marketing team was crushing it and we were just growing really fast. And it was a trade-off at the time and we had amazing engineers. It was just, it was a lot of speed. And so we were trying to figure out what to do with incidents at that time.

Corey: Well, when you say, a lot of incidents, I mean, I've worked in shops where that means—so how many—s what is a lot? And everyone's going to have questions about that. From my perspective, it's, “Oh, we've had eight incidents.” “Oh, well, of course. Of your company? That’s not that—” “No, no. This morning.”

And it comes down to at that point when everything's an incident, nothing is, and everyone feels sort of trapped into wherever they are because architectural decisions, because of business requirements, et cetera. And it's easy to sit here without running an infrastructure myself and have these conversations. But when you're in the middle of it, it feels exhausting, never-ending, and the rest.

Nora: And that's such a good point, Corey, is that what is an incident at our company? And I've challenged a few orgs that I've worked with in the past to ask—I’ll say, “If I asked five different people at this company, ‘what is an incident here?’ How many different answers am I going to get?” And I'll usually get the answer, “Five.” Some folks will be cheeky, and say, “You'll get 10 answers.”

But that's the problem at some of these organizations is that when everything's an incident, nothing is. And so there needs to be some key business metrics, and that has to be something that people can consistently answer from legal all the way to engineering, to marketing. Everyone should be able to have a consistent answer on what an incident means. And I don't mean, like, a document that you can pull up that's five pages, that I have to figure out if this is an incident, what level incident and I'm trying to find the document, I've wasted ten minutes at this point, and it's two in the morning, and I'm really tired. None of that should happen.

This should just be kind of a consistent KPI-level metric that you can grok and people can pool on. And I realize that's different with different products, but at Jet, we were having a lot of incidents. And when I say incidents, I mean we were having incidents every day. And I worked with amazing folks there that were working around the clock to pull things back together and some of the best and brightest engineers I've ever seen. But when I went to Netflix, I noticed that everyone kind of knew what an incident was at Netflix.

Everyone knew what the key business metrics were, and how they impacted things, and it was just, there was this alignment. And so it was never a question on whether something was an incident, on whether something was worth waking up at two in the morning for, it was just understood and baked into the fabric of the culture pretty early on.

Corey: Unfortunately, this kind of doesn't help in some respects because it feels like it's just another example of, “Oh, well, Netflix is of course, otherworldly, and far beyond what any other mortal company could wind up doing.” And I don't know that that's necessarily true. But it also feels reminiscent of chaos engineering, insofar as getting buy-in to fix things by breaking them on purpose is often a very heavy lift for folks who can't get to a point of stability. Similarly, it feels like learning from incidents is going to be very hard with respect to finding the time to do it when you're buried in them. It almost feels like you have to educate your customer before you can help them. Is that at all accurate or am I misunderstanding something dramatic?

Nora: No, it's a really interesting point, Corey. And when I was at Jet when we were having all of those incidents, we were kind of reaching a point where we were like, “Let's just try a few different things.” And I think a lot of companies reach that point where they're willing to try something, where they're not wanting to wake their engineers up anymore. And so that's when we started trying chaos engineering. And it was helpful from the perspective of helping us understand our culture a little bit more in who we need to rely on.

The hard part was figuring out what to fail, where to inject chaos, and even what to do with the results afterwards. As a lot of software engineers do, it was kind of like thought of it almost as a, let’s automate it away situation: we can just have a tool running in the background, but that kind of defeats the purpose of chaos engineering. And so when I went to Netflix, I was really excited to join the chaos engineering team there because Netflix had made this percolate and work throughout the company, and I was really excited to be in an organization where it was so widely understood. But as I worked on the team a bit more, I mean, I was programming probably, like, 80% of my time at Netflix, and as was the rest of the chaos engineering team, but when I went to check who was actually using the chaos engineering tools on our team, it was mostly the four of us building these tools. Which was fine from a certain perspective, but we weren't getting enough ROI out of it.

The whole purpose of chaos engineering is to actually learn where your weak spots are so that you can be a bit more proactive to them. And we were focusing very much on the injecting failure, and we were really focused on mitigating the blast radius so we could do it safely. We were doing some very fascinating things technically, and we were doing some really great stuff with distributed systems in general and working with other teams, but what we weren't super focused on was the creating the experiment, and what to do with the results. And so it was usually us nudging people to create experiments, some folks would put them in their continuous deployment pipelines and stuff, which can add a little bit more of the benefit. But actually sitting and taking the time to think about where you want to experiment and what you want to do with the results, causes, I think, a bit more ROI from doing chaos engineering.

And so I realized those were problem areas at Netflix. And so I started analyzing incidents to try to make a catalyst for like, “Okay, here's where we should create chaos experiments, and here are the areas where we need to do a lot of stuff with the results.” Basically, I started looking at incidents to try to help my chaos tools a little bit. And then I realized there was so much more to learning from incidents than that. I was finding things like, wow, we bring in this particular engineer all the time.

They are a knowledge island in this organization. Or this team is severely underwater right now. Maybe we should staff them up. And so, yes, it was helpful in informing where we should chaos experiment and what we should do with those results, like if we should prioritize them in our action items, but it was also helpful in a number of things in the business. And we started writing incident reports that were getting read by folks all over the company.

And people were learning more about the system because we had taken a deeper look at analyzing these incidents. And when I say analyzing these incidents, I mean looking through the chat transcripts, understanding who got paged, figuring out what team they were on, what tenure they have, things like that. And so that was clearly beneficial, and it was stuff I started doing at Slack, too, but it was a lot of manual work. And I can't imagine that most companies would invest that time doing that manual work. And so we wanted to help do some of that for them.

And so, basically, Jeli gives you shoulders to stand on with your incidents so that you're not coming in at zero. We're directing your attention towards places that could use your attention organizationally. And so this post mortem that you're doing is not a chore or a checklist item just because it's part of this process you engrained five years ago; it’s actually something that's useful for you. So we're showing you places that deserve more attention, maybe an engineer that got brought in that we hadn't planned on being there, and understanding what specific knowledge that they had; or understanding that with Kubernetes incidents, we don't do a great job as an organization figuring out the right folks to get in the room; or we throw out a lot of theories before we actually figure out what's going on. These are the things that we're showing you. We're helping you get to those places faster so that you can do your post mortems faster. But we're also enhancing the quality of the output at the same time.

Corey: So, when I look back to my operations days, the dealing with incidents was always obnoxious. Let me walk you through a minor example of one and then you can figure out I guess, well, the audience can figure out more easily what is wrong with the places that I've worked. So, things are breaking, getting the right people on the call is important and almost impossible, so you wind up with your great, great grand-boss on, and you have the CEO breathing down your neck—“Is it fixed? Is it fixed? Is it fixed?”—and then it finally comes up and cool, now it's time to do a post mortem, but we don't call them that, so it's going to be an incident retrospective.

And you're sitting in the room and it's a blameless post mortem. Cool. And you say great. “So, that engineer over there screwed this up.” It's like, whoa, whoa, whoa: blameless. Okay. So, an unnamed engineer screwed this up. And it becomes an iterative process, and invariably, it almost feels like it's a, justify while you're still good at your job, exercise, and a lot of these places. Help.

Nora: Yeah it is.

Corey: How do I fix that?

Nora: It's what people know. If your incident is hitting Twitter, or you have customers calling, that's when someone from your C suite is probably going to jump in and they're probably doing more harm than good. We're actually giving you tools to show you where some of that is hurting. If certain folks jumping in or hurting or helping the situation, we're allowing you to analyze that a little bit better so that you can reduce your costs of coordination during these incidents. And I think costs of coordination are not something that folks tend to look at.

I think a lot of companies look at what quote-unquote, “Caused the incident,” and they look at, quote-unquote, “How to prevent it from ever happening again,” when really, they should also be looking at how they worked together in that moment. Did this involve teams that had never spoken to each other prior to this event? Had this—

Corey: Well, not politely anyway.

Nora: Yeah. Had they been in an incident before? Did the CTO actually hurt jumping in or were they helpful? Who knew the right people to bring in the room? How many hops did it take to get to those right people?

And even imagine a world where you can understand who exactly has the information that you're looking for in a particular moment, and allowing the incident to go much more smoothly. And also just showing you how it didn't go smoothly. I think post mortems and incident reviews—and I don't like the word post mortem either; I tend to use incident review, but if that's a word that you want to hold on to, you can hold on to it, but I recommend making little changes at a time. It's already a super-charged event; removing the word post mortem can make it a little less charged, and I think having a tool to help you during that event to point to areas can also make it a little less awkward of a situation where it doesn't feel as blame-y and finger-pointing, even if you are using the term ‘blameless post mortem’ to describe what that meeting is, it can still sometimes feel like that.

Corey: Whenever you talk about something like a tool in this space, I start getting flashbacks to a number of—I don't want to say failed attempts, but that's really what they were—looking at previous patterns and then making AI or machine learning-driven suggestions about what the outage is likely to be. Which generally means you're trying to swindle someone; if you're not sure who it is, it's probably you. And it looks at previous things and it pops up—if you ever get it to this level of development—with exceedingly unhelpful things. Like, “Last week, a disk filled up. Maybe this time, it's a disk that filled up,” when it is very clearly not that.

And with anything that's driven with suggestion oriented or machine learning-based, it feels like two or three bad suggestions in a row means that no one will trust anything it has to say forever, even if it improves. I mean, take a look at the various digital assistants we have floating around. When you ask Siri to do something and it doesn't work the way you expect it to you feel a bit dumb for having asked in the first place. Never mind the fact that a week later it does that thing; you won't go back and try it again.

Nora: Yeah, I completely agree. And I am so skeptical of AI ops and something that's automating everything for you, I think where we need to go as a software engineering industry—and some insights are helpful, and bubbling those things up for you is super helpful. I think about different things that GSuite does sometimes; some of the automated responses are useful, some of them are not. Setting up the Zoom meetings accordingly. But I think the tools that work the best are the ones that you treat like a member of your team, almost.

It's not something that's doing something for you, it's something that you're working with to achieve the best outcome. And they're still putting something on the table. And that's the mindset that we're building Jeli with. We are showing you some insights, but you still have to do some work on your own. What we're really giving you is a playground to play with your incident a little bit more.

Something that's dedicated and built to help you understand this incident a little bit more. To the point where if you signed on to the incident, we've given you some areas to direct your attention, but you still need to put in some time to understand those areas as you would any post mortem. We want to help you facilitate that discussion so that it's productive, and people leave the discussion feeling like, “Wow, this was really a good discussion for us.” It's not like we are telling you which questions to ask or which things to fix, like the disk uses stuff. We are just giving you focus areas so that you can see themes over time.

And I think how we're really different, too, is we are really focusing on the people and how to best enable and help them. I've done this sort of pattern analysis and incident analysis at a number of organizations, and it is useful, and it can provide a lot of recommendations. And I don't want to just automate exactly what I was doing at these organizations, but I do want to automate the beginning stages of that. And that's what we're doing with Jeli is letting you not start from scratch with a post mortem. Because let's face it, when you get assigned a post mortem, you're like, “I have to remember how to do this. Okay, let me open up a Google doc. Okay, let me pull up 15 Chrome tabs. Okay, I needed to DM so-and-so—”

Corey: Or it’s the other side where you're so used to it, it's habit, and it winds up auto completing automatically. And that doesn't feel great either.

Nora: No, it doesn't. And so, we're helping that be productive. We're helping you look better. We're helping your organization be a bit more collaborative in these events, and just feel more confident about your incidents.

Corey: This episode is sponsored by ExtraHop. ExtraHop provides threat detection and response for the Enterprise (not the starship). On-prem security doesn’t translate well to cloud or multi-cloud environments, and that’s not even counting IoT. ExtraHop automatically discovers everything inside the perimeter, including your cloud workloads and IoT devices, detects these threats up to 35 percent faster, and helps you act immediately. Ask for a free trial of detection and response for AWS today at extrahop.com/trial.

Corey: So, help me understand, who's your target customer for something like this? Is it going to be the hyperscale companies who already have attained a certain level of operational maturity? Is it a brand new startup that has just committed their first line of code yesterday and won't figure out until the end of the month that it has a disastrous effect on their AWS bill? Or is it folks in between? I mean, who is your ideal customer these days?

Nora: We're working with a number of different companies right now that are getting value out of Jeli in very different ways. We're working with a Series B company that has about 100 people; we are working with a company that has 10,000 people that's been around for a while; we're working with a company that can measure the exact cost of their incidents at this point. The primary criteria is that it's companies that have incidents right now and they want something a little bit better. I think everyone in the industry doesn't feel great about their post-incident process—I think that's a pretty common thread—and wants it to be a little bit better. And so we want to help that be a more delightful and less, kind of, awkward experience for folks, that they're actually getting value out of, and that it's helping with their internal relationships, it's helping with our customer relationships, it's helping with a number of things.

Corey: So, framing it a little bit differently. If I'm an engineer, and I'm constantly frustrated, by the way that incidents seem to always happen, turn into blame festivals, et cetera, is that something Jeli can help with? In other words, what is the pain that I have that is going to transform into you jumping up and saying, “Yes, yes, that's what we fix.” What is the symptom that lets me know as I walk through the world, that I'm a prospective fit for what Jeli is doing?

Nora: So, the pain that engineers are experiencing right now, or anyone that has to write this post-incident document is a pain around creating a timeline, and copying and pasting items, and figuring out what to focus on in their meeting, and the time it takes to do a good job doing this. So, we reduce that time for you, we make that faster for you, and we enhance the quality of the output. And so the target audience, the target pain point is folks that have pain putting this together today, and feel like it's a chore, and feel like it's seldom a great experience. We want to help make that faster for you. And we want you to have a better and different experience. And so that's really the pain, which is why we're working with an array of companies because no matter where you are, you are having incidents if you're having customers. And so it's it's a matter of how you're addressing those—

Corey: Well, not if you ignore them sufficiently.

Nora: That's true. [laugh].

Corey: But if you do that they become not customers anymore and that sort of solves the problem, but not in the way that anyone really wanted it to.

Nora: Yeah.

Corey: So you've been a software engineer, you've been a senior technical leader at a number of different companies, and now you're a founder. What has changed for you or surprised you the most as you went through that path?

Nora: [laugh]. Yeah, it's an interesting question. A lot of things. I mean, I'm definitely a software engineer at heart. So, I still love architecting and writing code. And it's definitely been a big shift to enabling the folks around me to build in this vision, too, and add to it.

I think that's been, it's not really a surprise because I think we have a really great team, but it's been amazing. It's exactly what I want to be doing right now. I can't imagine doing anything else right now. It's just, I kept having this itch at every company I was at that there has to be something better around incidents. And after seeing these patterns at a number of places, I got the urge to go build it. And there's a lot of folks in the industry that are feeling this pain too, and are dedicated towards making a better solution. And I think that's been a lot of fun.

Corey: It's hard to go back on some level. Once you started a company and had the autonomy, it's scary, and it's hard, and it's one of those I don't ever see a future where I go back to what I used to be. It's sort of a one way door that you never really realized that when you're going through it.

Nora: Yeah, absolutely.

Corey: So, something I've noticed about every company, no matter what it does, I mean for my own where I fix Amazon bills, people have hilarious misunderstandings about it. In my case, it's, “Oh, great. How can I save money on socks?” And the answer is I don't really have a good answer, except I actually kind of do. If you get their Prime credit card, it knocks five percent off, but don't quote me on that. And even if it's something relatively straightforward, people don't always get it. What are the most hilarious misapprehensions you've seen so far about what Jeli does?

Nora: I think some of what you alluded to earlier. It's the AI ops kind of thing. We're certainly providing insights for folks, but you're also participating in the insights. And it's not this AI-focused engine. I think folks are not used to understanding the value that they can get from looking at the chat transcript, but there is so much in there that is just kind of waiting to be analyzed. And I get it, I don't want to go read a Slack conversation after it occurs. And so we're making it easier for you to do that, and glean those insights so that you can get the most value out of them. But.

Corey: Yeah. Assuming you can, that alone is valuable. People have always been saying, “Oh, the chat logs become super valuable just as soon as we learn how to work with them.” And they’ve been saying that for 15 years—

Nora: Right.

Corey: —but I've yet to see it really become valuable. I don't find myself scrolling back to look at how conversations unfolded. I search for specific terms: “Oh, there's the URL I was looking for. Oh there's the image.” Getting more signal than that seems inevitable, but I don't see people doing much with it yet.

Nora: No. And what I was doing at various companies I was at is, I was reading the chat transcript. I would sometimes print them out, go at my desk, highlight them, write notes on them, figure out who the people were, figure out what teams they were on, figure out who was getting paged, and I just, you know, I ended up having [crosstalk 00:26:20]—

Corey: “D minus. You can do better than this, please see me after class.” And then mail it to someone.

Nora: [laugh]. I ended up having a desk that looked like a crime scene where I had sticky notes everywhere, and I had yarn attached to different sticky notes, and just trying to connect all the pieces. Like, an investigation was unfolding because that's exactly what it was. And what we're doing is—

Corey: Oh, if someone didn’t know any better, they’d think you were trying to put together Google's messaging strategy.

Nora: [laugh]. But at Jeli, we're aggregating all of that for folks. So, you can have a more comprehensive picture about how people were coordinating in that moment so that you can reduce those coordination costs in the future. And no one's really looking at that today because it's not easy to do. But there's so much data in there that could actually help you really, really improve and make incidents not such a stressful, time-consuming experience.

Corey: So, one other thing that you've been involved with that I wanted to make sure that we got to talk about was you are also the founder of the learningfromincidents.io. Is it a community? Is it a movement? I'm not entirely clear, but it sounds directly aligned with what you're doing now, what you have been doing, and what Jeli is setting out to solve for. What is the relationship between Learning From Incidents as an entity and Jeli as an entity?

Nora: Yeah. While I was at Netflix and I was getting more deeply into incident analysis, I kind of had this thought, “Wow I really want to talk to folks from other organizations that are also looking at incidents under a deep lens. Surely there are more folks.” And I posted something on Twitter, and I think I got, like, hundreds and hundreds of DMs that night. And I kind of wanted to get like-minded folks together so that we could share our experiences and learn from each other in, kind of like, a safe space.

And so I started a Slack community around that, and I got some great people in it. And as we talked for over a year together, we kind of wanted to open-source some of these learnings. As I was mentioning earlier, it would be so helpful if companies talked a little bit more openly about their incidents and understood that they can do so without revealing proprietary business information. And so we started open-sourcing some of our learnings on the learningfromincidents.iowebsite.

And so it's a community of folks that want to use incidents as a catalyst for helping their organization and helping their businesses, but it's also a place to open source some of those learnings and get stories from folks that are doing it as well. And so, yeah, it's a movement; it's definitely a community; it's both of those things. And I think it's a new take on how the software industry is progressing in the reliability world is kind of taking a more human-centered and learning-focused approach because ultimately it can be really good for your business.

Corey: So we've covered an awful lot of ground over the course of this episode. What are the next steps? What should people who are interested in what you're up to do next if they want to learn more, or figure out whether they're potentially a fit for some of the stuff that you're talking about, and offering solutions to very real painful problems?

Nora: Yeah. So, for learningfromincidents.io, definitely go to the webpage and read some of the posts. There's posts from brilliant folks in that community that are actually doing real things, and chopping the wood, and carrying the water. And my focus with that website was I don't want to talk about the theory, I actually want to do this stuff at companies and have folks talk about how that worked out.

And that's what that website is. And I think if your organization is not feeling like you're getting a lot out of your incidents right now and wants a boost, and you're interested in Jeli, you can use the Contact Us form on our webpage right now and we'll reach out to you to set something up.

Corey: Excellent. We will, of course, include links to that in the [show notes 00:30:12]. Nora, thank you so much for taking the time to speak with me today. I really appreciate it.

Nora: Thanks, Corey.

Corey: Norah Jones, founder, and CEO of Jeli. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice along with a lengthy comment arguing about exactly whose fault it is.

Announcer: This has been this week’s episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com, or wherever fine snark is sold.

This has been a HumblePod production. Stay humble.

Join our newsletter

Want to sponsor the podcast? Send me an email.

2021 Duckbill Group, LLC