How Cloudflare is Working to Fix the Internet with Matthew Prince
Transcript
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. One of the things we talk about here, an awful lot is cloud providers. There sure are a lot of them, and there’s the usual suspects that you would tend to expect with to come up, and there are companies that work within their ecosystem. And then there are the enigmas.
Today, I’m talking to returning guest Matthew Prince, Cloudflare CEO and co-founder, who… well first, welcome back, Matthew. I appreciate your taking the time to come and suffer the slings and arrows a second time.
Matthew: Corey, thanks for having me.
Corey: What I’m trying to do at the moment is figure out where Cloudflare lives in the context of the broad ecosystem because you folks have released an awful lot. You had this vaporware-style announcement of R2, which was an S3 competitor, that then turned out to be real. And oh, it’s always interesting, when vapor congeals into something that actually exists. Cloudflare Workers have been around for a while and I find that they become more capable every time I turn around. You have Cloudflare Tunnel which, to my understanding, is effectively a VPN without the VPN overhead. And it feels that you are coming at building a cloud provider almost from the other side than the traditional cloud provider path. Is it accurate? Am I missing something obvious? How do you see yourselves?
Matthew: Hey, you know, I think that, you know, you can often tell a lot about a company by what they measure and what they measure themselves by. And so, if you’re at a traditional, you know, hyperscale public cloud, an AWS or a Microsoft Azure or a Google Cloud, the key KPI that they focus on is how much of a customer’s data are they hoarding, effectively? They’re all hoarding clouds, fundamentally. Whereas at Cloudflare, we focus on something of it’s very different, which is, how effectively are we moving a customer’s data from one place to another? And so, while the traditional hyperscale public clouds are all focused on keeping your data and making sure that they have as much of it, what we’re really focused on is how do we make sure your data is wherever you need it to be and how do we connect all of the various things together?
So, I think it’s exactly right, where we start with a network and are kind of building more functions on top of that network, whereas other companies start really with a database—the traditional hyperscale public clouds—and the network is sort of an afterthought on top of it, just you know, a cost center on what they’re delivering. And I think that describes a lot of the difference between us and everyone else. And so oftentimes, we work very much in conjunction with. A lot of our customers use hyperscale public clouds and Cloudflare, but increasingly, there are certain applications, there’s certain data that just makes sense to live inside the network itself, and in those cases, customers are using things like R2, they’re using our Workers platform in order to be able to build applications that will be available everywhere around the world and incredibly performant. And I think that is fundamentally the difference. We’re all about moving data between places, making sure it’s available everywhere, whereas the traditional hyperscale public clouds are all about hoarding that data in one place.
Corey: I want to clarify that when you say hoard, I think of this, from my position as a cloud economist, as effectively in an economic story where hoarding the data, they get to charge you for hosting it, they get to charge you serious prices for egress. I’ve had people mishear that before in a variety of ways, usually distilled down to, “Oh, and their data mining all of their customers’ data.” And I want to make sure that that’s not the direction that you intend the term to be used. If it is, then great, we can talk about that, too. I just want to make sure that I don’t get letters because God forbid we get letters for things that we say in the public.
Matthew: No, I mean, I had an aunt who was a hoarder and she collected every piece of everything and stored it somewhere in her tiny little apartment in the panhandle of Florida. I don’t think she looked at any of it and for the most part, I don’t think that AWS or Google or Microsoft are really using your data in any way that’s nefarious, but they’re definitely not going to make it easy for you to get it out of those places; they’re going to make it very, very expensive. And again, what they’re measuring is how much of a customer’s data are they holding onto whereas at Cloudflare we’re measuring how much can we enable you to move your data around and connected wherever you need it. And again, I think that that kind of gets to the fundamental difference between how we think of the world and how I think the hyperscale public clouds thing of the world. And it also gets to where are the places where it makes sense to use Cloudflare, and where are the places that it makes sense to use an AWS or Google Cloud or Microsoft Azure.
Corey: So, I have to ask, and this gets into the origin story trope a bit, but what radicalized you? For me, it was the realization one day that I could download two terabytes of data from S3 once, and it would cost significantly more than having Amazon.com ship me a two-terabyte hard drive from their store.
Matthew: I think that—so Cloudflare started with the basic idea that the internet’s not as good as it should be. If we all knew what the internet was going to be used for and what we’re all going to depend on it for, we would have made very different decisions in how it was designed. And we would have made sure that security was built in from day one, we would have—you know, the internet is very reliable and available, but there are now airplanes that can’t land if the internet goes offline, they are shopping transactions shut down if the internet goes offline. And so, I don’t think we understood—we made it available to some extent, but not nearly to the level that we all now depend on it. And it wasn’t as fast or as efficient as it possibly could be. It’s still very dependent on the geography of where data is located.
And so, Cloudflare started out by saying, “Can we fix that? Can we go back and effectively patch the internet and make it what it should have been when we set down the original protocols in the ’60s, ’70s, and ’80s?” But can we go back and say, can we build a new, sort of, overlay on the internet that solves those problems: make it more secure, make it more reliable, make it faster and more efficient? And so, I think that that’s where we started, and as a result of, again, starting from that place, it just made fundamental sense that our job was, how do you move data from one place to another and do it in all of those ways? And so, where I think that, again, the hyperscale public clouds measure themselves by how much of a customer’s data are they hoarding; we measure ourselves by how easy are we making it to securely, reliably, and efficiently move any piece of data from one place to another.
And so, I guess, that is radical compared to some of the business models of the traditional cloud providers, but it just seems like what the internet should be. And that’s our North Star and that’s what just continues to drive us and I think is a big reason why more and more customers continue to rely on Cloudflare.
Corey: The thing that irks me potentially the most in the entire broad strokes of cloud is how the actions of the existing hyperscalers have reflected mostly what’s going on in the larger world. Moore’s law has been going on for something like 100 years now. And compute continues to get faster all the time. Storage continues to cost less year over year in a variety of ways. But they have, on some level, tricked an entire generation of businesses into believing that network bandwidth is this precious, very finite thing, and of course, it’s going to be ridiculously expensive. You know, unless you’re taking it inbound, in which case, oh, by all means back the truck around. It’ll be great.
So, I’ve talked to founders—or prospective founders—who had ideas but were firmly convinced that there was no economical way to build it. Because oh, if I were to start doing real-time video stuff, well, great, let’s do the numbers on this. And hey, that’ll be $50,000 a minute, if I read the pricing page correctly, it’s like, well, you could get some discounts if you ask nicely, but it doesn’t occur to them that they could wind up asking for a 98% discount on these things. Everything is measured in a per gigabyte dimension and that just becomes one of those things where people are starting to think about and meter something that—from my days in data centers where you care about the size of the pipe and not what’s passing through it—to be the wrong way of thinking about things.
Matthew: A little of this is that everybody is colored by their experience of dealing with their ISP at home. And in the United States, in a lot of the world, ISPs are built on the old cable infrastructure. And if you think about the cable infrastructure, when it was originally laid down, it was all one-directional. So, you know, if you were turning on cable in your house in a pre-internet world, data fl—
Corey: Oh, you’d watch a show and your feedback was yelling at the TV, and that’s okay. They would drop those packets.
Matthew: And there was a tiny, tiny, tiny bit of data that would go back the other direction, but cable was one-directional. And so, it actually took an enormous amount of engineering to make cable bi-directional. And that’s the reason why if you’re using a traditional cable company as your ISP, typically you will have a large amount of download capacity, you’ll have, you know, a 100 megabits of down capacity, but you might only have a 10th of that—so maybe ten megabits—of upload capacity. That is an artifact of the cable system. That is not just the natural way that the internet works.
And the way that it is different, that wholesale bandwidth works, is that when you sign up for wholesale bandwidth—again, as you phrase it, you’re not buying this many bytes that flows over the line; you’re buying, effectively, a pipe. You know, the late Senator Ted Stevens said that the internet is just a series of tubes and got mocked mercilessly, but the internet is just a series of tubes. And when Cloudflare or AWS or Google or Microsoft buys one of those tubes, what they pay for is the diameter of the tube, the amount that can fit through it. And the nature of this is you don’t just get one tube, you get two. One that is down and one that is up. And they’re the same size.
And so, if you’ve got a terabit of traffic coming down and zero going up, that costs exactly the same as a terabit going up and zero going down, which costs exactly the same as a terabit going down and a terabit going up. It is different than your home, you know, cable internet connection. And that’s the thing that I think a lot of people don’t understand. And so, as you pointed out, but the great tragedy of the cloud is that for nothing other than business reasons, these hyperscale public cloud companies don’t charge you anything to accept data—even though that is actually the more expensive of the two operations for that because writes are more expensive than reads—but the inherent fact that they were able to suck the data in means that they have the capacity, at no additional cost, to be able to send that data back out. And so, I think that, you know, the good news is that you’re starting to see some providers—so Cloudflare, we’ve never charged for egress because, again, we think that over time, bandwidth prices go to zero because it just makes sense; it makes sense for ISPs, it makes sense for connectiv—to be connected to us.
And that’s something that we can do, but even in the cases of the cloud providers where maybe they’re all in one place and somebody has to pay to backhaul the traffic around the world, maybe there’s some cost, but you’re starting to see some pressure from some of the more forward-leaning providers. So Oracle, I think has done a good job of leaning in and showing how egress fees are just out of control. But it’s crazy that in some cases, you have a 4,000x markup on AWS bandwidth fees. And that’s assuming that they’re paying the same rates as what we would get at Cloudflare, you know, even though we are a much smaller company than they are, and they should be able to get even better prices.
Corey: Yes, if there’s one thing Amazon is known for, it as being bad at negotiating. Yeah, sure it is. I’m sure that they’re just a terrific joy to be a vendor to.
Matthew: Yeah, and I think that fundamentally what the price of bandwidth is, is tied very closely to what the cost of a port on a router costs. And what we’ve seen over the course of the last ten years is that cost has just gone enormously down where the capacity of that port has gone way up and the just physical cost, the depreciated cost that port has gone down. And yet, when you look at Amazon, you just haven’t seen a decrease in the cost of bandwidth that they’re passing on to customers. And so, again, I think that this is one of the places where you’re starting to see regulators pay attention, we’ve seen efforts in the EU to say whatever you charge to take data out is the same as what you should charge it to put data in. We’re seeing the FTC start to look at this, and we’re seeing customers that are saying that this is a purely anti-competitive action.
And, you know, I think what would be the best and healthiest thing for the cloud by far is if we made it easy to move between various cloud providers. Because right now the choice is, do I use AWS or Google or Microsoft, whereas what I think any company out there really wants to be able to do is they want to be able to say, “I want to use this feature at AWS because they’re really good at that and I want to use this other feature at Google because they’re really good at that, and I want to us this other feature at Microsoft, and I want to mix and match between those various things.” And I think that if you actually got cloud providers to start competing on features as opposed to competing on their overall platform, we’d actually have a much richer and more robust cloud environment, where you’d see a significantly improved amount of what’s going on, as opposed to what we have now, which is AWS being mediocre at everything.
Corey: I think that there’s also a story where for me, the egress is annoying, but so is the cross-region and so is the cross-AZ, which in many cases costs exactly the same. And that frustrates me from the perspective of, yes, if you have two data centers ten miles apart, there is some startup costs to you in running fiber between them, however you want to wind up with that working, but it’s a sunk cost. But at the end of that, though, when you wind up continuing to charge on a per gigabyte basis to customers on that, you’re making them decide on a very explicit trade-off of, do I care more about cost or do I care more about reliability? And it’s always going to be an investment decision between those two things, but when you make the reasonable approach of well, okay, an availability zone rarely goes down, and then it does, you get castigated by everyone for, “Oh it even says in their best practice documents to go ahead and build it this way.” It’s funny how a lot of the best practice documents wind up suggesting things that accrue primarily to a cloud provider’s benefit. But that’s the way of the world I suppose.
I just know, there’s a lot of customer frustration on it and in my client environments, it doesn’t seem to be very acute until we tear apart a bill and look at where they’re spending money, and on what, at which point, the dawning realization, you can watch it happen, where they suddenly realize exactly where their money is going—because it’s relatively impenetrable without that—and then they get angry. And I feel like if people don’t know what they’re being charged for, on some level, you’ve messed up.
Matthew: Yeah. So, there’s cost to running a network, but there’s no reason other than limiting competition why you would charge more to take data out than you would put data in. And that’s a puzzle. The cross-region thing, you know, I think where we’re seeing a lot of that is actually oftentimes, when you’ve got new technologies that come out and they need to take advantage of some scarce resource. And so, AI—and all the AI companies are a classic example of this—right now, if you’re trying to build a model, an AI model, you are hunting the world for available GPUs at a reasonable price because there’s an enormous scarcity of them.
And so, you need to move from AWS East to AWS West, to AWS, you know, Singapore, to AWS in Luxembourg and bounce around to find wherever there’s GPU availability. And then that is crossed against the fact that these training datasets are huge. You know, I mean, they’re just massive, massive, massive amounts of data. And so, what that is doing is you’re having these AI companies that are really seeing this get hit in the face, where they literally can’t get the capacity they need because of the fact that whatever cloud provider in whatever region they’ve selected to store their data isn’t able to have that capacity. And so, they’re getting hit not only by sort of a double whammy of, “I need to move my data to wherever there’s capacity. And if I don’t do that, then I have to pay some premium, an ever-escalating price for the underlying GPUs.” And God forbid, you have to move from AWS to Google to chase that.
And so, we’re seeing a lot of companies that are saying, “This doesn’t make any sense. We have this enormous training set. If we just put it with Cloudflare, this is data that makes sense to live in the network, fundamentally.” And not everything does. Like, we’re not the right place to store your long-term transaction logs that you’re only going to look at if you get sued. There are much better places, much more effective places do it.
But in those cases where you’ve got to read data frequently, you’ve got to read it from different places around the world, and you will need to decrease what those costs of each one of those reads are, what we’re seeing is just an enormous amount of demand for that. And I think these AI startups are really just a very clear example of what company after company after company needs, and why R2 has had—which is our zero egress cost S3 competitor—why that is just seeing such explosive growth from a broad set of customers.
Corey: Because I enjoy pushing the bounds of how ridiculous I can be on the internet, I wound up grabbing a copy of the model, the Llama 2 model that Meta just released earlier this week as we’re recording this. And it was great. It took a little while to download here. I have gigabit internet, so okay, it took some time. But then I wound up with something like 330 gigs of models. Great, awesome.
Except for the fact that I do the math on that and just for me as one person to download that, had they been paying the listed price on the AWS website, they would have spent a bit over $30, just for me as one random user to download the model, once. If you can express that into the idea of this is a model that is absolutely perfect for whatever use case, but we want to have it run with some great GPUs available at another cloud provider. Let’s move the model over there, ignoring the data it’s operating on as well, it becomes completely untenable. It really strikes me as an anti-competitiveness issue.
Matthew: Yeah. I think that’s it. That’s right. And that’s just the model. To build that model, you would have literally millions of times more data that was feeding it. And so, the training sets for that model would be many, many, many, many, many, many orders of magnitude larger in terms of what’s there. And so, I think the AI space is really illustrating where you have this scarce resource that you need to chase around the world, you have these enormous datasets, it’s illustrating how these egress fees are actually holding back the ability for innovation to happen.
And again, they are absolutely—there is no valid reason why you would charge more for egress than you do for ingress other than limiting competition. And I think the good news, again, is that’s something that’s gotten regulators’ attention, that’s something that’s gotten customers’ attention, and over time, I think we all benefit. And I think actually, AWS and Google and Microsoft actually become better if we start to have more competition on a feature-by-feature basis as opposed to on an overall platform. The choice shouldn’t be, “I use AWS.” And any big company, like, nobody is all-in only on one cloud provider. Everyone is multi-cloud, whether they want to be or not because people end up buying another company or some skunkworks team goes off and uses some other function.
So, you are across multiple different clouds, whether you want to be or not. But the ideal, and when I talk to customers, they want is, they want to say, “Well, you know that stuff that they’re doing over at Microsoft with AI, that sounds really interesting. I want to use that, but I really like the maturity and robustness of some of the EC2 API, so I want to use that at AWS. And Google is still, you know, the best in the world at doing search and indexing and everything, so I want to use that as well, in order to build my application.” And the applications of the future will inherently stitch together different features from different cloud providers, different startups.
And at Cloudflare, what we see is our, sort of, purpose for being is how do we make that stitching as easy as possible, as cost-effective as possible, and make it just make sense so that you have one consistent security layer? And again, we’re not about hording the data; we’re about connecting all of those things together. And again, you know, from the last time we talked to now, I’m actually much more optimistic that you’re going to see, kind of, this revolution where egress prices go down, you get competition on feature-by-features, and that’s just going to make every cloud provider better over the long-term.
Corey: This episode is sponsored in part by Panoptica. Panoptica simplifies container deployment, monitoring, and security, protecting the entire application stack from build to runtime. Scalable across clusters and multi-cloud environments, Panoptica secures containers, serverless APIs, and Kubernetes with a unified view, reducing operational complexity and promoting collaboration by integrating with commonly used developer, SRE, and SecOps tools. Panoptica ensures compliance with regulatory mandates and CIS benchmarks for best practice conformity. Privacy teams can monitor API traffic and identify sensitive data, while identifying open-source components vulnerable to attacks that require patching. Proactively addressing security issues with Panoptica allows businesses to focus on mitigating critical risks and protecting their interests. Learn more about Panoptica today at panoptica.app.
Corey: I don’t know that I would trust you folks to the long-term storage of critical data or the store of record on that. You don’t have the track record on that as a company the way that you do for being the network interchange that makes everything just work together. There are areas where I’m thrilled to explore and see how it works, but it takes time, at least from the sensible infrastructure perspective of trusting people with track records on these things. And you clearly have the network track record on these things to make this stick. It almost—it seems unfair to you folks, but I view you as Cloudflare is a CDN, that also dabbles in a few other things here in there, though, increasingly, it seems it’s CDN and security company are becoming synonymous.
Matthew: It’s interesting. I remember—and this really is going back to the origin story, but when we were starting Cloudflare, you know, what we saw was that, you know, we watched as software—starting with companies like Salesforce—transition from something that you bought in the box to something that you bought as a service [into 00:23:25] the cloud. We watched as, sort of, storage and compute transition from something that you bought from Dell or HP to something that you rented as a service. And so the fundamental problem that Cloudflare started out with was if the software and the storage and compute are going to move, inherently the security and the networking is going to move as well because it has to be as a service as well, there’s no way you can buy a you know, Cisco firewall and stick it in front of your cloud service. You have to be in the cloud as well.
So, we actually started very much as a security company. And the objection that everybody had to us as we would sort of go out and describe what we were planning on doing was, “You know, that sounds great, but you’re going to slow everything down.” And so, we became just obsessed with latency. And Michelle, my co-founder, and I were business students and we had an advisor, a guy named Tom [Eisenmann 00:24:26] in business school. And I remember going in and that was his objection as well and so we did all this work to figure it out.
And obviously, you know, I’d say computer science, and anytime that you have a problem around latency or speed caching is an obvious part of the solution to that. And so, we went in and we said, “Here’s how we’re going to do it: [unintelligible 00:24:47] all this protocol optimization stuff, and here’s how we’re going to distribute it around the world and get close to where users are. And we’re going to use caching in the places where we can do caching.” And Tom said, “Oh, you’re building a CDN.” And I remember looking at him and then I’m looking at Michelle. And Michelle is Canadian, and so I was like, “I don’t know that I’m building a Canadian, but I guess. I don’t know.”
And then, you know, we walked out in the hall and Michelle looked at me and she’s like, “We have to go figure out what the CDN thing is.” And we had no idea what a CDN was. And even when we learned about it, we were like, that business doesn’t make any sense. Like because again, the CDNs were the first ones to really charge for bandwidth. And so today, we have effectively built, you know, a giant CDN and are the fastest in the world and do all those things.
But we’ve always given it away basically for free because fundamentally, what we’re trying to do is all that other stuff. And so, we actually started with security. Security is—you know, my—I’ve been working in security now for over 25 years and that’s where my background comes from, and if you go back and look at what the original plan was, it was how do we provide that security as a service? And yeah, you need to have caching because caching makes sense. What I think is the difference is that in order to do that, in order to be able to build that, we had to build a set of developer tools for our own team to allow them to build things as quickly as possible.
And, you know, if you look at Cloudflare, I think one of the things we’re known for is just the rapid, rapid, rapid pace of innovation. And so, over time, customers would ask us, “How do you innovate so fast? How do you build things fast?” And part of the answer to that, there are lots of ways that we’ve been able to do that, but part of the answer to that is we built a developer platform for our own team, which was just incredibly flexible, allowed you to scale to almost any level, took care of a lot of that traditional SRE functions just behind the scenes without you having to think about it, and it allowed our team to be really fast. And our customers are like, “Wow, I want that too.”
And so, customer after customer after customer after customer was asking and saying, you know, “We have those same problems. You know, if we’re a big e-commerce player, we need to be able to build something that can scale up incredibly quickly, and we don’t have to think about spinning up VMs or containers or whatever, we don’t have to think about that. You know, our customers are around the world. We don’t want to have to pick a region for where we’re going to deploy code.” And so, where we built Cloudflare Workers for ourself first, customers really pushed us to make it available to them as well.
And that’s the way that almost any good developer platform starts out. That’s how AWS started. That’s how, you know, the Microsoft developer platform, and so the Apple developer platform, the Salesforce developer platform, they all start out as internal tools, and then someone says, “Can you expose this to us as well?” And that’s where, you know, I think that we have built this. And again, it’s very opinionated, it is right for certain applications, it’s never going to be the right place to run SAP HANA, but the company that builds the tool [crosstalk 00:27:58]—
Corey: I’m not convinced there is a right place to run SAP HANA, but that’s probably unfair of me.
Matthew: Yeah, but there is a startup out there, I guarantee you, that’s building whatever the replacement for SAP HANA is. And I think it’s a better than even bet that Cloudflare Workers is part of their stack because it solves a lot of those fundamental challenges. And that’s been great because it is now allowing customer after customer after customer, big and large startups and multinationals, to do things that you just can’t do with traditional legacy hyperscale public cloud. And so, I think we’re sort of the next generation of building that. And again, I don’t think we set out to build a developer platform for third parties, but we needed to build it for ourselves and that’s how we built such an effective tool that now so many companies are relying on.
Corey: As a Cloudflare customer myself, I think that one of the things that makes you folks standalone—it’s why I included security as well as CDN is one of the things I trust you folks with—has been—
Matthew: I still think CDN is Canadian. You will never see us use that term. It’s like, Gartner was like, “You have to submit something for the CDN-like ser—” and we ended up, like, being absolute top-right in it. But it’s a space that is inherently going to zero because again, if bandwidth is free, I’m not sure what—this is what the internet—how the internet should work. So yeah, anyway.
Corey: I agree wholeheartedly. But what I’ve always enjoyed, and this is probably going to make me sound meaner than I intend it to, it has been your outages. Because when computers inherently at some point break, which is what they do, you personally and you as a company have both taken a tone that I don’t want to say gleeful, but it’s sort of the next closest thing to it regarding the postmortem that winds up getting published, the explanation of what caused it, the transparency is unheard of at companies that are your scale, where usually they want to talk about these things as little as possible. Whereas you’ve turned these into things that are educational to those of us who don’t have the same scale to worry about but can take things from that are helpful. And that transparency just counts for so much when we’re talking about things as critical as security.
Matthew: I would definitely not describe it as gleeful. It is incredibly painful. And we, you know, we know we let customers down anytime we have an issue. But we tend not to make the same mistake twice. And the only way that we really can reliably do that is by being just as transparent as possible about exactly what happened.
And we hope that others can learn from the mistakes that we made. And so, we own the mistakes we made and we talk about them and we’re transparent, both internally but also externally when there’s a problem. And it’s really amazing to just see how much, you know, we’ve improved over time. So, it’s actually interesting that, you know, if you look across—and we measure, we test and measure all the big hyperscale public clouds, what their availability and reliability is and measure ourselves against it, and across the board, second half of 2021 and into the first half of 2022 was the worst for every cloud provider in terms of reliability. And the question is why?
And the answer is, Covid. I mean, the answer to most things over the last three years is in one way, directly or indirectly, Covid. But what happened over that period of time was that in April of 2020, internet traffic and traffic to our service and everyone who’s like us doubled over the course of a two-week period. And there are not many utilities that you can imagine that if their usage doubles, that you wouldn’t have a problem. Imagine the sewer system all of a sudden has twice as much sewage, or the electrical grid as twice as much demand, or the freeways have twice as many cars. Like, things break down.
And especially the European internet came incredibly close to just completely failing at that time. And we all saw where our bottlenecks were. And what’s interesting is actually the availability wasn’t so bad in 2020 because people were—they understood the absolute critical importance that while we’re in the middle of a pandemic, we had to make sure the internet worked. And so, we—there were a lot of sleepless nights, there’s a—and not just at with us, but with every provider that’s out there. We were all doing Herculean tasks in order to make sure that things came online.
By the time we got to the sort of the second half of 2021, what everybody did, Cloudflare included, was we looked at it, and we said, “Okay, here were where the bottlenecks were. Here were the problems. What can we do to rearchitect our systems to do that?” And one of the things that we saw was that we effectively treated large data centers as one big block, and if you had certain pieces of equipment that failed in a way, that you would take that entire data center down and then that could have cascading effects across traffic as it shifted around across our network. And so, we did the work to say, “Let’s take that one big data center and divide it effectively into multiple independent units, where you make sure that they’re all on different power suppliers, you make sure they’re all in different [crosstalk 00:32:52]”—
Corey: [crosstalk 00:32:51] harder than it sounds. When you have redundant things, very often, the thing that takes you down the most is the heartbeat that determines whether something next to it is up or not. It gets a false reading and suddenly, they’re basically trying to clobber each other to death. So, this is a lot harder than it sounds like.
Matthew: Yeah, and it was—but what’s interesting is, like, we took it all that into account, but the act of fixing things, you break things. And that was not just true at Cloudflare. If you look across Google and Microsoft and Amazon, everybody, their worst availability was second half of 2021 or into 2022. But it both internally and externally, we talked about the mistakes we made, we talked about the challenges we had, we talked about—and today, we’re significantly more resilient and more reliable because of that. And so, transparency is built into Cloudflare from the beginning.
The earliest story of this, I remember, there was a 15-year-old kid living in Long Beach, California who bought my social security number off of a Russian website that had hacked a bank that I’d once used to get a mortgage. He then use that to redirect my cell phone voicemail to a voicemail box he controlled. He then used that to get into my personal email. He then used that to find a zero-day vulnerability in Google’s corporate email where he could privilege-escalate from my personal email into Google’s corporate email, which is the provider that we use for our email service. And then he used that as an administrator on our email at the time—this is back in the early days of Cloudflare—to get into another administration account that he then used to redirect one of Cloud Source customers to a website that he controlled.
And thankfully, it wasn’t, you know, the FBI or the Central Bank of Brazil, which were all Cloudflare customers. Instead, it was 4chan because he was a 15-year-old hacker kid. And we fix it pretty quickly and nobody knew who Cloudflare was at the time. And so potential—
Corey: The potential damage that could have been caused at that point with that level of access to things, like, that is such a ridiculous way to use it.
Matthew: And—yeah [laugh]—my temptation—because it was embarrassing. He took a bunch of stuff from my personal email and he put it up on a website, which just to add insult to injury, was actually using Cloudflare as well. And I wanted to sweep it under the rug. And our team was like, “That’s not the right thing to do. We’re fundamentally a security company and we need to talk about when we make mistakes on security.” And so, we wrote a huge postmortem on, “Here’s all the stupid things that we did that caused this hack to happen.” And by the way, it wasn’t just us. It was AT&T, it was Google. I mean, there are a lot of people that ended up being involved.
Corey: It builds trust with that stuff. It’s painful in the short term, but I believe with the benefit of hindsight, it was clearly the right call.
Matthew: And it was—and I remember, you know, pushing ‘publish’ on the blog post and thinking, “This is going to be the end of the company.” And quite the opposite happened, which was all of a sudden, we saw just an incredible amount of people who signed up the next day saying, “If you’re going to be that transparent about something that was incredibly embarrassing when you didn’t have to be, then that’s the sort of thing that actually makes me trust that you’re going to be transparent the future.” And I think learning that lesson early on, has been just an incredibly valuable lesson for us and made us the company that we are today.
Corey: A question that I have for you about the idea of there being no reason to charge in one direction but not the other. There’s something that I’m not sure that I understand on this. If I run a website, to use your numbers of a terabit out—because it’s a web server—and effectively nothing in—because it’s a webserver; other than the request, nothing really is going to come in—that ingress bandwidth becomes effectively unused and also free. So, if I have another use case where I’m paying for it anyway, if I’m primarily caring about an outward direction, sure, you can send things in for free. Now, there’s a lot of nuance that goes into that. But I’m curious as to what the—is their fundamental misunderstanding in that analysis of the bandwidth market?
Matthew: No. And I think that’s exactly, exactly right. And it’s actually interesting. At Cloudflare, our infrastructure team—which is the one that manages our connections to the outside world, manages the hardware we have—meets on a quarterly basis with our product team. It’s called the Hot and Cold Meeting.
And what they do is they go over our infrastructure, and they say, “Okay, where are we hot? Where do we have not enough capacity?” If you think of any given server, an easy way to think of a server is that it has, sort of, four resources that are available to it. This is, kind of, vast simplification, but one is the connectivity to the outside world, both transit in and out. The second is the—
Corey: Otherwise it’s just a complicated space heater.
Matthew: Yeah [laugh]. The other is the CPU. The other is the longer-term storage. We use only SSDs, but sort of, you know, hard drives or SSD storage. And then the fourth is the short-term storage, or RAM that’s in that server.
And so, at any given moment, there are going to be places where we are running hot, where we have a sort of capacity level that we’re targeting and we’re over that capacity level, but we’re also going to be running cold in some of those areas. And so, the infrastructure team and the product team get together and the product team has requests on, you know, “Here’s some more places we would be great to have more infrastructure.” And we’re really good at deploying that when we need to, but the infrastructure team then also says, “Here are the places where we’re cold, where we have excess capacity.” And that turns into products at Cloudflare. So, for instance, you know, the reason that we got into the zero-trust space was very much because we had all this excess capacity.
We have 100 times the capacity of something like Zscaler across our network, and we can add that—that is primar—where most of our older products are all about outward traffic, the zero-trust products are all about inward traffic. And the reason that we can do everything that Zscaler does, but for, you know, a much, much, much more affordable prices, we going to basically just layer that on the network that already exists. The reason we don’t charge for the bandwidth behind DDoS attacks is DDoS attacks are always about inbound traffic and we have just a ton of excess capacity around that inbound traffic. And so, that unused capacity is a resource that we can then turn into products, and very much that conversation between our product team and our infrastructure team drives how we think about building new products. And we’re always trying to say how can we get as much utilization out of every single piece of equipment that we run everywhere in the world.
The way we build our network, we don’t have custom machines or different networks for every products. We build all of our machines—they come in generations. So, we’re on, I think, generation 14 of servers where we spec a server and it has, again, a certain amount of each of those four [bits 00:39:22] of capacity. But we can then deploy that server all around the world, and we’re buying many, many, many of them at any given time so we can get the best cost on that. But our product team is very much in constant communication with our infrastructure team and saying, “What more can we do with the capacity that we have?” And then we pass that on to our customers by adding additional features that work across our network and then doing it in a way that’s incredibly cost-effective.
Corey: I really want to thank you for taking the time to, basically once again, suffer slings and arrows about networking, security, cloud, economics, and so much more. If people want to learn more, where’s the best place for them to find you?
Matthew: You know, used to be an easy question to answer because it was just, you know, go on Twitter and find me but now we have all these new mediums. So, I’m @eastdakota on Twitter. I’m eastdakota.com on Bluesky. I’m @real_eastdakota on Threads. And so, you know, one way or another, if you search for eastdakota, you’ll come across me somewhere out there in the ether.
Corey: And we will, of course, put links to that in the show notes. Thank you so much for your time. I appreciate it.
Matthew: It’s great to talk to you, Corey.
Corey: Matthew Prince, CEO and co-founder of Cloudflare. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, insulting comment that I will of course not charge you inbound data rates on.
Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
Join our newsletter
2021 Duckbill Group, LLC