The Complexity and Value of Scaling Reliability with Kannan Solaiappan
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: This episode is brought to us by our friends at Wiz.
Wiz is on a mission to help every organization rapidly identify and remove critical risks in their cloud environments. Purpose-built for the cloud, Wiz delivers full stack visibility, accurate risk prioritization, and enhanced business agility. Wiz connects in minutes, using an agentless approach that scans both platform configurations and inside every workload. We perform a deep assessment that goes beyond what standalone CSPM and CWPP tools offer and find the toxic combination of flaws that represent real risk. To learn more, visit Wiz.io
Corey: This episode is brought to us in part by our friends at Min.io
With more than 1.1 billion docker pulls - Most of which were not due to an unfortunate loop mistake, like the kind I like to make - and more than 37 thousand github stars, (which are admittedly harder to get wrong), MinIO has become the industry standard alternative to S3. It runs everywhere - public clouds, private clouds, Kubernetes distributions, baremetal, raspberry’s pi, colocations - even in AWS Local Zones.
The reason people like it comes down to its simplicity, scalability, enterprise features and best in class throughput. Software-defined and capable of running on almost any hardware you can imagine and some you probably can’t, MinIO can handle everything you can throw at it - and AWS has imagined a lot of things - from datalakes to databases.
Don’t take their word for it though - check it out at www.min.io and see for yourself. That’s www.min.io
Corey: Welcome to Screaming in the Cloud, I’m Corey Quinn. This is one of those fun episodes because it is brought to us as a promoted guest episode by our friends at Severalnines. However, no one from Severalnines is on this conversation today. Instead, I am joined by a Kannan Solaiappan, who is currently the Head of Reliability and Data Engineering at Circles.Life. Now, they happen to be a Severalnines customer, but let’s dive into it. Kannan, thank you for joining me. I appreciate it.
Kannan: Thanks, Corey. Glad to be part of the call.
Corey: So, let’s start at the very beginning. What does Circles.Life do?
Kannan: Circles.Life is a digital telco. We are trying to disrupt the market by providing a vertical SaaS telco operating system to all the telco partners. And our mission is giving the power back to customer, not to get themselves into lock-in contracts with multiple [unintelligible 00:01:31] and giving all the power and, you know, savings back to the customer while providing the high performance and highly reliable network of data and call services to them across the world.
Corey: Working in telco is one of those interesting areas because it’s one of those problems of mistakes are absolutely going to show, but when you get it right, no one notices or cares because, “Of course, the phone is going to talk to the other end. Why wouldn’t it? And of course, the data is going to flow.” Until suddenly TCP terminates on the floor and then nothing gets to talk to anything ever again. Severalnines is an offering around the idea of database as a service, which on some level, sounds to me like wait, aren’t there a lot of managed database offerings?
But when I started digging into them a little bit further, that’s not really how they position themselves. They talk about the idea of ‘Sovereign Database as a Service’ offerings and the ability to wind up running a wide variety of different data stores in ways that are much more portable and, to be direct, responsible than running it yourself in a different bespoke way in every environment that winds up coming across your desk. Now, that’s the official story that their shiny marketing pages say. What’s your perspective on them? What do they do for you folks that leaves you at least happy enough to wind up showing up and claiming you’re going to say nice things, but we’ll see how it plays out?
Kannan: Fantastic, we’ll see, right? So, that’s the biggest challenge when you’re, you know, trying to build a product as a startup and, you know, grow to a multinational company expanding to hundreds of countries, right? So, one of the big challenge for us is, based on the data regulation, we are compelled to use the public clouds that are available in the local markets, which also make you to choose the managed services of those public cloud in those markets. So, we decided to go for cloud agnostic, especially on the persistent stack. And that brings a new challenge of, you know, how you are going to set up the environments and how you’re going to operate those environments, time taken to provision those hundreds of databases and making the data recovery backups, fault tolerance and observability on top of it.
And Severalnines actually give a hand to help us in speeding up those areas especially on the areas of disaster recovery planning and provisioning and backup and recovery provisioning for multiple variations of the databases we have, as well as of observability of those database instance. Yes of course, they are not a hundred percent match for our needs, but the areas where we are using and we are more than happy to get their help in speeding up those operations.
Corey: I can already hear the plaintive comments coming in before I even wind up putting this on the internet because people are going to be accusing me of, “Now, hang on a second, you’ve always been very down on the idea of multi-cloud as a design principle.” Because I have been. But when you talk to individual companies that are building with an idea toward running a workload in multiple environments and designing it in, my default assumption is not, “Well, they’re doing it wrong.” It’s instead, “Oh, you’re probably doing something right,” because you have much closer insight to your strategic requirements. And when you’re talking about taking a workload as you are and putting it in a bunch of different places based upon where the customer happens to be, you’ve got to be able to wind up deploying that in a bunch of different places because we’ll all grow old and wither and die waiting for our cloud provider of choice to build a region where we really, really need one debate. At least that’s always been my philosophy on this.
Now, that doesn’t mean every aspect of every workload needs to be run in a completely autonomous way, but there are some core functions that seem to. At least that’s my approach on it. How do you think about it?
Kannan: So, some aspect of it is, you know, entirely true. The way how we are trying to build our telcos as a platform, [unintelligible 00:05:27] as a platform is, we, after much thoughtful architectural discussions, and you know, brainstorming sessions with key architects and industry experts, we decided to go with the semi-platform multiple instances model, right? So, which means you need to add utmost precision in building your code platform. So, a single platforms is going to act as multiple instances in different countries, even though you are using multiple clouds in those areas. So, that brings us an opportunity to, you know, optimize the design aspect and the complexities you will be facing with the multi-cloud environment.
So, that also introduces new challenges like, you know, you should have a hundred percent observability and you should have a high availability, fault tolerance, and you have alerts in place for all the areas and the slow queries should be identified before your customers complaints you. So, most of these areas are covered in our single platform multiple instances, and especially using Severalnines, last year, we were able to identify slow-performing queries and just before customer reach out, we were able to identify and optimize.
Corey: I think at some level, there’s a bit of a reduction into what almost tries to be a binary when there needs to be a spectrum instead when it comes to the idea of independence. And on the one hand, it’s, “Oh, we’re not going to trust any of those cloud providers. We’re going to build everything in our data centers ourselves with the sweat of our brow and the bad grounding of our electrical supply.” Et cetera, et cetera. And the other side of it is, “No, no, no, we’re just going to make everything that we do run by other people because they’re good at things and we just want to focus on the one thing that we do.” In practice, it really feels to me like independence is a spectrum. Where do you see yourselves falling on that spectrum?
Kannan: Great question. So, if you see the large, successful SaaS companies, the way how they started is, they put everything in one plate, try to expand and, you know, in multi-tenant model, your server will be operated from a single country, and you are serving the data to the entire world. Now, the world is moving towards a highly regulated environment, right? So, those companies are losing their freedom. We got all those learnings when we started our journey and we built our systems in such a way that your persistent data layer can resides in the country where you are operating and you have a core platform where your nonregulated data can reside, right? So, that level of initial thought process gave us designing the systems optimistically, so that when you’re trying to scale, you can reduce all this complexity [unintelligible 00:08:11].
Corey: Some level, like, some of the worst takes around, “Oh, we’re going to wind up building these things to go in a bunch of different places,” come right in the wake of significant cloud outages because, “Oh, wow, AWS went down in a particular region for a couple hours and it made headlines everywhere. So, we’re going to go ahead and run in multiple cloud providers to avoid it.” In practice, what I see happening instead is that people are just doubling their exposure to outages. Now, whenever GCP or AWS have a problem, now we’re going to be hard down. It feels like it’s going in the exact opposite direction than the things really should be moving.
You are the head of both reliability—which I spent a bit of time in back when I was hands-on keyboard in my engineering life—and also data engineering, something I stayed the hell away from because I’m unlucky, have an aura, and standing to close the data warehouse means I don’t have a company anymore. So, in my experience, playing around on the reliability side was always an area of trade-offs and concerns. You’ve been in that area for almost your entire career by my understanding. What do you think that a lot of the industry is getting wrong now?
Kannan: Fantastic question, Corey. I would love to answer this, right? So, the people are passionate in, you know, operations engineering side to add you know, utmost reliability as possible for all the services they are offering. The biggest mistake, what is happening in those areas are—you know, when you are buying a new house and there is no specification of how many locks you can add into that house, right? So, you can put a hundred locks to make it more secure, and you can just have one lock, right?
So, this analogy is for how much security you need to add an in what layers, right? So, similar goes for a reliability, right? So, there is a trade-off, right? So, the more reliability, the more SLA you’re committing to your customer, the more you’re going to spend on your platform, then there is not going to be an ROI of your business. So, the right trade-off and the right trade-off comes from, what is the customer expectation and which customer journeys need to be highly available, right?
So, think about if you’re running a ride-hailing application, and all you need to worry about that booking a cab or riding a cab and making the payment should be your key customer journey, apart from you know, looking at different, you know, interests, areas where to visit, and static content can be three-nines availability. But your ride-hailing app should have, you know, five-nine-plus availability. So, the designing your system on the [unintelligible 00:10:47] different, right, so whether it is adding observability, or, you know, adding high availability and creating SLAs and SLOs, and committing an SLA with the customer, the fine line is how you are going to balance between a high precision engineering versus customer requirement [unintelligible 00:11:06], right. So, it’s a fantastic journey for us so far, and we had a lot of learnings in our B2C country launches. And those learnings brought us to—[unintelligible 00:11:15] us to build a state-of-the-art futuristic SaaS telco. And we are really balancing out and the [unintelligible 00:11:22] promising the quality of the customer platforms.
Corey: I think that’s something that a lot of businesses, when they start out at a business level don’t fully understand. Because you ask them the question of, “Oh, how much downtime is acceptable for this platform?” And the default expected answer is, “No downtime whatsoever.” “Okay, we’re never going to achieve that, but to start, I’m going to need $20 billion and I’ll let you know when that runs out,” et cetera, et cetera. And then they wind up saying, “Whoa, what do you mean?”
And come to find out what they really mean is that they want the email server that powers outlook to be up during business hours. Okay, great. And we talk about what the trade-offs look like as we have the SLO and SLA negotiations with the business, and eventually we come to an idea of these things are core and need to be up. Other things don’t. One of things I love personally about the positioning of Severalnines just as a brand is it’s an area that is highly focused on being very precise, as far as the exact levels of service that are being guaranteed, and their name effectively just cuts against that in a really fun way. “How many nines of reliability do you offer?” “Several.” And you know, that I love that approach. I think there’s really something very human about that.
Kannan: That’s true. That’s true. So, as we discussed in the previous question, right, so how many more nines you’d like to add, versus, you know, how many nines are expected from the customer? Balancing that act will give you the ability to, you know, design a strategy, and you know, achieve and adopt that strategy. While we started discussing about our SaaS platform years ago, our first question was, you know, instead of, you know, starting from the product side, so we just start from the reliability side.
Because for any SaaS product, as reliable as possible is going to be the core value you will be providing to the customer. And you are adding more features as you go, right? So, Severalnines is one of the partners who helped us know, achieving that high availability, and you know, disaster recovery, and backup recovery, and observability capabilities when we try to launch a country with hundreds of databases. Another pain point, what we had was when you are trying to launch country—multiple countries at one time, and you need a lot of operational engineers to provision and operate those components of your platform. But that goes down when you stabilize those platforms, right?
So, where the tools like Severalnines helped us to achieve that elasticity of not having the ability to attract more operation engineers to our platform to do that. Instead, we utilize those technology which helped us as a single pane of glass to provision those many hundreds of database and other instances and operate and you know, scale as well. Yeah, it’s a balancing act and the much before thought is going to help you in rightly architecture and choosing the right tool and using and in the right way, right? If I use Severalnines for something else, I’m going to fail. I really use Severalnines for the purpose for what it is built. And they still have—we are one of the customer who had a lot of feature requests because we are highly demanding customer for them.
Corey: I’m personally a big fan of misusing things as databases like, you know, DNS TXT records, or the contents of certain databases that were never designed to serve DNS records and then use it back again. Again, there’s all kinds of ways to misuse things in horrifying ways that no one’s happy with. But using the right way, the right tool is an absolute pleasure and a joy to wind up working with. That said, I tend to be relatively skeptical of experience reports when people have nothing but good things to say about a company’s product start, stop, the end. For examples of this, you can look at any conference keynote where they have a customer testimonial, ever, because it turns out getting in front of a few thousand customers of a company and saying, “Here’s what they’re terrible at,” gets you basically taken out by a sniper, if you’re not careful.
We don’t have any of those here. So, have you had any experiences working with Severalnines that left you a little bit… I guess, I don’t know if dissatisfied is the right word, but learning, okay, this is not necessarily the best way to apply it in different ways? What hasn’t worked out super well?
Kannan: So, [our cloud 00:15:41] is right so that Severalnines is a tool helps us in achieving what we try to achieve. But when we tried to operationalize for a running country that’s, when you are setting up the systems, everything is green. But when your customers are started using your systems, that’s where the rubber meets the road.
Corey: Because architecture would be perfect if it worked for the users or customers. It would be glorious. So, why don’t we just keep them out forever? Yeah, it turns out that’s not sustainable.
Kannan: [laugh]. Exactly. So, the corner case is like your backup fails due to some port issues on the network side of customer control side, then we work with the engineers. And the engineers are really good and they quickly pick up the call and, you know, get into the meeting and quickly sort that out. So, when we started launching, there are multiple, you know, service requests to solve those corner cases, mostly related to the setups and, you know, scaling and utilizing some of the configurations in the right way.
Apart from that, I don’t recollect anything major happened to us in this engagement. And Severalnines is really correct, you know, observability, we use multi-observability tools strategy. We don’t rely on a single observability with the sense of, you know—with the past sense of what if—for example, if you were using New Relic or Dynatraces of the world—what if they go down, right, their service goes down, right? You should have at least no additional [unintelligible 00:17:01] of looking into those alerts and, you know, monitoring capabilities. So, we are actually multi-observability strategy company. And even though those cases, we make sure there is no customer impact there.
Corey: This episode is sponsored in part by our friends at Strata. Are you struggling to keep up with the demands of managing and securing identity in your distributed enterprise IT environment? You're not alone, but you shouldn’t let that hold you back. With Strata’s Identity Orchestration Platform, you can secure all your apps on any cloud with any IDP, so your IT teams will never have to refactor for identity again. Imagine modernizing app identity in minutes instead of months, deploying passwordless on any tricky old app, and achieving business resilience with always-on identity, all from one lightweight and flexible platform.
Want to see it in action? Share your identity challenge with them on a discovery call and they'll hook you up with a complimentary pair of AirPods Pro. Don't miss out, visit Strata.io/ScreamingCloud. That's Strata dot io slash ScreamingCloud.
Corey: There’s something to be said for being a very large user of a given tool or products. And one of the joys, I imagine, of being as telco-focused as you are with the scale of your customer base and how you operate, that you tend to wind up straining some of the bounds of what a lot of things were designed to do. It’s a strange challenge because some vendors seem able to rise to that occasion and others, they try and gaslight you, or they’re like oh, “You’re site fell over, but here’s a here’s an SLA credit.” “Great. I have had unhappy customers on my side that I have to talk to and getting 40 bucks back on the enterprise deal I’m spending doesn’t help anything.”
It really does tend to separate out the vendors that are known and trustworthy in this space from those who become an experience report because experience is what you get when you didn’t get what you wanted.
Kannan: That’s true. So, the most recent incident rates, I think last year, we all aware of that, you know, one of the major CDN provider, you know, when for an outage and Atlassian went for an outage as well. So, the wonderful part of all those companies’ outages, partly is that it taught to the entire world how you can make fault-tolerant on the entry points as well, right? So, we all talk much and, you know, discuss more and architect our systems and platforms to have fault tolerance on the back-end and services, right, to make sure that the customer does not get impacted. And we are making an assumption that your entry points are—that single entry points are always secured, highly available, a hundred percent.
So, those incidents broke those concepts for architects to start discussing about how I can make my entry points also fault-tolerant and highly available, and I can add disaster recovery capabilities. So, those incidents really taught good lesson to the entire world. And those gave us an opportunity for us to re-architect some of our components before we met with those incidents.
Corey: One of the things that I didn’t really have a keen appreciation for is scale itself. When I’m sitting here building a Hello World-style application and okay, I have a even a microservices architecture with a half-dozen different things all talking to one another, it’s not that hard for me to start tracing what’s going on through the application as I hit various, you know, syntax errors because I don’t know what a linter is, and I’m terrible at programming. Great. But once you’re at a point of significant scale where every individual transaction is like looking for a needle in a haystack, it feels like what I used to call monitoring and most people call observability and I refer to as hipster monitoring now seems to have taken on a very different tone. But it does seem like you need to be at a certain point of scale before any of it really makes sense and starts to resonate. Has that been your experience or am I missing something fundamental?
Kannan: Yes, some part of it, right? So, when you grow up from your, you know, startup base with a few thousand customers to a few million, right—that’s the journey—you’re going to make and [deliver 00:20:18] initial assumptions that are going to be proven wrong as you go into that journey, right? So, the major aspect of scaling, especially in Circles.Life, we take a proactive approach in scaling. And I would see that rest of the world, major tech companies are doing the same as well, right?
So, we don’t rely entirely on auto-scaling capabilities, right? And also, we don’t rely entirely on monitoring and observability for us to make scaling decisions. So, when you build complex systems, right, in a SaaS environment, right, so you just need to build in such a way that [unintelligible 00:20:53] user journey can be performance-tested, and the metrics from that testing should be taken as input number one. And number two, you also need to do microservice-to-microservice-based individual independent testing to make sure that it can [unintelligible 00:21:09] withhold this N number of transaction per seconds, and without breaking any of the underlying systems. And number three, are databases are scaled appropriately? Then you add the observability piece on top of that.
Together, it is going to give you a proactive approach of keeping your systems scaled before actually your customer are getting into the system in a large number. And also you have situations like, you know, for example, you have customers, and new year coming up, and you are running campaigns, and you’re going to acquire a large amount of customer on that particular day. And how I can make a dynamic, scaling-enabled and you know, disable it at later point of time? So, these scaling strategies are going to really help you achieve, you know, lower customer impact and high availability of this [unintelligible 00:21:57]. And we are ahead of the curve, especially in the digital telco arena. And we spend a lot of time and money on this area of reliability.
Corey: It really feels on some level, like, all of these things get lumped together. And at small-scale, they absolutely do. The idea of reliability of the infrastructure underlying things, performance engineering, observability as a whole. It’s all—basically, if it plugs into the wall, keep it running and make sure that the site is up and we find out about it before customers start calling to tell us that there’s a problem. But as you wind up scaling out, it feels like these things start to become—they gain a bit of organizational distance between them where they’re related, but they tend to be handled by different teams focusing on different areas. Given that all of these roads effectively lead to you in your current role, how do you think that they should be structured in an organizational context?
Kannan: Yeah, so that’s a great question. So, any structuring of your team in any organization, including ours, right, so we start from somewhere, and [unintelligible 00:23:00] more matured and, you know, more optimal way. And nobody can start from an ideal team on day one, right? Being a startup, we started with, you know, three to four—you know, our founder started in a room with an engineer, and now we are, you know, thousand-plus engineers are working in the company. The way how I can see is, you need to structure your team based on if you’re a tech company, right?
So, based on what architecture, enterprise architecture, you’re going to adopt, right? Think about this: for example, if you have isolation between your infrastructure reliability DevOps engineer to your developers and you know, your QAs and the automation engineers, that is going to be silos. We learned this lesson the hard way and we made changes accordingly to build full-stack DevOps and SRE teams. That brings a lot of ability for us to improve throughput and, you know, and reliability as well as, you know, speed. So, structuring the team depends on how you architecture in your product and how you can make a full-stack teams to support delivering that architecture and have software and guardrails to minimize any impact.
Think about this, right? So, you are outsourcing your deliverables to a country where you have junior engineers, right? So, your platform should not be broken by even an intern, right? So, you need to build guardrails accordingly. So, your CI/CD pipelines are completely automated. You check in any junk, if it is a junk, it is going to reject and it is not going to be you know, promoted to [laugh] another [unintelligible 00:24:37]. When we [crosstalk 00:24:38]—
Corey: It sounds almost like you’re suggesting the radical idea that it’s not sustainable when every engineer, to work in your Kubernetes environment, needs to have 15 years of experience running large-scale applications otherwise, they’re a menace to themselves and others. There’s a scalability problem. How do we [laugh] make this stuff more accessible where you can be a junior engineer and not be a walking disaster waiting to happen? And guardrails are the answer, but everyone hears the term and almost seems to recoil because, “Ew, governance. I don’t like being told what I can’t do.” It’s, “Yeah, and I don’t like getting a notice that my data has just been leaked by yet another vendor.” So, you know, we all have choices to make.
Kannan: Absolutely right. When you say governance, people will hate that word, right? When I was an [unintelligible 00:25:17] engineer, the most hated word was governance as well because, you know, when you are young, you’re exploring and when the freedoms are cut and, you know, you feel bad [laugh]. So, that’s [unintelligible 00:25:28] engineers [unintelligible 00:25:30] work as a lab environment of an R&D environment where more freedom.
But now the world has changed, right? So, now we can bring guardrails within the technology. I’ll give an example. If your in the public cloud, if you hire an engineer to play around with, you know, your AWS components or GCP components, you can build policies internally. For example, if your engineer is creating an S3 bucket, make it as a [unintelligible 00:25:53] and your policy will prevent that to happen.
Immediately, that transaction will be rolled back. He won’t be able to create an S3 bucket. So, I’m not keeping people to explain people. So, you have central policy which will be shared across the table, if you adhere to it, your are good. Now, you can deliver your stuff.
If you’re not, then your system is not going to allow you [laugh]. So, now companies evolved build automation at every aspect of your, you know, guardrails. Whether it is CI/CD or to the release management side of it or access control of it or you are trying to expose the data or accessing the data, everything needs to be digitally [unintelligible 00:26:33], audited, and actioned.
Corey: I really want to thank you for taking so much time out of your day to speak with me about how you’re approaching a lot of these things. And given what I perceive to be a very positive but also honest and fair assessment of your experience as a Severalnines customer. If people want to learn more about what you’re up to, where’s the best place for them to find you?
Kannan: Well, thanks, Corey. First of all, thanks for hosting me. And thanks for Severalnines as well. Yeah, so Circles.Life is a busy telco as I mentioned. We are going to expand in multiple countries in a short span of time, an exciting journey with partners helping us like Severalnines.
You can find us on Facebook, on Circles.Life website, and Instagram, Twitter, and Facebook, [unintelligible 00:27:18], every social network. And we will be talking in TM forums and other [unintelligible 00:27:24] telco conferences where you can find our colleagues there.
Corey: It’s an area I find myself paying an increased amount of attention to as the world continues to progress towards, well, hopefully something. Thank you once again for your time. I really appreciate it.
Kannan: Thank you, Corey. Thanks for your time as well.
Corey: Kannan Solaiappan, Head of Reliability and Data Engineering at Circles.Life, brought to us by our friends at Severalnines. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry insulting comment that will not get posted because your podcast platform of choice is having a problem with one of the 17 different cloud providers that they deployed to, so as a result, nothing works until it’s fixed.
Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
Join our newsletter
2021 Duckbill Group, LLC