A series of recent cloud service outages has impacted people across the globe. In this latest podcast we talk about the underlying causes, why there seems to be a greater number of outages, despite reassurance the cloud is sage, and what your organization can do to mitigate the risk.
Scott: A series of recent cloud service outages has impacted people across the globe. What's the cause of it? Here to talk about it is Mike Burke with IR. Mike, are we seeing more outages than normal or does it just seem that way?
Mike: Well, I don't think we're seeing more outages than normal. It's the sort of thing that because there's more emphasis and migration to the cloud for day to day computing and application functionality that when it happens it affects more people and things that people have come to rely on as a daily part of their both work life, but also their personal life these days. And so when an outage occurs, it's more widely felt within the community at large and so there's both business and personal impact. You know, when something like say Netflix is interrupted. When an Amazon—when AWS goes down for a brief period of time - it's not just about a business outage. It's a personal outage as well.
Scott: What would you say to those companies who rely on those services? What should they do knowing that they're, you know, this is going to happen from time to time. What are some things that they can do to prepare for it?
Mike: Well, get ahead of the game and companies, such as Netflix, they have a very active fault injection routine that they go through to continuously inject faults into their network and then see what happens and use that proactively. Use that information proactively to continuously improve the network so that should that fault happen in the real world in an uncontrolled situation, they've got a workaround in place already and the idea of having self-healing networks, if you will, or multiple levels of redundancy is very helpful. Thinking ahead, these are the failure scenarios that we've tried and this is what we know we can do to mitigate those. The other thing is put your services on more than one cloud. There's multiple ones out there. You know, Google, Azure from Microsoft, Amazon Web Services, AWS. There's multiple cloud environments available for hosting platforms, infrastructure, and applications.
Scott: For some of the outages that we have seen recently, what are some causes? I mean, it's just the normal course of business. It's gonna happen. Nothing you can do about it, or is that just where we are right now with the technology?
Mike: Well, the technology is pretty sophisticated. It's higher than expected usage and perhaps a memory leak within an application that causes resources to be constricted at some point in time and then things just start to spin down. If you think about computing maybe like a superhighway. You know, when you look at it at 2 o'clock in the morning and you see fifteen lanes that are empty you can't imagine that it would ever spin down but all that has to happen is, one car blows a tire or something like that and all of a sudden everything can grind to a halt in both directions both from the incident and from the gawkers. Well data processing can—it's a intriguing analogy, but data processing can be a lot like that. Something goes wrong and all of a sudden services that are used to running along at a hundred miles an hour have to stop and take care of something else, fix it as you go, and then things start to back up and back up and sooner or later the system runs out of resources and it just has to spin down to be able to recover itself.
Scott: Some of the steps that you suggested that businesses can take to sort of prepare for this and you mentioned some examples like Netflix, that sort of thing. What if I'm a small business owner and I find myself just at the mercy of these outages and when it happens, there's nothing I can do about it? I mean, are some of these same steps that you mentioned they're applicable for even the little guy?
Mike: I think for a little guy that would be a pretty expensive thought process, but the idea of making sure that when you choose to go the cloud, you're working with a cloud services supplier that has multiple regions and so the supplier itself has taken it upon itself to compartmentalize the delivery that it has available so that if a failure in one spot happens to pop up unexpectedly, they can transfer the data processing and the communications infrastructure or the communications connectivity over to another region that hasn't been affected. So make sure you're working with a cloud services provider that has got ample capacity and is itself interested in multiple levels of redundancy so that you're not dependent upon just one computer like we all might be at home. You know, do your homework and make sure you find a provider that's got the proper level of risk mitigation in place and can—it's a cost tradeoff analysis. You can't always spend enough money to go to the moon on all these things but what you can do is take a look at risk mitigation that's in place with the various cloud services providers and pick one that's got the right level of redundancy and risk mitigation in place that fits your budget.
Scott: Join us next time when we discuss the steps needed to be sure your system is ready for the demand of Thanksgiving weekend. For more podcasts, visit IR.com.