Microservice architectures offer IT organizations many benefits and advantages over traditional monolithic applications. This is especially true in cloud environments where resource optimization works hand-in-hand with microservice architectures.
So it’s no mystery that so many organizations are transitioning their application development strategies to a microservices mindset. But even in the realm of microservices, building and operating an application at scale can be daunting.
Problems can include something as fundamental as having too few resources and time to continue developing and operating your application, to underestimating the needs of your rapidly growing customer base. At its best, failure to build for scale can be frustrating. At its worst, it can cause entire projects—even whole companies—to fail.
Realistically, we know that it’s impossible to remove all risk from an application. There is no magic eight ball — no crystal ball — that allows you to see in the future and understand how decisions you make today impact your application tomorrow. Risk will always be a burden to you and your application. But, we can learn to mitigate risk. We can learn to minimize and lessen the impact of risk before problems associated with the risk negatively impact you and your applications.
I’ve worked in many organizations, and have observed many more. Planning for problems is very hard and something most organizations fail to do properly. Technical debt is often a nebulous concept. Quantifying risk is the first step to understanding vulnerability. It also helps set priorities and goals. Is fixing one potential risk more important than another? How can you decide if the risks aren’t understood and quantified.
In this episode, we’re going to talk about how to measure risk, so that you can build, maintain, and operate large, complex, modern applications at scale.
There is a great quote by Donald Rumsfeld, twice former secretary of defense for the United States. It starts “Reports that say that something hasn’t happened are always interesting to me”.
He goes on to say: “because, as we know, there are known knowns, there’re things we know we know. We also know there are known unknowns, that is to say we know there are some things we do not know.”
“But there are also unknown unknowns. The ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.”
This is true in running a country, and a country’s military, and it is true in running a modern digital application at scale.
This quote encompasses the entire meaning of risk management in one single quote. Risk management is about dealing with the unknown unknowns.
You will often hear me talk about my big game example. This is the example where you invite 20 of your closest friends over to your house to watch the big game on your brand new big screen TV. Only, once the party — and the game — start, the power goes off in your home. Your big day is over, and your friends go home disappointed. Now, what would you do if you called the power company to report this outage, and their response was “What are you complaining about, you have power most of the time. In fact, we see you have power 95% of the time. Who cares if power goes out the other 5% of the time?”
Who cares indeed.
The reality is the power company can’t operate in this way. They cannot be satisfied with “good enough” service. They have to strive to provide power to you 100% of the time. 24 hours a day. 7 days a week.
They have to strive for perfection.
This difference from perfection. This extra 5%, it’s driven by the expected actions and problems that we see. It’s driven by the unknowns. It’s driven by the things we don’t even know that we don’t know.
It’s driven by the unknown unknowns.
Preparing for these unknown unknowns, is what risk management is all about.
Risk, like anything else, can be quantified. There are two fundamental metrics that matter most when quantifying risk.
Likelihood is the measure of the chance of a particular risk triggering. Or, put another way, it’s the measure of the chance of a particular risk occurring. We say “what’s the likelihood of our pipes freezing tonight?” Or “what’s the likelihood of us getting rain tomorrow" Or “what’s the likelihood of a tornado hitting our house?”
Likelihood measures the possibility of an event happening. The likelihood of you getting rain tomorrow, for instance, is most definitely significantly higher than the likelihood of your house getting hit by a tornado tomorrow. That’s likelihood.
Severity is the measure of the cost of a risk that triggers. If a risk occurs, what really happens and how severe are the ramifications? Using the examples above, the severity of rain hitting you on the head is pretty low. The severity of your pipes freezing is greater. But the severity of a tornado hitting your house — well, that is severe — the impact of each of those three things happening is different, and that difference is measured by severity.
It’s important to keep these two things distinct and understand the difference between them.
Likelihood is IF an event will occur. Severity is WHAT is the cost of the event occurring.
The chance of rain tomorrow might be high (likelihood) but it doesn’t hurt you too much if it does (severity). The chance of your house getting hit by a tornado is very low (likelihood) but the impact of that event would be catastrophic for you (severity).
These two measures work together to quantify the risk of a particular event. These two measures together are what we use to track and measure risk…whether that risk is a weather related risk, or a risk of an application failure in your business systems.
In my book, architecting for scale, I give an example of risk measurement in a modern application by utilizing a T-Shirt e-commerce store example. We can measure the risk of a failure of components of this application using likelihood and severity.
The e-commerce store probably has a top ten list component — a service that generates a top ten list of products sold through the site. What’s the risk of the top ten list not appearing? The likelihood of the list not appearing is probably relatively low — it’s a simple component without a lot of complexity to it. Likewise, the severity of the problem of the top ten list not appearing is also low. If customer’s can’t see the top ten list, it doesn’t significantly impact their buying experience. This would be a low likelihood, low severity problem. In shorthand, it’d be a low/low risk.
But what about the order database? What’s the risk if the order database stops accepting new orders? Well, once again, the likelihood of that happening is probably relatively low. It’s an important subsystem that we’ll assume is probably well maintained. But if it does happen, the severity of that problem is quite high. If you can’t accept orders, your entire business suffers. This would be a low likelihood, high severity problem. Shorthand, a low/high risk.
Moving on, let’s say your store uses a custom font to make the display more visually pleasing. What’s the risk of the font not loading in a user’s browser? Well, the likelihood of this happening might in fact be quite high. You can imagine scenarios where a user’s browser has a poor internet connection and the font file doesn’t load correctly. Or maybe you are using a 3rd party font service to provide the fonts dynamically. The likelihood of this problem occurring could actually be quite high. But what about the severity? Here the severity is probably quite low. If the custom font doesn’t load, the browser will just substitute a different font for the page. The page will still work, it just might not look quite as visually appealing as you desire it to be. This would be a high likelihood, low severity problem. Shorthand, a high/low risk.
Finally, let’s take a look at the t-shirt photos that appear in the store. These are the pictures of products that customer’s might buy. What’s the risk of the photos not appearing on a page? Well, the likelihood of this risk could be high, because showing photos on a page means loading them from a cache server or maybe a 3rd party CDN, and this system might not be working quite right. The photos may not be available, or the user’s internet connection could flake out and not show them. The likelihood of this problem occurring is, in this example, high. What about severity? Well, it’s hard to imagine that a customer would buy a t-shirt that they could not see a photograph of, so if the photos aren’t appearing, that could have a big impact on your business since people would buy fewer t-shirts. The severity of this problem is also high. This would be a high likelihood, high severity problem. In shorthand, this would be a high/high risk.
These are four examples of problems that might occur in an e-commerce store, and the risk measurement associated with them happening. Now that we can measure the risk, we can use that measurement to prioritize work to mitigate or remove those risks. We can imagine that mitigating or removing a high/high risk, would be more critical than a high/low or low/high risk, and all of them would be more important than working on a low/low risk.
We can properly determine which risks are most helpful for us to work on, and we can measure the impact of our work to mitigate those risks.
In future episodes, I will continue the topic of risk management and discuss tools and techniques for monitoring, reporting, and mitigating risk in our applications with the ultimate goal of reducing the impact that risk has on our availability of our applications.