Modern applications require high availability. Our customers expect it, our customers demand it. But building a modern scalable application that has high availability is not easy and does not happen automatically. Problems happen. And when problems happen, availability suffers. Sometimes availability problems come from the simplest of places, but sometimes they can be highly complex.
In this episode, we will continue our discussion from last week with the remainder of the five strategies for keeping your modern application, highly available as well.
This is How to Improve Application Availability, on Modern Digital Applications.
Links and More Information
The following are links mentioned in this episode, and links to related information:
Building a scalable application that has high availability is not easy and does not come automatically. Problems can crop up in unexpected ways that can cause your application to stop working for some or all of your customers.
No one can anticipate where problems will come from, and no amount of testing will find all issues. Many of these are systemic problems, not merely code problems.
To find these availability problems, we need to step back and take a systemic look at our applications and how they works.
What follows are five things you can and should focus on when building a system to make sure that, as its use scales upwards, availability remains high. In part 1 of this series, we discussed two of these focuses. The first was building with failure in mind. The second was always think about scaling. In part 2 of this series, we conclude with the remaining three focuses.
Number 3 - Mitigate risk
Keeping a system highly available requires removing risk from the system. When a system fails, often the cause of the failure could have been identified as a risk before the failure actually occurred. Identifying risk is a key method of increasing availability.
All systems have risk in them. There is risk that:
A server will crash
A database will become corrupted
A returned answer will be incorrect
A network connection will fail
A newly deployed piece of software will fail
Keeping a system available requires removing risk. But as systems become more and more complicated, this becomes less and less possible. Keeping a large system available is more about managing what your risk is, how much risk is acceptable, and what you can do to mitigate that risk.
This is Risk management, and it is at the heart of building highly available systems.
Part of risk management is risk mitigation. Risk mitigation is knowing what to do when a problem occurs in order to reduce the impact of the problem as much as possible. Mitigation is about making sure your application works as best and as completely as possible, even when services and resources fail. Risk mitigation requires thinking about the things that can go wrong, and putting a plan together now, to be able to handle the situation when it does happen.
For example, consider a typical online e-commerce store. Being able to search for product on the e-commerce store is critical to almost any online store. But what happens if search breaks?
To prepare for this, you need to have “Failed Search Engine” listed as a risk in your application risk plan. And in that risk, you need to specify a mitigation plan to execute if that risk ever triggers.
For example, we might know from history that 60 percent of people who search our site end up looking at and buying our famous red striped shirt. So, if our search service stops functioning, rather than simply failing, we could display an appropriate “I’m Sorry” page, followed by a list of our most popular T-Shirts, including our red striped shirts. For some number of customers, this would be a success. For the rest, it might create alternatives for them other than simply leaving in frustration. Combine this I’m sorry page with showing a coupon for 10% off their next visit, and you’ve turned a bad customer experience into an experience that might just create some return customers.
This is an example of a risk mitigation plan. It’s a plan that you build in advance of a potential but serious problem, and be able to implement it if that problem occurs. This is a great example of risk mitigation.
Other risk mitigation plans might be entirely technical. They might involve failover servers, or rapid response plans to resolve an issue. Whatever the plan is, risk mitigation is the process of creating and putting these plans into place.
Number 4 - Monitor availability
You can’t know if there is a problem in your application unless you can see the problem. Make sure your application is properly instrumented so that you can see how the application is performing.
Proper monitoring depends on the specifics of your application and needs, but usually entails some of the following capabilities some or all of the following:
Server monitoring. Monitoring the health of the server infrastructure that your application is running on. This might be physical resources, or cloud-based virtual resources.
Configuration change monitoring. Understanding when and how your system infrastructure changes and how those changes impact the operation of your application.
Application performance monitoring. Look inside your application and services to make sure they are operating the way you expect them to operate.
Synthetic testing. Monitor how your application works from an external perspective in order to catch problems as customers may see them before they actually see them.
Monitoring involves improving two aspects of modern application operation. MTTD and MTTR. That’s mean time to detection and mean time to resolution. Looking at key performance indicators for changes in patterns and alerting you of those changes will improve your mean time to detection. Giving you a wealth of data that can be used to diagnose the source of a problem will improve your mean time to resolution. Both of these are important measures for improving application availability.
Number 5 - Respond to issues in a predictable and well defined manner
Monitoring systems are useless unless you are prepared to act on the issues that arise. This means being alerted when problems occur so that you can take action. Additionally, you should establish processes and procedures that your team can follow to help diagnose issues and easily fix common failure scenarios.
For example, if a service becomes unresponsive, you might have a set of remedies to try to make the service responsive. This might include tasks such as running a test to help diagnose where the problem is, restarting a daemon that is known to cause the service to become unresponsive, or rebooting a server if all else fails. Having standard processes in place for handling common failure scenarios will decrease the amount of time your system is unavailable. It will help with improving Mean Time to Resolution.
Additionally, they can provide useful followup diagnosis information to your engineering teams to help them deduce the root cause of common ailments, in order to reduce the likelihood of a reoccurrence of a problem, or the occurrence of a similar problem.
When an alert is triggered indicating that a service is or might be failing, the owners of that service must of course be alerted so they can deal with the issue in a timely manner.
However, other teams that are closely connected to the problem service may also want to be alerted. If you own a service that depends on the failing service, you might want to be informed of the problem even before it impacts your service, so that you could take preventative measures or institute actions from your risk management plan before they become critical. Additionally, if the failing service is a consumer of your service, you may want to be aware of the fact that traffic patterns from the failing service may change as the problem occurs and is being resolved. You may want to keep a close eye on your service to make sure any changes do not negatively impact you.
Documented processes and operations are an essential part of this process. Support artifacts should be well documented and available to all parties that require them. They should also be frequently reviewed and updated, and updating support artifacts should be a regular part of your process for adding new features and capabilities.
These processes and procedures are especially useful, after all, because outages often occur during inconvenient times — times such as the middle of the night or on weekends—times when your on-call team might not perform at peak mental efficiency.
These recommendations will assist your team in making smarter and safer moves toward restoring your system to operational status.
No one can anticipate where and when availability issues will occur. But you can assume that they will occur, especially as your system scales to larger customer demands and more complex applications.
Preparation and planning are critical to improving availability and maintaining availability as your application scales.
That’s it for the five focuses to help improve your modern application availability. More information on these five focuses can be found in my book, Architecting for Scale, published by O’Reilly Media. A link can be found in the shownotes.
Tech Tapas — Can’t Scale? Time to go out of business
Why is it essential that your application scale? Well, why not ask Robinhood Financial. Robinhood is an investment company that provides investment management services for its tech-savvy clientele.
Robinhood learned the hard way the cost of success.
On Monday, March 2nd, 2020, the United States stock market had a record-breaking day. After previous significant drops due to virus pandemic scares, good news caused the stock market to rally. The result was record-breaking traffic in the stock market.
For a young investment company, like Robinhood Financial, this would typically be considered great news! The company thrives on new account signups and on customer market transactions. Both of these were available in record numbers to Robinhood on this day.
The problem? Their traffic was *too* high.
You see, companies like Robinhood need to be able to respond to variable loads, as spikes in traffic occur all the time.
Still, this record-breaking traffic spike and the volatile market conditions that went with it were more than Robinhood’s infrastructure could handle. The result? There systems started to fail. This failure created a “thundering herd” effect, as Robinhood founders described it, leading to a failure of their DNS system.
The result was that Robinhood systems were down for approximately one and a half days.
One and a half days.
This one and a half day outage occurred during a peak stock market time, a time when their customer’s needed them most.
During their most critical time, their systems were unavailable.
That’s often the problem, scaling related outages more often than not occur during good times, not the bad times. This can very quickly turn a significant business opportunity for success...into an utter failure.
Scaling and availability problems can take the moment of greatness you’ve been working towards all your life, and turn it into an event that can shutter your business.
This isn’t true just for Robinhood — all modern companies face problems like this. One of the most challenging things for an online business to handle...is success.
Success can be the killer of a business.
Success can shut you down.
Avoiding problems like this is why it is essential that you consider scaling and availability needs of your application well before your needs arise. The day when success is staring you in the face is too late to be planning for scaling in your system architecture.
In the shownotes, I have a link to the announcement of the outage by Robinhood in their blog, along with links to other useful information.
There are many resources you can use to help you build in scalability and availability into your business processes and your applications. My book, Architecting for Scale, published by O’Reilly Media, has lots of useful high level information on the systems and processes involved in building highly scaled, highly available applications. You can also listen to this podcast. I talk often about scaling and availability topics as it relates to building and operating your modern digital applications.