Modern applications require high availability. Our customers expect it, our customers demand it. But building a modern scalable application that has high availability is not easy and does not happen automatically. Problems happen. And when problems happen, availability suffers. Sometimes availability problems come from the simplest of places, but sometimes they can be highly complex.
In this episode, we will discuss five strategies for keeping your modern application, highly available as well.
This is How to Improve Application Availability, on Modern Digital Applications.
Links and More Information
The following are links mentioned in this episode, and links to related information:
Building a scalable application that has high availability is not easy and does not come automatically. Problems can crop up in unexpected ways that can cause your application to stop working for some or all of your customers.
These availability problems often arise from the areas you least expect, and some of the most serious availability problems can originate from extremely simple sources.
Let’s take a simple example from a real world application that I’ve worked on in the past. This problem really happened.
The software was a SaaS application. Customer’s could login to the application and they received a customized experience for their personal use. One of the ways that the customer could tell they were logged in is that an avatar of themselves appeared in the top right hand corner. It wasn’t a big deal, but it was a handy indicator that you were receiving a personalized environment. We’ve all seen this sort of thing, it’s pretty common in online software applications now-a-days.
Anyway, by default, when we showed the page, we read the avatar from a 3rd party avatar service that told us what avatar to display for the current user. One day, that third party system failed. Our application, which made the poor assumption that the avatar service would always be working, also failed. Simply because we were unable to display a picture of the user in the upper right hand corner, our entire application crashed and nobody could use it. It was, of course, a major problem for us. It was harder too because the avatar service was out of our control. Our business was directly tied to a 3rd party service we had no control over, and we weren’t even aware of the dependency.
A very minor feature crashed our entire business…Our business crashed because of an icon.
Obviously, that was unacceptable.
How could we have avoided this problem? There were a thousand solutions to the problem. By far the easiest would have been to notice and catch any failure of the 3rd party service in realtime, and if it did fail, show some default generic avatar instead. There was no need to bring down our entire application over this simple problem. A simple check, some error recovery logic, some fallback options, that’s all it would have taken to avoid crashing our entire business.
No one can anticipate where problems will come from, and no amount of testing will find all issues. Many of these are systemic problems, not merely code problems.
To find these availability problems, we need to step back and take a systemic look at our applications and how they works.
What follows are five things you can and should focus on when building a system to make sure that, as its use scales upwards, availability remains high.
Number 1 - Build with Failure in Mind
As Werner Vogels, CTO of Amazon, says:
“Everything fails all the time.”
You should plan on your applications and services failing.
It will happen.
Now, deal with it.
Assuming your application will fail, how will it fail? As you build your system, you need to consider availability during all aspects of your architecture, design, construction, and testing.
What design constructs and patterns have you considered or are you using that will help improve the availability of your software?
Are you using simple but effective methods for detecting problems with other services? This can be simply catching errors. It might mean appropriate retry logic. But it could also involve things like circuit breaker patterns in order to validate external systems are behaving as they should.
Circuit breaker patterns are particularly useful for handling dependency failures because they can reduce the impact a dependency failure has on your system as a whole, without having to continuously rediscover the problem and impact performance.
Once you detect a failure, what can you do? Is this a hard dependency that you cannot recover from if it fails? Or can you replace the call with a temporary result that may degrade the experience but allow other functions to continue? This is the case with the avatar service, you can show a default avatar as a degraded experience, yet still the main part of your service can continue operating.
Beside taking care of failures of systems that you depend on, you need to consider failures of systems that depend on you. What happens if a service that is calling your system behaves poorly? Can you handle excessive load from your consumers? Can you handle bad incoming requests and provide graceful responses?
Sometimes, denial-of-service attacks can come from “friendly” sources. A service calling you could have a bug that causes it to go into a tight infinite loop, calling you at an unacceptably high rate. Can you detect this and throttle traffic appropriately in a way that doesn’t impact valid requests that might also be coming at a high rate?
Number 2 - Always think about scaling
Speaking about scaling…
Always be thinking about scaling during all aspects of your system architecture, design, and construction. Thinking about scaling should be ingrained into your culture and part of every decision you make.
Just because your application works now does not mean it will work tomorrow. Most web applications have increasing traffic patterns. A website that generates a certain amount of traffic today might generate significantly more traffic sooner than you anticipate. As you build your system, don’t build it for today’s traffic; build it for tomorrow’s traffic. Build it for traffic spikes and bulges. Build it for your biggest days.
This might involve things as simple as increasing the size and capacity of your databases. But it might involve rethinking a design to reduce dependency on certain types of resources, or utilize caching and other scale-based-optimization techniques.
Think about what logical limits exist to your scaling. Where are your sensitive and vulnerable spots? What can you do today to reduce your impact on these vulnerable spots?
Can content you are currently generating dynamically, can it be generated statically and cached instead?
Be creative. You might be surprised what scaling can do to your thought process.
One simple example from my early days with Amazon. We were looking at how to scale an early version of the menu bar on the top of the amazon.com home page. That page was dynamically generated on every page load. It “had to be”, because the banner contained a “Hi Dave” personalized message for each customer. It had to be generated dynamically.
Until we discovered an interesting fact. A significantly large portion of our traffic came using only one of a hundred or so possible names. By creating static versions of the banner for each of these high traffic names and caching these, we would only have to dynamically create the banner for pages that were personalized using a different, less common name. The result was that a huge amount of our dynamic traffic could be cached.
Now, I don’t know if they ever implemented that approach or if there were other techniques that ultimately were used for improving scale, that’s all irrelevant now. But the point is that this was a big lesson for me and I’ve always remembered it. It can be surprising what scale can do to help you make things more optimized.
*** High scale is the friend of optimization, not the enemy.
That is worth repeating…
*** High scale is the friend of optimization, not the enemy.
We often think of optimization as a tool to fight the negative impact of increased scaling…negative impacts like resource depletion. But if you only take one lesson out of this discussion, take this. The higher scale an application operates at, the more options available to you to improve system optimization and efficiency. The more you scale, the more you can optimize for scale.
That’s it for part one of this story in this episode. We will continue with the remaining three focuses for improved availability in the next episode.
More information on these five focuses to improve availability can be found in my book, Architecting for Scale, published by O’Reilly Media. A link can be found in the shownotes.
Tech Tapas — What happened to HP cloud?
What happened to HP Cloud? HP Cloud never was at the scale of Amazon Web Services, nor Microsoft Azure, nor Google Cloud. Yet it was a player for awhile in the early cloud days. Like much of HP at the time, HP cloud was focused from the beginning on meeting the needs of enterprise customers.
HP had a great idea — focus on the enterprise — at the time before enterprise was a huge focus for any of the major cloud providers. While AWS was growing focusing on the small-to-medium sized businesses and individuals, HP was focusing on the enterprise.
But by the time HP was able to execute on their enterprise strategy, the big three public cloud providers — Amazon, Microsoft, Google — all had switched their focus to the enterprise customer as well. This left HP out in the cold.
HP is now virtually entirely out of the public cloud space. In fact, in 2015, HP announced it was shutting down its Helton public cloud offering and instead started focusing only on private and hybrid clouds. HP ceded the public cloud market to Amazon, Microsoft, and Google...
HP has been able to make some traction in this space...the private and hybrid cloud space... The reason? Their focus on using the OpenStack technology, a technology that has not shown much traction in the public cloud markets, but has gained significant traction in the private and hybrid cloud markets.
Maybe HP found a groove that will work for them. But in this space, they’ll have to compete with the likes of IBM, which is also focusing on the hybrid cloud space as well...
How long will HP remain a player in the private and hybrid cloud niche? We’ll have to wait to see, but as far as the public cloud is concerned, HP is nowhere to be found. That’s a shame, I would like to see HP — and IBM — both be larger players in the public cloud. We need more and better competition for AWS in the public cloud. But it looks like that competition isn’t going to come from HP.