So, now that we’ve defined service tiers, how do you use service tiers?
Service tiers have two distinct uses. Helping determine required responsiveness to problems, and requirements for dependencies between individual services.
Let’s first talk about responsiveness. The service tier level of a service can be used to determine how fast or slow a problem with a service should be addressed. Of course, the higher the significance of a problem, the faster it should be addressed. But, in general, ***the lower the service tier number, the higher importance the problem likely is…and therefore the faster it should be addressed***. A low-to-medium severity problem in a Tier-1 service is likely more important and impactful than a high severity problem with a Tier-4 service.
Given this, you can use the service tier, in conjunction with the severity of the problem, together to determine how fast of a response your team should have to a problem. Should we be alerted immediately, 24 hours a day, 7 days a week and fix the problem no matter what time of the day or night? Or is this a problem that can wait until the next morning to fix? Or is it a problem we can add to a queue and fix it when we get to it in our overall list of priorities? Or should we simply add it to our backlog for future consideration? Service tiers, in conjunction with problem severity, can give you the right procedural guidelines for how to handle a problem. You can use them to set SLAs on service responsiveness. You can even use them to set availability SLAs for your services.
For example, you could create a policy that says that all Tier 1 services need to have an availability of 99.95%. This might dictate that all high severity problems must be resolved within 2 hours of identification, meaning that you must have an on call support team available 24 hours a day, 7 days a week and that support team must have enough knowledge and experience to fix any serious problems that arise. This would likely mean the owning development team would need to comprise the support rotation for this service.
Meanwhile a Tier 3 service might be able to have an availability SLA of only 99.8%. A Tier 4 service might not even have an availability SLA. This would mean that all but the most serious problems could probably wait until the next business day to be fixed, meaning an on call support role may not be needed, or may not need to be as formal or have tight mean time to repair goals.
Service Tiers help set policy on responsiveness requirements for your services, which can then dictate many requirements for your other policies and procedures.
Now, let’s talk about how service tiers can help with inter service dependencies. Given that services at different service tier levels have different responsiveness requirements. This impacts your dependency map between services and assumptions you can make about your service dependencies.
For example, if a Tier-4 service, a low priority service, makes a call to a Tier-1 service, a high priority)service, then it probably is safe for the Tier-4 service to assume that the Tier-1 service will always respond and will always be available. If for some reason the Tier-1 service does not respond, it would typically be acceptable for the Tier-4 service to simply fail itself. After all, if a Tier-1 service for your application is down, significant efforts will be immediately put into place to try and resolve that service problem. The fact that a Tier-4 service is also down will not be of any significant consequence.
Think of the case where your web application is down because users cannot log in…this would be a Tier 1 service problem. In such a situation, how concerning will it be that the marketing email service is down, delaying the delivery of your day’s marketing emails? The marketing email service is a Tier-4 service problem, so it is dwarfed by the Tier 1 login problem.
But the reverse is not true. If a Tier-1 service depends on a Tier 4 service, , then that Tier-1 service ***must*** ***have*** developed contingency plans and failover recovery plans for when that Tier-4 service might be down. After all, you don’t want a Tier-1 service to fail simply because a much lower priority Tier-4 service is not functioning.
As an example, let’s say you have a Tier-1 web application running. The web application displays the current customer’s avatar in the upper right hand corner whenever they are logged in. It does this by calling the customer avatar service, which is likely a Tier-3 service. After all, if you cannot display the avatar, it really has no significant impact on the customer’s experience.
So, what happens to the Tier-1 web application if it tries to get the customer’s avatar from the Tier-3 avatar service, and the Tier 3 service is down? In this case, it would *not*, under any situation, be acceptable for the Tier-1 application to fail simply because the avatar service was failing. Instead, the Tier-1 service, when it calls the avatar service, should have safe guards in place in order to handle service failures. It should have contingency plans in place for what to do if the tier-3 service call fails. In this case, the web application can simply not show the avatar.
The web application can simply not show the avatar if it is not available. But it would not be acceptable for the entire application to fail simply because it couldn’t display the avatar!
This may sound like an unrealistic example, but I actually saw this happen once, with a company that was not paying attention to service levels. They had a SaaS application that went down and was completely unavailable to their customer’s, simply because they could not load the customer’s avatar due to a problem in the external avatar service. It was an embarrassing problem for them to admit.
Therefore, it is not acceptable for a Tier 1 service to fail simply because a dependent Tier 3 service is unavailable. Instead, the Tier 1 service should be built in such a way that it can work around and deal with failures of the Tier 3 service.
So, we can see that Service Tiers can be used to determine the criticalness of inter service dependencies. If a lower priority service, such as Tier 4 service, depends on a higher priority service, such as a Tier 1 service, then the Tier 4 service can generally ignore failures of the Tier 1 service. But the reverse is not true. If a higher priority service, such as a Tier 1 or 2 service, depends on a lower priority service, such as a Tier 3 or 4 service, then the Tier 1 or 2 service must be written to handle and managed failures of the lower priority dependent service.
This is the power of service tiers.
Tech Tapas - First AWS Service
What was the first cloud service that Amazon put out as part of Amazon Web Services? The first cloud service was the Simple Queue Service. The Simple Queue Service, or SQS, is still available today. It provides a message queuing system for delivery of messages between loosely connected application components. SQS was launched publicly in November of 2004. AWS was relaunched in March of 2006 as a combination of three initial services: SQS, the S3 object storage service, and the EC2 virtual server service. These were all very early implementations. The EC2 instances had no persistent storage attached, so they could not be used to build persistent storage systems, such as relational databases. The services were also all exposed to the public internet, so there was no security available for any of the services. The EC2 instances were openly vulnerable on the public internet.
EC2 didn’t get persistent attached storage until EBS was launched in 2008. It didn’t get private network security until VPC launched in September of 2009. It was after those two capabilities were added that EC2 became a viable option for application computation, and cloud computing took off with a vengeance.
Tech Tapas - Space Shuttle Redundancy
One requirement of modern digital applications is that they must maintain a high degree of resiliency...resiliency to failure. The ability to stay operational even in adverse conditions, and the ability to successfully recovery easily and quickly when problems do occur in a manner that reduces or eliminates the impact on the end user is critical to modern digital applications.
There is an older, yet great example of building high level of resiliency into a complex software application. That is the United States Space Shuttle program. The United States Space Shuttle is a program that ran from 1981 to 2011.
The space shuttle program had its problems, disastrous problems. But there was one aspect of the program that worked incredibly well...their computers and software systems.
The computer systems aboard the space shuttle were state of the art when they were built. They had to be. A failure of the software on board the space shuttle would be disastrous. It was critical that the software not fail.
To that end, a sophisticated system to maintain availability and reliability was employed. Let’s take the main computer system. There were four identical, redundant copies of the main computer aboard the shuttle. They consisted of identical hardware, identical inputs and outputs, and ran an identical copy of the software. Every computation the computer had to make, was made simultaneously but independently by all four computers. The results of the four computations were then compared, and the results were expected to be the same from all four computers. If they were, all was well.
If, however, the results differed among the computers, then the four computers voted on which result was correct. The winning result was used, and the computers that generated the losing results were disconnected and turned off for the duration of the voyage. The losers were disabled. This is what I call the ultimate in democratic voting!
The space shuttle could operate with as few as two computers operating, but the mission might be changed or shortened. But as long as two of the four computers were operating, the shuttle could be operated safely.
What happened if the computers voted and the result was a tie? Well, then a fifth computer system was consulted. This system was a very different computer...a much simpler one. It had software for only the most critical calculations, and that software was written and the hardware was built by completely independent teams than the ones that built the main computer. That way, a problem that caused the main computers to fail, would not likely impact this fifth computer. If the main computers voted, disagreed, and tied, then the fifth computer would become the tie breaker.
During the 30 year operation of the space shuttle, there was never a case where a serious life threatening problem occurred that was a result of a software problem, even though the software was the most complex software ever built for a space program at the time.
It was a sophisticated system for high availability, especially for the 1970s when this system was first built. But sometimes its useful to look at older techniques, such as this, when deciding how to build and operate modern digital applications.
We will be talking about the space shuttle software system in more detail during a main story in a later episode.