The 5 Pillars of Good Solution Architecture: Availability and Recoverability
By Sergio Barbosa (CIO – Global Kinetic)
To date, we have looked at the security and the performance and scalability aspects of good solution architecture. If you did not catch those two posts, be sure to read up on them before checking out this next architectural pillar, which focuses on Availability and Recoverability.
Availability vs. Recoverability
Firstly, what is the difference between these two? They seem similar?
Availability refers to a measure of how available a system is for its end-users and/or external integrated systems. Recoverability, however, refers to how quickly a system can recover from a catastrophic failure. Availability is usually measured as a percentage of time that a system is available, whereas recoverability is usually measured by the number of hours of data loss or downtime. The two are inter-related because the requirements for the one imply the requirements for the other, but they are not the same thing. From an architectural perspective, it is necessary to implement different measures and designs to address the availability and recoverability requirements of a system separately. These requirements are usually dictated by a Service Level Agreement (SLA) promised to end-users and/or external integrated systems and is the departure point in designing for availability and recoverability.
Let us take availability first by way of example. The table below shows the allowable downtime by availability. You can see the huge difference between an SLA requiring a system to be available 99% of the time vs. one that is required to be available 99.999% of the time. On the one hand, the system can be unavailable for days, whereas on the other hand if the system is down for more than 5 minutes, you’re in big trouble.
A system requiring 99.99% availability or higher will likely need to be self-diagnosing and self-healing to some degree. Services should be able to restart themselves automatically and scale out and/or up automatically if load increases to prevent the system from becoming unavailable. This requires specific design to allow the system to work with load balancers for example or to have the ability to restart automatically in the event of a failure.
There are some patterns here that can be utilized as well, such as the circuit breaker pattern. With this pattern, a service will automatically return a failed status to consumers after a certain number of failures (threshold) has been reached (much like an electrical circuit breaker). The service will do so over a pre-defined or calculated time period after which the service will let a certain number of test calls through. On success of all test calls the service will then return to normal operation. This pattern can be particularly useful if the solution you are designing is heavily dependent on external third-party integrations where the availability of those dependent systems or applications are outside of your control. The whole management of dependent third-party applications and integrations is a whole topic on its own, and probably warrants a separate blog post… hmm.
A 99.99% availability SLA will most likely also require the system to be deployed across multiple availability zones, and these will most likely need to be deployed across different geographies to mitigate the risk of a data centre becoming unavailable. To support this, the solution must be able to work behind a DNS Traffic Manager and data that the system relies upon should be replicated across both Availability Zones, so that normal system operation can resume in the event of a failover from one data centre to another.
Additionally, to normal operation, one needs to consider the deployment of new versions of the software into production and the impact that these can have to the availability of the system. For example, if you need to incur some downtime to the system for a deployment of a new version, you are eating into the allowable downtime your SLA for the system allows. You should be banking this time allowance for when you really need it, i.e., in the event of a disaster. To support zero downtime deployments, you will need to be able to route only some requests/traffic to the newly deployed version of the system or service and then, on successful validation of the new version, route the rest of the requests/traffic to the new version and then deprecate the old version. You need to design version control into your system otherwise you will not be able to accomplish this.
Building a highly available architecture
The diagram below is an example of a high availability system architecture. The solution is deployed to multiple regions behind a Traffic Manager. In each availability zone there the services are deployed behind a gateway that can direct requests to different services according to certain rules. Load Balancers have been implemented to manage additional load to the system. What does this picture tell you about the system itself?
Well firstly, it looks as though the system is data-heavy, and that there is potentially no real separation in the design between data reads and writes. Additionally, it looks as though the system serves up a lot of content that is independent of processing. Also, the processing that the system does is probably confined to a small number of features and is potentially single purpose, because there does not appear to be much separation of concerns between application instances. What are the single points of failure? Well, they are clearly the Gateway, Load Balancers, and the Traffic Manager. So, to support high availability, the application should be able to scale out with more VM instances in the event of heavy load and have a robust Disaster Recovery (DR) plan in place for the 3 single points of failure listed.
Disaster Recovery strategies
Much like availability, recoverability is driven by the SLAs that are promised by the system to end-users and/or external integrated systems. There are two recovery objectives to take into consideration here:
- Recovery Point Objective (RPO) = Maximum acceptable data loss in hours
- Recovery Time Objective (RTO) = Maximum acceptable downtime in hours
The diagram below shows the difference between these two. The RPO and RTO needs to be defined for each service and workload in the system.
Once the RPO and RTO for the system is determined, a DR plan needs to be put into place with detailed recovery steps. I am not going to go through the details here about best practices for DR strategies and planning, but it goes without saying that the only useful DR plan is one that has been tested and gets tested end to end at least twice a year. From an architecture perspective though, there are a couple of key areas to design for in terms of recoverability.
Firstly, and most importantly, data. To lose as little data as possible during a disaster, you need to have regular backup processes in place, have less reliance on cached data as an interim persistence mechanism and/or be able to rebuild your data from event sources. Each one of these tactics has its advantages and disadvantages, and an ideal scenario is to take a bit from each into your architecture design. It is not always practical to do regular backups, so replicated data stores at different locations can offer some relief an alternative method to protecting your data.
Secondly, is downtime, which is typically dictated by the system’s single points of failure and the DR plan’s ability to restore any one of these to full operation. Obviously the more single points of failure you have, the higher the likelihood of a disaster. A good solution architecture will keep single points of failure to a minimum. Again, there is no silver bullet, but it is key to consider that the solution’s recoverability must be driven by the SLA requirements and the RPO and RTO of the system. There is no point in over-engineering a system when there is no requirement to do so, but where there is a requirement it is imperative that it is met. This plays into the cost pillar which we cover in the last post in the series.
In the next and penultimate post of this series, we look at designing for the efficiency of operations. A system architected without considering the operational maintenance and enhancement of that system can be crippling and often one of the most overlooked aspects of good solution architecture.