SRE Engagment Playbook — Month 4

8 min readFeb 8, 2021

This is a continuation of my SRE Engagement Series Articles, check out Month 1, Month 2 and Month 3

After your first month with the team, you learned how to work with a new system, you understand the bottlenecks and areas of concern from the team. Your second month you implemented observability tooling and taken the initiative to solve some common issues that can be exposed by review of that observability data. The next month using tracking systems and alerts to remediate failures has helped you repair the system. Now we’re going to predict failure in systems, create systems that self-heal, and design distributed systems that seem like they never go down.

Fault Tolerance

Multiple individual components working together create a complex system, this system depends on the success of individual components. The success of a system after the failure of one or more components is what we call fault tolerance, here are some ways you can implement Fault Tolerance in a system. For the sake of speaking about fault tolerance, we’ll be discussing this under the conditions of a single region.

Load Balancers

One of the easiest ways to create a fault tolerant system is by creating a system that can be placed behind a Load Balancer. An example would be a public API or website that accepts requests and produces a response, if your system shuts down or fails you can no longer send responses. A system like that has a single point of failure, but if you run multiple copies of your application behind a Load Balancer even if one of them fails, you will still have the ability to send responses when you receive a request.

A common fault tolerant system built using Load Balancers is designed by using two Load Balancing systems that communicate in a way that allows them to be Highly Available meaning they can handle a single Load Balancer outage. You can configure a set of Load Balancers like this using Active/Active or Active/Passive communication between the Load Balancers and a Domain Name Server. The tradeoff between a Active/Passive system and an Active/Active system is the complexity of the DNS, the Active/Passive system uses an IP address assigned to both Load Balancers. The Active Load Balancer is an IP address that has requests directed to it, while the Passive Load Balancer is assigned an IP and is configured to recieve requests but doesn’t get sent requests.

If Active Load Balancer has a failure the Passive Load Balancer becomes an Active Load Balancer with the DNS server directing requests towards it. The way this sort of system is configured, is by using a Alias Records (A Record) and a Canonical Name Records (CName Record) within a DNS server, which allows for multiple servers to have a common name as shown above.

High Availability (HA) Applications

Within applications the things you should be focusing on during a fault tolerant situation are your dependencies and external data handling. One of easiest way to identify critical components of a system are to look at how your application is storing data and where it is running computation.

Looking at the example above, we realize there are four main components to every application. The main component we want to focus on is ensuring that our input/output actions will function. A great way to do this is by adding different actions to handle failure in I/O resources.

Retry — On submission of data, keep track of the status response you receive during submission. If there is a failure, attempt to submit the data again in an attempt to handle an unexpected loss of connection to dependency. An issue with this pattern can occur when there is a flood or retry events, which we’ll fix with the next action.

Back off — Another form of protection for input and output is creating a system that allows requests to wait for a period of time before requesting. This allows a system to not overload its external dependency by slowing down the amount of request that it submits.

Circuit Breaker— In the event you have a critical I/O resource that you cannot handle running without, another option is to send a signal to other services that you cannot reach it. Allowing other services to stop requesting from you and instead of sending failure responses back to them.

These are many ways to deal with failures in different types of an application, but these features are in my experience the most common throughout.

High Availability (HA)

The database is one of the most critical parts of a system and creating a fault tolerant database layer can mean the difference between immediate recovery vs hour or even days of downtime.

Left (Cache ahead system), Right (API ahead system)

Cache Ahead — Sometime loss of a database might be due to network instability or a thread issue. In these situations, having a cached set of data outside of the database (ahead of calls to the database) will offset connection issues directly to your database and allow you to access a high-speed localized data source. This can happen during transactions or on a schedule, allowing you to store read data for use on loss of a database.

API Ahead — Using an API layer above the database, you can create complex logic ahead of the calls to your database. This included the ability to add a internal caching layer into your database.

Leader Elected — In the event you are using a data storage platform that allows for leader elected interaction I would recommend a using a five node quorum which allows for the loss of two nodes.

Primary-Secondary —Finally a common pattern used in database reliability is the use of a primary database and a secondary backup database. This secondary database is written to by the primary database and on failure of the primary database, it switches to the secondary database. This allows for the failure of your primary database. In many systems there can be more than one secondary database, but for this example we’ve only talked about a single secondary database.

Cache/API ahead both are built with the expectation that network instability is a common issue with your system, while Leader Elected/P-S is based on loss of a server due to unforeseen errors.

Recovery

No matter what type of system you build, you will see failures. The goal isn’t to eliminate failure scenarios, it’s to decrease the affected customers during a failure scenario. Any complex system will always be in an error state, there can be multiple failures with little to no impact on a customer. These systems “appear” up and running with systems functioning as normal or slightly slower than expected. Once a system enters a failure state, the goal of the engineering team should be rapid recovery and mitigation. Here are some ways you can recover and mitigate failure scenarios quickly.

Database

Database recovery boils down to when your last backup took place, how your backup is stored, and if you have an automated process for recovery. We’ll go over where you should store a backup, and how to transition from one database to another.

Backups Rule of 3 — The Rule of three when it comes to backups is a common saying for data storage specialist, its mean is summed up as this: “Have 3 copies of your data, two on primary locations for quick access. 2 stored on a different media than what you’ve stored the first, and 1 on a offsite backup.” Since many programmers are not a data storage specialist, you can break this down as 3 backups, one on your database server, one on a block storage service, and one on a different cloud provider’s cloud block storage.

Traffic Routing —During database recovery you don’t want to take down your current set of databases unless you really need to. Having a traffic routing plan setup for shifting your requests from one set of databases to another is critical. A common pattern is a green/blue deployment, create your new databases using backups to restore them as close to the current database as you can. Then using a load balancer or DNS entry redirect your requests directly to a new database set, recovering the most recent requests through transaction replay or manual remediation.

Transaction

Re-Play — Recovery of transactions (queries, events, requests) requires effective logging or a event hub of some kind where requests are recorded and queue’d up for services. The ability to “re-play” these events mean that if an incident occurs at 9AM due to a deployment at 7AM, you can mark the transactions from 7AM to 9AM as queue’d and allow the services to run them again.

Store at State —Sometimes a system will lose a single component in a chain of tools, this causes requests to fail at a specific point in the chain of services. Storing the state of the request prior to that point allows you to fix that single component and continue requests from that point, instead of restarting from the beginning. An example would be four services that work together, each service sends their data to the next service in the chain. Service 1 and 2 both take 15 minutes to run their computation, meaning they take 30 minutes in total. If there is a failure on Service 3, you’ll save time by storing the state of the transactions after Service 2, instead of replaying them from Service 1.

Application

Scale-Up — In the event of loss of an application, regional scale-up can be used to handle extra traffic in new regions. Having a plan in place to automatically look at loss of a region and resource usage can be critical in the recovery of a loss of applications.

Load Shedding — In the event of a hard down outage, meaning the loss of an application for a period of time with no access for customers. Bringing up an application from an event like this can be dangerous, teams need to be wary of the number of request per second received by an application during startup. Applications that receive too many requests during startup can become overwhelmed and fail as they are starting. A common pattern to protect against this, is feeding request per second metrics to a load balancing service and ignore x per second requests until you’ve reached a normalized request per second point. If you have a way to weight these requests, it’s recommended to drop low priority requests first. This protects your application on startup, while you slowly respond to a large number of requests that are being directed to your application once it has recovered.

These are all examples of situations you might experience and hopefully can be used to identify a solution for some of the issues you’ll face during a recovery scenario.

Please stay tuned for the next two articles in the series, we’ll be covering automation/pipelines and security. These will be the last two articles in the SRE Engagment Playbook series, which should cover the first 6 months of your time with a new team. If you’re interested in work like this or are looking for more information, please let me know in the comments below.