After your first month with the team you learned how to work with a new system, you understand the bottlenecks and areas of concern from the team. You’ve implemented observability tooling and taken the initiative to solve some common issues that can be exposed by review of that observability data. Now you need to start making sure things are reliably up and running as intended for consumers of the service, through alerts and on-call rotations.
One of the first things you’ll start to do after getting comfortable on a team is starting to identify common issues from metrics or failure data collected about the application. Use this data to create a system of monitors that will accurately catch and notify the correct people with the appropriate severity of a issue. A very common and useful monitor is a health check, (sometimes called a synthetic check) for the service(s) you are running. Do this by defining something that will always be true if the service is on and running, this could be a endpoint on the API service that returns 200OK when you call it. This could also be a check to ensure that logs are being generated from the service in a certain location and they don’t include any “Error” or “Fatal” line messages. Have something that will always happen if the service is actually running, this could even be requests for data as long as you aren’t requesting extremely large amounts of data every time. Do these health checks frequently, preferably at an interval of 10–30 seconds so you can tell the exact moment the service has gone down.
There are trade-offs when doing health check alerts like this, instead of checking the rate of disk space or the memory usage of an application, you won’t know when the service is going to go down ahead of time you’ll be dependant on a health check for all your information. You’ll also have to ensure that the things you are using to run these health checks are working, which means you’ll want a second check that ensures the first check is running.
This brings me to my next point in alerting, you want a system of automatic responses to alerts, and fail-over systems that respond without you to these alerts. Having a single system that is in charge of system health is a recipe for disaster, because there is a chance even that will fail. Build your system with the expectation that anything can fail, and will fail given time. A common pattern is having a monitor for your monitor, if you can afford to have multiple regions and load balancers, you can create a complex and nearly perfect fail-over pattern with multiple monitors. Sometimes you can’t afford all the bells and whistles though, which is when you’ll want two services each looking at each other, ensuring that their twin service is still up running. This ensure at least some degree of cross checking on the status of services and creates the most basic, but effective health-check system.
Another alert that I’ve found is a very common and consistent issue with services relates to space utilization, especially regarding disk space. A application without memory space will simply be killed (hopefully to restart), a server without disk space will stop writing data, never restart, or worse shut down the entire server without any chance of an alert being sent. Always set alerts for disk utilization, especially for critical data services like a database, cache, or file storage system. Once you’ve set these alerts, make sure you have a process for measuring the rate of change in regards to these alerts, if you have a system using 10gb/day you can normally handle manual fixes. If the applications usage spikes between 1gb-10gb, you will soon see a very aggressive rate of disk usage that you couldn’t predict, having an alert based on rate of change will save you in this situation.
My preferred option for disk space alerts is to set a limit based on a data retention proposal and at that limit run rotation of logs, compression of data, and backups. Once those are complete move them to a large scale tape storage for retrieval 2–3 times the retention period for final deletion after that(there are concerns with data deletion so always check with your leadership before data deletion).
You have a database with transaction records, you want to save 24 months of records for data retention. Each record has a size of 480 bytes, you get an average of 200000–400000 records per month, so the data size is 96MB/month — 192MB/month which means by 24 months you’ll have 2.3GB— 4.6GB to deal with.
So based on the 2–3 times size rule I described, you should compact, backup, and rotate out the logs at the 24 month mark with a max backup for each month at 192MB, these should be retained for a max of 72 months(depending on compaction size). It also tells you that if you have a 20GB server, you can handle 104 months’ worth of data (not including backups) on your database server. So if you move data off your server every 24 months to a cold backup and assuming you have 4GB-8GB of logging for the database application and server, you should never use more than 75% of your database disk. Set an alert on your server for every 24 months seeing a dip in disk and never going above 75% usage of the disk.
Now that you’ve configured all your alerts and you’ve got a good idea about how to track failures it’s time to start responding to those alerts. Having an on-call rotation is the first part of dealing with application failures and reliability. There are a couple of way to handle an on-call rotation some of them are less intense than others, some are very involved but can be altered based on the team. This is where having a diverse team is a huge advantage, having folks with different holidays, having folks who wake up and go to sleep at different times, and having folks in different paths through their life. In the event of an incident, having lots of different approaches to a problem means you will be able to solve the problem faster and with more tradeoffs to pick from. There will be a lot of shuffle in when someone wants or can’t be on call, so having a schedule that is balanced and allows for shift trading is important. I normally follow one of these shift schedules:
The weekend warrior — 1 person on call during the week, 1 person on call during the weekend. Person that was on call during the week handles backup for the person working on the weekend. (1 OnCall — 1Backup Onall)
The 7 day shuffle— 1 person is the on call for a week at a time, the week prior to their shift they are a backup on call for the person working during the week. This helps to slowing ramp up the person handling on-call by being backup first. (1 OnCall — 1Backup Oncall)
Night Shift/Day Shift— This is a great way to handle things if you’ve got an even split of fresh college grads who don’t want to sleep and senior staff that don’t want to wake up their partner. This is great for geo-diverse teams as well, making it a morning shift for each. 1 Engineer takes the day shift starting at 6 AM and hands off to another engineer at 6PM, and in the morning it shifts back. This happens for a week at a time, in the event a backup is needed treat it like a 7 day shuffle where the week ahead of their shift they work as a backup. (2 OnCall — 2 Backup OnCall)
Each of these rotations have trade-offs, but the main thing to focus on here is that during an on-call shift you should be comfortable with the application stack. So the on-call rotation should ensure that every engineer working on the codebase has a chance to be on-call at least once every other month, otherwise engineers who haven’t been on call will be calling their backup frequently due to lack of experience in the on-call role.
SRE Red Team vs. Blue Team
If you have engineers calling for backup support constantly, then I’d recommend Incident Fridays. Make a clone of your production systems and ask one experienced engineer to create an issue with the system, something that the team will need to debug together. Have them create at least 4 clues as to what is wrong.
A bad record was inserted into the system database, clue below.
- Data on the front-end or through the API is creating a 500 error code
- There was an issue 4 days ago where we had to restore records manually due to a database connection bug
- Only when pulling all data from the API do we hit the 500 error code
- The errors start appearing around the same time as the records restore
Treat it like a team game, the team is going to have to work through the process of debugging the issues and the experienced engineer is going to test their knowledge of the system until they are comfortable.
The last thing I’m going to talk about is the health of your on-call engineering staff and the number of incidents they’re dealing with. This depends on the type of team that you have, sometimes the team loves dealing with weird bugs other times the team wants to get right back to coding. If you realize that your on-call engineer is dealing with a ticket every day or more than once per day, and the ticket takes more than two hours of time to deal with you. You have a reliability problem, and this isn’t going to be solved by adding more on-call engineering staff. You need to make a very real investment into fixing the issues with your system, most likely by adding testing, adding multiple instances of the service that are load balanced and can handle a single outage, or possibly asking your most productive engineers to stop what they are doing and fix bugs instead of write features. If you have to stop writing features and start fixing bugs, I’d recommend declaring a Code Yellow to teams around you. Meaning you’ve had a large amount of failures and any help they can provide the system related to reliability over the next few weeks is critical. A software project is like an entropy system, the more features you write the higher the level of entropy in the system, the higher the entropy the harder it is to add features or keep the system running.
If any of this is new to you, try asking other teams around you how often they’re on-call, what types of tools they’re using and if you can share the types of tools that are being used to monitor and alert on incidents. Doing this will help to limit the amount of time you have to spend setting up your infrastructure and could help you understand issues that are impacting multiple application across the company.
If you enjoy the articles that I am writing about, please feel free to request more articles like the ones you’ve read. If you are interested in the style of work that I am writing about please feel free to speak with me, we’re looking for wonderful SRE folks that want to solve problems like this!