SRE Engagement Playbook — Month 2

9 min readOct 2, 2020

This is a continuation of my SRE Engagement Series Article

Given that you’ve been on a team for at least a month, you should have a basic idea about the team, their skill levels, and what their main issues are. Many teams especially those in need of an SRE team have a major issue with instability and outages. We helped identify tasks in our last month to help the team, now we’re going to validate where those issues are.

Observability

The Queen of Observability and head of https://www.honeycomb.io/

There are two ways to deal with identification of issues within the system, ask someone who knows or pull information together to see for yourself. That’s where the idea of monitoring and observability come together, they help create a picture of the system in real-time.

There’s a lot of things you can pull together to create this picture, but I’d recommend focusing on things related to your specific service. Look at that list of issues that was brought up in the first month and identify things that could be better understood from data collection about the service. Start from the smaller issues first and move up the list from there.

Here’s some of the things you may decide to collect and some recommendations on them:

Metrics

Your metrics will be the smallest and fastest way to deal with information related to your application. This is the information that you want to collect related to integer values connected to the health of an application. This information should be collected for a specific set of time related to a data retention plan; my normal recommendation is three to six months, which gives you at least two business quarters to determine impact over time on the service. This timeline can and should be shortened for a product that is rapidly changing.

For metrics there are two ways to supply this information to a collection service, push(1) and pull(2). There is one more additional way to handle metrics called watch(3), which can be used for larger systems. I’d go into detail about this, but it can be ignored for the time being since it’s dependent on very high throughput applications and extensive infrastructure work. It’s good to be aware of this though.

Depending on how you’ve architected your system, you’ll want to select one of the patterns above (probably 1 or 2). The most common pattern I’ve seen is number 2 which means exposing metrics in a pull pattern and allowing a collection service to grab metrics at will. This eliminates issues related to a service accidentally attempting to push when the metrics collection service is down. This limits the load on the metrics collection service by allowing it to determine when it should pull metrics from an application service, protecting it from being overloaded by requests. Additionally, this also gives you a free health-check against your service, if there are no metrics the service is down or can’t be reached.

Events

After collecting simple integer-based information, you’ll realize you’re able to identify issues, but not investigate them. Which is why you need to tie this into some sort of identification system, which is where Events comes in.

When I say events, I mean something that can be understood as an end to end action that is tagged with an unique identification code. A great simple example of this is a database entry for an action that has occurred, let’s say you have a service that creates new users. When this new user is created, they are given a randomized 12-character base64 code “FEfe8VRWW1TLXrY” this gives us a unique identifier for that user. For every action, the user makes include this unique tag and ensure it’s only for the length of their session. We can now pull all events that have occurred in relation to our user “FEfe8VRWW1TLXrY” giving us an outline of what they user has done in a list of events. If the memory of our application goes up drastically and we can see it happens at 8PM every night, now you should be able to determine who and what was causing that “event” and start to understand what they were doing.

Logging

The most common and sometimes most difficult tool, Logging is something that will give you context and can be collected as a metric or event. Logging can become a major expense as the size of a system’s logging messages can explode when not handled correctly. The biggest recommendation I can give for a logging system is the persistence of a development team to apply the correct logging levels to their output messages and right-sizing how long you need to retain logs for.

When handling a logging system, you want to identify issues early and be able to hand that information back to a development team quickly for debug and remediation. Which is why having the messages you collect broken down into groupings for use by the team is important. This also helps to protect against accidental leaks of information that can be added to a logging message. The pattern that’s followed is:

Trace — A message that contains as much information as possible
Debug — A message that is meant for dumping out extra information in development
Info — Information that should be used for feature currently in development or early release
Warn — Anything related to unexpected output, possible invalid input/output handling
Error — Critical issues related to data integrity handling, lost dependencies, issues related to the service running reliability

There is a habit that development teams end up not trusting their tools or failing to test. Having a large amount of logging at maximum levels can be a signal that the team you are working with has failed to test correctly, or is moving too quickly to trust their own work. Which in turn can cause the logging pipeline to fail under the weight of the logging that it is processing daily. Use the current team setup to determine if this is an issue you need to address.

Trace

The last and most critical tool you should be using with a team is tagged events that come out through the system as logs. These are known as a Trace, which can be injected into a system and automatically apply a unique event identification code to their messages. The biggest benefit of a tracing system is the addition of timing-based debug information built into the system. Implementing this will allow you to detect the TTX (time to X) of actions, which helps debug latency and bottlenecks within the system at a glance. With this additional tool there is a much larger weight put on any existing monitoring system though, so you should build this system with the ability to turn on/off the application tracing as needed internally in the application. My preference is to use a set of environmental values that can be injected into the running application to turn on and off the tracing on the fly.

Fixing Issues

With the information you’ve collected you should be able to identify some issues right away. Some of the most common issues you will see are external to the team you are working with, issues caused by another team or a resource the team uses. Let’s look at some possible solutions based on issues related to external resources or teams.

Dependencies

Check to ensure the things you depend on are up and running correctly. These are thing things you have no control over, but you are required to use for some reason of another.

Health Status — How often does this dependency go down?
Response Timing — How fast are the responses from this dependency?
Load Volume — At what point do the requests we make break our dependency?

These help to identify critical issues with a dependency, assuming you have no control over it there are some ways we can support the dependency.

Circuit breaker pattern from a Database Issue

A common issue for many applications is loss of a backend resource, sometimes due to high request rates, sometimes due to network instability. There are some design patterns that can help minimize the impact of this loss of service. My recommended solution is known as the Circuit Breaker, a pattern that signals to the user facing applications an issue with lower level services. This solution is halts transactions from our service that requires this backend service and signal all the way up our service chain that we can’t accept requests.

This deals with the possibility of a Health Status issue with our system. We are tied directly to our backend dependency and can’t work without it. In the event you’ve added a load balancer to the front of your service with multiple regions and a health check feature, you can now re-balance your requests to a new region that has a backend that is running.

If you find yourself still needing this dependency, then it’s time to inject a cache into your system. Something just above the dependency to handle the data short-term and provide you with available information even when the data source goes down. This will provide you with a Healthy source while you wait for the service to return. This solution can also be used when you have issues with the speed of Response, having a localized cache can improve your response times.

Another possible issue you could see, during a high load scenario is sudden loss of your backend dependency and dropped messages due to a dependency inconsistency. This type of situation can be handled with the use of built in throttling/artificial back off against the resource. A handy way to do this is by adding some sort of request limiting resource.

Adding an eventing tool within the system will delay the write of data into the system. Relieving the load on the backend dependency, while allowing it to handle less messages, but allowing you to write and read from a localized cache of data. There is a time limit related to this situation based on the amount of available space(good thing you are collecting metrics), but it buys you much needed time to protect your backend dependency. This handles Load Volume on the dependency, slowing the addition of the data until it can be handled effectively.

Finally, a problem you can see with a system like this, requiring a dependency is issues with access to a very specific part of the dependency. Mostly likely you will require only certain parts of the dependency at certain times. Maybe lots of users stored in a backend at a certain time or loss of pictures/video from a dependency for a user’s sessions.

A way to handle this is by breaking the data out into shards based on the number of read/write requests you are tracking. If you see a grouping of data that is read explicitly during a certain time. That data needs to be “Hot” or reserved in a cache/index for use during its high access time window. Then removed and replaced with data that can be used later. Breaking this data out across two groups of applications also allows for faster access and higher availability as well. This means that you can handle a much larger number of requests, by breaking the data out across multiple regions or groups. While also protecting you if things go down in one region or group.

Dashboard

After identifying and fixing issues related to the application with the use of these Observability and Monitoring tools, you should be ready to build dashboards or reports that can help the team understand their application better. You can also give recommendations from these solutions on how to improve things based on your dashboards and give the team leadership the tools they need to make informed decisions.