SRE Engagment Playbook — Month 5
After your first month with the team, you learned how to work with a new system, you understand the bottlenecks and areas of concern from the team. Your second month you implemented observability tooling and you’ve taken the initiative to solve some common issues that can be exposed by review of that observability data. During your third month you’ve used tracking systems and alerts to remediate failures have helped you repair the system. Last month we predicted failure in our systems, created systems that self-heal, and designed distributed systems that seem like they never go down. Now we’re going to ensure that the quality of those systems is consistent every time and that we can run every possible quality check ahead of shipping out application.
Working like a Factory
Sometimes folks joke about SRE being “Simply Restart Everything”, I’d like to say that’s not completely true, but sometimes starting from a baseline will reset your issues and resolve a broken state. That’s why the main goal of any SRE team should be to create a system of automation that can repeatedly test, validate, build, and store your application, to get us back to a state we understand. These tasks can be broken down into simple steps that ensure our automated tasks are self-contained and limited in scope, alleviating complexity in our pipeline itself. I like to think of it as building a factory, we want every product that comes out of our factory to have a certain level of quality and have that same quality for every product we create. The idea being, we don’t want to give a customer a faulty device or recall products after they’ve already been released.
That’s why we’re going to talk about how you can build a system with correct deployments and automation pipelines.
The first thing your factory should do is pull your code and start running a barrage of tests against your product. This is an attempt to isolate as many ways as possible someone will use your product and check them. Test pipelines are a process of gaining confidence in the product you are about to release. They do not guarantee success, they only increase confidence in your product.
UIE “Unit-Integration-End to End” is the first step in the factory, the idea being the faster we find a problem before the customer, the sooner we can fix it before they see it. Testing identifies a failure through a scenario we’ve identified as a possible usage case. We break tests up into different patterns and levels of coverage, the most precise type of test and often the fastest is testing (Unit).
These types of tests should be for a very small amount of functionality that can be self-contained, which means you should be able to identify an “in” and an “out” for your functionality. This allows you to test whatever type of input you could expect, as well as unexpected input. This can become more complex or less complex, but generally boils down to ensuring the program has these specific “in” and “out” points. Since Unit tests can be some of the fastest to run tests it’s recommended to encourage the use of these during local development workflows.
The next form of testing (Integration) you’ll want to do is actual use of things that make up the application. In the above diagram you might have tested the way a string was read in from the config reader. Now you need to test many actions all at once: does the configuration file location matter, once the file is found can you read that file correctly, and how does the data structure that you are storing the file as handle different configurations. Some of these things might have been tested during a unit test and so they don’t necessarily need to be re-tested, but those unit tests are based on a very small view of the larger application. Since Integration tests can sometimes depend on different components of the system, you should use these before merging code and encourage a passing Integration test before pull requests are approved.
The last form of testing (End to End or E2E) that you should be doing is simulating exactly what a customer is doing. The downside to E2E testing is that it takes more effort to implement and even more to implement correctly since this requires testing the whole system at once. E2E tests interact at specific points in the system and at the end of the system which is where the test should check. This is also the most critical type of testing (in my own opinion) because it gives you the most coverage with the least amount of system knowledge needed. These types of tests should be run during any deployment into an environment, ensuring once a system is promoted from one environment to another it has been validated.
The next step in our test pipeline is building a system that can withstand failures, which can happen in several different ways. The best way to ensure a system can succeed during a type of failure event is by running your system somewhere that creates failures and reviewing the state of your system during those failures.
The easiest type of failure is an outage failure, turning on and off the entire machine that is running part of your system (sometimes multiple times). To do this you can simply run a shutoff or reboot command against a machine and follow along with the logs in your system to see what happens. Depending on how your application is structured, you should be able to test failure in different components of the system and collect data on this failure to see what happened. This type of exercise is called a chaos experiment and can give engineers an idea of the way a system will operate under certain failure scenarios.
Another form of failure is in communication or the lack of communication within a system. What happens when a system can’t communicate and certain components continue to run? Will you cause more damage, or will everything be fine? A way to test this type of a failure is by turning a firewall on and shutting off communication, sometimes turning it on and off to test sporadic failure. Another way is by slowing the traffic of a call by limiting the network message size to slow traffic.
Finally, another form of failure that you might want to test is loss of storage or the inability to write more information due to a failure at the data layer. Failures like this can often take out a system, giving it no time to report the failure. In this type of failure, does your system detect this issue and remediate the problems or does your application fail to recognize the failure? A common way to test this is by changing the ownership of a filesystem to stop an entire system from writing data.
These types of chaos experiments are the things that will determine the success of your system in many adverse situations.
The final form of pipeline testing that should be implemented is a form of load testing, which will test the application during many requests. These types of tests should focus on two types of load tests: integration load testing and end to end load testing. Which would mean we can test a system by each of its components and with the entire system as well.
Testing individual components of a system is critical because testing these components can ensure that the overall system’s bottlenecks are well understood. A common way to test application components is to start a large number of services in a similar location to decrease the amount of network interference and run requests until you start to see failures. In the above diagram, each service can handle the number of requests from our load tester, except Service 3, which drops the requests. This is a great example of a bottleneck in the system, which can be fixed by splitting requests from Service 2 and sharing those requests against a copy of Service 3 and Service 3 in order to handle the extra load.
These tests can produce a good metric for you when building a system that will eventually undergo a large amount of load during a set period of time. A service that is used in shopping may see a large number of requests during a certain time of the year where a sale is happening. Having a service that can handle that additional load or knowing when to scale your service up can be the difference between a loss of customers and the best sales season a company has seen. Testing requests per second, load on a CPU for a server, and memory usage during a load test can isolate bottlenecks.
Once you’ve tested and validated your product, you need to put all the parts together and add it to a box with tracking information in the event of a recall or if you need to switch out components. This type of pipeline handles the last-minute details for your product and getting it ready for shipping.
The process of code scanning can mean several things, one of the more critical meanings is the process of static code analysis. Code analysis can be handy when trying to identify code that, despite having passed validation testing, could be improved or isn’t up to a team defined best practices. Many build pipelines will run a linting step, which are mainly focused on finding errors, bugs, stylistic problems, and suspicious constructs in the code base. These are things that we don’t see as a problem in the test cases but could be fixed to make the codebase more maintainable or help to avoid code that will create issues in the future.
Another form of code scanning is data-flow analysis, which looks at the way the code was written and gives warnings when the number of logical loops seem less than optimal. An example would be when the system sees something like three or more loops layered on top of each other. To the scanner this seems like an area for improvement because the system is running with a possible O(n³) —O((n)*(n)*(n)) complexity. Having a tool that can isolate issues like this can give you insights into a codebase that might function correctly, but has room for large improvements.
Aside from checking the code itself, a good code scanning tool will look for high entropy variables stored as constants in the code. These variables will be flagged as possible secrets that have been stored in code, which can protect your codebase from exposing secrets without even knowing it. These can be setup for specific secret formats like oAuth, RSA, or x509 SSL certificates, which can all be found accidentally stored in a codebase by developers who are moving quickly.
Using a code scanner should be a default practice when implementing an automation pipeline and should be encouraged during local development workflows.
Once we’ve passed the last layer of code verification, we need to ensure we know what code is going out and where it’s located in the wild. A great way to do this is by including a file along with the build, inserting lots of build metadata related to the point in time of the build.
Having build information included in your code release will help you track exactly what code you are looking at when it’s running in different systems, especially if you have a build debug tool that can pull this file from a build. Giving you a detailed artifact that can be used for real time analysis of a system and direct application to code linking from this artifact.
If you can alter this build metadata artifact to append information, it also makes a wonderful “pipeline tracing” document. Append what type of tests you’ve run, when you ran them, on what servers and what was the outcome of each. So in the event your code needs to be reviewed, a developer can be given the error output and this full build artifact.
Adding data about the build to your product will give you a direct link to the code that was released and cut down on time you spend identifying what was released and where it came from.
The last thing your build pipeline will need to do is put the product you’ve been creating somewhere it can be received by your customers.
Creating a compressed release of your build is a legitimate way to package your application. Depending on the language you’ve been working with you may already have compressed your application. Once you’ve compressed your application, place it in an object store and send the link out to your consumers with a checksum for them to validate the package hasn’t been altered.
You might want to create a scripted download and install process, something like a shell script or a configuration playbook. These can be written in different languages or tools but will give your consumer a view of all the steps it takes to configure and run your application.
If you’re trying to do something a bit more complex and you want to ensure that your consumer will have all their dependencies and the tools they need for their application, you may want to use a container image. A container image is smaller than a machine image but will allow you to select from specific dependencies and place them in a image together. Building a container image will allow you to ensure these layers of dependencies saved into your product. Once you’ve created the image you’ll need to store this in a container registry for retrieval by your consumers.
If you’re looking to create a perfect replica of your application running on a machine, you might look at creating a virtual machine image, similar to that of a container image. This can be started from an Infrastructure as a Service provider and gives you a running product that is exactly as you expect right away. These should also be stored in an object store of some type and tagged with the correct extension for retrieval and use by consumers.
Building a package for your product depends on the consumer of your product, so use the tools and locations most familiar to them.
Please stay tuned for the last articles in the series! We’ll be covering security for the last post. This will be the last article in the SRE Engagment Playbook series, which should cover the first 6 months of your time with a new team. If you’re interested in work like this or are looking for more information, please let me know in the comments below.