3 Things to Improve Software Reliability

4 min readNov 17, 2020

You’re staring at an email from your boss, it says:

Hello engineering team, our customers have seen another outage of the service today. I’ve been informed, if we have another outage our biggest customer will be moving to another service. We can’t afford to lose this account, I’m asking you to find and solve the issues you are seeing with the system.

If you’ve ever seen an email like this you know one thing, you are exhausted and the whole world feels like it’s falling on top of you. So, I’m going to talk about three things you can do today, that will help protect you from failures that can take down your services.

Number 1: Always have a Backup

When I say “always have a backup” I’m not asking you to create a copy of your database (but please do this), I’m saying you shouldn’t have one database with all your records. You should have two databases, that manage records back and forth. This is called a high availability service, which means you can handle losing one database and still have a running service.

Number 2: Use Throttling, Queues, and Load Balancing

If you have a service that can’t handle a massive increase in requests, you’ll never be able to recover. Which is why it’s so important for you to create a system of back-pressure and requests per second protections.

One of the quickest ways to handle this is adding a Throttle, for a certain function or API interface. If you know that a function is called 100 times per second and you can only handle 120 requests per second, maybe you should only allow 110 request per second. It’s better for a customer to wait, than for them to lose access to a service due to failure.

In the event you don’t want to send back a “wait” signal to the customer, but still want to accept requests and execute them slower. You can always add a Queue to your service, which will slow down execution while allowing you to accept customer requests. An example would be, you have a database insert operation, but you can only insert 10 requests per second. So instead of blocking the customer from requesting 15 operations all at once, only execute 10 of those operations and store 5 of them to execute slightly slower. This allows you to handle large request bursts and slowly eat away at the work your application needs to complete.

If you’ve added a Throttle and a Queue to your service, and you are still running into issues handling the load put on your service. The last thing you can do is start to create more instances of your application and put them behind a service that will pass requests to each instance of the service.

An example would be, the service with a Throttle and Queue can handle 500 requests per second, but you need to handle 2000 requests per second. If you create 5 copies of your service and put them behind a DNS name called custom.service.com you can route requests to each of those 5 services through round robin, meaning every service you’ve created will only need to manage 400 requests per second. Protecting your customers from failure, and giving you more control over the number of requests you can handle.

Number 3: Create Service Shards and Partitions

Let’s say you are running your service on a single server, in a single data center, on a single server rack. If a hardware engineer is walking through that data center with their morning coffee and accidentally sneezes spilling a drop of coffee on the server you are using here’s what could happen. That drop creates a electrical circuit and it shorts out the entire server, bringing down your application and now you’ve just lost your entire service at no fault of your own.

Service is only in one Region, but two different zones

So, given that situation, how can we protect our service so customers can always access things? The answer is by not having everything run on a single server, in a single datacenter, on a single server rack. This means making sure that your service can run in multiple places, and when you shut off one service the other service picks up the slack. We’ve already done this with a load balancer, but the way we’ll do it with an application in a datacenter is by knowing where our service is running and making sure we have another instance in a completely separate physical location. This is called Mult-datacenter/High Availability (MD/HA Compliance), because we are in more than one datacenter and we are highly available which means we can handle a service going down if we have a copy somewhere.

If you can do these three things, you will drastically decrease the number of outages due to unforeseen and unexpected failures. Saving you hours of time debugging and resolving outages and giving you more time for development, testing, and design.

3 Things to Improve Software Reliability

Number 1: Always have a Backup

Number 2: Use Throttling, Queues, and Load Balancing

Number 3: Create Service Shards and Partitions

Written by John Stupka