Learning from Incidents - Andrew Hatch & Rolling Out Error Budgets - John Viner
The presentations by Andrew Hatch from Seek and
John Viner from ZenDesk both related to errors,
outages and system failures, so I have elected to group them together and combine my thoughts on this topic.
Due to the scale of both Seek and ZenDesk, Andrew and John have encountered a number of issues I have previously seen,
but they have had to deal with them as a priority due to the size and complexity of their technical and people
In an Agile and DevOps environment, teams that build a product or feature are expected to own it. This is done with a
view to improving customer value and to help reduce the complexity for an individual to a sustainable level. But it
has a negative aspect as well. Like organisations that are siloed by technical skills or job function, the larger
organisation can become siloed by product; developers and teams don’t always know who their customer is as more
consumers throughout the organisation integrate with the product or feature they offer. This leads to the localisation
of technical knowledge, which in turn leads to the localisation of incident handling and site reliability knowledge.
During these two presentations John focussed on how to manage the risk associated with an outage, while Andrew focussed
on the impact of an outage. Using some of the advice they presented I will walk through the process in a somewhat
chronological order relative to an outage occurring.
John talked about the implementation of targets, service level indicators (SLIs), service level objectives (SLOs) and
the resulting error budget. For context, a target is a simple measurement, perhaps it is the availability of a system,
the response time of a system, or maybe the total processing time of an item; an SLI is the number of good events
divided by the total number of events; an SLO is the desired value of an SLI over time, that could be an uptime measure,
the percentage of requests within an acceptable SLI or the percentage of items processed within the SLI; the resulting
error budget is then the inverse of the SLO. This should not be confused with service level agreements (SLAs) which
are promises made to customers about the availability and/or speed of a system.
To provide a concrete example of this, if a target response time of 200ms is set 10,000 requests are received and 9,900
of these are within 200ms, we can calculate our SLI as 9,900 divided by 10,000, giving us 99%. If the SLO is 98% then
we have exceeded our SLO. We can also calculate our error budget by subtracting our SLO from 1, so in this case, we
have an error budget of 2%. By looking at our SLI, we can see that half of the error budget has been consumed. For
clarity around SLAs, an SLA of 96% may be implemented on this indicator, that means that should be service breach the
agreement the customer will be compensated, the internal SLO will likely be higher than the customer facing SLA as this
provides a margin of error and as an organisation we want to exceed expectations, not just meet them.
When determining the SLIs, SLOs and error budgets, the system has to be viewed from a customer perspective. The items
that provide the most value for customers should have the most stringent SLIs and SLOs; for example an outage on an
authentication system that prevents people from accessing the service would have a much greater negative impact than
an outage on an asynchronous search index updater.
In the organisations I have worked at, almost all of them have had a central location for API specifications. This
enables developers (and sometimes customers) to easily find the available services and interfaces. If a system such as
this is used, this would be a perfect place to publish the SLIs, SLOs and error budgets. By publishing these values
along with the interface specifications it will help to remind people that systems are fallible, errors will occur, and
any system that is developed should be capable of coping with errors; this will lead to a resilient system architecture
and will ultimately improve the reliability of all the components in a system.
So, why are we promoting acceptable outage levels? It’s simply because we know that it is impossible to have a system
that is 100% reliable, couple this with our need to keep innovating and improving, and the constant changes this
requires, and we are increasing the likelihood of something unexpected occurring. By implementing and promoting an
error budget we are acknowledging that outages and service degradation are unavoidable, we are setting acceptable limits
on the outages, and we are removing the fear that is associated with these issues. With the implementation of an error
budget we are also working to find an objective balance between the reliability that is inherent in a stable system and
the risk of a loss of reliability when changes are implemented in a system.
As a developer, even after more than 25 years of coding, I still get nervous when a feature is released to production;
it doesn’t matter how much testing has been performed, it doesn’t matter how many times my code has been reviewed, I
worry that if something goes wrong when it is deployed it could have an unintended consequence and cause an outage; by
accepting that this will sometimes occur and being prepared to take action to rectify the problem the fear is reduced.
The reduction in the fear of degradation or outage increases my confidence in implementing new features or making
changes with the goal of improving the system.
With the implementation of an error budget, and an acknowledgement that sometimes things will go wrong I can focus on
how to reduce the mean time to recovery (MTTR), accepting that the mean time between failure (MTBF) is less important.
I know there will be a number of people who will be horrified with the thought that the MTBF is of little importance,
but in the typical SaaS organisation the length of continuous uptime provides little value to the customer. If we were
building aircraft, then the MTBF is a life or death situation (just see the Boeing 737Max issues that happened
recently), but most SaaS services aren’t putting lives at risk. Customer value is delivered through the total usable
time of a system in any given period. If we have a MTBF of 2 years, and a MTTR of 8 hours, that means that every 2
years there will be 1 business day where the system is unavailable, a situation like this will have a huge impact on
customers and will rapidly lose them. But if we have a MTBF of 7 days and a MTTR of 5 minutes we will have far less
impact on the customers real and perceived value.
So far, I haven’t touched on what happens when an incident occurs and some of the error budget is consumed. It’s
important to remember that having an error budget doesn’t negate the need to assess each service degradation or outage.
If anything, it is more important because there is now an objective measure of what is acceptable and what is not and
we must strive to stay within acceptable limits. As part of using error budgets, I believe that every incident that
consumes some of the error budget should be assessed in a similar way to an incident when no error budgets exist. This
leads into aspects of the presentation by Andrew and how to ensure that incidents are a learning experience and benefit
the entire organisation.
When reviewing an incident, many organisations will focus on the people and what individuals could have done to prevent
the incident. If this focus is maintained people will become protective and will seek ways to avoid reporting
incidents. Instead of focussing on the people, an incident post-mortem needs to balance the technical causes and the
ways people can be better equipped to counteract these risks.
In many cases an incident isn’t the result of a single failure. If you’ve watched as many episodes of Air Crash
Investigations as I have, you will be familiar with most incidents being caused by a chain of events; the same is true
with IT systems. A well architected system will have resilience built-in, it will be able to cope with isolated
failures, and it will take a number of concurrent failures for an incident to have a significant impact. To ensure the
prospect of concurrent failures is minimised, each isolated incident should be assessed in terms of impact, contributing
factors and potential ways to prevent the incident from recurring. In the case of a significant incident, the same
basic process and assessments will be used, but across a much wider section of the product and potentially involving
systems that were not affected to find what had been done differently to prevent impact or increase resilience.
When assessing an incident focus should first be placed on supporting those who were impacted. Ensuring the health and
well-being of staff involved in resolving the incident will help them to be open about potential causes and will also
help them to be better prepared the next time an incident occurs.
During the assessment Seek utilises a technical staff member to facilitate the post-mortem, and a product staff member
to act as a scribe; this allows input from both the technical and customer perspective in a process that is often tech
focussed. During the assessment it is important to acknowledge that complexity of the system will always increase,
adding resilience to a system will add additional layers of complexity and more locations for errors to appear, adding
features to a system will increase the amount of communication required between systems and will increase the dependency
tree. As complexity increases unintended feedback loops will appear. By acknowledging these factors and by working to
identify all the contributing factors in an incident then a holistic approach can be taken to find a solution. The
knowledge gained from the incident can also be shared throughout the organisation so others are able to make allowances
for the contributing factors and can assess is the solutions found are applicable to their features.
So far I’ve covered the benefits of error budgets and how to handle incidents and the consumption of error budgets, but
I still need to touch on what to do when an SLO is at risk (the error budget is almost consumed), and what to do when
an SLO is broken (the error budget is consumed).
When an error budget is at risk, the team will need to prioritise the resilience of the system over new feature
development. This doesn’t mean that feature development should stop, but it should be reduced to allow time for the
improvement of the existing system. If the error budget has been consumed due to unavailability of external services
then action must be taken to reduce the dependencies on these services; if it has been consumed due to issues within
the system then ways to increase the reliability should be found.
If the error budget is completely depleted, then the team must give even more importance to improving the resilience of
the system. In this case it is likely that development of new features will cease for a short period of time (maybe one
or two sprints) to focus on getting back to an acceptable level of error.
By implementing error budgets with appropriate measures and limits, ensuring that incidents are assessed and learnings
are both implemented and shared, acknowledging that external dependencies and internal functionality are fallible,
reducing the MTTR and by ensuring focus is maintained on delivering value, it is possible to reduce stress on staff,
increase system resilience and ultimately deliver better value to customers.