Skip to content

Operational Risk

Rob Moffat edited this page Dec 12, 2018 · 56 revisions

For Review

"The risk of loss resulting from inadequate or failed internal processes, people and systems or from external events." - Operational Risk, Wikipedia

In this section we're going to start considering the realities of running software systems in the real world.

Here, we're going to set the scene by looking at what constitutes an Operational Risk, and then look at the related discipline of Operations Management. Following this background, we'll apply the Risk-First model and dive into the various mitigations for Operational Risk.

Operational Risks

When building software, it's tempting to take a very narrow view of the dependencies of a system, but Operational Risks are often caused by dependencies we don't consider - i.e. the Operational Context within which the system is operating. Here are some examples:

  • Staff Risks:

    • Freak weather conditions affecting ability of staff to get to work, interrupting the development and support teams.
    • Reputational damage caused when staff are rude to the customers.
  • Reliability Risks:

    • A data-centre going off-line, causing your customers to lose access.
    • A power cut causing backups to fail.
    • Not having enough desks for everyone to sit at.
  • Process Risks:

    • Regulatory change, which means you have to adapt your business model.
    • Insufficient controls which means you don't notice when some transactions are failing, leaving you out-of-pocket.
    • Data loss because of bugs introduced during an untested release.
  • Software Dependency Risk:

    • Hackers exploit weaknesses in a piece of 3rd party software, bringing your service down.
  • Agency Risk:

    • Suppliers deciding to stop supplying you with something you need.
    • Workers going on strike.
    • Employees trying to steal from the company (bad actors).
    • Other crime, such as hackers stealing data.

.. basically, a long laundry-list of everything that can go wrong due to operating in "The Real World". Although we've spent a lot of time looking at the varieties of Dependency Risk on a software project, with Operational Risk we have to consider that these dependencies will fail in any number of unusual ways, and we can't be ready for all of them. Nevertheless, preparing for this comes under the umbrella of Operations Management.

Operations Management

A Risk-First Model of Operations Management, inspired by the work of Slack et al.

If we are designing a software system to "live" in the real world, we have to be mindful of the Operational Context we're working in, and craft our software and processes accordingly. This view of the "wider" system is the discipline of Operations Management.

The diagram above is a Risk-First interpretation of Slack et al's model of Operations Management. This model breaks down some of the key abstractions of the discipline: a Transform Process (the Operation itself) is embedded in the wider Operational Context, which supplies it with three key dependencies:

  • Resources: Whether transformed resources (like electricity or information, say) or transforming resources (like staff or equipment).
  • Customers: Which supply it with money in return for goods and services.
  • Operational Strategy: The goals and objectives of the operation, informed by the reality of the environment it operates in.

We have looked at processes like the Transform Process in the section on Process Risk. The healthy functioning of this process is the domain of Operations Management, and (as per Slack et al.) this involves the following types of actions:

  • Control: Ensuring that the Operation is working according to it's targets. This includes day-to-day quality control and monitoring of the Transform Process.
  • Planning: This covers aspects such as capacity planning, forecasting and project planning. This is about making sure the transform process has targets to meet and the resources to meet them.
  • Design: Ensuring that the design of the product and the transform process itself fulfils an Operational Strategy.
  • Improvement: Improving the operation in response to changes in the Environment and the Operational Strategy, detecting failure and recovering from it.

Let's look at each of these actions in turn.

Control

Control, Monitoring And Detection

Since Humans and machines have different areas of expertise, and because Operational Risks are often novel, it's often not optimal to try and automate everything. A good operation will consist of a mix of human and machine actors, each playing to their strengths (see the table below).

Humans Are... Machines Are...
Good at novel situations Good at repetitive situations
Good at adaptation Good at consistency
Expensive at scale Cheap at scale
Reacting and Anticipating Recording

The aim is to build a human-machine operational system that is Homeostatic. This is the property of living things to try and maintain an equilibrium (for example, body temperature or blood glucose levels), but also applies to systems at any scale. The key to homeostasis is to build systems with feedback loops, even though this leads to more complex systems overall. The diagram above shows some of the actions involved in these kind of feedback loops.

As we saw in Map and Territory Risk, it's very easy to fool yourself, especially around Key Performance Indicators (KPIs) and metrics. Good Operations Management is about going beyond this and looking for trouble. Large organisations have Audit functions precisely to guard against their own internal failing processes and Agency Risk. Audits could be around software tools, processes, practices, quality and so on. Practices such as Continuous Improvement and Total Quality Management also figure here.

The Operational Context

There are plenty of Hidden Risks within the environment the operation exists within, and these change all the time in response to economic or political change. In order to manage a risk, you have to uncover it, so part of Operations Management is to look for trouble:

  • Environmental Scanning is all about trying to determine which changes in the environment are going to impact your operation. Here, we are trying to determine the level of Dependency Risk we face for external dependencies, such as suppliers, customers and markets. Tools like PEST are relevant here, as is
  • Penetration Testing is looking for security weaknesses within the operation. See OWASP for examples.
  • Vulnerability Management is keeping up-to-date with vulnerabilities in Software Dependencies.

Planning

Forecasting and Planning Actions.

In order to control an operation, we need targets and plans to control against. For a system to run well, it needs to carefully manage unreliable dependencies, and ensure their safety and availability. In the example of the humans, say, it's the difference between Hunter-Gathering (picking up food where we find it) and Agriculture (controlling the environment and the resources to grown crops).

As the diagram above shows, we can bring Forecasting and Planning to bear on dependency management, and this usually falls to the more human end of the operation.

Design

Design and Change Activities

While planning is a day-to-day operational feedback loop, design is a longer feedback loop which is changing not just the parameters of the operation, but the operation itself.

You might think that for an IT operation, tasks like Design belong within the Development function within an organisation. Often, this is the case. However separating design from operation implies Boundary Risk between these two functions. For example, the developers might employ different tools, equipment and processes to the operations team, resulting in a mismatch when software is delivered.

In recent years, the "DevOps" movement has brought this Boundary Risk into sharper focus. This specifically means:

  • Using code to automate previously manual Operations functions, like monitoring and releasing.
  • Involving Operations in the planning and design, so that the delivered software is optimised for the environment it runs in.

DevOps Activities:  Development and Operations activities overlap one-another (Credit: Kharnagy, Wikipedia)

Since our operation exists in a world of risks like Red Queen Risk and Feature Drift Risk, we would expect that the output of our Forecasting and Planning activities would result in changes to our operation.

Improvement

Taking action against Operational Risk by Meeting Reality

Once exposed to the real world, no system is perfect. This means we must design-in ways in which the systems we build can improve and change. Since we don't have a perfect understanding of the world, most of the Operational Risk we face is Hidden Risks.

Reputational Risk

Our production systems are Meeting Reality all the time, and in order to mitigate Operational Risk we need to take the most advantage of this as possible. However, conversely, Operational Risk includes Reputational Risk, which gives us pause: we don't want to destroy good will created for our organisation, this is very hard to rebuild.

So there is a tension between "you only get one chance to make a first impression" and "gilding the lily" (perfectionism). In the past I've seen this stated as:

"Pressure to ship vs pressure to improve"

A Risk-First re-framing of this might be the balance between:

  • The perceived Reputational Risk, Feature Risk and Operational Risk of going to production (pressure to improve).
  • The perceived Scarcity Risks (such as funding, time available, etc) of staying in development (pressure to ship).

Balance of Risks from Delivering Software

The "should we ship?" decision is therefore a complex one. In Meeting Reality, we discussed that it's better to do this "sooner, more frequently, in smaller chunks and with feedback". We can meet Operational Risk on our own terms by doing so:

Meet Reality... Techniques
Sooner Quality Control Processes, Limited Early-Access Programs, Beta Programs, Soft Launches, Business Continuity Testing
More Frequently Continuous Delivery, Sprints
In Smaller Chunks Modular Releases, Microservices, Feature Toggles, Trial Populations
With Feedback User Communities, Support Groups, Monitoring, Logging, Analytics

In a way, we are now back to where we started from: identifying Dependency Risk, Feature Risk and Complexity Risk that hinders our operation, and mitigating it through tasks like software development. Our safari of risk is finally complete, it's time to look back and what we've seen in Staging and Classifying.

Clone this wiki locally