In This Issue
Winter Issue of The Bridge on Complex Unifiable Systems
December 15, 2020 Volume 50 Issue 4
The articles in this issue are a first step toward exploring the notion of unifiability, not merely as an engineering ethos but also as a broader cultural responsibility.

Managing Failure Risks

Friday, December 18, 2020

Author: Elisabeth Paté-Cornell

The complexity of engineered systems can be baffling, scary, and paralyzing. From fear of flying to fear of nuclear power plants, people have expressed their reluctance about technologies that are useful but, to some variable degree, risky.

Failure risks generally have to be managed under uncertainties, budget constraints, and other social complexities. Herein lies a paradox. It is sometimes stated that system complexity increases failure risk. That is not necessarily true: fly-by-wire aircraft are more complex and safer than classic turbojets. Redundancies generally increase both the complexity and the safety of a system, although their benefits can be limited if their failures are dependent.

Other questionable statements are that more information means less uncertainty, and its corollary, that more uncertainty implies less knowledge. Neither of these is necessarily true. One may have not yet discovered a scenario under which the system has a higher probability of failure than previously believed.

Many types of rare failures can be anticipated while acknowledging uncertainties. Invoking “black swans” after the fact is often a poor excuse for not doing so. Pandemics, earthquakes, and plane crashes will continue to occur, and safety measures need to be taken proactively instead of waiting for a disaster.

Bayesian probability allows combining uncertainties, both epistemic (lack of fundamental knowledge) and aleatory (randomness), and machine learning allows updating automatically the failure risk with new information. Model-based systems engineering has been an accepted tool for decades, generally involving deterministic models. Uncertainties, however, need to be considered, such as those related to loads on a system and its capacity in order to assess the chances that the former exceed the latter over the system’s lifetime.

Management negligence, inappropriate incentives, and poor information are main causes of operator errors. This was the case in the accidents that destroyed the Piper Alpha (1988) and Deepwater Horizon (2010) offshore oil platforms. Risk management thus starts at the top of the organization—and in hiring, training and rewarding people.

Complete risk management involves linking management decisions (M) (e.g., setting incentives), operator actions (A) (e.g., response to signals of potential problems), and system safety (S) based on the performance of critical subsystems (a model called SAM).

Moreover, risk analysis need not be more complex than the situation requires. Some risk management measures, such as maintaining the brakes of a car, do not require a formal analysis. Beyond common sense, one can observe, in practice, several levels of complexity in implicit or explicit risk assessment, from a simple identification of the worst case to central values of the loss distribution and, finally, a full analysis based on scenario probabilities and outcomes. In all cases, the aim is to ensure that extreme values, if they are significant, are properly accounted for.

But risks may change over time. Modeling the dynamics of failure risk may be essential when systems, procedures, risk attitudes, or information are changing. Analyses relying solely on past experience may then be simply wrong. In that case, although statistical information may have become irrelevant, it is often tempting to stick with it because it looks more “objective.” But it is not relevant if elements of the system or its environment have changed, or if one has received new information.

If one is considering long-term risk management, the analysis must include the dynamics of a decision sequence and the possible outcomes of the various options.

With risk communications, warning systems can serve as powerful tools, including to better understand both false positives and false negatives. Near misses and important information are sometimes dismissed at the operational level because, though precursors occurred, the accident did not actually happen.

At the management level, the structure and procedures of the organization must ensure that warnings reach the right decision maker. Obviously, all signals should not be transmitted to the top, but the filters in place should be designed to recognize the importance of messages even if they include uncertainties. For instance, in 2001 the FBI in Phoenix had received signals that individuals were taking unusual flying lessons, but the information was not acted upon.[1]

In flat organizations, information may circulate easily and decisions may be widely understood, thus facilitating risk management, but managing a program that requires a number of systems and organizations can be particularly complex. Space programs such as Apollo or Artemis have involved a large number of contractors, interfaces, techniques, assumptions, and risk tolerance levels. It is the role of management to ensure that these programs interact effectively to gather and share information and warnings in order to ensure consistency, compatibility, and safety.

Compatibility and consistency are key to managing the failure risk of complex systems and programs. Risk analysis does not yield predictions but rather the chances that a failure may occur and the effectiveness of various safety measures.

When assessing a complex system, a key issue is the model formulation. This may require simplifying the system’s representation to make the analysis manageable. For example, the heat shield of space shuttle orbiters involved about 25,000 different tiles, and the formulation of the risk analysis model required grouping them in zones with similar values of key parameters.

A challenge for unifying systems is to find better ways to communicate the risk assessment results and the uncertainties, rather than presenting the most likely hypothesis as if one could be sure of it. For the risk message to be effective, one should avoid large numbers of complex scenarios.

In the end, one needs to check that the results fit common sense and if not, determine whether it is the model or the intuition that needs to be reassessed.


[1]  Findings of the Final Report of the Senate Select Committee on Intelligence and the House Permanent Select Committee on Intelligence Joint Inquiry into the Terrorist Attacks of September 11, 2001: Final Report, Part 1, section 5e, Dec 10, 2002, pp. 325–35 ( documents/CRPT-107srpt351-5.pdf).

About the Author:Elisabeth Paté-Cornell (NAE) is the Burt and Deedee McMurtry Professor in the School of Engineering and professor and founding chair of the Department of Management Science and Engineering at Stanford University.