In This Issue
Summer Bridge Issue on Aeronautics
June 26, 2020 Volume 50 Issue 2
The articles in this issue present the scope of progress and possibility in modern aviation. Challenges are being addressed through innovative developments that will support and enhance air travel in the decades to come.

Embracing the Risk Sciences to Enhance Air Travel Safety

Thursday, June 25, 2020

Author: B. John Garrick and Ali Mosleh

Increased dependence on autonomous systems is a significant driver for advocating more rigorous proactive risk analyses (Ramos et al. 2019) of the safety of air travel. Modern aircraft alert systems represent a target area for quantitative risk assessments (QRAs). Examples of such systems are the ­traffic alert and collision avoidance system (TCAS), wind shear warning system, enhanced ground proximity warning system, and maneuvering characteristics augmentation system (MCAS). This paper highlights the value added of applying rigorous and quantitative methods from the risk sciences to assess and enhance the safety and performance of air travel.[1] The scope is limited to methods of analysis and their value added.

Introduction

In industries such as nuclear power where QRA methods have matured over several decades, the methods have provided major economic benefits in decision making on plant design, operations, and maintenance (­Garrick 2014). In fact, the economic advantages of QRA became apparent very early (PLG et al. 1981, 1982), resulting in hundreds of millions of 1982 dollar savings to the plant owners by demonstrating regulatory compliance of safety without the major modifications recommended by intervenors.[2]

A second early example of QRA involved the Beznau nuclear plant in Switzerland. A QRA (PLG 1989) ­provided compelling evidence to the authorities that a two-train safety system could be reduced to a single train with little to no impact on risk. The savings to the plant owner were in excess of 100 million 1988 dollars.

To illustrate the application of QRA methods to enhance the safety of air travel we focus on the risks of TCAS failure and pilot error.

The QRA Framework and Methods of Analysis

Complex systems, whether natural or human designed, pose formidable predictability challenges in terms of understanding and quantifying risks. These challenges have two main dimensions: (1) complexity in terms of topological, functional, and behavioral features and (2) limitations in data and knowledge needed to understand the complexity.

In specific applications the focus is on identifying the scenarios[3] that lead to extremely rare but highly significant system states (e.g., unanticipated catastrophic failures). Such scenarios are often at or outside the boundaries of scientific and engineering knowledge and also are easily masked by model abstractions and solution techniques.

Over the past several decades QRA (also known as probabilistic risk assessment, PRA; Garrick 2008) has offered a proactive way to think about and analyze the safety and performance of complex systems. The primary techniques include deductive and inductive logic models (e.g., event trees and fault trees) for modeling sequences of events (i.e., risk scenarios) and their contributing subevents.

The 1975 Reactor Safety Study (USNRC 1975) was the first large-scale application of such methods. It developed and used a combination of fault tree and event tree modeling techniques to not only identify accident conditions and their consequences but also evaluate their probability and uncertainties in the probabilities. Analysis of scenarios individually and in the aggregate makes it possible to identify safety vulnerabilities, rank contributors to risk by importance, quantify uncertainties, and make risk-informed decisions.

QRA techniques have gone through many upgrades and been used to assess the safety of other complex industrial installations and systems such as chemical process plants (Spouge 1999), space missions (Stamatelatos and Dezfuli 2011), and civil aviation (Mosleh et al. 2007). The aerospace industry is employing QRA techniques selectively (Kuchar et al. 2004), but their use has not advanced to becoming a basic part of the industry safety culture.

Airline safety has benefited greatly from comprehensive reactive accident investigations. As the industry transitions to increased dependence on autonomous systems, advanced QRA methods can provide near-term added value to the safety and economic performance of air travel. The result is added clarity of critical interactions between system elements (hardware, software, and human) and accountability of nonlinearities.

The Triplet Definition of Risk

The framework for quantifying risks that has been very successful in many industries and adopted by some regulatory agencies is the “triplet definition of risk” (Kaplan and Garrick 1981). This definition is founded on the principle that when one asks “what is the risk of something” they are really asking three questions:

  • What can go wrong?
  • How likely is it to go wrong?
  • What are the consequences?

The first question is answered by a set of scenarios, the second by the available evidence accounting for the uncertainties, and the third by the various end states of the scenarios.

Formally, this is written as a complete set of triplets:

R = {<Si,Li,Xi>}c,

where R denotes the risk attendant to the system or activity of interest, Si denotes risk scenario i, Li denotes the likelihood of that scenario, and Xi denotes the consequences or damage level of the scenario. The angle brackets < > enclose the triplet, the curly brackets mean “a set of,” and the subscript c denotes complete, meaning that all of the important scenarios are included in the set. This notion of risk can be generalized to any metric of performance of the system being analyzed.

The rigorous risk assessment methods proposed for alert systems are the triplet definition of risk in conjunction with the theory of scenarios, the hybrid causal logic (HCL) method, and methods of human reliability analysis (cognitive and behavioral sciences, simulation, experimental validation). A key assessment requirement is a characterization of the pilot response to resolution advisories (RAs).

The HCL method (Groth et al. 2010) is a multi­layered structure that integrates event sequence diagrams (ESDs), fault trees, and Bayesian networks (BNs). This hybrid modeling capability allows appropriate modeling techniques for different aspects of the system or process and their environment.

In the HCL approach, risk scenarios are modeled in the first layer using ESDs. Fault trees are used in the second layer to model the response behavior of the physical system as possible causes or contributing factors to the events delineated by the ESDs. The BNs in the third layer extend the causal chain of events to potential human, organizational, and process roots, and probability of failure models for equipment. Other layers of ­analysis are added as appropriate. For the TCAS example, an additional layer models the interactions between air traffic control (ATC), the TCAS, and the pilot. The results are used to aid risk management activities.

The QRA process generally involves the following:

  1. Define the system being analyzed in terms of what constitutes successful operation.
  2. Structure and process scenarios for both “success” and “what can go wrong.”
  3. Quantify the scenarios while characterizing their uncertainties.
  4. Assemble and integrate the scenarios into definitive risk metrics.
  5. Interpret the results to aid the risk management ­process.

The following example emphasizes steps 1 and 2, which capture the framework of QRA and are foundational to the computational steps 3 and 4. The critical TCAS components of detection, evaluation and response, and pilot execution are represented in the logic models developed for steps 1 and 2. The goal is to perform the types of analyses on alert systems that could reduce the risks of such disasters as currently attributed to faults in the Boeing 737 MAX MCAS (Baker 2019).

Illustrative Example: TCAS

This example uses the following assumptions:

  1. The intruding aircraft’s TCAS and pilot are performing according to specifications and procedures.
  2. This scenario is taking place in midair at cruising altitude (35,000 feet). There are no geographical or meteorological factors in play.
  3. No other aircraft are in close vicinity of the two under consideration.
  4. The only entities under fault consideration are the TCAS, the pilot, and the aircraft.

Define the System

The TCAS mitigates collision risk by surveilling and tracking nearby air traffic and issuing avoidance instructions (RAs and traffic advisories) to pilots when a threat is determined.

A Mode-S transponder regularly sends out location information around a particular range. When the signal is received, the TCAS logic unit processes the information, which contains data about the other aircraft, typically the heading (direction), speed, altitude, position, and the aircraft’s unique identification code. The logic unit begins a series of “interrogations” using the transponder directed toward the other aircraft with the unique address. Thus both aircraft have information about each other. There are three intruder aircraft TCAS detection regions around the aircraft—the caution area, warning area, and collision area, as ­illustrated in figure 1.

Figure 1 

When the intruder aircraft enters the RA zone, the TCAS increases the frequency of interrogation and issues RAs to both pilots in the form of a red dot on the display and an audio announcement of commands directing the pilot to execute a standard maneuver. For example, the command “climb, climb” directs the pilot to climb at a rate of 1,500 to 2,000 feet per minute. The TCAS of both aircraft would now issue RA commands and inform the TCAS of the other aircraft to synchronize decision making.

Structure and Process Scenarios for Success and Failure

Success and failure states of the system are defined using logic tools such as an event sequence diagram. An ESD for the TCAS is illustrated in figure 2.

Figure 2 

From the moment two planes enter each other’s RA zone, the ESD describes several scenarios. Ideally, the TCAS provides the optimal solution; a successful end state (denoted “safe” in the figure) is where both planes maneuver to safety. Multiple failure end states are represented by different colors. Dotted lines indicate extensions in the sequence.

An example scenario based on the ESD is described by the yellow lines in figure 2. The scenario tries to break down the events that led to the 2002 Überlingen midair collision.[4] Starting at the initiating event, the intruder aircraft (DHL flight 611, a Boeing 757 cargo jet) enters the threshold zone of Bashkirian Airlines flight 2937 (a Tupolev Tu-154 passenger jet) for issuing RAs. TCAS computes the correct steps for safety and sends the information to the pilot of the Tu-154, but ATC relays the wrong instructions to the same pilot. The pilot thus has conflicting instructions from the TCAS and ATC, forcing him to choose between the two advisories with limited visual information to evaluate them. The pilot decides to follow the ATC instructions and maneuvers the aircraft accordingly. Since the other pilot is following TCAS (as in the scenario and assumptions), both aircraft execute descend maneuvers, resulting in a midair collision.

Causes of TCAS failure fall into two categories: the TCAS or pilot response. TCAS failures may involve sensors, transponder-interrogating systems, ground-based primary and secondary surveillance radars, and automatic dependent surveillance–broadcast (ADS-B). Airborne transponder-interrogating systems provide air-to-air surveillance of other transponder-equipped aircraft. Alternatively, ground-based radars can provide surveillance information via a digital datalink. ADS-B relies on aircraft self-reporting their position as determined by GPS or some other navigation system.

Pilot errors include response error, failure to understand the alert, failure to assign priority to an alert, failure to select the appropriate action, and failure to comply with the alert.

In addition to the three layers of the HCL model, another layer models the interactions between the system (all hardware and software) and the pilot, using the concurrent task analysis (CoTA) method (Ramos et al. 2020). This is the first application of the CoTA method in the context of aviation safety.

The CoTA, developed from the system’s ESD, is a ­success-oriented model that allows for a more detailed understanding of the tasks that must be accomplished for the ESD events to take place. It translates the events from the ESD to specific tasks in three categories: TCAS, pilot, and aircraft. Logically all three sets are part of the same CoTA scenario. The development of the scenario-­specific CoTA starts with identification of the events involved in the selected ESD path. The analyst then highlights the human task actions that belong to that sequence.

Figure 3 

Figure 3 shows a TCAS task using CoTA model­ing. The TCAS collects information and communicates with the pilot in parallel with all other tasks. The TCAS tasks do not show up on the ESD as they are parallel, non­sequential tasks that appear only in CoTAs. Because some pilot actions (such as response to an alarm) depend on TCAS communication and data collection tasks, outgoing interface nodes for human task actions are linked to these tasks.

Figure 4 

The pilot CoTA shown in figure 4 shows one of the pilot tasks after a TCAS alert. The pilot must respond to the alert, process available information, and make a decision about which actions to take, all while constantly monitoring the situation using information from TCAS, ATC, and visual senses.

Figure 5 

Fault trees can be linked to CoTA events to show failure in a single task. Interfaces between tasks can be ­added as extensions to the fault trees. As an example, failure of the pilot task “Resolve incorrect or inconsistent info” is modeled using the Phoenix method (Ekanem et al. 2016) in the fault tree of figure 5. ­Phoenix is a human reliability analysis ­method that identifies and quantifies the probability of possible human errors in interactions with complex systems. Errors are analyzed for three distinct phases of human-­system interactions: information gathering, situation assessment/decision making, and action execution. Since failure in task “Resolve ­incorrect or inconsistent info” represents a decision-making failure, it is broken down into a failure either to assess the situation or to decide on an action. In the Überlingen case, the failure was due to the pilot’s incorrect conclusion.

Performance-influencing factors (PIFs) that influence or cause human failure events (HFEs) identified in the human response fault tree can be ­modeled using BNs, an extension of the HFE fault tree for the HFE “inappropriate conclusion” (the yellow path in figure 5). These factors (e.g., knowledge/­abilities, resources, stress, bias, time constraint) can each play a role in the pilot’s misdiagnosis. For example, in the 2002 Überlingen midair collision, TCASs were new and not well established. This would create a strong bias in favor of the information provided by ATC.

The BN extension in figure 5 is taken from the Phoenix human reliability model but not all PIFs are included. The justification for including or removing individual PIFs is given in table 1 based on how each PIF would affect the pilot’s decision making.

Table 1 

System failures are also modeled through corresponding hardware fault trees. Once the full model is constructed using the steps illustrated above, it needs to be processed using mathematical and computational procedures to identify risk scenarios and contributing factors. For example, the highlighted (yellow) path in figure 2 and contributing events highlighted in figure 5 form a scenario.

Quantify Scenarios and Characterize Their Uncertainties

The probability, p(s), of any given risk scenarios is the product of the conditional probabilities, P(Ei), given the events (E0, E1,…EN), forming the scenario:

Formula 

The failure probabilities (hardware, software, or human) are estimated based on all available evidence. Advanced quantitative risk assessments use Bayesian inference methods for estimation of the risk model parameters including probabilities and rates of events. The main reason is that failures and accidents in highly reliable and safe systems are rare, forcing the analyst to use partially relevant evidence, generic data, or expert subject-matter knowledge.

Bayes’ theorem is

Formula 

where x is the event or quantity of interest, P(x) the initial or prior probability of x, P(x/E) the posterior or updated probability of x given any available data or evidence E, and P(E/x) the likelihood of the evidence given an assumed value of x.

Essential to any credible risk analysis is a comprehensive characterization and quantification of uncertainties: aleatory, epistemic, parametric, or model­ing.[5] Advanced QRA methods allow these uncertainties to be identified and quantified during modeling and quantification of the risk scenarios.

Assemble and Integrate Scenarios in Interpretable Risk Metrics

Before performing a risk assessment, a decision has to be made about the consequences to be quantified and the desired form of the risk metric. Typical consequences are the failure of a system to perform its intended function, physical damage, injuries, fatalities, or a combination of these. Typical risk metrics are probability, frequency, or a combination such as probability of frequency; the frequencies are represented by probability distributions to account for their uncertainties.

Figure 6 

For complex systems the number of scenarios could be millions, thus the need for assembling the results in a form that captures the full story for interpretation. ­Frequency of exceedance curves of the type shown in figure 6 can be developed for each consequence of interest and is constructed by ordering the scenarios by increasing levels of damage and cumulating the probabilities from the bottom up, with probability P represented as a family of curves. The form of the result on log-log paper is figure 6, which illustrates that there is a P3 probability that a consequence of X1 or greater has an annual frequency of F1. The family of curves also facilitates presentation of the results in terms of confidence intervals if desired.

Risk metrics provide insights and guidance not only on design, operational, and maintenance measures for reducing risk but also for training and qualification of personnel such as pilots and for recovery from life-threatening events.

Concluding Remarks

The above modeling and process steps illustrate that comprehensive and rigorous quantitative risk models are possible for safety-critical aircraft systems and can contribute to air travel safety. The proactive methods involve the processing of direct and indirect evidence, accountability of system hardware and software logic and dynamics, human performance analysis, and the ability to simulate large numbers of concurrent tasks. Typical output is a ranked list of risk scenarios and contributors by probability and consequence.

The methodology used in this paper to illustrate the essential steps of QRA of complex systems, while quite advanced and powerful in providing risk insights, may still be inadequate to capture the complexity of some systems. Advanced simulation-based methods (Mosleh 2019) have proven to be particularly effective for systems with control loops and complex interactions between elements—hardware, software, or human; they provide a natural probabilistic environment with physical ­models of system behavior (e.g., coupled processes), mech ­anistic models of materials or hardware systems to predict failure, and models of natural hazards. ­Simulation-based methods are expected to play a critical role in the assessment of autonomous systems (e.g., aircraft, ocean ­vessels, and ground vehicles) with humans transitioning toward a monitoring and ­recovery role.

Acknowledgments

The authors acknowledge significant contributions of UCLA Garrick Institute research scientist Marilia Ramos and graduate students Karthik Sankaran and Theresa Stewart in developing the example application.

References

Baker M. 2019. The inside story of MCAS: How Boeing’s 737 MAX system gained power and lost safeguards. Seattle Times, Jun 22 (updated Jun 24).

Ekanem N, Mosleh A, Shen SH. 2016. Phoenix—A model-based human reliability analysis methodology: Qualitative analysis procedure. Reliability Engineering & System Safety 145:301–15.

FAA [Federal Aviation Administration]. 2011. Introduction to TCAS II Version 7.1. Washington.

Garrick BJ. 2008. Quantifying and Controlling Catastrophic Risks. Amsterdam: Elsevier.

Garrick BJ. 2014. PRA-based risk management: History and perspectives. Nuclear News 57(8):48–53.

Groth K, Wang C, Mosleh A. 2010. Hybrid causal ­methodology and software platform for probabilistic risk assessment and safety monitoring of socio-technical systems. Reliability Engineering & System Safety 95:1276–85.

Kaplan S, Garrick BJ. 1981. On the quantitative definition of risk. Risk Analysis 1(1):11–27.

Kuchar J, Andrews J, Drumm A, Hall T, Heinz V, ­Thompson S, Welch J. 2004. Safety analysis process for the ­traffic alert and collision avoidance system (TCAS) and see-and-avoid systems on remotely piloted vehicles. ­Presented at AIAA 3rd “Unmanned Unlimited” Technical Conf, Workshop, and Exhibit, Sep 20–23, Chicago.

Mosleh A, Groth K, Wang C, Groen F, Hzu D. 2007. ­Integrated Risk Information System, Methodology and Software Platform developed for the US Federal Aviation Administration by the University of Maryland Center for Risk and Reliability.

Mosleh A. 2019. Architecture for guided simulation of probabilistic evolution of complex systems. Publication of the UCLA B. John Garrick Institute for the Risk Sciences (GIRS-2019-10/L). Available from GIRS.

NASEM [National Academies of Sciences, Engineering, and Medicine]. 2014. Lessons Learned from the Fukushima Nuclear Accident for Improving Safety of US Nuclear Plants. Washington: National Academies Press.

PLG [Pickard, Lowe, and Garrick, Inc.], Westinghouse ­Electric Corporation, Fauske and Associates, Inc. 1981. Zion probabilistic safety study. Prepared for Commonwealth ­Edison Company, Chicago.

PLG, Westinghouse Electric Corporation, Fauske and Associates, Inc. 1982. Indian Point probabilistic safety study. Prepared for Consolidated Edison Company of New York and New York Power Authority.

PLG. 1989. Beznau Station Risk Assessment: Plant with NANO (PLG0511). Prepared for and available from ­Nordostschweizerische Kraftwerke AG, Zürich.

Ramos MA, Thieme CA, Utne IB, Mosleh A. 2019. ­Autonomous systems safety: State of the art and chal­lenges. Proceedings, First Internatl Workshop on Autonomous Systems Safety, Mar 11–13, Trondheim, Norway.

Ramos MA, Thieme CA, Utne IB, Mosleh A. 2020. Human-system concurrent task analysis for maritime autonomous surface ship operation and safety. Reliability Engineering & System Safety 195:106697.

Spouge J. 1999. A Guide to Quantitative Risk Assessment for Offshore Installations. Aberdeen UK: Centre for Marine and Petroleum Technology.

Stamatelatos M, Dezfuli H. 2011. Probabilistic Risk Assessment Procedures Guide for NASA Managers and ­Practitioners (NASA/SP-2011-3421, 2nd ed). ­Washington: NASA Headquarters.

USNRC [US Nuclear Regulatory Commission]. 1975. ­Reactor Safety Study: An Assessment of Accident Risks in US Commercial Nuclear Power Plants (WASH-1400, NUREG-75/014). Washington.


[1]  The value added of such analysis has been cited by the National Academies in several reports, including one on the Fukushima nuclear accident (NASEM 2014).

[2]  Correspondence between B. John Garrick and Tom Wellock, historian, US Nuclear Regulatory Commission, about early probabilistic risk assessment development activities, March 18, 2016.

[3]  A scenario is a sequence of events starting with an initiating or triggering event, going through all possible aggravating or mitigating events or conditions, and resulting in either success or adverse consequences.

[4]  https://en.wikipedia.org/wiki/2002_Überlingen_mid- air_­collision

[5]  Uncertainties can be characterized by the following types:

  • Aleatory: natural or inherent randomness or stochastic nature of the phenomena being analyzed; this type of uncertainty is considered irreducible
  • Epistemic: uncertainty due to limitations in knowledge; considered reducible by increasing knowledge, data, or evidence
  • Parametric: uncertainty about the individual parameters of the model
  • Modeling: uncertainty about the structure and completeness of the model(s).
About the Author:John Garrick (NAE) is distinguished adjunct professor at the B. John Garrick Institute for the Risk Sciences and Ali Mosleh (NAE) is distinguished professor and Evelyn Knight Chair in Engineering at the Henry Samueli School of Engineering and Applied Science, both at the University of California, Los Angeles.