In This Issue
Fall Issue of The Bridge on Social Sciences and Engineering Practice
September 5, 2012 Volume 42 Issue 3

Complex Organizational Failures: Culture, High Reliability, and the Lessons from Fukushima

Tuesday, August 28, 2012

Author: Nick Pidgeon

The principal causes of the Fukushima disaster were organizational culture and system complexity.

Most academics and practitioners in engineering quite rightly focus their attention on the science and performance of physical structures and systems, but delivering and operating engineered systems in an effective and safe manner has always depended on society and human beings. People, organizations, and ultimately their cultures are all involved in decisions about the design, building, and management of complex engineered systems. So, too, when things go badly wrong, people and their organizations are implicated in the events that led to the disaster. The recently published official Japanese inquiry into the Fukushima Daiichi nuclear accident acknowledges this with its description of the events as a “man-made disaster” (National Diet of Japan, 2012).

In addition to the human element, failures of complex engineered systems are rarely due to a single technical or environmental cause. Thus, although the Fukushima plants shut down as designed when the earthquake struck, the quake caused the loss of all off-site power to the complex. The subsequent tsunami then overwhelmed the flood defenses, destroying many of the remaining backup power and safety systems.

The inquiry report highlights other important contributory factors and concludes that this chain of events should have been foreseen and prevented. The site’s vulnerability to loss of power in a major tsunami had been identified several years before the accident, but the report documents an insular and defensive attitude on the part of the plant’s operator, Tokyo Electric Power Company (TEPCO), that, combined with a culture of deference and a cozy relationship with regulators, meant these and other safety warnings were not given sufficient priority. The result was a failure, over a number of years, to either properly examine the risks to the plant or improve safety measures.

This and other relatively recent examples of major system failure (e.g., the Deepwater Horizon oil spill in the Gulf of Mexico; Hopkins, 2012; Wassel, 2012, in this issue) are particularly disturbing because the underlying pathology of such situations has been well understood for more than 30 years. In this article I describe contributions of research in the social and engineering sciences to understanding of so-called socio-technical or man-made disasters.

Man-Made Disasters and the “Climatology of Accidents”

Several seminal texts published within a decade of each other showed that major accidents do not simply happen on the day of the visible failure. They have a background, a social and cultural context, and a history. Taken together, the reports revealed that system complexity and an incomplete understanding of such complexity can defeat the best attempts at anticipating risks.

Writing in the Proceedings of the First International Conference on Structural Safety and Reliability, the structural engineer Sir Alfred Pugsley (1969) coined the phrase “engineering climatology of structural accidents.” By this he meant the combination of political, financial, professional, and industrial pressures that may bear on a project to induce critical human errors or the oversight of critical safety issues that might lead to major structural failure. The final dramatic event might appear entirely technical in nature—a bridge or roof collapse, a major fire, a catastrophic aircraft fatigue failure—but the underlying origins and causes must be sought in the organizational and societal preconditions. The poorer the engineering climatology, the greater the likelihood of failure.

During the following decade detailed analytic work in engineering and the social sciences validated Pugsley’s explanation. The British sociologist Barry Turner published his own account of major socio-technical system failures in his influential book Man-Made Disasters (1978). Based on careful analysis of common patterns underlying more than 80 accidents and disasters that occurred during a 10-year period in the United Kingdom (including major structural failures such as the collapse of the London Ronan Point apartment tower and the Aberfan mining debris disaster in South Wales), Turner found that all of them could be explained using theories of human and organizational behavior. He demonstrated that very few major accidents have a singular cause: it was far more typical to find that several precursor events had “incubated” to produce a situation he described as an “accident waiting to happen.” In addition, surveying different domains, Turner showed that identical organizational and human causes recurred in seemingly disparate engineering sectors (e.g., structural, chemical, or electrical engineering), suggesting that engineers in different disciplines could learn vital lessons by talking to each other about failures.

Working independently of Turner, and drawing material from a range of prominent cases of civil engineering failures such as the collapses of the West Gate Bridge (Melbourne) and Tacoma (Wash.) Narrows Bridge, the structural engineer David Blockley (1980) arrived at similar conclusions in his book The Nature of Structural Design and Safety. According to Blockley, the practice and theory of engineering safety and reliability cannot progress without, first, greater attention to the organizational and political conditions that can induce human error and, second, the development of vulnerability metrics to measure such factors. Although these ideas are now widely accepted in engineering education, they were considered radical when first proposed.

System Complexity and “Normal Accidents”: The Example of Three Mile Island

Theory and research on organizational accidents gained recognition outside the academic arena after the 1979 accident at the Three Mile Island (TMI) nuclear power plant in the United States and the subsequent publication of Normal Accidents by Charles Perrow (1984). Now one of America’s foremost authorities on complex organizations, Perrow admitted that he came to the topic of risk and technology almost by accident (as it were), when he was invited by a former student to provide evidence to the President’s Commission enquiring into the causes underlying the TMI nuclear accident. Perrow used a sociological approach to unravel the causes of the disaster, form his “normal accidents” thesis, and inform his book.

The TMI incident was the result of a series of minor failures: a seemingly inconsequential leak of water triggered a chain of events involving both technical component malfunctions and operator misunderstandings and errors. The collective outcome was a major loss of coolant, something that the recent tragic events in Japan once again demonstrate is still an Achilles’ heel of the older generation of pressurized water reactors. No single contributory cause was sufficient to trigger the TMI meltdown (or indeed the Fukushima disaster), but taken together the events, not fully anticipated by the plant designers, conspired to defeat multiple safety systems designed to prevent loss of coolant.

Perrow concluded that the TMI accident was a direct consequence of the sheer complexity of the organizational and technical systems involved: some modern high-risk systems, such as nuclear power plants, are so complex as to be inevitably vulnerable to failure no matter how well managed. As Perrow put it, they eventually suffer a “normal accident.” However, inasmuch as background preconditions likely incubate over relatively long periods of time, there is at least some possibility of detection and prevention, even for highly complex systems.

For his analysis of system complexity, Perrow developed the concepts of “interactive complexity,” meaning the number and degree of system interrelationships, and “tight coupling,” or the degree to which initial failures can rapidly concatenate to affect the functioning of other parts of the system. Universities, for example, are interactively complex but only loosely coupled, whereas modern production lines often have tight coupling but typically rely on simple linear interactions. Neither tends to suffer systemic accidents. But a high-risk system with both high complexity and tight coupling, as at TMI, may require radical redesign or even abandonment of the technology entirely.1

Perrow’s analysis of the TMI incident can be supplemented by insights from man-made disaster theory, which describes the background organizational, management, and communication failings that occur in the days, months, and years before an accident. The precise events at TMI, as is now known, had been foreshadowed by similar near-miss events in other U.S. pressurized water nuclear plants, raising the question of why safety information and learning were not shared among the various organizations involved (Hopkins, 2001).

These insightful analyses of major accidents were unfortunately followed in very short order by a string of major technological disasters around the globe during the 1980s (e.g., the Challenger explosion, the Chernobyl nuclear disaster, and the gas leak in Bhopal), all of which called for sophisticated analysis. In light of the arguments for examining technological failures as the product of complex interacting systems, and the identification of the critical roles of organizational and management factors as primary causes of failures, such “technological” disasters could no longer be ascribed, as previously, to isolated malfunctions, operator error, or “random acts of God.”

High-Reliability Organizations and Safety Cultures

The 1980s not only were important for crystallizing and disseminating new theories of organizational accidents but also served as an intellectual turning point. By the 1990s it was clear that, while there was a need for analysis of the causes of past accidents, analysts also had to consider ways to improve safety management in complex engineered systems. How might engineers and risk managers encourage organizational safety? Could it be designed?

Of course, in a straightforward sense, risk and safety must always be considered together, and the goal of related research is to help improve safety and resilience. However, understanding how vulnerability to failures and accidents arises does not automatically confer predictive knowledge to prevent future catastrophes. The question then becomes, can a theory of vulnerability to error and failure be used to build a theory of resilience and safety (e.g., Blockley, 1992)?

Perrow’s approach embodied a tension between foresight and fatalism. Studies over the past two decades have explored this tension through analysis of high-reliability organizations and safety culture.

High-Reliability Organizations

Researchers who examined high reliability worked from very detailed empirical case studies (e.g., concerning flight operations aboard aircraft carriers) in which the conditions for normal accidents existed but the systems operated safely and reliably on a day-to-day basis (Roberts, 1993). The results of their examination indicated that research on the conditions leading to failures should be supplemented by studies of successful risk management.

Analysts identified the following organizational and cultural factors as key reasons for the safe management of the otherwise toxic combination of high system complexity and high risk (see, e.g., Weick and Sutcliffe, 2001).

  • Collective “mindfulness” is the idea that a design or operations teams can, by collaborating, develop a more comprehensive picture than that of any one individual alone.
  • Group norms stressing open communication and deference to expertise (wherever it resides in the organization) can promote identification and response to signs of rapidly escalating failure conditions before the onset of a full-scale disaster.
  • High-reliability organizations place a heavy premium on maximizing long-term learning opportunities, both within the organization and from other related industrial organizations and sectors, to identify and address underlying systemic faults before they combine with other events.

Discussions of high-reliability organizations eventually played out with no satisfactory resolution of the fundamental question at hand: Were normal accidents inevitable, as authors such as Sagan (1993) argued, or could complex systems indeed be safely managed as the high reliability researchers claimed? Part of the problem lay in the inherent difficulty of identifying a truly normal accident (which, by definition, is very rare) and part in the impossibility of definitively proving that a system was reliably safe (beyond the absence of any history of accidents).

Both the normal accident and high-reliability approaches are now viewed less as traditional theories, which would yield propositions falsifiable through clear empirical tests, and more as “sensitizing concepts” that enable a more effective approach to thinking about how high-risk systems both work and fail (Rijpma, 1997).

Safety Culture

Accident prevention research focused on the somewhat different concept of safety culture. In a development now likely to be replayed in light of the conclusions of the Fukushima inquiry report, intense academic and regulatory interest in safety culture followed the accident at Chernobyl in 1986. The errors and violations of operating procedures that contributed to the disaster were described as evidence of a poor safety culture both at the plant and in the former Soviet nuclear industry more generally (OECD Nuclear Agency, 1987).

Implicit in the man-made disasters model was a view of culture in terms of the symbols and systems through which a given group or profession understands the world. A safety culture is built on assumptions and associated practices that inform beliefs about danger and safety. Such a culture is repeatedly created and recreated as members behave and communicate in ways that seem to them natural, obvious, and unquestionable and as such contribute to a particular version of risk, danger, and safety.

To maximize the chances that an organization can recognize and respond appropriately to signs of potential emerging hazards, a good safety culture should reflect at least four facets (Pidgeon and O’Leary, 2000):

• senior management commitment to safety,

• shared care and concern about hazards and their impacts on people,

• realistic and flexible norms and rules about dealing with risks, and

• continual reflection on and improvement of practice through monitoring, analysis, and feedback systems (organizational learning).

In exploring safety cultures as a route to resilient technical systems it is thus necessary to go beyond individual attitudes about safety to the level of shared thinking and the administrative structures and resources that support, rather than constrict, the development of organizational understandings of risk and danger.

The Importance of Organizational Learning

It is clear that organizational learning is a key component of both good safety cultures and high-reliability organizations (Pidgeon, 1997). But learning can be thwarted by well-known difficulties in handling information—too much information, inappropriate communication channels, incomplete or inappropriate information sources, or failure to connect available data—and these difficulties can pose acute challenges for safety. For example, an incomplete or inaccurate problem representation might develop at the level of the organization as a whole and thus influence the interpretations and decisions of the organization’s individual members. Such a representation may arise through organizational rigidity of beliefs about what is and is not to be considered a “hazard.”

The Fukushima disaster is an instructive example of such organizational thinking. The plant owners developed a group mindset about the risks of tsunami, minimizing the significance of the knowledge that flooding across the site could lead to a total loss of power (and hence the cooling function). They also failed to take account of the risk of a tsunami larger than the projections made by the Japanese Society of Civil Engineers, even though it was clear that such an event could disable the plant and seriously damage the reactors, with catastrophic consequences.

The lack of adequate preparation at Fukushima illustrates the point that an organization is defined as much by what its members attend to as by what they choose to ignore. As Dianne Vaughan succinctly put it in her detailed analysis of the Challenger Space Shuttle disaster, a deficient safety culture at NASA “provided a way of seeing that was simultaneously a way of not seeing” (Vaughan, 1996, p. 392).

Avoiding disaster therefore involves an element of thinking both within defined frames of reference to deal with well-defined hazards that fall within an organization’s existing worldview, and outside those frames to at least consider the possibility of emergent or ill-defined hazards that have not been identified or that perhaps fall outside an individual’s or organization’s strict professional or legal remit. In effect, engineers should cultivate the art of scanning for the unintended consequences of their decisions—they should embrace the use of what I call “safety imagination”—as a routine part of their professional practice.

Box 1

Box 1 provides one of the best checklists for safety imagination I have yet to come across, although it was not created as such. The list is adapted from teaching materials developed for training firefighters in the U.S. Forest Service (Thomas, 1994). Most fire service training revolves around a military style of command and control, emphasizing hierarchical organizational structure and response, since many of the hazards involved in firefighting are well known and relevant precautions or procedures can accordingly be specified and trained for in advance. Some hazards, however, are far less well understood by firefighters on the ground. For such circumstances, and for any professional facing a potentially ill-structured and changing risk system, the points in Box 1 outline a useful approach.

The intention of the guidance presented in Box 1 is to counter well-known information difficulties and organizational rigidities of thinking by

  • extending the scope of potential scenarios relevant to the risk issue at hand (e.g., by eliciting varied viewpoints, playing the “what if” game, visualizing near misses becoming accidents),
  • countering complacency and the view that “it won’t happen here” (i.e., always fear the worst, thoroughly consider worst-case scenarios),
  • forcing the recognition that during an incubation period the most dangerous ill-structured hazards are by definition surrounded in ambiguity and uncertainty (i.e., tolerate ambiguity), and, perhaps most critically,
  • attempting to step temporarily beyond, or even suspend, institutionally defined assumptions about what the likely hazard and its consequences will comprise (i.e., suspend assumptions about how the safety task was completed in the past).

Concluding Comments

In light of recent serious challenges, it is clear that the lessons to be gained from analyses of past organizational accidents and disasters may need to be learned all over again by a new generation of engineers, risk regulators, and industry managers. The inquiries following Fukushima highlight the fact that the importance of cultural and organizational factors should never be underestimated (and it would be a further mistake to attribute these events to the unique culture and society of Japan).

For engineers interested in seeking to understand and manage complex risks, theories of high-reliability organizations, safety culture, and organizational accidents should be required reading. Failure to anticipate hazards in complex engineered systems is an affliction that can strike anybody, any time, and anywhere!

Acknowledgments

The author acknowledges the support of the Leverhulme Trust (F/00 407/AG) and the UK Energy Research Centre (NE/G007748/1).

References

Blockley, D.I. 1980. The Nature of Structural Design and Safety. Chichester, U.K.: Ellis Horwood.

Blockley, D.I. 1992. Engineering Safety. London: McGraw Hill.

Hopkins, A. 2001. Was Three Mile Island a “Normal Accident”? Journal of Contingencies and Crisis Management 9(2): 65–72.

Hopkins, A. 2012. Disastrous Decisions: The Human and Organisational Causes of the Gulf of Mexico Blowout. Sydney: CCH Australia.

National Diet of Japan. 2012. Fukushima Nuclear Accident Independent Commission (NAIIC) Final Report, edited by K. Kurokawa et al. Tokyo.

OECD Nuclear Agency. 1987. Chernobyl and the Safety of Nuclear Reactors in OECD Countries. Paris: Organization for Economic Cooperation and Development.

Perrow, C. 1984. Normal Accidents: Living with High Risk Technologies. New York: Basic Books.

Pidgeon, N.F. 1997. The limits to safety? Culture, politics, learning and man-made disasters. Journal of Contingencies and Crisis Management 5(1): 1–14.

Pidgeon, N.F., and M. O’Leary. 2000. Man-made disasters: Why technology and organizations (sometimes) fail. Safety Science 34: 15–30.

Pugsley, A.G. 1969. The engineering climatology of structural accidents. Pp. 335–340 in Proceedings of the First International Conference on Structural Safety and Reliability (ICOSSAR I). Washington.

Rijpma, J. 1997. Complexity, tight coupling and reliability: Connecting normal accidents with high reliability theory. Journal of Contingencies and Crisis Management 5(1): 15–23.

Roberts, K.H. 1993. New Challenges to Understanding Organizations. New York: MacMillan.

Sagan, S.D. 1993. The Limits of Safety: Organizations, Accidents, and Nuclear Weapons. Princeton, N.J.: Princeton University Press.

Thomas, D. 1994. Prescribed fire safety: Preventing accidents and disasters, Part II. Unit 2-G in course Prescribed Fire Behavior Analyst. Marana, Ariz.: National Advanced Resource Technology Center.

Turner, B.A. 1978. Man-made Disasters. London: Wykeham Science Press.

Vaughan, D. 1996. The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. Chicago, Ill.: Chicago University Press.

Wassel, R. 2012. Safer offshore drilling: Lessons from the Macondo well blowout in the Gulf of Mexico. The Bridge 42(3): 46–53.

Weick, K.E., and K.M. Sutcliffe. 2001. Managing the Unexpected: Assuring High Performance in an Age of Complexity. San Francisco, Calif.: Jossey-Bass.

FOOTNOTES 

1 Perrow’s account implies that simply adding more safety devices—the standard response to the previous unanticipated failure—might paradoxically reduce margins of safety if they add to the opaqueness and complexity of the system.

About the Author:Nick Pidgeon is professor of applied psychology and director, Understanding Risk Research Group, Cardiff University, Wales.