1881 lines
137 KiB
Plaintext
1881 lines
137 KiB
Plaintext
chapter 2.
|
||
Questioning the Foundations of Traditional Safety
|
||
Engineering.
|
||
|
||
It’s never what we don’t know that stops us. It’s what we do know that just ain’t so
|
||
Paradigm changes necessarily start with questioning the basic assumptions underly
|
||
ing what we do today. Many beliefs about safety and why accidents occur have been
|
||
widely accepted without question. This chapter examines and questions some of the
|
||
most important assumptions about the cause of accidents and how to prevent them
|
||
that “just ain’t so.” There is, of course, some truth in each of these assumptions, and
|
||
many were true for the systems of the past. The real question is whether they still
|
||
fit today’s complex sociotechnical systems and what new assumptions need to be
|
||
substituted or added.
|
||
|
||
section 2 1.
|
||
|
||
Confusing Safety with Reliability.
|
||
Assumption 1. Safety is increased by increasing system or component reliability. If
|
||
components or systems do not fail, then accidents will not occur.
|
||
This assumption is one of the most pervasive in engineering and other fields. The
|
||
problem is that it’s not true. Safety and reliability are different properties. One does
|
||
not imply nor require the other. A system can be reliable but unsafe. It can also be
|
||
safe but unreliable. In some cases, these two properties even conflict, that is, making
|
||
the system safer may decrease reliability and enhancing reliability may decrease
|
||
safety. The confusion on this point is exemplified by the primary focus on failure
|
||
events in most accident and incident analysis. Some researchers in organizational
|
||
aspects of safety also make this mistake by suggesting that high reliability organiza
|
||
tions will be safe.
|
||
|
||
Because this assumption about the equivalence between safety and reliability is
|
||
so widely held, the distinction between these two properties needs to be carefully
|
||
considered. First, let’s consider accidents where none of the system components fail.
|
||
|
||
Reliable but Unsafe.
|
||
In complex systems, accidents often result from interactions among components
|
||
that are all satisfying their individual requirements, that is, they have not failed.
|
||
The loss of the Mars Polar Lander was attributed to noise (spurious signals) gener
|
||
ated when the landing legs were deployed during the spacecraft’s descent to the
|
||
planet surface. This noise was normal and expected and did not represent a
|
||
failure in the landing leg system. The onboard software interpreted these signals
|
||
as an indication that landing had occurred (which the software engineers were told
|
||
such signals would indicate) and shut down the descent engines prematurely,
|
||
causing the spacecraft to crash into the Mars surface. The landing legs and the
|
||
software performed correctly (as specified in their requirements) and reliably, but
|
||
the accident occurred because the system designers did not account for all the
|
||
potential interactions between landing leg deployment and the descent engine
|
||
control software.
|
||
|
||
The Mars Polar Lander loss is a component interaction accident. Such accidents
|
||
arise in the interactions among system components (electromechanical, digital,
|
||
human, and social) rather than in the failure of individual components. In contrast,
|
||
the other main type of accident, a component failure accident, results from compo
|
||
nent failures, including the possibility of multiple and cascading failures. In compo
|
||
nent failure accidents, the failures are usually treated as random phenomena. In
|
||
component interaction accidents, there may be no failures and the system design
|
||
errors giving rise to unsafe behavior are not random events.
|
||
A failure in engineering can be defined as the nonperformance or inability of a
|
||
component (or system) to perform its intended function. Intended function (and
|
||
thus failure) is defined with respect to the component’s behavioral requirements. If
|
||
the behavior of a component satisfies its specified requirements (such as turning
|
||
off the descent engines when a signal from the landing legs is received), even though
|
||
the requirements may include behavior that is undesirable from a larger system
|
||
context, that component has not failed.
|
||
|
||
Component failure accidents have received the most attention in engineering,
|
||
but component interaction accidents are becoming more common as the complexity
|
||
of our system designs increases. In the past, our designs were more intellectually
|
||
manageable, and the potential interactions among components could be thoroughly
|
||
planned, understood, anticipated, and guarded against. In addition, thorough
|
||
testing was possible and could be used to eliminate design errors before use. Modern,
|
||
hightech systems no longer have these properties, and system design errors are
|
||
increasingly the cause of major accidents, even when all the components have
|
||
operated reliably.that is, the components have not failed.
|
||
|
||
Consider another example of a component interaction accident that occurred in
|
||
a batch chemical reactor in England. The design of this system is shown in
|
||
figure 2 1. The computer was responsible for controlling the flow of catalyst into the
|
||
reactor and also the flow of water into the reflux condenser to cool off the reaction.
|
||
Additionally, sensor inputs to the computer were supposed to warn of any problems
|
||
in various parts of the plant. The programmers were told that if a fault occurred
|
||
in the plant, they were to leave all controlled variables as they were and to sound
|
||
an alarm.
|
||
|
||
On one occasion, the computer received a signal indicating a low oil level in a
|
||
gearbox. The computer reacted as the requirements specified. It sounded an alarm
|
||
and left everything as it was. By coincidence, a catalyst had just been added to the
|
||
reactor, but the computer had only started to increase the coolingwater flow to
|
||
the reflux condenser; the flow was therefore kept at a low rate. The reactor over
|
||
heated, the relief valve lifted, and the content of the reactor was discharged into
|
||
the atmosphere.
|
||
|
||
Note that there were no component failures involved in this accident. the indi
|
||
vidual components, including the software, worked as specified, but together they
|
||
created a hazardous system state. The problem was in the overall system design.
|
||
Merely increasing the reliability of the individual components or protecting against
|
||
their failure would not have prevented this accident because none of the compo
|
||
nents failed. Prevention required identifying and eliminating or mitigating unsafe
|
||
interactions among the system components. High component reliability does not
|
||
prevent component interaction accidents.
|
||
|
||
Safe but Unreliable.
|
||
Accidents like the Mars Polar Lander or the British batch chemical reactor losses,
|
||
where the cause lies in dysfunctional interactions of nonfailing, reliable com
|
||
ponents. i e., the problem is in the overall system design.illustrate reliable
|
||
components in an unsafe system. There can also be safe systems with unreliable
|
||
components if the system is designed and operated so that component failures do
|
||
not create hazardous system states. Design techniques to prevent accidents are
|
||
described in chapter 16 of Safeware. One obvious example is systems that are
|
||
failsafe, that is, they are designed to fail into a safe state.
|
||
For an example of behavior that is unreliable but safe, consider human operators.
|
||
If operators do not follow the specified procedures, then they are not operating
|
||
reliably. In some cases, that can lead to an accident. In other cases, it may prevent
|
||
an accident when the specified procedures turn out to be unsafe under the particular
|
||
circumstances existing at that time. Examples abound of operators ignoring pre
|
||
scribed procedures in order to prevent an accident. At the same time,
|
||
accidents have resulted precisely because the operators did follow the predeter
|
||
mined instructions provided to them in their training, such as at Three Mile Island.
|
||
When the results of deviating from procedures are positive, operators are
|
||
lauded, but when the results are negative, they are punished for being “unreliable.”
|
||
In the successful case (deviating from specified procedures averts an accident), their
|
||
behavior is unreliable but safe. It satisfies the behavioral safety constraints for
|
||
the system, but not individual reliability requirements with respect to following
|
||
specified procedures.
|
||
|
||
It may be helpful at this point to provide some additional definitions. Reliability
|
||
in engineering is defined as the probability that something satisfies its specified
|
||
behavioral requirements over time and under given conditions.that is, it does not
|
||
fail. Reliability is often quantified as mean time between failure. Every hard
|
||
ware component (and most humans) can be made to “break” or fail given some set
|
||
of conditions or a long enough time. The limitations in time and operating conditions
|
||
in the definition are required to differentiate between (1) unreliability under the
|
||
assumed operating conditions and (2) situations where no component or component
|
||
design could have continued to operate.
|
||
|
||
If a driver engages the brakes of a car too late to avoid hitting the car in front,
|
||
we would not say that the brakes “failed” because they did not stop the car under
|
||
circumstances for which they were not designed. The brakes, in this case, were not
|
||
unreliable. They operated reliably but the requirements for safety went beyond the
|
||
capabilities of the brake design. Failure and reliability are always related to require
|
||
ments and assumed operating (environmental) conditions. If there are no require
|
||
ments either specified or assumed, then there can be no failure as any behavior is
|
||
acceptable and no unreliability.
|
||
Safety, in contrast, is defined as the absence of accidents, where an accident is an
|
||
event involving an unplanned and unacceptable loss. To increase safety, the
|
||
focus should be on eliminating or preventing hazards, not eliminating failures.
|
||
Making all the components highly reliable will not necessarily make the system safe.
|
||
|
||
Conflicts between Safety and Reliability.
|
||
At this point you may be convinced that reliable components are not enough for
|
||
system safety. But surely, if the system as a whole is reliable it will be safe and vice
|
||
versa, if the system is unreliable it will be unsafe. That is, reliability and safety are
|
||
the same thing at the system level, aren’t they? This common assumption is also
|
||
untrue. A chemical plant may very reliably manufacture chemicals while occasion
|
||
ally (or even continually) releasing toxic materials into the surrounding environ
|
||
ment. The plant is reliable but unsafe.
|
||
|
||
Not only are safety and reliability not the same thing, but they sometimes conflict.
|
||
Increasing reliability may decrease safety and increasing safety may decrease reli
|
||
ability. Consider the following simple example in physical design. Increasing the
|
||
working pressure to burst ratio (essentially the strength) of a tank will make the
|
||
tank more reliable, that is, it will increase the mean time between failure. When a
|
||
failure does occur, however, more serious damage may result because of the higher
|
||
pressure at the time of the rupture.
|
||
|
||
Reliability and safety may also conflict in engineering design when a choice has
|
||
to be made between retreating to a failsafe state (and protecting people and prop
|
||
erty) versus attempting to continue to achieve the system objectives but with
|
||
increased risk of an accident.
|
||
|
||
Understanding the conflicts between reliability and safety requires distinguishing
|
||
between requirements and constraints. Requirements are derived from the mission
|
||
or reason for the existence of the organization. The mission of the chemical plant
|
||
is to produce chemicals. Constraints represent acceptable ways the system or orga
|
||
nization can achieve the mission goals. Not exposing bystanders to toxins and
|
||
not polluting the environment are constraints on the way the mission (producing
|
||
chemicals) can be achieved.
|
||
|
||
While in some systems safety is part of the mission or reason for existence, such
|
||
as air traffic control or healthcare, in others safety is not the mission but instead is
|
||
a constraint on how the mission can be achieved. The best way to ensure the con
|
||
straints are enforced in such a system may be not to build or operate the system
|
||
at all. Not building a nuclear bomb is the surest protection against accidental deto
|
||
nation. We may be unwilling to make that compromise, but some compromise is
|
||
almost always necessary. The most effective design protections (besides not building
|
||
the bomb at all) against accidental detonation also decrease the likelihood of
|
||
detonation when it is required.
|
||
|
||
Not only do safety constraints sometimes conflict with mission goals, but the
|
||
safety requirements may even conflict among themselves. One safety constraint on
|
||
an automated train door system, for example, is that the doors must not open unless
|
||
the train is stopped and properly aligned with a station platform. Another safety
|
||
constraint is that the doors must open anywhere for emergency evacuation. Resolv
|
||
ing these conflicts is one of the important steps in safety and system engineering.
|
||
Even systems with mission goals that include assuring safety, such as air traffic
|
||
control (ATC), usually have other conflicting goals. ATC systems commonly have
|
||
the mission to both increase system throughput and ensure safety. One way to
|
||
increase throughput is to decrease safety margins by operating aircraft closer
|
||
together. Keeping the aircraft separated adequately to assure acceptable risk may
|
||
decrease system throughput.
|
||
|
||
There are always multiple goals and constraints for any system.the challenge
|
||
in engineering design and risk management is to identify and analyze the conflicts,
|
||
to make appropriate tradeoffs among the conflicting requirements and constraints,
|
||
and to find ways to increase system safety without decreasing system reliability.
|
||
Safety versus Reliability at the Organizational Level
|
||
So far the discussion has focused on safety versus reliability at the physical level.
|
||
But what about the social and organizational levels above the physical system? Are
|
||
safety and reliability the same here as implied by High Reliability Organization
|
||
(H R O) advocates who suggest that High Reliability Organizations (H R Os) will be
|
||
safe? The answer, again, is no.
|
||
|
||
Figure 2 2. shows Rasmussen’s analysis of the Zeebrugge ferry mishap. Some
|
||
background is necessary to understand the figure. On the day the ferry capsized, the
|
||
Herald of Free Enterprise was working the route between Dover and the Belgium
|
||
port of Bruges–Zeebrugge. This route was not her normal one, and the linkspan2 at
|
||
Zeebrugge had not been designed specifically for the Spirit type of ships. The link
|
||
span used spanned a single deck and so could not be used to load decks E and G
|
||
simultaneously. The ramp could also not be raised high enough to meet the level of
|
||
deck E due to the high spring tides at that time. This limitation was commonly
|
||
known and was overcome by filling the forward ballast tanks to lower the ferry’s
|
||
bow in the water. The Herald was due to be modified during its refit later that year
|
||
to overcome this limitation in the ship’s design.
|
||
Before dropping moorings, it was normal practice for a member of the crew, the
|
||
assistant boatswain, to close the ferry doors. The first officer also remained on deck
|
||
to ensure they were closed before returning to the wheelhouse. On the day of the
|
||
accident, in order to keep on schedule, the first officer returned to the wheelhouse
|
||
before the ship dropped its moorings (which was common practice), leaving the
|
||
closing of the doors to the assistant boatswain, who had taken a short break after
|
||
cleaning the car deck upon arrival at Zeebrugge. He had returned to his cabin and
|
||
was still asleep when the ship left the dock. The captain could only assume that the
|
||
doors had been closed because he could not see them from the wheelhouse due to
|
||
their construction, and there was no indicator light in the wheelhouse to show door
|
||
position. Why nobody else closed the door is unexplained in the accident report.
|
||
Other factors also contributed to the loss. One was the depth of the water. if the
|
||
ship’s speed had been below 18 knots (33 km/h) and the ship had not been in shallow
|
||
water, it was speculated in the accident report that the people on the car deck would
|
||
probably have had time to notice the bow doors were open and close them.
|
||
But open bow doors were not alone enough to cause the final capsizing. A few years
|
||
earlier, one of the Herald’s sister ships sailed from Dover to Zeebrugge with the
|
||
bow doors open and made it to her destination without incident.
|
||
Almost all ships are divided into watertight compartments below the waterline
|
||
so that in the event of flooding, the water will be confined to one compartment,
|
||
keeping the ship afloat. The Herald’s design had an open car deck with no divid
|
||
ers, allowing vehicles to drive in and out easily, but this design allowed water to
|
||
flood the car deck. As the ferry turned, the water on the car deck moved to one
|
||
side and the vessel capsized. One hundred and ninety three passengers and crew
|
||
were killed.
|
||
In this accident, those making decisions about vessel design, harbor design, cargo
|
||
management, passenger management, traffic scheduling, and vessel operation were
|
||
unaware of the impact (side effects) of their decisions on the others and the overall
|
||
impact on the process leading to the ferry accident. Each operated “reliably” in
|
||
terms of making decisions based on the information they had.
|
||
Bottomup decentralized decision making can lead.and has led.to major acci
|
||
dents in complex sociotechnical systems. Each local decision may be “correct” in
|
||
the limited context in which it was made but lead to an accident when the indepen
|
||
dent decisions and organizational behaviors interact in dysfunctional ways.
|
||
Safety is a system property, not a component property, and must be controlled at
|
||
the system level, not the component level. We return to this topic in chapter 3.
|
||
Assumption 1 is clearly untrue. A new assumption needs to be substituted.
|
||
New Assumption 1. High reliability is neither necessary nor sufficient for safety.
|
||
Building safer systems requires going beyond the usual focus on component failure
|
||
and reliability to focus on system hazards and eliminating or reducing their occur
|
||
rence. This fact has important implications for analyzing and designing for safety.
|
||
Bottomup reliability engineering analysis techniques, such as failure modes and
|
||
effects analysis ( F M E A), are not appropriate for safety analysis. Even topdown
|
||
techniques, such as fault trees, if they focus on component failure, are not adequate.
|
||
Something else is needed.
|
||
|
||
section 2 2.
|
||
|
||
Modeling Accident Causation as Event Chains.
|
||
|
||
Assumption 2. Accidents are caused by chains of directly related events. We can
|
||
understand accidents and assess risk by looking at the chain of events leading to
|
||
the loss.
|
||
Some of the most important assumptions in safety lie in our models of how the
|
||
world works. Models are important because they provide a means for understanding
|
||
phenomena like accidents or potentially hazardous system behavior and for record
|
||
ing that understanding in a way that can be communicated to others.
|
||
A particular type of model, an accident causality model (or accident model for
|
||
short) underlies all efforts to engineer for safety. Our accident models provide the
|
||
foundation for (1) investigating and analyzing the cause of accidents, (2) designing
|
||
to prevent future losses, and (3) assessing the risk associated with using the systems
|
||
and products we create. Accident models explain why accidents occur, and they
|
||
determine the approaches we take to prevent them. While you might not be con
|
||
sciously aware you are using a model when engaged in these activities, some (perhaps
|
||
subconscious) model of the phenomenon is always part of the process.
|
||
All models are abstractions; they simplify the thing being modeled by abstracting
|
||
away what are assumed to be irrelevant details and focusing on the features of the
|
||
phenomenon that are judged to be the most relevant. Selecting some factors as
|
||
relevant and others as irrelevant is, in most cases, arbitrary and entirely the choice
|
||
of the modeler. That choice, however, is critical in determining the usefulness and
|
||
accuracy of the model in predicting future events.
|
||
An underlying assumption of all accident models is that there are common pat
|
||
terns in accidents and that they are not simply random events. Accident models
|
||
impose patterns on accidents and influence the factors considered in any safety
|
||
analysis. Because the accident model influences what cause(s) is ascribed to an
|
||
accident, the countermeasures taken to prevent future accidents, and the evaluation
|
||
of the risk in operating a system, the power and features of the accident model used
|
||
will greatly affect our ability to identify and control hazards and thus prevent
|
||
accidents.
|
||
The earliest formal accident models came from industrial safety (sometimes
|
||
called occupational safety) and reflect the factors inherent in protecting workers
|
||
from injury or illness. Later, these same models or variants of them were applied to
|
||
the engineering and operation of complex technical and social systems. At the begin
|
||
ning, the focus in industrial accident prevention was on unsafe conditions, such as
|
||
open blades and unprotected belts. While this emphasis on preventing unsafe condi
|
||
tions was very successful in reducing workplace injuries, the decrease naturally
|
||
started to slow down as the most obvious hazards were eliminated. The emphasis
|
||
|
||
then shifted to unsafe acts. Accidents began to be regarded as someone’s fault rather
|
||
than as an event that could have been prevented by some change in the plant
|
||
or product.
|
||
Heinrich’s Domino Model, published in 19 31, was one of the first published
|
||
general accident models and was very influential in shifting the emphasis in safety
|
||
to human error. Heinrich compared the general sequence of accidents to five domi
|
||
noes standing on end in a line (figure 2 3). When the first domino falls, it automati
|
||
cally knocks down its neighbor and so on until the injury occurs. In any accident
|
||
sequence, according to this model, ancestry or social environment leads to a fault
|
||
of a person, which is the proximate reason for an unsafe act or condition (mechani
|
||
cal or physical), which results in an accident, which leads to an injury. In 19 76, Bird
|
||
and Loftus extended the basic Domino Model to include management decisions as
|
||
a factor in accidents.
|
||
1. Lack of control by management, permitting.
|
||
2. Basic causes (personal and job factors) that lead to.
|
||
3. Immediate causes (substandard practices/conditions/errors), which are the
|
||
proximate cause of.
|
||
4. An accident or incident, which results in.
|
||
5. A loss.
|
||
In the same year, Adams suggested a different managementaugmented model that
|
||
included.
|
||
|
||
1. Management structure. (objectives, organization, and operations).
|
||
2. Operational errors. (management or supervisory behavior).
|
||
3. Tactical errors. (caused by employee behavior and work conditions).
|
||
4. Accident or incident.
|
||
5. Injury or damage to persons or property.
|
||
|
||
Reason reinvented the Domino Model twenty years later in what he called the Swiss
|
||
Cheese model, with layers of Swiss cheese substituted for dominos and the layers
|
||
or dominos labeled as layers of defense3 that have failed.
|
||
The basic Domino Model is inadequate for complex systems and other models
|
||
were developed. (see Safeware, chapter 10). but the assumption that there is a
|
||
single or root cause of an accident unfortunately persists as does the idea of dominos
|
||
(or layers of Swiss cheese) and chains of failures, each directly causing or leading
|
||
to the next one in the chain. It also lives on in the emphasis on human error in
|
||
identifying accident causes.
|
||
The most common accident models today explain accidents in terms of multiple
|
||
events sequenced as a forward chain over time. The events included almost always
|
||
involve some type of failure” event or human error, or they are energy related
|
||
(for example, an explosion). The chains may be branching (as in fault trees) or
|
||
there may be multiple chains synchronized by time or common events. Lots of nota
|
||
tions have been developed to represent the events in a graphical form, but the
|
||
underlying model is the same. Figure 2 4. shows an example for the rupture of a
|
||
pressurized tank.
|
||
|
||
The use of eventchain models of causation has important implications for the
|
||
way engineers design for safety. If an accident is caused by a chain of events, then
|
||
the most obvious preventive measure is to break the chain before the loss occurs.
|
||
Because the most common events considered in these models are component
|
||
failures, preventive measures tend to be focused on preventing failure events.
|
||
increasing component integrity or introducing redundancy to reduce the likelihood
|
||
of the event occurring. If corrosion can be prevented in the tank rupture accident,
|
||
for example, then the tank rupture is averted.
|
||
Figure 2 5. is annotated with mitigation measures designed to break the chain.
|
||
These mitigation measures are examples of the most common design techniques
|
||
based on eventchain models of accidents, such as barriers (for example, preventing
|
||
the contact of moisture with the metal used in the tank by coating it with plate
|
||
carbon steel or providing mesh screens to contain fragments), interlocks (using a
|
||
burst diaphragm), overdesign (increasing the metal thickness), and operational pro
|
||
cedures (reducing the amount of pressure as the tank ages).
|
||
For this simple example involving only physical failures, designing to prevent such
|
||
failures works well. But even this simple example omits any consideration of factors
|
||
indirectly related to the events in the chain. An example of a possible indirect or
|
||
systemic example is competitive or financial pressures to increase efficiency that
|
||
could lead to not following the plan to reduce the operating pressure as the tank
|
||
ages. A second factor might be changes over time to the plant design that require
|
||
workers to spend time near the tank while it is pressurized.
|
||
|
||
Formal and informal notations for representing the event chain may contain only
|
||
the events or they may also contain the conditions that led to the events. Events
|
||
create conditions that, along with existing conditions, lead to events that create new
|
||
conditions, and so on (figure 2 6). The tank corrodes event leads to a corrosion exists
|
||
in tank condition, which leads to a metal weakens event, which leads to a weakened
|
||
metal condition, and so forth.
|
||
The difference between events and conditions is that events are limited in time,
|
||
while conditions persist until some event occurs that results in new or changed
|
||
conditions. For example, the three conditions that must exist before a flammable
|
||
mixture will explode (the event) are the flammable gases or vapors themselves, air,
|
||
and a source of ignition. Any one or two of these may exist for a period of time
|
||
before the other(s) occurs and leads to the explosion. An event (the explosion)
|
||
creates new conditions, such as uncontrolled energy or toxic chemicals in the air.
|
||
Causality models based on event chains (or dominos or layers of Swiss cheese)
|
||
are simple and therefore appealing. But they are too simple and do not include what
|
||
is needed to understand why accidents occur and how to prevent them. Some impor
|
||
tant limitations include requiring direct causality relationships, subjectivity in select
|
||
ing the events to include, subjectivity in identifying chaining conditions, and exclusion
|
||
of systemic factors.
|
||
|
||
section 2 2 1. Direct Causality.
|
||
The causal relationships between the events in event chain models (or between
|
||
dominoes or Swiss cheese slices) are required to be direct and linear, representing
|
||
the notion that the preceding event must have occurred and the linking conditions
|
||
must have been present for the subsequent event to occur. if event A had not
|
||
occurred then the following event B would not have occurred. As such, event chain
|
||
models encourage limited notions of linear causality, and it is difficult or impossible
|
||
to incorporate nonlinear relationships. Consider the statement “Smoking causes
|
||
lung cancer.” Such a statement would not be allowed in the eventchain model of
|
||
causality because there is no direct relationship between the two. Many smokers do
|
||
not get lung cancer, and some people who get lung cancer are not smokers. It is
|
||
widely accepted, however, that there is some relationship between the two, although
|
||
it may be quite complex and nonlinear.
|
||
|
||
In addition to limitations in the types of causality considered, the causal factors
|
||
identified using eventchain models depend on the events that are considered and
|
||
on the selection of the conditions that link the events. Other than the physical events
|
||
immediately preceding or directly involved in the loss, however, the choice of events
|
||
to include is subjective and the conditions selected to explain the events is even
|
||
more so. Each of these two limitations is considered in turn.
|
||
|
||
section 2 2 2. Subjectivity in Selecting Events.
|
||
The selection of events to include in an event chain is dependent on the stopping
|
||
rule used to determine how far back the sequence of explanatory events goes.
|
||
Although the first event in the chain is often labeled the initiating event or root
|
||
cause, the selection of an initiating event is arbitrary and previous events and
|
||
conditions could always be added.
|
||
Sometimes the initiating event is selected (the backward chaining stops) because
|
||
it represents a type of event that is familiar and thus acceptable as an explanation
|
||
for the accident or it is a deviation from a standard. In other cases, the initiat
|
||
ing event or root cause is chosen because it is the first event in the backward chain
|
||
for which it is felt that something can be done for correction.4
|
||
The backward chaining may also stop because the causal path disappears due to
|
||
lack of information. Rasmussen suggests that a practical explanation for why actions
|
||
by operators actively involved in the dynamic flow of events are so often identified
|
||
as the cause of an accident is the difficulty in continuing the backtracking “through”
|
||
a human.
|
||
A final reason why a “root cause” may be selected is that it is politically accept
|
||
able as the identified cause. Other events or explanations may be excluded or not
|
||
examined in depth because they raise issues that are embarrassing to the organiza
|
||
tion or its contractors or are politically unacceptable.
|
||
The accident report on a friendly fire shootdown of a U.S. Army helicopter over
|
||
the Iraqi nofly zone in 19 94, for example, describes the chain of events leading to
|
||
the shootdown. Included in these events is the fact that the helicopter pilots did not
|
||
change to the radio frequency required in the nofly zone when they entered it (they
|
||
stayed on the enroute frequency). Stopping at this event in the chain (which the
|
||
official report does), it appears that the helicopter pilots were partially at fault for
|
||
the loss by not following radio procedures. An independent account of the accident,
|
||
however, notes that the U.S. commander of the operation had made
|
||
an exception about the radio frequency to be used by the helicopters in order to
|
||
mitigate a different safety concern (see chapter 5), and therefore the pilots were
|
||
simply following orders when they did not switch to the “required” frequency. The
|
||
command to the helicopter pilots not to follow official radio procedures is not
|
||
included in the chain of events provided in the official government accident report,
|
||
but it suggests a very different understanding of the role of the helicopter pilots in
|
||
the loss.
|
||
In addition to a root cause or causes, some events or conditions may be identified
|
||
as proximate or direct causes while others are labeled as contributory. There is no
|
||
more basis for this distinction than the selection of a root cause.
|
||
Making such distinctions between causes or limiting the factors considered
|
||
can be a hindrance in learning from and preventing future accidents. Consider the
|
||
following aircraft examples.
|
||
In the crash of an American Airlines D C 10 at Chicago’s O’Hare Airport in 19 79,
|
||
the U.S. National Transportation Safety Board (N T S B) blamed only a “mainte
|
||
nanceinduced crack,” and not also a design error that allowed the slats to retract
|
||
if the wing was punctured. Because of this omission, McDonnell Douglas was not
|
||
required to change the design, leading to future accidents related to the same design
|
||
flaw.
|
||
Similar omissions of causal factors in aircraft accidents have occurred more
|
||
recently. One example is the crash of a China Airlines A300 on April 26, 19 94, while
|
||
approaching the Nagoya, Japan, airport. One of the factors involved in the accident
|
||
was the design of the flight control computer software. Previous incidents with the
|
||
same type of aircraft had led to a Service Bulletin being issued for a modification
|
||
of the two flight control computers to fix the problem. But because the computer
|
||
problem had not been labeled a “cause” of the previous incidents (for perhaps at
|
||
least partially political reasons), the modification was labeled recommended rather
|
||
than mandatory. China Airlines concluded, as a result, that the implementation of
|
||
the changes to the computers was not urgent and decided to delay modification
|
||
until the next time the flight computers on the plane needed repair. Because of
|
||
that delay, 264 passengers and crew died.
|
||
In another D C 10 saga, explosive decompression played a critical role in a near
|
||
miss over Windsor, Ontario. An American Airlines D C 10 lost part of its passenger
|
||
floor, and thus all of the control cables that ran through it, when a cargo door opened
|
||
in flight in June 19 72. Thanks to the extraordinary skill and poise of the pilot, Bryce
|
||
McCormick, the plane landed safely. In a remarkable coincidence, McCormick had
|
||
trained himself to fly the plane using only the engines because he had been con
|
||
cerned about a decompressioncaused collapse of the floor. After this close call,
|
||
McCormick recommended that every D C 10 pilot be informed of the consequences
|
||
of explosive decompression and trained in the flying techniques that he and his crew
|
||
an exception about the radio frequency to be used by the helicopters in order to
|
||
mitigate a different safety concern (see chapter 5), and therefore the pilots were
|
||
simply following orders when they did not switch to the “required” frequency. The
|
||
command to the helicopter pilots not to follow official radio procedures is not
|
||
included in the chain of events provided in the official government accident report,
|
||
but it suggests a very different understanding of the role of the helicopter pilots in
|
||
the loss.
|
||
In addition to a root cause or causes, some events or conditions may be identified
|
||
as proximate or direct causes while others are labeled as contributory. There is no
|
||
more basis for this distinction than the selection of a root cause.
|
||
Making such distinctions between causes or limiting the factors considered
|
||
can be a hindrance in learning from and preventing future accidents. Consider the
|
||
following aircraft examples.
|
||
In the crash of an American Airlines D C 10 at Chicago’s O’Hare Airport in 19 79,
|
||
the U.S. National Transportation Safety Board (N T S B) blamed only a “mainte
|
||
nanceinduced crack,” and not also a design error that allowed the slats to retract
|
||
if the wing was punctured. Because of this omission, McDonnell Douglas was not
|
||
required to change the design, leading to future accidents related to the same design
|
||
flaw .
|
||
Similar omissions of causal factors in aircraft accidents have occurred more
|
||
recently. One example is the crash of a China Airlines A300 on April 26, 19 94, while
|
||
approaching the Nagoya, Japan, airport. One of the factors involved in the accident
|
||
was the design of the flight control computer software. Previous incidents with the
|
||
same type of aircraft had led to a Service Bulletin being issued for a modification
|
||
of the two flight control computers to fix the problem. But because the computer
|
||
problem had not been labeled a “cause” of the previous incidents (for perhaps at
|
||
least partially political reasons), the modification was labeled recommended rather
|
||
than mandatory. China Airlines concluded, as a result, that the implementation of
|
||
the changes to the computers was not urgent and decided to delay modification
|
||
until the next time the flight computers on the plane needed repair . Because of
|
||
that delay, 264 passengers and crew died.
|
||
In another D C 10 saga, explosive decompression played a critical role in a near
|
||
miss over Windsor, Ontario. An American Airlines D C 10 lost part of its passenger
|
||
floor, and thus all of the control cables that ran through it, when a cargo door opened
|
||
in flight in June 19 72. Thanks to the extraordinary skill and poise of the pilot, Bryce
|
||
McCorMICk, the plane landed safely. In a remarkable coincidence, McCorMICk had
|
||
trained himself to fly the plane using only the engines because he had been con
|
||
cerned about a decompressioncaused collapse of the floor. After this close call,
|
||
McCorMICk recommended that every D C 10 pilot be informed of the consequences
|
||
of explosive decompression and trained in the flying techniques that he and his crew
|
||
had used to save their passengers and aircraft. FAA investigators, the National
|
||
Transportation Safety Board, and engineers at a subcontractor to McDonnell
|
||
Douglas that designed the fuselage of the plane, all recommended changes in the
|
||
design of the aircraft. Instead, McDonnell Douglas attributed the Windsor incident
|
||
totally to human error on the part of the baggage handler responsible for closing
|
||
the cargo compartment door (a convenient event in the event chain) and not to any
|
||
error on the part of their designers or engineers and decided all they had to do was
|
||
to come up with a fix that would prevent baggage handlers from forcing the door.
|
||
One of the discoveries after the Windsor incident was that the door could be
|
||
improperly closed but the external signs, such as the position of the external handle,
|
||
made it appear to be closed properly. In addition, this incident proved that the
|
||
cockpit warning system could fail, and the crew would then not know that the plane
|
||
was taking off without a properly closed door.
|
||
The aviation industry does not normally receive such manifest warnings of basic design
|
||
flaws in an aircraft without cost to human life. Windsor deserved to be celebrated as an
|
||
exceptional case when every life was saved through a combination of crew skill and the
|
||
sheer luck that the plane was so lightly loaded. If there had been more passengers and
|
||
thus more weight, damage to the control cables would undoubtedly have been more
|
||
severe, and it is highly questionable if any amount of skill could have saved the plane .
|
||
Almost two years later, in March 19 74, a fully loaded Turkish Airlines D C 10 crashed
|
||
near Paris, resulting in 346 deaths.one of the worst accidents in aviation history.
|
||
Once again, the cargo door had opened in flight, causing the cabin floor to collapse,
|
||
severing the flight control cables. Immediately after the accident, Sanford McDon
|
||
nell stated the official McDonnellDouglas position that once again placed the
|
||
blame on the baggage handler and the ground crew. This time, however, the FAA
|
||
finally ordered modifications to all D C 10s that eliminated the hazard. In addition,
|
||
an FAA regulation issued in July 19 75 required all widebodied jets to be able to
|
||
tolerate a hole in the fuselage of twenty square feet. By labeling the root cause in
|
||
the event chain as baggage handler error and attempting only to eliminate that event
|
||
or link in the chain rather than the basic engineering design flaws, fixes that could
|
||
have prevented the Paris crash were not made.
|
||
Until we do a better job of identifying causal factors in accidents, we will continue
|
||
to have unnecessary repetition of incidents and accidents.
|
||
|
||
|
||
Footnote. As an example, a NASA Procedures and Guidelines document (N P G 86 21 Draft 1) defines a root
|
||
cause as. “Along a chain of events leading to an mishap, the first causal action or failure to act that could
|
||
have been controlled systematically either by. policy./practice./procedure. or individual adherence to. policy./
|
||
practice./procedure.”
|
||
|
||
section 2 2 3. Subjectivity in Selecting the Chaining Conditions
|
||
In addition to subjectivity in selecting the events and the root cause event, the links
|
||
between the events that are chosen to explain them are subjective and subject to
|
||
bias. Leplat notes that the links are justified by knowledge or rules of different types,
|
||
including physical and organizational knowledge. The same event can give rise to
|
||
different types of links according to the mental representations the analyst has of
|
||
the production of this event. When several types of rules are possible, the analyst
|
||
will apply those that agree with his or her mental model of the situation .
|
||
Consider, for example, the loss of an American Airlines B757 near Cali,
|
||
Colombia, in 19 95 . Two significant events in this loss were
|
||
(1.) Pilot asks for clearance to take the R O Z O. approach
|
||
followed later by
|
||
(2.) Pilot types R into the F M S. 5.
|
||
In fact, the pilot should have typed the four letters R O Z O. instead of R..the latter
|
||
was the symbol for a different radio beacon (called romeo) near Bogota. As a result,
|
||
the aircraft incorrectly turned toward mountainous terrain. While these events are
|
||
noncontroversial, the link between the two events could be explained by any of the
|
||
following.
|
||
•Pilot Error. In the rush to start the descent, the pilot executed a change of
|
||
course without verifying its effect on the flight path.
|
||
•Crew Procedure Error. In the rush to start the descent, the captain entered
|
||
the name of the waypoint without normal verification from the other pilot.
|
||
•Approach Chart and F M S. Inconsistencies. The identifier used to identify R O Z O.
|
||
on the approach chart (R) did not match the identifier used to call up R O Z O. in
|
||
the F M S..
|
||
•F M S. Design Deficiency. The F M S. did not provide the pilot with feedback
|
||
that choosing the first identifier listed on the display was not the closest beacon
|
||
having that identifier.
|
||
•American Airlines Training Deficiency. The pilots flying into South America
|
||
were not warned about duplicate beacon identifiers nor adequately trained on
|
||
the logic and priorities used in the F M S. on the aircraft.
|
||
•Manufacturer Deficiency. JeppesenSanderson did not inform airlines operat
|
||
ing F M S.equipped aircraft of the differences between navigation information
|
||
provided by JeppesenSanderson Flight Management System navigation data
|
||
bases and JeppesenSanderson approach charts or the logic and priorities
|
||
employed in the display of electronic F M S. navigation information.
|
||
•International Standards Deficiency. No single worldwide standard provides
|
||
unified criteria for the providers of electronic navigation databases used in
|
||
Flight Management Systems.
|
||
The selection of the linking condition (or events) will greatly influence the cause
|
||
ascribed to the accident yet in the example all are plausible and each could serve
|
||
as an explanation of the event sequence. The choice may reflect more on the person
|
||
or group making the selection than on the accident itself. In fact, understanding this
|
||
accident and learning enough from it to prevent future accidents requires identifying
|
||
all of these factors to explain the incorrect input. The accident model used should
|
||
encourage and guide a comprehensive analysis at multiple technical and social
|
||
system levels.
|
||
|
||
footnote. An F M S. is an automated flight management system that assists the pilots in various ways. In this case,
|
||
it was being used to provide navigation information.
|
||
|
||
section 2 2 4. Discounting SysteMIC Factors.
|
||
The problem with event chain models is not simply that the selection of the events
|
||
to include and the labeling of some of them as causes are arbitrary or that the selec
|
||
tion of which conditions to include is also arbitrary and usually incomplete. Even
|
||
more important is that viewing accidents as chains of events and conditions may
|
||
limit understanding and learning from the loss and omit causal factors that cannot
|
||
be included in an event chain.
|
||
Event chains developed to explain an accident usually concentrate on the proxi
|
||
mate events immediately preceding the loss. But the foundation for an accident is
|
||
often laid years before. One event simply triggers the loss, but if that event had not
|
||
happened, another one would have led to a loss. The Bhopal disaster provides a
|
||
good example.
|
||
The release of methyl isocyanate. (M I C.) from the Union Carbide chemical plant
|
||
in Bhopal, India, in December 19 84 has been called the worst industrial accident
|
||
in history. Conservative estimates point to 2,000 fatalities, 10,000 permanent dis
|
||
abilities (including blindness), and 200,000 injuries . The Indian government
|
||
blamed the accident on human error.the improper cleaning of a pipe at the plant.
|
||
A relatively new worker was assigned to wash out some pipes and filters, which were
|
||
clogged. M I C produces large amounts of heat when in contact with water, and the
|
||
worker properly closed the valves to isolate the M I C tanks from the pipes and filters
|
||
being washed. Nobody, however, inserted a required safety disk (called a slip blind)
|
||
to back up the valves in case they leaked .
|
||
A chain of events describing the accident mechanism for Bhopal might include.
|
||
E 1. Worker washes pipes without inserting a slip blind.
|
||
E 2. Water leaks into M I C tank.
|
||
E 3.Explosion occurs.
|
||
E 4. Relief valve opens.
|
||
E 5. M I C. vented into air.
|
||
E 6. Wind carries M I C into populated area around plant.
|
||
|
||
Both Union Carbide and the Indian government blamed the worker washing the
|
||
pipes for the accident.6 A different operator error might be identified as the root
|
||
cause (initiating event) if the chain is followed back farther. The worker who had
|
||
been assigned the task of washing the pipes reportedly knew that the valves leaked,
|
||
but he did not check to see whether the pipe was properly isolated because, he said,
|
||
it was not his job to do so. Inserting the safety disks was the job of the maintenance
|
||
department, but the maintenance sheet contained no instruction to insert this disk.
|
||
The pipewashing operation should have been supervised by the second shift super
|
||
visor, but that position had been eliminated in a costcutting effort. So the root cause
|
||
might instead have been assigned to the person responsible for inserting the slip
|
||
blind or to the lack of a second shift supervisor.
|
||
But the selection of a stopping point and the specific operator action to label as
|
||
the root cause.and operator actions are almost always selected as root causes.is
|
||
not the real problem here. The problem is the oversimplification implicit in using a
|
||
chain of events to understand why this accident occurred. Given the design and
|
||
operating conditions of the plant, an accident was waiting to happen.
|
||
However got in, it would not have caused the severe explosion had the refrigera
|
||
tion unit not been disconnected and drained of freon, or had the gauges been properly
|
||
working and monitored, or had various steps been taken at the first smell of M I C instead
|
||
of being put off until after the tea break, or had the scrubber been in service, or had the
|
||
water sprays been designed to go high enough to douse the emissions, or had the flare
|
||
tower been working and been of sufficient capacity to handle a large excursion.
|
||
It is not uncommon for a company to turn off passive safety devices, such as refrig
|
||
eration units, to save money. The operating manual specified that the refrigeration
|
||
unit must be operating whenever M I C was in the system. The chemical has to be
|
||
maintained at a temperature no higher than 5 degrees Celsius to avoid uncontrolled reac
|
||
tions. A high temperature alarm was to sound if the M I C reached 11 degrees. The refrigera
|
||
tion unit was turned off, however, to save money, and the M I C was usually stored
|
||
at nearly 20 degrees. The plant management adjusted the threshold of the alarm, accord
|
||
ingly, from 11 degrees to 20 degrees and logging of tank temperatures was halted, thus eliminating
|
||
the possibility of an early warning of rising temperatures.
|
||
Gauges at plants are frequently out of service . At the Bhopal facility, there
|
||
were few alarms or interlock devices in critical locations that might have warned
|
||
operators of abnormal conditions.a system design deficiency.
|
||
|
||
Other protection devices at the plant had inadequate design thresholds. The vent
|
||
scrubber, had it worked, was designed to neutralize only small quantities of gas at
|
||
fairly low pressures and temperatures. The pressure of the escaping gas during the
|
||
accident exceeded the scrubber’s design by nearly two and a half times, and the
|
||
temperature of the escaping gas was at least 80 degrees Celsius more than the scrubber
|
||
could handle. Similarly, the flare tower (which was supposed to burn off released
|
||
vapor) was totally inadequate to deal with the estimated 40 tons of M I C that
|
||
escaped during the accident. In addition, the M I C was vented from the vent stack
|
||
108 feet above the ground, well above the height of the water curtain intended to
|
||
knock down the gas. The water curtain reached only 40 to 50 feet above the ground.
|
||
The water jets could reach as high as 115 feet, but only if operated individually.
|
||
Leaks were routine occurrences and the reasons for them were seldom investi
|
||
gated. Problems were either fixed without further examination or were ignored. A
|
||
safety audit two years earlier by a team from Union Carbide had noted many safety
|
||
problems at the plant, including several involved in the accident, such as filter
|
||
cleaning operations without using slip blinds, leaking valves, the possibility of con
|
||
taminating the tank with material from the vent gas scrubber, and bad pressure
|
||
gauges. The safety auditors had recommended increasing the capability of the water
|
||
curtain and had pointed out that the alarm at the flare tower from which the M I C
|
||
leaked was nonoperational, and thus any leak could go unnoticed for a long time.
|
||
None of the recommended changes were made . There is debate about whether
|
||
the audit information was fully shared with the Union Carbide India subsidiary and
|
||
about who was responsible for making sure changes were made. In any event, there
|
||
was no followup to make sure that the problems identified in the audit had been
|
||
corrected.
|
||
A year before the accident, the chemical engineer managing the M I C plant
|
||
resigned because he disapproved of falling safety standards, and still no changes
|
||
were made. He was replaced by an electrical engineer. Measures for dealing with a
|
||
chemical release once it occurred were no better. Alarms at the plant sounded so
|
||
often (the siren went off twenty to thirty times a week for various purposes) that
|
||
an actual alert could not be distinguished from routine events or practice alerts.
|
||
Ironically, the warning siren was not turned on until two hours after the M I C leak
|
||
was detected (and after almost all the injuries had occurred) and then was turned
|
||
off after only five minutes.which was company policy . Moreover, the numer
|
||
ous practice alerts did not seem to be effective in preparing for an emergency. When
|
||
the danger during the release became known, many employees ran from the con
|
||
taminated areas of the plant, totally ignoring the buses that were sitting idle ready
|
||
to evacuate workers and nearby residents. Plant workers had only a bare minimum
|
||
of emergency equipment.a shortage of oxygen masks, for example, was discovered
|
||
after the accident started.and they had almost no knowledge or training about
|
||
how to handle nonroutine events.
|
||
|
||
The police were not notified when the cheMICal release began. In fact, when
|
||
called by police and reporters, plant spokesmen first denied the accident and then
|
||
claimed that M I C was not dangerous. Nor was the surrounding community warned
|
||
of the dangers, before or during the release, or informed of the simple precautions
|
||
that could have saved them from lethal exposure, such as putting a wet cloth over
|
||
their face and closing their eyes. If the community had been alerted and provided
|
||
with this simple information, many (if not most) lives would have been saved and
|
||
injuries prevented .
|
||
Some of the reasons why the poor conditions in the plant were allowed to persist
|
||
are financial. Demand for M I C had dropped sharply after 19 81, leading to reduc
|
||
tions in production and pressure on the company to cut costs. The plant was operat
|
||
ing at less than half capacity when the accident occurred. Union Carbide put pressure
|
||
on the Indian management to reduce losses, but gave no specific details on how
|
||
to achieve the reductions. In response, the maintenance and operating personnel
|
||
were cut in half. Maintenance procedures were severely cut back and the shift reliev
|
||
ing system was suspended.if no replacement showed up at the end of the shift,
|
||
the following shift went unmanned. The person responsible for inserting the
|
||
slip blind in the pipe had not showed up for his shift. Top management justified the
|
||
cuts as merely reducing avoidable and wasteful expenditures without affecting
|
||
overall safety.
|
||
As the plant lost money, many of the skilled workers left for more secure jobs.
|
||
They either were not replaced or were replaced by unskilled workers. When the
|
||
plant was first built, operators and technicians had the equivalent of two years of
|
||
college education in chemistry or chemical engineering. In addition, Union Carbide
|
||
provided them with six months training. When the plant began to lose money, edu
|
||
cational standards and staffing levels were reportedly reduced. In the past, U C. flew
|
||
plant personnel to West Virginia for intensive training and had teams of U S. engi
|
||
neers make regular onsite safety inspections. But by 19 82, financial pressures led
|
||
U C. to give up direct supervision of safety at the plant, even though it retained
|
||
general financial and technical control. No American advisors were resident at
|
||
Bhopal after 19 82.
|
||
Management and labor problems followed the financial losses. Morale at the
|
||
plant was low. “There was widespread belief among employees that the management
|
||
had taken drastic and imprudent measures to cut costs and that attention to details
|
||
that ensure safe operation were absent” .
|
||
These are only a few of the factors involved in this catastrophe, which also include
|
||
other technical and human errors within the plant, design errors, management neg
|
||
ligence, regulatory deficiencies on the part of the U.S. and Indian governments, and
|
||
general agricultural and technology transfer policies related to the reason they were
|
||
making such a dangerous chemical in India in the first place. Any one of these
|
||
perspectives or “causes” is inadequate by itself to understand the accident and to
|
||
prevent future ones. In particular, identifying only operator error or sabotage as the
|
||
root cause of the accident ignores most of the opportunities for the prevention of
|
||
similar accidents in the future. Many of the systemic causal factors are only indirectly
|
||
related to the proximate events and conditions preceding the loss.
|
||
When all the factors, including indirect and systemic ones, are considered, it
|
||
becomes clear that the maintenance worker was, in fact, only a minor and somewhat
|
||
irrelevant player in the loss. Instead, degradation in the safety margin occurred over
|
||
time and without any particular single decision to do so but simply as a series of
|
||
decisions that moved the plant slowly toward a situation where any slight error
|
||
would lead to a major accident. Given the overall state of the Bhopal Union Carbide
|
||
plant and its operation, if the action of inserting the slip disk had not been left out
|
||
of the pipe washing operation that December day in 19 84, something else would
|
||
have triggered an accident. In fact, a similar leak had occurred the year before, but
|
||
did not have the same catastrophic consequences and the true root causes of that
|
||
incident were neither identified nor fixed.
|
||
To label one event (such as a maintenance worker leaving out the slip disk) or
|
||
even several events as the root cause or the start of an event chain leading to the
|
||
Bhopal accident is misleading at best. Rasmussen writes.
|
||
The stage for an accidental course of events very likely is prepared through time by the
|
||
normal efforts of many actors in their respective daily work context, responding to the
|
||
standing request to be more productive and less costly. Ultimately, a quite normal variation
|
||
in somebody’s behavior can then release an accident. Had this “root cause” been avoided
|
||
by some additional safety measure, the accident would very likely be released by another
|
||
cause at another point in time. In other words, an explanation of the accident in terms of
|
||
events, acts, and errors is not very useful for design of improved systems .
|
||
In general, eventbased models are poor at representing systemic accident factors
|
||
such as structural deficiencies in the organization, management decision making,
|
||
and flaws in the safety culture of the company or industry. An accident model should
|
||
encourage a broad view of accident mechanisms that expands the investigation
|
||
beyond the proximate events. A narrow focus on technological components and
|
||
pure engineering activities or a similar narrow focus on operator errors may lead
|
||
to ignoring some of the most important factors in terms of preventing future acci
|
||
dents. The accident model used to explain why the accident occurred should not
|
||
only encourage the inclusion of all the causal factors but should provide guidance
|
||
in identifying these factors.
|
||
|
||
footnote. Union Carbide lawyers argued that the introduction of water into the M I C tank was an act of sabotage
|
||
rather than a maintenance worker’s mistake. While this differing interpretation of the initiating event
|
||
has important implications with respect to legal liability, it makes no difference in the argument presented
|
||
here regarding the limitations of eventchain models of accidents or even, as will be seen, understanding
|
||
why this accident occurred.
|
||
|
||
section 2 2 5. Including Systems Factors in Accident Models
|
||
Largescale engineered systems are more than just a collection of technological
|
||
artifacts. They are a reflection of the structure, management, procedures, and culture
|
||
of the engineering organization that created them. They are usually also a reflection
|
||
of the society in which they were created. Ralph Miles Jr., in describing the basic
|
||
concepts of systems theory, notes,
|
||
Underlying every technology is at least one basic science, although the technology may
|
||
be well developed long before the science emerges. Overlying every technical or civil
|
||
system is a social system that provides purpose, goals, and decision criteria.
|
||
Effectively preventing accidents in complex systems requires using accident models
|
||
that include that social system as well as the technology and its underlying science.
|
||
Without understanding the purpose, goals, and decision criteria used to construct
|
||
and operate systems, it is not possible to completely understand and most effectively
|
||
prevent accidents.
|
||
Awareness of the importance of social and organizational aspects of safety goes
|
||
back to the early days of System Safety.7 In 19 68, Jerome Lederer, then the director
|
||
of the NASA Manned Flight Safety Program for Apollo, wrote.
|
||
System safety covers the total spectrum of risk management. It goes beyond the hardware
|
||
and associated procedures of system safety engineering. It involves. attitudes and motiva
|
||
tion of designers and production people, employee/management rapport, the relation of
|
||
industrial associations among themselves and with government, human factors in supervi
|
||
sion and quality control, documentation on the interfaces of industrial and public safety
|
||
with design and operations, the interest and attitudes of top management, the effects of
|
||
the legal system on accident investigations and exchange of information, the certification
|
||
of critical workers, political considerations, resources, public sentiment and many other
|
||
nontechnical but vital influences on the attainment of an acceptable level of risk control.
|
||
These nontechnical aspects of system safety cannot be ignored.
|
||
Too often, however, these nontechnical aspects are ignored.
|
||
At least three types of factors need to be considered in accident causation. The
|
||
first is the proximate event chain, which for the Herald of Free Enterprise includes
|
||
the assistant boatswain’s not closing the doors and the return of the first officer to
|
||
the wheelhouse prematurely. Note that there was a redundant design here, with the
|
||
first officer checking the work of the assistant boatswain, but it did not prevent the
|
||
accident, as is often the case with redundancy .
|
||
The second type of information includes the conditions that allowed the events
|
||
to occur. the high spring tides, the inadequate design of the ferry loading ramp for
|
||
this harbor, and the desire of the first officer to stay on schedule (thus leaving the
|
||
car deck before the doors were closed). All of these conditions can be directly
|
||
mapped to the events.
|
||
|
||
tions, but these indirect factors are critical in fully understanding why the accident
|
||
occurred and thus how to prevent future accidents. In this case, the systemic factors
|
||
include the owner of the ferry (Townsend Thoresen) needing ships that were
|
||
designed to permit fast loading and unloading and quick acceleration in order to
|
||
remain competitive in the ferry business, and pressure by company management on
|
||
the captain and first officer to strictly adhere to schedules, also related to competi
|
||
tive factors.
|
||
Several attempts have been made to graft systemic factors onto event models,
|
||
but all have important limitations. The most common approach has been to add
|
||
hierarchical levels above the event chain. In the seventies, Johnson proposed a
|
||
model and sequencing method that described accidents as chains of direct events
|
||
and causal factors arising from contributory factors, which in turn arise from sys
|
||
temic factors (figure 2 7) .
|
||
Johnson also tried to put management factors into fault trees (a technique called
|
||
MORT, or Management Oversight Risk Tree), but ended up simply providing a
|
||
general checklist for auditing management practices. While such a checklist can be
|
||
very useful, it presupposes that every error can be predefined and put into a check
|
||
list form. The checklist is comprised of a set of questions that should be asked during
|
||
an accident investigation. Examples of the questions from a DOE MORT User’s
|
||
Manual are. Was there sufficient training to update and improve needed supervisory
|
||
skills? Did the supervisors have their own technical staff or access to such individu
|
||
als? Was there technical support of the right discipline(s) sufficient for the needs of
|
||
|
||
supervisory programs and review functions? Were there established methods for
|
||
measuring performance that permitted the effectiveness of supervisory programs to
|
||
be evaluated? Was a maintenance plan provided before startup? Was all relevant
|
||
information provided to planners and managers? Was it used? Was concern for
|
||
safety displayed by vigorous, visible personal action by top executives? And so forth.
|
||
Johnson originally provided hundreds of such questions, and additions have been
|
||
made to his checklist since Johnson created it in the 19 70s so it is now even larger.
|
||
The use of the MORT checklist is feasible because the items are so general, but that
|
||
same generality also limits its usefulness. Something more effective than checklists
|
||
is needed.
|
||
The most sophisticated of the hierarchical addons to event chains is Rasmussen
|
||
and Svedung’s model of the sociotechnical system involved in risk management
|
||
. As shown in figure 2 8. at the social and organizational levels they use a hier
|
||
archical control structure, with levels for government, regulators and associations,
|
||
company, management, and staff. At all levels they map information flow. The model
|
||
concentrates on operations; information from the system design and analysis process
|
||
is treated as input to the operations process. At each level, they model the factors
|
||
involved using event chains, with links to the event chains at the level below. Notice
|
||
that they still assume there is a root cause and causal chain of events. A generaliza
|
||
tion of the Rasmussen and Svedung model, which overcomes these limitations, is
|
||
presented in chapter 4.
|
||
Once again, a new assumption is needed to make progress in learning how to
|
||
design and operate safer systems.
|
||
New Assumption 2. Accidents are complex processes involving the entire socio
|
||
technical system. Traditional eventchain models cannot describe this process
|
||
adequately.
|
||
Most of the accident models underlying safety engineering today stem from the days
|
||
when the types of systems we were building and the context in which they were
|
||
built were much simpler. As noted in chapter 1, new technology and social factors
|
||
are making fundamental changes in the etiology of accidents, requiring changes in
|
||
the explanatory mechanisms used to understand them and in the engineering
|
||
techniques applied to prevent them.
|
||
Eventbased models are limited in their ability to represent accidents as complex
|
||
processes, particularly at representing systemic accident factors such as structural
|
||
deficiencies in the organization, management deficiencies, and flaws in the safety
|
||
culture of the company or industry. We need to understand how the whole system,
|
||
including the organizational and social components, operating together, led to the
|
||
loss. While some extensions to eventchain models have been proposed, all are
|
||
unsatisfactory in important ways.
|
||
|
||
An accident model should encourage a broad view of accident mechanisms that
|
||
expands the investigation beyond the proximate events. A narrow focus on operator
|
||
actions, physical component failures, and technology may lead to ignoring some of
|
||
the most important factors in terms of preventing future accidents. The whole
|
||
concept of “root cause” needs to be reconsidered.
|
||
|
||
footnote. When this term is capitalized in this book, it denotes the specific form of safety engineering developed
|
||
originally by the Defense Department and its contractors for the early ICBM systems and defined
|
||
by M I LS T D8 82. System safety (uncapitalized) or safety engineering denotes all the approaches to
|
||
engineering for safety.
|
||
|
||
section 2 3.
|
||
Limitations of Probabilistic Risk Assessment.
|
||
Assumption 3. Probabilistic risk analysis based on event chains is the best way to
|
||
assess and communicate safety and risk information.
|
||
The limitations of eventchain models are reflected in the current approaches to
|
||
quantitative risk assessment, most of which use trees or other forms of event chains.
|
||
Probabilities. (or probability density functions.) are assigned to the events in the
|
||
chain and an overall likelihood of a loss is calculated.
|
||
In performing a probabilistic risk assessment (PRA), initiating events in the chain
|
||
are usually assumed to be mutually exclusive. While this assumption simplifies the
|
||
mathematics, it may not match reality. As an example, consider the following description of an accident chain for an offshore oil platform.
|
||
An initiating event is an event that triggers an accident sequence..e g., a wave that
|
||
exceeds the jacket’s capacity that, in turn, triggers a blowout that causes failures of the
|
||
foundation. As initiating events, they are mutually exclusive; only one of them starts the
|
||
accident sequence. A catastrophic platform failure can start by failure of the foundation,
|
||
failure of the jacket, or failure of the deck. These initiating failures are also. (by definition.)
|
||
mutually exclusive and constitute the basic events of the probabilistic risk assessment
|
||
model in its simplest form.
|
||
The selection of the failure of the foundation, jacket, or deck as the initiating
|
||
event is arbitrary, as we have seen, and eliminates from consideration prior events
|
||
leading to them such as manufacturing or construction problems. The failure of the
|
||
foundation, for example, might be related to the use of inferior construction materi
|
||
als, which in turn might be related to budget deficiencies or lack of government
|
||
oversight.
|
||
In addition, there does not seem to be any reason for assuming that initiating
|
||
failures are mutually exclusive and that only one starts the accident, except perhaps
|
||
again to simplify the mathematics. In accidents, seemingly independent failures may
|
||
have a common systemic cause. (often not a failure.) that results in coincident failures. For example, the same pressures to use inferior materials in the foundation
|
||
may result in their use in the jacket and the deck, leading to a wave causing coincident, dependent failures in all three. Alternatively, the design of the foundation.
|
||
a systemic factor rather than a failure event.may lead to pressures on the jacket and
|
||
deck when stresses cause deformities in the foundation. Treating such events as
|
||
independent may lead to unrealistic risk assessments.
|
||
In the Bhopal accident, the vent scrubber, flare tower, water spouts, refrigeration
|
||
unit, and various monitoring instruments were all out of operation simultaneously.
|
||
Assigning probabilities to all these seemingly unrelated events and assuming inde
|
||
pendence would lead one to believe that this accident was merely a matter of a
|
||
onceinalifetime coincidence. A probabilistic risk assessment based on an event
|
||
chain model most likely would have treated these conditions as independent failures
|
||
and then calculated their coincidence as being so remote as to be beyond consider
|
||
ation. Reason, in his popular Swiss Cheese Model of accident causation based on
|
||
defense in depth, does the same, arguing that in general “the chances of such a
|
||
trajectory of opportunity finding loopholes in all the defences at any one time is
|
||
very small indeed”. As suggested earlier, a closer look at Bhopal and,
|
||
indeed, most accidents paints a quite different picture and shows these were not
|
||
random failure events but were related to engineering and management decisions
|
||
stemming from common systemic factors.
|
||
Most accidents in welldesigned systems involve two or more lowprobability
|
||
events occurring in the worst possible combination. When people attempt to predict
|
||
system risk, they explicitly or implicitly multiply events with low probability.
|
||
assuming independence.and come out with impossibly small numbers, when, in
|
||
fact, the events are dependent. This dependence may be related to common systemic
|
||
factors that do not appear in an event chain. Machol calls this phenomenon the
|
||
Titanic coincidence.
|
||
A number of “coincidences” contributed to the Titanic accident and the subse
|
||
quent loss of life. For example, the captain was going far too fast for existing condi
|
||
tions, a proper watch for icebergs was not kept, the ship was not carrying enough
|
||
lifeboats, lifeboat drills were not held, the lifeboats were lowered properly but
|
||
arrangements for manning them were insufficient, and the radio operator on a
|
||
nearby ship was asleep and so did not hear the distress call. Many of these events
|
||
or conditions may be considered independent but appear less so when we consider
|
||
that overconfidence due to incorrect engineering analyses about the safety and
|
||
unsinkability of the ship most likely contributed to the excessive speed, the lack of
|
||
a proper watch, and the insufficient number of lifeboats and drills. That the collision
|
||
occurred at night contributed to the iceberg not being easily seen, made abandoning
|
||
ship more difficult than it would have been during the day, and was a factor in why
|
||
the nearby ship’s operator was asleep .. Assuming independence here leads to
|
||
a large underestimate of the true risk.
|
||
Another problem in probabilistic risk assessment (PRA) is the emphasis on
|
||
failure events.design errors are usually omitted and only come into the calculation
|
||
indirectly through the probability of the failure event. Accidents involving dysfunc
|
||
tional interactions among nonfailing (operational) components.that is, compo
|
||
nent interaction accidents.are usually not considered. Systemic factors also are not
|
||
reflected. In the offshore oil platform example at the beginning of this section, the
|
||
true probability density function for the failure of the deck might reflect a poor
|
||
design for the conditions the deck must withstand (a human design error) or, as
|
||
noted earlier, the use of inadequate construction materials due to lack of govern
|
||
ment oversight or project budget limitations.
|
||
When historical data are used to determine the failure probabilities used in the
|
||
PRA, nonfailure factors, such as design errors or unsafe management decisions,
|
||
may differ between the historic systems from which the data was derived and the
|
||
system under consideration. It is possible (and obviously desirable) for each PRA
|
||
to include a description of the conditions under which the probabilities were
|
||
derived. If such a description is not included, it may not be possible to determine
|
||
whether conditions in the platform being evaluated differ from those built pre
|
||
viously that might significantly alter the risk. The introduction of a new design
|
||
feature or of active control by a computer might greatly affect the probability of
|
||
failure and the usefulness of data from previous experience then becomes highly
|
||
questionable.
|
||
The most dangerous result of using PRA arises from considering only immediate
|
||
physical failures. Latent design errors may be ignored and go uncorrected due to
|
||
overconfidence in the risk assessment. An example, which is a common but danger
|
||
ous practice judging from its implication in a surprising number of accidents, is
|
||
wiring a valve to detect only that power has been applied to open or close it and
|
||
not that the valve position has actually changed. In one case, an Air Force system
|
||
included a relief valve to be opened by the operator to protect against overpres
|
||
surization .. A second, backup relief valve was installed in case the primary valve
|
||
failed. The operator needed to know that the first valve had not opened, however,
|
||
in order to determine that the backup valve must be activated. One day, the operator
|
||
issued a command to open the primary valve. The position indicator and open indi
|
||
cator lights both illuminated but the primary relief valve was not open. The operator,
|
||
thinking the primary valve had opened, did not activate the backup valve and an
|
||
explosion occurred.
|
||
A postaccident investigation discovered that the indicator light circuit was wired
|
||
to indicate presence of power at the valve, but it did not indicate valve position. Thus,
|
||
the indicator showed only that the activation button had been pushed, not that the
|
||
valve had operated. An extensive probabilistic risk assessment of this design had
|
||
correctly assumed a low probability of simultaneous failure for the two relief
|
||
valves, but had ignored the possibility of a design error in the electrical wiring. The
|
||
probability of that design error was not quantifiable. If it had been identified, of
|
||
course, the proper solution would have been to eliminate the design error, not to
|
||
assign a probability to it. The same type of design flaw was a factor in the Three
|
||
Mile Island accident. An indicator misleadingly showed that a discharge valve had
|
||
been ordered closed but not that it had actually closed. In fact, the valve was blocked
|
||
in an open position.
|
||
In addition to these limitations of PRA for electromechanical systems, current
|
||
methods for quantifying risk that are based on combining probabilities of individual
|
||
component failures and mutually exclusive events are not appropriate for systems
|
||
controlled by software and by humans making cognitively complex decisions, and
|
||
there is no effective way to incorporate management and organizational factors,
|
||
such as flaws in the safety culture, despite many wellintentioned efforts to do
|
||
so. As a result, these critical factors in accidents are often omitted from risk assess
|
||
ment because analysts do not know how to obtain a “failure” probability, or alter
|
||
natively, a number is pulled out of the air for convenience. If we knew enough to
|
||
measure these types of design flaws, it would be better to fix them than to try to
|
||
measure them.
|
||
Another possibility for future progress is usually not considered.
|
||
New Assumption 3. Risk and safety may be best understood and communicated in
|
||
ways other than probabilistic risk analysis.
|
||
Understanding risk is important in decision making. Many people assume that risk
|
||
information is most appropriately communicated in the form of a probability. Much
|
||
has been written, however, about the difficulty people have in interpreting probabili
|
||
ties .. Even if people could use such values appropriately, the tools commonly
|
||
used to compute these quantities, which are based on computing probabilities of
|
||
failure events, have serious limitations. An accident model that is not based on
|
||
failure events, such as the one introduced in this book, could provide an entirely
|
||
new basis for understanding and evaluating safety and, more generally, risk.
|
||
|
||
footnote. Watt defined a related phenomenon he called the Titanic effect to explain the fact that major accidents
|
||
are often preceded by a belief that they cannot happen. The Titanic effect says that the magnitude of
|
||
disasters decreases to the extent that people believe that disasters are possible and plan to prevent them
|
||
or to minimize their effects.
|
||
|
||
section 2 4.
|
||
The Role of Operators in Accidents.
|
||
Assumption 4. Most accidents are caused by operator error. Rewarding safe behav
|
||
ior and punishing unsafe behavior will eliminate or reduce accidents significantly.
|
||
As we have seen, the definition of “caused by” is debatable. But the fact remains
|
||
that if there are operators in the system, they are most likely to be blamed for an
|
||
accident. This phenomenon is not new. In the nineteenth century, coupling accidents
|
||
on railroads were one of the main causes of injury and death to railroad workers
|
||
.. In the seven years between 1888 and 1894, 16,000 railroad workers were killed
|
||
in coupling accidents and 170,000 were crippled. Managers claimed that such acci
|
||
dents were due only to worker error and negligence, and therefore nothing could
|
||
be done aside from telling workers to be more careful. The government finally
|
||
stepped in and required that automatic couplers be installed. As a result, fatalities
|
||
dropped sharply. According to the June 1896 (three years after Congress acted on
|
||
the problem) issue of Scientific American.
|
||
Few battles in history show so ghastly a fatality. A large percentage of these deaths were
|
||
caused by the use of imperfect equipment by the railroad companies; twenty years ago it
|
||
was practically demonstrated that cars could be automatically coupled, and that it was no
|
||
longer necessary for a railroad employee to imperil his life by stepping between two cars
|
||
about to be connected. In response to appeals from all over, the U.S. Congress passed the
|
||
Safety Appliance Act in March 1893. It has or will cost the railroads $50,000,000 to fully
|
||
comply with the provisions of the law. Such progress has already been made that the death
|
||
rate has dropped by 35 per cent.
|
||
|
||
sectio 2 4 1. Do Operators Cause Most Accidents?
|
||
The tendency to blame the operator is not simply a nineteenth century problem,
|
||
but persists today. During and after World War 2, the Air Force had serious prob
|
||
lems with aircraft accidents. From 19 52 to 19 66, for example, 7,715 aircraft were lost
|
||
and 8,547 people killed .. Most of these accidents were blamed on pilots. Some
|
||
aerospace engineers in the 19 50s did not believe the cause was so simple and
|
||
argued that safety must be designed and built into aircraft just as are performance,
|
||
stability, and structural integrity. Although a few seminars were conducted and
|
||
papers written about this approach, the Air Force did not take it seriously until
|
||
they began to develop intercontinental ballistic missiles. there were no pilots
|
||
to blame for the frequent and devastating explosions of these liquidpropellant
|
||
missiles. In having to confront factors other than pilot error, the Air Force began
|
||
to treat safety as a system problem, and System Safety programs were developed
|
||
to deal with them. Similar adjustments in attitude and practice may be forced
|
||
in the future by the increasing use of unmanned autonomous aircraft and other
|
||
automated systems.
|
||
It is still common to see statements that 70 percent to 80 percent of aircraft acci
|
||
dents are caused by pilot error or that 85 percent of work accidents are due to unsafe
|
||
acts by workers rather than unsafe conditions. However, closer examination shows
|
||
that the data may be biased and incomplete. the less that is known about an accident,
|
||
the most likely it will be attributed to operator error .. Thorough investigation
|
||
of serious accidents almost invariably finds other factors.
|
||
|
||
Part of the problem stems from the use of the chainofevents model in accident
|
||
investigation because it is difficult to find an event preceding and causal to the
|
||
operator behavior, as mentioned earlier. If the problem is in the system design,
|
||
there is no proximal event to explain the error, only a flawed decision during
|
||
system design.
|
||
Even if a technical failure precedes the human action, the tendency is to put the
|
||
blame on an inadequate response to the failure by an operator. Perrow claims that
|
||
even in the best of industries, there is rampant attribution of accidents to operator
|
||
error, to the neglect of errors by designers or managers .. He cites a U.S. Air
|
||
Force study of aviation accidents demonstrating that the designation of human error
|
||
(pilot error in this case) is a convenient classification for mishaps whose real cause
|
||
is uncertain, complex, or embarrassing to the organization.
|
||
Beside the fact that operator actions represent a convenient stopping point in an
|
||
event chain, other reasons for the operator error statistics include. (1.) operator
|
||
actions are generally reported only when they have a negative impact on safety and
|
||
not when they are responsible for preventing accidents.; (2.) blame may be based on
|
||
unrealistic expectations that operators can overcome every emergency.; (3.) opera
|
||
tors may have to intervene at the limits of system behavior when the consequences
|
||
of not succeeding are likely to be serious and often involve a situation the designer
|
||
never anticipated and was not covered by the operator’s training.; and (4.) hindsight
|
||
often allows us to identify a better decision in retrospect, but detecting and correct
|
||
ing potential errors before they have been made obvious by an accident is far more
|
||
difficult.
|
||
|
||
section 2 4 2. Hindsight Bias.
|
||
The psychological phenomenon called hindsight bias plays such an important role
|
||
in attribution of causes to accidents that it is worth spending time on it. The report
|
||
on the Clapham Junction railway accident in Britain concluded.
|
||
There is almost no human action or decision that cannot be made to look flawed and less
|
||
sensible in the misleading light of hindsight. It is essential that the critic should keep
|
||
himself constantly aware of that fact.
|
||
After an accident, it is easy to see where people went wrong, what they should
|
||
have done or not done, to judge people for missing a piece of information that
|
||
turned out to be critical, and to see exactly the kind of harm that they should have
|
||
foreseen or prevented .. Before the event, such insight is difficult and, perhaps,
|
||
impossible.
|
||
Dekker . points out that hindsight allows us to.
|
||
1.•Oversimplify causality because we can start from the outcome and reason
|
||
backward to presumed or plausible “causes.”
|
||
2.•Overestimate the likelihood of the outcome.and people’s ability to foresee
|
||
it.because we already know what the outcome is.
|
||
3.•Overrate the role of rule or procedure “violations.” There is always a gap
|
||
between written guidance and actual practice, but this gap almost never leads
|
||
to trouble. It only takes on causal significance once we have a bad outcome to
|
||
look at and reason about.
|
||
4.•Misjudge the prominence or relevance of data presented to people at the
|
||
time.
|
||
5.•Match outcome with the actions that went before it. If the outcome was bad,
|
||
then the actions leading up to it must have also been bad.missed opportuni
|
||
ties, bad assessments, wrong decisions, and misperceptions.
|
||
Avoiding hindsight bias requires changing our emphasis in analyzing the role of
|
||
humans in accidents from what they did wrong to why it made sense for them to
|
||
act the way they did.
|
||
|
||
section 2 4 3. The Impact of System Design on Human Error.
|
||
All human activity takes place within and is influenced by the environment, both
|
||
physical and social, in which it takes place. It is, therefore, often very difficult to
|
||
separate system design error from operator error. In highly automated systems, the
|
||
operator is often at the mercy of the system design and operational procedures.
|
||
One of the major mistakes made by the operators at Three Mile Island was follow
|
||
ing the procedures provided to them by the utility. The instrumentation design also
|
||
did not provide the information they needed to act effectively in recovering from
|
||
the hazardous state ..
|
||
In the lawsuits following the 19 95 B 7 57 Cali accident, American Airlines was held
|
||
liable for the crash based on the Colombian investigators blaming crew error entirely
|
||
for the accident. The official accident investigation report cited the following four
|
||
causes for the loss ..
|
||
1. The flightcrew’s failure to adequately plan and execute the approach to runway
|
||
19 and their inadequate use of automation.
|
||
2. Failure of the flightcrew to discontinue their approach, despite numerous cues
|
||
alerting them of the inadvisability of continuing the approach.
|
||
3. The lack of situational awareness of the flightcrew regarding vertical naviga
|
||
tion, proximity to terrain, and the relative location of critical radio aids.
|
||
4. Failure of the flightcrew to revert to basic radio navigation at a time when the
|
||
F M Sassisted navigation became confusing and demanded an excessive work
|
||
load in a critical phase of the flight.
|
||
Look in particular the fourth identified cause: the blame is placed on the pilots
|
||
when the automation became confusing and demanded an excessive workload
|
||
rather than on the design of the automation. To be fair, the report also identifies
|
||
two “contributory factors”.but not causes.as:
|
||
1.•F M S logic that dropped all intermediate fixes from the display(s) in the event
|
||
of execution of a direct routing.
|
||
2.•F M Sgenerated navigational information that used a different naming conven
|
||
tion from that published in navigational charts.
|
||
These two “contributory factors” are highly related to the third cause.the pilots’
|
||
“lack of situational awareness.” Even using an eventchain model of accidents, the
|
||
F M Srelated events preceded and contributed to the pilot errors. There seems to be
|
||
no reason why, at the least, they should be treated any different than the labeled
|
||
“causes.” There were also many other factors in this accident that were not reflected
|
||
in either the identified causes or contributory factors.
|
||
In this case, the Cali accident report conclusions were challenged in court. A U S.
|
||
appeals court rejected the conclusion of the report about the four causes of the
|
||
accident ., which led to a lawsuit by American Airlines in a federal court in which
|
||
American alleged that components of the automated aircraft system made by Hon
|
||
eywell Air Transport Systems and Jeppesen Sanderson helped cause the crash.
|
||
American blamed the software, saying Jeppesen stored the location of the Cali
|
||
airport beacon in a different file from most other beacons. Lawyers for the computer
|
||
companies argued that the beacon code could have been properly accessed and that
|
||
the pilots were in error. The jury concluded that the two companies produced a
|
||
defective product and that Jeppesen was 17 percent responsible, Honeywell was 8
|
||
percent at fault, and American was held to be 75 percent responsible .. While such
|
||
distribution of responsibility may be important in determining how much each
|
||
company will have to pay, it is arbitrary and does not provide any important infor
|
||
mation with respect to accident prevention in the future. The verdict is interesting,
|
||
however, because the jury rejected the oversimplified notion of causality being
|
||
argued. It was also one of the first cases not settled out of court where the role of
|
||
software in the loss was acknowledged.
|
||
This case, however, does not seem to have had much impact on the attribution
|
||
of pilot error in later aircraft accidents.
|
||
Part of the problem is engineers’ tendency to equate people with machines.
|
||
Human “failure” usually is treated the same as a physical component failure.
|
||
a
|
||
deviation from the performance of a specified or prescribed sequence of actions.
|
||
This definition is equivalent to that of machine failure. Alas, human behavior is
|
||
much more complex than machines.
|
||
As many human factors experts have found, instructions and written procedures
|
||
are almost never followed exactly as operators try to become more efficient and
|
||
productive and to deal with time pressures . In studies of operators, even in
|
||
such highly constrained and highrisk environments as nuclear power plants, modi
|
||
fication of instructions is repeatedly found . When examined, these
|
||
violations of rules appear to be quite rational, given the workload and timing
|
||
constraints under which the operators must do their job. The explanation lies in
|
||
the basic conflict between error viewed as a deviation from normative procedure
|
||
and error viewed as a deviation from the rational and normally used effective
|
||
procedure .
|
||
One implication is that following an accident, it will be easy to find someone
|
||
involved in the dynamic flow of events that has violated a formal rule by following
|
||
established practice rather than specified practice. Given the frequent deviation of
|
||
established practice from normative work instructions and rules, it is not surprising
|
||
that operator “error” is found to be the cause of 70 percent to 80 percent of acci
|
||
dents. As noted in the discussion of assumption 2, a root cause is often selected
|
||
because that event involves a deviation from a standard.
|
||
|
||
section 2 4 4. The Role of Mental Models.
|
||
The updating of human mental models plays a significant role here (figure 2.9). Both
|
||
the designer and the operator will have their own mental models of the plant. It is
|
||
quite natural for the designer’s and operator’s models to differ and even for both
|
||
to have significant differences from the actual plant as it exists. During development,
|
||
the designer evolves a model of the plant to the point where it can be built. The
|
||
designer’s model is an idealization formed before the plant is constructed. Significant
|
||
differences may exist between this ideal model and the actual constructed system.
|
||
Besides construction variances, the designer always deals with ideals or averages,
|
||
not with the actual components themselves. Thus, a designer may have a model of
|
||
a valve with an average closure time, while real valves have closure times that fall
|
||
somewhere along a continuum of timing behavior that reflects manufacturing and
|
||
material differences. The designer’s idealized model is used to develop operator
|
||
work instructions and training. But the actual system may differ from the designer’s
|
||
model because of manufacturing and construction variances and evolution and
|
||
changes over time.
|
||
The operator’s model of the system will be based partly on formal training created
|
||
from the designer’s model and partly on experience with the system. The operator
|
||
must cope with the system as it is constructed and not as it may have been
|
||
envisioned. As the physical system changes and evolves over time, the operator’s
|
||
model and operational procedures must change accordingly. While the formal pro
|
||
cedures, work instructions, and training will be updated periodically to reflect the
|
||
current operating environment, there is necessarily always a time lag. In addition,
|
||
the operator may be working under time and productivity pressures that are not
|
||
reflected in the idealized procedures and training.
|
||
Operators use feedback to update their mental models of the system as the
|
||
system evolves. The only way for the operator to determine that the system has
|
||
changed and that his or her mental model must be updated is through experimenta
|
||
tion: To learn where the boundaries of safe behavior currently are, occasionally they
|
||
must be crossed.
|
||
Experimentation is important at all levels of control . For manual tasks
|
||
where the optimization criteria are speed and smoothness, the limits of acceptable
|
||
adaptation and optimization can only be known from the error experienced when
|
||
occasionally crossing a limit. Errors are an integral part of maintaining a skill at
|
||
an optimal level and a necessary part of the feedback loop to achieve this goal.
|
||
The role of such experimentation in accidents cannot be understood by treating
|
||
human errors as events in a causal chain separate from the feedback loops in which
|
||
they operate.
|
||
At higher levels of cognitive control and supervisory decision making, experi
|
||
mentation is needed for operators to update procedures to handle changing
|
||
conditions or to evaluate hypotheses while engaged in reasoning about the best
|
||
response to unexpected situations. Actions that are quite rational and important
|
||
during the search for information and test of hypotheses may appear to be unac
|
||
ceptable mistakes in hindsight, without access to the many details of a “turbulent”
|
||
situation .
|
||
The ability to adapt mental models through experience in interacting with the
|
||
operating system is what makes the human operator so valuable. For the reasons
|
||
discussed, the operators’ actual behavior may differ from the prescribed procedures
|
||
because it is based on current inputs and feedback. When the deviation is correct
|
||
(the designers’ models are less accurate than the operators’ models at that particular
|
||
instant in time), then the operators are considered to be doing their job. When the
|
||
operators’ models are incorrect, they are often blamed for any unfortunate results,
|
||
even though their incorrect mental models may have been reasonable given the
|
||
information they had at the time.
|
||
Providing feedback and allowing for experimentation in system design, then, is
|
||
critical in allowing operators to optimize their control ability. In the less automated
|
||
system designs of the past, operators naturally had this ability to experiment and
|
||
update their mental models of the current system state. Designers of highly auto
|
||
mated systems sometimes do not understand this requirement and design automa
|
||
tion that takes operators “out of the loop.” Everyone is then surprised when the
|
||
operator makes a mistake based on an incorrect mental model. Unfortunately, the
|
||
reaction to such a mistake is to add even more automation and to marginalize
|
||
the operators even more, thus exacerbating the problem .
|
||
Flawed decisions may also result from limitations in the boundaries of the opera
|
||
tor’s or designer’s model. Decision makers may simply have too narrow a view of
|
||
the system their decisions will impact. Recall figure 2.2 and the discussion of the
|
||
Herald of Free Enterprise accident. The boundaries of the system model relevant to
|
||
a particular decision maker may depend on the activities of several other decision
|
||
makers found within the total system . Accidents may result from the interac
|
||
tion and side effects of their decisions based on their limited model. Before an
|
||
accident, it will be difficult for the individual decision makers to see the full picture
|
||
during their daily operational decision making and to judge the current state of the
|
||
multiple defenses and safety margins that are partly dependent on decisions made
|
||
by other people in other departments and organizations .
|
||
Rasmussen stresses that most decisions are sound using local judgment criteria
|
||
and given the time and budget pressures and shortterm incentives that shape
|
||
behavior. Experts do their best to meet local conditions and in the busy daily
|
||
flow of activities may be unaware of the potentially dangerous side effects of their
|
||
behavior. Each individual decision may appear safe and rational within the context
|
||
of the individual work environments and local pressures, but may be unsafe when
|
||
|
||
considered as a whole: It is difficult.if not impossible.for any individual to judge
|
||
the safety of their decisions when it is dependent on the decisions made by other
|
||
people in other departments and organizations.
|
||
Decentralized decision making is, of course, required in some timecritical situa
|
||
tions. But like all safetycritical decision making, the decentralized decisions must
|
||
be made in the context of systemlevel information and from a total systems per
|
||
spective in order to be effective in reducing accidents. One way to make distributed
|
||
decision making safe is to decouple the system components in the overall system
|
||
design, if possible, so that decisions do not have systemwide repercussions. Another
|
||
common way to deal with the problem is to specify and train standard emergency
|
||
responses. Operators may be told to sound the evacuation alarm any time an indica
|
||
tor reaches a certain level. In this way, safe procedures are determined at the system
|
||
level and operators are socialized and trained to provide uniform and appropriate
|
||
responses to crisis situations.
|
||
There are situations, of course, when unexpected conditions occur and avoiding
|
||
losses requires the operators to violate the specified (and in such cases unsafe)
|
||
procedures. If the operators are expected to make decisions in real time and not just
|
||
follow a predetermined procedure, then they usually must have the relevant system
|
||
level information about the situation in order to make safe decisions. This is not
|
||
required, of course, if the system design decouples the components and thus allows
|
||
operators to make independent safe decisions. Such decoupling must be designed
|
||
into the system, however.
|
||
Some high reliability organization (H R O) theorists have argued just the opposite.
|
||
They have asserted that H R Os are safe because they allow professionals at the front
|
||
lines to use their knowledge and judgment to maintain safety. During crises, they
|
||
argue, decision making in H R Os migrates to the frontline workers who have the
|
||
necessary judgment to make decisions . The problem is that the assumption
|
||
that frontline workers will have the necessary knowledge and judgment to make
|
||
decisions is not necessarily true. One example is the friendly fire accident analyzed
|
||
in chapter 5 where the pilots ignored the rules of engagement they were told to
|
||
follow and decided to make realtime decisions on their own based on the inade
|
||
quate information they had.
|
||
Many of the H R O theories were derived from studying safetycritical systems,
|
||
such as aircraft carrier flight operations. La Porte and Consolini , for example,
|
||
argue that while the operation of aircraft carriers is subject to the Navy’s chain of
|
||
command, even the lowestlevel seaman can abort landings. Clearly, this local
|
||
authority is necessary in the case of aborted landings because decisions must be
|
||
made too quickly to go up a chain of command. But note that such lowlevel per
|
||
sonnel can only make decisions in one direction, that is, they may only abort land
|
||
ings. In essence, they are allowed to change to an inherently safe state (a goaround)
|
||
|
||
with respect to the hazard involved. Systemlevel information is not needed because
|
||
a safe state exists that has no conflicts with other hazards, and the actions governed
|
||
by these decisions and the conditions for making them are relatively simple. Aircraft
|
||
carriers are usually operating in areas containing little traffic.they are decoupled
|
||
from the larger system.and therefore localized decisions to abort are almost always
|
||
safe and can be allowed from a larger system safety viewpoint.
|
||
Consider a slightly different situation, however, where a pilot makes a decision
|
||
to goaround (abort a landing) at a busy urban airport. While executing a goaround
|
||
when a clear danger exists if the pilot lands is obviously the right decision, there
|
||
have been near misses when a pilot executed a goaround and came too close to
|
||
another aircraft that was taking off on a perpendicular runway. The solution to this
|
||
problem is not at the decentralized level.the individual pilot lacks the systemlevel
|
||
information to avoid hazardous system states in this case. Instead, the solution must
|
||
be at the system level, where the danger must be reduced by instituting different
|
||
landing and takeoff procedures, building new runways, redistributing air traffic, or
|
||
by making other systemlevel changes. We want pilots to be able to execute a go
|
||
around if they feel it is necessary, but unless the encompassing system is designed
|
||
to prevent collisions, the action decreases one hazard while increasing a different
|
||
one. Safety is a system property.
|
||
|
||
section 2 4 5. An Alternative View of Human Error.
|
||
Traditional decisionmaking research views decisions as discrete processes that can
|
||
be separated from the context in which the decisions are made and studied as an
|
||
isolated phenomenon. This view is starting to be challenged. Instead of thinking of
|
||
operations as predefined sequences of actions, human interaction with a system is
|
||
increasingly being considered to be a continuous control task in which separate
|
||
“decisions” or errors are difficult to identify.
|
||
Edwards, back in 19 62, was one of the first to argue that decisions can only be
|
||
understood as part of an ongoing process . The state of the system is perceived
|
||
in terms of possible actions, one of these actions is chosen, and the resulting response
|
||
from the controlled system acts as a background for the next actions. Errors then
|
||
are difficult to localize in the stream of behavior; the effects of less successful actions
|
||
are a natural part of the search by the operator for optimal performance. As an
|
||
example, consider steering a boat. The helmsman of ship A may see an obstacle
|
||
ahead (perhaps another ship) and decide to steer the boat to the left to avoid it.
|
||
The wind, current, and wave action may require the helmsman to make continual
|
||
adjustments in order to hold the desired course. At some point, the other ship may
|
||
also change course, making the helmsman’s first decision about what would be a
|
||
safe course no longer correct and needing to be revised. Steering then can be per
|
||
ceived as a continuous control activity or process with what is the correct and safe
|
||
|
||
behavior changing over time and with respect to the results of prior behavior. The
|
||
helmsman’s mental model of the effects of the actions of the sea and the assumed
|
||
behavior of the other ship has to be continually adjusted.
|
||
Not only are individual unsafe actions difficult to identify in this nontraditional
|
||
control model of human decision making, but the study of decision making cannot
|
||
be separated from a simultaneous study of the social context, the value system in
|
||
which it takes place, and the dynamic work process it is intended to control .
|
||
This view is the foundation of some modern trends in decision making research,
|
||
such as dynamic decision making , the new field of naturalistic decision making
|
||
, and the approach to safety described in this book.
|
||
As argued by Rasmussen and others, devising more effective accident models
|
||
that go beyond the simple event chain and human failure models requires shifting
|
||
the emphasis in explaining the role of humans in accidents from error (that is, devia
|
||
tions from normative procedures) to focus instead on the mechanisms and factors
|
||
that shape human behavior, that is, the performance shaping context in which
|
||
human actions take place and decisions are made. Modeling human behavior by
|
||
decomposing it into decisions and actions and studying it as a phenomenon isolated
|
||
from the context in which the behavior takes place is not an effective way to under
|
||
stand behavior .
|
||
The alternative view requires a new approach to representing and understanding
|
||
human behavior, focused not on human error and violation of rules but on the
|
||
mechanisms generating behavior in the actual, dynamic context. Such as approach
|
||
must take into account the work system constraints, the boundaries of acceptable
|
||
performance, the need for experimentation, and the subjective criteria guiding
|
||
adaptation to change. In this approach, traditional task analysis is replaced or aug
|
||
mented with cognitive work analysis . Behav
|
||
ior is modeled in terms of the objectives of the decision maker, the boundaries of
|
||
acceptable performance, the behavior shaping constraints of the environment
|
||
(including the value system and safety constraints), and the adaptive mechanisms
|
||
of the human actors.
|
||
Such an approach leads to new ways of dealing with the human contribution to
|
||
accidents and human “error.” Instead of trying to control human behavior by fight
|
||
ing deviations from specified procedures, focus should be on controlling behavior
|
||
by identifying the boundaries of safe performance (the behavioral safety con
|
||
straints), by making the boundaries explicit and known, by giving opportunities
|
||
to develop coping skills at the boundaries, by designing systems to support safe
|
||
optimization and adaptation of performance in response to contextual influences
|
||
and pressures, by providing means for identifying potentially dangerous side effects
|
||
of individual decisions in the network of decisions over the entire system, by
|
||
|
||
designing for error tolerance (making errors observable and reversible before safety
|
||
constraints are violated) , and by counteracting the pressures that drive opera
|
||
tors and decision makers to violate safety constraints.
|
||
Once again, future progress in accident reduction requires tossing out the old
|
||
assumption and substituting a new one:
|
||
New Assumption 4: Operator behavior is a product of the environment in which it
|
||
occurs. To reduce operator “error” we must change the environment in which the
|
||
operator works.
|
||
Human behavior is always influenced by the environment in which it takes place.
|
||
Changing that environment will be much more effective in changing operator error
|
||
than the usual behaviorist approach of using reward and punishment. Without
|
||
changing the environment, human error cannot be reduced for long. We design
|
||
systems in which operator error is inevitable, and then blame the operator and not
|
||
the system design.
|
||
As argued by Rasmussen and others, devising more effective accident causality
|
||
models requires shifting the emphasis in explaining the role that humans play in
|
||
accidents from error (deviations from normative procedures) to focus on the mecha
|
||
nisms and factors that shape human behavior, that is the performance shaping
|
||
features and context in which human actions take place and decisions are made.
|
||
Modeling behavior by decomposing it into decisions and actions or events, which
|
||
most all current accident models do, and studying it as a phenomenon isolated from
|
||
the context in which the behavior takes place is not an effective way to understand
|
||
behavior .
|
||
|
||
section 2 5.
|
||
The Role of Software in Accidents.
|
||
Assumption 5: Highly reliable software is safe.
|
||
The most common approach to ensuring safety when the system includes software
|
||
is to try to make the software highly reliable. To help readers who are not software
|
||
professionals see the flaws in this assumption, a few words about software in general
|
||
may be helpful.
|
||
The uniqueness and power of the digital computer over other machines stems
|
||
from the fact that, for the first time, we have a general purpose machine:
|
||
We no longer need to build a mechanical or analog autopilot from scratch, for
|
||
example, but simply to write down the “design” of an autopilot in the form of instructions or steps to accomplish the desired goals. These steps are then loaded into the
|
||
computer, which, while executing the instructions, in effect becomes the specialpurpose machine (the autopilot). If changes are needed, the instructions can be
|
||
changed and the same physical machine (the computer hardware) is used instead
|
||
of having to build a different physical machine from scratch. Software in essence is
|
||
the design of a machine abstracted from its physical realization. In other words, the
|
||
logical design of a machine (the software) is separated from the physical design of
|
||
that machine (the computer hardware).
|
||
Machines that previously were physically impossible or impractical to build
|
||
become feasible, and the design of a machine can be changed quickly without going
|
||
through an entire retooling and manufacturing process. In essence, the manufacturing phase is eliminated from the lifecycle of these machines: the physical parts of
|
||
the machine (the computer hardware) can be reused, leaving only the design and
|
||
verification phases. The design phase also has changed: The designer can concentrate
|
||
on identifying the steps to be achieved without having to worry about how those
|
||
steps will be realized physically.
|
||
These advantages of using computers (along with others specific to particular
|
||
applications, such as reduced size and weight) have led to an explosive increase in
|
||
their use, including their introduction into potentially dangerous systems. There are,
|
||
however, some potential disadvantages of using computers and some important
|
||
changes that their use introduces into the traditional engineering process that are
|
||
leading to new types of accidents as well as creating difficulties in investigating
|
||
accidents and preventing them.
|
||
One of the most important changes is that with computers, the design of the
|
||
special purpose machine is usually created by someone who is not an expert on
|
||
designing such machines. The autopilot design expert, for example, decides how the
|
||
autopilot should work, and then provides that information to a software engineer,
|
||
who is an expert in software design but not autopilots. It is the software engineer
|
||
who then creates the detailed design of the autopilot. The extra communication step
|
||
between the engineer and the software developer is the source of the most serious
|
||
problems with software today.
|
||
|
||
It should not be surprising, then, that most errors found in operational software
|
||
can be traced to requirements flaws, particularly incompleteness. Completeness is a
|
||
|
||
quality often associated with requirements but rarely defined. The most appropriate
|
||
definition in the context of this book has been proposed by Jaffe: Software requirements specifications are complete if they are sufficient to distinguish the desired
|
||
behavior of the software from that of any other undesired program that might be
|
||
designed .
|
||
Nearly all the serious accidents in which software has been involved in the past
|
||
twenty years can be traced to requirements flaws, not coding errors. The requirements may reflect incomplete or wrong assumptions
|
||
• About the operation of the system components being controlled by the software (for example, how quickly the component can react to a softwaregenerated control command) or
|
||
• About the required operation of the computer itself
|
||
In the Mars Polar Lander loss, the software requirements did not include information about the potential for the landing leg sensors to generate noise or, alternatively, to ignore any inputs from the sensors while the spacecraft was more than
|
||
forty meters above the planet surface. In the batch chemical reactor accident, the
|
||
software engineers were never told to open the water valve before the catalyst valve
|
||
and apparently thought the ordering was therefore irrelevant.
|
||
The problems may also stem from unhandled controlled-system states and environmental conditions. An F-18 was lost when a mechanical failure in the aircraft
|
||
led to the inputs arriving faster than expected, which overwhelmed the software
|
||
. Another F-18 loss resulted from the aircraft getting into an attitude that the
|
||
engineers had assumed was impossible and that the software was not programmed
|
||
to handle.
|
||
In these cases, simply trying to get the software “correct” in terms of accurately
|
||
implementing the requirements will not make it safer. Software may be highly reliable and correct and still be unsafe when:
|
||
1.• The software correctly implements the requirements, but the specified behavior
|
||
is unsafe from a system perspective.
|
||
2.• The software requirements do not specify some particular behavior required
|
||
for system safety (that is, they are incomplete).
|
||
3.• The software has unintended (and unsafe) behavior beyond what is specified
|
||
in the requirements.
|
||
If the problems stem from the software doing what the software engineer thought
|
||
it should do when that is not what the original design engineer wanted, the use of
|
||
integrated product teams and other project management schemes to help with communication are useful. The most serious problems arise, however, when nobody
|
||
|
||
understands what the software should do or even what it should not do. We need
|
||
better techniques to assist in determining these requirements.
|
||
There is not only anecdotal but some hard data to support the hypothesis that
|
||
safety problems in software stem from requirements flaws and not coding errors.
|
||
Lutz examined 387 software errors uncovered during integration and system testing
|
||
of the Voyager and Galileo spacecraft . She concluded that the software errors
|
||
identified as potentially hazardous to the system tended to be produced by different
|
||
error mechanisms than non-safety-related software errors. She showed that for these
|
||
two spacecraft, the safety-related software errors arose most commonly from
|
||
(1) discrepancies between the documented requirements specifications and the
|
||
requirements needed for correct functioning of the system and (2) misunderstandings about the software’s interface with the rest of the system. They did not involve
|
||
coding errors in implementing the documented requirements.
|
||
Many software requirements problems arise from what could be called the curse
|
||
of flexibility. The computer is so powerful and so useful because it has eliminated
|
||
many of the physical constraints of previous machines. This is both its blessing and
|
||
its curse: We no longer have to worry about the physical realization of our designs,
|
||
but we also no longer have physical laws that limit the complexity of our designs.
|
||
Physical constraints enforce discipline on the design, construction, and modification
|
||
of our design artifacts. Physical constraints also control the complexity of what we
|
||
build. With software, the limits of what is possible to accomplish are different than
|
||
the limits of what can be accomplished successfully and safely.the limiting factors
|
||
change from the structural integrity and physical constraints of our materials to
|
||
limits on our intellectual capabilities.
|
||
It is possible and even quite easy to build software that we cannot understand in
|
||
terms of being able to determine how it will behave under all conditions. We can
|
||
construct software (and often do) that goes beyond human intellectual limits. The
|
||
result has been an increase in component interaction accidents stemming from intellectual unmanageability that allows potentially unsafe interactions to go undetected
|
||
during development. The software often controls the interactions among the system
|
||
components so its close relationship with component interaction accidents should
|
||
not be surprising. But this fact has important implications for how software must
|
||
be engineered when it controls potentially unsafe systems or products: Software
|
||
or system engineering techniques that simply ensure software reliability or correctness (consistency of the code with the requirements) will have little or no impact
|
||
on safety.
|
||
Techniques that are effective will rest on a new assumption:
|
||
New Assumption 5: Highly reliable software is not necessarily safe. Increasing
|
||
software reliability or reducing implementation errors will have little impact on safety.
|
||
|
||
section 2 6.
|
||
|
||
Static versus Dynamic Views of Systems.
|
||
Assumption 6: Major accidents occur from the chance simultaneous occurrence of
|
||
random events.
|
||
Most current safety engineering techniques suffer from the limitation of considering
|
||
only the events underlying an accident and not the entire accident process. Accidents are often viewed as some unfortunate coincidence of factors that come
|
||
together at one particular point in time and lead to the loss. This belief arises from
|
||
too narrow a view of the causal time line. Looking only at the immediate time of
|
||
the Bhopal MIC release, it does seem to be a coincidence that the refrigeration
|
||
system, flare tower, vent scrubber, alarms, water curtain, and so on had all been
|
||
inoperable at the same time. But viewing the accident through a larger lens makes
|
||
it clear that the causal factors were all related to systemic causes that had existed
|
||
for a long time.
|
||
Systems are not static. Rather than accidents being a chance occurrence of multiple independent events, they tend to involve a migration to a state of increasing
|
||
risk over time . A point is reached where an accident is inevitable unless the
|
||
high risk is detected and reduced. The particular events involved at the time of the
|
||
loss are somewhat irrelevant: if those events had not occurred, something else would
|
||
have led to the loss. This concept is reflected in the common observation that a loss
|
||
was “an accident waiting to happen.” The proximate cause of the Columbia Space
|
||
Shuttle loss was the foam coming loose from the external tank and damaging
|
||
the reentry heat control structure. But many potential problems that could have
|
||
caused the loss of the Shuttle had preceded this event and an accident was avoided
|
||
by luck or unusual circumstances. The economic and political pressures led the
|
||
Shuttle program to migrate to a state where any slight deviation could have led to
|
||
a loss .
|
||
Any approach to enhancing safety that includes the social system and humans
|
||
must account for adaptation. To paraphrase a familiar saying, the only constant is
|
||
that nothing ever remains constant. Systems and organizations continually experience change as adaptations are made in response to local pressures and short-term
|
||
productivity and cost goals. People adapt to their environment or they change their
|
||
environment to better suit their purposes. A corollary to this propensity for systems
|
||
and people to adapt over time is that safety defenses are likely to degenerate systematically through time, particularly when pressure toward cost-effectiveness and
|
||
increased productivity is the dominant element in decision making. Rasmussen
|
||
noted that the critical factor here is that such adaptation is not a random process.it
|
||
is an optimization process depending on search strategies.and thus should be
|
||
predictable and potentially controllable .
|
||
|
||
Woods has stressed the importance of adaptation in accidents. He describes
|
||
organizational and human failures as breakdowns in adaptations directed at coping
|
||
with complexity, and accidents as involving a “drift toward failure as planned
|
||
defenses erode in the face of production pressures and change” .
|
||
Similarly, Rasmussen has argued that major accidents are often caused not by a
|
||
coincidence of independent failures but instead reflect a systematic migration of
|
||
organizational behavior to the boundaries of safe behavior under pressure toward
|
||
cost-effectiveness in an aggressive, competitive environment . One implication
|
||
of this viewpoint is that the struggle for a good safety culture will never end because
|
||
it must continually fight against the functional pressures of the work environment.
|
||
Improvement of the safety culture will therefore require an analytical approach
|
||
directed toward the behavior-shaping factors in the environment. A way of achieving this goal is described in part 3.
|
||
Humans and organizations can adapt and still maintain safety as long as they stay
|
||
within the area bounded by safety constraints. But in the search for optimal operations, humans and organizations will close in on and explore the boundaries of
|
||
established practice. Such exploration implies the risk of occasionally crossing the
|
||
limits of safe practice unless the constraints on safe behavior are enforced.
|
||
The natural migration toward the boundaries of safe behavior, according to
|
||
Rasmussen, is complicated by the fact that it results from the decisions of multiple
|
||
people, in different work environments and contexts within the overall sociotechnical system, all subject to competitive or budgetary stresses and each trying to optimize their decisions within their own immediate context. Several decision makers
|
||
at different times, in different parts of the company or organization, all striving
|
||
locally to optimize cost effectiveness may be preparing the stage for an accident, as
|
||
illustrated by the Zeebrugge ferry accident (see figure 2.2) and the friendly fire
|
||
accident described in chapter 5. The dynamic flow of events can then be released
|
||
by a single act.
|
||
Our new assumption is therefore:
|
||
New Assumption 6: Systems will tend to migrate toward states of higher risk. Such
|
||
migration is predictable and can be prevented by appropriate system design or detected
|
||
during operations using leading indicators of increasing risk.
|
||
To handle system adaptation over time, our causal models and safety techniques
|
||
must consider the processes involved in accidents and not simply events and conditions: Processes control a sequence of events and describe system and human
|
||
behavior as it changes and adapts over time rather than considering individual
|
||
events and human actions. To talk about the cause or causes of an accident makes
|
||
no sense in this systems or process view of accidents. As Rasmussen argues, deterministic causal models are inadequate to explain the organizational and social
|
||
factors in highly adaptive sociotechnical systems. Instead, accident causation must
|
||
be viewed as a complex process involving the entire sociotechnical system including
|
||
legislators, government agencies, industry associations and insurance companies,
|
||
company management, technical and engineering personnel, operations, and so
|
||
on .
|
||
|
||
section 2 7.
|
||
The Focus on Determining Blame.
|
||
Assumption 7: Assigning blame is necessary to learn from and prevent accidents or
|
||
incidents.
|
||
Beyond the tendency to blame operators described under assumption 3, other types
|
||
of subjectivity in ascribing cause exist. Rarely are all the causes of an accident perceived identically by everyone involved, including engineers, managers, operators,
|
||
union officials, insurers, lawyers, politicians, the press, the state, and the victims and
|
||
their families. Such conflicts are typical in situations that involve normative, ethical,
|
||
and political considerations about which people may legitimately disagree. Some
|
||
conditions may be considered unnecessarily hazardous by one group yet adequately
|
||
safe and necessary by another. In addition, judgments about the cause of an accident
|
||
may be affected by the threat of litigation or by conflicting interests.
|
||
Research data validates this hypothesis. Various studies have found the selection
|
||
of a cause(s) depends on characteristics of the victim and of the analyst (e.g., hierarchical status, degree of involvement, and job satisfaction) as well as on the relationships between the victim and the analyst and on the severity of the accident .
|
||
For example, one study found that workers who were satisfied with their jobs
|
||
and who were integrated into and participating in the enterprise attributed accidents
|
||
mainly to personal causes. In contrast, workers who were not satisfied and who had
|
||
a low degree of integration and participation more often cited nonpersonal causes
|
||
that implied that the enterprise was responsible . Another study found differences in the attribution of accident causes among victims, safety managers, and
|
||
general managers. Other researchers have suggested that accidents are attributed
|
||
to factors in which the individuals are less directly involved. A further consideration
|
||
may be position in the organization: The lower the position in the hierarchy,
|
||
the greater the tendency to blame accidents on factors linked to the organization;
|
||
individuals who have a high position in the hierarchy tend to blame workers for
|
||
accidents .
|
||
There even seem to be differences in causal attribution between accidents and
|
||
incidents: Accident investigation data on near-miss (incident) reporting suggest that
|
||
causes for these events are mainly attributed to technical deviations while similar
|
||
events that result in losses are more often blamed on operator error .
|
||
|
||
Causal identification may also be influenced by the data collection methods. Data
|
||
are usually collected in the form of textual descriptions of the sequence of events of
|
||
the accident, which, as we have seen, tend to concentrate on obvious conditions or
|
||
events closely preceding the accident in time and tend to leave out less obvious or
|
||
indirect events and factors. There is no simple solution to this inherent bias: On one
|
||
hand, report forms that do not specifically ask for nonproximal factors often do not
|
||
elicit them while, on the other hand, more directive report forms that do request
|
||
particular information may limit the categories or conditions considered .
|
||
Other factors affecting causal filtering in accident and incident reports may be
|
||
related to the design of the reporting system itself. For example, the NASA Aviation
|
||
Safety Reporting System (ASRS) has a category that includes nonadherence to
|
||
FARs (Federal Aviation Regulations). In a NASA study of reported helicopter
|
||
incidents and accidents over a nine-year period, this category was by far the largest
|
||
category cited . The NASA study concluded that the predominance of FAR
|
||
violations in the incident data may reflect the motivation of the ASRS reporters to
|
||
obtain immunity from perceived or real violations of FARs and not necessarily the
|
||
true percentages.
|
||
A final complication is that human actions always involve some interpretation
|
||
of the person’s goals and motives. The individuals involved may be unaware of their
|
||
actual goals and motivation or may be subject to various types of pressures to reinterpret their actions. Explanations by accident analysts after the fact may be influenced by their own mental models or additional goals and pressures.
|
||
Note the difference between an explanation based on goals and one based on
|
||
motives: a goal represents an end state while a motive explains why that end state
|
||
was chosen. Consider the hypothetical case where a car is driven too fast during a
|
||
snowstorm and slides into a telephone pole. An explanation based on goals for this
|
||
chain of events might include the fact that the driver wanted to get home quickly.
|
||
An explanation based on motives might include the fact that guests were coming
|
||
for dinner and the driver had to prepare the food before they arrived.
|
||
Explanations based on goals and motives depend on assumptions that cannot be
|
||
directly measured or observed by the accident investigator. Leplat illustrates this
|
||
dilemma by describing three different motives for the event “operator sweeps the
|
||
floor”: (1) the floor is dirty, (2) the supervisor is present, or (3) the machine is broken
|
||
and the operator needs to find other work . Even if the people involved survive
|
||
the accident, true goals and motives may not be revealed for a variety of reasons.
|
||
Where does all this leave us? There are two possible reasons for conducting an
|
||
accident investigation: (1) to assign blame for the accident and (2) to understand
|
||
why it happened so that future accidents can be prevented. When the goal is to
|
||
assign blame, the backward chain of events considered often stops when someone
|
||
or something appropriate to blame is found, such as the baggage handler in the
|
||
|
||
D C 10. case or the maintenance worker at Bhopal. As a result, the selected initiating
|
||
event may provide too superficial an explanation of why the accident occurred to
|
||
prevent similar losses in the future.
|
||
As another example, stopping at the O-ring failure in the Challenger accident
|
||
and fixing that particular design flaw would not have eliminated the systemic flaws
|
||
that could lead to accidents in the future. For Challenger, examples of those systemic
|
||
problems include flawed decision making and the political and economic pressures
|
||
that led to it, poor problem reporting, lack of trend analysis, a “silent” or ineffective
|
||
safety program, communication problems, etc. None of these are “events” (although
|
||
they may be manifested in particular events) and thus do not appear in the chain
|
||
of events leading to the accident. Wisely, the authors of the Challenger accident
|
||
report used an event chain only to identify the proximate physical cause and not
|
||
the reasons those events occurred, and the report’s recommendations led to many
|
||
important changes at NASA or at least attempts to make such changes.
|
||
Twenty years later, another Space Shuttle was lost. While the proximate cause
|
||
for the Columbia accident (foam hitting the wing of the orbiter) was very different
|
||
than that for Challenger, many of the systemic causal factors were similar and
|
||
reflected either inadequate fixes of these factors after the Challenger accident or
|
||
their reemergence in the years between these losses .
|
||
Blame is not an engineering concept; it is a legal or moral one. Usually there is
|
||
no objective criterion for distinguishing one factor or several factors from other
|
||
factors that contribute to an accident. While lawyers and insurers recognize that
|
||
many factors contribute to a loss event, for practical reasons and particularly for
|
||
establishing liability, they often oversimplify the causes of accidents and identify
|
||
what they call the proximate (immediate or direct) cause. The goal is to determine
|
||
the parties in a dispute that have the legal liability to pay damages, which may be
|
||
affected by the ability to pay or by public policy considerations, such as discouraging
|
||
company management or even an entire industry from acting in a particular way in
|
||
the future.
|
||
When learning how to engineer safer systems is the goal rather than identifying
|
||
who to punish and establishing liability, then the emphasis in accident analysis needs
|
||
to shift from cause (in terms of events or errors), which has a limiting, blame orientation, to understanding accidents in terms of reasons, that is, why the events and
|
||
errors occurred. In an analysis by the author of recent aerospace accidents involving
|
||
software, most of the reports stopped after assigning blame.usually to the operators who interacted with the software.and never got to the root of why the accident
|
||
occurred, e.g., why the operators made the errors they did and how to prevent such
|
||
errors in the future (perhaps by changing the software) or why the software requirements specified unsafe behavior, why that requirements error was introduced, and
|
||
why it was not detected and fixed before the software was used .
|
||
|
||
When trying to understand operator contributions to accidents, just as with overcoming hindsight bias, it is more helpful in learning how to prevent future accidents
|
||
to focus not on what the operators did “wrong” but on why it made sense for them
|
||
to behave that way under those conditions . Most people are not malicious but
|
||
are simply trying to do the best they can under the circumstances and with the
|
||
information they have. Understanding why those efforts were not enough will help
|
||
in changing features of the system and environment so that sincere efforts are more
|
||
successful in the future. Focusing on assigning blame contributes nothing toward
|
||
achieving this goal and may impede it by reducing openness during accident investigations, thereby making it more difficult to find out what really happened.
|
||
A focus on blame can also lead to a lot of finger pointing and arguments that
|
||
someone or something else was more to blame. Much effort is usually spent in
|
||
accident investigations on determining which factors were the most important and
|
||
assigning them to categories such as root cause, primary cause, contributory cause.
|
||
In general, determining the relative importance of various factors to an accident
|
||
may not be useful in preventing future accidents. Haddon argues, reasonably,
|
||
that countermeasures to accidents should not be determined by the relative importance of the causal factors; instead, priority should be given to the measures that
|
||
will be most effective in reducing future losses. Explanations involving events in an
|
||
event chain often do not provide the information necessary to prevent future losses,
|
||
and spending a lot of time determining the relative contributions of events or conditions to accidents (such as arguing about whether an event is the root cause or a
|
||
contributory cause) is not productive outside the legal system. Rather, Haddon suggests that engineering effort should be devoted to identifying the factors (1) that
|
||
are easiest or most feasible to change, (2) that will prevent large classes of accidents,
|
||
and (3) over which we have the greatest control.
|
||
Because the goal of this book is to describe a new approach to understanding and
|
||
preventing accidents rather than assigning blame, the emphasis is on identifying all
|
||
the factors involved in an accident and understanding the relationship among these
|
||
causal factors in order to provide an explanation of why the accident occurred. That
|
||
explanation can then be used to generate recommendations for preventing losses in
|
||
the future. Building safer systems will be more effective when we consider all causal
|
||
factors, both direct and indirect. In the new approach presented in this book, there is
|
||
no attempt to determine which factors are more “important” than others but rather
|
||
how they all relate to each other and to the final loss event or near miss.
|
||
One final new assumption is needed to complete the foundation for future
|
||
progress:
|
||
New Assumption 7: Blame is the enemy of safety. Focus should be on understanding how the system behavior as a whole contributed to the loss and not on who or
|
||
what to blame for it.
|
||
|
||
We will be more successful in enhancing safety by focusing on why accidents occur
|
||
rather than on blame.
|
||
Updating our assumptions about accident causation will allow us to make greater
|
||
progress toward building safer systems in the twenty-first century. The old and new
|
||
assumptions are summarized in table 2 1. The new assumptions provide the foundation for a new view of accident causation.
|
||
|
||
section 2 8.
|
||
Goals for a New Accident Model.
|
||
Event-based models work best for accidents where one or several components fail,
|
||
leading to a system failure or hazard. Accident models and explanations involving
|
||
only simple chains of failure events, however, can easily miss subtle and complex
|
||
couplings and interactions among failure events and omit entirely accidents involving no component failure at all. The event-based models developed to explain
|
||
physical phenomena (which they do well) are inadequate to explain accidents
|
||
involving organizational and social factors and human decisions and software design
|
||
errors in highly adaptive, tightly-coupled, interactively complex sociotechnical systems.namely, those accidents related to the new factors (described in chapter 1)
|
||
in the changing environment in which engineering is taking place.
|
||
The search for a new model, resulting in the accident model presented in part II,
|
||
was driven by the following goals:
|
||
1.•Expand accident analysis by forcing consideration of factors other than component failures and human errors. The model should encourage a broad
|
||
view of accident mechanisms, expanding the investigation from simply considering proximal events to considering the entire sociotechnical system.
|
||
Such a model should include societal, regulatory, and cultural factors. While
|
||
some accident reports do this well, for example the space shuttle Challenger
|
||
report, such results appear to be ad hoc and dependent on the personalities
|
||
involved in the investigation rather than being guided by the accident
|
||
model itself.
|
||
2.•Provide a more scientific way to model accidents that produces a better and less
|
||
subjective understanding of why the accident occurred and how to prevent
|
||
future ones. Event-chain models provide little guidance in the selection of
|
||
events to include in the accident explanation or the conditions to investigate.
|
||
The model should provide more assistance in identifying and understanding a
|
||
comprehensive set of factors involved, including the adaptations that led to
|
||
the loss.
|
||
3.•Include system design errors and dysfunctional system interactions. The models
|
||
used widely were created before computers and digital components and do not
|
||
handle them well. In fact, many of the event-based models were developed to
|
||
explain industrial accidents, such as workers falling into holes or injuring themselves during the manufacturing process, and do not fit system safety at all. A
|
||
new model must be able to account for accidents arising from dysfunctional
|
||
interactions among the system components.
|
||
4.•Allow for and encourage new types of hazard analyses and risk assessments that
|
||
go beyond component failures and can deal with the complex role software and
|
||
humans are assuming in high-tech systems. Traditional hazard analysis techniques, such as fault tree analysis and the various other types of failure analysis
|
||
techniques, do not work well for human errors and for software and other
|
||
system design errors. An appropriate model should suggest hazard analysis
|
||
techniques to augment these failure-based methods and encourage a wider
|
||
|
||
variety of risk reduction measures than redundancy and monitoring. In addition, risk assessment is currently firmly rooted in the probabilistic analysis of
|
||
failure events. Attempts to extend current probabilistic risk assessment techniques to software and other new technology, to management, and to cognitively complex human control activities have been disappointing. This way
|
||
forward may lead to a dead end, but starting from a different theoretical foundation may allow significant progress in finding new, more comprehensive
|
||
approaches to risk assessment for complex systems.
|
||
5.•Shift the emphasis in the role of humans in accidents front errors (deviations
|
||
from normative behavior) to focus on the mechanisms and factors that shape
|
||
human behavior (i.e., the performance-shaping mechanisms and context in which
|
||
human actions take place and decisions are made). A new model should
|
||
account for the complex role that human decisions and behavior are playing in
|
||
the accidents occurring in high-tech systems and handle not simply individual
|
||
decisions but also sequences of decisions and the interactions among decisions
|
||
by multiple, interacting decision makers . The model must include examining the possible goals and motives behind human behavior as well as the contextual factors that influenced that behavior.
|
||
6.•Encourage a shift in the emphasis in accident analysis from “cause”.which has
|
||
a limiting, blame orientation.to understanding accidents in terms of reasons,
|
||
that is, why the events and errors occurred . Learning how to engineer
|
||
safer systems is the goal here, not identifying whom to punish.
|
||
7.•Examine the processes involved in accidents and not simply events and
|
||
conditions Processes control a sequence of events and describe changes and
|
||
adaptations over time rather than considering events and human actions
|
||
individually.
|
||
8.•Allow for and encourage multiple viewpoints and multiple interpretations when
|
||
appropriate Operators, managers, and regulatory agencies may all have different views of the flawed processes underlying an accident, depending on the
|
||
hierarchical level of the sociotechnical control structure from which the process
|
||
is viewed. At the same time, the factual data should be separated from the
|
||
interpretation of that data.
|
||
9.•Assist in defining operational metrics and analyzing performance data. Computers allow the collection of massive amounts of operational data, but analyzing that data to determine whether the system is moving toward the boundaries
|
||
of safe behavior is difficult. A new accident model should provide directions
|
||
for identifying appropriate safety metrics and operational auditing procedures
|
||
to evaluate decisions made during design and development, to determine
|
||
whether controls over hazards are adequate, to detect erroneous operational
|
||
|
||
and environmental assumptions underlying the hazard analysis and design
|
||
process, to identify leading indicators and dangerous trends and changes in
|
||
operations before they lead to accidents, and to identify any maladaptive system
|
||
or environment changes over time that could increase accident risk to unacceptable levels.
|
||
These goals are achievable if models based on systems theory, rather than
|
||
reliability theory, underlie our safety engineering activities. |