1355 lines
90 KiB
Plaintext
1355 lines
90 KiB
Plaintext
Chapter 11.
|
||
Analyzing Accidents and Incidents (CAST).
|
||
The causality model used in accident or incident analysis determines what we look
|
||
for, how we go about looking for “facts,” and what we see as relevant. In our experi-
|
||
ence using STAMP-based accident analysis, we find that even if we use only the
|
||
information presented in an existing accident report, we come up with a very dif-
|
||
ferent view of the accident and its causes.
|
||
Most accident reports are written from the perspective of an event-based model.
|
||
They almost always clearly describe the events and usually one or several of these
|
||
events is chosen as the “root causes.” Sometimes “contributory causes” are identi-
|
||
fied. But the analysis of why those events occurred is usually incomplete: The analy-
|
||
sis frequently stops after finding someone to blame,usually a human operator,and
|
||
the opportunity to learn important lessons is lost.
|
||
An accident analysis technique should provide a framework or process to assist in
|
||
understanding the entire accident process and identifying the most important sys-
|
||
temic causal factors involved. This chapter describes an approach to accident analy-
|
||
sis, based on STAMP, called CAST (Causal Analysis based on STAMP). CAST can
|
||
be used to identify the questions that need to be answered to fully understand why
|
||
the accident occurred. It provides the basis for maximizing learning from the events.
|
||
The use of CAST does not lead to identifying single causal factors or variables.
|
||
Instead it provides the ability to examine the entire sociotechnical system design to
|
||
identify the weaknesses in the existing safety control structure and to identify
|
||
changes that will not simply eliminate symptoms but potentially all the causal
|
||
factors, including the systemic ones.
|
||
One goal of CAST is to get away from assigning blame and instead to shift the
|
||
focus to why the accident occurred and how to prevent similar losses in the future.
|
||
To accomplish this goal, it is necessary to minimize hindsight bias and instead to
|
||
determine why people behaved the way they did, given the information they had at
|
||
the time.
|
||
An example of the results of an accident analysis using CAST is presented in
|
||
chapter 5. Additional examples are in appendixes B and C. This chapter describes
|
||
|
||
|
||
the steps to go through in producing such an analysis. An accident at a fictional
|
||
chemical plant called Citichem [174] is used to demonstrate the process. The acci-
|
||
dent scenario was developed by Risk Management Pro to train accident investiga-
|
||
tors and describes a realistic accident process similar to many accidents that have
|
||
occurred in chemical plants. While the loss involves release of a toxic chemical, the
|
||
analysis serves as an example of how to do an accident or incident analysis for any
|
||
industry.
|
||
An accident investigation process is not being specified here, but only a way to
|
||
document and analyze the results of such a process. Accident investigation is a much
|
||
larger topic that goes beyond the goals of this book. This chapter only considers
|
||
how to analyze the data once it has been collected and organized. The accident
|
||
analysis process described in this chapter does, however, contribute to determining
|
||
what questions should be asked during the investigation. When attempting to apply
|
||
STAMP-based analysis to existing accident reports, it often becomes apparent that
|
||
crucial information was not obtained, or at least not included in the report, that
|
||
is needed to fully understand why the loss occurred and how to prevent future
|
||
occurrences.
|
||
|
||
footnote. Maggie Stringfellow and John Thomas, two MIT graduate students, contributed to the CAST analysis
|
||
of the fictional accident used in this chapter.
|
||
|
||
section 11.1.
|
||
The General Process of Applying STAMP to Accident Analysis.
|
||
In STAMP, an accident is regarded as involving a complex process, not just indi-
|
||
vidual events. Accident analysis in CAST then entails understanding the dynamic
|
||
process that led to the loss. That accident process is documented by showing the
|
||
sociotechnical safety control structure for the system involved and the safety con-
|
||
straints that were violated at each level of this control structure and why. The analy-
|
||
sis results in multiple views of the accident, depending on the perspective and level
|
||
from which the loss is being viewed.
|
||
Although the process is described in terms of steps or parts, no implication is
|
||
being made that the analysis process is linear or that one step must be completed
|
||
before the next one is started. The first three steps are the same ones that form the
|
||
basis of all the STAMP-based techniques described so far.
|
||
1. Identify the system(s) and hazard(s) involved in the loss.
|
||
2. Identify the system safety constraints and system requirements associated with
|
||
that hazard.
|
||
3. Document the safety control structure in place to control the hazard and
|
||
enforce the safety constraints. This structure includes the roles and responsi-
|
||
bilities of each component in the structure as well as the controls provided or
|
||
created to execute their responsibilities and the relevant feedback provided to
|
||
them to help them do this. This structure may be completed in parallel with
|
||
the later steps.
|
||
4. Determine the proximate events leading to the loss.
|
||
5. Analyze the loss at the physical system level. Identify the contribution of each
|
||
of the following to the events: physical and operational controls, physical fail-
|
||
ures, dysfunctional interactions, communication and coordination flaws, and
|
||
unhandled disturbances. Determine why the physical controls in place were
|
||
ineffective in preventing the hazard.
|
||
6. Moving up the levels of the safety control structure, determine how and why
|
||
each successive higher level allowed or contributed to the inadequate control
|
||
at the current level. For each system safety constraint, either the responsibility
|
||
for enforcing it was never assigned to a component in the safety control struc-
|
||
ture or a component or components did not exercise adequate control to
|
||
ensure their assigned responsibilities (safety constraints) were enforced in the
|
||
components below them. Any human decisions or flawed control actions need
|
||
to be understood in terms of (at least): the information available to the deci-
|
||
sion maker as well as any required information that was not available, the
|
||
behavior-shaping mechanisms (the context and influences on the decision-
|
||
making process), the value structures underlying the decision, and any flaws
|
||
in the process models of those making the decisions and why those flaws
|
||
existed.
|
||
7. Examine overall coordination and communication contributors to the loss.
|
||
8. Determine the dynamics and changes in the system and the safety control
|
||
structure relating to the loss and any weakening of the safety control structure
|
||
over time.
|
||
9. Generate recommendations.
|
||
In general, the description of the role of each component in the control structure
|
||
will include the following:
|
||
1.•Safety Requirements and Constraints
|
||
2.•Controls
|
||
3.•Context
|
||
3.1.– Roles and responsibilities
|
||
3.2.– Environmental and behavior-shaping factors
|
||
4.•Dysfunctional interactions, failures, and flawed decisions leading to erroneous
|
||
control actions
|
||
|
||
5.Reasons for the flawed control actions and dysfunctional interactions
|
||
5.1.– Control algorithm flaws
|
||
5.2.– Incorrect process or interface models.
|
||
5.3.– Inadequate coordination or communication among multiple controllers
|
||
5.4.– Reference channel flaws
|
||
5.5.– Feedback flaws
|
||
The next sections detail the steps in the analysis process, using Citichem as a
|
||
running example.
|
||
|
||
|
||
section 11.2.
|
||
Creating the Proximal Event Chain.
|
||
While the event chain does not provide the most important causality information,
|
||
the basic events related to the loss do need to be identified so that the physical
|
||
process involved in the loss can be understood.
|
||
For Citichem, the physical process events are relatively simple: A chemical reac-
|
||
tion occurred in storage tanks 701 and 702 of the Citichem plant when the chemical
|
||
contained in the tanks, K34, came in contact with water. K34 is made up of some
|
||
extremely toxic and dangerous chemicals that react violently to water and thus need
|
||
to be kept away from it. The runaway reaction led to the release of a toxic cloud of
|
||
tetrachloric cyanide (TCC) gas, which is flammable, corrosive, and volatile. The TCC
|
||
blew toward a nearby park and housing development, in a city called Oakbridge,
|
||
killing more than four hundred people.
|
||
The direct events leading to the release and deaths are:
|
||
1. Rain gets into tank 701 (and presumably 702), both of which are in Unit 7 of
|
||
the Citichem Oakbridge plant. Unit 7 was shut down at the time due to
|
||
lowered demand for K34.
|
||
2. Unit 7 is restarted when a large order for K34 is received.
|
||
3. A small amount of water is found in tank 701 and an order is issued to make
|
||
sure the tank is dry before startup.
|
||
4. T34 transfer is started at unit 7.
|
||
5. The level gauge transmitter in the 701 storage tank shows more than it
|
||
should.
|
||
6. A request is sent to maintenance to put in a new level transmitter.
|
||
7. The level transmitter from tank 702 is moved to tank 701. (Tank 702 is used
|
||
as a spare tank for overflow from tank 701 in case there is a problem.)
|
||
8. Pressure in Unit 7 reads as too high.
|
||
|
||
|
||
9. The backup cooling compressor is activated.
|
||
10. Tank 701 temperature exceeds 12 degrees Celsius.
|
||
11. A sample is run, an operator is sent to check tank pressure, and the plant
|
||
manager is called.
|
||
12. Vibration is detected in tank 701.
|
||
13. The temperature and pressure in tank 701 continue to increase.
|
||
14. Water is found in the sample that was taken (see event 11).
|
||
15. Tank 701 is dumped into the spare tank 702
|
||
16. A runaway reaction occurs in tank 702.
|
||
17. The emergency relief valve jams and runoff is not diverted into the backup
|
||
scrubber.
|
||
18. An uncontrolled gas release occurs.
|
||
19. An alarm sounds in the plant.
|
||
20. Nonessential personnel are ordered into units 2 and 3, which have positive
|
||
pressure and filtered air.
|
||
21. People faint outside the plant fence.
|
||
22. Police evacuate a nearby school.
|
||
23. The engineering manager calls the local hospital, gives them the chemical
|
||
name and a hotline phone number to learn more about the chemical.
|
||
24. The public road becomes jammed and emergency crews cannot get into the
|
||
surrounding community.
|
||
25. Hospital personnel cannot keep up with steady stream of victims.
|
||
26. Emergency medical teams are airlifted in.
|
||
These events are presented as one list here, but separation into separate interacting
|
||
component event chains may be useful sometimes in understanding what happened,
|
||
as shown in the friendly fire event description in chapter 5.
|
||
The Citichem event chain here provides a superficial analysis of what happened.
|
||
A deep understanding of why the events occurred requires much more information.
|
||
Remember that the goal of a STAMP-based analysis is to determine why the events
|
||
occurred—not who to blame for them—and to identify the changes that could
|
||
prevent them and similar events in the future.
|
||
|
||
section 11.3. Defining the System(s) and Hazards Involved in the Loss.
|
||
Citichem has two relevant physical processes being controlled: the physical plant
|
||
and public health. Because separate and independent controllers were controlling
|
||
|
||
these two processes, it makes sense to consider them as two interacting but inde-
|
||
pendent systems: (1) the chemical company, which controls the chemical process,
|
||
and (2) the public political structure, which has responsibilities for public health.
|
||
Figure 11.1 shows the major components of the two safety control structures and
|
||
interactions between them. Only the major structures are shown in the figure;
|
||
the details will be added throughout this chapter.2 No information was provided
|
||
|
||
|
||
about the design and engineering process for the Citichem plant in the accident
|
||
description, so details about it are omitted. A more complete example of a develop-
|
||
ment control structure and analysis of its role can be found in appendix B.
|
||
The analyst(s) also needs to identify the hazard(s) being avoided and the safety
|
||
constraint(s) to be enforced. An accident or loss event for the combined chemical
|
||
plant and public health structure can be defined as death, illness, or injury due to
|
||
exposure to toxic chemicals.
|
||
The hazards being controlled by the two control structures are related but
|
||
different. The public health structure hazard is exposure of the public to toxic
|
||
chemicals. The system-level safety constraints for the public health control system
|
||
are that:
|
||
1. The public must not be exposed to toxic chemicals.
|
||
2. Measures must be taken to reduce exposure if it occurs.
|
||
3. Means must be available, effective, and used to treat exposed individuals
|
||
outside the plant.
|
||
The hazard for the chemical plant process is uncontrolled release of toxic chemicals.
|
||
Accordingly, the system-level constraints are that:
|
||
1. Chemicals must be under positive control at all times.
|
||
2. Measures must be taken to reduce exposure if inadvertent release occurs.
|
||
3. Warnings and other measures must be available to protect workers in the plant
|
||
and minimize losses to the outside community.
|
||
4. Means must be available, effective, and used to treat exposed individuals inside
|
||
the plant.
|
||
Hazards and safety-constraints must be within the design space of those who devel-
|
||
oped the system and within the operational space of those who operate it. For
|
||
example, the chemical plant designers cannot be responsible for those things
|
||
outside the boundaries of the chemical plant over which they have no control,
|
||
although they may have some influence over them. Control over the environment
|
||
of a plant is usually the responsibility of the community and various levels of gov-
|
||
ernment. As another example, while the operators of the plant may cooperate with
|
||
local officials in providing public health and emergency response facilities, respon-
|
||
sibility for this function normally lies in the public domain. Similarly, while the
|
||
community and local government may have some influence on the design of the
|
||
chemical plant, the company engineers and managers control detailed design and
|
||
operations.
|
||
Once the goals and constraints are determined, the controls in place to enforce
|
||
them must be identified.
|
||
|
||
|
||
|
||
|
||
footnote. OSHA, the Occupational Safety and Health Administration, is part of a third larger governmental
|
||
control structure, which has many other components. For simplicity, only OSHA is shown and considered
|
||
in the example analysis.
|
||
|
||
|
||
section 11.4. Documenting the Safety Control Structure.
|
||
If STAMP has been used as the basis for previous safety activities, such as the origi-
|
||
nal engineering process or the investigation and analysis of previous incidents and
|
||
accidents, a model of the safety-control structure may already exist. If not, it must
|
||
be created although it can be reused in the future. Chapters 12 and 13 provide
|
||
information about the design of safety-control structures.
|
||
The components of the structure as well as each component’s responsibility with
|
||
respect to enforcing the system safety constraints must be identified. Determining
|
||
what these are (or what they should be) can start from system safety requirements.
|
||
The following are some example system safety requirements that might be appropri-
|
||
ate for the Citichem chemical plant example:
|
||
1. Chemicals must be stored in their safest form.
|
||
2. The amount of toxic chemicals stored should be minimized.
|
||
3. Release of toxic chemicals and contamination of the environment must be
|
||
prevented.
|
||
4. Safety devices must be operable and properly maintained at all times when
|
||
potentially toxic chemicals are being processed or stored.
|
||
5. Safety equipment and emergency procedures (including warning devices)
|
||
must be provided to reduce exposure in the event of an inadvertent chemical
|
||
release.
|
||
6. Emergency procedures and equipment must be available and operable to treat
|
||
exposed individuals.
|
||
7. All areas of the plant must be accessible to emergency personnel and equip-
|
||
ment during emergencies. Delays in providing emergency treatment must be
|
||
minimized.
|
||
8. Employees must be trained to
|
||
a. Perform their jobs safely and understand proper use of safety equipment
|
||
b. Understand their responsibilities with regards to safety and the hazards
|
||
related to their job
|
||
c. Respond appropriately in an emergency
|
||
9. Those responsible for safety in the surrounding community must be educated
|
||
about potential hazards from the plant and provided with information about
|
||
how to respond appropriately.
|
||
A similar list of safety-related requirements and responsibilities might be gener-
|
||
ated for the community safety control structure.
|
||
|
||
|
||
These general system requirements must be enforced somewhere in the safety
|
||
control structure. As the accident analysis proceeds, they are used as the starting
|
||
point for generating more specific constraints, such as constraints for the specific
|
||
chemicals being handled. For example, requirement 4, when instantiated for TCC,
|
||
might generate a requirement to prevent contact of the chemical with water. As the
|
||
accident analysis proceeds, the identified responsibilities of the components can be
|
||
mapped to the system safety requirements—the opposite of the forward tracing
|
||
used in safety-guided design. If STPA was used in the design or analysis of the
|
||
system, then the safety control structure documentation should already exist.
|
||
In some cases, general requirements and policies for an industry are established
|
||
by the government or by professional associations. These can be used during an
|
||
accident analysis to assist in comparing the actual safety control structure (both in
|
||
the plant and in the community) at the time of the accidents with the standards or
|
||
best practices of the industry and country. Accident analyses can in this way be made
|
||
less arbitrary and more guidance provided to the analysts as to what should be
|
||
considered to be inadequate controls.
|
||
The specific designed controls need not all be identified before the rest of the
|
||
analysis starts. Additional controls will be identified as the analysts go through
|
||
the next steps of the process, but a good start can usually be made early in the
|
||
analysis process.
|
||
|
||
section 11.5.
|
||
Analyzing the Physical Process.
|
||
Analysis starts with the physical process, identifying the physical and operational
|
||
controls and any potential physical failures, dysfunctional interactions and commu-
|
||
nication, or unhandled external disturbances that contributed to the events. The goal
|
||
is to determine why the physical controls in place were ineffective in preventing the
|
||
hazard. Most accident analyses do a good job of identifying the physical contributors
|
||
to the events.
|
||
Figure 11.2 shows the requirements and controls at the Citichem physical plant
|
||
level as well as failures and inadequate controls. The physical contextual factors
|
||
contributing to the events are included.
|
||
The most likely reason for water getting into tanks 701 and 702 were inadequate
|
||
controls provided to keep water out during a recent rainstorm (an unhandled exter-
|
||
nal disturbance to the system in figure 4.8), but there is no way to determine that
|
||
for sure.
|
||
Accident investigations, when the events and physical causes are not obvious,
|
||
often make use of a hazard analysis technique, such as fault trees, to create scenarios
|
||
to consider. STPA can be used for this purpose. Using control diagrams of the physi-
|
||
cal system, scenarios can be generated that could lead to the lack of enforcement
|
||
|
||
of the safety constraint(s) at the physical level. The safety design principles in
|
||
chapter 9 can provide assistance in identifying design flaws.
|
||
As is common in the process industry, the physical plant safety equipment (con-
|
||
trols) at Citichem were designed as a series of barriers to satisfy the system safety
|
||
constraints identified earlier, that is, to protect against runaway reactions, protect
|
||
against inadvertent release of toxic chemicals or an explosion (uncontrolled energy),
|
||
convert any released chemicals into a non-hazardous or less hazardous form, provide
|
||
protection against human or environmental exposure after release, and provide
|
||
emergency equipment to treat exposed individuals. Citichem had the standard
|
||
types of safety equipment installed, including gauges and other indicators of the
|
||
physical system state. In addition, it had an emergency relief system and devices to
|
||
minimize the danger from released chemicals such as a scrubber to reduce the toxic-
|
||
ity of any released chemicals and a flare tower to burn off gas before it gets into
|
||
the atmosphere.
|
||
A CAST accident analysis examines the controls to determine which ones did
|
||
not work adequately and why. While there was a reasonable amount of physical
|
||
safety controls provided at Citichem, much of this equipment was inadequate or not
|
||
operational—a common finding after chemical plant accidents.
|
||
In particular, rainwater got into the tank, which implies the tanks were not
|
||
adequately protected against rain despite the serious hazard created by the mixing
|
||
of TCC with water. While the inadequate protection against rainwater should be
|
||
investigated, no information was provided in the Citichem accident description. Did
|
||
the hazard analysis process, which in the process industry often involves HAZOP,
|
||
identify this hazard? If not, then the hazard analysis process used by the company
|
||
needs to be examined to determine why an important factor was omitted. If it was
|
||
not omitted, then the flaw lies in the translation of the hazard analysis results into
|
||
protection against the hazard in the design and operations. Were controls to protect
|
||
against water getting into the tank provided? If not, why not? If so, why were they
|
||
ineffective?
|
||
Critical gauges and monitoring equipment were missing or inoperable at the time
|
||
of the runaway reaction. As one important example, the plant at the time of the
|
||
accident had no operational level indicator on tank 702 despite the fact that this
|
||
equipment provided safety-critical information. One task for the accident analysis,
|
||
then, is to determine whether the indicator was designated as safety-critical, which
|
||
would (or should) trigger more controls at the higher levels, such as higher priority
|
||
in maintenance activities. The inoperable level indicator also indicates a need to
|
||
look at higher levels of the control structure that are responsible for providing and
|
||
maintaining safety-critical equipment.
|
||
As a final example, the design of the emergency relief system was inadequate:
|
||
The emergency relief valve jammed and excess gas could not be sent to the scrubber.
|
||
|
||
|
||
The pop-up relief valves in Unit 7 (and Unit 9) at the plant were too small to allow
|
||
the venting of the gas if non-gas material was present. The relief valve lines were
|
||
also too small to relieve the pressure fast enough, in effect providing a single point
|
||
of failure for the emergency relief system. Why an inadequate design existed also
|
||
needs to be examined in the higher-level control structure. What group was respon-
|
||
sible for the design and why did a flawed design result? Or was the design originally
|
||
adequate but conditions changed over time?
|
||
The physical contextual factors identified in figure 11.2 play a role in the accident
|
||
causal analysis, such as the limited access to the plant, but their importance becomes
|
||
obvious only at higher levels of the control structure.
|
||
At this point of the analysis, several recommendations are reasonable: add
|
||
protection against rainwater getting into the tanks, change the design of the valves
|
||
and vent pipes in the emergency relief system, put a level indicator on Tank 702,
|
||
and so on. Accident investigations often stop here with the physical process analysis
|
||
or go one step higher to determine what the operators (the direct controllers of the
|
||
physical process) did wrong.
|
||
The other physical process being controlled here, public health, must be exam-
|
||
ined in the same way. There were very few controls over public health instituted in
|
||
Oakbridge, the community surrounding the plant, and the ones that did exist were
|
||
inadequate. The public had no training in what to do in case of an emergency, the
|
||
emergency response system was woefully inadequate, and unsafe development was
|
||
allowed, such as the creation of a children’s park right outside the walls of the plant.
|
||
The reasons for these inadequacies, as well as the inadequacies of the controls on
|
||
the physical plant process, are considered in the next section.
|
||
|
||
|
||
section 11.6. Analyzing the Higher Levels of the Safety Control Structure.
|
||
While the physical control inadequacies are relatively easy to identify in the analysis
|
||
and are usually handled well in any accident analysis, understanding why those
|
||
physical failures or design inadequacies existed requires examining the higher levels
|
||
of safety control: Fully understanding the behavior at any level of the sociotechnical
|
||
safety control structure requires understanding how and why the control at the
|
||
next higher level allowed or contributed to the inadequate control at the current
|
||
level. Most accident reports include some of the higher-level factors, but usually
|
||
incompletely and inconsistently, and they focus on finding someone or something
|
||
to blame.
|
||
Each relevant component of the safety control structure, starting with the lowest
|
||
physical controls and progressing upward to the social and political controls, needs
|
||
to be examined. How are the components to be examined determined? Considering
|
||
everything is not practical or cost effective. By starting at the bottom, the relevant
|
||
|
||
|
||
components to consider can be identified. At each level, the flawed behavior or
|
||
inadequate controls are examined to determine why the behavior occurred and why
|
||
the controls at higher levels were not effective at preventing that behavior. For
|
||
example, in the STAMP-based analysis of an accident where an aircraft took off
|
||
from the wrong runway during construction at the airport, it was discovered that
|
||
the airport maps provided to the pilot were out of date [142]. That led to examining
|
||
the procedures at the company that provided the maps and the FAA procedures
|
||
for ensuring that maps are up-to-date.
|
||
Stopping after identifying inadequate control actions by the lower levels of the
|
||
safety control structure is common in accident investigation. The result is that the
|
||
cause is attributed to “operator error,” which does not provide enough information
|
||
to prevent accidents in the future. It also does not overcome the problems of hind-
|
||
sight bias. In hindsight, it is always possible to see that a different behavior would
|
||
have been safer. But the information necessary to identify that safer behavior is
|
||
usually only available after the fact. To improve safety, we need to understand the
|
||
reasons people acted the way they did. Then we can determine if and how to change
|
||
conditions so that better decisions can be made in the future.
|
||
The analyst should start from the assumption that most people have good inten-
|
||
tions and do not purposely cause accidents. The goal then is to understand why
|
||
people did not or could not act differently. People acted the way they did for very
|
||
good reasons; we need to understand why the behavior of the people involved made
|
||
sense to them at the time [51].
|
||
Identifying these reasons requires examining the context and behavior-shaping
|
||
factors in the safety control structure that influenced that behavior. What contextual
|
||
factors should be considered? Usually the important contextual and behavior-
|
||
shaping factors become obvious in the process of explaining why people acted the
|
||
way they did. Stringfellow has suggested a set of general factors to consider [195]:
|
||
•History: Experiences, education, cultural norms, behavioral patterns: how the
|
||
historical context of a controller or organization may impact their ability to
|
||
exercise adequate control.
|
||
•Resources: Staff, finances, time.
|
||
•Tools and Interfaces: Quality, availability, design, and accuracy of tools. Tools
|
||
may include such things as risk assessments, checklists, and instruments as well
|
||
as the design of interfaces such as displays, control levers, and automated tools.
|
||
•Training:
|
||
training.
|
||
•Human Cognition Characteristics: Person–task compatibility, individual toler-
|
||
ance of risk, control role, innate human limitations.
|
||
|
||
|
||
Pressures: Time, schedule, resource, production, incentive, compensation,
|
||
political. Pressures can include any positive or negative force that can influence
|
||
behavior.
|
||
•Safety Culture: Values and expectations around such things as incident report-
|
||
ing, workarounds, and safety management procedures.
|
||
•Communication: How the communication techniques, form, styles, or content
|
||
impacted behavior.
|
||
•Human Physiology:
|
||
Intoxication, sleep deprivation, and the like.
|
||
We also need to look at the process models used in the decision making. What
|
||
information did the decision makers have or did they need related to the inadequate
|
||
control actions? What other information could they have had that would have
|
||
changed their behavior? If the analysis determines that the person was truly incom-
|
||
petent (not usually the case), then the focus shifts to ask why an incompetent person
|
||
was hired to do this job and why they were retained in their position. A useful
|
||
method to assist in understanding human behavior is to show the process model of
|
||
the human controller at each important event in which he or she participated, that
|
||
is, what information they had about the controlled process when they made their
|
||
decisions.
|
||
Let’s follow some of the physical plant inadequacies up the safety control struc-
|
||
ture at Citichem. Three examples of STAMP-based analyses of the inadequate
|
||
control at Citichem are shown in figure 11.3: a maintenance worker, the maintenance
|
||
manager, and the operations manager.
|
||
During the investigation, it was discovered that a maintenance worker had found
|
||
water in tank 701. He was told to check the Unit 7 tanks to ensure they were ready
|
||
for the T34 production startup. Unit 7 had been shut down previously (see “Physical
|
||
Plant Context”). The startup was scheduled for 10 days after the decision to produce
|
||
additional K34 was made. The worker found a small amount of water in tank 701,
|
||
reported it to the maintenance manager, and was told to make sure the tank was
|
||
“bone dry.” However, water was found in the sample taken from tank 701 right
|
||
before the uncontrolled reaction. It is unknown (and probably unknowable) whether
|
||
the worker did not get all the water out or more water entered later through the same
|
||
path it entered previously or via a different path. We do know he was fatigued and
|
||
working a fourteen-hour day, and he may not have had time to do the job properly.
|
||
He also believed that the tank’s residual water was from condensation, not rain. No
|
||
independent check was made to determine whether all the water was removed.
|
||
Some potential recommendations from what has been described so far include
|
||
establishing procedures for quality control and checking safety-critical activities.
|
||
Any existence of a hazardous condition—such as finding water in a tank that is to
|
||
|
||
|
||
be used to produce a chemical that is highly reactive to water—should trigger an
|
||
in-depth investigation of why it occurred before any dangerous operations are
|
||
started or restarted. In addition, procedures should be instituted to ensure that those
|
||
performing safety-critical operations have the appropriate skills, knowledge, and
|
||
physical resources, which, in this case, include adequate rest. Independent checks of
|
||
critical activities also seem to be needed.
|
||
The maintenance worker was just following the orders of the maintenance
|
||
manager, so the role of maintenance management in the safety-control structure
|
||
also needs to be investigated. The runaway reaction was the result of TCC coming
|
||
in contact with water. The operator who worked for the maintenance manager told
|
||
him about finding water in tank 701 after the rain and was directed to remove it.
|
||
The maintenance manager does not tell him to check the spare tank 702 for water
|
||
and does not appear to have made any other attempts to perform that check. He
|
||
apparently accepted the explanation of condensation as the source of the water and
|
||
did not, therefore, investigate the leak further.
|
||
Why did the maintenance manager, a long-time employee who had always been
|
||
safety conscious in the past, not investigate further? The maintenance manager was
|
||
working under extreme time pressure and with inadequate staff to perform the jobs
|
||
that were necessary. There was no reporting channel to someone with specified
|
||
responsibility for investigating hazardous events, such as finding water in a tank
|
||
used for a toxic chemical that should never contact water. Normally an investigation
|
||
would not be the responsibility of the maintenance manager but would fall under
|
||
the purview of the engineering or safety engineering staff. There did not appear to
|
||
be anyone at Citichem with the responsibility to perform the type of investigation
|
||
and risk analysis required to understand the reason for water being in the tank. Such
|
||
events should be investigated thoroughly by a group with designated responsibility
|
||
for process safety, which presumes, of course, such a group exists.
|
||
The maintenance manager did protest (to the plant manager) about the unsafe
|
||
orders he was given and the inadequate time and resources he had to do his job
|
||
adequately. At the same time, he did not tell the plant manager about some of the
|
||
things that had occurred. For example, he did not inform the plant manager about
|
||
finding water in tank 701. If the plant manager had known these things, he might
|
||
have acted differently. There was no problem-reporting system in this plant for such
|
||
information to be reliably communicated to decision makers: Communication relied
|
||
on chance meetings and informal channels.
|
||
Lots of recommendations for changes could be generated from this part of
|
||
the analysis, such as providing rigorous procedures for hazard analysis when a haz-
|
||
ardous condition is detected and training and assigning personnel to do such an
|
||
analysis. Better communication channels are also indicated, particularly problem
|
||
reporting channels.
|
||
|
||
|
||
The operations manager (figure 11.3) also played a role in the accident process.
|
||
He too was under extreme pressure to get Unit 7 operational. He was unaware that
|
||
the maintenance group had found water in tank 701 and thought 702 was empty.
|
||
During the effort to get Unit 7 online, the level indicator on tank 701 was found to
|
||
be not working. When it was determined that there were no spare level indicators
|
||
at the plant and that delivery would require two weeks, he ordered the level indica-
|
||
tor on 702 to be temporarily placed on tank 701—tank 702 was only used for over-
|
||
flow in case of an emergency, and he assessed the risk of such an emergency as low.
|
||
This flawed decision clearly needs to be carefully analyzed. What types of risk and
|
||
safety analyses were performed at Citichem? What training was provided on the
|
||
hazards? What policies were in place with respect to disabling safety-critical equip-
|
||
ment? Additional analysis also seems warranted for the inventory control pro-
|
||
cedures at the plant and determining why safety-critical replacement parts were
|
||
out of stock.
|
||
Clearly, safety margins were reduced at Citichem when operations continued
|
||
despite serious failures of safety devices. Nobody noticed the degradation in safety.
|
||
Any change of the sort that occurred here—startup of operations in a previously
|
||
shut down unit and temporary removal of safety-critical equipment—should have
|
||
triggered a hazard analysis and a management of change (MOC) process. Lots of
|
||
accidents in the chemical industry (and others) involve unsafe workarounds. The
|
||
causal analysis so far should trigger additional investigation to determine whether
|
||
adequate management of change and control of work procedures had been provided
|
||
but not enforced or were not provided at all. The first step in such an analysis is to
|
||
determine who was responsible (if anyone) for creating such procedures and who
|
||
was responsible for ensuring they were followed. The goal again is not to find
|
||
someone to blame but simply to identify the flaws in the process for running
|
||
Citichem so they can be fixed.
|
||
At this point, it appears that decision making by higher-level management (above
|
||
the maintenance and operations manager) and management controls were inade-
|
||
quate at Citichem. Figures 11.4 and 11.5 show example STAMP-based analysis results
|
||
for the Citichem plant manager and Citichem corporate management. The plant
|
||
manager made many unsafe decisions and issued unsafe control actions that directly
|
||
contributed to the accident or did not initiate control actions necessary for safety
|
||
(as shown in figure 11.4). At the same time, it is clear that he was under extreme
|
||
pressure to increase production and was missing information necessary to make
|
||
better decisions. An appropriate safety control structure at the plant had not been
|
||
established leading to unsafe operational practices and inaccurate risk assessment
|
||
by most of the managers, especially those higher in the control structure. Some of
|
||
the lower level employees tried to warn against the high-risk practices, but appropri-
|
||
ate communication channels had not been established to express these concerns.
|
||
|
||
|
||
Safety controls were almost nonexistent at the corporate management level.
|
||
The upper levels of management provided inadequate leadership, oversight and
|
||
management of safety. There was either no adequate company safety policy or it
|
||
was not followed, either of which would lead to further causal analysis. A proper
|
||
process safety management system clearly did not exist at Citichem. Management
|
||
was under great competitive pressures, which may have led to ignoring corporate
|
||
safety controls or adequate controls may never have been established. Everyone
|
||
had very flawed mental models of the risks of increasing production without taking
|
||
the proper precautions. The recommendations should include consideration of
|
||
what kinds of changes might be made to provide better information about risks to
|
||
management decision makers and about the state of plant operations with respect
|
||
to safety.
|
||
Like any major accident, when analyzed thoroughly, the process leading to
|
||
the loss is complex and multi-faceted. A complete analysis of this accident is not
|
||
needed here. But a look at some of the factors involved in the plant’s environment,
|
||
including the control of public health, is instructive.
|
||
Figure 11.6 shows the STAMP-based analysis of the Oakbridge city emergency-
|
||
response system. Planning was totally inadequate or out of date. The fire department
|
||
did not have the proper equipment and training for a chemical emergency, the hos-
|
||
pital also did not have adequate emergency resources or a backup plan, and the
|
||
evacuation plan was ten years out of date and inadequate for the current level of
|
||
population.
|
||
Understanding why these inadequate controls existed requires understanding the
|
||
context and process model flaws. For example, the police chief had asked for
|
||
resources to update equipment and plans, but the city had turned him down. Plans
|
||
had been made to widen the road to Oakbridge so that emergency equipment could
|
||
be brought in, but those plans were never implemented and the planners never went
|
||
back to their plans to see if they were realistic for the current conditions. Citichem
|
||
had a policy against disclosing what chemicals they produce and use, justifying this
|
||
policy by the need for secrecy from their competitors, making it impossible for the
|
||
hospital to stockpile the supplies and provide the training required for emergencies,
|
||
all of which contributed to the fatalities in the accident. The government had no
|
||
disclosure laws requiring chemical companies to provide such information to emer-
|
||
gency responders.
|
||
Clear recommendations for changes result from this analysis, for example, updat-
|
||
ing evacuation plans and making changes to the planning process. But again, stop-
|
||
ping at this level does not help to identify systemic changes that could improve
|
||
community safety: The analysts should work their way up the control structure to
|
||
understand the entire accident process. For example, why was an inadequate emer-
|
||
gency response system allowed to exist?
|
||
|
||
|
||
The analysis in figure 11.7 helps to answer this question. For example, the
|
||
members of the city government had inadequate knowledge of the hazards associ-
|
||
ated with the plant, and they did not try to obtain more information about them or
|
||
about the impact of increased development close to the plant. At the same time,
|
||
they turned down requests for the funding to upgrade the emergency response
|
||
system as the population increased as well as attempts by city employees to provide
|
||
emergency response pamphlets for the citizens and set up appropriate communica-
|
||
tion channels.
|
||
Why did they make what in retrospect look like such bad decisions? With inad-
|
||
equate knowledge about the risks, the benefits of increased development were
|
||
ranked above the dangers from the plant in the priorities used by the city managers.
|
||
A misunderstanding about the dangers involved in the chemical processing at
|
||
the plant contributed also to the lack of planning and approval for emergency-
|
||
preparedness activities.
|
||
The city government officials were subjected to pressures from local developers
|
||
and local businesses that would benefit financially from increased development. The
|
||
developer sold homes before the development was approved in order to increase
|
||
pressure on the city council. He also campaigned against a proposed emergency
|
||
response pamphlet for local residents because he was afraid it would reduce his
|
||
sales. The city government was subjected to additional pressure from local business-
|
||
men who wanted more development in order to increase their business and profits.
|
||
The residents did not provide opposing pressure to counteract the business
|
||
influences and trusted that government would protect them: No community orga-
|
||
nizations existed to provide oversight of the local government safety controls and
|
||
to ensure that government was adequately considering their health and safety needs
|
||
(figure 11.8).
|
||
The city manager had the right instincts and concern for public safety, but she
|
||
lacked the freedom to make decisions on her own and the clout to influence the
|
||
mayor or city council. She was also subject to external pressures to back down on
|
||
her demands and no structure to assist her in resisting those pressures.
|
||
In general, there are few requirements for serving on city councils. In the United
|
||
States, they are often made up primarily of those with conflicts of interest, such as
|
||
real estate agents and developers. Mayors of small communities are often not paid
|
||
a full salary and must therefore have other sources of income, and city council
|
||
members are likely to be paid even less, if at all.
|
||
If community-level management is unable to provide adequate controls, controls
|
||
might be enforced by higher levels of government. A full analysis of this accident
|
||
would consider what controls existed at the state and federal levels and why they
|
||
were not effective in preventing the accident.
|
||
|
||
|
||
section 11.7.
|
||
A Few Words about Hindsight Bias and Examples.
|
||
One of the most common mistakes in accident analyses is the use of hindsight bias.
|
||
Words such as “could have” or “should have” in accident reports are judgments that
|
||
are almost always the result of such bias [50]. It is not the role of the accident analyst
|
||
to render judgment in terms of what people did or did not do (although that needs
|
||
to be recorded) but to understand why they acted the way they did.
|
||
Although hindsight bias is usually applied to the operators in an accident report,
|
||
because most accident reports focus on the operators, it theoretically could be
|
||
applied to people at any level of the organization: “The plant manager should have
|
||
known …”
|
||
The biggest problem with hindsight bias in accident reports is not that it is
|
||
unfair (which it usually is), but that an opportunity to learn from the accident and
|
||
prevent future occurrences is lost. It is always possible to identify a better decision
|
||
in retrospect—or there would not have been a loss or near miss—but it may have
|
||
been difficult or impossible to identify that the decision was flawed at the time it
|
||
had to be made. To improve safety and to reduce errors, we need to understand why
|
||
|
||
|
||
the decision made sense to the person at the time and redesign the system to help
|
||
people make better decisions.
|
||
Accident investigation should start with the assumption that most people have
|
||
good intentions and do not purposely cause accidents. The goal of the investigation,
|
||
then, is to understand why they did the wrong thing in that particular situation. In
|
||
particular, what were the contextual or systemic factors and flaws in the safety
|
||
control structure that influenced their behavior? Often, the person had an inaccu-
|
||
rate view of the state of the process and, given that view, did what appeared to be
|
||
the right thing at the time but turned out to be wrong with respect to the actual
|
||
state. The solution then is to redesign the system so that the controller has better
|
||
information on which to make decisions.
|
||
As an example, consider a real accident report on a chemical overflow from a
|
||
tank, which injured several workers in the vicinity [118]. The control room operator
|
||
issued an instruction to open a valve to start the flow of liquid into the tank. The
|
||
flow meter did not indicate a flow, so the control room operator asked an outside
|
||
operator to check the manual valves near the tank to see if they were closed.
|
||
The control room operator believed that the valves were normally left in an open
|
||
position to facilitate conducting the operation remotely. The tank level at this time
|
||
was 7.2 feet.
|
||
The outside operator checked and found the manual valves at the tank open. The
|
||
outside operator also saw no indication of flow on the flow meter and made an effort
|
||
to visually verify that there was no flow. He then began to open and close the valves
|
||
manually to try to fix the problem. He reported to the control room operator that
|
||
he heard a clunk that may have cleared an obstruction, and the control room opera-
|
||
tor tried opening the valve remotely again. Both operators still saw no flow on the
|
||
flow meter. The outside operator at this time got a call to deal with a problem in a
|
||
different part of the plant and left. He did not make another attempt to visually verify
|
||
if there was flow. The control room operator left the valve in the closed position. In
|
||
retrospect, it appears that the tank level at this time was approximately 7.7 feet.
|
||
Twelve minutes later, the high-level alarm on the tank sounded in the control
|
||
room. The control room operator acknowledged the alarm and turned it off. In
|
||
retrospect, it appears that the tank level at this time was approximately 8.5 feet,
|
||
although there was no indication of the actual level on the control board. The control
|
||
room operator got an alarm about an important condition in another part of the
|
||
plant and turned his attention to dealing with that alarm. A few minutes later, the
|
||
tank overflowed.
|
||
The accident report concluded, “The available evidence should have been suffi-
|
||
cient to give the control room operator a clear indication that (the tank) was indeed
|
||
filling and required immediate attention.” This statement is a classic example of
|
||
hindsight bias—note the use of the words “should have …” The report does not
|
||
|
||
identify what that evidence was. In fact, the majority of the evidence that both
|
||
operators had at this time was that the tank was not filling.
|
||
To overcome hindsight bias, it is useful to examine exactly what evidence the
|
||
operators had at time of each decision in the sequence of events. One way to do
|
||
this is to draw the operator’s process model and the values of each of the relevant
|
||
variables in it. In this case, both operators thought the control valve was closed—the
|
||
control room operator had closed it and the control panel indicated that it was
|
||
closed, the flow meter showed no flow, and the outside operator had visually checked
|
||
and there was no flow. The situation is complicated by the occurrence of other
|
||
alarms that the operators had to attend to at the same time.
|
||
Why did the control board show the control valve was closed when it must have
|
||
actually been open? It turns out that there is no way for the control room operator
|
||
to get confirmation that the valve has actually closed after he commands it closed.
|
||
The valve was not equipped with a valve stem position monitor, so the control
|
||
room operator only knows that a signal has gone to the valve for it to close but not
|
||
whether it has actually done so. The operators in many accidents, including Three
|
||
Mile Island, have been confused about the actual position of valves due to similar
|
||
designs.
|
||
An additional complication is that while there is an alarm in the tank that should
|
||
sound when the liquid level reaches 7.5 feet, that alarm was not working at the time,
|
||
and the operator did not know it was not working. So the operator had extra reason
|
||
to believe the liquid level had not risen above 7.5 feet, given that he believed there
|
||
was no flow into the tank and the 7.5-foot alarm had not sounded. The level trans-
|
||
mitter (which provided the information to the 7.5-foot alarm) had been operating
|
||
erratically for a year and a half, but a work order had not been written to repair it
|
||
until the month before. It had supposedly been fixed two weeks earlier, but it clearly
|
||
was not working at the time of the spill.
|
||
The investigators, in retrospect knowing that there indeed had to have been some
|
||
flow, suggested that the control room operator “could have” called up trend data on
|
||
the control board and detected the flow. But this suggestion is classic hindsight bias.
|
||
The control room operator had no reason to perform this extra check and was busy
|
||
taking care of critical alarms in other parts of the plant. Dekker notes the distinction
|
||
between data availability, which is what can be shown to have been physically avail-
|
||
able somewhere in the situation, and data observability, which is what was observ-
|
||
able given the features of the interface and the multiple interleaving tasks, goals,
|
||
interests, and knowledge of the people looking at it [51]. The trend data were avail-
|
||
able to the control room operator, but they were not observable without taking
|
||
special actions that did not seem necessary at the time.
|
||
While that explains why the operator did not know the tank was filling, it does
|
||
not fully explain why he did not respond to the high-level alarm. The operator said
|
||
that he thought the liquid was “tickling” the sensor and triggering a false alarm. The
|
||
|
||
|
||
accident report concludes that the operator should have had sufficient evidence the
|
||
tank was indeed filling and responded to the alarm. Not included in the official
|
||
accident report was the fact that nuisance alarms were relatively common in this
|
||
unit: they occurred for this alarm about once a month and were caused by sampling
|
||
errors or other routine activities. This alarm had never previously signaled a serious
|
||
problem. Given that all the observable evidence showed the tank was not filling and
|
||
that the operator needed to respond to a serious alarm in another part of the plant
|
||
at the time, the operator not responding immediately to the alarm does not seem
|
||
unreasonable.
|
||
An additional alarm was involved in the sequence of events. This alarm was at
|
||
the tank and denoted that a gas from the liquid in the tank was detected in the air
|
||
outside the tank. The outside operator went to investigate. Both operators are
|
||
faulted in the report for waiting thirty minutes to sound the evacuation horn after
|
||
this alarm went off. The official report says:
|
||
Interviews with operations personnel did not produce a clear reason why the response to
|
||
the [gas] alarm took 31 minutes. The only explanation was that there was not a sense of
|
||
urgency since, in their experience, previous [gas] alarms were attributed to minor releases
|
||
that did not require a unit evacuation.
|
||
This statement is puzzling, because the statement itself provides a clear explanation
|
||
for the behavior, that is, the previous experience. In addition, the alarm maxed out
|
||
at 25 ppm, which is much lower than the actual amount in the air, but the control
|
||
room operator had no way of knowing what the actual amount was. In addition,
|
||
there are no established criteria in any written procedure for what level of this gas
|
||
or what alarms constitute an emergency condition that should trigger sounding
|
||
the evacuation alarm. Also, none of the alarms were designated as critical alarms,
|
||
which the accident report does concede might have “elicited a higher degree of
|
||
attention amongst the competing priorities” of the control room operator. Finally,
|
||
there was no written procedure for responding to an alarm for this gas. The “stan-
|
||
dard response” was for an outside operator to conduct a field assessment of the
|
||
situation, which he did.
|
||
While there is training information provided about the hazards of the particular
|
||
gas that escaped, this information was not incorporated in standard operating or
|
||
emergency procedures. The operators were apparently on their own to decide if an
|
||
emergency existed and then were chastised for not responding (in hindsight) cor-
|
||
rectly. If there is a potential for operators to make poor decisions in safety-critical
|
||
situations, then they need to be provided with the criteria to make such a decision.
|
||
Expecting operators under stress and perhaps with limited information about the
|
||
current system state and inadequate training to make such critical decisions based
|
||
on their own judgment is unrealistic. It simply ensures that operators will be blamed
|
||
when their decisions turn out, in hindsight, to be wrong.
|
||
|
||
|
||
One of the actions the operators were criticized for was trying to fix the problem
|
||
rather than calling in emergency personnel immediately after the gas alarm sounded.
|
||
In fact, this response is the normal one for humans (see chapter 9 and [115], as well
|
||
as the following discussion): if it is not the desirable response, then procedures and
|
||
training must be used to ensure that a different response is elicited. The accident
|
||
report states that the safety policy for this company is:
|
||
At units, any employee shall assess the situation and determine what level of evacuation
|
||
and what equipment shutdown is necessary to ensure the safety of all personnel, mitigate
|
||
the environmental impact and potential for equipment/property damage. When in doubt,
|
||
evacuate.
|
||
There are two problems with such a policy.
|
||
The first problem is that evacuation responsibilities (or emergency procedures
|
||
more generally) do not seem to be assigned to anyone but can be initiated by all
|
||
employees. While this may seem like a good idea, it has a serious drawback because one
|
||
consequence of such a lack of assigned control responsibility is that everyone may
|
||
think that someone else will take the initiative—and the blame if the alarm is a false
|
||
one. Although everyone should report problems and even sound an emergency alert
|
||
when necessary, there must be someone who has the actual responsibility, authority,
|
||
and accountability to do so. There should also be backup procedures for others to step
|
||
in when that person does not execute his or her responsibility acceptably.
|
||
The second problem with this safety policy is that unless the procedures clearly
|
||
say to execute emergency procedures, humans are very likely to try to diagnose the
|
||
situation first. The same problem pops up in many accident reports—humans who
|
||
are overwhelmed with information that they cannot digest quickly or do not under-
|
||
stand, will first try to understand what is going on before sounding an alarm [115].
|
||
If management wants employees to sound alarms expeditiously and consistently,
|
||
then the safety policy needs to specify exactly when alarms are required, not leave
|
||
it up to personnel to “evaluate the situation” when they are probably confused and
|
||
unsure as to what is going on (as in this case) and under pressure to make quick
|
||
decisions under stressful situations. How many people, instead of dialing 911 imme-
|
||
diately, try to put out a small kitchen fire themselves? That it often works simply
|
||
reinforces the tendency to act in the same way during the next emergency. And it
|
||
avoids the embarrassment of the firemen arriving for a non-emergency. As it turns
|
||
out, the evacuation alert had been delayed in the past in this same plant, but nobody
|
||
had investigated why that occurred.
|
||
The accident report concludes with a recommendation that “operator duty to
|
||
respond to alarms needs to be reinforced with the work force.” This recommenda-
|
||
tion is inadequate because it ignores why the operators did not respond to the
|
||
alarms. More useful recommendations might have included designing more accurate
|
||
|
||
and more observable feedback about the actual position of the control valve (rather
|
||
than just the commanded position), about the state of flow into the tank, about the
|
||
level of the liquid in the tank, and so on. The recommendation also ignores the
|
||
ambiguous state of the company policy on responding to alarms.
|
||
Because the official report focused only on the role of the operators in the acci-
|
||
dent and did not even examine that in depth, a chance to detect flaws in the design
|
||
and operation of the plant that could lead to future accidents was lost. To prevent
|
||
future accidents, the report needed to explain such things as why the HAZOP per-
|
||
formed on the unit did not identify any of the alarms in this unit as critical. Is there
|
||
some deficiency in HAZOP or in the way it is being performed in this company?
|
||
Why were there no procedures in place, or why were the ones in place ineffective,
|
||
to respond to the emergency? Either the hazard was not identified, the company
|
||
does not have a policy to create procedures for dealing with hazards, or it was an
|
||
oversight and there was no procedure in place to check that there is a response for
|
||
all identified hazards.
|
||
The report does recommend that a risk assessed procedure for filling this tank
|
||
be created that defines critical operational parameters such as the sequence of steps
|
||
required to initiate the filling process, the associated process control parameters, the
|
||
safe level at which the tank is considered full, the sequence of steps necessary to
|
||
conclude and secure the tank-filling process, and appropriate response to alarms. It
|
||
does not say anything, however, about performing the same task for other processes
|
||
in the plant. Either this tank and its safety-critical process are the only ones missing
|
||
such procedures or the company is playing a sophisticated game of Whack-a-Mole
|
||
(see chapter 13), in which only symptoms of the real problems are removed with
|
||
each set of events investigated.
|
||
The official accident report concludes that the control room operator “did not
|
||
demonstrate an awareness of risks associated with overflowing the tank and poten-
|
||
tial to generate high concentrations of [gas] if the [liquid in the tank] was spilled.”
|
||
No further investigation of why this was true was included in the report. Was there
|
||
a deficiency in the training procedures about the hazards associated with his job
|
||
responsibilities? Even if the explanation is that this particular operator is simply
|
||
incompetent (probably not true) and although exposed to potentially effective train-
|
||
ing did not profit from it, then the question becomes why such an operator was
|
||
allowed to continue in that job and why the evaluation of his training outcomes did
|
||
not detect this deficiency. It seemed that the outside operator also had a poor
|
||
understanding of the risks from this gas so there is clearly evidence that a systemic
|
||
problem exists. An audit should have been performed to determine if a spill in this
|
||
tank is the only hazard that is not understood and if these two operators are the
|
||
only ones who are confused. Is this unit simply a poorly designed and managed one
|
||
in the plant or do similar deficiencies exist in other units?
|
||
|
||
|
||
|
||
Other important causal factors and questions also were not addressed in the
|
||
report such as why the level transmitter was not working so soon after it was sup-
|
||
posedly fixed, why safety orders were so delayed (the average age of a safety-related
|
||
work order in this plant was three months), why critical processes were allowed to
|
||
operate with non-functioning or erratically functioning safety-related equipment,
|
||
whether the plant management knew this was happening, and so on.
|
||
Hindsight bias and focusing only on the operator’s role in accidents prevents us
|
||
from fully learning from accidents and making significant progress in improving
|
||
safety.
|
||
section 11.8.
|
||
Coordination and Communication.
|
||
The analysis so far has looked at each component separately. But coordination and
|
||
communication between controllers are important sources of unsafe behavior.
|
||
Whenever a component has two or more controllers, coordination should be
|
||
examined carefully. Each controller may have different responsibilities, but the
|
||
control actions provided may conflict. The controllers may also control the same
|
||
aspects of the controlled component’s behavior, leading to confusion about who is
|
||
responsible for providing control at any time. In the Walkerton E. coli water supply
|
||
contamination example provided in appendix C, three control components were
|
||
responsible for following up on inspection reports and ensuring the required changes
|
||
were made: the Walkerton Public Utility Commission (WPUC), the Ministry of the
|
||
Environment (MOE), and the Ministry of Health (MOH). The WPUC commission-
|
||
ers had no expertise in running a water utility and simply left the changes to the
|
||
manager. The MOE and MOH both were responsible for performing the same
|
||
oversight: The local MOH facility assumed that the MOE was performing this func-
|
||
tion, but the MOE’s budget had been cut, and follow-ups were not done. In this
|
||
case, each of the three responsible groups assumed the other two controllers were
|
||
providing the needed oversight, a common finding after an accident.
|
||
A different type of coordination problem occurred in an aircraft collision near
|
||
Überlingen, Germany, in 2002 [28, 212]. The two controllers—the automated on-
|
||
board TCAS system and the ground air traffic controller—provided uncoordinated
|
||
control instructions that conflicted and actually caused a collision. The loss would
|
||
have been prevented if both pilots had followed their TCAS alerts or both had fol-
|
||
lowed the ground ATC instructions.
|
||
In the friendly fire accident analyzed in chapter 5, the responsibility of the
|
||
AWACS controllers had officially been disambiguated by assigning one to control
|
||
aircraft within the no-fly zone and the other to monitor and control aircraft outside
|
||
it. This partitioning of control broke down over time, however, with the result that
|
||
neither controlled the Black Hawk helicopter on that fateful day. No performance
|
||
|
||
|
||
auditing occurred to ensure that the assumed and designed behavior of the safety
|
||
control structure components was actually occurring.
|
||
Communication, both feedback and exchange of information, is also critical. All
|
||
communication links should be examined to ensure they worked properly and, if
|
||
they did not, the reasons for the inadequate communication must be determined.
|
||
The Überlingen collision, between a Russian Tupolev aircraft and a DHL Boeing
|
||
aircraft, provides a useful example. Wong used STAMP to analyze this accident and
|
||
demonstrated how the communications breakdown on the night of the accident
|
||
played an important role [212]. Figure 11.9 shows the components surrounding the
|
||
controller at the Air Traffic Control Center in Zürich that was controlling both
|
||
aircraft at the time and the feedback loops and communication links between the
|
||
components. Dashed lines represent partial communication channels that are not
|
||
available all the time. For example, only partial communication is available between
|
||
the controller and multiple aircraft because only one party can transmit at one time
|
||
when they are sharing a single radio frequency. In addition, the controller cannot
|
||
directly receive information about TCAS advisories—the Pilot Not Flying (PNF) is
|
||
|
||
|
||
supposed to report TCAS advisories to the controller over the radio. Finally, com-
|
||
municating all the time with all the aircraft requires the presence of two controllers
|
||
at two different consoles, but only one controller was present at the time.
|
||
Nearly all the communication links were broken or ineffective at the time of the
|
||
accident (see figure 11.10). A variety of conditions contributed to the lost links.
|
||
The first reason for the dysfunctional communication was unsafe practices such
|
||
as inadequate briefings given to the two controllers scheduled to work the night
|
||
shift, the second controller being in the break room (which was not officially allowed
|
||
but was known and tolerated by management during times of low traffic), and the
|
||
reluctance of the controller’s assistant to speak up with ideas to assist in the situa-
|
||
tion due to feeling that he would be overstepping his bounds. The inadequate brief-
|
||
ings were due to a lack of information as well as each party believing they were not
|
||
responsible for conveying specific information, a result of poorly defined roles and
|
||
responsibilities.
|
||
More links were broken due to maintenance work that was being done in the
|
||
control room to reorganize the physical sectors. This work led to unavailability of
|
||
the direct phone line used to communicate with adjacent ATC centers (including
|
||
ATC Karlsruhe, which saw the impending collision and tried to call ATC Zurich)
|
||
and the loss of an optical short-term conflict alert (STCA) on the console. The aural
|
||
short-term conflict alert was theoretically working, but nobody in the control room
|
||
heard it.
|
||
Unusual situations led to the loss of additional links. These include the failure of
|
||
the bypass telephone system from adjacent ATC centers and the appearance of a
|
||
delayed A320 aircraft landing at Friedrichshafen. To communicate with all three
|
||
aircraft, the controller had to alternate between two consoles, changing all the air-
|
||
craft–controller communication channels to partial links.
|
||
Finally, some links were unused because the controller did not realize they were
|
||
available. These include possible help from the other staff present in the control room
|
||
(but working on the resectorization) and a third telephone system that the controller
|
||
did not know about. In addition, the link between the crew of the Tupolev aircraft
|
||
and its TCAS unit was broken due to the crew ignoring the TCAS advisory.
|
||
Figure 11.10 shows the remaining links after all these losses. At the time of the
|
||
accident, there were no complete feedback loops left in the system and the few
|
||
remaining connections were partial ones. The exception was the connection between
|
||
the TCAS units of the two aircraft, which were still communicating with each other.
|
||
The TCAS unit can only provide information to the crew, however, so this remaining
|
||
loop was unable to exert any control over the aircraft.
|
||
Another common type of communication failure is in the problem-reporting
|
||
channels. In a large number of accidents, the investigators find that the problems
|
||
were identified in time to prevent the loss but that the required problem-reporting
|
||
|
||
|
||
channels were not used. Recommendations in the ensuing accident reports usually
|
||
involve training people to use the reporting channels—based on an assumption that
|
||
the lack of use reflected poor training—or attempting to enforce their use by reit-
|
||
erating the requirement that all problems be reported. These investigations, however,
|
||
usually stop short of finding out why the reporting channels were not used. Often
|
||
an examination and a few questions reveal that the formal reporting channels are
|
||
difficult or awkward and time-consuming to use. Redesign of a poorly designed
|
||
system will be more effective in ensuring future use than simply telling people they
|
||
have to use a poorly designed system. Unless design changes are made, over time
|
||
the poorly designed communication channels will again become underused.
|
||
At Citichem, all problems were reported orally to the control room operator, who
|
||
was supposed to report them to someone above him. One conduit for information,
|
||
of course, leads to a very fragile reporting system. At the same time, there were few
|
||
formal communication and feedback channels established—communication was
|
||
informal and ad hoc, both within Citichem and between Citichem and the local
|
||
government.
|
||
|
||
section 11.9. Dynamics and Migration to a High-Risk State.
|
||
As noted previously, most major accidents result from a migration of the system
|
||
toward reduced safety margins over time. In the Citichem example, pressure from
|
||
commercial competition was one cause of this degradation in safety. It is, of course,
|
||
a very common one. Operational safety practices at Citichem had been better in the
|
||
past, but the current market conditions led management to cut the safety margins
|
||
and ignore established safety practices. Usually there are precursors signaling the
|
||
increasing risks associated with these changes in the form of minor incidents and
|
||
accidents, but in this case, as in so many others, these precursors were not recognized.
|
||
Ironically, the death of the Citichem maintenance manager in an accident led the
|
||
management to make changes in the way they were operating, but it was too late
|
||
to prevent the toxic chemical release.
|
||
The corporate leaders pressured the Citichem plant manager to operate at higher
|
||
levels of risk by threatening to move operations to Mexico, leaving the current
|
||
workers without jobs. Without any way of maintaining an accurate model of the risk
|
||
in current operations, the plant manager allowed the plant to move to a state of
|
||
higher and higher risk.
|
||
Another change over time that affected safety in this system was the physical
|
||
change in the separation of the population from the plant. Usually hazardous facili-
|
||
ties are originally placed far from population centers, but the population shifts
|
||
after the facility is created. People want to live near where they work and do not
|
||
like long commutes. Land and housing may be cheaper near smelly, polluting plants.
|
||
In third world countries, utilities (such as power and water) and transportation
|
||
facilities may be more readily available near heavy industrial plants, as was the case
|
||
at Bhopal.
|
||
At Citichem, an important change over time was the obsolescence of the emer-
|
||
gency preparations as the population increased. Roads, hospital facilities, firefighting
|
||
equipment, and other emergency resources became inadequate. Not only were there
|
||
insufficient resources to handle the changes in population density and location,
|
||
but financial and other pressures militated against those wanting to update the
|
||
emergency resources and plans.
|
||
Considering the Oakbridge community dynamics, the city of Oakbridge con-
|
||
tributed to the accident through the erosion of the safety controls due to the normal
|
||
pressures facing any city government. Without any history of accidents, or risk
|
||
assessments indicating otherwise, the plant was deemed safe, and officials allowed
|
||
developers to build on previously restricted land. A contributing factor was the
|
||
desire to increase city finances and business relationships that would assist in reelec-
|
||
tion of the city officials. The city moved toward a state where casualties would be
|
||
massive when an accident did occur.
|
||
|
||
|
||
The goal of understanding the dynamics is to redesign the system and the safety
|
||
control structure to make them more conducive to system safety. For example,
|
||
behavior is influenced by recent accidents or incidents: As safety efforts are success-
|
||
fully employed, the feeling grows that accidents cannot occur, leading to reduction
|
||
in the safety efforts, an accident, and then increased controls for a while until the
|
||
system drifts back to an unsafe state and complacency again increases . . .
|
||
This complacency factor is so common that any system safety effort must include
|
||
ways to deal with it. SUBSAFE, the U.S. nuclear submarine safety program, has
|
||
been particularly successful at accomplishing this goal. The SUBSAFE program is
|
||
described in chapter 14.
|
||
One way to combat this erosion of safety is to provide ways to maintain accurate
|
||
risk assessments in the process models of the system controllers. The more and
|
||
better information controllers have, the more accurate will be their process models
|
||
and therefore their decisions.
|
||
In the Citichem example, the dynamics of the city migration toward higher risk
|
||
might be improved by doing better hazard analyses, increasing communication
|
||
between the city and the plant (e.g., learning about incidents that are occurring),
|
||
and the formation of community citizen groups to provide counterbalancing pres-
|
||
sures on city officials to maintain the emergency response system and the other
|
||
public safety measures.
|
||
Finally, understanding the reason for such migration provides an opportunity to
|
||
design the safety control structure to prevent it or to detect it when it occurs. Thor-
|
||
ough investigation of incidents using CAST and the insight it provides can be used
|
||
to redesign the system or to establish operational controls to stop the migration
|
||
toward increasing risk before an accident occurs.
|
||
|
||
section 11.10. Generating Recommendations from the CAST Analysis.
|
||
The goal of an accident analysis should not be just to address symptoms, to assign
|
||
blame, or to determine which group or groups are more responsible than others.
|
||
Blame is difficult to eliminate, but, as discussed in section 2.7, blame is antitheti-
|
||
cal to improving safety. It hinders accident and incident investigations and the
|
||
reporting of errors before a loss occurs, and it hinders finding the most important
|
||
factors that need to be changed to prevent accidents in the future. Often, blame is
|
||
assigned to the least politically powerful in the control hierarchy or to those people
|
||
or physical components physically and operationally closest to the actual loss
|
||
events. Understanding why inadequate control was provided and why it made
|
||
sense for the controllers to act in the way they did helps to diffuse what seems to
|
||
be a natural desire to assign blame for events. In addition, looking at how the entire
|
||
safety control structure was flawed and conceptualizing accidents as complex
|
||
|
||
|
||
processes rather than the result of independent events should reduce the finger
|
||
pointing and arguments about others being more to blame that often arises when
|
||
system components other than the operators are identified as being part of the
|
||
accident process. “More to blame” is not a relevant concept in a systems approach
|
||
to accident analysis and should be resisted and avoided. Each component in a
|
||
system works together to obtain the results, and no part is more important than
|
||
another.
|
||
The goal of the accident analysis should instead be to determine how to change
|
||
or reengineer the entire safety-control structure in the most cost-effective and prac-
|
||
tical way to prevent similar accident processes in the future. Once the STAMP
|
||
analysis has been completed, generating recommendations is relatively simple and
|
||
follows directly from the analysis results.
|
||
One consequence of the completeness of a STAMP analysis is that many possi-
|
||
ble recommendations may result—in some cases, too many to be practical to
|
||
include in the final accident report. A determination of the relative importance of
|
||
the potential recommendations may be required in terms of having the greatest
|
||
impact on the largest number of potential future accidents. There is no algorithm
|
||
for identifying these recommendations, nor can there be. Political and situational
|
||
factors will always be involved in such decisions. Understanding the entire accident
|
||
process and the overall safety control structure should help with this identification,
|
||
however.
|
||
Some sample recommendations for the Citichem example are shown throughout
|
||
the chapter. A more complete list of the recommendations that might result from a
|
||
STAMP-based Citichem accident analysis follows. The list is divided into four parts:
|
||
physical equipment and design, corporate management, plant operations and man-
|
||
agement, and government and community.
|
||
Physical Equipment and Design
|
||
1. Add protection against rainwater getting into tanks.
|
||
2. Consider measures for preventing and detecting corrosion.
|
||
3. Change the design of the valves and vent pipes to respond to the two-phase
|
||
flow problem (which was responsible for the valves and pipes being jammed).
|
||
4. Etc. (the rest of the physical plant factors are omitted)
|
||
Corporate Management
|
||
1. Establish a corporate safety policy that specifies:
|
||
a. Responsibility, authority, accountability of everyone with respect to safety
|
||
b. Criteria for evaluating decisions and for designing and implementing safety
|
||
controls.
|
||
|
||
|
||
2. Establish a corporate process safety organization to provide oversight that is
|
||
responsible for:
|
||
a. Enforcing the safety policy
|
||
b. Advising corporate management on safety-related decisions
|
||
c. Performing risk analyses and overseeing safety in operations including
|
||
performing audits and setting reporting requirements (to keep corporate
|
||
process models accurate). A safety working group at the corporate level
|
||
should be considered.
|
||
d. Setting minimum requirements for safety engineering and operations at
|
||
plants and overseeing the implementation of these requirements as well as
|
||
management of change requirements for evaluating all changes for their
|
||
impact on safety.
|
||
e. Providing a conduit for safety-related information from below (a formal
|
||
safety reporting system) as well as an independent feedback channel about
|
||
process safety concerns by employees.
|
||
f. Setting minimum physical and operational standards (including functioning
|
||
equipment and backups) for operations involving dangerous chemicals.
|
||
g. Establishing incident/accident investigation standards and ensuring recom-
|
||
mendations are adequately implemented.
|
||
h. Creating and maintaining a corporate process safety information system.
|
||
3. Improve process safety communication channels both within the corporate
|
||
level as well as information and feedback channels from Citichem plants to
|
||
corporate management.
|
||
4. Ensure that appropriate communication and coordination is occurring between
|
||
the Citichem plants and the local communities in which they reside.
|
||
5. Strengthen or create an inventory control system for safety-critical parts at the
|
||
corporate level. Ensure that safety-related equipment is in stock at all times.
|
||
|
||
Citichem Oakbridge Plant Management and Operations.
|
||
1. Create a safety policy for the plant. Derive it from the corporate safety policy
|
||
and make sure everyone understands it. Include minimum requirements for
|
||
operations: for example, safety devices must be operational, and production
|
||
should be shut down if they are not.
|
||
2. Establish a plant process safety organization and assign responsibility, author-
|
||
ity, and accountability for this organization. Include a process safety manager
|
||
whose primary responsibility is process safety. The responsibilities of this
|
||
organization should include at least the following:
|
||
a. Perform hazard and risk analysis.
|
||
|
||
b. Advise plant management on safety-related decisions.
|
||
c. Create and maintain a plant process safety information system.
|
||
d. Perform or organize process safety audits and inspections using hazard
|
||
analysis results as the preconditions for operations and maintenance.
|
||
e. Investigate hazardous conditions, incidents, and accidents.
|
||
f. Establish leading indicators of risk.
|
||
g. Collect data to ensure process safety policies and procedures are being
|
||
followed.
|
||
3. Ensure that everyone has appropriate training in process safety and the spe-
|
||
cific hazards associated with plant operations.
|
||
4. Regularize and improve communication channels. Create the operational
|
||
feedback channels from controlled components to controllers necessary to
|
||
maintain accurate process models to assist in safety-related decision making.
|
||
If the channels exist but are not used, then the reason why they are unused
|
||
should be determined and appropriate changes made.
|
||
5. Establish a formal problem reporting system along with channels for problem
|
||
reporting that include management and rank and file workers. Avoid com-
|
||
munication channels with a single point of failure for safety-related messages.
|
||
Decisions on whether management is informed about hazardous operational
|
||
events should be proceduralized. Any operational conditions found to exist
|
||
that involve hazards should be reported and thoroughly investigated by those
|
||
responsible for system safety.
|
||
6. Consider establishing employee safety committees with union representation
|
||
(if there are unions at the plant). Consider also setting up a plant process safety
|
||
working group.
|
||
7. Require that all changes affecting safety equipment be approved by the plant
|
||
manager or by his or her designated representative for safety. Any outage of
|
||
safety-critical equipment must be reported immediately.
|
||
8. Establish procedures for quality control and checking of safety-critical activi-
|
||
ties and follow-up investigation of safety excursions (hazardous conditions).
|
||
9. Ensure that those performing safety-critical operations have appropriate skills
|
||
and physical resources (including adequate rest).
|
||
10. Improve inventory control procedures for safety-critical parts at the
|
||
Oakbridge plant.
|
||
11. Review procedures for turnarounds, maintenance, changes, operations, etc.
|
||
that involve potential hazards and ensure that these are being followed. Create
|
||
an MOC procedure that includes hazard analysis on all planned changes.
|
||
|
||
12. Enforce maintenance schedules. If delays are unavoidable, a safety analysis
|
||
should be performed to understand the risks involved.
|
||
13. Establish incident/accident investigation standards and ensure that they are
|
||
being followed and recommendations are implemented.
|
||
14. Create a periodic audit system on the safety of operations and the state of
|
||
the plant. Audit scope might be defined by such information as the hazard
|
||
analysis, identified leading indicators of risk, and past incident/accident
|
||
investigations.
|
||
15. Establish communication channels with the surrounding community and
|
||
provide appropriate information for better decision making by community
|
||
leaders and information to emergency responders and the medical establish-
|
||
ment. Coordinate with the surrounding community to provide information
|
||
and assistance in establishing effective emergency preparedness and response
|
||
measures. These measures should include a warning siren or other notifica-
|
||
tion of an emergency and citizen information about what to do in the case of
|
||
an emergency.
|
||
|
||
Government and Community.
|
||
1. Set policy with respect to safety and ensure that the policy is enforced.
|
||
2. Establish communication channels with hazardous industry in the com-
|
||
munity.
|
||
3. Establish and monitor information channels about the risks in the community.
|
||
Collect and disseminate information on hazards, the measures citizens can take
|
||
to protect themselves, and what to do in case of an emergency.
|
||
4. Encourage citizens to take responsibility for their own safety and to encourage
|
||
local, state, and federal government to do the things necessary to protect them.
|
||
5. Encourage the establishment of a community safety committee and/or a safety
|
||
ombudsman office that is not elected but represents the public in safety-related
|
||
decision making.
|
||
6. Ensure that safety controls are in place before approving new development in
|
||
hazardous areas, and if not (e.g., inadequate roads, communication channels,
|
||
emergency response facilities), then perhaps make developers pay for them.
|
||
Consider requiring developers to provide an analysis of the impact of new
|
||
development on the safety of the community. Hire outside consultants to
|
||
evaluate these impact analyses if such expertise is not available locally.
|
||
7. Establish an emergency preparedness plan and re-evaluate it periodically to
|
||
determine if it is up to date. Include procedures for coordination among emer-
|
||
gency responders.
|
||
|
||
|
||
8. Plan temporary measures for additional manpower in emergencies.
|
||
9. Acquire adequate equipment.
|
||
10. Provide drills and ensure alerting and communication channels exist and are
|
||
operational.
|
||
11. Train emergency responders.
|
||
12. Ensure that transportation and other facilities exist for an emergency.
|
||
13. Set up formal communications between emergency responders (hospital staff,
|
||
police, firefighters, Citichem). Establish emergency plans and means to peri-
|
||
odically update them.
|
||
One thing to note from this example is that many of the recommendations are
|
||
simply good safety management practices. While this particular example involved a
|
||
system that was devoid of the standard safety practices common to most industries,
|
||
many accident investigations conclude that standard safety management practices
|
||
were not observed. This fact points to a great opportunity to prevent accidents simply
|
||
by establishing standard safety controls using the techniques described in this book.
|
||
While we want to learn as much as possible from each loss, preventing the losses in
|
||
the first place is a much better strategy than waiting to learn from our mistakes.
|
||
These recommendations and those resulting from other thoroughly investigated
|
||
accidents also provide an excellent resource to assist in generating the system safety
|
||
requirements and constraints for similar types of systems and in designing improved
|
||
safety control structures.
|
||
Just investigating the incident or accident is, of course, not enough. Recommenda-
|
||
tions must be implemented to be useful. Responsibility must be assigned for ensur-
|
||
ing that changes are actually made. In addition, feedback channels should be
|
||
established to determine whether the recommendations and changes were success-
|
||
ful in reducing risk.
|
||
|
||
section 11.11. Experimental Comparisons of CAST with Traditional Accident Analysis.
|
||
Although CAST is new, several evaluations have been done, mostly aviation-
|
||
related.
|
||
Robert Arnold, in a master’s thesis for Lund University, conducted a qualitative
|
||
comparison of SOAM and STAMP in an Air Traffic Management (ATM) occur-
|
||
rence investigation. SOAM (Systemic Occurrence Analysis Methodology) is used
|
||
by Eurocontrol to analyze ATM incidents. In Arnold’s experiment, an incident was
|
||
investigated using SOAM and STAMP and the usefulness of each in identifying
|
||
systemic countermeasures was compared. The results showed that SOAM is a useful
|
||
heuristic and a powerful communication device, but that it is weak with respect to
|
||
|
||
|
||
emergent phenomena and nonlinear interactions. SOAM directs the investigator to
|
||
consider the context in which the events occur, the barriers that failed, and the
|
||
organizational factors involved, but not the processes that created them or how
|
||
the entire system can migrate toward the boundaries of safe operation. In contrast,
|
||
the author concludes,
|
||
STAMP directs the investigator more deeply into the mechanism of the interactions
|
||
between system components, and how systems adapt over time. STAMP helps identify the
|
||
controls and constraints necessary to prevent undesirable interactions between system
|
||
components. STAMP also directs the investigation through a structured analysis of the
|
||
upper levels of the system’s control structure, which helps to identify high level systemic
|
||
countermeasures. The global ATM system is undergoing a period of rapid technological
|
||
and political change. . . . The ATM is moving from centralized human controlled systems
|
||
to semi-automated distributed decision making. . . . Detailed new systemic models like
|
||
STAMP are now necessary to prevent undesirable interactions between normally func-
|
||
tioning system components and to understand changes over time in increasingly complex
|
||
ATM systems.
|
||
Paul Nelson, in another Lund University master’s thesis, used STAMP and CAST
|
||
to analyze the crash of Comair 5191 at Lexington, Kentucky, on August 27, 2006,
|
||
when the pilots took off from the wrong runway [142]. The accident, of course, has
|
||
been thoroughly investigated by the NTSB. Nelson concludes that the NTSB report
|
||
narrowly targeted causes and potential solutions. No recommendations were put
|
||
forth to correct the underlying safety control structure, which fostered process
|
||
model inconsistencies, inadequate and dysfunctional control actions, and unenforced
|
||
safety constraints. The CAST analysis, on the other hand, uncovered these useful
|
||
levers for eliminating future loss.
|
||
Stringfellow compared the use of STAMP, augmented with guidewords for orga-
|
||
nizational and human error analysis, with the use of HFACS (Human Factors Analy-
|
||
sis and Classification System) on the crash of a Predator-B unmanned aircraft near
|
||
Nogales, Arizona [195]. HFACS, based on the Swiss Cheese Model (event-chain
|
||
model), is an error-classification list that can be used to label types of errors, prob-
|
||
lems, or poor decisions made by humans and organizations [186]. Once again,
|
||
although the analysis of the unmanned vehicle based on STAMP found all the
|
||
factors found in the published analysis of the accident using HFACS [31, 195], the
|
||
STAMP-based analysis identified additional factors, particularly those at higher
|
||
levels of the safety control structure, for example, problems in the FAA’s COA3
|
||
approval process. Stringfellow concludes:
|
||
|
||
The organizational influences listed in HFACS . . . do not go far enough for engineers to
|
||
create recommendations to address organizational problems. . . . Many of the factors cited
|
||
in Swiss Cheese-based methods don’t point to solutions; many are just another label for
|
||
human error in disguise [195, p. 154].
|
||
In general, most accident analyses do a good job in describing what happened, but
|
||
not why.
|
||
|
||
|
||
footnote. The COA or Certificate of Operation allows an air vehicle that does not nominally meet FAA safety
|
||
standards access to the National Airspace System. The COA application process includes measures to
|
||
mitigate risks, such as sectioning off the airspace to be used by the unmanned aircraft and preventing
|
||
other aircraft from entering the space.
|
||
|
||
|
||
section 11.12. Summary.
|
||
In this chapter, the process for performing accident analysis using STAMP as the
|
||
basis is described and illustrated using a chemical plant accident as an example.
|
||
Stopping the analysis at the lower levels of the safety-control structure, in this case
|
||
at the physical controls and the plant operators, provides a distorted and incomplete
|
||
view of the causative factors in the loss. Both a better understanding of why the
|
||
accident occurred and how to prevent future ones are enhanced with a more com-
|
||
plete analysis. As the entire accident process becomes better understood, individual
|
||
mistakes and actions assume a much less important role in comparison to the role
|
||
played by the environment and context in which their decisions and control actions
|
||
take place. What may look like an error or even negligence by the low-level opera-
|
||
tors and controllers may appear much more reasonable given the full picture. In
|
||
addition, changes at the lower levels of the safety-control structure often have much
|
||
less ability to impact the causal factors in major accidents than those at higher levels.
|
||
At all levels, focusing on assessing blame for the accident does not provide the
|
||
information necessary to prevent future accidents. Accidents are complex processes,
|
||
and understanding the entire process is necessary to provide recommendations that
|
||
are going to be effective in preventing a large number of accidents and not just
|
||
preventing the symptoms implicit in a particular set of events. There is too much
|
||
repetition of the same causes of accidents in most industries. We need to improve
|
||
our ability to learn from the past.
|
||
Improving accident investigation may require training accident investigators in
|
||
systems thinking and in the types of environmental and behavior shaping factors to
|
||
consider during an analysis, some of which are discussed in later chapters. Tools to
|
||
assist in the analysis, particularly graphical representations that illustrate interactions
|
||
and causality, will help. But often the limitations of accident reports do not stem from
|
||
the sincere efforts of the investigators but from political and other pressures to limit
|
||
the causal factors identified to those at the lower levels of the management or politi-
|
||
cal hierarchy. Combating these pressures is beyond the scope of this book. Removing
|
||
blame from the process will help somewhat. Management also has to be educated to
|
||
understand that safety pays and, in the longer term, costs less than the losses that
|
||
result from weak safety programs and incomplete accident investigations. |