1
0

890 lines
68 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

PART 2.
STAMP. AN ACCIDENT MODEL BASED ON
SYSTEMS THEORY.
Part 2 introduces an expanded accident causality model based on the new assumptions in chapter 2 and satisfying the goals stemming from them. The theoretical
foundation for the new model is systems theory, as introduced in chapter 3. Using
this new causality model, called STAMP .(Systems-Theoretic Accident Model and
Processes), changes the emphasis in system safety from preventing failures to enforcing behavioral safety constraints. Component failure accidents are still included, but
our conception of causality is extended to include component interaction accidents.
Safety is reformulated as a control problem rather than a reliability problem. This
change leads to much more powerful and effective ways to engineer safer systems,
including the complex sociotechnical systems of most concern today.
The three main concepts in this model.safety constraints, hierarchical control
structures, and process models.are introduced first in chapter 4. Then the STAMP
causality model is described, along with a classification of accident causes implied
by the new model.
To provide additional understanding of STAMP, it is used to describe the causes
of several very different types of losses.a friendly fire shootdown of a U.S. Army
helicopter by a U.S. Air Force fighter jet over northern Iraq, the contamination of
a public water system with E. coli bacteria in a small town in Canada, and the loss
of a Milstar satellite. Chapter 5 presents the friendly fire accident analysis. The other
accident analyses are contained in appendixes B and C.
chapter 4.
A Systems-Theoretic View of Causality.
In the traditional causality models, accidents are considered to be caused by chains
of failure events, each failure directly causing the next one in the chain. Part I
explained why these simple models are no longer adequate for the more complex
sociotechnical systems we are attempting to build today. The definition of accident
causation needs to be expanded beyond failure events so that it includes component
interaction accidents and indirect or systemic causal mechanisms.
The first step is to generalize the definition of an accident.1 An accident is an
unplanned and undesired loss event. That loss may involve human death and injury,
but it may also involve other major losses, including mission, equipment, financial,
and information losses.
Losses result from component failures, disturbances external to the system, interactions among system components, and behavior of individual system components
that lead to hazardous system states. Examples of hazards include the release of
toxic chemicals from an oil refinery, a patient receiving a lethal dose of medicine,
two aircraft violating minimum separation requirements, and commuter train doors
opening between stations.
In systems theory, emergent properties, such as safety, arise from the interactions
among the system components. The emergent properties are controlled by imposing
constraints on the behavior of and interactions among the components. Safety then
becomes a control problem where the goal of the control is to enforce the safety
constraints. Accidents result from inadequate control or enforcement of safetyrelated constraints on the development, design, and operation of the system.
At Bhopal, the safety constraint that was violated was that the MIC must not
come in contact with water. In the Mars Polar Lander, the safety constraint was that
the spacecraft must not impact the planet surface with more than a maximum force.
In the batch chemical reactor accident described in chapter 2, one safety constraint
is a limitation on the temperature of the contents of the reactor.
The problem then becomes one of control where the goal is to control the behavior of the system by enforcing the safety constraints in its design and operation.
Controls must be established to accomplish this goal. These controls need not necessarily involve a human or automated controller. Component behavior .(including
failures). and unsafe interactions may be controlled through physical design, through
process .(such as manufacturing processes and procedures, maintenance processes,
and operations), or through social controls. Social controls include organizational
(management), governmental, and regulatory structures, but they may also be cultural, policy, or individual .(such as self-interest). As an example of the latter, one
explanation that has been given for the 2 thousand 9 financial crisis is that when investment
banks went public, individual controls to reduce personal risk and long-term profits
were eliminated and risk shifted to shareholders and others who had few and weak
controls over those taking the risks.
In this framework, understanding why an accident occurred requires determining
why the control was ineffective. Preventing future accidents requires shifting from
a focus on preventing failures to the broader goal of designing and implementing
controls that will enforce the necessary constraints.
The STAMP .(System-Theoretic Accident Model and Processes). accident model
is based on these principles. Three basic constructs underlie STAMP. safety constraints, hierarchical safety control structures, and process models.
section 4 1.
Safety Constraints.
The most basic concept in STAMP is not an event, but a constraint. Events leading
to losses occur only because safety constraints were not successfully enforced.
The difficulty in identifying and enforcing safety constraints in design and operations has increased from the past. In many of our older and less automated systems,
physical and operational constraints were often imposed by the limitations of technology and of the operational environments. Physical laws and the limits of our
materials imposed natural constraints on the complexity of physical designs and
allowed the use of passive controls.
In engineering, passive controls are those that maintain safety by their presence.
basically, the system fails into a safe state or simple interlocks are used to limit
the interactions among system components to safe ones. Some examples of passive
controls that maintain safety by their presence are shields or barriers such as
containment vessels, safety harnesses, hardhats, passive restraint systems in vehicles,
and fences. Passive controls may also rely on physical principles, such as gravity,
to fail into a safe state. An example is an old railway semaphore that used weights
to ensure that if the cable .(controlling the semaphore). broke, the arm would automatically drop into the stop position. Other examples include mechanical relays
designed to fail with their contacts open, and retractable landing gear for aircraft in
which the wheels drop and lock in the landing position if the pressure system that
raises and lowers them fails. For the batch chemical reactor example in chapter 2,
where the order valves are opened is crucial, designers might have used a physical
interlock that did not allow the catalyst valve to be opened while the water valve
was closed.
In contrast, active controls require some action(s). to provide protection. .(1). detection of a hazardous event or condition .(monitoring), .(2). measurement of some
variable(s), .(3). interpretation of the measurement .(diagnosis), and .(4). response
(recovery or fail-safe procedures), all of which must be completed before a loss
occurs. These actions are usually implemented by a control system, which now commonly includes a computer.
Consider the simple passive safety control where the circuit for a high-power
outlet is run through a door that shields the power outlet. When the door is opened,
the circuit is broken and the power disabled. When the door is closed and the power
enabled, humans cannot touch the high power outlet. Such a design is simple and
foolproof. An active safety control design for the same high power source, requires
some type of sensor to detect when the access door to the power outlet is opened
and an active controller to issue a control command to cut the power. The failure
modes for the active control system are greatly increased over the passive design,
as is the complexity of the system component interactions. In the railway semaphore
example, there must be a way to detect that the cable has broken .(probably now a
digital system is used instead of a cable so the failure of the digital signaling system
must be detected). and some type of active controls used to warn operators to stop
the train. The design of the batch chemical reactor described in chapter 2 used a
computer to control the valve opening and closing order instead of a simple mechanical interlock.
While simple examples are used here for practical reasons, the complexity of our
designs is reaching and exceeding the limits of our intellectual manageability with
a resulting increase in component interaction accidents and lack of enforcement of
the system safety constraints. Even the relatively simple computer-based batch
chemical reactor valve control design resulted in a component interaction accident.
There are often very good reasons to use active controls instead of passive ones,
including increased functionality, more flexibility in design, ability to operate over
large distances, weight reduction, and so on. But the difficulty of the engineering
problem is increased and more potential for design error is introduced.
A similar argument can be made for the interactions between operators and
the processes they control. Cook suggests that when controls were primarily
mechanical and were operated by people located close to the operating process,
proximity allowed sensory perception of the status of the process via direct physical
feedback such as vibration, sound, and temperature .(figure 4.1). Displays were
directly linked to the process and were essentially a physical extension of it. For
example, the flicker of a gauge needle in the cab of a train indicated that .(1). the
engine valves were opening and closing in response to slight pressure fluctuations,
(2). the gauge was connected to the engine, .(3). the pointing indicator was free, and
so on. In this way, the displays provided a rich source of information about the
controlled process and the state of the displays themselves.
The introduction of electromechanical controls allowed operators to control
processes from a greater distance .(both physical and conceptual). than possible with
pure mechanically linked controls .(figure 4.2). That distance, however, meant that
operators lost a lot of direct information about the process.they could no longer
sense the process state directly and the control and display surfaces no longer provided as rich a source of information about the process or the state of the controls
themselves. The system designers had to synthesize and provide an image of the
process state to the operators. An important new source of design errors was introduced by the need for the designers to determine beforehand what information the
operator would need under all conditions to safely control the process. If the designers had not anticipated a particular situation could occur and provided for it in the
original system design, they might also not anticipate the need of the operators for
information about it during operations.
Designers also had to provide feedback on the actions of the operators and on
any failures that might have occurred. The controls could now be operated without
the desired effect on the process, and the operators might not know about it. Accidents started to occur due to incorrect feedback. For example, major accidents
(including Three Mile Island). have involved the operators commanding a valve to
open and receiving feedback that the valve had opened, when in reality it had not.
In this case and others, the valves were wired to provide feedback indicating that
power had been applied to the valve, but not that the valve had actually opened.
Not only could the design of the feedback about success and failures of control
actions be misleading in these systems, but the return links were also subject
to failure.
Electromechanical controls relaxed constraints on the system design allowing
greater functionality .(figure 4.3). At the same time, they created new possibilities
for designer and operator error that had not existed or were much less likely in
mechanically controlled systems. The later introduction of computer and digital
controls afforded additional advantages and removed even more constraints on the
control system design.and introduced more possibility for error. Proximity in our
old mechanical systems provided rich sources of feedback that involved almost all
of the senses, enabling early detection of potential problems. We are finding it hard
to capture and provide these same qualities in new systems that use automated
controls and displays.
It is the freedom from constraints that makes the design of such systems so difficult. Physical constraints enforced discipline and limited complexity in system
design, construction, and modification. The physical constraints also shaped system
design in ways that efficiently transmitted valuable physical component and process
information to operators and supported their cognitive processes.
The same argument applies to the increasing complexity in organizational and
social controls and in the interactions among the components of sociotechnical
systems. Some engineering projects today employ thousands of engineers. The Joint
Strike Fighter, for example, has eight thousand engineers spread over most of the
United States. Corporate operations have become global, with greatly increased
interdependencies and producing a large variety of products. A new holistic approach
to safety, based on control and enforcing safety constraints in the entire sociotechnical system, is needed to ensure safety.
To accomplish this goal, system-level constraints must be identified, and responsibility for enforcing them must be divided up and allocated to appropriate groups.
For example, the members of one group might be responsible for performing hazard
analyses. The manager of this group might be assigned responsibility for ensuring
that the group has the resources, skills, and authority to perform such analyses and
for ensuring that high-quality analyses result. Higher levels of management might
have responsibility for budgets, for establishing corporate safety policies, and for
providing oversight to ensure that safety policies and activities are being carried out
successfully and that the information provided by the hazard analyses is used in
design and operations.
During system and product design and development, the safety constraints will
be broken down and sub-requirements or constraints allocated to the components
of the design as it evolves. In the batch chemical reactor, for example, the system
safety requirement is that the temperature in the reactor must always remain below
a particular level. A design decision may be made to control this temperature using
a reflux condenser. This decision leads to a new constraint. “Water must be flowing
into the reflux condenser whenever catalyst is added to the reactor.” After a decision
is made about what component(s). will be responsible for operating the catalyst and
water valves, additional requirements will be generated. If, for example, a decision
is made to use software rather than .(or in addition to). a physical interlock, the
software must be assigned the responsibility for enforcing the constraint. “The
water valve must always be open when the catalyst valve is open.”
In order to provide the level of safety demanded by society today, we first need
to identify the safety constraints to enforce and then to design effective controls to
enforce them. This process is much more difficult for todays complex and often
high-tech systems than in the past and new techniques, such as those described in
part THREE, are going to be required to solve it, for example, methods to assist in generating the component safety constraints from the system safety constraints.
The alternative.building only the simple electromechanical systems of the past or
living with higher levels of risk.is for the most part not going to be considered an
acceptable solution.
section 4 2.
The Hierarchical Safety Control Structure.
In systems theory .(see section 3 3.), systems are viewed as hierarchical structures,
where each level imposes constraints on the activity of the level beneath it.that is,
constraints or lack of constraints at a higher level allow or control lower-level
behavior.
Control processes operate between levels to control the processes at lower levels
in the hierarchy. These control processes enforce the safety constraints for which
the control process is responsible. Accidents occur when these processes provide
inadequate control and the safety constraints are violated in the behavior of the
lower-level components.
By describing accidents in terms of a hierarchy of control based on adaptive
feedback mechanisms, adaptation plays a central role in the understanding and
prevention of accidents.
At each level of the hierarchical structure, inadequate control may result from
missing constraints .(unassigned responsibility for safety), inadequate safety control
commands, commands that were not executed correctly at a lower level, or inadequately communicated or processed feedback about constraint enforcement. For
example, an operations manager may provide unsafe work instructions or procedures to the operators, or the manager may provide instructions that enforce the
safety constraints, but the operators may ignore them. The operations manager may
not have the feedback channels established to determine that unsafe instructions
were provided or that his or her safety-related instructions are not being followed.
Figure 4.4 shows a typical sociotechnical hierarchical safety control structure
common in a regulated, safety-critical industry in the United States, such as air
transportation. Each system, of course, must be modeled to include its specific
features. Figure 4.4 has two basic hierarchical control structures.one for system
development .(on the left). and one for system operation .(on the right).with interactions between them. An aircraft manufacturer, for example, might have only
system development under its immediate control, but safety involves both development and operational use of the aircraft, and neither can be accomplished successfully in isolation. Safety during operation depends partly on the original design and
development and partly on effective control over operations. Communication channels may be needed between the two structures.3 For example, aircraft manufacturers must communicate to their customers the assumptions about the operational
environment upon which the safety analysis was based, as well as information about
safe operating procedures. The operational environment .(e.g., the commercial airline
industry), in turn, provides feedback to the manufacturer about the performance of
the system over its lifetime.
Between the hierarchical levels of each safety control structure, effective communication channels are needed, both a downward reference channel providing the
information necessary to impose safety constraints on the level below and an upward
measuring channel to provide feedback about how effectively the constraints are
being satisfied .(figure 4.5). Feedback is critical in any open system in order to
provide adaptive control. The controller uses the feedback to adapt future control
commands to more readily achieve its goals.
Government, general industry groups, and the court system are the top two
levels of each of the generic control structures shown in figure 4.4. The government
control structure in place to control development may differ from that controlling
operations.responsibility for certifying the aircraft developed by aircraft manufacturers is assigned to one group at the FAA, while responsibility for supervising
airline operations is assigned to a different group. The appropriate constraints in
each control structure and at each level will vary but in general may include technical design and process constraints, management constraints, manufacturing constraints, and operational constraints.
At the highest level in both the system development and system operation hierarchies are Congress and state legislatures.4 Congress controls safety by passing laws
and by establishing and funding government regulatory structures. Feedback as to
the success of these controls or the need for additional ones comes in the form of
government reports, congressional hearings and testimony, lobbying by various
interest groups, and, of course, accidents.
The next level contains government regulatory agencies, industry associations,
user associations, insurance companies, and the court system. Unions have always
played an important role in ensuring safe operations, such as the air traffic controllers union in the air transportation system, or in ensuring worker safety in
manufacturing. The legal system tends to be used when there is no regulatory
authority and the public has no other means to encourage a desired level of concern
for safety in company management. The constraints generated at this level and
imposed on companies are usually in the form of policy, regulations, certification,
standards .(by trade or user associations), or threat of litigation. Where there is a
union, safety-related constraints on operations or manufacturing may result from
union demands and collective bargaining.
Company management takes the standards, regulations, and other general controls on its behavior and translates them into specific policy and standards for the
company. Many companies have a general safety policy .(it is required by law in
Great Britain). as well as more detailed standards documents. Feedback may come
in the form of status reports, risk assessments, and incident reports.
In the development control structure .(shown on the left of figure 4.4), company
policies and standards are usually tailored and perhaps augmented by each engineering project to fit the needs of the particular project. The higher-level control
process may provide only general goals and constraints and the lower levels may
then add many details to operationalize the general goals and constraints given the
immediate conditions and local goals. For example, while government or company
standards may require a hazard analysis be performed, the system designers and
documenters .(including those designing the operational procedures and writing user
manuals). may have control over the actual hazard analysis process used to identify
specific safety constraints on the design and operation of the system. These detailed
procedures may need to be approved by the level above.
The design constraints identified as necessary to control system hazards are
passed to the implementers and assurers of the individual system components
along with standards and other requirements. Success is determined through feedback provided by test reports, reviews, and various additional hazard analyses. At
the end of the development process, the results of the hazard analyses as well
as documentation of the safety-related design features and design rationale should
be passed on to the maintenance group to be used in the system evolution and
sustainment process.
A similar process involving layers of control is found in the system operation
control structure. In addition, there will be .(or at least should be). interactions
between the two structures. For example, the safety design constraints used during
development should form the basis for operating procedures and for performance
and process auditing.
As in any control loop, time lags may affect the flow of control actions and feedback and may impact the effectiveness of the control loop in enforcing the safety
constraints. For example, standards can take years to develop or change.a time
scale that may keep them behind current technology and practice. At the physical
level, new technology may be introduced in different parts of the system at different
rates, which may result in asynchronous evolution of the control structure. In the
accidental shootdown of two U.S. Army Black Hawk helicopters by two U.S. Air
Force F-15s in the no-fly zone over northern Iraq in 1994, for example, the fighter
jet aircraft and the helicopters were inhibited in communicating by radio because
the F-15 pilots used newer jam-resistant radios that could not communicate with
the older-technology Army helicopter radios. Hazard analysis needs to include the
influence of these time lags and potential changes over time.
A common way to deal with time lags leading to delays is to delegate responsibility to lower levels that are not subject to as great a delay in obtaining information
or feedback from the measuring channels. In periods of quickly changing technology,
time lags may make it necessary for the lower levels to augment the control processes passed down from above or to modify them to fit the current situation. Time
lags at the lowest levels, as in the Black Hawk shootdown example, may require the
use of feedforward control to overcome lack of feedback or may require temporary
controls on behavior. Communication between the F-15s and the Black Hawks
would have been possible if the F-15 pilots had been told to use an older radio
technology available to them, as they were commanded to do for other types of
friendly aircraft.
More generally, control structures always change over time, particularly those
that include humans and organizational components. Physical devices also change
with time, but usually much slower and in more predictable ways. If we are to handle
social and human aspects of safety, then our accident causality models must include
the concept of change. In addition, controls and assurance that the safety control
structure remains effective in enforcing the constraints over time are required.
Control does not necessarily imply rigidity and authoritarian management
styles. Rasmussen notes that control at each level may be enforced in a very prescriptive command and control structure or it may be loosely implemented as performance objectives with many degrees of freedom in how the objectives are met
. Recent trends from management by oversight to management by insight
reflect differing levels of feedback control that are exerted over the lower levels and
a change from prescriptive management control to management by objectives,
where the objectives are interpreted and satisfied according to the local context.
Management insight, however, does not mean abdication of safety-related responsibility. In a Milstar satellite loss and
Mars Polar Lander losses, the accident reports all note that a poor transition from oversight to insight was a factor in the losses. Attempts to delegate decisions and to manage by objectives require an explicit formulation of the value
criteria to be used and an effective means for communicating the values down
through society and organizations. In addition, the impact of specific decisions at
each level on the objectives and values passed down need to be adequately and
formally evaluated. Feedback is required to measure how successfully the functions
are being performed.
Although regulatory agencies are included in the figure 4.4 example, there is no
implication that government regulation is required for safety. The only requirement
is that responsibility for safety is distributed in an appropriate way throughout
the sociotechnical system. In aircraft safety, for example, manufacturers play the
major role while the FAA type certification authority simply provides oversight that
safety is being successfully engineered into aircraft at the lower levels of the hierarchy. If companies or industries are unwilling or incapable of performing their
public safety responsibilities, then government has to step in to achieve the overall
public safety goals. But a much better solution is for company management to take
responsibility, as it has direct control over the system design and manufacturing and
over operations.
The safety-control structure will differ among industries and examples are spread
among the following chapters. Figure C.1 in appendix C shows the control structure
and safety constraints for the hierarchical water safety control system in Ontario,
Canada. The structure is drawn on its side .(as is more common for control diagrams)
so that the top of the hierarchy is on the left side of the figure. The system hazard
is exposure of the public to E. coli or other health-related contaminants through the
public drinking water system; therefore, the goal of the safety control structure is to
prevent such exposure. This goal leads to two system safety constraints.
1. Water quality must not be compromised.
2. Public health measures must reduce the risk of exposure if water quality is
somehow compromised .(such as notification and procedures to follow).
The physical processes being controlled by this control structure .(shown at the
right of the figure). are the water system, the wells used by the local public utilities,
and public health. Details of the control structure are discussed in appendix C, but
appropriate responsibility, authority, and accountability must be assigned to each
component with respect to the role it plays in the overall control structure. For
example, the responsibility of the Canadian federal government is to establish a
nationwide public health system and ensure that it is operating effectively. The
provincial government must establish regulatory bodies and codes, provide resources
to the regulatory bodies, provide oversight and feedback loops to ensure that the
regulators are doing their job adequately, and ensure that adequate risk assessment
is conducted and effective risk management plans are in place. Local public utility
operations must apply adequate doses of chlorine to kill bacteria, measure the
chlorine residuals, and take further steps if evidence of bacterial contamination is
found. While chlorine residuals are a quick way to get feedback about possible
contamination, more accurate feedback is provided by analyzing water samples but
takes longer .(it has a greater time lag). Both have their uses in the overall safety
control structure of the public water supply.
Safety control structures may be very complex. Abstracting and concentrating on
parts of the overall structure may be useful in understanding and communicating
about the controls. In examining different hazards, only subsets of the overall structure may be relevant and need to be considered in detail and the rest can be treated
as the inputs to or the environment of the substructure. The only critical part is that
the hazards must first be identified at the system level and the process must then
proceed top-down and not bottom-up to identify the safety constraints for the parts
of the overall control structure.
The operation of sociotechnical safety control structures at all levels is facing the
stresses noted in chapter 1, such as rapidly changing technology, competitive and
time-to-market pressures, and changing public and regulatory views of responsibility
for safety. These pressures can lead to a need for new procedures or new controls
to ensure that required safety constraints are not ignored.
section 4 3.
Process Models.
The third concept used in STAMP, along with safety constraints and hierarchical
safety control structures, is process models. Process models are an important part of
control theory. The four conditions required to control a process are described in
chapter 3. The first is a goal, which in STAMP is the safety constraints that must
be enforced by each controller in the hierarchical safety control structure. The
action condition is implemented in the .(downward). control channels and the observability condition is embodied in the .(upward). feedback or measuring channels. The
final condition is the model condition. Any controller.human or automated.
needs a model of the process being controlled to control it effectively .(figure 4.6).
At one extreme, this process model may contain only one or two variables, such
as the model required for a simple thermostat, which contains the current temperature and the setpoint and perhaps a few control laws about how temperature is
changed. At the other extreme, effective control may require a very complex model
with a large number of state variables and transitions, such as the model needed to
control air traffic.
Whether the model is embedded in the control logic of an automated controller
or in the mental model maintained by a human controller, it must contain the same
type of information. the required relationship among the system variables .(the
control laws), the current state .(the current values of the system variables), and the
ways the process can change state. This model is used to determine what control
actions are needed, and it is updated through various forms of feedback. If the model
of the room temperature shows that the ambient temperature is less than the setpoint, then the thermostat issues a control command to start a heating element.
Temperature sensors provide feedback about the .(hopefully rising). temperature.
This feedback is used to update the thermostats model of the current room temperature. When the setpoint is reached, the thermostat turns off the heating element.
In the same way, human operators also require accurate process or mental models
to provide safe control actions.
Component interaction accidents can usually be explained in terms of incorrect
process models. For example, the Mars Polar Lander software thought the spacecraft
had landed and issued a control instruction to shut down the descent engines. The
captain of the Herald of Free Enterprise thought the ferry doors were closed and
ordered the ship to leave the mooring. The pilots in the Cali Colombia B757 crash
thought R was the symbol denoting the radio beacon near Cali.
In general, accidents often occur, particularly component interaction accidents
and accidents involving complex digital technology or human error, when the
process model used by the controller .(automated or human). does not match the
process and, as a result.
1. Incorrect or unsafe control commands are given
2. Required control actions .(for safety). are not provided
3. Potentially correct control commands are provided at the wrong time .(too
early or too late), or
4. Control is stopped too soon or applied too long.
These four types of inadequate control actions are used in the new hazard analysis technique described in chapter 8.
A model of the process being controlled is required not just at the lower physical
levels of the hierarchical control structure, but at all levels. In order to make proper
decisions, the manager of an oil refinery may need to have a model of the current
maintenance level of the safety equipment of the refinery, the state of safety training
of the workforce, and the degree to which safety requirements are being followed
or are effective, among other things. The CEO of the global oil conglomerate has a
much less detailed model of the state of the refineries he controls but at the same
time requires a broader view of the state of safety of all the corporate assets in order
to make appropriate corporate-level decisions impacting safety.
Process models are not only used during operations but also during system development activities. Designers use both models of the system being designed and
models of the development process itself. The developers may have an incorrect
model of the system or software behavior necessary for safety or the physical laws
controlling the system. Safety may also be impacted by developers incorrect models
of the development process itself.
As an example of the latter, a Titan/Centaur satellite launch system, along with
the Milstar satellite it was transporting into orbit, was lost due to a typo in a load
tape used by the computer to determine the attitude change instructions to issue to
the engines. The information on the load tape was essentially part of the process
model used by the attitude control software. The typo was not caught during the
development process partly because of flaws in the developers models of the testing
process.each thought someone else was testing the software using the actual load
tape when, in fact, nobody was .(see appendix B).
In summary, process models play an important role .(1). in understanding why
accidents occur and why humans provide inadequate control over safety-critical
systems and .(2). in designing safer systems.
section 4.4.
STAMP.
The STAMP .(Systems-Theoretic Accident Model and Process). model of accident
causation is built on these three basic concepts.safety constraints, a hierarchical
safety control structure, and process models.along with basic systems theory concepts. All the pieces for a new causation model have been presented. It is now simply
a matter of putting them together.
In STAMP, systems are viewed as interrelated components kept in a state of
dynamic equilibrium by feedback control loops. Systems are not treated as static
but as dynamic processes that are continually adapting to achieve their ends and to
react to changes in themselves and their environment.
Safety is an emergent property of the system that is achieved when appropriate
constraints on the behavior of the system and its components are satisfied. The
original design of the system must not only enforce appropriate constraints on
behavior to ensure safe operation, but the system must continue to enforce the
safety constraints as changes and adaptations to the system design occur over time.
Accidents are the result of flawed processes involving interactions among people,
societal and organizational structures, engineering activities, and physical system
components that lead to violating the system safety constraints. The process leading
up to an accident is described in STAMP in terms of an adaptive feedback function
that fails to maintain safety as system performance changes over time to meet a
complex set of goals and values.
Instead of defining safety management in terms of preventing component
failures, it is defined as creating a safety control structure that will enforce the
behavioral safety constraints and ensure its continued effectiveness as changes
and adaptations occur over time. Effective safety .(and risk). management may
require limiting the types of changes that occur but the goal is to allow as much
flexibility and performance enhancement as possible while enforcing the safety
constraints.
Accidents can be understood, using STAMP, by identifying the safety constraints
that were violated and determining why the controls were inadequate in enforcing
them. For example, understanding the Bhopal accident requires determining not
simply why the maintenance personnel did not insert the slip blind, but also why
the controls that had been designed into the system to prevent the release of hazardous chemicals and to mitigate the consequences of such occurrences.including
maintenance procedures and oversight of maintenance processes, refrigeration units,
gauges and other monitoring units, a vent scrubber, water spouts, a flare tower,
safety audits, alarms and practice alerts, emergency procedures and equipment, and
others.were not successful.
STAMP not only allows consideration of more accident causes than simple component failures, but it also allows more sophisticated analysis of failures and component failure accidents. Component failures may result from inadequate constraints
on the manufacturing process; inadequate engineering design such as missing or
incorrectly implemented fault tolerance; lack of correspondence between individual
component capacity .(including human capacity). and task requirements; unhandled
environmental disturbances .(e.g., electromagnetic interference or EMI); inadequate
maintenance; physical degradation .(wearout); and so on.
Component failures may be prevented by increasing the integrity or resistance
of the component to internal or external influences or by building in safety margins
or safety factors. They may also be avoided by operational controls, such as
operating the component within its design envelope and by periodic inspections and
preventive maintenance. Manufacturing controls can reduce deficiencies or flaws
introduced during the manufacturing process. The effects of physical component
failure on system behavior may be eliminated or reduced by using redundancy. The
important difference from other causality models is that STAMP goes beyond
simply blaming component failure for accidents by requiring that the reasons be
identified for why those failures occurred .(including systemic factors). and led to an
accident, that is, why the controls instituted for preventing such failures or for minimizing their impact on safety were missing or inadequate. And it includes other
types of accident causes, such as component interaction accidents, which are becoming more frequent with the introduction of new technology and new roles for
humans in system control.
STAMP does not lend itself to a simple graphic representation of accident causality .(see figure 4.7). While dominoes, event chains, and holes in Swiss cheese are very
compelling because they are easy to grasp, they oversimplify causality and thus the
approaches used to prevent accidents.
section 4.5.
A General Classification of Accident Causes.
Starting from the basic definitions in STAMP, the general causes of accidents can
be identified using basic systems and control theory. The resulting classification is
useful in accident analysis and accident prevention activities.
Accidents in STAMP are the result of a complex process that results in the system
behavior violating the safety constraints. The safety constraints are enforced by the
control loops between the various levels of the hierarchical control structure that
are in place during design, development, manufacturing, and operations.
Using the STAMP causality model, if there is an accident, one or more of the
following must have occurred.
1. The safety constraints were not enforced by the controller.
a. The control actions necessary to enforce the associated safety constraint at
each level of the sociotechnical control structure for the system were not
provided.
b. The necessary control actions were provided but at the wrong time .(too
early or too late). or stopped too soon.
c. Unsafe control actions were provided that caused a violation of the safety
constraints.
2. Appropriate control actions were provided but not followed.
These same general factors apply at each level of the sociotechnical control structure, but the interpretation .(application). of the factor at each level may differ.
Classification of accident causal factors starts by examining each of the basic
components of a control loop .(see figure 3.2). and determining how their improper
operation may contribute to the general types of inadequate control.
Figure 4.8 shows the classification. The causal factors in accidents can be divided
into three general categories. .(1). the controller operation, .(2). the behavior of actuators and controlled processes, and .(3). communication and coordination among
controllers and decision makers. When humans are involved in the control structure, context and behavior-shaping mechanisms also play an important role in
causality.
4.5.1 Controller Operation
Controller operation has three primary parts. control inputs and other relevant
external information sources, the control algorithms, and the process model. Inadequate, ineffective, or missing control actions necessary to enforce the safety constraints and ensure safety can stem from flaws in each of these parts. For human
controllers and actuators, context is also an important factor.
Unsafe Inputs .(① in figure 4.8).
Each controller in the hierarchical control structure is itself controlled by higherlevel controllers. The control actions and other information provided by the higher
level and required for safe behavior may be missing or wrong. Using the Black Hawk
friendly fire example again, the F-15 pilots patrolling the no-fly zone were given
instructions to switch to a non-jammed radio mode for a list of aircraft types that
did not have the ability to interpret jammed broadcasts. Black Hawk helicopters
had not been upgraded with new anti-jamming technology but were omitted from
the list and so could not hear the F-15 radio broadcasts. Other types of missing or
wrong noncontrol inputs may also affect the operation of the controller.
Unsafe Control Algorithms .(② in figure 4.8).
Algorithms in this sense are both the procedures designed by engineers for hardware controllers and the procedures that human controllers use. Control algorithms
may not enforce safety constraints because the algorithms are inadequately designed
originally, the process may change and the algorithms become unsafe, or the control
algorithms may be inadequately modified by maintainers if the algorithms are automated or through various types of natural adaptation if they are implemented by
humans. Human control algorithms are affected by initial training, by the procedures
provided to the operators to follow, and by feedback and experimentation over time
(see figure 2.9).
Time delays are an important consideration in designing control algorithms. Any
control loop includes time lags, such as the time between the measurement of
process parameters and receiving those measurements or between issuing a
command and the time the process state actually changes. For example, pilot
response delays are important time lags that must be considered in designing the
control function for TCAS5 or other aircraft systems, as are time lags in the controlled process.the aircraft trajectory, for example.caused by aircraft performance limitations.
Delays may not be directly observable, but may need to be inferred. Depending
on where in the feedback loop the delay occurs, different control algorithms are
required to cope with the delays . dead time and time constants require an
algorithm that makes it possible to predict when an action is needed before the
need. Feedback delays generate requirements to predict when a prior control action
has taken effect and when resources will be available again. Such requirements may
impose the need for some type of open loop or feedforward strategy to cope with
delays. When time delays are not adequately considered in the control algorithm,
accidents can result.
Leplat has noted that many accidents relate to asynchronous evolution ,
where one part of a system .(in this case the hierarchical safety control structure)
changes without the related necessary changes in other parts. Changes to subsystems
may be carefully designed, but consideration of their effects on other parts of the
system, including the safety control aspects, may be neglected or inadequate. Asynchronous evolution may also occur when one part of a properly designed system
deteriorates.
In both these cases, the erroneous expectations of users or system components
about the behavior of the changed or degraded subsystem may lead to accidents.
The Ariane 5 trajectory changed from that of the Ariane 4, but the inertial reference
system software was not changed. As a result, an assumption of the inertial reference
software was violated and the spacecraft was lost shortly after launch. One factor
in the loss of contact with SOHO .(SOlar Heliospheric Observatory), a scientific
spacecraft, in 19 98 was the failure to communicate to operators that a functional
change had been made in a procedure to perform gyro spin down. The Black Hawk
friendly fire accident .(analyzed in chapter 5). had several examples of asynchronous
evolution, for example the mission changed and an individual key to communication
between the Air Force and Army left, leaving the safety control structure without
an important component.
Communication is a critical factor here as well as monitoring for changes that
may occur and feeding back this information to the higher-level control. For example,
the safety analysis process that generates constraints always involves some basic
assumptions about the operating environment of the process. When the environment changes such that those assumptions are no longer true, as in the Ariane 5 and
SOHO examples, the controls in place may become inadequate. Embedded pacemakers provide another example. These devices were originally assumed to be used
only in adults, who would lie quietly in the doctors office while the pacemaker was
being “programmed.” Later these devices began to be used in children, and the
assumptions under which the hazard analysis was conducted and the controls were
designed no longer held and needed to be revisited. A requirement for effective
updating of the control algorithms is that the assumptions of the original .(and subsequent). analysis are recorded and retrievable.
Inconsistent, Incomplete, or Incorrect Process Models .(③ in figure 4.8)
Section 4.3 stated that effective control is based on a model of the process state.
Accidents, particularly component interaction accidents, most often result from
inconsistencies between the models of the process used by the controllers .(both
human and automated). and the actual process state. When the controllers model of
the process .(either the human mental model or the software or hardware model)
diverges from the process state, erroneous control commands .(based on the incorrect model). can lead to an accident. for example, .(1). the software does not know that
the plane is on the ground and raises the landing gear, or .(2). the controller .(automated or human). does not identify an object as friendly and shoots a missile at it, or
(3). the pilot thinks the aircraft controls are in speed mode but the computer has
changed the mode to open descent and the pilot behaves inappropriately for that
mode, or .(4). the computer does not think the aircraft has landed and overrides the
pilots attempts to operate the braking system. All of these examples have actually
occurred.
The mental models of the system developers are also important. During software
development, for example, the programmers models of required behavior may not
match the engineers models .(commonly referred to as a software requirements
error), or the software may be executed on computer hardware or may control
physical systems during operations that differ from what was assumed by the programmer and used during testing. The situation becomes more even complicated
when there are multiple controllers .(both human and automated). because each of
their process models must also be kept consistent.
The most common form of inconsistency occurs when one or more process
models is incomplete in terms of not defining appropriate behavior for all possible
process states or all possible disturbances, including unhandled or incorrectly
handled component failures. Of course, no models are complete in the absolute
sense. The goal is to make them complete enough that no safety constraints are
violated when they are used. Criteria for completeness in this sense are presented
in Safeware, and completeness analysis is integrated into the new hazard analysis
method as described in chapter 9.
How does the process model become inconsistent with the actual process state?
The process model designed into the system .(or provided by training if the controller is human). may be wrong from the beginning, there may be missing or incorrect
feedback for updating the process model as the controlled process changes state,
the process model may be updated incorrectly .(an error in the algorithm of the
controller), or time lags may not be accounted for. The result can be uncontrolled
disturbances, unhandled process states, inadvertent commanding of the system into
a hazardous state, unhandled or incorrectly handled controlled process component
failures, and so forth.
Feedback is critically important to the safe operation of the controller. A basic
principle of system theory is that no control system will perform better than its
measuring channel. Feedback may be missing or inadequate because such feedback
is not included in the system design, flaws exist in the monitoring or feedback
communication channel, the feedback is not timely, or the measuring instrument
operates inadequately.
A contributing factor cited in the Cali B757 accident report, for example, was the
omission of the waypoints6 behind the aircraft from cockpit displays, which contributed to the crew not realizing that the waypoint for which they were searching was
behind them .(missing feedback). The model of the Ariane 501 attitude used by the
attitude control software became inconsistent with the launcher attitude when an
error message sent by the inertial reference system was interpreted by the attitude
control system as data .(incorrect processing of feedback), causing the spacecraft
onboard computer to issue an incorrect and unsafe command to the booster and
main engine nozzles.
Other reasons for the process models to diverge from the true system state may
be more subtle. Information about the process state has to be inferred from measurements. For example, in the TCAS TWO aircraft collision avoidance system, relative
range positions of other aircraft are computed based on round-trip message propagation time. The theoretical control function .(control law). uses the true values of
the controlled variables or component states .(e.g., true aircraft positions). However,
at any time, the controller has only measured values, which may be subject to time
lags or inaccuracies. The controller must use these measured values to infer the true
conditions in the process and, if necessary, to derive corrective actions to maintain
the required process state. In the TCAS example, sensors include on-board devices
such as altimeters that provide measured altitude .(not necessarily true altitude). and
antennas for communicating with other aircraft. The primary TCAS actuator is the
pilot, who may or may not respond to system advisories. The mapping between the
measured or assumed values and the true values can be flawed.
To summarize, process models can be incorrect from the beginning.where
correct is defined in terms of consistency with the current process state and with
the models being used by other controllers.or they can become incorrect due to
erroneous or missing feedback or measurement inaccuracies. They may also be
incorrect only for short periods of time due to time lags in the process loop.
4.5.2. Actuators and Controlled Processes .(④ in figure 4.8)
The factors discussed so far have involved inadequate control. The other case occurs
when the control commands maintain the safety constraints, but the controlled
process may not implement these commands. One reason might be a failure or flaw
in the reference channel, that is, in the transmission of control commands. Another
reason might be an actuator or controlled component fault or failure. A third is that
the safety of the controlled process may depend on inputs from other system components, such as power, for the execution of the control actions provided. If these
process inputs are missing or inadequate in some way, the controller process may
be unable to execute the control commands and accidents may result. Finally, there
may be external disturbances that are not handled by the controller.
In a hierarchical control structure, the actuators and controlled process may
themselves be a controller of a lower-level process. In this case, the flaws in executing the control are the same described earlier for a controller.
Once again, these types of flaws do not simply apply to operations or to the
technical system but also to system design and development. For example, a common
flaw in system development is that the safety information gathered or created by
the system safety engineers .(the hazards and the necessary design constraints to
control them). is inadequately communicated to the system designers and testers, or
that flaws exist in the use of this information in the system development process.
section 4.5.3. Coordination and Communication among Controllers and Decision Makers.
When there are multiple controllers .(human and/or automated), control actions
may be inadequately coordinated, including unexpected side effects of decisions
or actions or conflicting control actions. Communication flaws play an important
role here.
Leplat suggests that accidents are most likely in overlap areas or in boundary
areas or where two or more controllers .(human or automated). control the same
process or processes with common boundaries .(figure 4.9). . In both boundary
and overlap areas, the potential exists for ambiguity and for conflicts among
independent decisions.
Responsibility for the control functions in boundary areas is often poorly defined.
For example, Leplat cites an iron and steel plant where frequent accidents occurred
at the boundary of the blast furnace department and the transport department. One
conflict arose when a signal informing transport workers of the state of the blast
furnace did not work and was not repaired because each department was waiting
for the other to fix it. Faverge suggests that such dysfunction can be related to the
number of management levels separating the workers in the departments from a
common manager. The greater the distance, the more difficult the communication,
and thus the greater the uncertainty and risk.
Coordination problems in the control of boundary areas are rife. As mentioned
earlier, a Milstar satellite was lost due to inadequate attitude control of the Titan/
Centaur launch vehicle, which used an incorrect process model based on erroneous
inputs on a software load tape. After the accident, it was discovered that nobody
had tested the software using the actual load tape.each group involved in testing
and assurance had assumed some other group was doing so. In the system development process, system engineering and mission assurance activities were missing or
ineffective, and a common control or management function was quite distant from
the individual development and assurance groups .(see appendix B). One factor
in the loss of the Black Hawk helicopters to friendly fire over northern Iraq was
that the helicopters normally flew only in the boundary areas of the no-fly zone and
procedures for handling aircraft in those areas were ill defined. Another factor was
that an Army base controlled the flights of the Black Hawks, while an Air Force
base controlled all the other components of the airspace. A common control point
once again was high above where the accident occurred in the control structure. In
addition, communication problems existed between the Army and Air Force bases
at the intermediate control levels.
Overlap areas exist when a function is achieved by the cooperation of two controllers or when two controllers exert influence on the same object. Such overlap
creates the potential for conflicting control actions .(dysfunctional interactions
among control actions). Leplat cites a study of the steel industry that found 67
percent of technical incidents with material damage occurred in areas of co-activity,
although these represented only a small percentage of the total activity areas. In an
A320 accident in Bangalore, India, the pilot had disconnected his flight director
during approach and assumed that the copilot would do the same. The result would
have been a mode configuration in which airspeed is automatically controlled by
the autothrottle .(the speed mode), which is the recommended procedure for the
approach phase. However, the copilot had not turned off his flight director, which
meant that open descent mode became active when a lower altitude was selected
instead of speed mode, eventually contributing to the crash of the aircraft short of
the runway . In the Black Hawks shootdown by friendly fire, the aircraft surveillance officer .(A S O). thought she was responsible only for identifying and tracking aircraft south of the 36th Parallel, while the air traffic controller for the area
north of the 36th Parallel thought the A S O was also tracking and identifying aircraft
in his area and acted accordingly.
In 2002, two aircraft collided over southern Germany. An important factor in the
accident was the lack of coordination between the airborne TCAS .(collision avoidance). system and the ground air traffic controller. They each gave different and
conflicting advisories on how to avoid a collision. If both pilots had followed one
or the other, the loss would have been avoided, but one followed the TCAS advisory
and the other followed the ground air traffic control advisory.
section 4.5.4. Context and Environment.
Flawed human decision making can result from incorrect information and inaccurate process models, as described earlier. But human behavior is also greatly
impacted by the context and environment in which the human is working. These
factors have been called “behavior shaping mechanisms.” While value systems and
other influences on decision making can be considered to be inputs to the controller,
describing them in this way oversimplifies their role and origin. A classification of
the contextual and behavior-shaping mechanisms is premature at this point, but
relevant principles and heuristics are elucidated throughout the rest of the book.
section 4.6.
Applying the New Model.
To summarize, STAMP focuses particular attention on the role of constraints in
safety management. Accidents are seen as resulting from inadequate control or
enforcement of constraints on safety-related behavior at each level of the system
development and system operations control structures. Accidents can be understood
in terms of why the controls that were in place did not prevent or detect maladaptive changes.
Accident causal analysis based on STAMP starts with identifying the safety constraints that were violated and then determines why the controls designed to enforce
the safety constraints were inadequate or, if they were potentially adequate, why
the system was unable to exert appropriate control over their enforcement.
In this conception of safety, there is no “root cause.” Instead, the accident “cause”
consists of an inadequate safety control structure that under some circumstances
leads to the violation of a behavioral safety constraint. Preventing future accidents
requires reengineering or designing the safety control structure to be more effective.
Because the safety control structure and the behavior of the individuals in it, like
any physical or social system, changes over time, accidents must be viewed as
dynamic processes. Looking only at the time of the proximal loss events distorts and
omits from view the most important aspects of the larger accident process that are
needed to prevent reoccurrences of losses from the same causes in the future.
Without that view, we see and fix only the symptoms, that is, the results of the flawed
processes and inadequate safety control structure without getting to the sources of
those symptoms.
To understand the dynamic aspects of accidents, the process leading to the loss
can be viewed as an adaptive feedback function where the safety control system
performance degrades over time as the system attempts to meet a complex set of
goals and values. Adaptation is critical in understanding accidents, and the adaptive
feedback mechanism inherent in the model allows a STAMP analysis to incorporate
adaptation as a fundamental system property.
We have found in practice that using this model helps us to separate factual
data from the interpretations of that data. While the events and physical data
involved in accidents may be clear, their importance and the explanations for why
the factors were present are often subjective as is the selection of the events to
consider.
STAMP models are also more complete than most accident reports and other
models, for example see . Each of the explanations for the incorrect
FMS input of R in the Cali American Airlines accident described in chapter 2, for
example, appears in the STAMP analysis of that accident at the appropriate levels
of the control structure where they operated. The use of STAMP helps not only to
identify the factors but also to understand the relationships among them.
While STAMP models will probably not be useful in law suits as they do not
assign blame for the accident to a specific person or group, they do provide more
help in understanding accidents by forcing examination of each part of the sociotechnical system to see how it contributed to the loss.and there will usually be
contributions at each level. Such understanding should help in learning how to
engineer safer systems, including the technical, managerial, organizational, and regulatory aspects.
To accomplish this goal, a framework for classifying the factors that lead to accidents was derived from the basic underlying conceptual accident model .(see figure
4.8). This classification can be used in identifying the factors involved in a particular
accident and in understanding their role in the process leading to the loss. The accident investigation after the Black Hawk shootdown .(analyzed in detail in the next
chapter). identified 130 different factors involved in the accident. In the end, only
the AWACS senior director was court-martialed, and he was acquitted. The more
one knows about an accident process, the more difficult it is to find one person or
part of the system responsible, but the easier it is to find effective ways to prevent
similar occurrences in the future.
STAMP is useful not only in analyzing accidents that have occurred but in developing new and potentially more effective system engineering methodologies to
prevent accidents. Hazard analysis can be thought of as investigating an accident
before it occurs. Traditional hazard analysis techniques, such as fault tree analysis
and various types of failure analysis techniques, do not work well for very complex
systems, for software errors, human errors, and system design errors. Nor do they
usually include organizational and management flaws. The problem is that these
hazard analysis techniques are limited by a focus on failure events and the role of
component failures in accidents; they do not account for component interaction
accidents, the complex roles that software and humans are assuming in high-tech
systems, the organizational factors in accidents, and the indirect relationships
between events and actions required to understand why accidents occur.
STAMP provides a direction to take in creating these new hazard analysis and
prevention techniques. Because in a system accident model everything starts from
constraints, the new approach focuses on identifying the constraints required to
maintain safety; identifying the flaws in the control structure that can lead to an
accident .(inadequate enforcement of the safety constraints); and then designing
a control structure, physical system and operating conditions that enforces the
constraints.
Such hazard analysis techniques augment the typical failure-based design focus
and encourage a wider variety of risk reduction measures than simply adding redundancy and overdesign to deal with component failures. The new techniques also
provide a way to implement safety-guided design so that safety analysis guides the
design generation rather than waiting until a design is complete to discover it is
unsafe. Part THREE describes ways to use techniques based on STAMP to prevent accidents through system design, including design of the operating conditions and the
safety management control structure.
STAMP can also be used to improve performance analysis. Performance monitoring of complex systems has created some dilemmas. Computers allow the collection
of massive amounts of data, but analyzing that data to determine whether the system
is moving toward the boundaries of safe behavior is difficult. The use of an accident
model based on system theory and the basic concept of safety constraints may
provide directions for identifying appropriate safety metrics and leading indicators;
determining whether control over the safety constraints is adequate; evaluating the
assumptions about the technical failures and potential design errors, organizational
structure, and human behavior underlying the hazard analysis; detecting errors in
the operational and environmental assumptions underlying the design and the organizational culture; and identifying any maladaptive changes over time that could
increase risk of accidents to unacceptable levels.
Finally, STAMP points the way to very different approaches to risk assessment.
Currently, risk assessment is firmly rooted in the probabilistic analysis of failure
events. Attempts to extend current P R A techniques to software and other new
technology, to management, and to cognitively complex human control activities
have been disappointing. This way forward may lead to a dead end. Significant
progress in risk assessment for complex systems will require innovative approaches
starting from a completely different theoretical foundation.