1
0

chore: add more chapters and replacements

This commit is contained in:
xuu 2025-03-15 19:07:36 -06:00
parent c1804744bf
commit ff069b52c4
Signed by: xuu
GPG Key ID: 8B3B0604F164E04F
7 changed files with 2770 additions and 26 deletions

1
.gitignore vendored

@ -1,4 +1,5 @@
piper/
*.wav
*.ogg
*.mp3
*.onnx*

@ -2,20 +2,20 @@
PATH:=./piper:$(PATH)
WAV_FILES := $(patsubst %.txt,%.wav,$(wildcard *.txt))
OGG_FILES := $(patsubst %.txt,%.ogg,$(wildcard *.txt))
MP3_FILES := $(patsubst %.txt,%.mp3,$(wildcard *.txt))
MODEL=en_GB-alan-medium.onnx
CONFIG=en_GB-alan-medium.onnx.json
complete: $(OGG_FILES)
complete: $(MP3_FILES)
echo $@ $^
$(WAV_FILES): %.wav: %.txt
cat $^ | piper -m $(MODEL) -c $(CONFIG) -f $@
$(OGG_FILES): %.ogg: %.wav
ffmpeg -i $^ $@
$(MP3_FILES): %.mp3: %.wav
ffmpeg -y -i $^ $@
install:

@ -311,14 +311,14 @@ started to slow down as the most obvious hazards were eliminated. The emphasis
then shifted to unsafe acts. Accidents began to be regarded as someones fault rather
than as an event that could have been prevented by some change in the plant
or product.
Heinrichs Domino Model, published in 1931, was one of the first published
Heinrichs Domino Model, published in 19 31, was one of the first published
general accident models and was very influential in shifting the emphasis in safety
to human error. Heinrich compared the general sequence of accidents to five domi
noes standing on end in a line (figure 2 3). When the first domino falls, it automati
cally knocks down its neighbor and so on until the injury occurs. In any accident
sequence, according to this model, ancestry or social environment leads to a fault
of a person, which is the proximate reason for an unsafe act or condition (mechani
cal or physical), which results in an accident, which leads to an injury. In 1976, Bird
cal or physical), which results in an accident, which leads to an injury. In 19 76, Bird
and Loftus extended the basic Domino Model to include management decisions as
a factor in accidents.
1. Lack of control by management, permitting.
@ -439,7 +439,7 @@ able as the identified cause. Other events or explanations may be excluded or no
examined in depth because they raise issues that are embarrassing to the organiza
tion or its contractors or are politically unacceptable.
The accident report on a friendly fire shootdown of a U.S. Army helicopter over
the Iraqi nofly zone in 1994, for example, describes the chain of events leading to
the Iraqi nofly zone in 19 94, for example, describes the chain of events leading to
the shootdown. Included in these events is the fact that the helicopter pilots did not
change to the radio frequency required in the nofly zone when they entered it (they
stayed on the enroute frequency). Stopping at this event in the chain (which the
@ -459,14 +459,14 @@ more basis for this distinction than the selection of a root cause.
Making such distinctions between causes or limiting the factors considered
can be a hindrance in learning from and preventing future accidents. Consider the
following aircraft examples.
In the crash of an American Airlines D C 10 at Chicagos OHare Airport in 1979,
In the crash of an American Airlines D C 10 at Chicagos OHare Airport in 19 79,
the U.S. National Transportation Safety Board (N T S B) blamed only a “mainte
nanceinduced crack,” and not also a design error that allowed the slats to retract
if the wing was punctured. Because of this omission, McDonnell Douglas was not
required to change the design, leading to future accidents related to the same design
flaw.
Similar omissions of causal factors in aircraft accidents have occurred more
recently. One example is the crash of a China Airlines A300 on April 26, 1994, while
recently. One example is the crash of a China Airlines A300 on April 26, 19 94, while
approaching the Nagoya, Japan, airport. One of the factors involved in the accident
was the design of the flight control computer software. Previous incidents with the
same type of aircraft had led to a Service Bulletin being issued for a modification
@ -480,7 +480,7 @@ that delay, 264 passengers and crew died.
In another D C 10 saga, explosive decompression played a critical role in a near
miss over Windsor, Ontario. An American Airlines D C 10 lost part of its passenger
floor, and thus all of the control cables that ran through it, when a cargo door opened
in flight in June 1972. Thanks to the extraordinary skill and poise of the pilot, Bryce
in flight in June 19 72. Thanks to the extraordinary skill and poise of the pilot, Bryce
McCormick, the plane landed safely. In a remarkable coincidence, McCormick had
trained himself to fly the plane using only the engines because he had been con
cerned about a decompressioncaused collapse of the floor. After this close call,
@ -499,14 +499,14 @@ more basis for this distinction than the selection of a root cause.
Making such distinctions between causes or limiting the factors considered
can be a hindrance in learning from and preventing future accidents. Consider the
following aircraft examples.
In the crash of an American Airlines D C 10 at Chicagos OHare Airport in 1979,
In the crash of an American Airlines D C 10 at Chicagos OHare Airport in 19 79,
the U.S. National Transportation Safety Board (N T S B) blamed only a “mainte
nanceinduced crack,” and not also a design error that allowed the slats to retract
if the wing was punctured. Because of this omission, McDonnell Douglas was not
required to change the design, leading to future accidents related to the same design
flaw .
Similar omissions of causal factors in aircraft accidents have occurred more
recently. One example is the crash of a China Airlines A300 on April 26, 1994, while
recently. One example is the crash of a China Airlines A300 on April 26, 19 94, while
approaching the Nagoya, Japan, airport. One of the factors involved in the accident
was the design of the flight control computer software. Previous incidents with the
same type of aircraft had led to a Service Bulletin being issued for a modification
@ -520,7 +520,7 @@ that delay, 264 passengers and crew died.
In another D C 10 saga, explosive decompression played a critical role in a near
miss over Windsor, Ontario. An American Airlines D C 10 lost part of its passenger
floor, and thus all of the control cables that ran through it, when a cargo door opened
in flight in June 1972. Thanks to the extraordinary skill and poise of the pilot, Bryce
in flight in June 19 72. Thanks to the extraordinary skill and poise of the pilot, Bryce
McCorMICk, the plane landed safely. In a remarkable coincidence, McCorMICk had
trained himself to fly the plane using only the engines because he had been con
cerned about a decompressioncaused collapse of the floor. After this close call,
@ -545,14 +545,14 @@ exceptional case when every life was saved through a combination of crew skill a
sheer luck that the plane was so lightly loaded. If there had been more passengers and
thus more weight, damage to the control cables would undoubtedly have been more
severe, and it is highly questionable if any amount of skill could have saved the plane .
Almost two years later, in March 1974, a fully loaded Turkish Airlines D C 10 crashed
Almost two years later, in March 19 74, a fully loaded Turkish Airlines D C 10 crashed
near Paris, resulting in 346 deaths.one of the worst accidents in aviation history.
Once again, the cargo door had opened in flight, causing the cabin floor to collapse,
severing the flight control cables. Immediately after the accident, Sanford McDon
nell stated the official McDonnellDouglas position that once again placed the
blame on the baggage handler and the ground crew. This time, however, the FAA
finally ordered modifications to all D C 10s that eliminated the hazard. In addition,
an FAA regulation issued in July 1975 required all widebodied jets to be able to
an FAA regulation issued in July 19 75 required all widebodied jets to be able to
tolerate a hole in the fuselage of twenty square feet. By labeling the root cause in
the event chain as baggage handler error and attempting only to eliminate that event
or link in the chain rather than the basic engineering design flaws, fixes that could
@ -575,7 +575,7 @@ different types of links according to the mental representations the analyst has
the production of this event. When several types of rules are possible, the analyst
will apply those that agree with his or her mental model of the situation .
Consider, for example, the loss of an American Airlines B757 near Cali,
Colombia, in 1995 . Two significant events in this loss were
Colombia, in 19 95 . Two significant events in this loss were
(1.) Pilot asks for clearance to take the R O Z O. approach
followed later by
(2.) Pilot types R into the F M S. 5.
@ -630,7 +630,7 @@ often laid years before. One event simply triggers the loss, but if that event h
happened, another one would have led to a loss. The Bhopal disaster provides a
good example.
The release of methyl isocyanate. (M I C.) from the Union Carbide chemical plant
in Bhopal, India, in December 1984 has been called the worst industrial accident
in Bhopal, India, in December 19 84 has been called the worst industrial accident
in history. Conservative estimates point to 2,000 fatalities, 10,000 permanent dis
abilities (including blindness), and 200,000 injuries . The Indian government
blamed the accident on human error.the improper cleaning of a pipe at the plant.
@ -733,7 +733,7 @@ their face and closing their eyes. If the community had been alerted and provide
with this simple information, many (if not most) lives would have been saved and
injuries prevented .
Some of the reasons why the poor conditions in the plant were allowed to persist
are financial. Demand for M I C had dropped sharply after 1981, leading to reduc
are financial. Demand for M I C had dropped sharply after 19 81, leading to reduc
tions in production and pressure on the company to cut costs. The plant was operat
ing at less than half capacity when the accident occurred. Union Carbide put pressure
on the Indian management to reduce losses, but gave no specific details on how
@ -776,7 +776,7 @@ time and without any particular single decision to do so but simply as a series
decisions that moved the plant slowly toward a situation where any slight error
would lead to a major accident. Given the overall state of the Bhopal Union Carbide
plant and its operation, if the action of inserting the slip disk had not been left out
of the pipe washing operation that December day in 1984, something else would
of the pipe washing operation that December day in 19 84, something else would
have triggered an accident. In fact, a similar leak had occurred the year before, but
did not have the same catastrophic consequences and the true root causes of that
incident were neither identified nor fixed.
@ -822,7 +822,7 @@ Without understanding the purpose, goals, and decision criteria used to construc
and operate systems, it is not possible to completely understand and most effectively
prevent accidents.
Awareness of the importance of social and organizational aspects of safety goes
back to the early days of System Safety.7 In 1968, Jerome Lederer, then the director
back to the early days of System Safety.7 In 19 68, Jerome Lederer, then the director
of the NASA Manned Flight Safety Program for Apollo, wrote.
System safety covers the total spectrum of risk management. It goes beyond the hardware
and associated procedures of system safety engineering. It involves. attitudes and motiva
@ -876,7 +876,7 @@ be evaluated? Was a maintenance plan provided before startup? Was all relevant
information provided to planners and managers? Was it used? Was concern for
safety displayed by vigorous, visible personal action by top executives? And so forth.
Johnson originally provided hundreds of such questions, and additions have been
made to his checklist since Johnson created it in the 1970s so it is now even larger.
made to his checklist since Johnson created it in the 19 70s so it is now even larger.
The use of the MORT checklist is feasible because the items are so general, but that
same generality also limits its usefulness. Something more effective than checklists
is needed.
@ -1090,9 +1090,9 @@ rate has dropped by 35 per cent.
sectio 2 4 1. Do Operators Cause Most Accidents?
The tendency to blame the operator is not simply a nineteenth century problem,
but persists today. During and after World War 2, the Air Force had serious prob
lems with aircraft accidents. From 1952 to 1966, for example, 7,715 aircraft were lost
lems with aircraft accidents. From 19 52 to 19 66, for example, 7,715 aircraft were lost
and 8,547 people killed .. Most of these accidents were blamed on pilots. Some
aerospace engineers in the 1950s did not believe the cause was so simple and
aerospace engineers in the 19 50s did not believe the cause was so simple and
argued that safety must be designed and built into aircraft just as are performance,
stability, and structural integrity. Although a few seminars were conducted and
papers written about this approach, the Air Force did not take it seriously until

387
chapter03.txt Normal file

@ -0,0 +1,387 @@
chapter 3.
Systems Theory and Its Relationship to Safety.
To achieve the goals set at the end of the last chapter, a new theoretical underpinning is needed for system safety. Systems theory provides that foundation. This
chapter introduces some basic concepts in systems theory, how this theory is reflected
in system engineering, and how all of this relates to system safety.
section 3 1.
An Introduction to Systems Theory.
Systems theory dates from the 19 30s and 19 40s and was a response to limitations of
the classic analysis techniques in coping with the increasingly complex systems starting to be built at that time . Norbert Wiener applied the approach to control
and communications engineering , while Ludwig von Bertalanffy developed
similar ideas for biology . Bertalanffy suggested that the emerging ideas in
various fields could be combined into a general theory of systems.
In the traditional scientific method, sometimes referred to as divide and conquer,
systems are broken into distinct parts so that the parts can be examined separately.
Physical aspects of systems are decomposed into separate physical components,
while behavior is decomposed into discrete events over time.
Physical aspects → Separate physical components
Behavior → Discrete events over time
This decomposition .(formally called analytic reduction).assumes that the separation
is feasible. that is, each component or subsystem operates independently, and analysis results are not distorted when these components are considered separately. This
assumption in turn implies that the components or events are not subject to feedback loops and other nonlinear interactions and that the behavior of the components is the same when examined singly as when they are playing their part in the
whole. A third fundamental assumption is that the principles governing the assembling of the components into the whole are straightforward, that is, the interactions
among the subsystems are simple enough that they can be considered separate from
the behavior of the subsystems themselves.
These are reasonable assumptions, it turns out, for many of the physical
regularities of the universe. System theorists have described these systems as
displaying organized simplicity .(figure 3 1.).. Such systems can be separated
into non-interacting subsystems for analysis purposes. the precise nature of the
component interactions is known and interactions can be examined pairwise. Analytic reduction has been highly effective in physics and is embodied in structural
mechanics.
Other types of systems display what systems theorists have labeled unorganized
complexity.that is, they lack the underlying structure that allows reductionism to
be effective. They can, however, often be treated as aggregates. They are complex,
but regular and random enough in their behavior that they can be studied statistically. This study is simplified by treating them as a structureless mass with interchangeable parts and then describing them in terms of averages. The basis of this
approach is the law of large numbers. The larger the population, the more likely that
observed values are close to the predicted average values. In physics, this approach
is embodied in statistical mechanics.
These systems are too complex for complete analysis and too organized for statistics;
the averages are deranged by the underlying structure . Many of the complex
engineered systems of the postWorld War 2 era, as well as biological systems and
social systems, fit into this category. Organized complexity also represents particularly well the problems that are faced by those attempting to build complex software,
and it explains the difficulty computer scientists have had in attempting to apply
analysis and statistics to software.
Systems theory was developed for this third type of system. The systems approach
focuses on systems taken as a whole, not on the parts taken separately. It assumes
that some properties of systems can be treated adequately only in their entirety,
taking into account all facets relating the social to the technical aspects . These
system properties derive from the relationships between the parts of systems. how
the parts interact and fit together . Concentrating on the analysis and design of
the whole as distinct from the components or parts provides a means for studying
systems exhibiting organized complexity.
The foundation of systems theory rests on two pairs of ideas. .(1).emergence and
hierarchy and .(2).communication and control .
section 3 2. Emergence and Hierarchy.
A general model of complex systems can be expressed in terms of a hierarchy of
levels of organization, each more complex than the one below, where a level is characterized by having emergent properties. Emergent properties do not exist at lower
levels; they are meaningless in the language appropriate to those levels. The shape of
an apple, although eventually explainable in terms of the cells of the apple, has no
meaning at that lower level of description. The operation of the processes at the
lower levels of the hierarchy result in a higher level of complexity.that of the whole
apple itself.that has emergent properties, one of them being the apples shape .
The concept of emergence is the idea that at a given level of complexity, some properties characteristic of that level .(emergent at that level).are irreducible.
Hierarchy theory deals with the fundamental differences between one level of
complexity and another. Its ultimate aim is to explain the relationships between
different levels. what generates the levels, what separates them, and what links
them. Emergent properties associated with a set of components at one level in a
hierarchy are related to constraints upon the degree of freedom of those components.
Describing the emergent properties resulting from the imposition of constraints
requires a language at a higher level .(a metalevel).different than that describing the
components themselves. Thus, different languages of description are appropriate at
different levels.
Reliability is a component property.1 Conclusions can be reached about the
reliability of a valve in isolation, where reliability is defined as the probability that
the behavior of the valve will satisfy its specification over time and under given
conditions.
Safety, on the other hand, is clearly an emergent property of systems. Safety can
be determined only in the context of the whole. Determining whether a plant is
acceptably safe is not possible, for example, by examining a single valve in the plant.
In fact, statements about the “safety of the valve” without information about the
context in which that valve is used are meaningless. Safety is determined by the
relationship between the valve and the other plant components. As another example,
pilot procedures to execute a landing might be safe in one aircraft or in one set of
circumstances but unsafe in another.
Although they are often confused, reliability and safety are different properties.
The pilots may reliably execute the landing procedures on a plane or at an airport
in which those procedures are unsafe. A gun when discharged out on a desert with
no other humans or animals for hundreds of miles may be both safe and reliable.
When discharged in a crowded mall, the reliability will not have changed, but the
safety most assuredly has.
Because safety is an emergent property, it is not possible to take a single system
component, like a software module or a single human action, in isolation and assess
its safety. A component that is perfectly safe in one system or in one environment
may not be when used in another.
The new model of accidents introduced in part 2 of this book incorporates the
basic systems theory idea of hierarchical levels, where constraints or lack of constraints at the higher levels control or allow lower-level behavior. Safety is treated
as an emergent property at each of these levels. Safety depends on the enforcement
of constraints on the behavior of the components in the system, including constraints
on their potential interactions. Safety in the batch chemical reactor in the previous
chapter, for example, depends on the enforcement of a constraint on the relationship
between the state of the catalyst valve and the water valve.
footnote. 1. This statement is somewhat of an oversimplification, because the reliability of a system component
can, under some conditions .(e.g., magnetic interference or excessive heat).be impacted by its environment. The basic reliability of the component, however, can be defined and measured in isolation, whereas
the safety of an individual component is undefined except in a specific environment.
section 3 3.
Communication and Control.
The second major pair of ideas in systems theory is communication and control. An
example of regulatory or control action is the imposition of constraints upon the
activity at one level of a hierarchy, which define the “laws of behavior” at that level.
Those laws of behavior yield activity meaningful at a higher level. Hierarchies are
characterized by control processes operating at the interfaces between levels .
The link between control mechanisms studied in natural systems and those engineered in man-made systems was provided by a part of systems theory known as
cybernetics. Checkland writes.
Control is always associated with the imposition of constraints, and an account of a control
process necessarily requires our taking into account at least two hierarchical levels. At a
given level, it is often possible to describe the level by writing dynamical equations, on the
assumption that one particle is representative of the collection and that the forces at other
levels do not interfere. But any description of a control process entails an upper level
imposing constraints upon the lower. The upper level is a source of an alternative .(simpler)
description of the lower level in terms of specific functions that are emergent as a result
of the imposition of constraints .
Note Checklands statement about control always being associated with the
imposition of constraints. Imposing safety constraints plays a fundamental role in
the approach to safety presented in this book. The limited focus on avoiding failures,
which is common in safety engineering today, is replaced by the larger concept of
imposing constraints on system behavior to avoid unsafe events or conditions, that
is, hazards.
Control in open systems .(those that have inputs and outputs from their environment).implies the need for communication. Bertalanffy distinguished between
closed systems, in which unchanging components settle into a state of equilibrium,
and open systems, which can be thrown out of equilibrium by exchanges with their
environment.
In control theory, open systems are viewed as interrelated components that are
kept in a state of dynamic equilibrium by feedback loops of information and control.
The plants overall performance has to be controlled in order to produce the desired
product while satisfying cost, safety, and general quality constraints.
In order to control a process, four conditions are required .
•Goal Condition. The controller must have a goal or goals .(for example, to
maintain the setpoint).
•Action Condition. The controller must be able to affect the state of the system.
In engineering, control actions are implemented by actuators.
•Model Condition. The controller must be .(or contain).a model of the system
(see section 4.3).
•Observability Condition. The controller must be able to ascertain the state of
the system. In engineering terminology, observation of the state of the system
is provided by sensors.
Figure 3 2. shows a typical control loop. The plant controller obtains information
about .(observes).the process state from measured variables .(feedback).and uses this
information to initiate action by manipulating controlled variables to keep the
process operating within predefined limits or set points .(the goal).despite disturbances to the process. In general, the maintenance of any open-system hierarchy
(either biological or man-made).will require a set of processes in which there is
communication of information for regulation or control .
Control actions will generally lag in their effects on the process because of delays
in signal propagation around the control loop. an actuator may not respond immediately to an external command signal .(called dead time); the process may have
delays in responding to manipulated variables .(time constants); and the sensors
may obtain values only at certain sampling intervals .(feedback delays). Time lags
restrict the speed and extent with which the effects of disturbances, both within the
process itself and externally derived, can be reduced. They also impose extra requirements on the controller, for example, the need to infer delays that are not directly
observable.
The model condition plays an important role in accidents and safety. In order to
create effective control actions, the controller must know the current state of the
controlled process and be able to estimate the effect of various control actions on
that state. As discussed further in section 4.3, many accidents have been caused by
the controller incorrectly assuming the controlled system was in a particular state
and imposing a control action .(or not providing one).that led to a loss. the Mars
Polar Lander descent engine controller, for example, assumed that the spacecraft
was on the surface of the planet and shut down the descent engines. The captain
of the Herald of Free Enterprise thought the car deck doors were shut and left
the mooring.
section 3 4.
Using Systems Theory to Understand Accidents.
Safety approaches based on systems theory consider accidents as arising from the
interactions among system components and usually do not specify single causal
variables or factors . Whereas industrial .(occupational).safety models and
event chain models focus on unsafe acts or conditions, classic system safety models
instead look at what went wrong with the systems operation or organization to
allow the accident to take place.
This systems approach treats safety as an emergent property that arises when
the system components interact within an environment. Emergent properties like
safety are controlled or enforced by a set of constraints .(control laws).related to
the behavior of the system components. For example, the spacecraft descent engines
must remain on until the spacecraft reaches the surface of the planet and the car
deck doors on the ferry must be closed before leaving port. Accidents result from
interactions among components that violate these constraints.in other words,
from a lack of appropriate constraints on the interactions. Component interaction
accidents, as well as component failure accidents, can be explained using these
concepts.
Safety then can be viewed as a control problem. Accidents occur when component failures, external disturbances, and/or dysfunctional interactions among system
components are not adequately controlled. In the space shuttle Challenger loss, the
O-rings did not adequately control propellant gas release by sealing a tiny gap in
the field joint. In the Mars Polar Lander loss, the software did not adequately control
the descent speed of the spacecraft.it misinterpreted noise from a Hall effect
sensor .(feedback of a measured variable).as an indication the spacecraft had reached
the surface of the planet. Accidents such as these, involving engineering design
errors, may in turn stem from inadequate control over the development process. A
Milstar satellite was lost when a typo in the software load tape was not detected
during the development and testing. Control is also imposed by the management
functions in an organization.the Challenger and Columbia losses, for example,
involved inadequate controls in the launch-decision process.
While events reflect the effects of dysfunctional interactions and inadequate
enforcement of safety constraints, the inadequate control itself is only indirectly
reflected by the events.the events are the result of the inadequate control. The
control structure itself must be examined to determine why it was inadequate to
maintain the constraints on safe behavior and why the events occurred.
As an example, the unsafe behavior .(hazard).in the Challenger loss was the
release of hot propellant gases from the field joint. The miscreant O-ring was used
to control the hazard.that is, its role was to seal a tiny gap in the field joint created
by pressure at ignition. The loss occurred because the system design, including the
O-ring, did not effectively impose the required constraint on the propellant gas
release. Starting from here, there are then several questions that need to be answered
to understand why the accident occurred and to obtain the information necessary
to prevent future accidents. Why was this particular design unsuccessful in imposing
the constraint, why was it chosen .(what was the decision process), why was the
flaw not found during development, and was there a different design that might
have been more successful? These questions and others consider the original
design process.
Understanding the accident also requires examining the contribution of the
operations process. Why were management decisions made to launch despite warnings that it might not be safe to do so? One constraint that was violated during
operations was the requirement to correctly handle feedback about any potential
violation of the safety design constraints, in this case, feedback during operations
that the control by the O-rings of the release of hot propellant gases from the field
joints was not being adequately enforced by the design. There were several instances
of feedback that was not adequately handled, such as data about O-ring blowby and
erosion during previous shuttle launches and feedback by engineers who were concerned about the behavior of the O-rings in cold weather. Although the lack of
redundancy provided by the second O-ring was known long before the loss of Challenger, that information was never incorporated into the NASA Marshall Space
Flight Center database and was unknown by those making the launch decision.
In addition, there was missing feedback about changes in the design and testing
procedures during operations, such as the use of a new type of putty and the introduction of new O-ring leak checks without adequate verification that they satisfied
system safety constraints on the field joints. As a final example, the control processes
that ensured unresolved safety concerns were fully considered before each flight,
that is, the flight readiness reviews and other feedback channels to project management making flight decisions, were flawed.
Systems theory provides a much better foundation for safety engineering than
the classic analytic reduction approach underlying event-based models of accidents.
It provides a way forward to much more powerful and effective safety and risk
analysis and management procedures that handle the inadequacies and needed
extensions to current practice described in chapter 2.
Combining a systems-theoretic approach to safety with system engineering
processes will allow designing safety into the system as it is being developed or
reengineered. System engineering provides an appropriate vehicle for this process
because it rests on the same systems theory foundation and involves engineering
the system as a whole.
section 3 5.
Systems Engineering and Safety.
The emerging theory of systems, along with many of the historical forces noted in
chapter 1, gave rise after World War 2 to a new emphasis in engineering, eventually
called systems engineering. During and after the war, technology expanded rapidly
and engineers were faced with designing and building more complex systems than
had been attempted previously. Much of the impetus for the creation of this new
discipline came from military programs in the 19 50s and 19 60s, particularly intercontinental ballistic missile .(ICBM).systems. Apollo was the first nonmilitary government program in which systems engineering was recognized from the beginning
as an essential function .
System Safety, as defined in MIL-STD-882, is a subdiscipline of system engineering. It was created at the same time and for the same reasons. The defense community tried using the standard safety engineering techniques on their complex
new systems, but the limitations became clear when interface and component interaction problems went unnoticed until it was too late, resulting in many losses and
near misses. When these early aerospace accidents were investigated, the causes of
a large percentage of them were traced to deficiencies in design, operations, and
management. Clearly, big changes were needed. System engineering along with its
subdiscipline, System Safety, were developed to tackle these problems.
Systems theory provides the theoretical foundation for systems engineering,
which views each system as an integrated whole even though it is composed of
diverse, specialized components. The objective is to integrate the subsystems into
the most effective system possible to achieve the overall objectives, given a prioritized set of design criteria. Optimizing the system design often requires making
tradeoffs between these design criteria .(goals).
The development of systems engineering as a discipline enabled the solution of
enormously more complex and difficult technological problems than previously
. Many of the elements of systems engineering can be viewed merely as good
engineering. It represents more a shift in emphasis than a change in content. In
addition, while much of engineering is based on technology and science, systems
engineering is equally concerned with overall management of the engineering
process.
A systems engineering approach to safety starts with the basic assumption that
some properties of systems, in this case safety, can only be treated adequately in the
context of the social and technical system as a whole. A basic assumption of systems
engineering is that optimization of individual components or subsystems will not in
general lead to a system optimum; in fact, improvement of a particular subsystem
may actually worsen the overall system performance because of complex, nonlinear
interactions among the components. When each aircraft tries to optimize its path
from its departure point to its destination, for example, the overall air transportation
system throughput may not be optimized when they all arrive at a popular hub at
the same time. One goal of the air traffic control system is to optimize the overall
air transportation system throughput while, at the same time, trying to allow as much
flexibility for the individual aircraft and airlines to achieve their goals. In the end,
if system engineering is successful, everyone gains. Similarly, each pharmaceutical
company acting to optimize its profits, which is a legitimate and reasonable company
goal, will not necessarily optimize the larger societal system goal of producing safe
and effective pharmaceutical and biological products to enhance public health.
These system engineering principles are applicable even to systems beyond those
traditionally thought of as in the engineering realm. The financial system and its
meltdown starting in 2007 is an example of a social system that could benefit from
system engineering concepts.
Another assumption of system engineering is that individual component behavior .(including events or actions).cannot be understood without considering the
components role and interaction within the system as a whole. This basis for systems
engineering has been stated as the principle that a system is more than the sum of
its parts. Attempts to improve long-term safety in complex systems by analyzing and
changing individual components have often proven to be unsuccessful over the long
term. For example, Rasmussen notes that over many years of working in the field
of nuclear power plant safety, he found that attempts to improve safety from models
of local features were compensated for by people adapting to the change in an
unpredicted way .
Approaches used to enhance safety in complex systems must take these basic
systems engineering principles into account. Otherwise, our safety engineering
approaches will be limited in the types of accidents and systems they can handle.
At the same time, approaches that include them, such as those described in this
book, have the potential to greatly improve our ability to engineer safer and more
complex systems.
section 3 6.
Building Safety into the System Design.
System Safety, as practiced by the U.S. defense and aerospace communities as well
as the new approach outlined in this book, fit naturally within the general systems
engineering process and the problem-solving approach that a system view provides.
This problem-solving process entails several steps. First, a need or problem is specified in terms of objectives that the system must satisfy along with criteria that can
be used to rank alternative designs. For a system that has potential hazards, the
objectives will include safety objectives and criteria along with high-level requirements and safety design constraints. The hazards for an automated train system, for
example, might include the train doors closing while a passenger is in the doorway.
The safety-related design constraint might be that obstructions in the path of a
closing door must be detected and the door closing motion reversed.
After the high-level requirements and constraints on the system design are identified, a process of system synthesis takes place that results in a set of alternative
designs. Each of these alternatives is analyzed and evaluated in terms of the stated
objectives and design criteria, and one alternative is selected to be implemented. In
practice, the process is highly iterative. The results from later stages are fed back to
early stages to modify objectives, criteria, design alternatives, and so on. Of course,
the process described here is highly simplified and idealized.
The following are some examples of basic systems engineering activities and the
role of safety within them.
•Needs analysis. The starting point of any system design project is a perceived
need. This need must first be established with enough confidence to justify the
commitment of resources to satisfy it and understood well enough to allow
appropriate solutions to be generated. Criteria must be established to provide
a means to evaluate both the evolving and final system. If there are hazards
associated with the operation of the system, safety should be included in the
needs analysis.
•Feasibility studies. The goal of this step in the design process is to generate a
set of realistic designs. This goal is accomplished by identifying the principal
constraints and design criteria.including safety constraints and safety design
criteria.for the specific problem being addressed and then generating plausible solutions to the problem that satisfy the requirements and constraints and
are physically and economically feasible.
•Trade studies. In trade studies, the alternative feasible designs are evaluated
with respect to the identified design criteria. A hazard might be controlled by
any one of several safeguards. A trade study would determine the relative
desirability of each safeguard with respect to effectiveness, cost, weight, size,
safety, and any other relevant criteria. For example, substitution of one material
for another may reduce the risk of fire or explosion, but may also reduce reliability or efficiency. Each alternative design may have its own set of safety
constraints .(derived from the system hazards).as well as other performance
goals and constraints that need to be assessed. Although decisions ideally should
be based upon mathematical analysis, quantification of many of the key factors
is often difficult, if not impossible, and subjective judgment often has to be used.
•System architecture development and analysis. In this step, the system engineers break down the system into a set of subsystems, together with the functions and constraints, including safety constraints, imposed upon the individual
subsystem designs, the major system interfaces, and the subsystem interface
topology. These aspects are analyzed with respect to desired system performance characteristics and constraints .(again including safety constraints).and
the process is iterated until an acceptable system design results. The preliminary
design at the end of this process must be described in sufficient detail that
subsystem implementation can proceed independently.
•Interface analysis. The interfaces define the functional boundaries of the
system components. From a management standpoint, interfaces must .(1).optimize visibility and control and .(2).isolate components that can be implemented
independently and for which authority and responsibility can be delegated
. From an engineering standpoint, interfaces must be designed to separate
independent functions and to facilitate the integration, testing, and operation
of the overall system. One important factor in designing the interfaces is safety,
and safety analysis should be a part of the system interface analysis. Because
interfaces tend to be particularly susceptible to design error and are implicated
in the majority of accidents, a paramount goal of interface design is simplicity.
Simplicity aids in ensuring that the interface can be adequately designed, analyzed, and tested prior to integration and that interface responsibilities can be
clearly understood.
Any specific realization of this general systems engineering process depends on
the engineering models used for the system components and the desired system
qualities. For safety, the models commonly used to understand why and how accidents occur have been based on events, particularly failure events, and the use of
reliability engineering techniques to prevent them. Part 2 of this book further
details the alternative systems approach to safety introduced in this chapter, while
part 3 provides techniques to perform many of these safety and system engineering
activities.

890
chapter04.txt Normal file

@ -0,0 +1,890 @@
PART 2.
STAMP. AN ACCIDENT MODEL BASED ON
SYSTEMS THEORY.
Part 2 introduces an expanded accident causality model based on the new assumptions in chapter 2 and satisfying the goals stemming from them. The theoretical
foundation for the new model is systems theory, as introduced in chapter 3. Using
this new causality model, called STAMP .(Systems-Theoretic Accident Model and
Processes), changes the emphasis in system safety from preventing failures to enforcing behavioral safety constraints. Component failure accidents are still included, but
our conception of causality is extended to include component interaction accidents.
Safety is reformulated as a control problem rather than a reliability problem. This
change leads to much more powerful and effective ways to engineer safer systems,
including the complex sociotechnical systems of most concern today.
The three main concepts in this model.safety constraints, hierarchical control
structures, and process models.are introduced first in chapter 4. Then the STAMP
causality model is described, along with a classification of accident causes implied
by the new model.
To provide additional understanding of STAMP, it is used to describe the causes
of several very different types of losses.a friendly fire shootdown of a U.S. Army
helicopter by a U.S. Air Force fighter jet over northern Iraq, the contamination of
a public water system with E. coli bacteria in a small town in Canada, and the loss
of a Milstar satellite. Chapter 5 presents the friendly fire accident analysis. The other
accident analyses are contained in appendixes B and C.
chapter 4.
A Systems-Theoretic View of Causality.
In the traditional causality models, accidents are considered to be caused by chains
of failure events, each failure directly causing the next one in the chain. Part I
explained why these simple models are no longer adequate for the more complex
sociotechnical systems we are attempting to build today. The definition of accident
causation needs to be expanded beyond failure events so that it includes component
interaction accidents and indirect or systemic causal mechanisms.
The first step is to generalize the definition of an accident.1 An accident is an
unplanned and undesired loss event. That loss may involve human death and injury,
but it may also involve other major losses, including mission, equipment, financial,
and information losses.
Losses result from component failures, disturbances external to the system, interactions among system components, and behavior of individual system components
that lead to hazardous system states. Examples of hazards include the release of
toxic chemicals from an oil refinery, a patient receiving a lethal dose of medicine,
two aircraft violating minimum separation requirements, and commuter train doors
opening between stations.
In systems theory, emergent properties, such as safety, arise from the interactions
among the system components. The emergent properties are controlled by imposing
constraints on the behavior of and interactions among the components. Safety then
becomes a control problem where the goal of the control is to enforce the safety
constraints. Accidents result from inadequate control or enforcement of safetyrelated constraints on the development, design, and operation of the system.
At Bhopal, the safety constraint that was violated was that the MIC must not
come in contact with water. In the Mars Polar Lander, the safety constraint was that
the spacecraft must not impact the planet surface with more than a maximum force.
In the batch chemical reactor accident described in chapter 2, one safety constraint
is a limitation on the temperature of the contents of the reactor.
The problem then becomes one of control where the goal is to control the behavior of the system by enforcing the safety constraints in its design and operation.
Controls must be established to accomplish this goal. These controls need not necessarily involve a human or automated controller. Component behavior .(including
failures). and unsafe interactions may be controlled through physical design, through
process .(such as manufacturing processes and procedures, maintenance processes,
and operations), or through social controls. Social controls include organizational
(management), governmental, and regulatory structures, but they may also be cultural, policy, or individual .(such as self-interest). As an example of the latter, one
explanation that has been given for the 2 thousand 9 financial crisis is that when investment
banks went public, individual controls to reduce personal risk and long-term profits
were eliminated and risk shifted to shareholders and others who had few and weak
controls over those taking the risks.
In this framework, understanding why an accident occurred requires determining
why the control was ineffective. Preventing future accidents requires shifting from
a focus on preventing failures to the broader goal of designing and implementing
controls that will enforce the necessary constraints.
The STAMP .(System-Theoretic Accident Model and Processes). accident model
is based on these principles. Three basic constructs underlie STAMP. safety constraints, hierarchical safety control structures, and process models.
section 4 1.
Safety Constraints.
The most basic concept in STAMP is not an event, but a constraint. Events leading
to losses occur only because safety constraints were not successfully enforced.
The difficulty in identifying and enforcing safety constraints in design and operations has increased from the past. In many of our older and less automated systems,
physical and operational constraints were often imposed by the limitations of technology and of the operational environments. Physical laws and the limits of our
materials imposed natural constraints on the complexity of physical designs and
allowed the use of passive controls.
In engineering, passive controls are those that maintain safety by their presence.
basically, the system fails into a safe state or simple interlocks are used to limit
the interactions among system components to safe ones. Some examples of passive
controls that maintain safety by their presence are shields or barriers such as
containment vessels, safety harnesses, hardhats, passive restraint systems in vehicles,
and fences. Passive controls may also rely on physical principles, such as gravity,
to fail into a safe state. An example is an old railway semaphore that used weights
to ensure that if the cable .(controlling the semaphore). broke, the arm would automatically drop into the stop position. Other examples include mechanical relays
designed to fail with their contacts open, and retractable landing gear for aircraft in
which the wheels drop and lock in the landing position if the pressure system that
raises and lowers them fails. For the batch chemical reactor example in chapter 2,
where the order valves are opened is crucial, designers might have used a physical
interlock that did not allow the catalyst valve to be opened while the water valve
was closed.
In contrast, active controls require some action(s). to provide protection. .(1). detection of a hazardous event or condition .(monitoring), .(2). measurement of some
variable(s), .(3). interpretation of the measurement .(diagnosis), and .(4). response
(recovery or fail-safe procedures), all of which must be completed before a loss
occurs. These actions are usually implemented by a control system, which now commonly includes a computer.
Consider the simple passive safety control where the circuit for a high-power
outlet is run through a door that shields the power outlet. When the door is opened,
the circuit is broken and the power disabled. When the door is closed and the power
enabled, humans cannot touch the high power outlet. Such a design is simple and
foolproof. An active safety control design for the same high power source, requires
some type of sensor to detect when the access door to the power outlet is opened
and an active controller to issue a control command to cut the power. The failure
modes for the active control system are greatly increased over the passive design,
as is the complexity of the system component interactions. In the railway semaphore
example, there must be a way to detect that the cable has broken .(probably now a
digital system is used instead of a cable so the failure of the digital signaling system
must be detected). and some type of active controls used to warn operators to stop
the train. The design of the batch chemical reactor described in chapter 2 used a
computer to control the valve opening and closing order instead of a simple mechanical interlock.
While simple examples are used here for practical reasons, the complexity of our
designs is reaching and exceeding the limits of our intellectual manageability with
a resulting increase in component interaction accidents and lack of enforcement of
the system safety constraints. Even the relatively simple computer-based batch
chemical reactor valve control design resulted in a component interaction accident.
There are often very good reasons to use active controls instead of passive ones,
including increased functionality, more flexibility in design, ability to operate over
large distances, weight reduction, and so on. But the difficulty of the engineering
problem is increased and more potential for design error is introduced.
A similar argument can be made for the interactions between operators and
the processes they control. Cook suggests that when controls were primarily
mechanical and were operated by people located close to the operating process,
proximity allowed sensory perception of the status of the process via direct physical
feedback such as vibration, sound, and temperature .(figure 4.1). Displays were
directly linked to the process and were essentially a physical extension of it. For
example, the flicker of a gauge needle in the cab of a train indicated that .(1). the
engine valves were opening and closing in response to slight pressure fluctuations,
(2). the gauge was connected to the engine, .(3). the pointing indicator was free, and
so on. In this way, the displays provided a rich source of information about the
controlled process and the state of the displays themselves.
The introduction of electromechanical controls allowed operators to control
processes from a greater distance .(both physical and conceptual). than possible with
pure mechanically linked controls .(figure 4.2). That distance, however, meant that
operators lost a lot of direct information about the process.they could no longer
sense the process state directly and the control and display surfaces no longer provided as rich a source of information about the process or the state of the controls
themselves. The system designers had to synthesize and provide an image of the
process state to the operators. An important new source of design errors was introduced by the need for the designers to determine beforehand what information the
operator would need under all conditions to safely control the process. If the designers had not anticipated a particular situation could occur and provided for it in the
original system design, they might also not anticipate the need of the operators for
information about it during operations.
Designers also had to provide feedback on the actions of the operators and on
any failures that might have occurred. The controls could now be operated without
the desired effect on the process, and the operators might not know about it. Accidents started to occur due to incorrect feedback. For example, major accidents
(including Three Mile Island). have involved the operators commanding a valve to
open and receiving feedback that the valve had opened, when in reality it had not.
In this case and others, the valves were wired to provide feedback indicating that
power had been applied to the valve, but not that the valve had actually opened.
Not only could the design of the feedback about success and failures of control
actions be misleading in these systems, but the return links were also subject
to failure.
Electromechanical controls relaxed constraints on the system design allowing
greater functionality .(figure 4.3). At the same time, they created new possibilities
for designer and operator error that had not existed or were much less likely in
mechanically controlled systems. The later introduction of computer and digital
controls afforded additional advantages and removed even more constraints on the
control system design.and introduced more possibility for error. Proximity in our
old mechanical systems provided rich sources of feedback that involved almost all
of the senses, enabling early detection of potential problems. We are finding it hard
to capture and provide these same qualities in new systems that use automated
controls and displays.
It is the freedom from constraints that makes the design of such systems so difficult. Physical constraints enforced discipline and limited complexity in system
design, construction, and modification. The physical constraints also shaped system
design in ways that efficiently transmitted valuable physical component and process
information to operators and supported their cognitive processes.
The same argument applies to the increasing complexity in organizational and
social controls and in the interactions among the components of sociotechnical
systems. Some engineering projects today employ thousands of engineers. The Joint
Strike Fighter, for example, has eight thousand engineers spread over most of the
United States. Corporate operations have become global, with greatly increased
interdependencies and producing a large variety of products. A new holistic approach
to safety, based on control and enforcing safety constraints in the entire sociotechnical system, is needed to ensure safety.
To accomplish this goal, system-level constraints must be identified, and responsibility for enforcing them must be divided up and allocated to appropriate groups.
For example, the members of one group might be responsible for performing hazard
analyses. The manager of this group might be assigned responsibility for ensuring
that the group has the resources, skills, and authority to perform such analyses and
for ensuring that high-quality analyses result. Higher levels of management might
have responsibility for budgets, for establishing corporate safety policies, and for
providing oversight to ensure that safety policies and activities are being carried out
successfully and that the information provided by the hazard analyses is used in
design and operations.
During system and product design and development, the safety constraints will
be broken down and sub-requirements or constraints allocated to the components
of the design as it evolves. In the batch chemical reactor, for example, the system
safety requirement is that the temperature in the reactor must always remain below
a particular level. A design decision may be made to control this temperature using
a reflux condenser. This decision leads to a new constraint. “Water must be flowing
into the reflux condenser whenever catalyst is added to the reactor.” After a decision
is made about what component(s). will be responsible for operating the catalyst and
water valves, additional requirements will be generated. If, for example, a decision
is made to use software rather than .(or in addition to). a physical interlock, the
software must be assigned the responsibility for enforcing the constraint. “The
water valve must always be open when the catalyst valve is open.”
In order to provide the level of safety demanded by society today, we first need
to identify the safety constraints to enforce and then to design effective controls to
enforce them. This process is much more difficult for todays complex and often
high-tech systems than in the past and new techniques, such as those described in
part THREE, are going to be required to solve it, for example, methods to assist in generating the component safety constraints from the system safety constraints.
The alternative.building only the simple electromechanical systems of the past or
living with higher levels of risk.is for the most part not going to be considered an
acceptable solution.
section 4 2.
The Hierarchical Safety Control Structure.
In systems theory .(see section 3 3.), systems are viewed as hierarchical structures,
where each level imposes constraints on the activity of the level beneath it.that is,
constraints or lack of constraints at a higher level allow or control lower-level
behavior.
Control processes operate between levels to control the processes at lower levels
in the hierarchy. These control processes enforce the safety constraints for which
the control process is responsible. Accidents occur when these processes provide
inadequate control and the safety constraints are violated in the behavior of the
lower-level components.
By describing accidents in terms of a hierarchy of control based on adaptive
feedback mechanisms, adaptation plays a central role in the understanding and
prevention of accidents.
At each level of the hierarchical structure, inadequate control may result from
missing constraints .(unassigned responsibility for safety), inadequate safety control
commands, commands that were not executed correctly at a lower level, or inadequately communicated or processed feedback about constraint enforcement. For
example, an operations manager may provide unsafe work instructions or procedures to the operators, or the manager may provide instructions that enforce the
safety constraints, but the operators may ignore them. The operations manager may
not have the feedback channels established to determine that unsafe instructions
were provided or that his or her safety-related instructions are not being followed.
Figure 4.4 shows a typical sociotechnical hierarchical safety control structure
common in a regulated, safety-critical industry in the United States, such as air
transportation. Each system, of course, must be modeled to include its specific
features. Figure 4.4 has two basic hierarchical control structures.one for system
development .(on the left). and one for system operation .(on the right).with interactions between them. An aircraft manufacturer, for example, might have only
system development under its immediate control, but safety involves both development and operational use of the aircraft, and neither can be accomplished successfully in isolation. Safety during operation depends partly on the original design and
development and partly on effective control over operations. Communication channels may be needed between the two structures.3 For example, aircraft manufacturers must communicate to their customers the assumptions about the operational
environment upon which the safety analysis was based, as well as information about
safe operating procedures. The operational environment .(e.g., the commercial airline
industry), in turn, provides feedback to the manufacturer about the performance of
the system over its lifetime.
Between the hierarchical levels of each safety control structure, effective communication channels are needed, both a downward reference channel providing the
information necessary to impose safety constraints on the level below and an upward
measuring channel to provide feedback about how effectively the constraints are
being satisfied .(figure 4.5). Feedback is critical in any open system in order to
provide adaptive control. The controller uses the feedback to adapt future control
commands to more readily achieve its goals.
Government, general industry groups, and the court system are the top two
levels of each of the generic control structures shown in figure 4.4. The government
control structure in place to control development may differ from that controlling
operations.responsibility for certifying the aircraft developed by aircraft manufacturers is assigned to one group at the FAA, while responsibility for supervising
airline operations is assigned to a different group. The appropriate constraints in
each control structure and at each level will vary but in general may include technical design and process constraints, management constraints, manufacturing constraints, and operational constraints.
At the highest level in both the system development and system operation hierarchies are Congress and state legislatures.4 Congress controls safety by passing laws
and by establishing and funding government regulatory structures. Feedback as to
the success of these controls or the need for additional ones comes in the form of
government reports, congressional hearings and testimony, lobbying by various
interest groups, and, of course, accidents.
The next level contains government regulatory agencies, industry associations,
user associations, insurance companies, and the court system. Unions have always
played an important role in ensuring safe operations, such as the air traffic controllers union in the air transportation system, or in ensuring worker safety in
manufacturing. The legal system tends to be used when there is no regulatory
authority and the public has no other means to encourage a desired level of concern
for safety in company management. The constraints generated at this level and
imposed on companies are usually in the form of policy, regulations, certification,
standards .(by trade or user associations), or threat of litigation. Where there is a
union, safety-related constraints on operations or manufacturing may result from
union demands and collective bargaining.
Company management takes the standards, regulations, and other general controls on its behavior and translates them into specific policy and standards for the
company. Many companies have a general safety policy .(it is required by law in
Great Britain). as well as more detailed standards documents. Feedback may come
in the form of status reports, risk assessments, and incident reports.
In the development control structure .(shown on the left of figure 4.4), company
policies and standards are usually tailored and perhaps augmented by each engineering project to fit the needs of the particular project. The higher-level control
process may provide only general goals and constraints and the lower levels may
then add many details to operationalize the general goals and constraints given the
immediate conditions and local goals. For example, while government or company
standards may require a hazard analysis be performed, the system designers and
documenters .(including those designing the operational procedures and writing user
manuals). may have control over the actual hazard analysis process used to identify
specific safety constraints on the design and operation of the system. These detailed
procedures may need to be approved by the level above.
The design constraints identified as necessary to control system hazards are
passed to the implementers and assurers of the individual system components
along with standards and other requirements. Success is determined through feedback provided by test reports, reviews, and various additional hazard analyses. At
the end of the development process, the results of the hazard analyses as well
as documentation of the safety-related design features and design rationale should
be passed on to the maintenance group to be used in the system evolution and
sustainment process.
A similar process involving layers of control is found in the system operation
control structure. In addition, there will be .(or at least should be). interactions
between the two structures. For example, the safety design constraints used during
development should form the basis for operating procedures and for performance
and process auditing.
As in any control loop, time lags may affect the flow of control actions and feedback and may impact the effectiveness of the control loop in enforcing the safety
constraints. For example, standards can take years to develop or change.a time
scale that may keep them behind current technology and practice. At the physical
level, new technology may be introduced in different parts of the system at different
rates, which may result in asynchronous evolution of the control structure. In the
accidental shootdown of two U.S. Army Black Hawk helicopters by two U.S. Air
Force F-15s in the no-fly zone over northern Iraq in 1994, for example, the fighter
jet aircraft and the helicopters were inhibited in communicating by radio because
the F-15 pilots used newer jam-resistant radios that could not communicate with
the older-technology Army helicopter radios. Hazard analysis needs to include the
influence of these time lags and potential changes over time.
A common way to deal with time lags leading to delays is to delegate responsibility to lower levels that are not subject to as great a delay in obtaining information
or feedback from the measuring channels. In periods of quickly changing technology,
time lags may make it necessary for the lower levels to augment the control processes passed down from above or to modify them to fit the current situation. Time
lags at the lowest levels, as in the Black Hawk shootdown example, may require the
use of feedforward control to overcome lack of feedback or may require temporary
controls on behavior. Communication between the F-15s and the Black Hawks
would have been possible if the F-15 pilots had been told to use an older radio
technology available to them, as they were commanded to do for other types of
friendly aircraft.
More generally, control structures always change over time, particularly those
that include humans and organizational components. Physical devices also change
with time, but usually much slower and in more predictable ways. If we are to handle
social and human aspects of safety, then our accident causality models must include
the concept of change. In addition, controls and assurance that the safety control
structure remains effective in enforcing the constraints over time are required.
Control does not necessarily imply rigidity and authoritarian management
styles. Rasmussen notes that control at each level may be enforced in a very prescriptive command and control structure or it may be loosely implemented as performance objectives with many degrees of freedom in how the objectives are met
. Recent trends from management by oversight to management by insight
reflect differing levels of feedback control that are exerted over the lower levels and
a change from prescriptive management control to management by objectives,
where the objectives are interpreted and satisfied according to the local context.
Management insight, however, does not mean abdication of safety-related responsibility. In a Milstar satellite loss and
Mars Polar Lander losses, the accident reports all note that a poor transition from oversight to insight was a factor in the losses. Attempts to delegate decisions and to manage by objectives require an explicit formulation of the value
criteria to be used and an effective means for communicating the values down
through society and organizations. In addition, the impact of specific decisions at
each level on the objectives and values passed down need to be adequately and
formally evaluated. Feedback is required to measure how successfully the functions
are being performed.
Although regulatory agencies are included in the figure 4.4 example, there is no
implication that government regulation is required for safety. The only requirement
is that responsibility for safety is distributed in an appropriate way throughout
the sociotechnical system. In aircraft safety, for example, manufacturers play the
major role while the FAA type certification authority simply provides oversight that
safety is being successfully engineered into aircraft at the lower levels of the hierarchy. If companies or industries are unwilling or incapable of performing their
public safety responsibilities, then government has to step in to achieve the overall
public safety goals. But a much better solution is for company management to take
responsibility, as it has direct control over the system design and manufacturing and
over operations.
The safety-control structure will differ among industries and examples are spread
among the following chapters. Figure C.1 in appendix C shows the control structure
and safety constraints for the hierarchical water safety control system in Ontario,
Canada. The structure is drawn on its side .(as is more common for control diagrams)
so that the top of the hierarchy is on the left side of the figure. The system hazard
is exposure of the public to E. coli or other health-related contaminants through the
public drinking water system; therefore, the goal of the safety control structure is to
prevent such exposure. This goal leads to two system safety constraints.
1. Water quality must not be compromised.
2. Public health measures must reduce the risk of exposure if water quality is
somehow compromised .(such as notification and procedures to follow).
The physical processes being controlled by this control structure .(shown at the
right of the figure). are the water system, the wells used by the local public utilities,
and public health. Details of the control structure are discussed in appendix C, but
appropriate responsibility, authority, and accountability must be assigned to each
component with respect to the role it plays in the overall control structure. For
example, the responsibility of the Canadian federal government is to establish a
nationwide public health system and ensure that it is operating effectively. The
provincial government must establish regulatory bodies and codes, provide resources
to the regulatory bodies, provide oversight and feedback loops to ensure that the
regulators are doing their job adequately, and ensure that adequate risk assessment
is conducted and effective risk management plans are in place. Local public utility
operations must apply adequate doses of chlorine to kill bacteria, measure the
chlorine residuals, and take further steps if evidence of bacterial contamination is
found. While chlorine residuals are a quick way to get feedback about possible
contamination, more accurate feedback is provided by analyzing water samples but
takes longer .(it has a greater time lag). Both have their uses in the overall safety
control structure of the public water supply.
Safety control structures may be very complex. Abstracting and concentrating on
parts of the overall structure may be useful in understanding and communicating
about the controls. In examining different hazards, only subsets of the overall structure may be relevant and need to be considered in detail and the rest can be treated
as the inputs to or the environment of the substructure. The only critical part is that
the hazards must first be identified at the system level and the process must then
proceed top-down and not bottom-up to identify the safety constraints for the parts
of the overall control structure.
The operation of sociotechnical safety control structures at all levels is facing the
stresses noted in chapter 1, such as rapidly changing technology, competitive and
time-to-market pressures, and changing public and regulatory views of responsibility
for safety. These pressures can lead to a need for new procedures or new controls
to ensure that required safety constraints are not ignored.
section 4 3.
Process Models.
The third concept used in STAMP, along with safety constraints and hierarchical
safety control structures, is process models. Process models are an important part of
control theory. The four conditions required to control a process are described in
chapter 3. The first is a goal, which in STAMP is the safety constraints that must
be enforced by each controller in the hierarchical safety control structure. The
action condition is implemented in the .(downward). control channels and the observability condition is embodied in the .(upward). feedback or measuring channels. The
final condition is the model condition. Any controller.human or automated.
needs a model of the process being controlled to control it effectively .(figure 4.6).
At one extreme, this process model may contain only one or two variables, such
as the model required for a simple thermostat, which contains the current temperature and the setpoint and perhaps a few control laws about how temperature is
changed. At the other extreme, effective control may require a very complex model
with a large number of state variables and transitions, such as the model needed to
control air traffic.
Whether the model is embedded in the control logic of an automated controller
or in the mental model maintained by a human controller, it must contain the same
type of information. the required relationship among the system variables .(the
control laws), the current state .(the current values of the system variables), and the
ways the process can change state. This model is used to determine what control
actions are needed, and it is updated through various forms of feedback. If the model
of the room temperature shows that the ambient temperature is less than the setpoint, then the thermostat issues a control command to start a heating element.
Temperature sensors provide feedback about the .(hopefully rising). temperature.
This feedback is used to update the thermostats model of the current room temperature. When the setpoint is reached, the thermostat turns off the heating element.
In the same way, human operators also require accurate process or mental models
to provide safe control actions.
Component interaction accidents can usually be explained in terms of incorrect
process models. For example, the Mars Polar Lander software thought the spacecraft
had landed and issued a control instruction to shut down the descent engines. The
captain of the Herald of Free Enterprise thought the ferry doors were closed and
ordered the ship to leave the mooring. The pilots in the Cali Colombia B757 crash
thought R was the symbol denoting the radio beacon near Cali.
In general, accidents often occur, particularly component interaction accidents
and accidents involving complex digital technology or human error, when the
process model used by the controller .(automated or human). does not match the
process and, as a result.
1. Incorrect or unsafe control commands are given
2. Required control actions .(for safety). are not provided
3. Potentially correct control commands are provided at the wrong time .(too
early or too late), or
4. Control is stopped too soon or applied too long.
These four types of inadequate control actions are used in the new hazard analysis technique described in chapter 8.
A model of the process being controlled is required not just at the lower physical
levels of the hierarchical control structure, but at all levels. In order to make proper
decisions, the manager of an oil refinery may need to have a model of the current
maintenance level of the safety equipment of the refinery, the state of safety training
of the workforce, and the degree to which safety requirements are being followed
or are effective, among other things. The CEO of the global oil conglomerate has a
much less detailed model of the state of the refineries he controls but at the same
time requires a broader view of the state of safety of all the corporate assets in order
to make appropriate corporate-level decisions impacting safety.
Process models are not only used during operations but also during system development activities. Designers use both models of the system being designed and
models of the development process itself. The developers may have an incorrect
model of the system or software behavior necessary for safety or the physical laws
controlling the system. Safety may also be impacted by developers incorrect models
of the development process itself.
As an example of the latter, a Titan/Centaur satellite launch system, along with
the Milstar satellite it was transporting into orbit, was lost due to a typo in a load
tape used by the computer to determine the attitude change instructions to issue to
the engines. The information on the load tape was essentially part of the process
model used by the attitude control software. The typo was not caught during the
development process partly because of flaws in the developers models of the testing
process.each thought someone else was testing the software using the actual load
tape when, in fact, nobody was .(see appendix B).
In summary, process models play an important role .(1). in understanding why
accidents occur and why humans provide inadequate control over safety-critical
systems and .(2). in designing safer systems.
section 4.4.
STAMP.
The STAMP .(Systems-Theoretic Accident Model and Process). model of accident
causation is built on these three basic concepts.safety constraints, a hierarchical
safety control structure, and process models.along with basic systems theory concepts. All the pieces for a new causation model have been presented. It is now simply
a matter of putting them together.
In STAMP, systems are viewed as interrelated components kept in a state of
dynamic equilibrium by feedback control loops. Systems are not treated as static
but as dynamic processes that are continually adapting to achieve their ends and to
react to changes in themselves and their environment.
Safety is an emergent property of the system that is achieved when appropriate
constraints on the behavior of the system and its components are satisfied. The
original design of the system must not only enforce appropriate constraints on
behavior to ensure safe operation, but the system must continue to enforce the
safety constraints as changes and adaptations to the system design occur over time.
Accidents are the result of flawed processes involving interactions among people,
societal and organizational structures, engineering activities, and physical system
components that lead to violating the system safety constraints. The process leading
up to an accident is described in STAMP in terms of an adaptive feedback function
that fails to maintain safety as system performance changes over time to meet a
complex set of goals and values.
Instead of defining safety management in terms of preventing component
failures, it is defined as creating a safety control structure that will enforce the
behavioral safety constraints and ensure its continued effectiveness as changes
and adaptations occur over time. Effective safety .(and risk). management may
require limiting the types of changes that occur but the goal is to allow as much
flexibility and performance enhancement as possible while enforcing the safety
constraints.
Accidents can be understood, using STAMP, by identifying the safety constraints
that were violated and determining why the controls were inadequate in enforcing
them. For example, understanding the Bhopal accident requires determining not
simply why the maintenance personnel did not insert the slip blind, but also why
the controls that had been designed into the system to prevent the release of hazardous chemicals and to mitigate the consequences of such occurrences.including
maintenance procedures and oversight of maintenance processes, refrigeration units,
gauges and other monitoring units, a vent scrubber, water spouts, a flare tower,
safety audits, alarms and practice alerts, emergency procedures and equipment, and
others.were not successful.
STAMP not only allows consideration of more accident causes than simple component failures, but it also allows more sophisticated analysis of failures and component failure accidents. Component failures may result from inadequate constraints
on the manufacturing process; inadequate engineering design such as missing or
incorrectly implemented fault tolerance; lack of correspondence between individual
component capacity .(including human capacity). and task requirements; unhandled
environmental disturbances .(e.g., electromagnetic interference or EMI); inadequate
maintenance; physical degradation .(wearout); and so on.
Component failures may be prevented by increasing the integrity or resistance
of the component to internal or external influences or by building in safety margins
or safety factors. They may also be avoided by operational controls, such as
operating the component within its design envelope and by periodic inspections and
preventive maintenance. Manufacturing controls can reduce deficiencies or flaws
introduced during the manufacturing process. The effects of physical component
failure on system behavior may be eliminated or reduced by using redundancy. The
important difference from other causality models is that STAMP goes beyond
simply blaming component failure for accidents by requiring that the reasons be
identified for why those failures occurred .(including systemic factors). and led to an
accident, that is, why the controls instituted for preventing such failures or for minimizing their impact on safety were missing or inadequate. And it includes other
types of accident causes, such as component interaction accidents, which are becoming more frequent with the introduction of new technology and new roles for
humans in system control.
STAMP does not lend itself to a simple graphic representation of accident causality .(see figure 4.7). While dominoes, event chains, and holes in Swiss cheese are very
compelling because they are easy to grasp, they oversimplify causality and thus the
approaches used to prevent accidents.
section 4.5.
A General Classification of Accident Causes.
Starting from the basic definitions in STAMP, the general causes of accidents can
be identified using basic systems and control theory. The resulting classification is
useful in accident analysis and accident prevention activities.
Accidents in STAMP are the result of a complex process that results in the system
behavior violating the safety constraints. The safety constraints are enforced by the
control loops between the various levels of the hierarchical control structure that
are in place during design, development, manufacturing, and operations.
Using the STAMP causality model, if there is an accident, one or more of the
following must have occurred.
1. The safety constraints were not enforced by the controller.
a. The control actions necessary to enforce the associated safety constraint at
each level of the sociotechnical control structure for the system were not
provided.
b. The necessary control actions were provided but at the wrong time .(too
early or too late). or stopped too soon.
c. Unsafe control actions were provided that caused a violation of the safety
constraints.
2. Appropriate control actions were provided but not followed.
These same general factors apply at each level of the sociotechnical control structure, but the interpretation .(application). of the factor at each level may differ.
Classification of accident causal factors starts by examining each of the basic
components of a control loop .(see figure 3.2). and determining how their improper
operation may contribute to the general types of inadequate control.
Figure 4.8 shows the classification. The causal factors in accidents can be divided
into three general categories. .(1). the controller operation, .(2). the behavior of actuators and controlled processes, and .(3). communication and coordination among
controllers and decision makers. When humans are involved in the control structure, context and behavior-shaping mechanisms also play an important role in
causality.
4.5.1 Controller Operation
Controller operation has three primary parts. control inputs and other relevant
external information sources, the control algorithms, and the process model. Inadequate, ineffective, or missing control actions necessary to enforce the safety constraints and ensure safety can stem from flaws in each of these parts. For human
controllers and actuators, context is also an important factor.
Unsafe Inputs .(① in figure 4.8).
Each controller in the hierarchical control structure is itself controlled by higherlevel controllers. The control actions and other information provided by the higher
level and required for safe behavior may be missing or wrong. Using the Black Hawk
friendly fire example again, the F-15 pilots patrolling the no-fly zone were given
instructions to switch to a non-jammed radio mode for a list of aircraft types that
did not have the ability to interpret jammed broadcasts. Black Hawk helicopters
had not been upgraded with new anti-jamming technology but were omitted from
the list and so could not hear the F-15 radio broadcasts. Other types of missing or
wrong noncontrol inputs may also affect the operation of the controller.
Unsafe Control Algorithms .(② in figure 4.8).
Algorithms in this sense are both the procedures designed by engineers for hardware controllers and the procedures that human controllers use. Control algorithms
may not enforce safety constraints because the algorithms are inadequately designed
originally, the process may change and the algorithms become unsafe, or the control
algorithms may be inadequately modified by maintainers if the algorithms are automated or through various types of natural adaptation if they are implemented by
humans. Human control algorithms are affected by initial training, by the procedures
provided to the operators to follow, and by feedback and experimentation over time
(see figure 2.9).
Time delays are an important consideration in designing control algorithms. Any
control loop includes time lags, such as the time between the measurement of
process parameters and receiving those measurements or between issuing a
command and the time the process state actually changes. For example, pilot
response delays are important time lags that must be considered in designing the
control function for TCAS5 or other aircraft systems, as are time lags in the controlled process.the aircraft trajectory, for example.caused by aircraft performance limitations.
Delays may not be directly observable, but may need to be inferred. Depending
on where in the feedback loop the delay occurs, different control algorithms are
required to cope with the delays . dead time and time constants require an
algorithm that makes it possible to predict when an action is needed before the
need. Feedback delays generate requirements to predict when a prior control action
has taken effect and when resources will be available again. Such requirements may
impose the need for some type of open loop or feedforward strategy to cope with
delays. When time delays are not adequately considered in the control algorithm,
accidents can result.
Leplat has noted that many accidents relate to asynchronous evolution ,
where one part of a system .(in this case the hierarchical safety control structure)
changes without the related necessary changes in other parts. Changes to subsystems
may be carefully designed, but consideration of their effects on other parts of the
system, including the safety control aspects, may be neglected or inadequate. Asynchronous evolution may also occur when one part of a properly designed system
deteriorates.
In both these cases, the erroneous expectations of users or system components
about the behavior of the changed or degraded subsystem may lead to accidents.
The Ariane 5 trajectory changed from that of the Ariane 4, but the inertial reference
system software was not changed. As a result, an assumption of the inertial reference
software was violated and the spacecraft was lost shortly after launch. One factor
in the loss of contact with SOHO .(SOlar Heliospheric Observatory), a scientific
spacecraft, in 19 98 was the failure to communicate to operators that a functional
change had been made in a procedure to perform gyro spin down. The Black Hawk
friendly fire accident .(analyzed in chapter 5). had several examples of asynchronous
evolution, for example the mission changed and an individual key to communication
between the Air Force and Army left, leaving the safety control structure without
an important component.
Communication is a critical factor here as well as monitoring for changes that
may occur and feeding back this information to the higher-level control. For example,
the safety analysis process that generates constraints always involves some basic
assumptions about the operating environment of the process. When the environment changes such that those assumptions are no longer true, as in the Ariane 5 and
SOHO examples, the controls in place may become inadequate. Embedded pacemakers provide another example. These devices were originally assumed to be used
only in adults, who would lie quietly in the doctors office while the pacemaker was
being “programmed.” Later these devices began to be used in children, and the
assumptions under which the hazard analysis was conducted and the controls were
designed no longer held and needed to be revisited. A requirement for effective
updating of the control algorithms is that the assumptions of the original .(and subsequent). analysis are recorded and retrievable.
Inconsistent, Incomplete, or Incorrect Process Models .(③ in figure 4.8)
Section 4.3 stated that effective control is based on a model of the process state.
Accidents, particularly component interaction accidents, most often result from
inconsistencies between the models of the process used by the controllers .(both
human and automated). and the actual process state. When the controllers model of
the process .(either the human mental model or the software or hardware model)
diverges from the process state, erroneous control commands .(based on the incorrect model). can lead to an accident. for example, .(1). the software does not know that
the plane is on the ground and raises the landing gear, or .(2). the controller .(automated or human). does not identify an object as friendly and shoots a missile at it, or
(3). the pilot thinks the aircraft controls are in speed mode but the computer has
changed the mode to open descent and the pilot behaves inappropriately for that
mode, or .(4). the computer does not think the aircraft has landed and overrides the
pilots attempts to operate the braking system. All of these examples have actually
occurred.
The mental models of the system developers are also important. During software
development, for example, the programmers models of required behavior may not
match the engineers models .(commonly referred to as a software requirements
error), or the software may be executed on computer hardware or may control
physical systems during operations that differ from what was assumed by the programmer and used during testing. The situation becomes more even complicated
when there are multiple controllers .(both human and automated). because each of
their process models must also be kept consistent.
The most common form of inconsistency occurs when one or more process
models is incomplete in terms of not defining appropriate behavior for all possible
process states or all possible disturbances, including unhandled or incorrectly
handled component failures. Of course, no models are complete in the absolute
sense. The goal is to make them complete enough that no safety constraints are
violated when they are used. Criteria for completeness in this sense are presented
in Safeware, and completeness analysis is integrated into the new hazard analysis
method as described in chapter 9.
How does the process model become inconsistent with the actual process state?
The process model designed into the system .(or provided by training if the controller is human). may be wrong from the beginning, there may be missing or incorrect
feedback for updating the process model as the controlled process changes state,
the process model may be updated incorrectly .(an error in the algorithm of the
controller), or time lags may not be accounted for. The result can be uncontrolled
disturbances, unhandled process states, inadvertent commanding of the system into
a hazardous state, unhandled or incorrectly handled controlled process component
failures, and so forth.
Feedback is critically important to the safe operation of the controller. A basic
principle of system theory is that no control system will perform better than its
measuring channel. Feedback may be missing or inadequate because such feedback
is not included in the system design, flaws exist in the monitoring or feedback
communication channel, the feedback is not timely, or the measuring instrument
operates inadequately.
A contributing factor cited in the Cali B757 accident report, for example, was the
omission of the waypoints6 behind the aircraft from cockpit displays, which contributed to the crew not realizing that the waypoint for which they were searching was
behind them .(missing feedback). The model of the Ariane 501 attitude used by the
attitude control software became inconsistent with the launcher attitude when an
error message sent by the inertial reference system was interpreted by the attitude
control system as data .(incorrect processing of feedback), causing the spacecraft
onboard computer to issue an incorrect and unsafe command to the booster and
main engine nozzles.
Other reasons for the process models to diverge from the true system state may
be more subtle. Information about the process state has to be inferred from measurements. For example, in the TCAS TWO aircraft collision avoidance system, relative
range positions of other aircraft are computed based on round-trip message propagation time. The theoretical control function .(control law). uses the true values of
the controlled variables or component states .(e.g., true aircraft positions). However,
at any time, the controller has only measured values, which may be subject to time
lags or inaccuracies. The controller must use these measured values to infer the true
conditions in the process and, if necessary, to derive corrective actions to maintain
the required process state. In the TCAS example, sensors include on-board devices
such as altimeters that provide measured altitude .(not necessarily true altitude). and
antennas for communicating with other aircraft. The primary TCAS actuator is the
pilot, who may or may not respond to system advisories. The mapping between the
measured or assumed values and the true values can be flawed.
To summarize, process models can be incorrect from the beginning.where
correct is defined in terms of consistency with the current process state and with
the models being used by other controllers.or they can become incorrect due to
erroneous or missing feedback or measurement inaccuracies. They may also be
incorrect only for short periods of time due to time lags in the process loop.
4.5.2. Actuators and Controlled Processes .(④ in figure 4.8)
The factors discussed so far have involved inadequate control. The other case occurs
when the control commands maintain the safety constraints, but the controlled
process may not implement these commands. One reason might be a failure or flaw
in the reference channel, that is, in the transmission of control commands. Another
reason might be an actuator or controlled component fault or failure. A third is that
the safety of the controlled process may depend on inputs from other system components, such as power, for the execution of the control actions provided. If these
process inputs are missing or inadequate in some way, the controller process may
be unable to execute the control commands and accidents may result. Finally, there
may be external disturbances that are not handled by the controller.
In a hierarchical control structure, the actuators and controlled process may
themselves be a controller of a lower-level process. In this case, the flaws in executing the control are the same described earlier for a controller.
Once again, these types of flaws do not simply apply to operations or to the
technical system but also to system design and development. For example, a common
flaw in system development is that the safety information gathered or created by
the system safety engineers .(the hazards and the necessary design constraints to
control them). is inadequately communicated to the system designers and testers, or
that flaws exist in the use of this information in the system development process.
section 4.5.3. Coordination and Communication among Controllers and Decision Makers.
When there are multiple controllers .(human and/or automated), control actions
may be inadequately coordinated, including unexpected side effects of decisions
or actions or conflicting control actions. Communication flaws play an important
role here.
Leplat suggests that accidents are most likely in overlap areas or in boundary
areas or where two or more controllers .(human or automated). control the same
process or processes with common boundaries .(figure 4.9). . In both boundary
and overlap areas, the potential exists for ambiguity and for conflicts among
independent decisions.
Responsibility for the control functions in boundary areas is often poorly defined.
For example, Leplat cites an iron and steel plant where frequent accidents occurred
at the boundary of the blast furnace department and the transport department. One
conflict arose when a signal informing transport workers of the state of the blast
furnace did not work and was not repaired because each department was waiting
for the other to fix it. Faverge suggests that such dysfunction can be related to the
number of management levels separating the workers in the departments from a
common manager. The greater the distance, the more difficult the communication,
and thus the greater the uncertainty and risk.
Coordination problems in the control of boundary areas are rife. As mentioned
earlier, a Milstar satellite was lost due to inadequate attitude control of the Titan/
Centaur launch vehicle, which used an incorrect process model based on erroneous
inputs on a software load tape. After the accident, it was discovered that nobody
had tested the software using the actual load tape.each group involved in testing
and assurance had assumed some other group was doing so. In the system development process, system engineering and mission assurance activities were missing or
ineffective, and a common control or management function was quite distant from
the individual development and assurance groups .(see appendix B). One factor
in the loss of the Black Hawk helicopters to friendly fire over northern Iraq was
that the helicopters normally flew only in the boundary areas of the no-fly zone and
procedures for handling aircraft in those areas were ill defined. Another factor was
that an Army base controlled the flights of the Black Hawks, while an Air Force
base controlled all the other components of the airspace. A common control point
once again was high above where the accident occurred in the control structure. In
addition, communication problems existed between the Army and Air Force bases
at the intermediate control levels.
Overlap areas exist when a function is achieved by the cooperation of two controllers or when two controllers exert influence on the same object. Such overlap
creates the potential for conflicting control actions .(dysfunctional interactions
among control actions). Leplat cites a study of the steel industry that found 67
percent of technical incidents with material damage occurred in areas of co-activity,
although these represented only a small percentage of the total activity areas. In an
A320 accident in Bangalore, India, the pilot had disconnected his flight director
during approach and assumed that the copilot would do the same. The result would
have been a mode configuration in which airspeed is automatically controlled by
the autothrottle .(the speed mode), which is the recommended procedure for the
approach phase. However, the copilot had not turned off his flight director, which
meant that open descent mode became active when a lower altitude was selected
instead of speed mode, eventually contributing to the crash of the aircraft short of
the runway . In the Black Hawks shootdown by friendly fire, the aircraft surveillance officer .(A S O). thought she was responsible only for identifying and tracking aircraft south of the 36th Parallel, while the air traffic controller for the area
north of the 36th Parallel thought the A S O was also tracking and identifying aircraft
in his area and acted accordingly.
In 2002, two aircraft collided over southern Germany. An important factor in the
accident was the lack of coordination between the airborne TCAS .(collision avoidance). system and the ground air traffic controller. They each gave different and
conflicting advisories on how to avoid a collision. If both pilots had followed one
or the other, the loss would have been avoided, but one followed the TCAS advisory
and the other followed the ground air traffic control advisory.
section 4.5.4. Context and Environment.
Flawed human decision making can result from incorrect information and inaccurate process models, as described earlier. But human behavior is also greatly
impacted by the context and environment in which the human is working. These
factors have been called “behavior shaping mechanisms.” While value systems and
other influences on decision making can be considered to be inputs to the controller,
describing them in this way oversimplifies their role and origin. A classification of
the contextual and behavior-shaping mechanisms is premature at this point, but
relevant principles and heuristics are elucidated throughout the rest of the book.
section 4.6.
Applying the New Model.
To summarize, STAMP focuses particular attention on the role of constraints in
safety management. Accidents are seen as resulting from inadequate control or
enforcement of constraints on safety-related behavior at each level of the system
development and system operations control structures. Accidents can be understood
in terms of why the controls that were in place did not prevent or detect maladaptive changes.
Accident causal analysis based on STAMP starts with identifying the safety constraints that were violated and then determines why the controls designed to enforce
the safety constraints were inadequate or, if they were potentially adequate, why
the system was unable to exert appropriate control over their enforcement.
In this conception of safety, there is no “root cause.” Instead, the accident “cause”
consists of an inadequate safety control structure that under some circumstances
leads to the violation of a behavioral safety constraint. Preventing future accidents
requires reengineering or designing the safety control structure to be more effective.
Because the safety control structure and the behavior of the individuals in it, like
any physical or social system, changes over time, accidents must be viewed as
dynamic processes. Looking only at the time of the proximal loss events distorts and
omits from view the most important aspects of the larger accident process that are
needed to prevent reoccurrences of losses from the same causes in the future.
Without that view, we see and fix only the symptoms, that is, the results of the flawed
processes and inadequate safety control structure without getting to the sources of
those symptoms.
To understand the dynamic aspects of accidents, the process leading to the loss
can be viewed as an adaptive feedback function where the safety control system
performance degrades over time as the system attempts to meet a complex set of
goals and values. Adaptation is critical in understanding accidents, and the adaptive
feedback mechanism inherent in the model allows a STAMP analysis to incorporate
adaptation as a fundamental system property.
We have found in practice that using this model helps us to separate factual
data from the interpretations of that data. While the events and physical data
involved in accidents may be clear, their importance and the explanations for why
the factors were present are often subjective as is the selection of the events to
consider.
STAMP models are also more complete than most accident reports and other
models, for example see . Each of the explanations for the incorrect
FMS input of R in the Cali American Airlines accident described in chapter 2, for
example, appears in the STAMP analysis of that accident at the appropriate levels
of the control structure where they operated. The use of STAMP helps not only to
identify the factors but also to understand the relationships among them.
While STAMP models will probably not be useful in law suits as they do not
assign blame for the accident to a specific person or group, they do provide more
help in understanding accidents by forcing examination of each part of the sociotechnical system to see how it contributed to the loss.and there will usually be
contributions at each level. Such understanding should help in learning how to
engineer safer systems, including the technical, managerial, organizational, and regulatory aspects.
To accomplish this goal, a framework for classifying the factors that lead to accidents was derived from the basic underlying conceptual accident model .(see figure
4.8). This classification can be used in identifying the factors involved in a particular
accident and in understanding their role in the process leading to the loss. The accident investigation after the Black Hawk shootdown .(analyzed in detail in the next
chapter). identified 130 different factors involved in the accident. In the end, only
the AWACS senior director was court-martialed, and he was acquitted. The more
one knows about an accident process, the more difficult it is to find one person or
part of the system responsible, but the easier it is to find effective ways to prevent
similar occurrences in the future.
STAMP is useful not only in analyzing accidents that have occurred but in developing new and potentially more effective system engineering methodologies to
prevent accidents. Hazard analysis can be thought of as investigating an accident
before it occurs. Traditional hazard analysis techniques, such as fault tree analysis
and various types of failure analysis techniques, do not work well for very complex
systems, for software errors, human errors, and system design errors. Nor do they
usually include organizational and management flaws. The problem is that these
hazard analysis techniques are limited by a focus on failure events and the role of
component failures in accidents; they do not account for component interaction
accidents, the complex roles that software and humans are assuming in high-tech
systems, the organizational factors in accidents, and the indirect relationships
between events and actions required to understand why accidents occur.
STAMP provides a direction to take in creating these new hazard analysis and
prevention techniques. Because in a system accident model everything starts from
constraints, the new approach focuses on identifying the constraints required to
maintain safety; identifying the flaws in the control structure that can lead to an
accident .(inadequate enforcement of the safety constraints); and then designing
a control structure, physical system and operating conditions that enforces the
constraints.
Such hazard analysis techniques augment the typical failure-based design focus
and encourage a wider variety of risk reduction measures than simply adding redundancy and overdesign to deal with component failures. The new techniques also
provide a way to implement safety-guided design so that safety analysis guides the
design generation rather than waiting until a design is complete to discover it is
unsafe. Part THREE describes ways to use techniques based on STAMP to prevent accidents through system design, including design of the operating conditions and the
safety management control structure.
STAMP can also be used to improve performance analysis. Performance monitoring of complex systems has created some dilemmas. Computers allow the collection
of massive amounts of data, but analyzing that data to determine whether the system
is moving toward the boundaries of safe behavior is difficult. The use of an accident
model based on system theory and the basic concept of safety constraints may
provide directions for identifying appropriate safety metrics and leading indicators;
determining whether control over the safety constraints is adequate; evaluating the
assumptions about the technical failures and potential design errors, organizational
structure, and human behavior underlying the hazard analysis; detecting errors in
the operational and environmental assumptions underlying the design and the organizational culture; and identifying any maladaptive changes over time that could
increase risk of accidents to unacceptable levels.
Finally, STAMP points the way to very different approaches to risk assessment.
Currently, risk assessment is firmly rooted in the probabilistic analysis of failure
events. Attempts to extend current P R A techniques to software and other new
technology, to management, and to cognitively complex human control activities
have been disappointing. This way forward may lead to a dead end. Significant
progress in risk assessment for complex systems will require innovative approaches
starting from a completely different theoretical foundation.

1425
chapter05.raw Normal file

File diff suppressed because it is too large Load Diff

@ -1,6 +1,47 @@
: .
— .
\[.+\]
-\n
HMO H M O
MIC M I C
DC-10 D C 10.
19(\d\d) 19 $1
200(\d) 2 thousand $1
20(\d\d) 20 $1
\( .(
\) ).
III 3
II 2
IV 4
ASO A S O
PRA P R A
HMO H M O
MIC M I C
DC-10 D C 10
OPC O P C
TAOR T A O R
AAI A A I
ACO A C O
AFB A F B
AI A I
ATO A T O
BH B H
BSD B S D
CTF C T F
CFAC C FACK
DO D O
GAO GAOW
HQ-II H Q-2
IFF I F F
JOIC J O I C
JSOC J SOCK
JTIDS J tides
MCC M C C
MD M D
NCA N C A
NFZ N F Z
OPC O P C
ROE R O E
SD S D
SITREP SIT Rep
TACSAT Tack sat
TAOR T A O R
USCINCEUR U S C in E U R
WD W D