1
0
piper/chapter08.raw
2025-03-16 20:38:47 -06:00

1276 lines
86 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

chapter 8.
STPA: A New Hazard Analysis Technique.
Hazard analysis can be described as “investigating an accident before it occurs.” The
goal is to identify potential causes of accidents, that is, scenarios that can lead
to losses, so they can be eliminated or controlled in design or operations before
damage occurs.
The most widely used existing hazard analysis techniques were developed fifty
years ago and have serious limitations in their applicability to todays more complex,
software-intensive, sociotechnical systems. This chapter describes a new approach
to hazard analysis, based on the STAMP causality model, called STPA (System-
Theoretic Process Analysis).
section 8.1.
Goals for a New Hazard Analysis Technique.
Three hazard analysis techniques are currently used widely: Fault Tree Analysis,
Event Tree Analysis, and HAZOP. Variants that combine aspects of these three
techniques, such as Cause-Consequence Analysis (combining top-down fault trees
and forward analysis Event Trees) and Bowtie Analysis (combining forward and
backward chaining techniques) are also sometimes used. Safeware and other basic
textbooks contain more information about these techniques for those unfamiliar
with them. FMEA (Failure Modes and Effects Analysis) is sometimes used as a
hazard analysis technique, but it is a bottom-up reliability analysis technique and
has very limited applicability for safety analysis.
The primary reason for developing STPA was to include the new causal factors
identified in STAMP that are not handled by the older techniques. More specifically,
the hazard analysis technique should include design errors, including software flaws;
component interaction accidents; cognitively complex human decision-making
errors; and social, organizational, and management factors contributing to accidents.
In short, the goal is to identify accident scenarios that encompass the entire accident
process, not just the electromechanical components. While attempts have been
made to add new features to traditional hazard analysis techniques to handle new
technology, these attempts have had limited success because the underlying assump-
tions of the old techniques and the causality models on which they are based do not
fit the characteristics of these new causal factors. STPA is based on the new causality
assumptions identified in chapter 2.
An additional goal in the design of STPA was to provide guidance to the users
in getting good results. Fault tree and event tree analysis provide little guidance to
the analyst—the tree itself is simply the result of the analysis. Both the model of the
system being used by the analyst and the analysis itself are only in the analysts
head. Analyst expertise in using these techniques is crucial, and the quality of the
fault or event trees that result varies greatly.
HAZOP, widely used in the process industries, provides much more guidance to
the analysts. HAZOP is based on a slightly different accident model than fault and
event trees, namely that accidents result from deviations in system parameters, such
as too much flow through a pipe or backflow when forward flow is required.
HAZOP uses a set of guidewords to examine each part of a plant piping and wiring
diagram, such as more than, less than, and opposite. Both guidance in performing
the process and a concrete model of the physical structure of the plant are therefore
available.
Like HAZOP, STPA works on a model of the system and has “guidewords” to
assist in the analysis, but because in STAMP accidents are seen as resulting from
inadequate control, the model used is a functional control diagram rather than a
physical component diagram. In addition, the set of guidewords is based on lack of
control rather than physical parameter deviations. While engineering expertise is
still required, guidance is provided for the STPA process to provide some assurance
of completeness in the analysis.
The third and final goal for STPA is that it can be used before a design has been
created, that is, it provides the information necessary to guide the design process,
rather than requiring a design to exist before the analysis can start. Designing
safety into a system, starting in the earliest conceptual design phases, is the most
cost-effective way to engineer safer systems. The analysis technique must also, of
course, be applicable to existing designs or systems when safety-guided design is
not possible.
section 8.2.
The STPA Process.
STPA (System-Theoretic Process Analysis) can be used at any stage of the system
life cycle. It has the same general goals as any hazard analysis technique: accumulat-
ing information about how the behavioral safety constraints, which are derived
from the system hazards, can be violated. Depending on when it is used, it provides
the information and documentation necessary to ensure the safety constraints are
enforced in system design, development, manufacturing, and operations, including
the natural changes in these processes that will occur over time.
STPA uses a functional control diagram and the requirements, system hazards,
and the safety constraints and safety requirements for the component as defined in
chapter 7. When STPA is applied to an existing design, this information is available
when the analysis process begins. When STPA is used for safety-guided design, only
the system-level requirements and constraints may be available at the beginning
of the process. In the latter case, these requirements and constraints are refined
and traced to individual system components as the iterative design and analysis
process proceeds.
STPA has two main steps:
1. Identify the potential for inadequate control of the system that could lead to
a hazardous state. Hazardous states result from inadequate control or enforce-
ment of the safety constraints, which can occur because:
a. A control action required for safety is not provided or not followed.
b. An unsafe control action is provided.
c. A potentially safe control action is provided too early or too late, that is, at
the wrong time or in the wrong sequence.
d. A control action required for safety is stopped too soon or applied too long.
2. Determine how each potentially hazardous control action identified in step 1
could occur.
a. For each unsafe control action, examine the parts of the control loop to see
if they could cause it. Design controls and mitigation measures if they do not
already exist or evaluate existing measures if the analysis is being performed
on an existing design. For multiple controllers of the same component or
safety constraint, identify conflicts and potential coordination problems.
b. Consider how the designed controls could degrade over time and build in
protection, including
b.1. Management of change procedures to ensure safety constraints are
enforced in planned changes.
b.2. Performance audits where the assumptions underlying the hazard analy-
sis are the preconditions for the operational audits and controls so that
unplanned changes that violate the safety constraints can be detected.
b.3. Accident and incident analysis to trace anomalies to the hazards and to
the system design.
While the analysis can be performed in one step, dividing the process into
discrete steps reduces the analytical burden on the safety engineers and provides a
structured process for hazard analysis. The information from the first step (identify-
ing the unsafe control actions) is required to perform the second step (identifying
the causes of the unsafe control actions).
The assumption in this chapter is that the system design exists when STPA
is performed. The next chapter describes safety-guided design using STPA and
principles for safe design of control systems.
STPA is defined in this chapter using two examples. The first is a simple, generic
interlock. The hazard involved is exposure of a human to a potentially dangerous
energy source, such as high power. The power controller, which is responsible for
turning the energy on or off, implements an interlock to prevent the hazard. In the
physical controlled system, a door or barrier over the power source prevents expo-
sure while it is active. To simplify the example, we will assume that humans cannot
physically be inside the area when the barrier is in place—that is, the barrier is
simply a cover over the energy source. The door or cover will be manually operated
so the only function of the automated controller is to turn the power off when the
door is opened and to turn it back on when the door is closed.
Given this design, the process starts from:
Hazard: Exposure to a high-energy source.
Constraint: The energy source must be off when the door is not closed.
Figure 8.1 shows the control structure for this simple system. In this figure, the
components of the system are shown along with the control instructions each com-
ponent can provide and some potential feedback and other information or control
sources for each component. Control operations by the automated controller include
turning the power off and turning it on. The human operator can open and close
the door. Feedback to the automated controller includes an indication of whether
the door is open or not. Other feedback may be required or useful as determined
during the STPA (hazard analysis) process.
The control structure for a second more complex example to be used later in the
chapter, a fictional but realistic ballistic missile intercept system (FMIS), is shown
in figure 8.2. Pereira, Lee, and Howard [154] created this example to describe their
use of STPA to assess the risk of inadvertent launch in the U.S. Ballistic Missile
Defense System (BMDS) before its first deployment and field test.
The BMDS is a layered defense to defeat all ranges of threats in all phases of
flight (boost, midcourse, and terminal). The example used in this chapter is, for
security reasons, changed from the real system, but it is realistic, and the problems
identified by STPA in this chapter are similar to some that were found using STPA
on the real system.
The U S BDMS system has a variety of components, including sea-based sensors
in the Aegis shipborne platform; upgraded early warning systems; new and upgraded
radars, ground-based midcourse defense, fire control, and communications; a
Command and Control Battle Management and Communications component;
and ground-based interceptors. Future upgrades will add features. Some parts
of the system have been omitted in the example, such as the Aegis (ship-based)
platform.
Figure 8.2 shows the control structure for the FMIS components included in the
example. The command authority controls the operators by providing such things
as doctrine, engagement criteria, and training. As feedback, the command authority
gets the exercise results, readiness information, wargame results, and other informa-
tion. The operators are responsible for controlling the launch of interceptors by
sending instructions to the fire control subsystem and receiving status information
as feedback.
Fire control receives instructions from the operators and information from the
radars about any current threats. Using these inputs, fire control provides instruc-
tions to the launch station, which actually controls the launch of any interceptors.
Fire control can enable firing, disable firing, and so forth, and, of course, it receives
feedback from the launch station about the status of any previously provided
control actions and the state of the system itself. The launch station controls the
actual launcher and the flight computer, which in turn controls the interceptor
hardware.
There is one other component of the system. To ensure operational readiness, the
FMIS contains an interceptor simulator that periodically is used to mimic the flight
computer in order to detect a failure in the system.
footnote. The phrase “when the door is open” would be incorrect because a case is missing (a common problem):
in the power controllers model of the controlled process, which enforces the constraint, the door may
be open, closed, or the door position may be unknown to the controller. The phrase “is open or the door
position is unknown” could be used instead. See section 9.3.2 for a discussion of why the difference is
important.
section 8.3.
Identifying Potentially Hazardous Control Actions (Step 1)
Starting from the fundamentals defined in chapter 7, the first step in STPA is to
assess the safety controls provided in the system design to determine the potential
for inadequate control, leading to a hazard. The assessment of the hazard controls
uses the fact that control actions can be hazardous in four ways (as noted earlier):
1. A control action required for safety is not provided or is not followed.
2. An unsafe control action is provided that leads to a hazard.
3. A potentially safe control action is provided too late, too early, or out of
sequence.
4. A safe control action is stopped too soon or applied too long (for a continuous
or nondiscrete control action).
For convenience, a table can be used to record the results of this part of the analysis.
Other ways to record the information are also possible. In a classic System Safety
program, the information would be included in the hazard log. Figure 8.3 shows the
results of step 1 for the simple interlock example. The table contains four hazardous
types of behavior:
1. A power off command is not given when the door is opened,
2. The door is opened and the controller waits too long to turn the power off;
3. A power on command is given while the door is open, and
4. A power on command is provided too early (when the door has not yet fully
closed).
Incorrect but non-hazardous behavior is not included in the table. For example,
not providing a power on command when the power is off and the door is opened
or closed is not hazardous, although it may represent a quality-assurance problem.
Another example of a mission assurance problem but not a hazard occurs when the
power is turned off while the door is closed. Thomas has created a procedure to
assist the analyst in considering the effect of all possible combinations of environ-
mental and process variables for each control action in order to avoid missing any
cases that should be included in the table [199a].
The final column of the table, Stopped Too Soon or Applied Too Long, is not
applicable to the discrete interlock commands. An example where it does apply is
in an aircraft collision avoidance system where the pilot may be told to climb or
descend to avoid another aircraft. If the climb or descend control action is stopped
too soon, the collision may not be avoided.
The identified hazardous behaviors can now be translated into safety constraints
(requirements) on the system component behavior. For this example, four con-
straints must be enforced by the power controller (interlock):
1. The power must always be off when the door is open;
2. A power off command must be provided within x milliseconds after the door
is opened;
3. A power on command must never be issued when the door is open;
4. The power on command must never be given until the door is fully closed.
For more complex examples, the mode in which the system is operating may deter-
mine the safety of the action or event. In that case, the operating mode may need
to be included in the table, perhaps as an additional column. For example, some
spacecraft mission control actions may only be hazardous during the launch or
reentry phase of the mission.
In chapter 2, it was stated that many accidents, particularly component interac-
tion accidents, stem from incomplete requirements specifications. Examples were
provided such as missing constraints on the order of valve position changes in a
batch chemical reactor and the conditions under which the descent engines should
be shut down on the Mars Polar Lander spacecraft. The information provided
in this first step of STPA can be used to identify the necessary constraints on com-
ponent behavior to prevent the identified system hazards, that is, the safety require-
ments. In the second step of STPA, the information required by the component to
properly implement the constraint is identified as well as additional safety con-
straints and information necessary to eliminate or control the hazards in the design
or to design the system properly in the first place.
The FMIS system provides a less trivial example of step 1. Remember, the hazard
is inadvertent launch. Consider the fire enable command, which can be sent by the
fire control module to the launch station to allow launch commands subsequently
received by the launch station to be executed. As described in Pereira, Lee, and
Howard [154], the fire enable control command directs the launch station to enable
the live fire of interceptors. Prior to receiving this command, the launch station will
return an error message when it receives commands to fire an interceptor and will
discard the fire commands.2
Figure 8.4 shows the results of performing STPA Step 1 on the fire enable
command. If this command is missing (column 2), a launch will not take place. While
this omission might potentially be a mission assurance concern, it does not contrib-
ute to the hazard being analyzed (inadvertent launch).
If the fire enable command is provided to a launch station incorrectly, the launch
station will transition to a state where it accepts interceptor tasking and can progress
through a launch sequence. In combination with other incorrect or mistimed com-
mands, this control action could contribute to an inadvertent launch.
A late fire enable command will only delay the launch stations ability to
process a launch sequence, which will not contribute to an inadvertent launch. A
fire enable command sent too early could open a window of opportunity for
inadvertently progressing toward an inadvertent launch, similar to the incorrect
fire enable considered above. In the third case, a fire enable command might
be out of sequence with a fire disable command. If this incorrect sequencing is
possible in the system as designed and constructed, the system could be left
capable of processing interceptor tasking and launching an interceptor when not
intended.
Finally, the fire enable command is a discrete command sent to the launch
station to signal that it should allow processing of interceptor tasking. Because
fire enable is not a continuous command, the “stopped too soon” category does
not apply.
footnote. Section 9.4.4 explains the safety-related reasons for breaking up potentially hazardous actions into
multiple steps.
section 8.4.
Determining How Unsafe Control Actions Could Occur. (Step 2)
Performing the first step of STPA provides the component safety requirements,
which may be sufficient for some systems. A second step can be performed, however,
to identify the scenarios leading to the hazardous control actions that violate the
component safety constraints. Once the potential causes have been identified, the
design can be checked to ensure that the identified scenarios have been eliminated
or controlled in some way. If not, then the design needs to be changed. If the design
does not already exist, then the designers at this point can try to eliminate or control
the behaviors as the design is created, that is, use safety-guided design as described
in the next chapter.
Why is the second step needed? While providing the engineers with the safety
constraints to be enforced is necessary, it is not sufficient. Consider the chemical
batch reactor described in section 2.1. The hazard is overheating of the reactor
contents. At the system level, the engineers may decide (as in this design) to use
water and a reflux condenser to control the temperature. After this decision is made,
controls need to be enforced on the valves controlling the flow of catalyst and water.
Applying step 1 of STPA determines that opening the valves out of sequence is
dangerous, and the software requirements would accordingly be augmented with
constraints on the order of the valve opening and closing instructions, namely that
the water valve must be opened before the catalyst valve and the catalyst valve must
be closed before the water valve is closed or, more generally, that the water valve
must always be open when the catalyst valve is opened. If the software already exists,
the hazard analysis would ensure that this ordering of commands has been enforced
in the software. Clearly, building the software to enforce this ordering is a great deal
easier than proving the ordering is true after the software already exists.
But enforcing these safety constraints is not enough to ensure safe software
behavior. Suppose the software has commanded the water valve to open but some-
thing goes wrong and the valve does not actually open or it opens but water flow
is restricted in some way (the no flow guideword in HAZOP). Feedback is needed
for the software to determine if water is flowing through the pipes and the software
needs to check this feedback before opening the catalyst valve. The second step of
STPA is used to identify the ways that the software safety constraint, even if pro-
vided to the software engineers, might still not be enforced by the software logic
and system design. In essence, step 2 identifies the scenarios or paths to a hazard
found in a classic hazard analysis. This step is the usual “magic” one that creates the
contents of a fault tree, for example. The difference is that guidance is provided to
help create the scenarios and more than just failures are considered.
To create causal scenarios, the control structure diagram must include the process
models for each component. If the system exists, then the content of these models
should be easily determined by looking at the system functional design and its docu-
mentation. If the system does not yet exist, the analysis can start with a best guess
and then be refined and changed as the analysis process proceeds.
For the high power interlock example, the process model is simple and shown in
figure 8.5. The general causal factors, shown in figure 4.8 and repeated here in figure
8.6 for convenience, are used to identify the scenarios.
section 8.4.1. Identifying Causal Scenarios.
Starting with each hazardous control action identified in step 1, the analysis in step
2 involves identifying how it could happen. To gather information about how the
hazard could occur, the parts of the control loop for each of the hazardous control
actions identified in step 1 are examined to determine if they could cause or con-
tribute to it. Once the potential causes are identified, the engineers can design
controls and mitigation measures if they do not already exist or evaluate existing
measures if the analysis is being performed on an existing design.
Each potentially hazardous control action must be considered. As an example,
consider the unsafe control action of not turning off the power when the door is
opened. Figure 8.7 shows the results of the causal analysis in a graphical form. Other
ways of documenting the results are, of course, possible.
The hazard in figure 8.7 is that the door is open but the power is not turned off.
Looking first at the controller itself, the hazard could result if the requirement is
not passed to the developers of the controller, the requirement is not implemented
correctly, or the process model incorrectly shows the door closed and/or the power
off when that is not true. Working around the loop, the causal factors for each of
the loop components are similarly identified using the general causal factors shown
in figure 8.6. These causes include that the power off command is sent but not
received by the actuator, the actuator received the command but does not imple-
ment it (actuator failure), the actuator delays in implementing the command, the
power on and power off commands are received or executed in the wrong order,
the door open event is not detected by the door sensor or there is an unacceptable
delay in detecting it, the sensor fails or provides spurious feedback, and the feedback
about the state of the door or the power is not received by the controller or is not
incorporated correctly into the process model.
More detailed causal analysis can be performed if a specific design is being con-
sidered. For example, the features of the communication channels used will deter-
mine the potential way that commands or feedback could be lost or delayed.
Once the causal analysis is completed, each of the causes that cannot be shown
to be physically impossible must be checked to determine whether they are
adequately handled in the design (if the design exists) or design features added to
control them if the design is being developed with support from the analysis.
The first step in designing for safety is to try to eliminate the hazard completely.
In this example, the hazard can be eliminated by redesigning the system to have the
circuit run through the door in such a way that the circuit is broken as soon as the
door opens. Lets assume, however, that for some reason this design alternative is
rejected, perhaps as impractical. Design precedence then suggests that the next best
alternatives in order are to reduce the likelihood of the hazard occurring, to prevent
the hazard from leading to a loss, and finally to minimize damage. More about safe
design can be found in chapters 16 and 17 of Safeware and chapter 9 of this book.
Because design almost always involves tradeoffs with respect to achieving mul-
tiple objectives, the designers may have good reasons not to select the most effective
way to control the hazard but one of the other alternatives instead. It is important
that the rationale behind the choice is documented for future analysis, certification,
reuse, maintenance, upgrades, and other activities.
For this simple example, one way to mitigate many of the causes is to add a light
that identifies whether the power supply is on or off. How do human operators know
that the power has been turned off before inserting their hands into the high-energy
power source? In the original design, they will most likely assume it is off because
they have opened the door, which may be an incorrect assumption. Additional
feedback and assurance can be attained from the light. In fact, protection systems
in automated factories commonly are designed to provide humans in the vicinity
with aural or visual information that they have been detected by the protection
system. Of course, once a change has been made, such as adding a light, that change
must then be analyzed for new hazards or causal scenarios. For example, a light bulb
can burn out. The design might ensure that the safe state (the power is off) is rep-
resented by the light being on rather than the light being off, or two colors might
be used. Every solution for a safety problem usually has its own drawbacks and
limitations and therefore they will need to be compared and decisions made about
the best design given the particular situation involved.
In addition to the factors shown in figure 8.6, the analysis must consider the
impact of having two controllers of the same component whenever this occurs in
the system safety control structure. In the friendly fire example in chapter 5, for
example, confusion existed between the two AWACS operators responsible for
tracking aircraft inside and outside of the no-fly-zone about who was responsible
for aircraft in the boundary area between the two. The FMIS example below con-
tains such a scenario. An analysis must be made to determine that no path to a
hazard exists because of coordination problems.
The FMIS system provides a more complex example of STPA step 2. Consider
the fire enable command provided by fire control to the launch station. In step 1,
it was determined that if this command is provided incorrectly, the launch station
will transition to a state where it accepts interceptor tasking and can progress
through a launch sequence. In combination with other incorrect or mistimed control
actions, this incorrect command could contribute to an inadvertent launch.
The following are two examples of causal factors identified using STPA step 2 as
potentially leading to the hazardous state (violation of the safety constraint). Neither
of these examples involves component failures, but both instead result from unsafe
component interactions and other more complex causes that are for the most part
not identifiable by current hazard analysis methods.
In the first example, the fire enable command can be sent inadvertently due to
a missing case in the requirements—a common occurrence in accidents where soft-
ware is involved.
The fire enable command is sent when the fire control receives a weapons free
command from the operators and the fire control system has at least one active
track. An active track indicates that the radars have detected something that might
be an incoming missile. Three criteria are specified for declaring a track inactive:
(1) a given period passes with no radar input, (2) the total predicted impact time
elapses for the track, and (3) an intercept is confirmed. Operators are allowed to
deselect any of these options. One case was not considered by the designers: if an
operator deselects all of the options, no tracks will be marked as inactive. Under
these conditions, the inadvertent entry of a weapons free command would send the
fire enable command to the launch station immediately, even if there were no
threats currently being tracked by the system.
Once this potential cause is identified, the solution is obvious—fix the software
requirements and the software design to include the missing case. While the opera-
tor might instead be warned not to deselect all the options, this kind of human error
is possible and the software should be able to handle the error safely. Depending
on humans not to make mistakes is an almost certain way to guarantee that acci-
dents will happen.
The second example involves confusion between the regular and the test soft-
ware. The FMIS undergoes periodic system operability testing using an interceptor
simulator that mimics the interceptor flight computer. The original hazard analysis
had identified the possibility that commands intended for test activities could be
sent to the operational system. As a result, the system status information provided
by the launch station includes whether the launch station is connected only to
missile simulators or to any live interceptors. If the fire control computer detects a
change in this state, it will warn the operator and offer to reset into a matching state.
There is, however, a small window of time before the launch station notifies the fire
control component of the change. During this time interval, the fire control software
could send a fire enable command intended for test to the live launch station. This
latter example is a coordination problem arising because there are multiple control-
lers of the launch station and two operating modes (e.g., testing and live fire). A
potential mode confusion problem exists where the launch station can think it is in
one mode but really be in the other one. Several different design changes could be
used to prevent this hazardous state.
In the use of STPA on the real missile defense system, the risks involved in inte-
grating separately developed components into a larger system were assessed, and
several previously unknown scenarios for inadvertent launch were identified. Those
conducting the assessment concluded that the STPA analysis and supporting data
provided management with a sound basis on which to make risk acceptance deci-
sions [154]. The assessment results were used to plan mitigations for open safety
risks deemed necessary to change before deployment and field-testing of the system.
As system changes are proposed, they are assessed by updating the control structure
diagrams and assessment analysis results.
section 8.4.2. Considering the Degradation of Controls over Time.
A final step in STPA is to consider how the designed controls could degrade over
time and to build in protection against it. The mechanisms for the degradation could
be identified and mitigated in the design: for example, if corrosion is identified as a
potential cause, a stronger or less corrosive material might be used. Protection might
also include planned performance audits where the assumptions underlying the
hazard analysis are the preconditions for the operational audits and controls. For
example, an assumption for the interlock system with a light added to warn the
operators is that the light is operational and operators will use it to determine
whether it is safe to open the door. Performance audits might check to validate that
the operators know the purpose of the light and the importance of not opening the
door while the warning light is on. Over time, operators might create workarounds
to bypass this feature if it slows them up too much in their work or if they do not
understand the purpose, the light might be partially blocked from view because of
workplace changes, and so on. The assumptions and required audits should be iden-
tified during the system design process and then passed to the operations team.
Along with performance audits, management of change procedures need to be
developed and the STPA analysis revisited whenever a planned change is made in
the system design. Many accidents occur after changes have been made in the
system. If appropriate documentation is maintained along with the rationale for the
control strategy selected, this reanalysis should not be overly burdensome. How to
accomplish this goal is discussed in chapter 10.
Finally, after accidents and incidents, the design and the hazard analysis should
be revisited to determine why the controls were not effective. The hazard of foam
damaging the thermal surfaces of the Space Shuttle had been identified during
design, for example, but over the years before the Columbia loss the process for
updating the hazard analysis after anomalies occurred in flight was eliminated. The
Space Shuttle standard for hazard analyses (NSTS 22254, Methodology for Conduct
of Space Shuttle Program Hazard Analyses) specified that hazards be revisited only
when there was a new design or the design was changed: There was no process for
updating the hazard analyses when anomalies occurred or even for determining
whether an anomaly was related to a known hazard [117].
Chapter 12 provides more information about the use of the STPA results during
operations.
section 8.5. Human Controllers.
Humans in the system can be treated in the same way as automated components in
step 1 of STPA, as was seen in the interlock system above where a person controlled
the position of the door. The causal analysis and detailed scenario generation for
human controllers, however, is much more complex than that of electromechanical
devices and even software, where at least the algorithm is known and can be evalu-
ated. Even if operators are given a procedure to follow, for reasons discussed in
chapter 2, it is very likely that the operator may feel the need to change the proce-
dure over time.
The first major difference between human and automated controllers is that
humans need an additional process model. All controllers need a model of the
process they are controlling directly, but human controllers also need a model of
any process, such as an oil refinery or an aircraft, they are indirectly controlling
through an automated controller. If the human is being asked to supervise the
automated controller or to monitor it for wrong or dangerous behavior then he
or she needs to have information about the state of both the automated controller
and the controlled process. Figure 8.8 illustrates this requirement. The need for
an additional process model explains why supervising an automated system
requires extra training and skill. A wrong assumption is sometimes made that if the
human is supervising a computer, training requirements are reduced but this
belief is untrue. Human skill levels and required knowledge almost always go up in
this situation.
Figure 8.8 includes dotted lines to indicate that the human controller may need
direct access to the process actuators if the human is to act as a backup to the
automated controller. In addition, if the human is to monitor the automation, he
or she will need direct input from the sensors to detect when the automation is
confused and is providing incorrect information as feedback about the state of the
controlled process.
The system design, training, and operational procedures must support accurate
creation and updating of the extra process model required by the human supervisor.
More generally, when a human is supervising an automated controller, there are
extra analysis and design requirements. For example, the control algorithm used by
the automation must be learnable and understandable. Inconsistent behavior or
unnecessary complexity in the automation function can lead to increased human
error. Additional design requirements are discussed in the next chapter.
With respect to STPA, the extra process model and complexity in the system
design requires additional causal analysis when performing step 2 to determine the
ways that both process models can become inaccurate.
The second important difference between human and automated controllers is
that, as noted by Thomas [199], while automated systems have basically static control
algorithms (although they may be updated periodically), humans employ dynamic
control algorithms that they change as a result of feedback and changes in goals.
Human error is best modeled and understood using feedback loops, not as a chain
of directly related events or errors as found in traditional accident causality models.
Less successful actions are a natural part of the search by operators for optimal
performance [164].
Consider again figure 2.9. Operators are often provided with procedures to follow
by designers. But designers are dealing with their own models of the controlled
process, which may not reflect the actual process as constructed and changed over
time. Human controllers must deal with the system as it exists. They update their
process models using feedback, just as in any control loop. Sometimes humans use
experimentation to understand the behavior of the controlled system and its current
state and use that information to change their control algorithm. For example, after
picking up a rental car, drivers may try the brakes and the steering system to get a
feel for how they work before driving on a highway.
If human controllers suspect a failure has occurred in a controlled process, they
may experiment to try to diagnose it and determine a proper response. Humans
also use experimentation to determine how to optimize system performance. The
drivers control algorithm may change over time as the driver learns more about
the automated system and learns how to optimize the cars behavior. Driver goals
and motivation may also change over time. In contrast, automated controllers by
necessity must be designed with a single set of requirements based on the designers
model of the controlled process and its environment.
Thomas provides an example [199] using cruise control. Designers of an auto-
mated cruise control system may choose a control algorithm based on their model
of the vehicle (such as weight, engine power, response time), the general design of
roadways and vehicle traffic, and basic engineering design principles for propulsion
and braking systems. A simple control algorithm might control the throttle in pro-
portion to the difference between current speed (monitored through feedback) and
desired speed (the goal).
Like the automotive cruise control designer, the human driver also has a process
model of the cars propulsion system, although perhaps simpler than that of the
automotive control expert, including the approximate rate of car acceleration for
each accelerator position. This model allows the driver to construct an appropriate
control algorithm for the current road conditions (slippery with ice or clear and dry)
and for a given goal (obeying the speed limit or arriving at the destination at a
required time). Unlike the static control algorithm designed into the automated
cruise control, the human driver may dynamically change his or her control algo-
rithm over time based on changes in the cars performance, in goals and motivation,
or driving experience.
The differences between automated and human controllers lead to different
requirements for hazard analysis and system design. Simply identifying human
“failures” or errors is not enough to design safer systems. Hazard analysis must
identify the specific human behaviors that can lead to the hazard. In some cases, it
may be possible to identify why the behaviors occur. In either case, we are not able
to “redesign” humans. Training can be helpful, but not nearly enough—training can
do only so much in avoiding human error even when operators are highly trained
and skilled. In many cases, training is impractical or minimal, such as automobile
drivers. The only real solution lies in taking the information obtained in the hazard
analysis about worst-case human behavior and using it in the design of the other
system components and the system as a whole to eliminate, reduce, or compensate
for that behavior. Chapter 9 discusses why we need human operators in systems and
how to design to eliminate or reduce human errors.
STPA as currently defined provides much more useful information about the
cause of human errors than traditional hazard analysis methods, but augmenting
STPA could provide more information for designers. Stringfellow has suggested
some additions to STPA for human controllers [195]. In general, engineers need
better tools for including humans in hazard analyses in order to cope with the unique
aspects of human control.
section 8.6. Using STPA on Organizational Components of the Safety Control Structure.
The examples above focus on the lower levels of safety control structures, but STPA
can also be used on the organizational and management components. Less experi-
mentation has been done on applying it at these levels, and, once again, more needs
to be done.
Two examples are used in this section: one was a demonstration for NASA of
risk analysis using STPA on a new management structure proposed after the Colum-
bia accident. The second is pharmaceutical safety. The fundamental activities of
identifying system hazards, safety requirements and constraints, and of documenting
the safety control structure were described for these two examples in chapter 7.
This section starts from that point and illustrates the actual risk analysis process.
section 8.6.1. Programmatic and Organizational Risk Analysis.
The Columbia Accident Investigation Board (CAIB) found that one of the causes
of the Columbia loss was the lack of independence of the safety program from the
Space Shuttle program manager. The CAIB report recommended that NASA insti-
tute an Independent Technical Authority (ITA) function similar to that used in
SUBSAFE (see chapter 14), and individuals with SUBSAFE experience were
recruited to help design and implement the new NASA Space Shuttle program
organizational structure. After the program was designed and implementation
started, a risk analysis of the program was performed to assist in a planned review
of the programs effectiveness. A classic programmatic risk analysis, which used
experts to identify the risks in the program, was performed. In parallel, a group at
MIT developed a process to use STAMP as a foundation for the same type of pro-
grammatic risk analysis to understand the risks and vulnerabilities of this new
organizational structure and recommend improvements [125].3 This section describes
the STAMP-based process and results as an example of what can be done for other
systems and other emergent properties. Laracy [108] used a similar process to
examine transportation system security, for example.
The STAMP-based analysis rested on the basic STAMP concept that most major
accidents do not result simply from a unique set of proximal, physical events but
from the migration of the organization to a state of heightened risk over time as
safeguards and controls are relaxed due to conflicting goals and tradeoffs. In such
a high-risk state, events are bound to occur that will trigger an accident. In both the
Challenger and Columbia losses, organizational risk had been increasing to unac-
ceptable levels for quite some time as behavior and decision-making evolved in
response to a variety of internal and external performance pressures. Because risk
increased slowly, nobody noticed, that is, the boiled frog phenomenon. In fact, con-
fidence and complacency were increasing at the same time as risk due to the lack
of accidents.
The goal of the STAMP-based analysis was to apply a classic system safety
engineering process to the analysis and redesign of this organizational structure.
Figure 8.9 shows the basic process used, which started with a preliminary hazard
analysis to identify the system hazards and the safety requirements and constraints.
In the second step, a STAMP model of the ITA safety control structure was created
(as designed by NASA; see figure 7.4) and a gap analysis was performed to map the
identified safety requirements and constraints to the assigned responsibilities in the
safety control structure and identify any gaps. A detailed hazard analysis using STPA
was then performed to identify the system risks and to generate recommendations
for improving the designed new safety control structure and for monitoring the
implementation and long-term health of the new program. Only enough of the
modeling and analysis is included here to allow the reader to understand the process.
The complete modeling and analysis effort is documented elsewhere [125].
The hazard identification, system safety requirements, and safety control struc-
ture for this example are described in section 7.4.1, so the example starts from this
basic information.
footnote. Many people contributed to the analysis described in this section, including Nicolas Dulac, Betty
Barrett, Joel Cutcher-Gershenfeld, John Carroll, and Stephen Friedenthal.
section 8.6.2. Gap Analysis.
In analyzing an existing organizational or social safety control structure, one of the
first steps is to determine where the responsibility for implementing each require-
ment rests and to perform a gap analysis to identify holes in the current design, that
is, requirements that are not being implemented (enforced) anywhere. Then the
safety control structure needs to be evaluated to determine whether it is potentially
effective in enforcing the system safety requirements and constraints.
A mapping was made between the system-level safety requirements and con-
straints and the individual responsibilities of each component in the NASA safety
control structure to see where and how requirements are enforced. The ITA program
was at the time being carefully defined and documented. In other situations, where
such documentation may be lacking, interview or other techniques may need to be
used to elicit how the organizational control structure actually works. In the end,
complete documentation should exist in order to maintain and operate the system
safely. While most organizations have job descriptions for each employee, the safety-
related responsibilities are not necessarily separated out or identified, which can
lead to unidentified gaps or overlaps.
As an example, in the ITA structure the responsibility for the system-level safety
requirement:
1a. State-of-the art safety standards and requirements for NASA missions must
be established, implemented, enforced, and maintained that protect the astro-
nauts, the workforce, and the public
was assigned to the NASA Chief Engineer but the Discipline Technical Warrant
Holders, the Discipline Trusted Agents, the NASA Technical Standards Program,
and the headquarters Office of Safety and Mission Assurance also play a role in
implementing this Chief Engineer responsibility. More specifically, system require-
ment 1a was implemented in the control structure by the following responsibility
assignments:
•Chief Engineer: Develop, monitor, and maintain technical standards and
policy.
•Discipline Technical Warrant Holders:
1. Recommend priorities for development and updating of technical
standards.
2. Approve all new or updated NASA Preferred Standards within their assigned
discipline (the NASA Chief Engineer retains Agency approval)
3. Participate in (lead) development, adoption, and maintenance of NASA
Preferred Technical Standards in the warranted discipline.
4. Participate as members of technical standards working groups.
•Discipline Trusted Agents: Represent the Discipline Technical Warrant
Holders on technical standards committees
•NASA Technical Standards Program: Coordinate with Technical Warrant
Holders when creating or updating standards
•NASA Headquarters Office Safety and Mission Assurance:
1. Develop and improve generic safety, reliability, and quality process standards
and requirements, including FMEA, risk, and the hazard analysis process.
2. Ensure that safety and mission assurance policies and procedures are ade-
quate and properly documented.
Once the mapping is complete, a gap analysis can be performed to ensure that each
system safety requirement and constraint is embedded in the organizational design
and to find holes or weaknesses in the design. In this analysis, concerns surfaced,
particularly about requirements not reflected in the defined ITA organizational
structure.
As an example, one omission detected was appeals channels for complaints
and concerns about the components of the ITA structure itself that may not
function appropriately. All channels for expressing what NASA calls “technical
conscience” go through the warrant holders, but there was no defined way to express
concerns about the warrant holders themselves or about aspects of ITA that are not
working well.
A second example was the omission in the documentation of the ITA implemen-
tation plans of the person(s) who was to be responsible to see that engineers and
managers are trained to use the results of hazard analyses in their decision making.
More generally, a distributed and ill-defined responsibility for the hazard analysis
process made it difficult to determine responsibility for ensuring that adequate
resources are applied; that hazard analyses are elaborated (refined and extended)
and updated as the design evolves and test experience is acquired; that hazard logs
are maintained and used as experience is acquired; and that all anomalies are evalu-
ated for their hazard potential. Before ITA, many of these responsibilities were
assigned to each Centers Safety and Mission Assurance Office, but with much of
this process moving to engineering (which is where it should be) under the new ITA
structure, clear responsibilities for these functions need to be specified. One of the
basic causes of accidents in STAMP is multiple controllers with poorly defined or
overlapping responsibilities.
A final example involved the ITA program assessment process. An assessment
of how well ITA is working is part of the plan and is an assigned responsibility of
the chief engineer. The official risk assessment of the ITA program performed in
parallel with the STAMP-based one was an implementation of that chief engineers
responsibility and was planned to be performed periodically. We recommended the
addition of specific organizational structures and processes for implementing a
continual learning and improvement process and making adjustments to the design
of ITA itself when necessary outside of the periodic review.
section 8.6.3. Hazard Analysis to Identify Organizational and Programmatic Risks.
A risk analysis to identify ITA programmatic risks and to evaluate these risks peri-
odically had been specified as one of the chief engineers responsibilities. To accom-
plish this goal, NASA identified the programmatic risks using a classic process using
experts in risk analysis interviewing stakeholders and holding meetings where risks
were identified and discussed. The STAMP-based analysis used a more formal,
structured approach.
Risks in STAMP terms can be divided into two types: (1) basic inadequacies in
the way individual components in the control structure fulfill their responsibilities
and (2) risks involved in the coordination of activities and decision making that can
lead to unintended interactions and consequences.
Basic Risks
Applying the four types of inadequate control identified in STPA and interpreted
for the hazard, which in this case is unsafe decision-making leading to an accident,
ITA has four general types of risks:
1. Unsafe decisions are made or approved by the chief engineer or warrant
holders.
2. Safe decisions are disallowed (e.g., overly conservative decision making that
undermines the goals of NASA and long-term support for ITA).
3. Decision making takes too long, minimizing impact and also reducing support
for the ITA.
4. Good decisions are made by the ITA, but do not have adequate impact on
system design, construction, and operation.
The specific potentially unsafe control actions by those in the ITA safety control
structure that could lead to these general risks are the ITA programmatic risks. Once
identified, they must be eliminated or controlled just like any unsafe control actions.
Using the responsibilities and control actions defined for the components of the
safety control structure, the STAMP-based risk analysis applied the four general
types of inadequate control actions, omitting those that did not make sense for the
particular responsibility or did not impact risk. To accomplish this, the general
responsibilities must be refined into more specific control actions.
As an example, the chief engineer is responsible as the ITA for the technical
standards and system requirements and all changes, variances, and waivers to the
requirements, as noted earlier. The control actions the chief engineer has available
to implement this responsibility are:
1.• To develop, monitor, and maintain technical standards and policy.
2.•In coordination with programs and projects, to establish or approve the techni-
cal requirements and ensure they are enforced and implemented in the pro-
grams and projects (ensure the design is compliant with the requirements).
3.• To approve all changes to the initial technical requirements.
4.• To approve all variances (waivers, deviations, exceptions to the requirements.
5.•Etc.
Taking just one of these, the control responsibility to develop, monitor, and maintain
technical standards and policy, the risks (potentially inadequate or unsafe control
actions) identified using STPA step 1 include:
1. General technical and safety standards are not created.
2. Inadequate standards and requirements are created.
3. Standards degrade over time due to external pressures to weaken them. The
process for approving changes is flawed.
4. Standards are not changed over time as the environment changes.
As another example, the chief engineer cannot perform all these duties himself, so
he has a network of people below him in the hierarchy to whom he delegates or
“warrants” some of the responsibilities. The chief engineer retains responsibility for
ensuring that the warrant holders perform their duties adequately as in any hierar-
chical management structure.
The chief engineer responsibility to approve all variances and waivers to technical
requirements is assigned to the System Technical Warrant Holder (STWH). The
risks or potentially unsafe control actions of the STWH with respect to this respon-
sibility are:
1.• An unsafe engineering variance or waiver is approved.
2.•Designs are approved without determining conformance with safety require-
ments. Waivers become routine.
3.•Reviews and approvals take so long that ITA becomes a bottleneck. Mission
achievement is threatened. Engineers start to ignore the need for approvals
and work around the STWH in other ways.
Although a long list of risks was identified in this experimental application of STPA
to a management structure, many of the risks for different participants in the ITA
process were closely related. The risks listed for each participant are related to his
or her particular role and responsibilities and therefore those with related roles or
responsibilities will generate related risks. The relationships were made clear in the
earlier step tracing from system requirements to the roles and responsibilities for
each of the components of the ITA.
Coordination Risks.
Coordination risks arise when multiple people or groups control the same process.
The types of unsafe interactions that may result include: (1) both controllers
assume that the other is performing the control responsibilities, and as a result
nobody does, or (2) controllers provide conflicting control actions that have unin-
tended side effects.
Potential coordination risks are identified by the mapping from the system
requirements to the component requirements used in the gap analysis described
earlier. When similar responsibilities related to the same system requirement are
identified, the potential for new coordination risks needs to be considered.
As an example, the original ITA design documentation was ambiguous about
who had the responsibility for performing many of the safety engineering func-
tions. Safety engineering had previously been the responsibility of the Center
Safety and Mission Assurance Offices but the plan envisioned that these functions
would shift to the ITA in the new organization leading to several obvious
risks.
Another example involves the transition of responsibility for the production of
standards to the ITA from the NASA Headquarters Office of Safety and Mission
Assurance (OSMA). In the plan, some of the technical standards responsibilities
were retained by OSMA, such as the technical design standards for human rating
spacecraft and for conducting hazard analyses, while others were shifted to the ITA
without a clear demarcation of who was responsible for what. At the same time,
responsibilities for the assurance that the plans are followed, which seems to logi-
cally belong to the mission assurance group, were not cleanly divided. Both overlaps
raised the potential for some functions not being accomplished or conflicting stan-
dards being produced.
section 8.6.4. Use of the Analysis and Potential Extensions.
While risk mitigation and control measures could be generated from the list of risks
themselves, the application of step 2 of STPA to identify causes of the risks will help
to provide better control measures in the same way STPA step 2 plays a similar role
in physical systems. Taking the responsibility of the System Technical Warrant
Holder to approve all variances and waivers to technical requirements in the
example above, potential causes for approving an unsafe engineering variance or
waiver include: inadequate or incorrect information about the safety of the action,
inadequate training, bowing to pressure about programmatic concerns, lack of
support from management, inadequate time or resources to evaluate the requested
variance properly, and so on. These causal factors were generated using the generic
factors in figure 8.6 but defined in a more appropriate way. Stringfellow has exam-
ined in more depth how STPA can be applied to organizational factors [195].
The analysis can be used to identify potential changes to the safety control struc-
ture (the ITA program) that could eliminate or mitigate identified risks. General
design principles for safety are described in the next chapter.
A goal of the NASA risk analysis was to determine what to include in a planned
special assessment of the ITA early in its existence. To accomplish the same goal,
the MIT group categorized their identified risks as (1) immediate, (2) long-term, or
(3) controllable by standard ongoing processes. These categories were defined in
the following way:
Immediate concern: An immediate and substantial concern that should be part
of a near-term assessment.
Longer-term concern: A substantial longer-term concern that should potentially
be part of future assessments; as the risk will increase over time or cannot be
evaluated without future knowledge of the system or environment behavior.
Standard process: An important concern that should be addressed through
standard processes, such as inspections, rather than an extensive special assess-
ment procedure.
This categorization allowed identifying a manageable subset of risks to be part of the
planned near-term risk assessment and those that could wait for future assessments
or could be controlled by on-going procedures. For example, it is important to assess
immediately the degree of “buy-in” to the ITA program. Without such support, ITA
cannot be sustained and the risk of dangerous decision making is very high. On the
other hand, the ability to find appropriate successors to the current warrant holders
is a longer-term concern identified in the STAMP-based risk analysis that would be
difficult to assess early in the existence of the new ITA control structure. The perfor-
mance of the current technical warrant holders, for example, is one factor that will
have an impact on whether the most qualified people will want the job in the future.
section 8.6.5. Comparisons with Traditional Programmatic Risk Analysis Techniques.
The traditional risk analysis performed by NASA on ITA identified about one
hundred risks. The more rigorous, structured STAMP-based analysis—done inde-
pendently and without any knowledge of the results of the NASA process—
identified about 250 risks, all the risks identified by NASA plus additional ones. A
small part of the difference was related to the consideration by the STAMP group
of more components in the safety control structure, such as the NASA administrator,
Congress, and the Executive Branch (White House). There is no way to determine
whether the other additional risks identified by the STAMP-based process were
simply missed in the NASA analysis or were discarded for some reason.
The NASA analysis did not include a causal analysis of the risks and thus no
comparison is possible. Their goal was to determine what should be included in the
upcoming ITA risk assessment process and thus was narrower than the STAMP
demonstration risk analysis effort.
section 8.7. Reengineering a Sociotechnical System: Pharmaceutical Safety and the Vioxx
Tragedy.
The previous section describes the use of STPA on the management structure of an
organization that develops and operates high-tech systems. STPA and other types
of analysis are potentially also applicable to social systems. This section provides an
example using pharmaceutical safety.
Couturier has performed a STAMP-based causal analysis of the incidents associ-
ated with the introduction and withdrawal of Vioxx [43]. Once the causes of such
losses are determined, changes need to be made to prevent a recurrence. Many sug-
gestions for changes as a result of the Vioxx losses
have been proposed. After the Vioxx recall, three main reports were written by the
Government Accountability Office (GAO) [73], the Institute of Medicine (IOM)
[16], and one commissioned by Merck. The publication of these reports led to two
waves of changes, the first initiated within the FDA and the second by Congress in
the form of a new set of rules called FDAAA (FDA Amendments Act). Couturier
[43, 44], with inputs from others,4 used the Vioxx events to demonstrate how these
proposed and implemented policy and structural changes could be analyzed to
predict their potential effectiveness using STAMP.
footnote. Many people provided input to the analysis described in this section, including Stan Finkelstein, John
Thomas, John Carroll, Margaret Stringfellow, Meghan Dierks, Bruce Psaty, David Wierz, and various
other reviewers.
section 8.7.1. The Events Surrounding the Approval and Withdrawal of Vioxx.
Vioxx (Rofecoxib) is a prescription COX-2 inhibitor manufactured by Merck. It was
approved by the Food and Drug Administration (FDA) in May 1999 and was widely
used for pain management, primarily from osteoarthritis. Vioxx was one of the major
sources of revenue for Merck while on the market: It was marketed in more than
eighty countries with worldwide sales totaling $2.5 billion in 2003.
In September 2004, Merck voluntarily withdrew the drug from the market
because of safety concerns: The drug was suspected to increase the risk of cardio-
vascular events (heart attacks and stroke) for the patients taking it long term at high
dosages. Vioxx was one of the most widely used drugs ever to be withdrawn from
the market. According to an epidemiological study done by Graham, an FDA sci-
entist, Vioxx has been associated with more than 27,000 heart attacks or deaths and
may be the “single greatest drug safety catastrophe in the history of this country or
the history of the world” [76].
The important question to be considered is how did such a dangerous drug get
on the market and stay there so long despite warnings of problems and how can
this type of loss be avoided in the future.
The major events that occurred in this saga start with the discovery of the Vioxx
molecule in 1994. Merck sought FDA approval in November 1998.
In May 1999 the FDA approved Vioxx for the relief of osteoarthritis symptoms
and management of acute pain. Nobody had suggested that the COX-2 inhibitors
are more effective than the classic NSAIDS in relieving pain, but their selling point
had been that they were less likely to cause bleeding and other digestive tract com-
plications. The FDA was not convinced and required that the drug carry a warning
on its label about possible digestive problems. By December, Vioxx had more than
40 percent of the new prescriptions in its class.
In order to validate their claims about Rofecoxib having fewer digestive system
complications, Merck launched studies to prove their drugs should not be lumped
with other NSAIDS. The studies backfired.
In January 1999, before Vioxx was approved, Merck started a trial called VIGOR
(Vioxx Gastrointestinal Outcomes Research) to compare the efficacy and adverse
effects of Rofecoxib and Naproxen, an older nonsteroidal anti-inflammatory drug
or NSAID. In March 2000, Merck announced that the VIGOR trial had shown that
Vioxx was safer on the digestive tract than Naproxen, but it doubled the risk of
cardiovascular problems. Merck argued that the increased risk resulted not because
Vioxx caused the cardiovascular problems but that Celebrex (the Naproxen used in
the trial) protected against them. Merck continued to minimize unfavorable findings
for Vioxx up to a month before withdrawing it from the market in 2004.
Another study, ADVANTAGE, was started soon after the VIGOR trial.
ADVANTAGE had the same goal as VIGOR, but it targeted osteoarthritis,
whereas VIGOR was for rheumatoid arthritis. Although the ADVANTAGE trial
did demonstrate that Vioxx was safer on the digestive track than Naproxen, it
failed to show that Rofecoxib had any advantage over Naproxen in terms of pain
relief. Long after the report on ADVANTAGE was published, it turned out that its
first author had no involvement in the study until Merck presented him with a copy
of the manuscript written by Merck authors. This turned out to be one of the more
prominent recent examples of ghostwriting of journal articles where company
researchers wrote the articles and included the names of prominent researchers as
authors [178].
In addition, Merck documents later came to light that appear to show the
ADVANTAGE trial emerged from the Merck marketing division and was actually
a “seeding” trial, designed to market the drug by putting “its product in the hands
of practicing physicians, hoping that the experience of treating patients with the
study drug and a pleasant, even profitable interaction with the company will result
in more loyal physicians who prescribe the drug” [83].
Although the studies did demonstrate that Vioxx was safer on the digestive track
than Naproxen, they also again unexpectedly found that the COX-2 inhibitor
doubled the risk of cardiovascular problems. In April 2002, the FDA required that
Merck note a possible link to heart attacks and strokes on Vioxxs label. But it never
ordered Merck to conduct a trial comparing Vioxx with a placebo to determine
whether a link existed. In April 2000 the FDA recommended that Merck conduct
an animal study with Vioxx to evaluate cardiovascular safety, but no such study was
ever conducted.
For both the VIGOR and ADVANTAGE studies, claims have been made that
cardiovascular events were omitted from published reports [160]. In May 2000
Merck published the results from the VIGOR trial. The data included only seven-
teen of the twenty heart attacks the Vioxx patients had. When the omission was
later detected, Merck argued that the events occurred after the trial was over and
therefore did not have to be reported. The data showed a four times higher risk of
heart attacks compared with Naproxen. In October 2000, Merck officially told the
FDA about the other three heart attacks in the VIGOR study.
Merck marketed Vioxx heavily to doctors and spent more than $100 million
a year on direct-to-the-consumer advertising using popular athletes including
Dorothy Hamill and Bruce Jenner. In September 2001, the FDA sent Merck a letter
warning the company to stop misleading doctors about Vioxxs effect on the cardio-
vascular system.
In 2001, Merck started a new study called APPROVe (Adenomatous Polyp
PRevention On Vioxx) in order to expand its market by showing the efficacy of
Vioxx on colorectal polyps. APPROVe was halted early when the preliminary data
showed an increased relative risk of heart attacks and strokes after eighteen months
of Vioxx use. The long-term use of Rofecoxib resulted in nearly twice the risk of
suffering a heart attack or stroke compared to patients receiving a placebo.
David Graham, an FDA researcher, did an analysis of a database of 1.4 million
Kaiser Permanente members and found that those who took Vioxx were more likely
to suffer a heart attack or sudden cardiac death than those who took Celebrex,
Vioxxs main rival. Graham testified to a congressional committee that the FDA
tried to block publication of his findings. He described an environment “where he
was ostracized; subjected to veiled threats and intimidation.’” Graham gave the
committee copies of email that support his claims that his superiors at the FDA
suggested watering down his conclusions [178].
Despite all their efforts to deny the risks associated with Vioxx, Merck withdrew
the drug from the market in September 2004. In October 2004, the FDA approved
a replacement drug for Vioxx by Merck, called Arcoxia.
Because of the extensive litigation associated with Vioxx, many questionable
practices in the pharmaceutical industry have come to light [6]. Merck has been
accused of several unsafe “control actions” in this sequence of events, including not
accurately reporting trial results to the FDA, not having a proper control board
(DSMB) overseeing the safety of the patients in at least one of the trials, misleading
marketing efforts, ghostwriting journal articles about Rofecoxib studies, and paying
publishers to create fake medical journals to publish favorable articles [45]. Post-
market safety studies recommended by the FDA were never done, only studies
directed at increasing the market.
section 8.7.2. Analysis of the Vioxx Case.
The hazards, system safety requirements and constraints, and documentation of the
safety control structure for pharmaceutical safety were shown in chapter 7. Using
these, Couturier performed several types of analysis.
He first traced the system requirements to the responsibilities assigned to each
of the components in the safety control structure, that is, he performed a gap analysis
as described above for the NASA ITA risk analysis. The goal was to check that at
least one controller was responsible for enforcing each of the safety requirements,
to identify when multiple controllers had the same responsibility, and to study each
of the controllers independently to determine if they are capable of carrying out
their assigned responsibilities.
In the gap analysis, no obvious gaps or missing responsibilities were found, but
multiple controllers are in charge of enforcing some of the same safety requirements.
For example, the FDA, the pharmaceutical companies, and physicians are all respon-
sible for monitoring drugs for adverse events. This redundancy is helpful if the
controllers work together and share the information they have. Problems can occur,
however, if efforts are not coordinated and gaps occur.
The assignment of responsibilities does not necessarily mean they are carried out
effectively. As in the NASA ITA analysis, potentially inadequate control actions can
be identified using STPA step 1, potential causes identified using step 2, and controls
to protect against these causes designed and implemented. Contextual factors must
be considered such as external or internal pressures militating against effective
implementation or application of the controls. For example, given the financial
incentives involved in marketing a blockbuster drug—Vioxx in 2003 provided $2.5
billion, or 11 percent of Mercks revenue [66]—it may be unreasonable to expect
pharmaceutical companies to be responsible for drug safety without strong external
oversight and controls or even to be responsible at all: Suggestions have been made
that responsibility for drug development and testing be taken away from the phar-
maceutical manufacturers [67].
Controllers must also have the resources and information necessary to enforce
the safety constraints they have been assigned. Physicians need information about
drug safety and efficacy that is independent from the pharmaceutical company
representatives in order to adequately protect their patients. One of the first steps
in performing an analysis of the drug safety control structure is to identify the con-
textual factors that can influence whether each components responsibilities are
carried out and the information required to create an accurate process model to
support informed decision making in exercising the controls they have available to
carry out their responsibilities.
Couturier also used the drug safety control structure, system safety requirements
and constraints, the events in the Vioxx losses, and STPA and system dynamics
models (see appendix D) to investigate the potential effectiveness of the changes
implemented after the Vioxx events to control the marketing of unsafe drugs and
the impact of the changes on the system as a whole. For example, the Food and Drug
Amendments Act of 2007 (FDAAA) increased the responsibilities of the FDA and
provided it with new authority. Couturier examined the recommendations from the
FDAAA, the IOM report, and those generated from his STAMP causal analysis of
the Vioxx events.
System dynamics modeling was used to show the relationship among the contex-
tual factors and unsafe control actions and the reasons why the safety control struc-
ture migrated toward ineffectiveness over time. Most modeling techniques provide
only direct relationships (arrows), which are inadequate to understand the indirect
relationships between causal factors. System dynamics provides a way to show such
indirect and nonlinear relationships. Appendix D explains this modeling technique.
First, system dynamics models were created to model the contextual influences
on the behavior of each component (patients, pharmaceutical companies, the FDA,
and so on) in the pharmaceutical safety control structure. Then the models were
combined to assist in understanding the behavior of the system as a whole and the
interactions among the components. The complete analysis can be found in [43] and
a shorter paper on some of the results [44]. An overview and some examples are
provided here.
Figure 8.10 shows a simple model of two types of pressures in this system that
militate against drugs being recalled. The loop on the left describes pressures within
the pharmaceutical company related to drug recalls while the loop on the right
describes pressures on the FDA related to drug recalls.
Once a drug has been approved, the pharmaceutical company, which invested
large resources in developing, testing, and marketing the drug, has incentives to
maximize profits from the drug and keep it on the market. Those pressures are
accentuated in the case of expected blockbuster drugs where the companys finan-
cial well-being potentially depends on the success of the product. This goal creates
a reinforcing loop within the company to try to keep the drug on the market. The
company also has incentives to pressure the FDA to increase the number of approved
indications, and thus purchasers, resist label changes, and prevent drug recalls. If the
company is successful at preventing recalls, the expectations for the drug increase,
creating another reinforcing loop. External pressures to recall the drug limit the
reinforcing dynamics, but they have a lot of inertia to overcome.
Figure 8.11 includes more details, more complex feedback loops, and more outside
pressures, such as the availability of a replacement drug, the time left on the drugs
patent, and the amount of time spent on drug development. Pressures on the FDA
from the pharmaceutical companies are elaborated including the pressures on the
Office of New Drugs (OND) through PDUFA fees,5 pressures from advisory boards
to keep the drug (which are, in turn, subject to pressures from patient advocacy
groups and lucrative consulting contracts with the pharmaceutical companies), and
pressures from the FDA Office of Surveillance and Epidemiology (OSE) to recall
the drug.
Figures 8.12 and 8.13 show the pressures leading to overprescribing drugs. The
overview in figure 8.12 has two primary feedback loops. The loop on the left describes
pressures to lower the number of prescriptions based on the number of adverse
events and negative studies. The loop on the right shows the pressures within the
pharmaceutical company to increase the number of prescriptions based on company
earnings and marketing efforts.
For a typical pharmaceutical product, more drug prescriptions lead to higher
earnings for the drug manufacturer, part of which can be used to pay for more
advertising to get doctors to continue to prescribe the drug. This reinforcing loop is
usually balanced by the adverse effects of the drug. The more the drug is prescribed,
the more likely is observation of negative side effects, which will serve to balance
the pressures from the pharmaceutical companies. The two loops then theoretically
reach a dynamic equilibrium where drugs are prescribed only when their benefits
outweigh the risks.
As demonstrated in the Vioxx case, delays within a loop can significantly alter
the behavior of the system. By the time the first severe side effects were discovered,
millions of prescriptions had been given out. The balancing influences of the side-
effects loop were delayed so long that they could not effectively control the reinforc-
ing pressures coming from the pharmaceutical companies. Figure 8.13 shows how
additional factors can be incorporated including the quality of collected data, the
market size, and patient drug requests.
Couturier incorporated into the system dynamics models the changes that were
proposed by the IOM after the Vioxx events, the changes actually implemented
in FDAAA, and the recommendations coming out of the STAMP-based causal
analysis. One major difference was that the STAMP-based recommendations had
a broader scope. While the IOM and FDAAA changes focused on the FDA, the
STAMP analysis considered the contributions of all the components of the pharma-
ceutical safety control structure to the Vioxx events and the STAMP causal analysis
led to recommendations for changes in nearly all of them.
Couturier concluded, not surprisingly, that most of the FDAAA changes are
useful and will have the intended effects. He also determined that a few may be
counterproductive and others need to be added. The added ones come from the fact
that the IOM recommendations and the FDAAA focus on a single component of
the system (the FDA). The FDA does not operate in a vacuum, and the proposed
changes do not take into account the safety role played by other components in the
system, particularly physicians. As a result, the pressures that led to the erosion of
the overall system safety controls were left unaddressed and are likely to lead to
changes in the system static and dynamic safety controls that will undermine the
improvements implemented by FDAAA. See Couturier [43] for the complete results.
A potential contribution of such an analysis is the ability to consider the impact
of multiple changes within the entire safety control structure. Less than effective
controls may be implemented when they are created piecemeal to fix a current set
of adverse events. Existing pressures and influences, not changed by the new pro-
cedures, can defeat the intent of the changes by leading to unintended and counter-
balancing actions in the components of the safety control structure. STAMP-based
analysis suggest how to reengineer the safety control structure as a whole to achieve
the system goals, including both enhancing the safety of current drugs while at the
same time encouraging the development of new drugs.
footnote. The Prescription Drug Use Fee Act (PDUFA) was first passed by Congress in 1992. It allows the FDA
to collect fees from the pharmaceutical companies to pay the expenses for the approval of new drugs.
In return, the FDA agrees to meet drug review performance goals. The main goal of PDUFA is to accel-
erate the drug review process. Between 1993 and 2002, user fees allowed the FDA to increase by 77
percent the number of personnel assigned to review applications. In 2004, more than half the funding
for the CDEH was coming from user fees [148]. A growing group of scientists and regulators have
expressed fears that in allowing the FDA to be sponsored by the pharmaceutical companies, the FDA
has shifted its priorities to satisfying the companies, its “client,” instead of protecting the public.
section 8.8.
Comparison of STPA with Traditional Hazard Analysis Techniques.
Few formal comparisons have been made yet between STPA and traditional tech-
niques such as fault tree analysis and HAZOP. Theoretically, because STAMP
extends the causality model underlying the hazard analysis, non-failures and addi-
tional causes should be identifiable, as well as the failure-related causes found by
the traditional techniques. The few comparisons that have been made, both informal
and formal, have confirmed this hypothesis.
In the use of STPA on the U.S. missile defense system, potential paths to inad-
vertent launch were identified that had not been identified by previous analyses or
in extensive hazard analyses on the individual components of the system [BMDS].
Each element of the system had an active safety program, but the complexity and
coupling introduced by their integration into a single system created new subtle and
complex hazard scenarios. While the scenarios identified using STPA included those
caused by potential component failures, as expected, scenarios were also identified
that involved unsafe interactions among the components without any components
actually failing—each operated according to its specified requirements, but the
interactions could lead to hazardous system states. In the evaluation of this effort,
two other advantages were noted:
1. The effort was bounded and predictable and assisted the engineers in scoping
their efforts. Once all the control actions have been examined, the assessment
is complete.
2. As the control structure is developed and the potential inadequate control
actions are identified, they were able to prioritize required changes according
to which control actions have the greatest role in keeping the system from
transitioning to a hazardous state.
A paper published on this effort concluded:
The STPA safety assessment methodology . . . provided an orderly, organized fashion in
which to conduct the analysis. The effort successfully assessed safety risks arising from the
integration of the Elements. The assessment provided the information necessary to char-
acterize the residual safety risk of hazards associated with the system. The analysis and
supporting data provided management a sound basis on which to make risk acceptance
decisions. Lastly, the assessment results were also used to plan mitigations for open safety
risks. As changes are made to the system, the differences are assessed by updating the
control structure diagrams and assessment analysis templates.
Another informal comparison was made in the ITA (Independent Technical Author-
ity) analysis described in section 8.6. An informal review of the risks identified by
using STPA showed that they included all the risks identified by the informal NASA
risk analysis process using the traditional method common to such analyses. The
additional risks identified by STPA appeared on the surface to be as important as
those identified by the NASA analysis. As noted, there is no way to determine
whether the less formal NASA process identified additional risks and discarded
them for some reason or simply missed them.
A more careful comparison has also been made. JAXA (the Japanese Space
Agency) and MIT engineers compared the use of STPA on a JAXA unmanned
spacecraft (HTV) to transfer cargo to the International Space Station (ISS). Because
human life is potentially involved (one hazard is collision with the International
Space Station), rigorous NASA hazard analysis standards using fault trees and other
analyses had been employed and reviewed by NASA. In an STPA analysis of the
HTV used in an evaluation of the new technique for potential use at JAXA, all of
the hazard causal factors identified by the fault tree analysis were identified also by
STPA [88]. As with the BMDS comparison, additional causal factors were identified
by STPA alone. These additional causal factors again involved those related to more
sophisticated types of errors beyond simple component failures and those related
to software and human errors.
Additional independent comparisons (not done by the author or her students)
have been made between accident causal analysis methods comparing STAMP and
more traditional methods. The results are described in chapter 11 on accident analy-
sis based on STAMP.
section 8.9.
Summary.
Some new approaches to hazard and risk analysis based on STAMP and systems
theory have been suggested in this chapter. We are only beginning to develop such
techniques and hopefully others will work on alternatives and improvements. The
only thing for sure is that applying the techniques developed for simple electrome-
chanical systems to complex, human and software-intensive systems without funda-
mentally changing the foundations of the techniques is futile. New ideas are
desperately needed if we are going to solve the problems and respond to the changes
in the world of engineering described in chapter 1.