1
0
2025-03-16 20:38:47 -06:00

1704 lines
125 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

chapter 9.
Safety-Guided Design.
In the examples of STPA in the last chapter, the development of the design was
assumed to occur independently. Most of the time, hazard analysis is done after the
major design decisions have been made. But STPA can be used in a proactive way
to help guide the design and system development, rather than as simply a hazard
analysis technique on an existing design. This integrated design and analysis process
is called safety-guided design .(figure 9.1).
As the systems we build and operate increase in size and complexity, the use of
sophisticated system engineering approaches becomes more critical. Important
system-level .(emergent). properties, such as safety, must be built into the design of
these systems; they cannot be effectively added on or simply measured afterward.
Adding barriers or protection devices after the fact is not only enormously more
expensive, it is also much less effective than designing safety in from the beginning
(see Safeware, chapter 16). This chapter describes the process of safety-guided
design, which is enhanced by defining accident prevention as a control problem
rather than a “prevent failures” problem. The next chapter shows how safety engineering and safety-guided design can be integrated into basic system engineering
processes.
section 9.1.
The Safety-Guided Design Process.
One key to having a cost-effective safety effort is to embed it into a system engineering process from the very beginning and to design safety into the system as the
design decisions are made. Once again, the process starts with the fundamental
activities in chapter 7. After the hazards and system-level safety requirements and
constraints have been identified; the design process starts.
1. Try to eliminate the hazards from the conceptual design.
2. If any of the hazards cannot be eliminated, then identify the potential for their
control at the system level.
3. Create a system control structure and assign responsibilities for enforcing
safety constraints. Some guidance for this process is provided in the operations
and management chapters.
4. Refine the constraints and design in parallel.
a. Identify potentially hazardous control actions by each of system components that would violate system design constraints using STPA step 1.
Restate the identified hazard control actions as component design
constraints.
b. Using STPA Step 2, determine what factors could lead to a violation of the
safety constraints.
c. Augment the basic design to eliminate or control potentially unsafe control
actions and behaviors.
d. Iterate over the process, that is, perform STPA steps 1 and 2 on the new
augmented design and continue to refine the design until all hazardous
scenarios are eliminated, mitigated, or controlled.
The next section provides an example of the process. The rest of the chapter discusses safe design principles for physical processes, automated controllers, and
human controllers.
section 9.2.
An Example of Safety-Guided Design for an Industrial Robot.
The process of safety-guided design and the use of STPA to support it is illustrated
here with the design of an experimental Space Shuttle robotic Thermal Tile
Processing System .(TTPS). based on a design created for a research project at
CMU .
The goal of the TTPS system is to inspect and waterproof the thermal protection
tiles on the belly of the Space Shuttle, thus saving humans from a laborious task,
typically lasting three to four months, that begins within minutes after the Shuttle
lands and ends just prior to launch. Upon landing at either the Dryden facility in
California or Kennedy Space Center in Florida, the orbiter is brought to either the
Mate-Demate Device .( M D D). or the Orbiter Processing Facility .(OPF). These large
structures provide access to all areas of the orbiters.
The Space Shuttle is covered with several types of heat-resistant tiles that protect
the orbiters aluminum skin during the heat of reentry. While the majority of the
upper surfaces are covered with flexible insulation blankets, the lower surfaces are
covered with silica tiles. These tiles have a glazed coating over soft and highly porous
silica fibers. The tiles are 95 percent air by volume, which makes them extremely
light but also makes them capable of absorbing a tremendous amount of water.
Water in the tiles causes a substantial weight problem that can adversely affect
launch and orbit capabilities for the shuttles. Because the orbiters may be exposed
to rain during transport and on the launch pad, the tiles must be waterproofed. This
task is accomplished through the use of a specialized hydrophobic chemical, DMES,
which is injected into each tile. There are approximately 17,000 lower surface tiles
covering an area that is roughly 25m × 40m.
In the standard process, DMES is injected into a small hole in each tile by a
handheld tool that pumps a small quantity of chemical into the nozzle. The nozzle
is held against the tile and the chemical is forced through the tile by a pressurized
nitrogen purge for several seconds. It takes about 240 hours to waterproof the tiles
on an orbiter. Because the chemical is toxic, human workers have to wear heavy
suits and respirators while injecting the chemical and, at the same time, maneuvering
in a crowded work area. One goal for using a robot to perform this task was to
eliminate a very tedious, uncomfortable, and potentially hazardous human activity.
The tiles must also be inspected. A goal for the TTPS was to inspect the tiles
more accurately than the human eye and therefore reduce the need for multiple
inspections. During launch, reentry, and transport, a number of defects can occur on
the tiles in the form of scratches, cracks, gouges, discoloring, and erosion of surfaces.
The examination of the tiles determines if they need to be replaced or repaired. The
typical procedures involve visual inspection of each tile to see if there is any damage
and then assessment and categorization of the defects according to detailed checklists. Later, work orders are issued for repair of individual tiles.
Like any design process, safety-guided design starts with identifying the goals for
the system and the constraints under which the system must operate. The high-level
goals for the TTPS are to.
1. Inspect the thermal tiles for damage caused during launch, reentry, and
transport
2. Apply waterproofing chemicals to the thermal tiles
Environmental constraints delimit how these goals can be achieved and identifying
those constraints, particularly the safety constraints, is an early goal in safetyguided design.
The environmental constraints on the system design stem from physical properties of the Orbital Processing Facility .(OPF). at KSC, such as size constraints on the
physical system components and the necessity of any mobile robotic components
to deal with crowded work areas and for humans to be in the area. Example work
area environmental constraints for the TTPS are.
EA1. The work areas of the Orbiter Processing Facility .(OPF). can be very
crowded. The facilities provide access to all areas of the orbiters through the
use of intricate platforms that are laced with plumbing, wiring, corridors, lifting
devices, and so on. After entering the facility, the orbiters are jacked up and
leveled. Substantial structure then swings around and surrounds the orbiter on
all sides and at all levels. With the exception of the jack stands that support
the orbiters, the floor space directly beneath the orbiter is initially clear but
the surrounding structure can be very crowded.
EA2. The mobile robot must enter the facility through personnel access doors 1.1
meters .(42″). wide. The layout within the OPF allows a length of 2.5 meters
(100″). for the robot. There are some structural beams whose heights are as
low as 1.75 meters .(70″), but once under the orbiter the tile heights range from
about 2.9 meters to 4 meters. The compact roll-in form of the mobile system
must maneuver these spaces and also raise its inspection and injection equipment up to heights of 4 meters to reach individual tiles while still meeting a 1
millimeter accuracy requirement.
EA3. Additional constraints involve moving around the crowded workspace. The
robot must negotiate jack stands, columns, work stands, cables, and hoses. In
addition, there are hanging cords, clamps, and hoses. Because the robot might
cause damage to the ground obstacles, cable covers will be used for protection
and the robot system must traverse these covers.
Other design constraints on the TTPS include.
1.•Use of the TTPS must not negatively impact the flight schedules of the orbiters
more than that of the manual system being replaced.
2.•Maintenance costs of the TTPS must not exceed x dollars per year.
3.•Use of the TTPS must not cause or contribute to an unacceptable loss .(accident). as defined by Shuttle management.
As with many systems, prioritizing the hazards by severity is enough in this case to
assist the engineers in making decisions during design. Sometimes a preliminary
hazard analysis is performed using a risk matrix to determine how much effort will
be put into eliminating or controlling the hazards and in making tradeoffs in design.
Likelihood, at this point, is unknowable but some type of surrogate, like mitigatibility, as demonstrated in section 10.3.4, could be used. In the TTPS example, severity
plus the NASA policy described earlier is adequate. To decide not to consider some
of the hazards at all would be pointless and dangerous at this stage of development
as likelihood is not determinable. As the design proceeds and decisions must be
made, specific additional information may be found to be useful and acquired at
that time. After the system design is completed, if it is determined that some hazards
cannot be adequately handled or the compromises required to handle them are too
great; then the limitations would be documented .(as described in chapter 10). and
decisions would have to be made at that point about the risks of using the system.
At that time, however, the information necessary to make those decisions will more
likely be available than before the development process begins.
After the hazards are identified, system-level safety-related requirements and
design constraints are derived from them. As an example, for hazard H7 .(inadequate
thermal protection), a system-level safety design constraint is that the mobile robot
processing must not result in any tiles being missed in the inspection or waterproofing process. More detailed design constraints will be generated during the safetyguided design process.
To get started, a general system architecture must be selected .(figure 9.2). Lets
assume that the initial TTPS architecture consists of a mobile base on which tools
will be mounted, including a manipulator arm that performs the processing and
contains the vision and waterproofing tools. This very early decision may be changed
after the safety-guided design process starts, but some very basic initial assumptions
are necessary to get going. As the concept development and detailed design process
proceeds, information generated about hazards and design tradeoffs may lead to
changes in the initial configuration. Alternatively, multiple design configurations
may be considered in parallel.
In the initial candidate architecture .(control structure), a decision is made to
introduce a human operator in order to supervise robot movement as so many of
the hazards are related to movement. At the same time, it may be impractical for
an operator to monitor all the activities so the first version of the system architecture
is to have the TTPS control system in charge of the non-movement activities and
to have both the TTPS and the control room operator share control of movement.
The safety-guided design process, including STPA, will identify the implications of
this decision and will assist in analyzing the allocation of tasks to the various components to determine the safety tradeoffs involved.
In the candidate starting architecture .(control structure), there is an automated
robot work planner to provide the overall processing goals and tasks for the
TTPS. A location system is needed to provide information to the movement controller about the current location of the robot. A camera is used to provide information to the human controller, as the control room will be located at a distance
from the orbiter. The role of the other components should be obvious.
The proposed design has two potential movement controllers, so coordination
problems will have to be eliminated. The operator could control all movement, but
that may be considered impractical given the processing requirements. To assist with
this decision process, engineers may create a concept of operations and perform a
human task analysis .
The safety-guided design process, including STPA, will identify the implications
of the basic decisions in the candidate tasks and will assist in analyzing the
allocation of tasks to the various components to determine the safety tradeoffs
involved.
The design process is now ready to start. Using the information already specified,
particularly the general functional responsibilities assigned to each component,
designers will identify potentially hazardous control actions by each of the system
components that could violate the safety constraints, determine the causal factors
that could lead to these hazardous control actions, and prevent or control them in
the system design. The process thus involves a top-down identification of scenarios
in which the safety constraints could be violated. The scenarios can then be used to
guide more detailed design decisions.
In general, safety-guided design involves first attempting to eliminate the
hazard from the design and, if that is not possible or requires unacceptable
tradeoffs, reducing the likelihood the hazard will occur, reducing the negative
consequences of the hazard if it does occur, and implementing contingency plans
for limiting damage. More about design procedures is presented in the next
section.
As design decisions are made, an STPA-based hazard analysis is used to
inform these decisions. Early in the system design process, little information is
available, so the hazard analysis will be very general at first and will be refined
and augmented as additional information emerges through the system design
activities.
For the example, lets focus on the robot instability hazard. The first goal should
be to eliminate the hazard in the system design. One way to eliminate potential
instability is to make the robot base so heavy that it cannot become unstable, no
matter how the manipulator arm is positioned. A heavy base, however, could increase
the damage caused by the base coming into contact with a human or object or make
it difficult for workers to manually move the robot out of the way in an emergency
situation. An alternative solution is to make the base long and wide so the moment
created by the operation of the manipulator arm is compensated by the moments
created by base supports that are far from the robots center of mass. A long and
wide base could remove the hazard but may violate the environmental constraints
in the facility layout, such as the need to maneuver through doors and in the
crowded OPF.
The environmental constraint EA2 above implies a maximum length for the
robot of 2.5 meters and a width no larger than 1.1 meter. Given the required
maximum extension length of the manipulator arm and the estimated weight of
the equipment that will need to be carried on the mobile base, a calculation might
show that the length of the robot base is sufficient to prevent any longitudinal
instability, but that the width of the base is not sufficient to prevent lateral
instability.
If eliminating the hazard is determined to be impractical .(as in this case). or not
desirable for some reason, the alternative is to identify ways to control it. The decision to try to control it may turn out not to be practical or later may seem less
satisfactory than increasing the weight .(the solution earlier discarded). All decisions
should remain open as more information is obtained about alternatives and backtracking is an option.
At the initial stages in design, we identified only the general hazards.for
example, instability of the robot base and the related system design constraint that
the mobile base must not be capable of falling over under worst-case operational
conditions. As design decisions are proposed and analyzed, they will lead to additional refinements in the hazards and the design constraints.
For example, a potential solution to the stability problem is to use lateral stabilizer legs that are deployed when the manipulator arm is extended but must be
retracted when the robot base moves. Lets assume that a decision is made to at
least consider this solution. That potential design decision generates a new refined
hazard from the high-level stability hazard .(H2).
H2.1. The manipulator arm is extended while the stabilizer legs are not fully
extended.
Damage to the mobile base or other equipment around the OPF is another potential
hazard introduced by the addition of the legs if the mobile base moves while the
stability legs are extended. Again, engineers would consider whether this hazard
could be eliminated by appropriate design of the stability legs. If it cannot, then that
is a second additional hazard that must be controlled in the design with a corresponding design constraint that the mobile base must not move with the stability
legs extended.
There are now two new refined hazards that must be translated into design
constraints.
1. The manipulator arm must never be extended if the stabilizer legs are not
extended.
2. The mobile base must not move with the stability legs extended.
STPA can be used to further refine these constraints and to evaluate the resulting
designs. In the process, the safety control structure will be refined and perhaps
changed. In this case, a controller must be identified for the stabilizer legs, which
were previously not in the design. Lets assume that the legs are controlled by the
TTPS movement controller .(figure 9.3).
Using the augmented control structure, the remaining activities in STPA are to
identify potentially hazardous control actions by each of the system components
that could violate the safety constraints, determine the causal factors that could lead
to these hazardous control actions, and prevent or control them in the system design.
The process thus involves a top-down identification of scenarios in which the safety
constraints could be violated so that they can be used to guide more detailed design
decisions.
The unsafe control actions associated with the stability hazard are shown in
figure 9.4. Movement and thermal tile processing hazards are also identified in the
table. Combining similar entries for H1 in the table leads to the following unsafe
control actions by the leg controller with respect to the instability hazard.
1. The leg controller does not command a deployment of the stabilizer legs before
the arm is extended.
2. The leg controller commands a retraction of the stabilizer legs before the
manipulator arm is fully stowed.
3. The leg controller commands a retraction of the stabilizer legs after the arm
has been extended or commands a retraction of the stabilizer legs before the
manipulator arm is stowed.
4. The leg controller stops extension of the stabilizer legs before they are fully
extended.
and by the arm controller.
1. The arm controller extends the manipulator arm when the stabilizer legs are
not extended or before they are fully extended.
The inadequate control actions can be restated as system safety constraints on the
controller behavior .(whether the controller is automated or human).
1. The leg controller must ensure the stabilizer legs are fully extended before arm
movements are enabled.
2. The leg controller must not command a retraction of the stabilizer legs when
the manipulator arm is not in a fully stowed position.
3. The leg controller must command a deployment of the stabilizer legs before
arm movements are enabled; the leg controller must not command a retraction
of the stabilizer legs before the manipulator arm is stowed.
4. The leg controller must not stop the leg extension until the legs are fully
extended.
Similar constraints will be identified for all hazardous commands. for example, the
arm controller must not extend the manipulator arm before the stabilizer legs are
fully extended.
These system safety constraints might be enforced through physical interlocks,
human procedures, and so on. Performing STPA step 2 will provide information
during detailed design .(1). to evaluate and compare the different design choices,
(2). to design the controllers and design fault tolerance features for the system, and
(3). to guide the test and verification procedures .(or training for humans). As design
decisions and safety constraints are identified, the functional specifications for the
controllers can be created.
To produce detailed scenarios for the violation of safety constraints, the control
structure is augmented with process models. The preliminary design of the process
models comes from the information necessary to ensure the system safety constraints hold. For example, the constraint that the arm controller must not enable
manipulator movement before the stabilizer legs are completely extended implies
there must be some type of feedback to the arm controller to determine when the
leg extension has been completed.
While a preliminary functional decomposition of the system components is
created to start the process, as more information is obtained from the hazard analysis and the system design continues, this decomposition may be altered to optimize
fault tolerance and communication requirements. For example, at this point the need
for the process models of the leg and arm controllers to be consistent and the communication required to achieve this goal may lead the designers to decide to combine
the leg and arm controllers .(figure 9.5).
Causal factors for the stability hazard being violated can be determined using
STPA step 2. Feedback about the position of the legs is clearly critical to ensure
that the process model of the state of the stabilizer legs is consistent with the actual
state. The movement and arm controller cannot assume the legs are extended simply
because a command was issued to extend them. The command may not be executed
or may only be executed partly. One possible scenario, for example, involves an
external object preventing the complete extension of the stabilizer legs. In that case,
the robot controller .(either human or automated). may assume the stabilizer legs
are extended because the extension motors have been powered up .(a common type
of design error). Subsequent movement of the manipulator arm would then violate
the identified safety constraints. Just as the analysis assists in refining the component
safety constraints .(functional requirements), the causal analysis can be used to
further refine those requirements and to design the control algorithm, the control
loop components, and the feedback necessary to implement them.
Many of the causes of inadequate control actions are so common that they can
be restated as general design principles for safety-critical control loops. The requirement for feedback about whether a command has been executed in the previous
paragraph is one of these. The rest of this chapter presents those general design
principles.
section 9.3.
Designing for Safety.
Hazard analysis using STPA will identify application-specific safety design constraints that must be enforced by the control algorithm. For the thermal-tile processing robot, a safety constraint identified above is that the manipulator arm must
never be extended if the stabilizer legs are not fully extended. Causal analysis .(step
2 of STPA). can identify specific causes for the constraint to be violated and design
features can be created to eliminate or control them.
More general principles of safe control algorithm functional design can also be
identified by using the general causes of accidents as defined in STAMP .(and used
in STPA step 2), general engineering principles, and common design flaws that have
led to accidents in the past.
Accidents related to software or system logic design often result from incompleteness and unhandled cases in the functional design of the controller. This incompleteness can be considered a requirements or functional design problem. Some
requirements completeness criteria were identified in Safeware and specified using
a state machine model. Here those criteria plus additional design criteria are translated into functional design principles for the components of the control loop.
In STAMP, accidents are caused by inadequate control. The controllers can be
human or physical. This section focuses on design principles for the components of
the control loop that are important whether a human is in the loop or not. Section
9.4 describes extra safety-related design principles that apply for systems that
include human controllers. We cannot “design” human controllers, but we can design
the environment or context in which they operate, and we can design the procedures
they use, the control loops in which they operate, the processes they control, and
the training they receive.
section 9.3.1. Controlled Process and Physical Component Design.
Protection against component failure accidents is well understood in engineering.
Principles for safe design of common hardware systems .(including sensors and
actuators). with standard safety constraints are often systematized and encoded in
checklists for an industry, such as mechanical design or electrical design. In addition,
most engineers have learned about the use of redundancy and overdesign .(safety
margins). to protect against component failures.
These standard design techniques are still relevant today but provide little or no
protection against component interaction accidents. The added complexity of redundant designs may even increase the occurrence of these accidents. Figure 9.6 shows
the design precedence described in Safeware. The highest precedence is to eliminate
the hazard. If the hazard cannot be eliminated, then its likelihood of occurrence
should be reduced, the likelihood of it leading to an accident should be reduced
and, at the lowest precedence, the design should reduce the potential damage
incurred. Clearly, the higher the precedence level, the more effective and less costly
will be the safety design effort. As there is little that is new here that derives from
using the STAMP causality model, the reader is referred to Safeware and standard
engineering references for more information.
section 9.3.2. Functional Design of the Control Algorithm.
Design for safety includes more than simply the physical components but also the
control components. We start by considering the design of the control algorithm.
The controller algorithm is responsible for processing inputs and feedback, initializing and updating the process model, and using the process model plus other knowledge and inputs to produce control outputs. Each of these is considered in turn.
Designing and Processing Inputs and Feedback
The basic function of the algorithm is to implement a feedback control loop, as
defined by the controller responsibilities, along with appropriate checks to detect
internal or external failures or errors.
Feedback is critical for safe control. Without feedback, controllers do not know
whether their control actions were received and performed properly or whether
The controller must be designed to respond appropriately to the arrival of any
possible .(i.e., detectable by the sensors). input at any time as well as the lack of an
expected input over a given time period. Humans are better .(and more flexible)
than automated controllers at this task. Often automation is not designed to handle
input arriving unexpectedly, for example, a target detection report from a radar that
was previously sent a message to shut down.
All inputs should be checked for out-of-range or unexpected values and a
response designed into the control algorithm. A surprising number of losses still
occur due to software not being programmed to handle unexpected inputs.
In addition, the time bounds .(minimum and maximum). for every input should
be checked and appropriate behavior provided in case the input does not arrive
within these bounds. There should also be a response for the non-arrival of an input
within a given amount of time .(a timeout). for every variable in the process model.
The controller must also be designed to respond to excessive inputs .(overload conditions). in a safe way.
Because sensors and input channels can fail, there should be a minimum-arrivalrate check for each physically distinct communication path, and the controller
should have the ability to query its environment with respect to inactivity over a
given communication path. Traditionally these queries are called sanity or health
checks. Care needs to be taken, however, to ensure that the design of the response
to a health check is distinct from the normal inputs and that potential hardware
failures cannot impact the sanity checks. As an example of the latter, in June 19 80
warnings were received at the U.S. command and control headquarters that a major
nuclear attack had been launched against the United States . The military
prepared for retaliation, but the officers at command headquarters were able to
ascertain from direct contact with warning sensors that no incoming missile had
been detected and the alert was canceled. Three days later, the same thing happened again. The false alerts were caused by the failure of a computer chip in a
multiplexor system that formats messages sent out continuously to command posts
indicating that communication circuits are operating properly. This health check
message was designed to report that there were 000 ICBMs and 000 SLBMs
detected. Instead, the integrated circuit failure caused some of the zeros to be
replaced with twos. After the problem was diagnosed, the message formats were
changed to report only the status of the communication system and nothing about
detecting ballistic missiles. Most likely, the developers thought it would be easier to
have one common message format but did not consider the impact of erroneous
hardware behavior.
STAMP identifies inconsistency between the process model and the actual
system state as a common cause of accidents. Besides incorrect feedback, as in the
example early warning system, a common way for the process model to become
inconsistent with the state of the actual process is for the controller to assume that
an output command has been executed when it has not. The TTPS controller, for
example, assumes that because it has sent a command to extend the stabilizer legs,
the legs will, after a suitable amount of time, be extended. If commands cannot be
executed for any reason, including time outs, controllers have to know about it. To
detect errors and failures in the actuators or controlled process, there should be an
input .(feedback). that the controller can use to detect the effect of any output on
the process.
This feedback, however, should not simply be an indication that the command
arrived at the controlled process.for example, the command to open a valve was
received by the valve, but that the valve actually opened. An explosion occurred
in a U.S. Air Force system due to overpressurization when a relief valve failed to
open after the operator sent a command to open it. Both the position indicator
light and open indicator light were illuminated on the control board. Believing
the primary valve had opened, the operator did not open the secondary valve,
which was to be used if the primary valve failed. A post-accident examination
discovered that the indicator light circuit was wired to indicate presence of a signal
at the valve, but it did not indicate valve position. The indicator therefore showed
only that the activation button had been pushed, not that the valve had opened.
An extensive quantitative safety analysis of this design had assumed a low probability of simultaneous failure for the two relief valves, but it ignored the possibility
of a design error in the electrical wiring; the probability of the design error was
not quantifiable. Many other accidents have involved a similar design flaw, including Three Mile Island.
When the feedback associated with an output is received, the controller must be
able to handle the normal response as well as deal with feedback that is missing,
too late, too early, or has an unexpected value.
Initializing and Updating the Process Model.
Because the process model is used by the controller to determine what control commands to issue and when, the accuracy of the process model with respect to the
controlled process is critical. As noted earlier, many software-related losses have
resulted from such inconsistencies. STPA will identify which process model variables
are critical to safety; the controller design must ensure that the controller receives
and processes updates for these variables in a timely manner.
Sometimes normal updating of the process model is done correctly by the controller, but problems arise in initialization at startup and after a temporary shutdown. The process model must reflect the actual process state at initial startup and
after a restart. It seems to be common, judging from the number of incidents
and accidents that have resulted, for software designers to forget that the world
continues to change even though the software may not be operating. When the
computer controlling a process is temporarily shut down, perhaps for maintenance
or updating of the software, it may restart with the assumption that the controlled
process is still in the state it was when the software was last operating. In addition,
assumptions may be made about when the operation of the controller will be started,
which may be violated. For example, an assumption may be made that a particular
aircraft system will be powered up and initialized before takeoff and appropriate
default values used in the process model for that case. In the event it was not started
at that time or was shut down and then restarted after takeoff, the default startup
values in the process model may not apply and may be hazardous.
Consider the mobile tile-processing robot at the beginning of this chapter. The
mobile base may be designed to allow manually retracting the stabilizer legs if an
emergency occurs while the robot is servicing the tiles and the robot must be physically moved out of the way. When the robot is restarted, the controller may assume
that the stabilizer legs are still extended and arm movements may be commanded
that would violate the safety constraints.
The use of an unknown value can assist in protecting against this type of design
flaw. At startup and after temporary shutdown, process variables that reflect the
state of the controlled process should be initialized with the value unknown and
updated when new feedback arrives. This procedure will result in resynchronizing
the process model and the controlled process state. The control algorithm must also
account, of course, for the proper behavior in case it needs to use a process model
variable that has the unknown value.
Just as timeouts must be specified and handled for basic input processing as
described earlier, the maximum time the controller waits until the first input after
startup needs to be determined and what to do if this time limit is violated. Once
again, while human controllers will likely detect such a problem eventually, such as
a failed input channel or one that was not restarted on system startup, computers
will patiently wait forever if they are not given instructions to detect such a timeout
and to respond to it.
In general, the system and control loop should start in a safe state. Interlocks may
need to be initialized or checked to be operational at system startup, including
startup after temporarily overriding the interlocks.
Finally the behavior of the controller with respect to input received before
startup, after shutdown, or while the controller is temporarily disconnected from the
process .(offline). must be considered and it must be determined if this information
can be safely ignored or how it will be stored and later processed if it cannot. One
factor in the loss of an aircraft that took off from the wrong runway at Lexington
Airport, for example, is that information about temporary changes in the airport
taxiways was not reflected in the airport maps provided to the crew. The information
about the changes, which was sent by the National Flight Data Center, was received
by the map-provider computers at a time when they were not online, leading to
airport charts that did not match the actual state of the airport. The document
control system software used by the map provider was designed to only make
reports of information received during business hours Monday through Friday .
Producing Outputs.
The primary responsibility of the process controller is to produce commands to
fulfill its control responsibilities. Again, the STPA hazard analysis and safety-guided
design process will produce the application-specific behavioral safety requirements
and constraints on controller behavior to ensure safety. But some general guidelines
are also useful.
One general safety constraint is that the behavior of an automated controller
should be deterministic. it should exhibit only one behavior for arrival of any input
in a particular state. While it is easy to design software with nondeterministic
behavior and, in some cases, actually has some advantages from a software point of
view, nondeterministic behavior makes testing more difficult and, more important,
much more difficult for humans to learn how an automated system works and
to monitor it. If humans are expected to control or monitor an automated system
or an automated controller, then the behavior of the automation should be
deterministic.
Just as inputs can arrive faster than they can be processed by the controller, the
absorption rate of the actuators and recipients of output from the controller must
be considered. Again, the problem usually arises when a fast output device .(such as
a computer). is providing input to a slower device, such as a human. Contingency
action must be designed when the output absorption rate limit is exceeded.
Three additional general considerations in the safe design of controllers are data
age, latency, and fault handling.
Data age. No inputs or output commands are valid forever. The control loop
design must account for inputs that are no longer valid and should not be used by
the controller and for outputs that cannot be executed immediately. All inputs used
in the generation of output commands must be properly limited in the time they
are used and marked as obsolete once that time limit has been exceeded. At the
same time, the design of the control loop must account for outputs that are not
executed within a given amount of time. As an example of what can happen when
data age is not properly handled in the design, an engineer working in the cockpit
of a B-lA aircraft issued a close weapons bay door command during a test. At the
time, a mechanic working on the door had activated a mechanical inhibit on it. The
close door command was not executed, but it remained active. Several hours later,
when the door maintenance was completed, the mechanical inhibit was removed.
The door closed unexpectedly, killing the worker .
Latency. Latency is the time interval during which receipt of new information
cannot change an output even though it arrives prior to the output. While latency
time can be reduced by using various types of design techniques, it cannot be eliminated completely. Controllers need to be informed about the arrival of feedback
affecting previously issued commands and, if possible, provided with the ability to
undo or to mitigate the effects of the now unwanted command.
Fault-handling. Most accidents involve off-nominal processing modes, including
startup and shutdown and fault handling. The design of the control loop should assist
the controller in handling these modes and the designers need to focus particular
attention on them.
The system design may allow for performance degradation and may be designed
to fail into safe states or to allow partial shutdown and restart. Any fail-safe behavior
that occurs in the process should be reported to the controller. In some cases, automated systems have been designed to fail so gracefully that human controllers
are not aware of what is going on until they need to take control and may not be
prepared to do so. Also, hysteresis needs to be provided in the control algorithm
for transitions between off-nominal and nominal processing modes to avoid pingponging when the conditions that caused the controlled process to leave the normal
state still exist or recur.
Hazardous functions have special requirements. Clearly, interlock failures should
result in the halting of the functions they are protecting. In addition, the control
algorithm design may differ after failures are detected, depending on whether the
controller outputs are hazard-reducing or hazard-increasing. A hazard-increasing
output is one that moves the controlled process to a more hazardous state, for
example, arming a weapon. A hazard-reducing output is a command that leads to a
reduced risk state, for example, safing a weapon or any other command whose
purpose is to maintain safety.
If a failure in the control loop, such as a sensor or actuator, could inhibit the
production of a hazard-reducing command, there should be multiple ways to trigger
such commands. On the other hand, multiple inputs should be required to trigger
commands that can lead to hazardous states so they are not inadvertently issued.
Any failure should inhibit the production of a hazard-increasing command. As an
example of the latter condition, loss of the ability of the controller to receive input,
such as failure of a sensor, that might inhibit the production of a hazardous output
should prevent such an output from being issued.
section 9.4.
Special Considerations in Designing for Human Controllers.
The design principles in section 9.3 apply when the controller is automated or
human, particularly when designing procedures for human controllers to follow. But
humans do not always follow procedures, nor should they. We use humans to control
systems because of their flexibility and adaptability to changing conditions and to
the incorrect assumptions made by the designers. Human error is an inevitable and
unavoidable consequence. But appropriate design can assist in reducing human
error and increasing safety in human-controlled systems.
Human error is not random. It results from basic human mental abilities and
physical skills combined with the features of the tools being used, the tasks assigned,
and the operating environment. We can use what is known about human mental
abilities and design the other aspects of the system.the tools, the tasks, and the
operating environment.to reduce and control human error to a significant degree.
The previous section described general principles for safe design. This section
focuses on additional design principles that apply when humans control, either
directly or indirectly, safety-critical systems.
section 9.4.1. Easy but Ineffective Approaches.
One simple solution for engineers is to simply use human factors checklists. While
many such checklists exist, they often do not distinguish among the qualities they
enhance, which may not be related to safety and may even conflict with safety. The
only way such universal guidelines could be useful is if all design qualities were
complementary and achieved in exactly the same way, which is not the case. Qualities are conflicting and require design tradeoffs and decisions about priorities.
Usability and safety, in particular, are often conflicting; an interface that is easy
to use may not necessarily be safe. As an example, a common guideline is to ensure
that a user must enter data only once and that the computer can access that data if
needed later for the same task or for different tasks . Duplicate entry, however,
is required for the computer to detect entry errors unless the errors are so extreme
that they violate reasonableness criteria. A small slip usually cannot be detected
and such entry errors have led to many accidents. Multiple entry of critical data can
prevent such losses.
As another example, a design that involves displaying data or instructions on a
screen for an operator to check and verify by pressing the enter button minimizes
the typing an operator must do. Over time, however, and after few errors are
detected, operators will get in the habit of pressing the enter key multiple times in
rapid succession. This design feature has been implicated in many losses. For example,
the Therac-25 was a linear accelerator that overdosed multiple patients during radiation therapy. In the original Therac-25 design, operators were required to enter the
treatment parameters at the treatment site as well as on the computer console. After
the operators complained about the duplication, the parameters entered at the
treatment site were instead displayed on the console and the operator needed only
to press the return key if they were correct. Operators soon became accustomed to
pushing the return key quickly the required number of times without checking the
parameters carefully.
The second easy but not very effective solution is to write procedures for human
operators to follow and then assume the engineering job is done. Enforcing the
following of procedures is unlikely, however, to lead to a high level of safety.
Dekker notes what he called the “Following Procedures Dilemma” . Operators must balance between adapting procedures in the face of unanticipated conditions versus sticking to procedures rigidly when cues suggest they should be
adapted. If human controllers choose the former, that is, they adapt procedures
when it appears the procedures are wrong, a loss may result when the human controller does not have complete knowledge of the circumstances or system state. In
this case, the humans will be blamed for deviations and nonadherence to the procedures. On the other hand, if they stick to procedures .(the control algorithm provided). rigidly when the procedures turn out to be wrong, they will be blamed for
their inflexibility and the application of the rules in the wrong context. Hindsight
bias is often involved in identifying what the operator should have known and done.
Insisting that operators always follow procedures does not guarantee safety
although it does usually guarantee that there is someone to blame.either for following the procedures or for not following them.when things go wrong. Safety
comes from controllers being skillful in judging when and how procedures apply. As
discussed in chapter 12, organizations need to monitor adherence to procedures not
simply to enforce compliance but to understand how and why the gap between
procedures and practice grows and to use that information to redesign both the
system and the procedures .
Section 8.5 of chapter 8 describes important differences between human and
automated controllers. One of these differences is that the control algorithm used
by humans is dynamic. This dynamic aspect of human control is why humans are
kept in systems. They provide the flexibility to deviate from procedures when it turns
out the assumptions underlying the engineering design are wrong. But with this
flexibility comes the possibility of unsafe changes in the dynamic control algorithm
and raises new design requirements for engineers and system designers to understand the reason for such unsafe changes and prevent them through appropriate
system design.
Just as engineers have the responsibility to understand the hazards in the physical
systems they are designing and to control and mitigate them, engineers also must
understand how their system designs can lead to human error and how they can
design to reduce errors.
Designing to prevent human error requires some basic understanding about the
role humans play in systems and about human error.
section 9.4.2. The Role of Humans in Control Systems.
Humans can play a variety of roles in a control system. In the simplest cases, they
create the control commands and apply them directly to the controlled process. For
a variety of reasons, particularly speed and efficiency, the system may be designed
with a computer between the human controller and the system. The computer may
exist only in the feedback loop to process and present data to the human operator.
In other systems, the computer actually issues the control instructions with the
human operator either providing high-level supervision of the computer or simply
monitoring the computer to detect errors or problems.
An unanswered question is what is the best role for humans in safety-critical
process control. There are three choices beyond direct control. the human can
monitor an automated control system, the human can act as a backup to the automation, or the human and automation can both participate in the control through
some type of partnership. These choices are discussed in depth in Safeware and are
only summarized here.
Unfortunately for the first option, humans make very poor monitors. They cannot
sit and watch something without active control duties for any length of time and
maintain vigilance. Tasks that require little active operator behavior may result in
lowered alertness and can lead to complacency and overreliance on the automation.
Complacency and lowered vigilance are exacerbated by the high reliability and low
failure rate of automated systems.
But even if humans could remain vigilant while simply sitting and monitoring a
computer that is performing the control tasks .(and usually doing the right thing),
Bainbridge has noted the irony that automatic control systems are installed because
they can do the job better than humans, but then humans are assigned the task of
monitoring the automated system . Two questions arise.
1. The human monitor needs to know what the correct behavior of the controlled
or monitored process should be; however, in complex modes of operation.for
example, where the variables in the process have to follow a particular trajectory over time.evaluating whether the automated control system is performing correctly requires special displays and information that may only be
available from the automated system being monitored. How will human monitors know when the computer is wrong if the only information they have comes
from that computer? In addition, the information provided by an automated
controller is more indirect, which may make it harder for humans to get a clear
picture of the system. Failures may be silent or masked by the automation.
2. If the decisions can be specified fully, then a computer can make them more
quickly and accurately than a human. How can humans monitor such a system?
Whitfield and Ord found that, for example, air traffic controllers appreciation
of the traffic situation was reduced at the high traffic levels made feasible by
using computers . In such circumstances, humans must monitor the automated controller at some metalevel, deciding whether the computers decisions are acceptable rather than completely correct. In case of a disagreement,
should the human or the computer be the final arbiter?
Employing humans as backups is equally ineffective. Controllers need to have accurate process models to control effectively, but not being in active control leads to a
degradation of their process models. At the time they need to intervene, it may take
a while to “get their bearings”.in other words, to update their process models so
that effective and safe control commands can be given. In addition, controllers need
both manual and cognitive skills, but both of these decline in the absence of practice.
If human backups need to take over control from automated systems, they may be
unable to do so effectively and safely. Computers are often introduced into safetycritical control loops because they increase system reliability, but at the same time,
that high reliability can provide little opportunity for human controllers to practice
and maintain the skills and knowledge required to intervene when problems
do occur.
It appears, at least for now, that humans will have to provide direct control or
will have to share control with automation unless adequate confidence can be established in the automation to justify eliminating monitors completely. Few systems
exist today where such confidence can be achieved when safety is at stake. The
problem then becomes one of finding the correct partnership and allocation of tasks
between humans and computers. Unfortunately, this problem has not been solved,
although some guidelines are presented later.
One of the things that make the problem difficult is that it is not just a matter of
splitting responsibilities. Computer control is changing the cognitive demands on
human controllers. Humans are increasingly supervising a computer rather than
directly monitoring the process, leading to more cognitively complex decision
making. Automation logic complexity and the proliferation of control modes are
confusing humans. In addition, whenever there are multiple controllers, the requirements for cooperation and communication are increased, not only between the
human and the computer but also between humans interacting with the same computer, for example, the need for coordination among multiple people making entries
to the computer. The consequences can be increased memory demands, new skill
and knowledge requirements, and new difficulties in the updating of the humans
process models.
A basic question that must be answered and implemented in the design is who
will have the final authority if the human and computers disagree about the proper
control actions. In the loss of an Airbus 320 while landing at Warsaw in 19 93 , one
of the factors was that the automated system prevented the pilots from activating
the braking system until it was too late to prevent crashing into a bank built at the
end of the runway. This automation feature was a protection device included to
prevent the reverse thrusters accidentally being deployed in flight, a presumed cause
of a previous accident. For a variety of reasons, including water on the runway
causing the aircraft wheels to hydroplane, the criteria used by the software logic to
determine that the aircraft had landed were not satisfied by the feedback received
by the automation . Other incidents have occurred where the pilots have been
confused about who is in control, the pilot or the automation, and found themselves
fighting the automation .
One common design mistake is to set a goal of automating everything and then
leaving some miscellaneous tasks that are difficult to automate for the human controllers to perform. The result is that the operator is left with an arbitrary collection
of tasks for which little thought was given to providing support, particularly support
for maintaining accurate process models. The remaining tasks may, as a consequence,
be significantly more complex and error-prone. New tasks may be added, such as
maintenance and monitoring, that introduce new types of errors. Partial automation,
in fact, may not reduce operator workload but merely change the type of demands
on the operator, leading to potentially increased workload. For example, cockpit
automation may increase the demands on the pilots by creating a lot of data entry
tasks during approach when there is already a lot to do. These automation interaction tasks also create “heads down” work at a time when increased monitoring of
nearby traffic is necessary.
By taking away the easy parts of the operators job, automation may make the
more difficult ones even harder . One causal factor here is that taking away or
changing some operator tasks may make it difficult or even impossible for the operators to receive the feedback necessary to maintain accurate process models.
When designing the automation, these factors need to be considered. A basic
design principle is that automation should be designed to augment human abilities,
not replace them, that is, to aid the operator, not to take over.
To design safe automated controllers with humans in the loop, designers need
some basic knowledge about human error related to control tasks. In fact, Rasmussen has suggested that the term human error be replaced by considering such events
as humantask mismatches.
section 9.4.3. Human Error Fundamentals.
Human error can be divided into the general categories of slips and mistakes [143,
144]. Basic to the difference is the concept of intention or desired action. A mistake
is an error in the intention, that is, an error that occurs during the planning of an
action. A slip, on the other hand, is an error in carrying out the intention. As an
example, suppose an operator decides to push button A. If the operator instead
pushes button B, then it would be called a slip because the action did not match the
intention. If the operator pushed A .(carries out the intention correctly), but it turns
out that the intention was wrong, that is, button A should not have been pushed,
then this is called a mistake.
Designing to prevent slips involves applying different principles than designing
to prevent mistakes. For example, making controls look very different or placing
them far apart from each other may reduce slips, but not mistakes. In general, designing to reduce mistakes is more difficult than reducing slips, which is relatively
straightforward.
One of the difficulties in eliminating planning errors or mistakes is that such
errors are often only visible in hindsight. With the information available at the
time, the decisions may seem reasonable. In addition, planning errors are a necessary side effect of human problem-solving ability. Completely eliminating mistakes
or planning errors .(if possible). would also eliminate the need for humans as
controllers.
Planning errors arise from the basic human cognitive ability to solve problems.
Human error in one situation is human ingenuity in another. Human problem
solving rests on several unique human capabilities, one of which is the ability to
create hypotheses and to test them and thus create new solutions to problems not
previously considered. These hypotheses, however, may be wrong. Rasmussen has
suggested that human error is often simply unsuccessful experiments in an unkind
environment, where an unkind environment is defined as one in which it is not possible for the human to correct the effects of inappropriate variations in performance
before they lead to unacceptable consequences . He concludes that human
performance is a balance between a desire to optimize skills and a willingness to
accept the risk of exploratory acts.
A second basic human approach to problem solving is to try solutions that
worked in other circumstances for similar problems. Once again, this approach is
not always successful but the inapplicability of old solutions or plans .(learned procedures). may not be determinable without the benefit of hindsight.
The ability to use these problem-solving methods provides the advantages of
human controllers over automated controllers, but success is not assured. Designers,
if they understand the limitations of human problem solving, can provide assistance
in the design to avoid common pitfalls and enhance human problem solving. For
example, they may provide ways for operators to obtain extra information or to
test hypotheses safely. At the same time, there are some additional basic human
cognitive characteristics that must be considered.
Hypothesis testing can be described in terms of basic feedback control concepts.
Using the information in the process model, the controller generates a hypothesis
about the controlled process. A test composed of control actions is created to generate feedback useful in evaluating the hypothesis, which in turn is used to update the
process model and the hypothesis.
When controllers have no accurate diagnosis of a problem, they must make provisional assessments of what is going on based on uncertain, incomplete, and often
contradictory information . That provisional assessment will guide their information gathering, but it may also lead to over attention to confirmatory evidence
when processing feedback and updating process models while, at the same time,
discounting information that contradicts their current diagnosis. Psychologists call
this phenomenon cognitive fixation. The alternative is called thematic vagabonding,
where the controller jumps around from explanation to explanation, driven by the
loudest or latest feedback or alarm and never develops a coherent assessment of
what is going on. Only hindsight can determine whether the controller should have
abandoned one explanation for another. Sticking to one assessment can lead to
more progress in many situations than jumping around and not pursuing a consistent
planning process.
Plan continuation is another characteristic of human problem solving related to
cognitive fixation. Commitment to a preliminary diagnosis can lead to sticking with
the original plan even though the situation has changed and calls for a different
plan. Orisanu notes that early cues that suggest an initial plan is correct are
usually very strong and unambiguous, helping to convince people to continue
the plan. Later feedback that suggests the plan should be abandoned is typically
more ambiguous and weaker. Conditions may deteriorate gradually. Even when
controllers receive and acknowledge this feedback, the new information may not
change their plan, especially if abandoning the plan is costly in terms of organizational and economic consequences. In the latter case, it is not surprising that controllers will seek and focus on confirmatory evidence and will need a lot of contradictory
evidence to justify changing their plan.
Cognitive fixation and plan continuation are compounded by stress and fatigue.
These two factors make it more difficult for controllers to juggle multiple hypotheses about a problem or to project a situation into the future by mentally simulating
the effects of alternative plans .
Automated tools can be designed to assist the controller in planning and decision
making, but they must embody an understanding of these basic cognitive limitations
and assist human controllers in overcoming them. At the same time, care must be
taken that any simulation or other planning tools to assist human problem solving
do not rest on the same incorrect assumptions about the system that led to the
problems in the first place.
Another useful distinction is between errors of omission and errors of commission. Sarter and Woods note that in older, less complex aircraft cockpits, most
pilot errors were errors of commission that occurred as a result of a pilot control
action. Because the controller, in this case the pilot, took a direct action, he or she
is likely to check that the intended effect of the action has actually occurred. The
short feedback loops allow the operators to repair most errors before serious
consequences result. This type of error is still the prevalent one for relatively
simple devices.
In contrast, studies of more advanced automation in aircraft find that errors of
omission are the dominant form of error . Here the controller does not implement a control action that is required. The operator may not notice that the automation has done something because that automation behavior was not explicitly
invoked by an operator action. Because the behavioral changes are not expected,
the human controller is less likely to pay attention to relevant indications and
feedback, particularly during periods of high workload.
Errors of omission are related to the change of human roles in systems from
direct controllers to monitors, exception handlers, and supervisors of automated
controllers. As their roles change, the cognitive demands may not be reduced but
instead may change in their basic nature. The changes tend to be more prevalent at
high-tempo and high-criticality periods. So while some types of human errors have
declined, new types of errors have been introduced.
The difficulty and perhaps impossibility of eliminating human error does not
mean that greatly improved system design in this respect is not possible. System
design can be used to take advantage of human cognitive capabilities and to minimize the errors that may result from them. The rest of the chapter provides some
principles to create designs that better support humans in controlling safety-critical
processes and reduce human errors.
9.4.4 Providing Control Options
If the system design goal is to make humans responsible for safety in control systems,
then they must have adequate flexibility to cope with undesired and unsafe behavior
and not be constrained by inadequate control options. Three general design principles apply. design for redundancy, design for incremental control, and design for
error tolerance.
Design for redundant paths. One helpful design feature is to provide multiple
physical devices and logical paths to ensure that a single hardware failure or
software error cannot prevent the operator from taking action to maintain a
safe system state and avoid hazards. There should also be multiple ways to change
from an unsafe to a safe state, but only one way to change from an unsafe to a
safe state.
Design for incremental control. Incremental control makes a system easier to
control, both for humans and computers, by performing critical steps incrementally
rather than in one control action. The common use of incremental arm, aim, fire
sequences is an example. The controller should have the ability to observe the
system and get feedback to test the validity of the assumptions and models upon
which the decisions are made. The system design should also provide the controller
with compensating control actions to allow modifying or aborting previous control
actions before significant damage is done. An important consideration in designing
for controllability in general is to lower the time pressures on the controllers, if
possible.
The design of incremental control algorithms can become complex when a human
controller is controlling a computer, which is controlling the actual physical process,
in a stressful and busy environment, such as a military aircraft. If one of the commands in an incremental control sequence cannot be executed within a specified
period of time, the human operator needs to be informed about any delay or postponement or the entire sequence should be canceled and the operator informed. At
the same time, interrupting the pilot with a lot of messages that may not be critical
at a busy time could also be dangerous. Careful analysis is required to determine
when multistep controller inputs can be preempted or interrupted before they are
complete and when feedback should occur that this happened .
Design for error tolerance. Rasmussen notes that people make errors all the time,
but we are able to detect and correct them before adverse consequences occur .
System design can limit peoples ability to detect and recover from their errors. He
defined a system design goal of error tolerant systems. In these systems, errors are
observable .(within an appropriate time limit). and they are reversible before unacceptable consequences occur. The same applies to computer errors. they should be
observable and reversible.
The general goal is to allow controllers to monitor their own performance. To
achieve this goal, the system design needs to.
1. Help operators monitor their actions and recover from errors.
2. Provide feedback about actions operators took and their effects, in case the
actions were inadvertent. Common examples are echoing back operator inputs
or requiring confirmation of intent.
3. Allow for recovery from erroneous actions. The system should provide control
options, such as compensating or reversing actions, and enough time for recovery actions to be taken before adverse consequences result.
Incremental control, as described earlier, is a type of error-tolerant design
technique.
section 9.4.5. Matching Tasks to Human Characteristics.
In general, the designer should tailor systems to human requirements instead of
the opposite. Engineered systems are easier to change in their behavior than are
humans.
Because humans without direct control tasks will lose vigilance, the design
should combat lack of alertness by designing human tasks to be stimulating and
varied, to provide good feedback, and to require active involvement of the human
controllers in most operations. Maintaining manual involvement is important, not
just for alertness but also in getting the information needed to update process
models.
Maintaining active engagement in the tasks means that designers must distinguish between providing help to human controllers and taking over. The human
tasks should not be oversimplified and tasks involving passive or repetitive actions
should be minimized. Allowing latitude in how tasks are accomplished will not only
reduce monotony and error proneness, but can introduce flexibility to assist operators in improvising when a problem cannot be solved by only a limited set of behaviors. Many accidents have been avoided when operators jury-rigged devices or
improvised procedures to cope with unexpected events. Physical failures may cause
some paths to become nonfunctional and flexibility in achieving goals can provide
alternatives.
Designs should also be avoided that require or encourage management by exception, which occurs when controllers wait for alarm signals before taking action.
Management by exception does not allow controllers to prevent disturbances by
looking for early warnings and trends in the process state. For operators to anticipate
undesired events, they need to continuously update their process models. Experiments by Swaanenburg and colleagues found that management by exception is not
the strategy adopted by human controllers as their normal supervisory mode .
Avoiding management by exception requires active involvement in the control task
and adequate feedback to update process models. A display that provides only an
overview and no detailed information about the process state, for example, may not
provide the information necessary for detecting imminent alarm conditions.
Finally, if designers expect operators to react correctly to emergencies, they need
to design to support them in these tasks and to help fight some basic human tendencies described previously such as cognitive fixation and plan continuation. The
system design should support human controllers in decision making and planning
activities during emergencies.
section 9.4.6. Designing to Reduce Common Human Errors.
Some human errors are so common and unnecessary that there is little excuse for
not designing to prevent them. Care must be taken though that the attempt to
reduce erroneous actions does not prevent the human controller from intervening
in an emergency when the assumptions made during design about what should and
should not be done turn out to be incorrect.
One fundamental design goal is to make safety-enhancing actions easy, natural,
and difficult to omit or do wrong. In general, the design should make it more difficult
for the human controller to operate unsafely than safely. If safety-enhancing actions
are easy, they are less likely to be bypassed intentionally or accidentally. Stopping
an unsafe action or leaving an unsafe state should be possible with a single keystroke
that moves the system into a safe state. The design should make fail-safe actions
easy and natural, and difficult to avoid, omit, or do wrong.
In contrast, two or more unique operator actions should be required to start any
potentially hazardous function or sequence of functions. Hazardous actions should
be designed to minimize the potential for inadvertent activation; they should not,
for example, be initiated by pushing a single key or button .(see the preceding discussion of incremental control).
The general design goal should be to enhance the ability of the human controller
to act safely while making it more difficult to behave unsafely. Initiating a potentially
unsafe process change, such as a spacecraft launch, should require multiple keystrokes or actions while stopping a launch should require only one.
Safety may be enhanced by using procedural safeguards, where the operator is
instructed to take or avoid specific actions, or by designing safeguards into the
system. The latter is much more effective. For example, if the potential error involves
leaving out a critical action, either the operator can be instructed to always take
that action or the action can be made an integral part of the process. A typical error
during maintenance is not to return equipment .(such as safety interlocks). to the
operational mode. The accident sequence at Three Mile Island was initiated by such
an error. An action that is isolated and has no immediate relation to the “gestalt”
of the repair or testing task is easily forgotten. Instead of stressing the need to be
careful .(the usual approach), change the system by integrating the act physically
into the task, make detection a physical consequence of the tool design, or change
operations planning or review. That is, change design or management rather than
trying to change the human .
To enhance decision making, references should be provided for making judgments, such as marking meters with safe and unsafe limits. Because humans often
revert to stereotype and cultural norms, such norms should be followed in design.
Keeping things simple, natural, and similar to what has been done before .(not
making gratuitous design changes). is a good way to avoid errors when humans are
working under stress, are distracted, or are performing tasks while thinking about
something else.
To assist in preventing sequencing errors, controls should be placed in the
sequence in which they are to be used. At the same time, similarity, proximity, interference, or awkward location of critical controls should be avoided. Where operators
have to perform different classes or types of control actions, sequences should be
made as dissimilar as possible.
Finally, one of the most effective design techniques for reducing human error is
to design so that the error is not physically possible or so that errors are obvious.
For example, valves can be designed so they cannot be interchanged by making the
connections different sizes or preventing assembly errors by using asymmetric or
male and female connections. Connection errors can also be made obvious by color
coding. Amazingly, in spite of hundreds of deaths due to misconnected tubes in
hospitals that have occurred over decades, such as a feeding tube inadvertently
connected to a tube that is inserted in a patients vein, regulators, hospitals, and
tube manufacturers have taken no action to implement this standard safety design
technique .
section 9.4.7. Support in Creating and Maintaining Accurate Process Models.
Human controllers who are supervising automation have two process models to
maintain. one for the process being controlled by the automation and one for the
automated controller itself. The design should support human controllers in maintaining both of these models. An appropriate goal here is to provide humans with
the facilities to experiment and learn about the systems they are controlling, either
directly or indirectly. Operators should also be allowed to maintain manual involvement to update process models, to maintain skills, and to preserve self-confidence.
Simply observing will degrade human supervisory skills and confidence.
When human controllers are supervising automated controllers, the automation
has extra design requirements. The control algorithm used by the automation must
be learnable and understandable. Two common design flaws in automated controllers are inconsistent behavior by the automation and unintended side effects.
Inconsistent Behavior.
Carroll and Olson define a consistent design as one where a similar task or goal is
associated with similar or identical actions . Consistent behavior on the part of
the automated controller makes it easier for the human providing supervisory
control to learn how the automation works, to build an appropriate process model
for it, and to anticipate its behavior.
An example of inconsistency, detected in an A320 simulator study, involved an
aircraft go-around below 100 feet above ground level. Sarter and Woods found that
pilots failed to anticipate and realize that the autothrust system did not arm when
they selected takeoff/go-around .(TOGA). power under these conditions because it
did so under all other circumstances where TOGA power is applied .
Another example of inconsistent automation behavior, which was implicated in
an A320 accident, is a protection function that is provided in all automation configurations except the specific mode .(in this case altitude acquisition). in which the
autopilot was operating .
Human factors for critical systems have most extensively been studied in aircraft
cockpit design. Studies have found that consistency is most important in high-tempo,
highly dynamic phases of flight where pilots have to rely on their automatic systems
to work as expected without constant monitoring. Even in more low-pressure
situations, consistency .(or predictability). is important in light of the evidence from
pilot surveys that their normal monitoring behavior may change on high-tech flight
decks .
Pilots on conventional aircraft use a highly trained instrument-scanning pattern
of recurrently sampling a given set of basic flight parameters. In contrast, some A320
pilots report that they no longer scan anymore but allocate their attention within
and across cockpit displays on the basis of expected automation states and behaviors. Parameters that are not expected to change may be neglected for a long time
. If the automation behavior is not consistent, errors of omission may occur
where the pilot does not intervene when necessary.
In section 9.3.2, determinism was identified as a safety design feature for automated controllers. Consistency, however, requires more than deterministic behavior.
If the operator provides the same inputs but different outputs .(behaviors). result for
some reason other than what the operator has done .(or may even know about),
then the behavior is inconsistent from the operator viewpoint even though it is
deterministic. While the designers may have good reasons for including inconsistent
behavior in the automated controller, there should be a careful tradeoff made with
the potential hazards that could result.
Unintended Side Effects.
Incorrect process models can result when an action intended to have one effect has
an additional side effect not easily anticipated by the human controller. An example
occurred in the Sarter and Woods A320 aircraft simulator study cited earlier. Because
the approach to the destination airport is such a busy time for the pilots and the
automation requires so much heads down work, pilots often program the automation as soon as the air traffic controllers assign them a runway. Sarter and Woods
found that the experienced pilots in their study were not aware that entering a
runway change after entering data for the assigned approach results in the deletion
by the automation of all the previously entered altitude and speed constraints, even
though they may still apply.
Once again, there may be good reason for the automation designers to include
such side effects, but they need to consider the potential for human error that
can result.
Mode Confusion.
Modes define mutually exclusive sets of automation behaviors. Modes can be used
to determine how to interpret inputs or to define required controller behavior. Four
general types of modes are common. controller operating modes, supervisory modes,
display modes, and controlled process modes.
Controller operating modes define sets of related behavior in the controller, such
as shutdown, nominal behavior, and fault-handling.
Supervisory modes determine who or what is controlling the component at any
time when multiple supervisors can assume control responsibilities. For example, a
flight guidance system in an aircraft may be issued direct commands by the pilot(s)
or by another computer that is itself being supervised by the pilot(s). The movement
controller in the thermal tile processing system might be designed to be in either
manual supervisory mode .(by a human controller). or automated mode .(by the
TTPS task controller). Coordination of control actions among multiple supervisors
can be defined in terms of these supervisory modes. Confusion about the current
supervisory mode can lead to hazardous system behavior.
A third type of common mode is a display mode. The display mode will
affect the information provided on the display and how the user interprets that
information.
A final type of mode is the operating mode of the controlled process. For example,
the mobile thermal tile processing robot may be in a moving mode .(between work
areas). or in a work mode .(in a work area and servicing tiles, during which time it
may be controlled by a different controller). The value of this mode may determine
whether various operations.for example, extending the stabilizer legs or the
manipulator arm.are safe.
Early automated systems had a fairly small number of independent modes. They
provided a passive background on which the operator would act by entering target
data and requesting system operations. They also had only one overall mode setting
for each function performed. Indications of currently active mode and of transitions
between modes could be dedicated to one location on the display.
The consequences of breakdown in mode awareness were fairly small in these
system designs. Operators seemed able to detect and recover from erroneous actions
relatively quickly before serious problems resulted. Sarter and Woods conclude that,
in most cases, mode confusion in these simpler systems are associated with errors
of commission, that is, with errors that require a controller action in order for the
problem to occur . Because the human controller has taken an explicit action,
he or she is likely to check that the intended effect of the action has actually
occurred. The short feedback loops allow the controller to repair most errors quickly,
as noted earlier.
The flexibility of advanced automation allows designers to develop more complicated, mode-rich systems. The result is numerous mode indications often spread
over multiple displays, each containing just that portion of mode status data corresponding to a particular system or subsystem. The designs also allow for interactions across modes. The increased capabilities of automation can, in addition, lead
to increased delays between user input and feedback about system behavior.
These new mode-rich systems increase the need for and difficulty of maintaining
mode awareness, which can be defined in STAMP terms as keeping the controlledsystem operating mode in the controllers process model consistent with the actual
controlled system mode. A large number of modes challenges human ability to
maintain awareness of active modes, armed modes, interactions between environmental status and mode behavior, and interactions across modes. It also increases
the difficulty of error or failure detection and recovery.
Calling for systems with fewer or less complex modes is probably unrealistic.
Simplifying modes and automation behavior often requires tradeoffs with precision
or efficiency and with marketing demands from a diverse set of customers .
Systems with accidental .(unnecessary). complexity, however, can be redesigned to
reduce the potential for human error without sacrificing system capabilities. Where
tradeoffs with desired goals are required to eliminate potential mode confusion
errors, system and interface design, informed by hazard analysis, can help find solutions that require the fewest tradeoffs. For example, accidents most often occur
during transitions between modes, particularly normal and nonnormal modes, so
they should have more stringent design constraints applied to them.
Understanding more about particular types of mode confusion errors can assist
with design. Two common types leading to problems are interface interpretation
modes and indirect mode changes.
Interface Interpretation Mode Confusion. Interface mode errors are the classic
form of mode confusion error.
1. Input-related errors. The software interprets user-entered values differently
than intended.
2. Output-related errors. The software maps multiple conditions onto the same
output, depending on the active controller mode, and the operator interprets
the interface incorrectly.
A common example of an input interface interpretation error occurs with many
word processors where the user may think they are in insert mode but instead they
are in insert and delete mode or in command mode and their input is interpreted
in a different way and results in different behavior than they intended.
A more complex example occurred in what is believed to be a cause of an A320
aircraft accident. The crew directed the automated system to fly in the track/flight
path angle mode, which is a combined mode related to both lateral .(track). and
vertical .(flight path angle). navigation.
When they were given radar vectors by the air traffic controller, they may have switched
from the track to the hdg sel mode to be able to enter the heading requested by the
controller. However, pushing the button to change the lateral mode also automatically
changes the vertical mode from flight path angle to vertical speed.the mode switch
button affects both lateral and vertical navigation. When the pilots subsequently entered
“33” to select the desired flight path angle of 3.3 degrees, the automation interpreted their
input as a desired vertical speed of 3300 ft. This was not intended by the pilots who were
not aware of the active “interface mode” and failed to detect the problem. As a consequence of the too steep descent, the airplane crashed into a mountain .
An example of an output interface mode problem was identified by Cook et al.
in a medical operating room device with two operating modes. warmup and normal.
The device starts in warmup mode when turned on and changes from normal mode
to warmup mode whenever either of two particular settings is adjusted by the operator. The meaning of alarm messages and the effect of controls are different in these
two modes, but neither the current device operating mode nor a change in mode is
indicated to the operator. In addition, four distinct alarm-triggering conditions are
mapped onto two alarm messages so that the same message has different meanings
depending on the operating mode. In order to understand what internal condition
triggered the message, the operator must infer which malfunction is being indicated
by the alarm.
Several design constraints can assist in reducing interface interpretation errors.
At a minimum, any mode used to control interpretation of the supervisory interface
should be annunciated to the supervisor. More generally, the current operating
mode of the automation should be displayed at all times. In addition, any change of
operating mode should trigger a change in the current operating mode reflected in
the interface and thus displayed to the operator, that is, the annunciated mode must
be consistent with the internal mode.
A stronger design choice, but perhaps less desirable for various reasons, might
be not to condition the interpretation of the supervisory interface on modes at all.
Another possibility is to simplify the relationships between modes, for example in
the A320, the lateral and vertical modes might be separated with respect to the
heading select mode. Other alternatives are to make the required inputs different
to lessen confusion .(such as 3.3 and 3,300 rather than 33), or the mode indicator
on the control panel could be made clearer as to the current mode. While simply
annunciating the mode may be adequate in some cases, annunciations can easily
to missed for a variety of reasons and additional design features should be
considered.
Mode Confusion Arising from Indirect Mode Changes. Indirect mode changes
occur when the automation changes mode without an explicit instruction or direct
command by the operator. Such transitions may be triggered on conditions in the
automation, such as preprogrammed envelope protection. They may also result from
sensor input to the computer about the state of the computer-controlled process,
such as achievement of a preprogrammed target or an armed mode with a preselected mode transition. An example of the latter is a mode in which the autopilot
might command leveling off of the plane once a particular altitude is reached. the
operating mode of the aircraft .(leveling off). is changed when the altitude is reached
without a direct command to do so by the pilot. In general, the problem occurs when
activating one mode can result in the activation of different modes depending on
the system status at the time.
There are four ways to trigger a mode change.
1. The automation supervisor explicitly selects a new mode.
2. The automation supervisor enters data .(such as a target altitude). or a command
that leads to a mode change.
a. Under all conditions.
b. When the automation is in a particular state
c. When the automations controlled system model or environment is in a
particular state.
3. The automation supervisor does not do anything, but the automation logic
changes mode as a result of a change in the system it is controlling.
4. The automation supervisor selects a mode change but the automation does
something else, either because of the state of the automation at the time or
the state of the controlled system.
Again, errors related to mode confusion are related to problems that human supervisors of automated controllers have in maintaining accurate process models.
Changes in human controller behavior in highly automated systems, such as the
changes in pilot scanning behavior described earlier, are also related to these types
of mode confusion error.
Behavioral expectations about the automated controller behavior are formed
based on the human supervisors knowledge of the input to the automation and
on their process models of the automation. Gaps or misconceptions in this model
may interfere with predicting and tracking indirect mode transitions or with understanding the interactions among modes.
An example of an accident that has been attributed to an indirect mode change
occurred while an A320 was landing in Bangalore, India . The pilots selection
of a lower altitude while the automation was in the altitude acquisition mode
resulted in the activation of the open descent mode, where speed is controlled only
by the pitch of the aircraft and the throttles go to idle. In that mode, the automation
ignores any preprogrammed altitude constraints. To maintain pilot-selected speed
without power, the automation had to use an excessive rate of descent, which led
to the aircraft crashing short of the runway.
Understanding how this could happen is instructive in understanding just how
complex mode logic can get. There are three different ways to activate open descent
mode on the A320.
1. Pull the altitude knob after selecting a lower altitude.
2. Pull the speed knob when the aircraft is in expedite mode.
3. Select a lower altitude while in altitude acquisition mode.
It was the third condition that is suspected to have occurred. The pilot must not
have been aware the aircraft was within 200 feet of the previously entered target
altitude, which triggers altitude acquisition mode. He therefore may not have
expected selection of a lower altitude at that time to result in a mode transition and
did not closely monitor his mode annunciations during this high workload time. He
discovered what happened ten seconds before impact, but that was too late to
recover with the engines at idle .
Other factors contributed to his not discovering the problem until too late, one
of which is the problem in maintaining consistent process models when there are
multiple controllers as discussed in the next section. The pilot flying .(PF). had disengaged his flight director1 during approach and was assuming the pilot not flying
(PNF). would do the same. The result would have been a mode configuration in
which airspeed is automatically controlled by the autothrottle .(the speed mode),
which is the recommended procedure for the approach phase of flight. The PNF
never turned off his flight director, however, and the open descent mode became
active when a lower altitude was selected. This indirect mode change led to the
hazardous state and eventually the accident, as noted earlier. But a complicating
factor was that each pilot only received an indication of the status of his own flight
director and not all the information necessary to determine whether the desired
mode would be engaged. The lack of feedback and resulting incomplete knowledge
of the aircraft state .(incorrect aircraft process model). contributed to the pilots not
detecting the unsafe state in time to correct it.
Indirect mode transitions can be identified in software designs. What to do in
response to identifying them or deciding not to include them in the first place is
more problematic and the tradeoffs and mitigating design features must be considered for each particular system. The decision is just one of the many involving the
benefits of complexity in system design versus the hazards that can result.
footnote. The flight director is automation that gives visual cues to the pilot via an easily interpreted display of
the aircrafts flight path. The preprogrammed path, automatically computed, furnishes the steering commands necessary to obtain and hold a desired path.
Coordination of Multiple Controller Process Models.
When multiple controllers are engaging in coordinated control of a process, inconsistency between their process models can lead to hazardous control actions. Careful
design of communication channels and coordinated activity is required. In aircraft,
this coordination, called crew resource management, is accomplished through careful
design of the roles of each controller to enhance communication and to ensure
consistency among their process models.
A special case of this problem occurs when one human controller takes over
for another. The handoff of information about both the state of the controlled
process and any automation being supervised by the human must be carefully
designed.
Thomas describes an incident involving loss of communication for an extended
time between ground air traffic control and an aircraft . In this incident, a
ground controller had taken over after a controller shift change. Aircraft are passed
from one air traffic control sector to another through a carefully designed set of
exchanges, called a handoff, during which the aircraft is told to switch to the radio
frequency for the new sector. When, after a shift change the new controller gave an
instruction to a particular aircraft and received no acknowledgment, the controller
decided to take no further action; she assumed that the lack of acknowledgment
was an indication that the aircraft had already switched to the new sector and was
talking to the next controller.
Process model coordination during shift changes is partially controlled in a
position relief briefing. This briefing normally covers all aircraft that are currently
on the correct radio frequency or have not checked in yet. When the particular flight
in question was not mentioned in the briefing, the new controller interpreted that
as meaning that the aircraft was no longer being controlled by this station. She did
not call the next controller to verify this status because the aircraft had not been
mentioned in the briefing.
The design of the air traffic control system includes redundancy to try to avoid
errors.if the aircraft does not check in with the next controller, then that controller
would call her. When she saw the aircraft .(on her display). leave her airspace and
no such call was received, she interpreted that as another indication that the aircraft
was indeed talking to the next controller.
A final factor implicated in the loss of communication was that when the new
controller took over, there was little traffic at the aircrafts altitude and no danger
of collision. Common practice for controllers in this situation is to initiate an early
handoff to the next controller. So although the aircraft was only halfway through
her sector, the new controller assumed an early handoff had occurred.
An additional causal factor in this incident involves the way controllers track
which aircraft have checked in and which have already been handed off to the
next controller. The old system was based on printed flight progress strips and
included a requirement to mark the strip when an aircraft had checked in. The
new system uses electronic flight progress strips to display the same information,
but there is no standard method to indicate the check-in has occurred. Instead,
each individual controller develops his or her own personal method to keep track
of this status. In this particular loss of communication case, the controller involved
would type a symbol in a comment area to mark any aircraft that she had already
handed off to the next sector. The controller that was relieved reported that he
usually relied on his memory or checked a box to indicate which aircraft he was
communicating with.
That a carefully designed and coordinated process such as air traffic control can
suffer such problems with coordinating multiple controller process models .(and
procedures). attests to the difficulty of this design problem and the necessity for
careful design and analysis.
section 9.4.8. Providing Information and Feedback.
Designing feedback in general was covered in section 9.3.2. This section covers
feedback design principles specific to human controllers. Important problems in
designing feedback include what information should be provided, how to make the
feedback process more robust, and how the information should be presented to
human controllers.
Types of Feedback.
Hazard analysis using STPA will provide information about the types of feedback
needed and when. Some additional guidance can be provided to the designer, once
again, using general safety design principles.
Two basic types of feedback are needed.
1. The state of the controlled process. This information is used to .(1). update the
controllers process models and .(2). to detect faults and failures in the other
parts of the control loop, system, and environment.
2. The effect of the controllers actions. This feedback is used to detect human
errors. As discussed in the section on design for error tolerance, the key to
making errors observable.and therefore remediable.is to provide feedback
about them. This feedback may be in the form of information about the effects
of controller actions, or it may simply be information about the action itself
on the chance that it was inadvertent.
Updating Process Models.
Updating process models requires feedback about the current state of the system
and any changes that occur. In a system where rapid response by operators is necessary, timing requirements must be placed on the feedback information that the
controller uses to make decisions. In addition, when task performance requires or
implies need for the controller to assess timeliness of information, the feedback
display should include time and date information associated with data.
When a human controller is supervising or monitoring automation, the automation should provide an indication to the controller and to bystanders that it is functioning. The addition of a light to the power interlock example in chapter 8 is a simple
example of this type of feedback. For robot systems, bystanders should be signaled
when the machine is powered up or warning provided when a hazardous zone is
entered. An assumption should not be made that humans will not have to enter the
robots area. In one fully automated plant, an assumption was made that the robots
would be so reliable that the human controllers would not have to enter the plant
often and, therefore, the entire plant could be powered down when entry was
required. The designers did not provide the usual safety features such as elevated
walkways for the humans and alerts, such as aural warnings, when a robot was moving
or about the move. After plant startup, the robots turned out to be so unreliable that
the controllers had to enter the plant and bail them out several times during a shift.
Because powering down the entire plant had such a negative impact on productivity,
the humans got into the habit of entering the automated area of the plant without
powering everything down. The inevitable occurred and someone was killed .
The automation should provide information about its internal state .(such as the
state of sensors and actuators), its control actions, its assumptions about the state
of the system, and any anomalies that might have occurred. Processing requiring
several seconds should provide a status indicator so human controllers can distinguish automated system processing from failure. In one nuclear power plant, the
analog component that provided alarm annunciation to the operators was replaced
with a digital component performing the same function. An argument was made
that a safety analysis was not required because the replacement was “like for like.”
Nobody considered, however, that while the functional behavior might be the same,
the failure behavior could be different. When the previous analog alarm annunciator
failed, the screens went blank and the failure was immediately obvious to the human
operators. When the new digital system failed, however, the screens froze, which was
not immediately apparent to the operators, delaying critical feedback that the alarm
system was not operating.
While the detection of nonevents is relatively simple for automated controllers.
for instance, watchdog timers can be used.such detection is very difficult for
humans. The absence of a signal, reading, or key piece of information is not usually
immediately obvious to humans and they may not be able to recognize that a missing
signal can indicate a change in the process state. In the Turkish Airlines flight TK
1951 accident at Amsterdams Schiphol Airport in 20 09 , for example, the pilots did
not notice the absence of a critical mode shift . The design must ensure that lack
of important signals will be registered and noticed by humans.
While safety interlocks are being overridden for test or maintenance, their status
should be displayed to the operators and testers. Before allowing resumption of
normal operations, the design should require confirmation that the interlocks have
been restored. In one launch control system being designed by NASA, the operator
could turn off alarms temporarily. There was no indication on the display, however,
that the alarms had been disabled. If a shift change occurred and another operator
took over the position, the new operator would have no way of knowing that alarms
were not being annunciated.
If the information an operator needs to efficiently and safety control the process
is not readily available, controllers will use experimentation to test their hypotheses
about the state of the controlled system. If this kind of testing can be hazardous,
then a safe way for operators to test their hypotheses should be provided rather
than simply forbidding it. Such facilities will have additional benefits in handling
emergencies.
The problem of feedback in emergencies is complicated by the fact that disturbances may lead to failure of sensors. The information available to the controllers
(or to an automated system). becomes increasingly unreliable as the disturbance
progresses. Alternative means should be provided to check safety-critical information as well as ways for human controllers to get additional information the designer
did not foresee would be needed in a particular situation.
Decision aids need to be designed carefully. With the goal of providing assistance
to the human controller, automated systems may provide feedforward .(as well as
feedback). information. Predictor displays show the operator one or more future
states of the process parameters, as well as their present state or value, through a
fast-time simulation, a mathematical model, or other analytic method that projects
forward the effects of a particular control action or the progression of a disturbance
if nothing is done about it.
Incorrect feedforward information can lead to process upsets and accidents.
Humans can become dependent on automated assistance and stop checking
whether the advice is reasonable if few errors occur. At the same time, if the
process .(control algorithm). truly can be accurately predetermined along with all
future states of the system, then it should be automated. Humans are usually kept
in systems when automation is introduced because they can vary their process
models and control algorithms when conditions change or errors are detected in
the original models and algorithms. Automated assistance such as predictor displays may lead to overconfidence and complacency and therefore overreliance by
the operator. Humans may stop performing their own mental predictions and
checks if few discrepancies are found over time. The operator then will begin to
rely on the decision aid.
If decision aids are used, they need to be designed to reduce overdependence
and to support operator skills and motivation rather than to take over functions in
the name of support. Decision aids should provide assistance only when requested
and their use should not become routine. People need to practice making decisions
if we expect them to do so in emergencies or to detect erroneous decisions by
automation.
Detecting Faults and Failures.
A second use of feedback is to detect faults and failures in the controlled system,
including the physical process and any computer controllers and displays. If
the operator is expected to monitor a computer or automated decision making,
then the computer must make decisions in a manner and at a rate that operators
can follow. Otherwise they will not be able to detect faults and failures reliably
in the system being supervised. In addition, the loss of confidence in the automation may lead the supervisor to disconnect it, perhaps under conditions where that
could be hazardous, such as during critical points in the automatic landing of an
airplane. When human supervisors can observe on the displays that proper corrections are being made by the automated system, they are less likely to intervene
inappropriately, even in the presence of disturbances that cause large control
actions.
For operators to anticipate or detect hazardous states, they need to be continuously updated about the process state so that the system progress and dynamic state
can be monitored. Because of the poor ability of humans to perform monitoring
over extended periods of time, they will need to be involved in the task in some
way, as discussed earlier. If possible, the system should be designed to fail obviously
or to make graceful degradation obvious to the supervisor.
The status of safety-critical components or state variables should be highlighted
and presented unambiguously and completely to the controller. If an unsafe condition is detected by an automated system being supervised by a human controller,
then the human controller should be told what anomaly was detected, what action
was taken, and the current system configuration. Overrides of potentially hazardous
failures or any clearing of the status data should not be permitted until all of the
data has been displayed and probably not until the operator has acknowledged
seeing it. A system may have a series of faults that can be overridden safely if they
occur singly, but multiple faults could result in a hazard. In this case, the supervisor
should be made aware of all safety-critical faults prior to issuing an override
command or resetting a status display.
Alarms are used to alert controllers to events or conditions in the process that
they might not otherwise notice. They are particularly important for low-probability
events. The overuse of alarms, however, can lead to management by exception,
overload and the incredulity response.
Designing a system that encourages or forces an operator to adopt a management-by-exception strategy, where the operator waits for alarm signals before taking
action, can be dangerous. This strategy does not allow operators to prevent disturbances by looking for early warning signals and trends in the process state.
The use of computers, which can check a large number of system variables in a
short amount of time, has made it easy to add alarms and to install large numbers
of them. In such plants, it is common for alarms to occur frequently, often five to
seven times an hour . Having to acknowledge a large number of alarms may
leave operators with little time to do anything else, particularly in an emergency
. A shift supervisor at the Three Mile Island .(TMI). hearings testified that the
control room never had less than 52 alarms lit . During the TMI incident, more
than a hundred alarm lights were lit on the control board, each signaling a different
malfunction, but providing little information about sequencing or timing. So many
alarms occurred at TMI that the computer printouts were running hours behind the
events and, at one point jammed, losing valuable information. Brooks claims that
operators commonly suppress alarms in order to destroy historical information
when they need real-time alarm information for current decisions . Too many
alarms can cause confusion and a lack of confidence and can elicit exactly the wrong
response, interfering with the operators ability to rectify the problems causing
the alarms.
Another phenomenon associated with alarms is the incredulity response, which
leads to not believing and ignoring alarms after many false alarms have occurred.
The problem is that in order to issue alarms early enough to avoid drastic countermeasures, the alarm limits must be set close to the desired operating point. This goal
is difficult to achieve for some dynamic processes that have fairly wide operating
ranges, leading to the problem of spurious alarms. Statistical and measurement
errors may add to the problem.
A great deal has been written about alarm management, particularly in the
nuclear power arena, and sophisticated disturbance and alarm analysis systems have
been developed. Those designing alarm systems should be familiar with current
knowledge about such systems. The following are just a few simple guidelines.
1.•Keep spurious alarms to a minimum. This guideline will reduce overload and
the incredulity response.
2.•Provide checks to distinguish correct from faulty instruments. When response
time is not critical, most operators will attempt to check the validity of the alarm
. Providing information in a form where this validity check can be made
quickly and accurately, and not become a source of distraction, increases the
probability of the operator acting properly.
3.•Provide checks on alarm system itself. The operator has to know whether the
problem is in the alarm or in the system. Analog devices can have simple checks
such as “press to test” for smoke detectors or buttons to test the bulbs in a
lighted gauge. Computer-displayed alarms are more difficult to check; checking
usually requires some additional hardware or redundant information that
does not come through the computer. One complication comes in the form
of alarm analysis systems that check alarms and display a prime cause along
with associated effects. Operators may not be able to perform validity checks
on the complex logic necessarily involved in these systems, leading to overreliance . Weiner and Curry also worry that the priorities might not always
be appropriate in automated alarm analysis and that operators may not recognize this fact.
4.•Distinguish between routine and safety-critical alarms. The form of the alarm,
such as auditory cues or message highlighting, should indicate degree or urgency.
Alarms should be categorized as to which are the highest priority.
5.•Provide temporal information about events and state changes. Proper decision
making often requires knowledge about the timing and sequencing of events.
Because of system complexity and built-in time delays due to sampling intervals, however, information about conditions or events is not always timely or
even presented in the sequence in which the events actually occurred. Complex
systems are often designed to sample monitored variables at different frequencies. some variables may be sampled every few seconds while, for others, the
intervals may be measured in minutes. Changes that are negated within the
sampling period may not be recorded at all. Events may become separated from
their circumstances, both in sequence and time .
6.•Require corrective action when necessary. When faced with a lot of undigested
and sometimes conflicting information, humans will first try to figure out what
is going wrong. They may become so involved in attempts to save the system
that they wait too long to abandon the recovery efforts. Alternatively, they may
ignore alarms they do not understand or they think are not safety critical. The
system design may need to ensure that the operator cannot clear a safetycritical alert without taking corrective action or without performing subsequent
actions required to complete an interrupted operation. The Therac-25, a linear
accelerator that massively overdosed multiple patients, allowed operators to
proceed with treatment five times after an error message appeared simply by
pressing one key . No distinction was made between errors that could be
safety-critical and those that were not.
7.•Indicate which condition is responsible for the alarm. System designs with
more than one mode or where more than one condition can trigger the
alarm for a mode, must clearly indicate which condition is responsible for
the alarm. In the Therac-25, one message meant that the dosage given was
either too low or too high, without providing information to the operator
about which of these errors had occurred. In general, determining the cause of
an alarm may be difficult. In complex, tightly coupled plants, the point where
the alarm is first triggered may be far away from where the fault actually
occurred.
8.•
Minimize the use of alarms when they may lead to management by exception. After studying thousands of near accidents reported voluntarily by aircraft crews and ground support personnel, one U.S. government report
recommended that the altitude alert signal .(an aural sound). be disabled for all
but a few long-distance flights . Investigators found that this signal had
caused decreased altitude awareness in the flight crew, resulting in more frequent overshoots.instead of leveling off at 10,000 feet, for example, the aircraft continues to climb or descend until the alarm sounds. A study of such
overshoots noted that they rarely occur in bad weather, when the crew is most
attentive.
Robustness of the Feedback Process.
Because feedback is so important to safety, robustness must be designed into feedback channels. The problem of feedback in emergencies is complicated by the fact
that disturbances may lead to failure of sensors. The information available to the
controllers .(or to an automated system). becomes increasingly unreliable as the
disturbance progresses.
One way to prepare for failures is to provide alternative sources of information
and alternative means to check safety-critical information. It is also useful for the
operators to get additional information the designers did not foresee would be
needed in a particular situation. The emergency may have occurred because the
designers made incorrect assumptions about the operation of the controlled
system, the environment in which it would operate, or the information needs of the
controller.
If automated controllers provide the only information about the controlled
system state, the human controller supervising the automation can provide little
oversight. The human supervisor must have access to independent sources of information to detect faults and failures, except in the case of a few failure modes such
as total inactivity. Several incidents involving the command and control warning
system at NORAD headquarters in Cheyenne Mountain involved situations where
the computer had bad information and thought the United States was under nuclear
attack. Human supervisors were able to ascertain that the computer was incorrect
through direct contact with the warning sensors .(satellites and radars). This direct
contact showed the sensors were operating and had received no evidence of incoming missiles . The error detection would not have been possible if the humans
could only get information about the sensors from the computer, which had the
wrong information. Many of these direct sensor inputs are being removed in the
mistaken belief that only computer displays are required.
The main point is that human supervisors of automation cannot monitor its performance if the information used in monitoring is not independent from the thing
being monitored. There needs to be provision made for failure of computer displays
or incorrect process models in the software by providing alternate sources of information. Of course, any instrumentation to deal with a malfunction must not be
disabled by the malfunction, that is, common-cause failures must be eliminated or
controlled. As an example of the latter, an engine and pylon came off the wing of
a D C 10 , severing the cables that controlled the leading edge flaps and also four
hydraulic lines. These failures disabled several warning signals, including a flap mismatch signal and a stall warning light . If the crew had known the slats were
retracted and had been warned of a potential stall, they might have been able to
save the plane.
Displaying Feedback to Human Controllers.
Computer displays are now ubiquitous in providing feedback information to human
controllers, as are complaints about their design.
Many computer displays are criticized for providing too much data .(data overload). where the human controller has to sort through large amounts of data to find
the pieces needed. Then the information located in different locations may need to
be integrated. Bainbridge suggests that operators should not have to page between
displays to obtain information about abnormal states in the parts of the process
other than the one they are currently thinking about; neither should they have to
page between displays that provide information needed for a single decision
process.
These design problems are difficult to eliminate, but performing a task analysis
coupled with a hazard analysis can assist in better design as will making all the
information needed for a single decision process visible at the same time, placing
frequently used displays centrally, and grouping displays of information using the
information obtained in the task analysis. It may also be helpful to provide alternative ways to display information or easy ways to request what is needed.
Much has been written about how to design computer displays, although a surprisingly large number of displays still seem to be poorly designed. The difficulty of
such design is increased by the problem that, once again, conflicts can exist. For
example, intuition seems to support providing information to users in a form that
can be quickly and easily interpreted. This assumption is true if rapid reactions are
required. Some psychological research, however, suggests that cognitive processing
for meaning leads to better information retention. A display that requires little
thought and work on the part of the operator may not support acquisition of the
knowledge and thinking skills needed in abnormal conditions .
Once again, the designer needs to understand the tasks the user of the display is
performing. To increase safety, the displays should reflect what is known about how
the information is used and what kinds of displays are likely to cause human error.
Even slight changes in the way information is presented can have dramatic effects
on performance.
This rest of this section concentrates only on a few design guidelines that are
especially important for safety. The reader is referred to the standard literature on
display design for more information.
Safety-related information should be distinguished from non-safety-related
information and highlighted. In addition, when safety interlocks are being overridden, their status should be displayed. Similarly, if safety-related alarms are temporarily inhibited, which may be reasonable to allow so that the operator can deal
with the problem without being continually interrupted by additional alarms, the
inhibit status should be shown on the display. Make warning displays brief and
simple.
A common mistake is to make all the information displays digital simply because
the computer is a digital device. Analog displays have tremendous advantages for
processing by humans. For example, humans are excellent at pattern recognition,
so providing scannable displays that allow operators to process feedback and diagnose problems using pattern recognition will enhance human performance. A great
deal of information can be absorbed relatively easily when it is presented in the
form of patterns.
Avoid displaying absolute values unless the human requires the absolute values.
It is hard to notice changes such as events and trends when digital values are going
up and down. A related guideline is to provide references for judgment. Often, for
example, the user of the display does not need the absolute value but only the fact
that it is over or under a limit. Showing the value on an analog dial with references
to show the limits will minimize the required amount of extra and error-prone processing by the user. The overall goal is to minimize the need for extra mental processing to get the information the users of the display need for decision making or
for updating their process models.
Another typical problem occurs when computer displays must be requested and
accessed sequentially by the user, which makes greater memory demands upon the
operator, negatively affecting difficult decision-making tasks . With conventional
instrumentation, all process information is constantly available to the operator. an
overall view of the process state can be obtained by a glance at the console. Detailed
readings may be needed only if some deviation from normal conditions is detected.
The alternative, a process overview display on a computer console, is more time
consuming to process. To obtain additional information about a limited part of the
process, the operator has to select consciously among displays.
In a study of computer displays in the process industry, Swaanenburg and colleagues found that most operators considered a computer display more difficult to
work with than conventional parallel interfaces, especially with respect to getting
an overview of the process state. In addition, operators felt the computer overview
displays were of limited use in keeping them updated on task changes; instead,
operators tended to rely to a large extent on group displays for their supervisory
tasks. The researchers conclude that a group display, showing different process variables in reasonable detail .(such as measured value, setpoint, and valve position),
clearly provided the type of data operators preferred. Keeping track of the progress
of a disturbance is very difficult with sequentially presented information . One
general lesson to be learned here is that the operators of the system need to be
involved in display design decisions. The designers should not just do what is easiest
to implement or satisfies their aesthetic senses.
Whenever possible, software designers should try to copy the standard displays
with which operators have become familiar, and which were often developed for
good psychological reasons, instead of trying to be creative or unique. For example,
icons with a standard interpretation should be used. Researchers have found that
icons often pleased system designers but irritated users . Air traffic controllers,
for example, found the arrow icons for directions on a new display useless and
preferred numbers. Once again, including experienced operators in the design
process and understanding why the current analog displays have developed as they
have will help to avoid these basic types of design errors.
An excellent way to enhance human interpretation and processing is to design
the control panel to mimic the physical layout of the plant or system. For example,
graphical displays allow the status of valves to be shown within the context of piping
diagrams and even the flow of materials. Plots of variables can be shown, highlighting important relationships.
The graphical capabilities of computer displays provides exciting potential for
improving on traditional instrumentation, but the designs need to be based on psychological principles and not just on what appeals to the designer, who may never
have operated a complex process. As Lees has suggested, the starting point should
be consideration of the operators tasks and problems; the display should evolve as
a solution to these .
Operator inputs to the design process as well as extensive simulation and testing
will assist in designing usable computer displays. Remember that the overall goal is
to reduce the mental workload of the human in updating their process models and
to reduce human error in interpreting feedback.
section 9.5.
Summary.
A process for safety-guided design using STPA and some basic principles for safe
design have been described in this chapter. The topic is an important one and more
still needs to be learned, particularly with respect to safe system design for human
controllers. Including skilled and experienced operators in the design process from
the beginning will help as will performing sophisticated human task analyses rather
than relying primarily on operators interacting with computer simulations.
The next chapter describes how to integrate the disparate information and techniques provided so far in part 3 into a system-engineering process that integrates
safety into the design process from the beginning, as suggested in chapter 6.