1
0
piper/chapter06.raw
2025-03-15 21:11:29 -06:00

350 lines
24 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

part 3. USING STAMP.
STAMP provides a new theoretical foundation for system safety on which new, more
powerful techniques and tools for system safety can be constructed. Part III presents
some practical methods for engineering safer systems. All the techniques described
in part III have been used successfully on real systems. The surprise to those trying
them has been how well they work on enormously complex systems and how eco-
nomical they are to use. Improvements and even more applications of the theory to
practice will undoubtedly be created in the future.
chapter 6.
Engineering and Operating Safer Systems Using
STAMP.
Part III of this book is for those who want to build safer systems without incurring
enormous and perhaps impractical financial, time, and performance costs. The belief
that building and operating safer systems requires such penalties is widespread and
arises from the way safety engineering is usually done today. It need not be the case.
The use of top-down system safety engineering and safety-guided design based on
STAMP can not only enhance the safety of these systems but also potentially reduce
the costs associated with engineering for safety. This chapter provides an overview,
while the chapters following it provide details about how to implement this cost-
effective safety process.
section 6.1.
Why Are Safety Efforts Sometimes Not Cost-Effective?
While there are certainly some very effective safety engineering programs, too
many expend a large amount of resources with little return on the investment in
terms of improved safety. To fix a problem, we first need to understand it. Why are
safety efforts sometimes not cost-effective? There are five general answers to this
question:
1. Safety efforts may be superficial, isolated, or misdirected.
2. Safety activities often start too late.
3. The techniques used are not appropriate for the systems we are building today
and for new technology.
4. Efforts may be narrowly focused on the technical components.
5. Systems are usually assumed to be static throughout their lifetime.
Superficial, isolated, or misdirected safety engineering activities: Often, safety
engineering consists of performing a lot of very costly and tedious activities of
limited usefulness in improving safety in the final system design. Childs calls this
“cosmetic system safety” [37]. Detailed hazard logs are created and analyses
performed, but these have limited impact on the actual system design. Numbers are
associated with unquantifiable properties. These numbers always seem to support
whatever numerical requirement is the goal, and all involved feel as if they have
done their jobs. The safety analyses provide the answer the customer or designer
wants—that the system is safe—and everyone is happy. Haddon-Cave, in the 2009
Nimrod MR2 accident report, called such efforts compliance only exercises [78]. The
results impact certification of the system or acceptance by management, but despite
all the activity and large amounts of money spent, the safety of the system has been
unaffected.
A variant of this problem is that safety activities may be isolated from the engi-
neers and developers building the system. Too often, safety professionals are sepa-
rated from engineering design and placed within a mission assurance organization.
Safety cannot be assured without its already being part of the design; systems must
be constructed to be safe from the beginning. Separating safety engineering from
design engineering is almost guaranteed to make the effort and resources expended
a poor investment. Safety engineering is effective when it participates in and pro-
vides input to the design process, not when it focuses on making arguments about
the artifacts created after the major safety-related decisions have been made.
Sometimes the major focus of the safety engineering efforts is on creating a safety
case that proves the completed design is safe, often by showing that a particular
process was followed during development. Simply following a process does not
mean that the process was effective, which is the basic limitation of many process
assurance activities. In other cases the arguments go beyond the process, but they
start from the assumption that the system is safe and then focus on showing the
conclusion is true. Most of the effort is spent in seeking evidence that shows the
system is safe while not looking for evidence that the system is not safe. The basic
mindset is wrong, so the conclusions are biased.
One of the reasons System Safety has been so successful is that it takes the oppo-
site approach: an attempt is made to show that the system is unsafe and to identify
hazardous scenarios. By using this alternative perspective, paths to hazards are often
identified that were missed by the engineers, who tend to focus on what they want
to happen, not what they do not want to happen.
If safety-guided design, as defined in part III of this book, is used, the “safety
case” is created along with the design. Developing the certification argument
becomes trivial and consists primarily of simply gathering the documentation that
has been created during the development process.
Safety efforts start too late: Unlike the examples of ineffective safety activities
above, the safety efforts may involve potentially useful activities, but they may start
too late. Frola and Miller claim that 7080 percent of the most critical decisions
related to the safety of the completed system are made during early concept devel-
opment [70]. Unless the safety engineering effort impacts these decisions, it is
unlikely to have much effect on safety. Too often, safety engineers are busy doing
safety analyses, while the system engineers are in parallel making critical decisions
about system design and concepts of operation that are not based on that hazard
analysis. By the time the system engineers get the information generated by the
safety engineers, it is too late to have a significant impact on design decisions.
Of course, engineers normally do try to consider safety early, but the information
commonly available is only whether a particular function is safety-critical or not.
They are told that the function they are designing can contribute to an accident,
with perhaps some letter or numerical “score” of how critical it is, but not much else.
Armed only with this very limited information, they have no choice but to focus
safety design efforts on increasing the components reliability by adding redundancy
or safety margins. These features are often added without careful analysis of whether
they are needed or will be effective for the specific hazards related to that system
function. The design then becomes expensive to build and maintain without neces-
sarily having the maximum possible (or sometimes any) impact on eliminating
or reducing hazards. As argued earlier, redundancy and overdesign, such as building
in safety margins, are effective primarily for purely electromechanical components
and component failure accidents. They do not apply to software and miss component
interaction accidents entirely. In some cases, such design techniques can even
contribute to component interaction accidents when they add to the complexity of
the design.
Most of our current safety engineering techniques start from detailed designs. So
even if they are conscientiously applied, they are useful only in evaluating the safety
of a completed design, not in guiding the decisions made early in the design creation
process. One of the results of evaluating designs after they are created is that engi-
neers are confronted with important safety concerns only after it is too late or too
expensive to make significant changes. If and when the system and component
design engineers get the results of the safety activities, often in the form of a critique
of the design late in the development process, the safety concerns are frequently
ignored or argued away because changing the design at that time is too costly.
Design reviews then turn into contentious exercises where one side argues that the
system has serious safety limitations while the other side argues that those limita-
tions do not exist, they are not serious, or the safety analysis is wrong.
The problem is not a lack of concern by designers; its simply that safety concerns
about their design are raised at a time when major design changes are not possible—
the design engineers have no other option than to defend the design they have.
If they lose that argument, then they must try to patch the current design; starting
over with a safer design is, in almost all cases, impractical. If the designers had the
information necessary to factor safety into their early decision making, then the
process of creating safer designs need cost no more and, in fact, will cost less due
to two factors: (1) reduced rework after the decisions made are found to be flawed
or to provide inadequate safety and (2) less unnecessary overdesign and unneeded
protection.
The key to having a cost-effective safety effort is to embed it into a system
engineering process starting from early concept development and then to design
safety into the system as the design decisions are made. Costs are much less when
safety is built into the system design from the beginning rather than added on or
retrofitted later.
The techniques used are not appropriate for todays systems and new technol-
ogy: The assumptions of the major safety engineering techniques currently used,
almost all of which stem from decades past, do not match the assumptions underlying
the technology and complexity of the systems being built today or the new emerging
causes of accidents: They do not apply to human or software errors or flawed man-
agement decision making, and they certainly do not apply to weaknesses in the
organizational structure or social infrastructure systems. These contributors to acci-
dents do not “fail” in the same way assumed by the current safety analysis tools.
But with no other tools to use, safety engineers attempt to force square pegs into
round holes, hoping this will be sufficient. As a result, nothing much is accomplished
beyond expending time, money, and other resources. Its time we face up to the fact
that new safety engineering techniques are needed to handle those aspects of
systems that go beyond the analog hardware components and the relatively simple
designs of the past for which the current techniques were invented. Chapter 8
describes a new hazard analysis technique based on STAMP, called STPA, but others
are possible. The important thing is to confront these problems head on and not
ignore them and waste our time misapplying or futilely trying to extend techniques
that do not apply to todays systems.
The safety efforts are focused on the technical components of the system: Many
safety engineering (and system engineering, for that matter) efforts focus on the
technical system details. Little effort is made to consider the social, organizational,
and human components of the system in the design process. Assumptions are made
that operators will be trained to do the right things and that they will adapt to
whatever design they are given. Sophisticated human factors and system analysis
input is lacking, and when accidents inevitably result, they are blamed on the opera-
tors for not behaving the way the designers thought they would. To give just one
example (although most accident reports contain such examples), one of the four
causes, all of which cited pilot error, identified in the loss of the American Airlines
B757 near Cali, Colombia (see chapter 2), was “Failure of the flight crew to revert
to basic radio navigation when the FMS-assisted navigation became confusing and
demanded an excessive workload in a critical phase of the flight.” A more useful
alternative statement of the cause might have been “An FMS system that confused
the operators and demanded an excessive workload in a critical phase of flight.”
Virtually all systems contain humans, but engineers are often not taught much
about human factors and draw convenient boundaries around the technical com-
ponents, focusing their attention inside these artificial boundaries. Human factors
experts have complained about the resulting technology-centered automation [208],
where the designers focus on technical issues and not on supporting operator tasks.
The result is what has been called “clumsy” automation that increases the chance
of human error [183, 22, 208]. One of the new assumptions for safety in chapter 2
is that operator “error” is a product of the environment in which it occurs.
A variant of the problem is common in systems using information technology.
Many medical information systems, for example, have not been as successful as they
might have been in increasing safety and have even led to new types of hazards and
losses [104, 140]. Often, little effort is invested during development in considering
the usability of the system by medical professionals or of the impact, not always
positive, that the information system design will have on workflow and on the
practice of medicine.
Automation is commonly assumed to be safer than manual systems because
the hazards associated with the manual systems are eliminated. Inadequate con-
sideration is given to whether new, and maybe even worse, hazards are introduced
by the automated system and how to prevent or minimize these new hazards. The
aviation industry has, for the most part, learned this lesson for cockpit and flight
control design, where eliminating errors of commission simply created new errors
of omission [181, 182] (see chapter 9), but most other industries are far behind in
this respect.
Like other safety-related system properties that are ignored until too late, opera-
tors and human-factors experts often are not brought into the early design process
or they work in isolation from the designers until changes are extremely expensive
to make. Sometimes, human factors design is not considered until after an accident,
and occasionally not even then, almost guaranteeing that more accidents will occur.
To provide cost-effective safety engineering, the system and safety analysis
and design process needs to consider the humans in systems—including those that
are not directly controlling the physical processes—not separately or after the fact
but starting at concept development and continuing throughout the life cycle of
the system.
Systems are assumed to be static throughout their lifetimes: It is rare for engi-
neers to consider how the system will evolve and change over time. While designing
for maintainability may be considered, unintended changes are often ignored.
Change is a constant for all systems: physical equipment ages and degrades over
its lifetime and may not be maintained properly; human behavior and priorities
usually change over time; organizations change and evolve, which means the safety
control structure itself will evolve. Change may also occur in the physical and social
environment within which the system operates and with which it interacts. To be
effective, controls need to be designed that will reduce the risk associated with all
these types of changes. Not only are accidents expensive, but once again planning
for system change can reduce the costs associated with the change itself. In addition,
much of the effort in operations needs to be focused on managing and reacting
to change.
section 6.2.
The Role of System Engineering in Safety.
As the systems we build and operate increase in size and complexity, the use of
sophisticated system engineering approaches becomes more critical. Important
system-level (emergent) properties, such as safety, must be built into the design of
these systems; they cannot be effectively added on or simply measured afterward.
While system engineering was developed originally for technical systems, the
approach is just as important and applicable to social systems or the social compo-
nents of systems that are usually not thought of as “engineered.” All systems are
engineered in the sense that they are designed to achieve specific goals, namely to
satisfy requirements and constraints. So ensuring hospital safety or pharmaceutical
safety, for example, while not normally thought of as engineering problems, falls
within the broad definition of engineering. The goal of the system engineering
process is to create a system that satisfies the mission while maintaining the con-
straints on how the mission is achieved.
Engineering is a way of organizing that design process to achieve the most
cost-effective results. Social systems may not have been “designed” in the sense of
a purposeful design process but may have evolved over time. Any effort to change
such systems in order to improve them, however, can be thought of as a redesign or
reengineering process and can again benefit from a system engineering approach.
When using STAMP as the underlying causality model, engineering or reengineer-
ing safer systems means designing (or redesigning) the safety-control structure and
the controls designed into it to ensure the system operates safely, that is, without
unacceptable losses. What is being controlled—chemical manufacturing processes,
spacecraft or aircraft, public health, safety of the food supply, corporate fraud, risks
in the financial system—is irrelevant in terms of the general process, although
significant differences will exist in the types of controls applicable and the design
of those controls. The process, however, is very similar to a regular system engineer-
ing process.
The problem is that most engineering and even many system engineering tech-
niques were developed under conditions and assumptions that do not hold for
complex social systems, as discussed in part I. But STAMP and new system-theoretic
approaches to safety can point the way forward for both complex technical and
social processes. The general engineering and reengineering process described in
part III applies to all systems.
section 6.3.
A System Safety Engineering Process.
In STAMP, accidents and losses result from not enforcing safety constraints on
behavior. Not only must the original system design incorporate appropriate con-
straints to ensure safe operations, but the safety constraints must continue to be
enforced as changes and adaptations to the system design occur over time. This goal
forms the basis for safe management, development, and operations.
There is no agreed upon best system engineering process and probably cannot
be one—the process needs to match the specific problem and environment in which
it is being used. What is described in part III of this book is how to integrate system
safety into any reasonable system engineering process. Figure 6.1 shows the three
major components of a cost-effective system safety process: management, develop-
ment, and operations.
section 6.3.1. Management.
Safety starts with management leadership and commitment. Without these, the
efforts of others in the organization are almost doomed to failure. Leadership
creates culture, which drives behavior.
Besides setting the culture through their own behavior, managers need to estab-
lish the organizational safety policy and create a safety control structure with appro-
priate responsibilities, accountability and authority, safety controls, and feedback
channels. Management must also establish a safety management plan and ensure
that a safety information system and continual learning and improvement processes
are in place and effective.
Chapter 13 discusses managements role and responsibilities in safety.
section 6.3.2. Engineering Development.
The key to having a cost-effective safety effort is to embed it into a system engineer-
ing process from the very beginning and to design safety into the system as the
design decisions are made. All viewpoints and system components must be included
in the process and information used and documented in a way that is accessible,
understandable, and helpful.
System engineering starts with first determining the goals of the system. Potential
hazards to be avoided are then identified. From the goals and system hazards, a set
of system functional and safety requirements and constraints are identified that set
the foundation for design, operations, and management. Chapter 7 describes how
to establish these fundamentals.
To start safety engineering early enough to be cost-effective, safety must be con-
sidered from the early concept formation stages of development and continue
throughout the life cycle of the system. Design decisions should be guided by safety
considerations while at the same time taking other system requirements and con-
straints into account and resolving conflicts. The hazard analysis techniques used
must not require a completed design and must include all the factors involved
in accidents. Chapter 8 describes a new hazard analysis technique, based on the
STAMP model of causation, that provides the information necessary to design
safety into the system, and chapter 9 shows how to use it in a safety-guided design
process. Chapter 9 also presents general principles for safe design including how to
design systems and system components used by humans that do not contribute to
human error.
Documentation is critical not only for communication in the design and develop-
ment process but also because of inevitable changes over time. That documentation
must include the rationale for the design decisions and traceability from high-level
requirements and constraints down to detailed design features. After the original
system development is finished, the information necessary to operate and maintain
it safely must be passed in a usable form to operators and maintainers. Chapter 10
describes how to integrate safety considerations into specifications and the general
system engineering process.
Engineers have often concentrated more on the technological aspects of system
development while assuming that humans in the system will either adapt to what-
ever is given to them or will be trained to do the “right thing.” When an accident
occurs, it is blamed on the operator. This approach to safety, as argued above, is
one of the reasons safety engineering is not as effective as it could be. The system
design process needs to start by considering the human controller and continuing
that perspective throughout development. The best way to reach that goal is to
involve operators in the design decisions and safety analyses. Operators are
sometimes left out of the conceptual design stages and only brought in later in
development. To design safer systems, operators and maintainers must be included
in the design process starting from the conceptual development stage and con-
siderations of human error and preventing it should be at the forefront of the
design effort.
Many companies, particularly in aerospace, use integrated product teams that
include, among others, design engineers, safety engineers, human factors experts,
potential users of the system (operators), and maintainers. But the development
process used may not necessarily take maximum advantage of this potential for
collaboration. The process outlined in part III tries to do that.
section 6.3.3. Operations.
Once the system is built, it must be operated safely. System engineering creates the
basic information needed to do this in the form of the safety constraints and operat-
ing assumptions upon which the safety of the design was based. These constraints
and assumptions must be passed to operations in a form that they can understand
and use.
Because changes in the physical components, human behavior, and the organiza-
tional safety control structure are almost guaranteed to occur over the life of the
system, operations must manage change in order to ensure that the safety con-
straints are not violated. The requirements for safe operations are discussed in
chapter 12.
Its now time to look at the changes in system engineering, operations, and man-
agement, based on STAMP, that can assist in engineering a safer world.