1
0

chore: add through 14

This commit is contained in:
xuu 2025-03-21 21:30:30 -06:00
parent 48e35621ed
commit 79ff642e56
Signed by: xuu
GPG Key ID: 8B3B0604F164E04F
14 changed files with 10210 additions and 1 deletions

@ -8,6 +8,13 @@ MP3_FILES := $(patsubst %.txt,%.mp3,$(wildcard *.txt))
MODEL=en_GB-alan-medium.onnx
CONFIG=en_GB-alan-medium.onnx.json
# MODEL=en_GB-aru-medium.onnx
# CONFIG=en_GB-aru-medium.onnx.json
# MODEL=en_GB-cori-high.onnx
# CONFIG=en_GB-cori-high.onnx.json
complete: $(TXT_FILES) $(MP3_FILES)
echo $@ $^

1362
chapter10.raw Normal file

File diff suppressed because it is too large Load Diff

1238
chapter10.txt Normal file

File diff suppressed because it is too large Load Diff

1355
chapter11.raw Normal file

File diff suppressed because it is too large Load Diff

1237
chapter11.txt Normal file

File diff suppressed because it is too large Load Diff

925
chapter12.raw Normal file

@ -0,0 +1,925 @@
Chapter 12.
Controlling Safety during Operations.
In some industries, system safety is viewed as having its primary role in development
and most of the activities occur before operations begin. Those concerned with
safety may lose influence and resources after that time. As an example, one of
the chapters in the Challenger accident report, titled “The Silent Safety Program,”
lamented:
Following the successful completion of the orbital flight test phase of the Shuttle program,
the system was declared to be operational. Subsequently, several safety, reliability, and
quality assurance organizations found themselves with reduced and/or reorganized func-
tional capabilities. . . . The apparent reason for such actions was a perception that less
safety, reliability, and quality assurance activity would be required during “routine” Shuttle
operations. This reasoning was faulty.
While safety-guided design eliminates some hazards and creates controls for others,
hazards and losses may still occur in operations due to:
1.•Inadequate attempts to eliminate or control the hazards in the system design,
perhaps due to inappropriate assumptions about operations.
2.•Inadequate implementation of the controls that designers assumed would exist
during operations.
3.•Changes that occur over time, including violation of the assumptions underly-
ing the design.
4.•Unidentified hazards, sometimes new ones that arise over time and were not
anticipated during design and development.
Treating operational safety as a control problem requires facing and mitigating these
potential reasons for losses.
A complete system safety program spans the entire life of the system and, in some
ways, the safety program during operations is even more important than during
development. System safety does not stop after development; it is just getting started.
The focus now, however, shifts to the operations safety control structure.
This chapter describes the implications of STAMP on operations. Some topics
that are relevant here are left to the next chapter on management: organizational
design, safety culture and leadership, assignment of appropriate responsibilities
throughout the safety control structure, the safety information system, and corpo-
rate safety policies. These topics span both development and operations and many
of the same principles apply to each, so they have been put into a separate chapter.
A final section of this chapter considers the application of STAMP and systems
thinking principles to occupational safety.
section 12.1.
Operations Based on STAMP.
Applying the basic principles of STAMP to operations means that, like develop-
ment, the goal during operations is enforcement of the safety constraints, this time
on the operating system rather than in its design. Specific responsibilities and control
actions required during operations are outlined in chapter 13.
Figure 12.1 shows the interactions between development and operations. At the
end of the development process, the safety constraints, the results of the hazard
analyses, as well as documentation of the safety-related design features and design
rationale, should be passed on to those responsible for the maintenance and evo-
lution of the system. This information forms the baseline for safe operations. For
example, the identification of safety-critical items in the hazard analysis should be
used as input to the maintenance process for prioritization of effort.
At the same time, the accuracy and efficacy of the hazard analyses performed
during development and the safety constraints identified need to be evaluated using
the operational data and experience. Operational feedback on trends, incidents, and
accidents should trigger reanalysis when appropriate. Linking the assumptions
throughout the system specification with the parts of the hazard analysis based on
that assumption will assist in performing safety maintenance activities. During field
testing and operations, the links and recorded assumptions and design rationale can
be used in safety change analysis, incident and accident analysis, periodic audits and
performance monitoring as required to ensure that the operational system is and
remains safe.
For example, consider the TCAS requirement that TCAS provide collision avoid-
ance protection for any two aircraft closing horizontally at any rate up to 1,200 knots
and vertically up to 10,000 feet per minute. As noted in the rationale, this require-
ment is based on aircraft performance limits at the time TCAS was created. It is
also based on minimum horizontal and vertical separation requirements. The safety
analysis originally performed on TCAS is based on these assumptions. If aircraft
performance limits change or if there are proposed changes in airspace manage-
ment, as is now occurring in new Reduced Vertical Separation Minimums (RVSM),
hazard analysis to determine the safety of such changes will require the design
rationale and the tracing from safety constraints to specific system design features
as recorded in intent specifications. Without such documentation, the cost of reanal-
ysis could be enormous and in some cases even impractical. In addition, the links
between design and operations and user manuals in level 6 will ease updating when
design changes are made.
In a traditional System Safety program, much of this information is found
in or can be derived from the hazard log, but it needs to be pulled out and pro-
vided in a form that makes it easy to locate and use in operations. Recording
design rationale and assumptions in intent specifications allows using that informa-
tion both as the criteria under which enforcement of the safety constraints is
predicated and in the inevitable upgrades and changes that will need to be made
during operations. Chapter 10 shows how to identify and record the necessary
information.
The design of the operational safety controls are based on assumptions about the
conditions during operations. Examples include assumptions about how the opera-
tors will operate the system and the environment (both social and physical) in which
the system will operate. These conditions may change. Therefore, not only must the
assumptions and design rationale be conveyed to those who will operate the system,
but there also need to be safeguards against changes over time that violate those
assumptions.
The changes may be in the behavior of the system itself:
•Physical changes: the equipment may degrade or not be maintained properly.
•Human changes: human behavior and priorities usually change over time.
•Organizational changes: change is a constant in most organizations, including
changes in the safety control structure itself, or in the physical and social envi-
ronment within which the system operates or with which it interacts.
Controls need to be established to reduce the risk associated with all these types of
changes.
The safeguards may be in the design of the system itself or in the design of the
operational safety control structure. Because operational safety depends on the
accuracy of the assumptions and models underlying the design and hazard analysis
processes, the operational system should be monitored to ensure that:
1. The system is constructed, operated, and maintained in the manner assumed
by the designers.
2. The models and assumptions used during initial decision making and design
are correct.
3. The models and assumptions are not violated by changes in the system, such
as workarounds or unauthorized changes in procedures, or by changes in the
environment.
Designing the operations safety control structure requires establishing controls and
feedback loops to (1) identify and handle flaws in the original hazard analysis and
system design and (2) to detect unsafe changes in the system during operations
before the changes lead to losses. Changes may be intentional or they may be unin-
tended and simply normal changes in system component behavior or the environ-
ment over time. Whether intended or unintended, system changes that violate the
safety constraints must be controlled.
section 12.2.
Detecting Development Process Flaws during Operations.
Losses can occur due to flaws in the original assumptions and rationale underlying
the system design. Errors may also have been made in the hazard analysis process
used during system design. During operations, three goals and processes to achieve
these goals need to be established:
1. Detect safety-related flaws in the system design and in the safety control
structure, hopefully before major losses, and fix them.
2. Determine what was wrong in the development process that allowed the flaws
to exist and improve that process to prevent the same thing from happening
in the future.
3. Determine whether the identified flaws in the process might have led to other
vulnerabilities in the operational system.
If losses are to be reduced over time and companies are not going to simply
engage in constant firefighting, then mechanisms to implement learning and con-
tinual improvement are required. Identified flaws must not only be fixed (symptom
removal), but the larger operational and development safety control structures must
be improved, as well as the process that allowed the flaws to be introduced in the
first place. The overall goal is to change the culture from a fixing orientation—
identifying and eliminating deviations or symptoms of deeper problems—to a learn-
ing orientation where systemic causes are included in the search for the source of
safety problems [33].
To accomplish these goals, a feedback control loop is needed to regularly track
and assess the effectiveness of the development safety control structure and its
controls. Were hazards overlooked or incorrectly assessed as unlikely or not serious?
Were some potential failures or design errors not included in the hazard analysis?
Were identified hazards inappropriately accepted rather than being fixed? Were the
designed controls ineffective? If so, why?
When numerical risk assessment techniques are used, operational experience can
provide insight into the accuracy of the models and probabilities used. In various
studies of the DC-10 by McDonnell Douglas, the chance of engine power loss with
resulting slat damage during takeoff was estimated to be less than one in a billion
flights. However, this highly improbable event occurred four times in DC-10s in the
first few years of operation without raising alarm bells before it led to an accident
and changes were made. Even one event should have warned someone that the
models used might be incorrect. Surprisingly little scientific evaluation of probabi-
listic risk assessment techniques has ever been conducted [115], yet these techniques
are regularly taught to most engineering students and widely used in industry. Feed-
back loops to evaluate the assumptions underlying the models and the assessments
produced are an obvious way to detect problems.
Most companies have an accident/incident analysis process that identifies the
proximal failures that led to an incident, for example, a flawed design of the pressure
relief valve in a tank. Typical follow-up would include replacement of that valve with
an improved design. On top of fixing the immediate problem, companies should
have procedures to evaluate and potentially replace all the uses of that pressure
relief valve design in tanks throughout the plant or company. Even better would be
to reevaluate pressure relief valve design for all uses in the plant, not just in tanks.
But for long-term improvement, a causal analysis—CAST or something similar—
needs to be performed on the process that created the flawed design and that
process improved. If the development process was flawed, perhaps in the hazard
analysis or design and verification, then fixing that process can prevent a large
number of incidents and accidents in the future.
Responsibility for this goal has to be assigned to an appropriate component in
the safety control structure and feedback-control loops established. Feedback may
come from accident and incident reports as well as detected and reported design
and behavioral anomalies. To identify flaws before losses occur, which is clearly
desirable, audits and performance assessments can be used to collect data for vali-
dating and informing the safety design and analysis process without waiting for a
crisis. There must also be feedback channels to the development safety control
structure so that appropriate information can be gathered and used to implement
improvements. The design of these control loops is discussed in the rest of this
chapter. Potential challenges in establishing such control loops are discussed in the
next chapter on management.
section 12.3. Managing or Controlling Change.
Systems are not static but instead are dynamic processes that are continually adapt-
ing to achieve their ends and to react to changes in themselves and their environ-
ment. In STAMP, adaptation or change is assumed to be an inherent part of any
system, particularly those that include humans and organizational components:
Humans and organizations optimize and change their behavior, adapting to the
changes in the world and environment in which the system operates.
To avoid losses, not only must the original design enforce the safety constraints
on system behavior, but the safety control structure must continue to enforce them
as changes to the designed system, including the safety control structure itself, occur
over time.
While engineers usually try to anticipate potential changes and to design for
changeability, the bulk of the effort in dealing with change must necessarily occur
during operations. Controls are needed both to prevent unsafe changes and to detect
them if they occur.
In the friendly fire example in chapter 5, the AWACS controllers stopped handing
off helicopters as they entered and left the no-fly zone. They also stopped using the
Delta Point system to describe flight plans, although the helicopter pilots assumed
the coded destination names were still being used and continued to provide them.
Communication between the helicopters and the AWACS controllers was seriously
degraded although nobody realized it. The basic safety constraint that all aircraft
in the no-fly zone and their locations would be known to the AWACS controllers
became over time untrue as the AWACS controllers optimized their procedures.
This type of change is normal; it needs to be identified by checking that the assump-
tions upon which safety is predicated remain true over time.
The deviation from assumed behavior during operations was not, in the friendly
fire example, detected until after an accident. Obviously, finding the deviations at
this time is less desirable than using audits, and other types of feedback mechanisms
to detect hazardous changes, that is, those that violate the safety constraints, before
losses occur. Then something needs to be done to ensure that the safety constraints
are enforced in the future.
Controls are required for both intentional (planned) and unintentional changes.
section 12.3.1. Planned Changes.
Intentional system changes are a common factor in accidents, including physical,
process, and safety control structure changes [115]. The Flixborough explosion pro-
vides an example of a temporary physical change resulting in a major loss: Without
first performing a proper hazard analysis, a temporary pipe was used to replace a
reactor that had been removed to repair a crack. The crack itself was the result of
a previous process modification [54]. The Walkerton water contamination loss in
appendix C provides an example of a control structure change when the government
water testing lab was privatized without considering how that would affect feedback
to the Ministry of the Environment.
Before any planned changes are made, including organizational and safety
control structure changes, their impact on safety must be evaluated. Whether
this process is expensive depends on how the original hazard analysis was per-
formed and particularly how it was documented. Part of the rationale behind the
design of intent specifications was to make it possible to retrieve the information
needed.
While implementing change controls limits flexibility and adaptability, at least in
terms of the time it takes to make changes, the high accident rate associated with
intentional changes attests to the importance of controlling them and the high level
of risk being assumed by not doing so. Decision makers need to understand these
risks before they waive the change controls.
Most systems and industries do include such controls, usually called Management
of Change (MOC) procedures. But the large number of accidents occurring after
system changes without evaluating their safety implies widespread nonenforcement
of these controls. Responsibility needs to be assigned for ensuring compliance with
the MOC procedures so that change analyses are conducted and the results are not
ignored. One way to do this is to reward people for safe behavior when they choose
safety over other system goals and to hold them accountable when they choose to
ignore the MOC procedures, even when no accident results. Achieving this goal, in
turn, requires management commitment to safety (see chapter 13), as does just
about every aspect of building and operating a safe system.
section 12.3.2. Unplanned Changes.
While dealing with planned changes is relatively straightforward (even if difficult
to enforce), unplanned changes that move systems toward states of higher risk are
less straightforward. There need to be procedures established to prevent or detect
changes that impact the ability of the operations safety control structure and the
designed controls to enforce the safety constraints.
As noted earlier, people will tend to optimize their performance over time to
meet a variety of goals. If an unsafe change is detected, it is important to respond
quickly. People incorrectly reevaluate their perception of risk after a period of
success. One way to interrupt this risk-reevaluation process is to intervene quickly
to stop it before it leads to a further reduction in safety margins or a loss occurs.
But that requires an alerting function to provide feedback to someone who is
responsible for ensuring that the safety constraints are satisfied.
At the same time, change is a normal part of any system. Successful systems are
continually changing and adapting to current conditions. Change should be allowed
as long as it does not violate the basic constraints on safe behavior and therefore
increase risk to unacceptable levels. While in the short term relaxing the safety con-
straints may allow other system goals to be achieved to a greater degree, in the longer
term accidents and losses can cost a great deal more than the short-term gains.
The key is to allow flexibility in how safety goals are achieved, but not flexibility
in violating them, and to provide the information that creates accurate risk percep-
tion by decision makers.
Detecting migration toward riskier behavior starts with identifying baseline
requirements. The requirements follow from the hazard analysis. These require-
ments may be general (“Equipment will not be operated above the identified safety-
critical limits” or “Safety-critical equipment must be operational when the system
is operating”) or specifically tied to the hazard analysis (“AWACS operators must
always hand off aircraft when they enter and leave the no-fly zone” or “Pilots must
always follow the TCAS alerts and continue to do so until they are canceled”).
The next step is to assign responsibility to appropriate places in the safety control
structure to ensure the baseline requirements are not violated, while allowing
changes that do not raise risk. If the baseline requirements make it impossible for
the system to achieve its goals, then instead of waiving them, the entire safety control
structure should be reconsidered and redesigned. For example, consider the foam
shedding problems on the Space Shuttle. Foam had been coming off the external
tank for most of the operational life of the Shuttle. During development, a hazard
had been identified and documented related to the foam damaging the thermal
control surfaces of the spacecraft. Attempts had been made to eliminate foam shed-
ding, but none of the proposed fixes worked. The response was to simply waive the
requirement before each flight. In fact, at the time of the Columbia loss, more than
three thousand potentially critical failure modes were regularly waived on the
pretext that nothing could be done about them and the Shuttle had to fly [74].
More than a third of these waivers had not been reviewed in the ten years before
the accident.
After the Columbia loss, controls and mitigation measures for foam shedding
were identified and implemented, such as changing the fabrication procedures and
adding cameras and inspection and repair capabilities and other contingency actions.
The same measures could, theoretically, have been implemented before the loss of
Columbia. Most of the other waived hazards were also resolved in the aftermath of
the accident. While the operational controls to deal with foam shedding raise the
risk associated with a Shuttle accident above actually fixing the problem, the risk is
lower than simply ignoring and waiting for the hazards to occur. Understanding and
explicitly accepting risk is better than simply denying and ignoring it.
The NASA safety program and safety control structure had seriously degraded
before both the Challenger and Columbia losses [117]. Waiving requirements
interminably represents an abdication of the responsibility to redesign the system,
including the controls during operations, after the current design is determined to
be unsafe.
Is such a hard line approach impractical? SUBSAFE, the U.S. nuclear submarine
safety program established after the Thresher loss, described in chapter 14, has not
allowed waiving the SUBSAFE safety requirements for more than forty-five years,
with one exception. In 1967, four years after SUBSAFE was established, SUBSAFE
requirements for one submarine were waived in order to satisfy pressing Navy per-
formance goals. That submarine and its crew were lost less than a year later. The
same mistake has not been made again.
If there is absolutely no way to redesign the system to be safe and at the same
time to satisfy the system requirements that justify its existence, then the existence
of the system itself should be rethought and a major replacement or new design
considered. After the first accident, much more stringent and perhaps unacceptable
controls will be forced on operations. While the decision to live with risk is usually
accorded to management, those who will suffer the losses should have a right to
participate in that decision. Luckily, the choice is usually not so stark if flexibility is
allowed in the way the safety constraints are maintained and long-term rather than
short-term thinking prevails.
Like any set of controls, unplanned change controls involve designing appropri-
ate control loops. In general, the process involves identifying the responsibility
of the controller(s); collecting data (feedback); turning the feedback into useful
information (analysis) and updating the process models; generating any necessary
control actions and appropriate communication to other controllers; and measuring
how effective the whole process is (feedback again).
section 12.4. Feedback Channels.
Feedback is a basic part of STAMP and of treating safety as a control problem.
Information flow is key in maintaining safety.
There is often a belief—or perhaps hope—that a small number of “leading indi-
cators” can identify increasing risk of accidents, or, in STAMP terms, migration
toward states of increased risk. It is unlikely that general leading indicators appli-
cable to large industry segments exist or will be useful. The identification of system
safety constraints does, however, provide the possibility of identifying leading
indicators applicable to a specific system.
The desire to predict the future often leads to collecting a large amount of infor-
mation based on the hope that something useful will be obtained and noticed. The
NASA Space Shuttle program was collecting six hundred metrics a month before
the loss of Columbia. Companies often collect data on occupational safety, such as
days without a lost time accident, and they assume that these data reflect on system
safety [17], which of course it does not. Not only is this misuse of data potentially
misleading, but collecting information that may not be indicative of real risk diverts
limited resources and attention from more effective risk-reduction efforts.
Poorly defined feedback can lead to a decrease in safety. As an incentive to
reduce the number of accidents in the California construction industry, for example,
workers with the best safety records—as measured by fewest reported incidents—
were rewarded [126]. The reward created an incentive to withhold information
about small accidents and near misses, and they could not therefore be investigated
and the causes eliminated. Under-reporting of incidents created the illusion that the
system was becoming safer, when instead risk had merely been muted. The inac-
curate risk perception by management led to not taking the necessary control
actions to reduce risk. Instead, the reporting of accidents should have been rewarded.
Feedback requirements should be determined with respect to the design of the
organizations safety control structure, the safety constraints (derived from the
system hazards) that must be enforced on system operation, and the assumptions
and rationale underlying the system design for safety. They will be similar for dif-
ferent organizations only to the extent that the hazards, safety constraints, and
system design are similar.
The hazards and safety constraints, as well as the causal information derived by
the use of STPA, form the foundation for determining what feedback is necessary
to provide the controllers with the information they need to satisfy their safety
responsibilities. In addition, there must be mechanisms to ensure that feedback
channels are operating effectively.
The feedback is used to update the controllers process models and understand-
ing of the risks in the processes they are controlling, to update their control algo-
rithms, and to execute appropriate control actions.
Sometimes, cultural problems interfere with feedback about the state of the
controlled process. If the culture does not encourage sharing information and if
there is a perception that the information can be used in a way that is detrimental
to those providing it, then cultural changes will be necessary. Such changes require
leadership and freedom from blame (see “Just Culture” in chapter 13). Effective
feedback collection requires that those making the reports are convinced that the
information will be used for constructive improvements in safety and not as a basis
for criticism or disciplinary action. Resistance to airing dirty laundry is understand-
able, but this quickly transitions into an organizational culture where only good
news is passed on for fear of retribution. Everyones past experience includes indi-
vidual mistakes, and avoiding repeating the same mistakes requires a culture that
encourages sharing.
Three general types of feedback are commonly used: audits and performance
assessments; reporting systems; and anomaly, incident, and accident investigation.
section 12.4.1. Audits and Performance Assessments.
Once again, audits and performance assessments should start from the safety con-
straints and design assumptions and rationale. The goal should be to determine
whether the safety constraints are being enforced in the operation of the system
and whether the assumptions underlying the safety design and rationale are still
true. Audits and performance assessments provide a chance to detect whether the
behavior of the system and the system components still satisfies the safety con-
straints and whether the way the controllers think the system is working—as
reflected in their process models—is accurate.
The entire safety control structure must be audited, not just the lower-level pro-
cesses. Auditing the upper levels of the organization will require buy-in and com-
mitment from management and an independent group at a high enough level to
control audits as well as explicit rules for conducting them.
Audits are often less effective than they might be. When auditing is performed
through contracts with independent companies, there may be subtle pressures on
the audit team to be unduly positive or less than thorough in order to maintain their
customer base. In addition, behavior or conditions may be changed in anticipation
of an audit and then revert back to their normal state immediately afterward.
Overcoming these limitations requires changes in organizational culture and
in the use of the audit results. Safety controllers (managers) must feel personal
responsibility for safety. One way to encourage this view is to trust them and expect
them to be part of the solution and to care about safety. “Safety is everyones
responsibility” must be more than an empty slogan, and instead a part of the orga-
nizational culture.
A participatory audit philosophy can have an important impact on these cultural
goals. Some features of such a philosophy are:
1.• Audits should not be punitive. Audits need to be viewed as a chance to improve
safety and to evaluate the process rather than a way to evaluate employees.
2.• To
increase buy-in and commitment, those controlling the processes being
audited should participate in creating the rules and procedures and understand
the reasons for the audit and how the results will be used. Everyone should
have a chance to learn from the audit without it having negative consequences—
it should be viewed as an opportunity to learn how to improve.
3.•People from the process being audited should participate on the audit team. In
order to get an outside but educated view, using process experts from other
parts of the organization not directly being audited is a better approach than
using outside audit companies. Various stakeholders in safety may be included
such as unions. The goal should be to inculcate the attitude that this is our audit
and a chance to improve our practices. Audits should be treated as a learning
experience for everyone involved—including the auditors.
4.•Immediate feedback should be provided and solutions discussed. Often audit
results are not available until after the audit and are presented in a written
report. Feedback and discussion with the audit team during the audit are dis-
couraged. One of the best times to discuss problems found and how to design
solutions, however, is when the team is together and on the spot. Doing this
will also reinforce the understanding that the goal is to improve the process,
not to punish or evaluate those involved.
5.• All levels of the safety control structure should be audited, along with the
physical process and its immediate operators. Accepting being audited and
implementing improvements as a result—that is, leading by example—is a
powerful way for leaders to convey their commitment to safety and to its
improvement.
6.• A part of the audit should be to determine the level of safety knowledge and
training that actually exists, not what managers believe exists or what exists in
the training programs and user manuals. These results can be fed back into the
training materials and education programs. Under no circumstances, of course,
should such assessments be used in a negative way or one that is viewed as
punitive by those being assessed.
Because these rules for audits are so far from common practice, they may be
viewed as unrealistic. But this type of audit is carried out today with great success.
See chapter 14 for an example. The underlying philosophy behind these practices is
that most people do not want to harm others and have innate belief in safety as a
goal. The problems arise when other goals are rewarded or emphasized over safety.
When safety is highly valued in an organizational culture, obtaining buy-in is usually
not difficult. The critical step lies in conveying that commitment.
section 12.4.2. Anomaly, Incident, and Accident Investigation.
Anomaly, incident, and accident investigations often focus on a single “root” cause
and look for contributory causes near the events. The belief that there is a root cause,
sometimes called root cause seduction [32], is powerful because it provides an illu-
sion of control. If the root cause can simply be eliminated and if that cause is low
in the safety control structure, then changes can easily be made that will eliminate
accidents without implicating management or requiring changes that are costly or
disruptive to the organization. The result is that physical design characteristics or
low-level operators are usually identified as the root cause.
Causality is, however, much more complex than this simple but very entrenched
belief, as has been argued throughout this book. To effect high-leverage policies and
changes that are able to prevent large classes of future losses, the weaknesses in the
entire safety control structure related to the loss need to be identified and the
control structure redesigned to be more effective.
In general, effective learning from experience requires a change from a fixing
orientation to a continual learning and improvement culture. To create such a
culture requires high-level leadership by management, and sometimes organiza-
tional changes.
Chapter 11 describes a way to perform better analyses of anomalies, incidents,
and accidents. But having a process is not enough; the process must be embedded
in an organizational structure that allows the successful exploitation of that process.
Two important organizational factors will impact the successful use of CAST: train-
ing and follow-up.
Applying systems thinking to accident analysis requires training and experience.
Large organizations may be able to train a group of investigators or teams to
perform CAST analyses. This group should be managerially and financially inde-
pendent. Some managers prefer to have accident/incident analysis reports focus on
the low-level system operators and physical processes and the reports never go
beyond those factors. In other cases, those involved in accident analysis, while well-
meaning, have too limited a view to provide the perspective required to perform an
adequate causal analysis. Even when intentions are good and local skills and knowl-
edge are available, budgets may be so tight and pressures to maintain performance
schedules so high that it is difficult to find the time and resources to do a thorough
causal analysis using local personnel. Trained teams with independent budgets
can overcome some of these obstacles. But while the leaders of investigations and
causal analysis can be independent, participation by those with local knowledge is
also important.
A second requirement is follow-up. Often the process stops after recommenda-
tions are made and accepted. No follow-up is provided to ensure that the recom-
mendations are implemented or that the implementations were effective. Deadlines
and assignment of responsibility for making recommendations, as well as responsi-
bility for ensuring that they are made, are required. The findings in the causal analysis
should be an input to future audits and performance assessments. If the same or
similar causes recur, then that itself requires an analysis of why the problem was
not fixed when it first was detected. Was the fix unsuccessful? Did the system migrate
back to the same high-risk state because the underlying causal factors were never
successfully controlled? Were factors missed in the original causal analysis? Trend
analysis is important to ensure that progress is being made in controlling safety.
section 12.4.3. Reporting Systems.
Accident reports very often note that before a loss, someone detected an anomaly
but never reported it using the official reporting system. The response in accident
investigation reports is often to recommend that the requirement to use reporting
systems be emphasized to personnel or to provide additional training in using them.
This response may be effective for a short time, but eventually people revert back
to their prior behavior. A basic assumption about human behavior in this book (and
in systems approaches to human factors) is that human behavior can usually be
explained by looking at the system in which the human is operating. The reason in
the system design for the behavior must be determined and changed: Simply trying
to force people to behave in ways that are unnatural for them will usually be
unsuccessful.
So the first question to ask is why people do not use reporting systems and to fix
those factors. One obvious reason is that they may be designed poorly. They may
require extra, time-consuming steps, such as logging into a web-based system, that
are not part of their normal operating procedures or environment. Once they
get to the website, they may be faced with a poorly designed form that requires
them to provide a lot of extraneous information or does not allow the flexibility
necessary to enter the information they want to provide.
A second reason people do not report is that the information they provided in
the past appeared to go into a black hole, with nobody responding to it. There is
little incentive to continue to provide information under these conditions, particu-
larly when the reporting system is time-consuming and awkward to use.
A final reason for lack of reporting is a fear that the information provided may
be used against them or there are other negative repercussions such as a necessity
to spend time filling out additional reports.
Once the reason for failing to use reporting systems is understood, the solutions
usually become obvious. For example, the system may need to be redesigned so it
is easy to use and integrated into normal work procedures. As an example, email is
becoming a primary means of communication at work. The first natural response in
finding a problem is to contact those who can fix it, not to report it to some database
where there is no assurance it will be processed quickly or get to the right people.
A successful solution to this problem used on one large air traffic control system
was to require only that the reporter add an extra “cc:” on their emails in order to
get it reported officially to safety engineering and those responsible for problem
reports [94].
In addition, the receipt of a problem report should result in both an acknowledg-
ment of receipt and a thank-you. Later, when a resolution is identified, information
should be provided to the reporter of the problem about what was done about it.
If there is no resolution within a reasonable amount of time, that too should be
acknowledged. There is little incentive to use reporting systems if the reporters do
not think the information will be acted upon.
Most important, an effective reporting system requires that those making the
reports are convinced the information will be used for constructive improvements
in safety and not as a basis for criticism or disciplinary action. If reporting is con-
sidered to have negative consequences for the reporter, then anonymity may be
necessary and a written policy provided for the use of such reporting systems, includ-
ing the rights of the reporters and how the reported information will be used. Much
has been written about this aspect of reporting systems (e.g., see Dekker [51]). One
warning is that trust is hard to gain and easy to lose. Once it is lost, regaining it is
even harder than getting buy-in at the beginning.
When reporting involves an outside regulatory agency or industry group, pro-
tection of safety information and proprietary data from disclosure and use for
purposes other than improving safety must be provided.
Designing effective reporting systems is very difficult. Examining two successful
efforts, in nuclear power and in commercial aviation, along with the challenges they
face is instructive.
Nuclear Power.
Operators of nuclear power plants in the United States are required to file a
Licensee Event Report (LER) with the Nuclear Regulatory Commission (NRC)
whenever an irregular event occurs during plant operation. While the NRC collected
an enormous amount of information on the operating experience of plants in this
way, the data were not consistently analyzed until after the Three Mile Island (TMI)
accident. The General Accounting Office (GAO) had earlier criticized the NRC for
this failure, but no corrective action was taken until after the events at TMI [98].
The system also had a lack of closure: important safety issues were raised and
studied to some degree, but were not carried through to resolution [115]. Many
of the conditions involved in the TMI accident had occurred previously at other
plants but nothing had been done about correcting them. Babcock and Wilcox,
the engineering firm for TMI, had no formal procedures to analyze ongoing pro-
blems at plants they had built or to review the LERs on their plants filed with
the NRC.
The TMI accident sequence started when a pilot-operated relief valve stuck open.
In the nine years before the TMI incident, eleven of those valves had stuck open at
other plants, and only a year before, a sequence of events similar to those at TMI
had occurred at another U.S. plant.
The information needed to prevent TMI was available, including the prior
incidents at other plants, recurrent problems with the same equipment at TMI, and
engineers critiques that operators had been taught to do the wrong thing in specific
circumstances, yet nothing had been done to incorporate this information into
operating practices.
In reflecting on TMI, the utilitys president, Herman Dieckamp, said:
To me that is probably one of the most significant learnings of the whole accident [TMI]
the degree to which the inadequacies of that experience feedback loop . . . significantly
contributed to making us and the plant vulnerable to this accident [98].
As a result of this wake-up call, the nuclear industry initiated better evaluation and
follow-up procedures on LERs. It also created the Institute for Nuclear Power
Operations (INPO) to promote safety and reliability through external reviews of
performance and processes, training and accreditation programs, events analysis,
sharing of operating information and best practices, and special assistance to member
utilities. The IAEA (International Atomic Energy Agency) and World Association
of Nuclear Operators (WANO) share these goals and serve similar functions
worldwide.
The reporting system now provides a way for operators of each nuclear power
plant to reflect on their own operating experience in order to identify problems,
interpret the reasons for these problems, and select corrective actions to ameliorate
the problems and their causes. Incident reviews serve as important vehicles for self-
analysis, knowledge sharing across boundaries inside and outside specific plants, and
development of problem-resolution efforts. Both INPO and the NRC issue various
letters and reports to make the industry aware of incidents as part of operating
experience feedback, as does IAEAs Incident Reporting System.
The nuclear engineering experience is not perfect, of course, but real strides have
been made since the TMI wakeup call, which luckily occurred without major human
losses. To their credit, an improvement and learning effort was initiated and has
continued. High-profile incidents like TMI are rare, but smaller scale self-analyses
and problem-solving efforts follow detection of small defects, near misses, and pre-
cursors and negative trends. Occasionally the NRC has stepped in and required
changes. For example, in 1996 the NRC ordered the Millstone nuclear power plant
in Connecticut to remain closed until management could demonstrate a “safety
conscious work environment” after identified problems were allowed to continue
without remedial action [34].
Commercial Aviation.
The highly regarded ASRS (Aviation Safety Reporting System) has been copied by
many individual airline information systems. Although much information is now
collected, there still exist problems in evaluating and learning from it. The breadth
and type of information acquired is much greater than the NRC reporting system
described above. The sheer number of ASRS reports and the free form entry of the
information make evaluation very difficult. There are few ways implemented to
determine whether the report was accurate or evaluated the problem correctly.
Subjective causal attribution and inconsistency in terminology and information
included in the reports makes comparative analysis and categorization difficult and
sometimes impossible.
Existing categorization schemes have also become inadequate as technology
has changed, for example, with increased use of digital technology and computers
in aircraft and ground operations. New categorizations are being implemented,
but that creates problems when comparing data that used older categorization
schemes.
Another problem arising from the goal to encourage use of the system is in the
accuracy of the data. By filing an ASRS report, a limited form of indemnity against
punishment is assured. Many of the reports are biased by personal protection con-
siderations, as evidenced by the large percentage of the filings that report FAA
regulation violations. For example, in a NASA Langley study of reported helicopter
incidents in the ASRS over a nine-year period, nonadherence to FARs (Federal
Aviation Regulations) was by far the largest category of reports. The predominance
of FAR violations in the incident data may reflect the motivation of the ASRS
reporters to obtain immunity from perceived or real violations of FARs and not
necessarily the true percentages.
But with all these problems and limitations, most agree that the ASRS and
similar industry reporting systems have been very successful and the information
obtained extremely useful in enhancing safety. For example, reported unsafe airport
conditions have been corrected quickly and improvements in air traffic control and
other types of procedures made on the basis of ASRS reports.
The success of the ASRS has led to the creation of other reporting systems in
this industry. The Aviation Safety Action Program (ASAP) in the United States,
for example, encourages air carrier and repair station personnel to voluntarily
report safety information to be used to develop corrective actions for identified
safety concerns. An ASAP involves a partnership between the FAA and the cer-
tified organization (called the certificate holder) and may also include a third
party, such as the employees labor organization. It provides a vehicle for employ-
ees of the ASAP participants to identify and report safety issues to management
and to the FAA without fear that the FAA will use the reports accepted under
the program to take legal enforcement action against them or the company or
that companies will use the information to take disciplinary action against the
employee.
Certificate holders may develop ASAP programs and submit them to the FAA
for review and acceptance. Ordinarily, programs are developed for specific employee
groups, such as members of the flightcrew, flight attendants, mechanics, or dispatch-
ers. The FAA may also suggest, but not require, that a certificate holder develop an
ASAP to resolve an identified safety problem.
When ASAP reports are submitted, an event review committee (ERC) reviews
and analyzes them. The ERC usually includes a management representative from
the certificate holder, a representative from the employee labor association (if
applicable), and a specially trained FAA inspector. The ERC considers each ASAP
report for acceptance or denial, and if accepted, analyzes the report to determine
the necessary controls to put in place to respond to the identified problem.
Single ASAP reports can generate corrective actions and, in addition, analysis of
aggregate ASAP data can also reveal trends that require action. Under an ASAP,
safety issues are resolved through corrective action rather than through punishment
or discipline.
To prevent abuse of the immunity provided by ASAP programs, reports are
accepted only for inadvertent regulatory violations that do not appear to involve
an intentional disregard for safety and events that do not appear to involve criminal
activity, substance abuse, or intentional falsification.
Additional reporting programs provide for sharing data that is collected by air-
lines for their internal use. FOQA (Flight Operational Quality Assurance) is an
example. Air carriers often instrument their aircraft with extensive flight data
recording systems or use pilot generated checklists and reports for gathering infor-
mation internally to improve operations and safety. FOQA provides a voluntary
means for the airlines to share this information with other airlines and with the FAA
so that national trends can be monitored and the FAA can target its resources to
address the most important operational risk issues.
In contrast with the ASAP voluntary reporting of single events, FOQA programs
allow the accumulation of accurate operational performance information covering
all flights by multiple aircraft types such that single events or overall patterns of
aircraft performance data can be identified and analyzed. Such aggregate data can
determine trends specific to aircraft types, local flight path conditions, and overall
flight performance trends for the commercial aircraft industry. FOQA data has been
used to identify the need for changing air carrier operating procedures for specific
aircraft fleets and for changing air traffic control practices at certain airports with
unique traffic pattern limitations.
FOQA and other such voluntary reporting programs allow early identification
of trends and changes in behavior (i.e., migration of systems toward states of increas-
ing risk) before they lead to accidents. Follow-up is provided to ensure that unsafe
conditions are effectively remediated by corrective actions.
A cornerstone of FOQA programs, once again, is the understanding that aggre-
gate data provided to the FAA will be kept confidential and the identity of reporting
personnel or airlines will remain anonymous. Data that could be used to identify
flight crews are removed from the electronic record as part of the initial processing
of the collected data. Air carrier FOQA programs, however, typically provide a
gatekeeper who can securely retrieve identifying information for a limited amount
of time, in order to enable follow-up requests for additional information from the
specific flight crew associated with a FOQA event. The gatekeeper is typically a line
captain designated by the air carriers pilot association. FOQA programs usually
involve agreements between pilot organizations and the carriers that define how
the collected information can be used.
footnote. FOQA is voluntary in the United States but required in some countries.
section 12.5.
Using the Feedback.
Once feedback is obtained, it needs to be used to update the controllers process
models and perhaps control algorithms. The feedback and its analysis may be passed
to others in the control structure who need it.
Information must be provided in a form that people can learn from, apply to
their daily jobs, and use throughout the system life cycle.
Various types of analysis may be performed by the controller on the feedback,
such as trend analysis. If flaws in the system design or unsafe changes are detected,
obviously actions are required to remedy the problems.
In major accidents, precursors and warnings are almost always present but ignored
or mishandled. While what appear to be warnings are sometimes simply a matter
of hindsight, sometimes clear evidence does exist. In 1982, two years before the
Bhopal accident, for example, an audit was performed that identified many of the
deficiencies involved in the loss. The audit report noted such factors related to
the later tragedy such as filter-cleaning operations without using slip blinds, leaking
valves, and bad pressure gauges. The report recommended raising the capability
of the water curtain and pointed out that the alarm at the flare tower was nonop-
erational and thus any leakage could go unnoticed for a long time. The report also
noted that a number of hazardous conditions were known and allowed to persist
for considerable amounts of time or inadequate precautions were taken against
them. In addition, there was no follow-up to ensure that deficiencies were corrected.
According to the Bhopal manager, all improvements called for in the report
had been implemented, but obviously that was either untrue or the fixes were
ineffective.
As with accidents and incidents, warning signs or anomalies also need to be
analyzed using CAST. Because practice will naturally deviate from procedures, often
for very good reasons, the gap between procedures and practice needs to be moni-
tored and understood [50].
12.6
Education and Training
Everyone in the safety control structure, not just the lower-level controllers of the
physical systems, must understand their roles and responsibilities with respect to
safety and why the system—including the organizational aspects of the safety control
structure—was designed the way it was.
People, both managers and operators, need to understand the risks they are taking
in the decisions they make. Often bad decisions are made because the decision
makers have an incorrect assessment of the risks being assumed, which has implica-
tions for training. Controllers must know exactly what to look for, not just be told
to look for “weak signals,” a common suggestion in the HRO literature. Before a
bad outcome occurs, weak signals are simply noise; they take on the appearance of
signals only in hindsight, when their relevance becomes obvious. Telling managers
and operators to “be mindful of weak signals” simply creates a pretext for blame
after a loss event occurs. Instead, the people involved need to be knowledgeable
about the hazards associated with the operation of the system if we expect them to
recognize the precursors to an accident. Knowledge turns unidentifiable weak signals
into identifiable strong signals. People need to know what to look for.
Decision makers at all levels of the safety control structure also need to under-
stand the risks they are taking in the decisions they make: Training should include
not just what but why. For good decision making about operational safety, decision
makers must understand the system hazards and their responsibilities with respect
to avoiding them. Understanding the safety rationale, that is, the “why,” behind the
system design will also have an impact on combating complacency and unintended
changes leading to hazardous states. This rationale includes understanding why
previous accidents occurred. The Columbia Accident Investigation Board was sur-
prised at the number of NASA engineers in the Space Shuttle program who had
never read the official Challenger accident report [74]. In contrast, everyone in the
U.S. nuclear Navy has training about the Thresher loss every year.
Training should not be a one-time event for employees but should be continual
throughout their employment, if only as a reminder of their responsibilities and the
system hazards. Learning about recent events and trends can be a focus of this
training.
Finally, assessing for training effectiveness, perhaps during regular audits, can
assist in establishing an effective improvement and learning process.
With highly automated systems, an assumption is often made that less training is
required. In fact, training requirements go up (not down) in automated systems, and
they change their nature. Training needs to be more extensive and deeper when
using automation. One of the reasons for this requirement is that human operators
of highly automated systems not only need a model of the current process state and
how it can change state but also a model of the automation and its operation, as
discussed in chapter 8.
To control complex and highly automated systems safely, operators (controllers)
need to learn more than just the procedures to follow: If we expect them to control
and monitor the automation, they must also have an in-depth understanding of the
controlled physical process and the logic used in any automated controllers they
may be supervising. System controllers—at all levels—need to know:
• The system hazards and the reason behind safety-critical procedures and opera-
tional rules.
• The potential result of removing or overriding controls, changing prescribed
procedures, and inattention to safety-critical features and operations: Past acci-
dents and their causes should be reviewed and understood.
•How to interpret feedback: Training needs to include different combinations of
alerts and sequences of events, not just single events.
•How to think flexibly when solving problems: Controllers need to be provided
with the opportunity to practice problem solving.
•General strategies rather than specific responses: Controllers need to develop
skills for dealing with unanticipated events.
•How to test hypotheses in an appropriate way: To update mental models,
human controllers often use hypothesis testing to understand the system state
better and update their process models. Such hypothesis testing is common with
computers and automated systems where documentation is usually so poor
and hard to use that experimentation is often the only way to understand the
automation behavior and design. Such testing can, however, lead to losses.
Designers need to provide operators with the ability to test hypotheses safely
and controllers must be educated on how to do so.
Finally, as with any system, emergency procedures must be overlearned and continu-
ally practiced. Controllers must be provided with operating limits and specific
actions to take in case they are exceeded. Requiring operators to make decisions
under stress and without full information is simply another way to ensure that they
will be blamed for the inevitable loss event, usually based on hindsight bias. Critical
limits must be established and provided to the operators, and emergency procedures
must be stated explicitly.
section 12.7.
Creating an Operations Safety Management Plan.
The operations safety management plan is used to guide operational control of
safety. The plan describes the objectives of the operations safety program and how
they will be achieved. It provides a baseline to evaluate compliance and progress.
Like every other part of safety program, the plan will need buy-in and oversight.
The organization should have a template and documented expectations for oper-
ations safety management plans, but this template may need to be tailored for
particular project requirements.
The information need not all be contained in one document, but there should be
a central reference with pointers to where the information can be found. As is true
for every other part of the safety control structure, the plan should include review
procedures for the plan itself as well as how the plan will be updated and improved
through feedback from experience.
Some things that might be included in the plan:
1.•
General Considerations.
Scope and objectives.
Applicable standards (company, industry)
Documentation and reports.
Review of plan and progress reporting procedures.
2.•
Safety Organization (safety control structure)
Personnel qualifications and duties.
Staffing and manpower.
Communication channels
Responsibility, authority, accountability (functional organization, organiza-
tional structure)
Information requirements (feedback requirements, process model, updating
requirements)
Subcontractor responsibilities.
Coordination.
Working groups.
System safety interfaces with other groups, such as maintenance and test,
occupational safety, quality assurance, and so on.
3.•
Procedures.
Problem reporting (processes, follow-up)
Incident and accident investigation.
4.•Procedures.
5.•Staffing (participants)
6.•Follow-up (tracing to hazard and risk analyses, communication)
Testing and audit program.
7.•Procedures.
8.•Scheduling.
9.•Review and follow-up.
10.•Metrics and trend analysis.
11.•Operational assumptions from hazard and risk analyses.
Emergency and contingency planning and procedures.
Management of change procedures.
Training.
Decision making, conflict resolution.
12.•
Schedule.
Critical checkpoints and milestones.
Start and completion dates for tasks, reports, reviews.
Review procedures and participants.
13.•
Safety Information System.
Hazard and risk analyses, hazard logs (controls, review and feedback
procedures)
Hazard tracking and reporting system.
Lessons learned.
Safety data library (documentation and files)
Records retention policies.
14.•
Operations hazard analysis.
Identified hazards.
Mitigations for hazards.
15.•Evaluation and planned use of feedback to keep the plan up-to-date and
improve it over time.
section 12.8. Applying STAMP to Occupational Safety.
Occupational safety has, traditionally, not taken a systems approach but instead has
focused on individuals and changing their behavior. In applying systems theory to
occupational safety, more emphasis would be placed on understanding the impact
of system design on behavior and would focus on changing the system rather than
people. For example, vehicles used in large plants could be equipped with speed
regulators rather than depending on humans to follow speed limits and then punish-
ing them when they do not. The same design for safety principles presented in
chapter 9 for human controllers apply to designing for occupational safety.
With the increasing complexity and automation of our plants, the line between
occupational safety and engineering safety is blurring. By designing the system to
be safe despite normal human error or judgment errors under competing work
pressures, workers will be better protected against injury while fulfilling their job
responsibilities.

842
chapter12.txt Normal file

@ -0,0 +1,842 @@
Chapter 12.
Controlling Safety during Operations.
In some industries, system safety is viewed as having its primary role in development
and most of the activities occur before operations begin. Those concerned with
safety may lose influence and resources after that time. As an example, one of
the chapters in the Challenger accident report, titled “The Silent Safety Program,”
lamented.
Following the successful completion of the orbital flight test phase of the Shuttle program,
the system was declared to be operational. Subsequently, several safety, reliability, and
quality assurance organizations found themselves with reduced and/or reorganized functional capabilities. . . . The apparent reason for such actions was a perception that less
safety, reliability, and quality assurance activity would be required during “routine” Shuttle
operations. This reasoning was faulty.
While safety-guided design eliminates some hazards and creates controls for others,
hazards and losses may still occur in operations due to.
1.•Inadequate attempts to eliminate or control the hazards in the system design,
perhaps due to inappropriate assumptions about operations.
2.•Inadequate implementation of the controls that designers assumed would exist
during operations.
3.•Changes that occur over time, including violation of the assumptions underlying the design.
4.•Unidentified hazards, sometimes new ones that arise over time and were not
anticipated during design and development.
Treating operational safety as a control problem requires facing and mitigating these
potential reasons for losses.
A complete system safety program spans the entire life of the system and, in some
ways, the safety program during operations is even more important than during
development. System safety does not stop after development; it is just getting started.
The focus now, however, shifts to the operations safety control structure.
This chapter describes the implications of STAMP on operations. Some topics
that are relevant here are left to the next chapter on management. organizational
design, safety culture and leadership, assignment of appropriate responsibilities
throughout the safety control structure, the safety information system, and corporate safety policies. These topics span both development and operations and many
of the same principles apply to each, so they have been put into a separate chapter.
A final section of this chapter considers the application of STAMP and systems
thinking principles to occupational safety.
section 12.1.
Operations Based on STAMP.
Applying the basic principles of STAMP to operations means that, like development, the goal during operations is enforcement of the safety constraints, this time
on the operating system rather than in its design. Specific responsibilities and control
actions required during operations are outlined in chapter 13.
Figure 12.1 shows the interactions between development and operations. At the
end of the development process, the safety constraints, the results of the hazard
analyses, as well as documentation of the safety-related design features and design
rationale, should be passed on to those responsible for the maintenance and evolution of the system. This information forms the baseline for safe operations. For
example, the identification of safety-critical items in the hazard analysis should be
used as input to the maintenance process for prioritization of effort.
At the same time, the accuracy and efficacy of the hazard analyses performed
during development and the safety constraints identified need to be evaluated using
the operational data and experience. Operational feedback on trends, incidents, and
accidents should trigger reanalysis when appropriate. Linking the assumptions
throughout the system specification with the parts of the hazard analysis based on
that assumption will assist in performing safety maintenance activities. During field
testing and operations, the links and recorded assumptions and design rationale can
be used in safety change analysis, incident and accident analysis, periodic audits and
performance monitoring as required to ensure that the operational system is and
remains safe.
For example, consider the TCAS requirement that TCAS provide collision avoidance protection for any two aircraft closing horizontally at any rate up to 1,200 knots
and vertically up to 10,000 feet per minute. As noted in the rationale, this requirement is based on aircraft performance limits at the time TCAS was created. It is
also based on minimum horizontal and vertical separation requirements. The safety
analysis originally performed on TCAS is based on these assumptions. If aircraft
performance limits change or if there are proposed changes in airspace management, as is now occurring in new Reduced Vertical Separation Minimums .(RVSM),
hazard analysis to determine the safety of such changes will require the design
rationale and the tracing from safety constraints to specific system design features
as recorded in intent specifications. Without such documentation, the cost of reanalysis could be enormous and in some cases even impractical. In addition, the links
between design and operations and user manuals in level 6 will ease updating when
design changes are made.
In a traditional System Safety program, much of this information is found
in or can be derived from the hazard log, but it needs to be pulled out and provided in a form that makes it easy to locate and use in operations. Recording
design rationale and assumptions in intent specifications allows using that information both as the criteria under which enforcement of the safety constraints is
predicated and in the inevitable upgrades and changes that will need to be made
during operations. Chapter 10 shows how to identify and record the necessary
information.
The design of the operational safety controls are based on assumptions about the
conditions during operations. Examples include assumptions about how the operators will operate the system and the environment .(both social and physical). in which
the system will operate. These conditions may change. Therefore, not only must the
assumptions and design rationale be conveyed to those who will operate the system,
but there also need to be safeguards against changes over time that violate those
assumptions.
The changes may be in the behavior of the system itself.
•Physical changes. the equipment may degrade or not be maintained properly.
•Human changes. human behavior and priorities usually change over time.
•Organizational changes. change is a constant in most organizations, including
changes in the safety control structure itself, or in the physical and social environment within which the system operates or with which it interacts.
Controls need to be established to reduce the risk associated with all these types of
changes.
The safeguards may be in the design of the system itself or in the design of the
operational safety control structure. Because operational safety depends on the
accuracy of the assumptions and models underlying the design and hazard analysis
processes, the operational system should be monitored to ensure that.
1. The system is constructed, operated, and maintained in the manner assumed
by the designers.
2. The models and assumptions used during initial decision making and design
are correct.
3. The models and assumptions are not violated by changes in the system, such
as workarounds or unauthorized changes in procedures, or by changes in the
environment.
Designing the operations safety control structure requires establishing controls and
feedback loops to .(1). identify and handle flaws in the original hazard analysis and
system design and .(2). to detect unsafe changes in the system during operations
before the changes lead to losses. Changes may be intentional or they may be unintended and simply normal changes in system component behavior or the environment over time. Whether intended or unintended, system changes that violate the
safety constraints must be controlled.
section 12.2.
Detecting Development Process Flaws during Operations.
Losses can occur due to flaws in the original assumptions and rationale underlying
the system design. Errors may also have been made in the hazard analysis process
used during system design. During operations, three goals and processes to achieve
these goals need to be established.
1. Detect safety-related flaws in the system design and in the safety control
structure, hopefully before major losses, and fix them.
2. Determine what was wrong in the development process that allowed the flaws
to exist and improve that process to prevent the same thing from happening
in the future.
3. Determine whether the identified flaws in the process might have led to other
vulnerabilities in the operational system.
If losses are to be reduced over time and companies are not going to simply
engage in constant firefighting, then mechanisms to implement learning and continual improvement are required. Identified flaws must not only be fixed .(symptom
removal), but the larger operational and development safety control structures must
be improved, as well as the process that allowed the flaws to be introduced in the
first place. The overall goal is to change the culture from a fixing orientation.
identifying and eliminating deviations or symptoms of deeper problems.to a learning orientation where systemic causes are included in the search for the source of
safety problems .
To accomplish these goals, a feedback control loop is needed to regularly track
and assess the effectiveness of the development safety control structure and its
controls. Were hazards overlooked or incorrectly assessed as unlikely or not serious?
Were some potential failures or design errors not included in the hazard analysis?
Were identified hazards inappropriately accepted rather than being fixed? Were the
designed controls ineffective? If so, why?
When numerical risk assessment techniques are used, operational experience can
provide insight into the accuracy of the models and probabilities used. In various
studies of the D C 10 by McDonnell Douglas, the chance of engine power loss with
resulting slat damage during takeoff was estimated to be less than one in a billion
flights. However, this highly improbable event occurred four times in D C 10 s in the
first few years of operation without raising alarm bells before it led to an accident
and changes were made. Even one event should have warned someone that the
models used might be incorrect. Surprisingly little scientific evaluation of probabilistic risk assessment techniques has ever been conducted , yet these techniques
are regularly taught to most engineering students and widely used in industry. Feedback loops to evaluate the assumptions underlying the models and the assessments
produced are an obvious way to detect problems.
Most companies have an accident/incident analysis process that identifies the
proximal failures that led to an incident, for example, a flawed design of the pressure
relief valve in a tank. Typical follow-up would include replacement of that valve with
an improved design. On top of fixing the immediate problem, companies should
have procedures to evaluate and potentially replace all the uses of that pressure
relief valve design in tanks throughout the plant or company. Even better would be
to reevaluate pressure relief valve design for all uses in the plant, not just in tanks.
But for long-term improvement, a causal analysis.CAST or something similar.
needs to be performed on the process that created the flawed design and that
process improved. If the development process was flawed, perhaps in the hazard
analysis or design and verification, then fixing that process can prevent a large
number of incidents and accidents in the future.
Responsibility for this goal has to be assigned to an appropriate component in
the safety control structure and feedback-control loops established. Feedback may
come from accident and incident reports as well as detected and reported design
and behavioral anomalies. To identify flaws before losses occur, which is clearly
desirable, audits and performance assessments can be used to collect data for validating and informing the safety design and analysis process without waiting for a
crisis. There must also be feedback channels to the development safety control
structure so that appropriate information can be gathered and used to implement
improvements. The design of these control loops is discussed in the rest of this
chapter. Potential challenges in establishing such control loops are discussed in the
next chapter on management.
section 12.3. Managing or Controlling Change.
Systems are not static but instead are dynamic processes that are continually adapting to achieve their ends and to react to changes in themselves and their environment. In STAMP, adaptation or change is assumed to be an inherent part of any
system, particularly those that include humans and organizational components.
Humans and organizations optimize and change their behavior, adapting to the
changes in the world and environment in which the system operates.
To avoid losses, not only must the original design enforce the safety constraints
on system behavior, but the safety control structure must continue to enforce them
as changes to the designed system, including the safety control structure itself, occur
over time.
While engineers usually try to anticipate potential changes and to design for
changeability, the bulk of the effort in dealing with change must necessarily occur
during operations. Controls are needed both to prevent unsafe changes and to detect
them if they occur.
In the friendly fire example in chapter 5, the A Wacks controllers stopped handing
off helicopters as they entered and left the no-fly zone. They also stopped using the
Delta Point system to describe flight plans, although the helicopter pilots assumed
the coded destination names were still being used and continued to provide them.
Communication between the helicopters and the A Wacks controllers was seriously
degraded although nobody realized it. The basic safety constraint that all aircraft
in the no-fly zone and their locations would be known to the A Wacks controllers
became over time untrue as the A Wacks controllers optimized their procedures.
This type of change is normal; it needs to be identified by checking that the assumptions upon which safety is predicated remain true over time.
The deviation from assumed behavior during operations was not, in the friendly
fire example, detected until after an accident. Obviously, finding the deviations at
this time is less desirable than using audits, and other types of feedback mechanisms
to detect hazardous changes, that is, those that violate the safety constraints, before
losses occur. Then something needs to be done to ensure that the safety constraints
are enforced in the future.
Controls are required for both intentional .(planned). and unintentional changes.
section 12.3.1. Planned Changes.
Intentional system changes are a common factor in accidents, including physical,
process, and safety control structure changes . The Flixborough explosion provides an example of a temporary physical change resulting in a major loss. Without
first performing a proper hazard analysis, a temporary pipe was used to replace a
reactor that had been removed to repair a crack. The crack itself was the result of
a previous process modification . The Walkerton water contamination loss in
appendix C provides an example of a control structure change when the government
water testing lab was privatized without considering how that would affect feedback
to the Ministry of the Environment.
Before any planned changes are made, including organizational and safety
control structure changes, their impact on safety must be evaluated. Whether
this process is expensive depends on how the original hazard analysis was performed and particularly how it was documented. Part of the rationale behind the
design of intent specifications was to make it possible to retrieve the information
needed.
While implementing change controls limits flexibility and adaptability, at least in
terms of the time it takes to make changes, the high accident rate associated with
intentional changes attests to the importance of controlling them and the high level
of risk being assumed by not doing so. Decision makers need to understand these
risks before they waive the change controls.
Most systems and industries do include such controls, usually called Management
of Change .(MOC). procedures. But the large number of accidents occurring after
system changes without evaluating their safety implies widespread nonenforcement
of these controls. Responsibility needs to be assigned for ensuring compliance with
the MOC procedures so that change analyses are conducted and the results are not
ignored. One way to do this is to reward people for safe behavior when they choose
safety over other system goals and to hold them accountable when they choose to
ignore the MOC procedures, even when no accident results. Achieving this goal, in
turn, requires management commitment to safety .(see chapter 13), as does just
about every aspect of building and operating a safe system.
section 12.3.2. Unplanned Changes.
While dealing with planned changes is relatively straightforward .(even if difficult
to enforce), unplanned changes that move systems toward states of higher risk are
less straightforward. There need to be procedures established to prevent or detect
changes that impact the ability of the operations safety control structure and the
designed controls to enforce the safety constraints.
As noted earlier, people will tend to optimize their performance over time to
meet a variety of goals. If an unsafe change is detected, it is important to respond
quickly. People incorrectly reevaluate their perception of risk after a period of
success. One way to interrupt this risk-reevaluation process is to intervene quickly
to stop it before it leads to a further reduction in safety margins or a loss occurs.
But that requires an alerting function to provide feedback to someone who is
responsible for ensuring that the safety constraints are satisfied.
At the same time, change is a normal part of any system. Successful systems are
continually changing and adapting to current conditions. Change should be allowed
as long as it does not violate the basic constraints on safe behavior and therefore
increase risk to unacceptable levels. While in the short term relaxing the safety constraints may allow other system goals to be achieved to a greater degree, in the longer
term accidents and losses can cost a great deal more than the short-term gains.
The key is to allow flexibility in how safety goals are achieved, but not flexibility
in violating them, and to provide the information that creates accurate risk perception by decision makers.
Detecting migration toward riskier behavior starts with identifying baseline
requirements. The requirements follow from the hazard analysis. These requirements may be general .(“Equipment will not be operated above the identified safetycritical limits” or “Safety-critical equipment must be operational when the system
is operating”). or specifically tied to the hazard analysis .(“A Wacks operators must
always hand off aircraft when they enter and leave the no-fly zone” or “Pilots must
always follow the TCAS alerts and continue to do so until they are canceled”).
The next step is to assign responsibility to appropriate places in the safety control
structure to ensure the baseline requirements are not violated, while allowing
changes that do not raise risk. If the baseline requirements make it impossible for
the system to achieve its goals, then instead of waiving them, the entire safety control
structure should be reconsidered and redesigned. For example, consider the foam
shedding problems on the Space Shuttle. Foam had been coming off the external
tank for most of the operational life of the Shuttle. During development, a hazard
had been identified and documented related to the foam damaging the thermal
control surfaces of the spacecraft. Attempts had been made to eliminate foam shedding, but none of the proposed fixes worked. The response was to simply waive the
requirement before each flight. In fact, at the time of the Columbia loss, more than
three thousand potentially critical failure modes were regularly waived on the
pretext that nothing could be done about them and the Shuttle had to fly .
More than a third of these waivers had not been reviewed in the ten years before
the accident.
After the Columbia loss, controls and mitigation measures for foam shedding
were identified and implemented, such as changing the fabrication procedures and
adding cameras and inspection and repair capabilities and other contingency actions.
The same measures could, theoretically, have been implemented before the loss of
Columbia. Most of the other waived hazards were also resolved in the aftermath of
the accident. While the operational controls to deal with foam shedding raise the
risk associated with a Shuttle accident above actually fixing the problem, the risk is
lower than simply ignoring and waiting for the hazards to occur. Understanding and
explicitly accepting risk is better than simply denying and ignoring it.
The NASA safety program and safety control structure had seriously degraded
before both the Challenger and Columbia losses . Waiving requirements
interminably represents an abdication of the responsibility to redesign the system,
including the controls during operations, after the current design is determined to
be unsafe.
Is such a hard line approach impractical? SUBSAFE, the U.S. nuclear submarine
safety program established after the Thresher loss, described in chapter 14, has not
allowed waiving the SUBSAFE safety requirements for more than forty-five years,
with one exception. In 19 67 , four years after SUBSAFE was established, SUBSAFE
requirements for one submarine were waived in order to satisfy pressing Navy performance goals. That submarine and its crew were lost less than a year later. The
same mistake has not been made again.
If there is absolutely no way to redesign the system to be safe and at the same
time to satisfy the system requirements that justify its existence, then the existence
of the system itself should be rethought and a major replacement or new design
considered. After the first accident, much more stringent and perhaps unacceptable
controls will be forced on operations. While the decision to live with risk is usually
accorded to management, those who will suffer the losses should have a right to
participate in that decision. Luckily, the choice is usually not so stark if flexibility is
allowed in the way the safety constraints are maintained and long-term rather than
short-term thinking prevails.
Like any set of controls, unplanned change controls involve designing appropriate control loops. In general, the process involves identifying the responsibility
of the controller(s); collecting data .(feedback); turning the feedback into useful
information .(analysis). and updating the process models; generating any necessary
control actions and appropriate communication to other controllers; and measuring
how effective the whole process is .(feedback again).
section 12.4. Feedback Channels.
Feedback is a basic part of STAMP and of treating safety as a control problem.
Information flow is key in maintaining safety.
There is often a belief.or perhaps hope.that a small number of “leading indicators” can identify increasing risk of accidents, or, in STAMP terms, migration
toward states of increased risk. It is unlikely that general leading indicators applicable to large industry segments exist or will be useful. The identification of system
safety constraints does, however, provide the possibility of identifying leading
indicators applicable to a specific system.
The desire to predict the future often leads to collecting a large amount of information based on the hope that something useful will be obtained and noticed. The
NASA Space Shuttle program was collecting six hundred metrics a month before
the loss of Columbia. Companies often collect data on occupational safety, such as
days without a lost time accident, and they assume that these data reflect on system
safety , which of course it does not. Not only is this misuse of data potentially
misleading, but collecting information that may not be indicative of real risk diverts
limited resources and attention from more effective risk-reduction efforts.
Poorly defined feedback can lead to a decrease in safety. As an incentive to
reduce the number of accidents in the California construction industry, for example,
workers with the best safety records.as measured by fewest reported incidents.
were rewarded . The reward created an incentive to withhold information
about small accidents and near misses, and they could not therefore be investigated
and the causes eliminated. Under-reporting of incidents created the illusion that the
system was becoming safer, when instead risk had merely been muted. The inaccurate risk perception by management led to not taking the necessary control
actions to reduce risk. Instead, the reporting of accidents should have been rewarded.
Feedback requirements should be determined with respect to the design of the
organizations safety control structure, the safety constraints .(derived from the
system hazards). that must be enforced on system operation, and the assumptions
and rationale underlying the system design for safety. They will be similar for different organizations only to the extent that the hazards, safety constraints, and
system design are similar.
The hazards and safety constraints, as well as the causal information derived by
the use of STPA, form the foundation for determining what feedback is necessary
to provide the controllers with the information they need to satisfy their safety
responsibilities. In addition, there must be mechanisms to ensure that feedback
channels are operating effectively.
The feedback is used to update the controllers process models and understanding of the risks in the processes they are controlling, to update their control algorithms, and to execute appropriate control actions.
Sometimes, cultural problems interfere with feedback about the state of the
controlled process. If the culture does not encourage sharing information and if
there is a perception that the information can be used in a way that is detrimental
to those providing it, then cultural changes will be necessary. Such changes require
leadership and freedom from blame .(see “Just Culture” in chapter 13). Effective
feedback collection requires that those making the reports are convinced that the
information will be used for constructive improvements in safety and not as a basis
for criticism or disciplinary action. Resistance to airing dirty laundry is understandable, but this quickly transitions into an organizational culture where only good
news is passed on for fear of retribution. Everyones past experience includes individual mistakes, and avoiding repeating the same mistakes requires a culture that
encourages sharing.
Three general types of feedback are commonly used. audits and performance
assessments; reporting systems; and anomaly, incident, and accident investigation.
section 12.4.1. Audits and Performance Assessments.
Once again, audits and performance assessments should start from the safety constraints and design assumptions and rationale. The goal should be to determine
whether the safety constraints are being enforced in the operation of the system
and whether the assumptions underlying the safety design and rationale are still
true. Audits and performance assessments provide a chance to detect whether the
behavior of the system and the system components still satisfies the safety constraints and whether the way the controllers think the system is working.as
reflected in their process models.is accurate.
The entire safety control structure must be audited, not just the lower-level processes. Auditing the upper levels of the organization will require buy-in and commitment from management and an independent group at a high enough level to
control audits as well as explicit rules for conducting them.
Audits are often less effective than they might be. When auditing is performed
through contracts with independent companies, there may be subtle pressures on
the audit team to be unduly positive or less than thorough in order to maintain their
customer base. In addition, behavior or conditions may be changed in anticipation
of an audit and then revert back to their normal state immediately afterward.
Overcoming these limitations requires changes in organizational culture and
in the use of the audit results. Safety controllers .(managers). must feel personal
responsibility for safety. One way to encourage this view is to trust them and expect
them to be part of the solution and to care about safety. “Safety is everyones
responsibility” must be more than an empty slogan, and instead a part of the organizational culture.
A participatory audit philosophy can have an important impact on these cultural
goals. Some features of such a philosophy are.
1.• Audits should not be punitive. Audits need to be viewed as a chance to improve
safety and to evaluate the process rather than a way to evaluate employees.
2.• To
increase buy-in and commitment, those controlling the processes being
audited should participate in creating the rules and procedures and understand
the reasons for the audit and how the results will be used. Everyone should
have a chance to learn from the audit without it having negative consequences.
it should be viewed as an opportunity to learn how to improve.
3.•People from the process being audited should participate on the audit team. In
order to get an outside but educated view, using process experts from other
parts of the organization not directly being audited is a better approach than
using outside audit companies. Various stakeholders in safety may be included
such as unions. The goal should be to inculcate the attitude that this is our audit
and a chance to improve our practices. Audits should be treated as a learning
experience for everyone involved.including the auditors.
4.•Immediate feedback should be provided and solutions discussed. Often audit
results are not available until after the audit and are presented in a written
report. Feedback and discussion with the audit team during the audit are discouraged. One of the best times to discuss problems found and how to design
solutions, however, is when the team is together and on the spot. Doing this
will also reinforce the understanding that the goal is to improve the process,
not to punish or evaluate those involved.
5.• All levels of the safety control structure should be audited, along with the
physical process and its immediate operators. Accepting being audited and
implementing improvements as a result.that is, leading by example.is a
powerful way for leaders to convey their commitment to safety and to its
improvement.
6.• A part of the audit should be to determine the level of safety knowledge and
training that actually exists, not what managers believe exists or what exists in
the training programs and user manuals. These results can be fed back into the
training materials and education programs. Under no circumstances, of course,
should such assessments be used in a negative way or one that is viewed as
punitive by those being assessed.
Because these rules for audits are so far from common practice, they may be
viewed as unrealistic. But this type of audit is carried out today with great success.
See chapter 14 for an example. The underlying philosophy behind these practices is
that most people do not want to harm others and have innate belief in safety as a
goal. The problems arise when other goals are rewarded or emphasized over safety.
When safety is highly valued in an organizational culture, obtaining buy-in is usually
not difficult. The critical step lies in conveying that commitment.
section 12.4.2. Anomaly, Incident, and Accident Investigation.
Anomaly, incident, and accident investigations often focus on a single “root” cause
and look for contributory causes near the events. The belief that there is a root cause,
sometimes called root cause seduction , is powerful because it provides an illusion of control. If the root cause can simply be eliminated and if that cause is low
in the safety control structure, then changes can easily be made that will eliminate
accidents without implicating management or requiring changes that are costly or
disruptive to the organization. The result is that physical design characteristics or
low-level operators are usually identified as the root cause.
Causality is, however, much more complex than this simple but very entrenched
belief, as has been argued throughout this book. To effect high-leverage policies and
changes that are able to prevent large classes of future losses, the weaknesses in the
entire safety control structure related to the loss need to be identified and the
control structure redesigned to be more effective.
In general, effective learning from experience requires a change from a fixing
orientation to a continual learning and improvement culture. To create such a
culture requires high-level leadership by management, and sometimes organizational changes.
Chapter 11 describes a way to perform better analyses of anomalies, incidents,
and accidents. But having a process is not enough; the process must be embedded
in an organizational structure that allows the successful exploitation of that process.
Two important organizational factors will impact the successful use of CAST. training and follow-up.
Applying systems thinking to accident analysis requires training and experience.
Large organizations may be able to train a group of investigators or teams to
perform CAST analyses. This group should be managerially and financially independent. Some managers prefer to have accident/incident analysis reports focus on
the low-level system operators and physical processes and the reports never go
beyond those factors. In other cases, those involved in accident analysis, while wellmeaning, have too limited a view to provide the perspective required to perform an
adequate causal analysis. Even when intentions are good and local skills and knowledge are available, budgets may be so tight and pressures to maintain performance
schedules so high that it is difficult to find the time and resources to do a thorough
causal analysis using local personnel. Trained teams with independent budgets
can overcome some of these obstacles. But while the leaders of investigations and
causal analysis can be independent, participation by those with local knowledge is
also important.
A second requirement is follow-up. Often the process stops after recommendations are made and accepted. No follow-up is provided to ensure that the recommendations are implemented or that the implementations were effective. Deadlines
and assignment of responsibility for making recommendations, as well as responsibility for ensuring that they are made, are required. The findings in the causal analysis
should be an input to future audits and performance assessments. If the same or
similar causes recur, then that itself requires an analysis of why the problem was
not fixed when it first was detected. Was the fix unsuccessful? Did the system migrate
back to the same high-risk state because the underlying causal factors were never
successfully controlled? Were factors missed in the original causal analysis? Trend
analysis is important to ensure that progress is being made in controlling safety.
section 12.4.3. Reporting Systems.
Accident reports very often note that before a loss, someone detected an anomaly
but never reported it using the official reporting system. The response in accident
investigation reports is often to recommend that the requirement to use reporting
systems be emphasized to personnel or to provide additional training in using them.
This response may be effective for a short time, but eventually people revert back
to their prior behavior. A basic assumption about human behavior in this book .(and
in systems approaches to human factors). is that human behavior can usually be
explained by looking at the system in which the human is operating. The reason in
the system design for the behavior must be determined and changed. Simply trying
to force people to behave in ways that are unnatural for them will usually be
unsuccessful.
So the first question to ask is why people do not use reporting systems and to fix
those factors. One obvious reason is that they may be designed poorly. They may
require extra, time-consuming steps, such as logging into a web-based system, that
are not part of their normal operating procedures or environment. Once they
get to the website, they may be faced with a poorly designed form that requires
them to provide a lot of extraneous information or does not allow the flexibility
necessary to enter the information they want to provide.
A second reason people do not report is that the information they provided in
the past appeared to go into a black hole, with nobody responding to it. There is
little incentive to continue to provide information under these conditions, particularly when the reporting system is time-consuming and awkward to use.
A final reason for lack of reporting is a fear that the information provided may
be used against them or there are other negative repercussions such as a necessity
to spend time filling out additional reports.
Once the reason for failing to use reporting systems is understood, the solutions
usually become obvious. For example, the system may need to be redesigned so it
is easy to use and integrated into normal work procedures. As an example, email is
becoming a primary means of communication at work. The first natural response in
finding a problem is to contact those who can fix it, not to report it to some database
where there is no assurance it will be processed quickly or get to the right people.
A successful solution to this problem used on one large air traffic control system
was to require only that the reporter add an extra “cc.” on their emails in order to
get it reported officially to safety engineering and those responsible for problem
reports .
In addition, the receipt of a problem report should result in both an acknowledgment of receipt and a thank-you. Later, when a resolution is identified, information
should be provided to the reporter of the problem about what was done about it.
If there is no resolution within a reasonable amount of time, that too should be
acknowledged. There is little incentive to use reporting systems if the reporters do
not think the information will be acted upon.
Most important, an effective reporting system requires that those making the
reports are convinced the information will be used for constructive improvements
in safety and not as a basis for criticism or disciplinary action. If reporting is considered to have negative consequences for the reporter, then anonymity may be
necessary and a written policy provided for the use of such reporting systems, including the rights of the reporters and how the reported information will be used. Much
has been written about this aspect of reporting systems .(e.g., see Dekker ). One
warning is that trust is hard to gain and easy to lose. Once it is lost, regaining it is
even harder than getting buy-in at the beginning.
When reporting involves an outside regulatory agency or industry group, protection of safety information and proprietary data from disclosure and use for
purposes other than improving safety must be provided.
Designing effective reporting systems is very difficult. Examining two successful
efforts, in nuclear power and in commercial aviation, along with the challenges they
face is instructive.
Nuclear Power.
Operators of nuclear power plants in the United States are required to file a
Licensee Event Report .(LER). with the Nuclear Regulatory Commission .(NRC)
whenever an irregular event occurs during plant operation. While the NRC collected
an enormous amount of information on the operating experience of plants in this
way, the data were not consistently analyzed until after the Three Mile Island .(TMI)
accident. The General Accounting Office .( GAOW ). had earlier criticized the NRC for
this failure, but no corrective action was taken until after the events at TMI .
The system also had a lack of closure. important safety issues were raised and
studied to some degree, but were not carried through to resolution . Many
of the conditions involved in the TMI accident had occurred previously at other
plants but nothing had been done about correcting them. Babcock and Wilcox,
the engineering firm for TMI, had no formal procedures to analyze ongoing problems at plants they had built or to review the LERs on their plants filed with
the NRC.
The TMI accident sequence started when a pilot-operated relief valve stuck open.
In the nine years before the TMI incident, eleven of those valves had stuck open at
other plants, and only a year before, a sequence of events similar to those at TMI
had occurred at another U.S. plant.
The information needed to prevent TMI was available, including the prior
incidents at other plants, recurrent problems with the same equipment at TMI, and
engineers critiques that operators had been taught to do the wrong thing in specific
circumstances, yet nothing had been done to incorporate this information into
operating practices.
In reflecting on TMI, the utilitys president, Herman Dieckamp, said.
To me that is probably one of the most significant learnings of the whole accident
the degree to which the inadequacies of that experience feedback loop . . . significantly
contributed to making us and the plant vulnerable to this accident .
As a result of this wake-up call, the nuclear industry initiated better evaluation and
follow-up procedures on LERs. It also created the Institute for Nuclear Power
Operations .(INPO). to promote safety and reliability through external reviews of
performance and processes, training and accreditation programs, events analysis,
sharing of operating information and best practices, and special assistance to member
utilities. The IAEA .(International Atomic Energy Agency). and World Association
of Nuclear Operators .(WANO). share these goals and serve similar functions
worldwide.
The reporting system now provides a way for operators of each nuclear power
plant to reflect on their own operating experience in order to identify problems,
interpret the reasons for these problems, and select corrective actions to ameliorate
the problems and their causes. Incident reviews serve as important vehicles for selfanalysis, knowledge sharing across boundaries inside and outside specific plants, and
development of problem-resolution efforts. Both INPO and the NRC issue various
letters and reports to make the industry aware of incidents as part of operating
experience feedback, as does IAEAs Incident Reporting System.
The nuclear engineering experience is not perfect, of course, but real strides have
been made since the TMI wakeup call, which luckily occurred without major human
losses. To their credit, an improvement and learning effort was initiated and has
continued. High-profile incidents like TMI are rare, but smaller scale self-analyses
and problem-solving efforts follow detection of small defects, near misses, and precursors and negative trends. Occasionally the NRC has stepped in and required
changes. For example, in 19 96 the NRC ordered the Millstone nuclear power plant
in Connecticut to remain closed until management could demonstrate a “safety
conscious work environment” after identified problems were allowed to continue
without remedial action .
Commercial Aviation.
The highly regarded ASRS .(Aviation Safety Reporting System). has been copied by
many individual airline information systems. Although much information is now
collected, there still exist problems in evaluating and learning from it. The breadth
and type of information acquired is much greater than the NRC reporting system
described above. The sheer number of ASRS reports and the free form entry of the
information make evaluation very difficult. There are few ways implemented to
determine whether the report was accurate or evaluated the problem correctly.
Subjective causal attribution and inconsistency in terminology and information
included in the reports makes comparative analysis and categorization difficult and
sometimes impossible.
Existing categorization schemes have also become inadequate as technology
has changed, for example, with increased use of digital technology and computers
in aircraft and ground operations. New categorizations are being implemented,
but that creates problems when comparing data that used older categorization
schemes.
Another problem arising from the goal to encourage use of the system is in the
accuracy of the data. By filing an ASRS report, a limited form of indemnity against
punishment is assured. Many of the reports are biased by personal protection considerations, as evidenced by the large percentage of the filings that report FAA
regulation violations. For example, in a NASA Langley study of reported helicopter
incidents in the ASRS over a nine-year period, nonadherence to FARs .(Federal
Aviation Regulations). was by far the largest category of reports. The predominance
of FAR violations in the incident data may reflect the motivation of the ASRS
reporters to obtain immunity from perceived or real violations of FARs and not
necessarily the true percentages.
But with all these problems and limitations, most agree that the ASRS and
similar industry reporting systems have been very successful and the information
obtained extremely useful in enhancing safety. For example, reported unsafe airport
conditions have been corrected quickly and improvements in air traffic control and
other types of procedures made on the basis of ASRS reports.
The success of the ASRS has led to the creation of other reporting systems in
this industry. The Aviation Safety Action Program .(ASAP). in the United States,
for example, encourages air carrier and repair station personnel to voluntarily
report safety information to be used to develop corrective actions for identified
safety concerns. An ASAP involves a partnership between the FAA and the certified organization .(called the certificate holder). and may also include a third
party, such as the employees labor organization. It provides a vehicle for employees of the ASAP participants to identify and report safety issues to management
and to the FAA without fear that the FAA will use the reports accepted under
the program to take legal enforcement action against them or the company or
that companies will use the information to take disciplinary action against the
employee.
Certificate holders may develop ASAP programs and submit them to the FAA
for review and acceptance. Ordinarily, programs are developed for specific employee
groups, such as members of the flightcrew, flight attendants, mechanics, or dispatchers. The FAA may also suggest, but not require, that a certificate holder develop an
ASAP to resolve an identified safety problem.
When ASAP reports are submitted, an event review committee .(ERC). reviews
and analyzes them. The ERC usually includes a management representative from
the certificate holder, a representative from the employee labor association .(if
applicable), and a specially trained FAA inspector. The ERC considers each ASAP
report for acceptance or denial, and if accepted, analyzes the report to determine
the necessary controls to put in place to respond to the identified problem.
Single ASAP reports can generate corrective actions and, in addition, analysis of
aggregate ASAP data can also reveal trends that require action. Under an ASAP,
safety issues are resolved through corrective action rather than through punishment
or discipline.
To prevent abuse of the immunity provided by ASAP programs, reports are
accepted only for inadvertent regulatory violations that do not appear to involve
an intentional disregard for safety and events that do not appear to involve criminal
activity, substance abuse, or intentional falsification.
Additional reporting programs provide for sharing data that is collected by airlines for their internal use. FOQA .(Flight Operational Quality Assurance). is an
example. Air carriers often instrument their aircraft with extensive flight data
recording systems or use pilot generated checklists and reports for gathering information internally to improve operations and safety. FOQA provides a voluntary
means for the airlines to share this information with other airlines and with the FAA
so that national trends can be monitored and the FAA can target its resources to
address the most important operational risk issues.
In contrast with the ASAP voluntary reporting of single events, FOQA programs
allow the accumulation of accurate operational performance information covering
all flights by multiple aircraft types such that single events or overall patterns of
aircraft performance data can be identified and analyzed. Such aggregate data can
determine trends specific to aircraft types, local flight path conditions, and overall
flight performance trends for the commercial aircraft industry. FOQA data has been
used to identify the need for changing air carrier operating procedures for specific
aircraft fleets and for changing air traffic control practices at certain airports with
unique traffic pattern limitations.
FOQA and other such voluntary reporting programs allow early identification
of trends and changes in behavior .(i.e., migration of systems toward states of increasing risk). before they lead to accidents. Follow-up is provided to ensure that unsafe
conditions are effectively remediated by corrective actions.
A cornerstone of FOQA programs, once again, is the understanding that aggregate data provided to the FAA will be kept confidential and the identity of reporting
personnel or airlines will remain anonymous. Data that could be used to identify
flight crews are removed from the electronic record as part of the initial processing
of the collected data. Air carrier FOQA programs, however, typically provide a
gatekeeper who can securely retrieve identifying information for a limited amount
of time, in order to enable follow-up requests for additional information from the
specific flight crew associated with a FOQA event. The gatekeeper is typically a line
captain designated by the air carriers pilot association. FOQA programs usually
involve agreements between pilot organizations and the carriers that define how
the collected information can be used.
footnote. FOQA is voluntary in the United States but required in some countries.
section 12.5.
Using the Feedback.
Once feedback is obtained, it needs to be used to update the controllers process
models and perhaps control algorithms. The feedback and its analysis may be passed
to others in the control structure who need it.
Information must be provided in a form that people can learn from, apply to
their daily jobs, and use throughout the system life cycle.
Various types of analysis may be performed by the controller on the feedback,
such as trend analysis. If flaws in the system design or unsafe changes are detected,
obviously actions are required to remedy the problems.
In major accidents, precursors and warnings are almost always present but ignored
or mishandled. While what appear to be warnings are sometimes simply a matter
of hindsight, sometimes clear evidence does exist. In 19 82 , two years before the
Bhopal accident, for example, an audit was performed that identified many of the
deficiencies involved in the loss. The audit report noted such factors related to
the later tragedy such as filter-cleaning operations without using slip blinds, leaking
valves, and bad pressure gauges. The report recommended raising the capability
of the water curtain and pointed out that the alarm at the flare tower was nonoperational and thus any leakage could go unnoticed for a long time. The report also
noted that a number of hazardous conditions were known and allowed to persist
for considerable amounts of time or inadequate precautions were taken against
them. In addition, there was no follow-up to ensure that deficiencies were corrected.
According to the Bhopal manager, all improvements called for in the report
had been implemented, but obviously that was either untrue or the fixes were
ineffective.
As with accidents and incidents, warning signs or anomalies also need to be
analyzed using CAST. Because practice will naturally deviate from procedures, often
for very good reasons, the gap between procedures and practice needs to be monitored and understood .
12.6
Education and Training
Everyone in the safety control structure, not just the lower-level controllers of the
physical systems, must understand their roles and responsibilities with respect to
safety and why the system.including the organizational aspects of the safety control
structure.was designed the way it was.
People, both managers and operators, need to understand the risks they are taking
in the decisions they make. Often bad decisions are made because the decision
makers have an incorrect assessment of the risks being assumed, which has implications for training. Controllers must know exactly what to look for, not just be told
to look for “weak signals,” a common suggestion in the HRO literature. Before a
bad outcome occurs, weak signals are simply noise; they take on the appearance of
signals only in hindsight, when their relevance becomes obvious. Telling managers
and operators to “be mindful of weak signals” simply creates a pretext for blame
after a loss event occurs. Instead, the people involved need to be knowledgeable
about the hazards associated with the operation of the system if we expect them to
recognize the precursors to an accident. Knowledge turns unidentifiable weak signals
into identifiable strong signals. People need to know what to look for.
Decision makers at all levels of the safety control structure also need to understand the risks they are taking in the decisions they make. Training should include
not just what but why. For good decision making about operational safety, decision
makers must understand the system hazards and their responsibilities with respect
to avoiding them. Understanding the safety rationale, that is, the “why,” behind the
system design will also have an impact on combating complacency and unintended
changes leading to hazardous states. This rationale includes understanding why
previous accidents occurred. The Columbia Accident Investigation Board was surprised at the number of NASA engineers in the Space Shuttle program who had
never read the official Challenger accident report . In contrast, everyone in the
U.S. nuclear Navy has training about the Thresher loss every year.
Training should not be a one-time event for employees but should be continual
throughout their employment, if only as a reminder of their responsibilities and the
system hazards. Learning about recent events and trends can be a focus of this
training.
Finally, assessing for training effectiveness, perhaps during regular audits, can
assist in establishing an effective improvement and learning process.
With highly automated systems, an assumption is often made that less training is
required. In fact, training requirements go up .(not down). in automated systems, and
they change their nature. Training needs to be more extensive and deeper when
using automation. One of the reasons for this requirement is that human operators
of highly automated systems not only need a model of the current process state and
how it can change state but also a model of the automation and its operation, as
discussed in chapter 8.
To control complex and highly automated systems safely, operators .(controllers)
need to learn more than just the procedures to follow. If we expect them to control
and monitor the automation, they must also have an in-depth understanding of the
controlled physical process and the logic used in any automated controllers they
may be supervising. System controllers.at all levels.need to know.
• The system hazards and the reason behind safety-critical procedures and operational rules.
• The potential result of removing or overriding controls, changing prescribed
procedures, and inattention to safety-critical features and operations. Past accidents and their causes should be reviewed and understood.
•How to interpret feedback. Training needs to include different combinations of
alerts and sequences of events, not just single events.
•How to think flexibly when solving problems. Controllers need to be provided
with the opportunity to practice problem solving.
•General strategies rather than specific responses. Controllers need to develop
skills for dealing with unanticipated events.
•How to test hypotheses in an appropriate way. To update mental models,
human controllers often use hypothesis testing to understand the system state
better and update their process models. Such hypothesis testing is common with
computers and automated systems where documentation is usually so poor
and hard to use that experimentation is often the only way to understand the
automation behavior and design. Such testing can, however, lead to losses.
Designers need to provide operators with the ability to test hypotheses safely
and controllers must be educated on how to do so.
Finally, as with any system, emergency procedures must be overlearned and continually practiced. Controllers must be provided with operating limits and specific
actions to take in case they are exceeded. Requiring operators to make decisions
under stress and without full information is simply another way to ensure that they
will be blamed for the inevitable loss event, usually based on hindsight bias. Critical
limits must be established and provided to the operators, and emergency procedures
must be stated explicitly.
section 12.7.
Creating an Operations Safety Management Plan.
The operations safety management plan is used to guide operational control of
safety. The plan describes the objectives of the operations safety program and how
they will be achieved. It provides a baseline to evaluate compliance and progress.
Like every other part of safety program, the plan will need buy-in and oversight.
The organization should have a template and documented expectations for operations safety management plans, but this template may need to be tailored for
particular project requirements.
The information need not all be contained in one document, but there should be
a central reference with pointers to where the information can be found. As is true
for every other part of the safety control structure, the plan should include review
procedures for the plan itself as well as how the plan will be updated and improved
through feedback from experience.
Some things that might be included in the plan.
1.•
General Considerations.
Scope and objectives.
Applicable standards .(company, industry)
Documentation and reports.
Review of plan and progress reporting procedures.
2.•
Safety Organization .(safety control structure)
Personnel qualifications and duties.
Staffing and manpower.
Communication channels
Responsibility, authority, accountability .(functional organization, organizational structure)
Information requirements .(feedback requirements, process model, updating
requirements)
Subcontractor responsibilities.
Coordination.
Working groups.
System safety interfaces with other groups, such as maintenance and test,
occupational safety, quality assurance, and so on.
3.•
Procedures.
Problem reporting .(processes, follow-up)
Incident and accident investigation.
4.•Procedures.
5.•Staffing .(participants)
6.•Follow-up .(tracing to hazard and risk analyses, communication)
Testing and audit program.
7.•Procedures.
8.•Scheduling.
9.•Review and follow-up.
10.•Metrics and trend analysis.
11.•Operational assumptions from hazard and risk analyses.
Emergency and contingency planning and procedures.
Management of change procedures.
Training.
Decision making, conflict resolution.
12.•
Schedule.
Critical checkpoints and milestones.
Start and completion dates for tasks, reports, reviews.
Review procedures and participants.
13.•
Safety Information System.
Hazard and risk analyses, hazard logs .(controls, review and feedback
procedures)
Hazard tracking and reporting system.
Lessons learned.
Safety data library .(documentation and files)
Records retention policies.
14.•
Operations hazard analysis.
Identified hazards.
Mitigations for hazards.
15.•Evaluation and planned use of feedback to keep the plan up-to-date and
improve it over time.
section 12.8. Applying STAMP to Occupational Safety.
Occupational safety has, traditionally, not taken a systems approach but instead has
focused on individuals and changing their behavior. In applying systems theory to
occupational safety, more emphasis would be placed on understanding the impact
of system design on behavior and would focus on changing the system rather than
people. For example, vehicles used in large plants could be equipped with speed
regulators rather than depending on humans to follow speed limits and then punishing them when they do not. The same design for safety principles presented in
chapter 9 for human controllers apply to designing for occupational safety.
With the increasing complexity and automation of our plants, the line between
occupational safety and engineering safety is blurring. By designing the system to
be safe despite normal human error or judgment errors under competing work
pressures, workers will be better protected against injury while fulfilling their job
responsibilities.

1113
chapter13.raw Normal file

File diff suppressed because it is too large Load Diff

995
chapter13.txt Normal file

@ -0,0 +1,995 @@
chapter 13.
Managing Safety and the Safety Culture.
The key to effectively accomplishing any of the goals described in the previous
chapters lies in management. Simply having better tools is not enough if they are
not used. Studies have shown that management commitment to the safety goals is
the most important factor distinguishing safe from unsafe systems and companies
. Poor management decision making can undermine any attempts to improve
safety and ensure that accidents continue to occur.
This chapter outlines some of the most important management factors in reducing accidents. The first question is why managers should care about and invest in
safety. The answer, in short, is that safety pays and investment in safety provides
large returns over the long run.
If managers understand the importance of safety in achieving organizational
goals and decide they want to improve safety in their organizations, then three basic
organizational requirements are necessary to achieve that goal. The first is an effective safety control structure. Because of the importance of the safety culture in how
effectively the safety control structure operates, the second requirement is to implement and sustain a strong safety culture. But even the best of intentions will not
suffice without the appropriate information to carry them out, so the last critical
factor is the safety information system.
The previous chapters in this book focus on what needs to be done during design
and operations to control safety and enforce the safety constraints. This chapter
describes the overarching role of management in this process.
section 13.1. Why Should Managers Care about and Invest in Safety?
Most managers do care about safety. The problems usually arise because of misunderstandings about what is required to achieve high safety levels and what the
costs really are if safety is done right. Safety need not entail enormous financial or
other costs.
A classic myth is that safety conflicts with achieving other goals and that tradeoffs
are necessary to prevent losses. In fact, this belief is totally wrong. Safety is a prerequisite for achieving most organizational goals, including profits and continued
existence.
History is replete with examples of major accidents leading to enormous financial
losses and the demise of companies as a result. Even the largest global corporations
may not be able to withstand the costs associated with such losses, including loss of
reputation and customers. After all these examples, it is surprising that few seem to
learn from them about their own vulnerabilities. Perhaps it is in the nature of
mankind to be optimistic and to assume that disasters cannot happen to us, only
to others. In addition, in the simpler societies of the past, holding governments
and organizations responsible for safety was less common. But with loss of control
over our own environment and its hazards, and with rising wealth and living standards, the public is increasingly expecting higher standards of behavior with respect
to safety.
The “conflict” myth arises because of a misunderstanding about how safety is
achieved and the long-term consequences of operating under conditions of high risk.
Often, with the best of intentions, we simply do the wrong things in our attempts to
improve safety. Its not a matter of lack of effort or resources applied, but how they
are used that is the problem. Investments in safety need to be funneled to the most
effective activities in achieving it.
Sometimes it appears that organizations are playing a sophisticated version of
Whack-a-Mole, where symptoms are found and fixed but not the processes that
allow these symptoms to occur. Enormous resources may be expended with little
return on the investment. So many incidents occur that they cannot all be investigated in depth, so only superficial analysis of a few is attempted. If, instead, a few
were investigated in depth and the systemic factors fixed, the number of incidents
would decrease by orders of magnitude.
Such groups find themselves in continual firefighting mode and eventually conclude that accidents are inevitable and investments to prevent them are not costeffective, thus, like Sisyphus, condemning themselves to traverse the same vicious
circle in perpetuity. Often they convince themselves that their industry is just more
hazardous than others and that accidents in their world are inevitable and are the
price of productivity.
This belief that accidents are inevitable and occur because of random chance
arises from our own inadequate efforts to prevent them. When accident causes are
examined in depth, using the systems approach in this book, it becomes clear that
there is nothing random about them. In fact, we seem to have the same accident
over and over again, with only the symptoms differing, but the causes remaining
fairly constant. Most of these causes could be eliminated, but they are not. The
precipitating immediate factors, like a stuck valve, may have some randomness
associated with them, such as which valve actually precipitates a loss. But there is
nothing random about systemic factors that have not been corrected and exist
over long periods of time, such as flawed valve design and analysis or inadequate
maintenance practices.
As described in previous chapters, organizations tend to move inexorably toward
states of higher risk under various types of performance pressures until an accident
become inevitable. Under external or internal pressures, projects start to violate
their own rules. “Well do it just this once.its critical that we get this procedure
finished today.” In the Deepwater Horizon oil platform explosion of 20 10 , cost pressures led to not following standard safety procedures and, in the end, to enormous
financial losses . Similar dynamics occurred, with slightly different pressures, in
the Columbia Space Shuttle loss where the tensions among goals were created by
forces largely external to NASA. What appear to be short-term conflicts of other
organizational goals with safety goals, however, may not exist over the long term,
as witnessed in both these cases.
When operating at elevated levels of risk, the only question is which of many
potential events will trigger the loss. Before the Columbia accident, NASA manned
space operations was experiencing a slew of problems in the orbiters. The head of
the NASA Manned Space Program at the time misinterpreted the fact that they
were finding and fixing problems and wrote a report that concluded risk had been
reduced by more than a factor of five . The same unrealistic perception of risk
led to another report in 19 95 recommending that NASA “restructure and reduce
overall safety, reliability, and quality assurance elements” .
Figure 13.1 shows some of the dynamics at work. The model demonstrates the
major sources of the high risk in the Shuttle program at the time of the Columbia
loss. In order to get the funding needed to build and operate the space shuttle,
NASA had made unachievable performance promises. The need to justify expenditures and prove the value of manned space flight has been a major and consistent
tension between NASA and other governmental entities. The more missions the
Shuttle could fly, the better able the program was to generate funding. Adding to
these pressures was a commitment to get the International Space Station construction complete by February 20 04 .(called “core complete”), which required deliveries
of large items that could only be carried by the shuttle. The only way to meet the
deadline was to have no launch delays, a level of performance that had never previously been achieved . As just one indication of the pressure, computer screen
savers were mailed to managers in NASAs human spaceflight program that depicted
a clock counting down .(in seconds). to the core complete deadline .
The control loop in the lower left corner of figure 13.1, labeled R1 or Pushing
the Limit, shows how as external pressures increased, performance pressure
increased, which led to increased launch rates and thus success in meeting the launch
rate expectations, which in turn led to increased expectations and increasing performance pressures. This reinforcing loop represents an unstable system and cannot
be maintained indefinitely, but NASA is a “can-do” organization that believes
anything can be accomplished with enough effort .
The upper left loop represents the Space Shuttle safety program, which when
operating effectively is meant to balance the risks associated with loop R1. The external influences of budget cuts and increasing performance pressures, however, reduced
the priority of safety procedures and led to a decrease in system safety efforts.
Adding to the problems is the fact that system safety efforts led to launch delays
when problems were found, which created another reason for reducing the priority
of the safety efforts in the face of increasing launch pressures.
While reduction in safety efforts and lower prioritization of safety concerns may
lead to accidents, accidents usually do not occur for a while so false confidence is
created that the reductions are having no impact on safety and therefore pressures
increase to reduce the efforts and priority even further as the external and internal
performance pressures mount.
The combination of the decrease in safety efforts along with loop B2 in which
fixing the problems that were being found increased complacency, which also
contributed to reduction of system safety efforts, eventually led to a situation of
unrecognized high risk.
When working at such elevated levels of risk, the only question is which of many
potential events will trigger the loss. The fact that it was the foam and not one of
the other serious problems identified both before and after the loss was the only
random part of the accident. At the time of the Columbia accident, NASA was
regularly flying the Shuttle with many uncontrolled hazards; the foam was just one
of them.
Often, ironically, our successful efforts to eliminate or reduce accidents contribute to the march toward higher risk. Perception of the risk associated with an activity
often decreases over a period of time when no losses occur even though the real
risk has not changed at all. This misperception leads to reducing the very factors
that are preventing accidents because they are seen as no longer needed and available to trade off with other needs. The result is that risk increases until a major loss
occurs. This vicious cycle needs to be broken to prevent accidents. In STAMP terms,
the weakening of the safety control structure over time needs to be prevented or
detected before the conditions occur that lead to a loss.
System migration toward states of higher risk is potentially controllable and
detectable . The migration results from weakening of the safety control structure. To achieve lasting results, strong operational safety efforts are needed that
provide protection from and appropriate responses to the continuing environmental
influences and pressures that tend to degrade safety over time and that change the
safety control structure and the behavior of those in it.
The experience in the nuclear submarine community is a testament to the fact
that such dynamics can be overcome. The SUBSAFE program .(described in the
next chapter). was established after the loss of the Thresher in 19 63 . Since that time,
no submarine in the SUBSAFE program, that is, satisfying the SUBSAFE requirements, has been lost, although such losses were common before SUBSAFE was
established.
The leaders in SUBSAFE describe other benefits beyond preventing the loss of
critical assets. Because those operating the submarines have complete confidence
in their ships, they can focus solely on the completion of their mission. The U.S.
nuclear submarine programs experience over the past forty-five years belies the
myth that increasing safety necessarily decreases system performance. Over a sustained period, a safer operation is generally more efficient. One reason is that stoppages and delays are eliminated.
Examples can also be found in private industry. As just one example, because of
a number of serious accidents, OSHA tried to prohibit the use of power presses
where employees had to place one or both hands beneath the ram during the production cycle . After vehement protests that the expense would be too great in
terms of reduced productivity, the requirement was dropped. Preliminary motion
studies showed that reduced production would result if all loading and unloading
were done with the die out from under the ram. Some time after OSHA gave up
on the idea, one manufacturer who used power presses decided, purely as a safety
and humanitarian measure, to accept the production penalty. Instead of reducing
production, however, the effect was to increase production from 5 to 15 percent,
even though the machine cycle was longer. Other examples of similar experiences
can be found in Safeware .
The belief that safer systems cost more or that building safety in from the beginning necessarily requires unacceptable compromises with other goals is simply not
justified. The costs, like anything else, depend on the methods used to achieve
increased safety. In another ironic twist, in the attempt to avoid making tradeoffs
with safety, systems are often designed to optimize mission goals and safety devices
added grudgingly when the design is complete. This approach, however, is the most
expensive and least effective that could be used. The costs are much less and in
fact can be eliminated if safety is built into the system design from the beginning
rather than added on or retrofitted later, usually in the form of redundancy
or elaborate protection systems. Eliminating or reducing hazards early in design
often results in a simpler design, which in itself may reduce both risk and costs.
The reduced risk makes it more likely that the mission or system goals will be
achieved.
Sometimes it takes a disaster to “get religion” but it should not have to. This
chapter was written for those managers who are wise enough to know that investment in safety pays dividends, even before this fact is brought home .(usually too
late). by a tragedy.
footnote. Appendix D explains how to read system dynamics models, for those unfamiliar with them.
section 13.2. General Requirements for Achieving Safety Goals.
Escaping from the Whack-a-Mole trap requires identifying and eliminating the
systemic factors behind accidents. Some common reasons why safety efforts are
often not cost-effective were identified in chapter 6, including.
1.•Superficial, isolated, or misdirected safety engineering activities, such as spending most of the effort proving the system is safe rather than making it so.
2.•Starting too late.
3.•Using techniques inappropriate for todays complex systems and new
technology.
4.•Focusing only on the technical parts of the system, and
5.• Assuming systems are static throughout their lifetime and decreasing attention
to safety during operations
Safety needs to be managed and appropriate controls established. The major ingredients of effective safety management include.
1.•
Commitment and leadership
2.• A corporate safety policy
3.•Risk awareness and communication channels
4.•Controls on system migration toward higher risk
5.• A strong corporate safety culture
6.• A safety control structure with appropriate assignment of responsibility, authority, and accountability
7.• A safety information system
8.•Continual improvement and learning
9.•Education, training, and capability development
Each of these is described in what follows.
section 13.2.1. Management Commitment and Leadership.
Top management concern about safety is the most important factor in discriminating between safe and unsafe companies matched on other variables . This
commitment must be genuine, not just a matter of sloganeering. Employees need
to feel they will be supported if they show concern for safety. An Air Force study
of system safety concluded.
Air Force top management support of system safety has not gone unnoticed by contractors. They now seem more than willing to include system safety tasks, not as “window
dressing” but as a meaningful activity .
The B1-B program is an example of how this result was achieved. In that development program, the program manager or deputy program manager chaired the meetings of the group where safety decisions were made. “An unmistaken image of
the importance of system safety in the program was conveyed to the contractors”
.
A managers open and sincere concern for safety in everyday dealings with
employees and contractors can have a major impact on the reception given to safetyrelated activities . Studies have shown that top managements support for and
participation in safety efforts is the most effective way to control and reduce accidents . Support for safety is shown by personal involvement, by assigning capable
people and giving them appropriate objectives and resources, by establishing comprehensive organizational safety control structures, and by responding to initiatives
by others.
section 13.2.2. Corporate Safety Policy.
A policy is a written statement of the wisdom, intentions, philosophy, experience,
and belief of an organizations senor managers that states the goals for the organization and guides their attainment . The corporate safety policy provides employees with a clear, shared vision of the organizations safety goals and values and a
strategy to achieve them. It documents and shows managerial priorities where safety
is involved.
The author has found companies that justify not having a safety policy on the
grounds that “everyone knows safety is important in our business.” While safety may
seem important for a particular business, management remaining mute on their
policy conveys the impression that tradeoffs are acceptable when safety seems to
conflict with other goals. The safety policy provides a way for management to clearly
define the priority between conflicting goals they expect to be used in decision
making. The safety policy should define the relationship of safety to other organizational goals and provide the scope for discretion, initiative, and judgment in
deciding what should be done in specific situations.
Safety policy should be broken into two parts. The first is a short and concise
statement of the safety values of the corporation and what is expected from employees with respect to safety. Details about how the policy will be implemented should
be separated into other documents.
A complete safety policy contains such things as the goals of the safety program;
a set of criteria for assessing the short- and long-term success of that program with
respect to the goals; the values to be used in tradeoff decisions; and a clear statement
of responsibilities, authority, accountability, and scope. The policy should be explicit
and state in clear and understandable language what is expected, not a set of lofty
goals that cannot be operationalized. An example sometimes found .(as noted in the
previous chapter). is a policy for employees to “be mindful of weak signals”. This
policy provides no useful guidance on what to do.both “mindful” and “weak
signals” are undefined and undefinable. An alternative might be, “If you see
something that you think is unsafe, you are responsible for reporting it immediately.”
In addition, employees need to be trained on the hazards in the processes they
control and what to look for.
Simply having a safety policy is not enough. Employees need to believe the
safety policy reflects true commitment by management. The only way this commitment can be effectively communicated is through actions by management that
demonstrate that commitment. Employees need to feel that management will
support them when they make reasonable decisions in favor of safety over alternative goals. Incentives and reward structures must encourage the proper handling of
tradeoffs between safety and other goals. Not only the formal rewards and rules
but also the informal rules .(social processes). of the organizational culture must
support the overall safety policy. A practical test is whether employees believe that
company management will support them if they choose safety over the demands of
production .
To encourage proper decision making, the flexibility to respond to safety problems needs to be built into the organizational procedures. Schedules, for example,
should be adaptable to allow for uncertainties and possibilities of delay due to
legitimate safety concerns, and production goals must be reasonable.
Finally, not only must a safety policy be defined, it must be disseminated and
followed. Management needs to ensure that safety receives appropriate attention
in decision making. Feedback channels must be established and progress in achieving the goals should be monitored and improvements identified, prioritized, and
implemented.
section 13.2.3. Communication and Risk Awareness.
Awareness of the risk in the controlled process is a major component of safetyrelated decision making by controllers. The problem is that risk, when defined
as the severity of a loss event combined with its likelihood, is not calculable or
knowable. It can only be estimated from a set of variables, some of which may be
unknown, or the information to evaluate likelihood of these variables may be
lacking or incorrect. But decisions need to be made based on this unknowable
property.
In the absence of accurate information about the state of the process, risk perception may be reevaluated downward as time passes without an accident. In fact, risk
probably has not changed, only our perception of it. In this trap, risk is assumed to
be reflected by a lack of accidents or incidents and not by the state of the safety
control structure.
When STAMP is used as the foundation of the safety program, safety and risk
are a function of the effectiveness of the controls to enforce safe system behavior, that
is, the safety constraints and the control structure used to enforce those constraints.
Poor safety-related decision making on the part of management, for example, is
commonly related to inadequate feedback and inaccurate process models. As such,
risk is potentially knowable and not some amorphous property denoted by probability estimates. This new definition of risk can be used to create new risk assessment
procedures.
While lack of accidents could reflect a strong safety control structure, it may also
simply reflect delays between the relaxation of the controls and negative consequences. The delays encourage relaxation of more controls, which then leads to
accidents. The basic problem is inaccurate risk perception and calculating risk using
the wrong factors. This process is behind the frequently used but rarely defined label
of “complacency.” Complacency results from inaccurate process models and risk
awareness.
Risk perception is directly related to communication and feedback. The more and
better the information we have about the potential causes of accidents in our system
and the state of the controls implemented to prevent them, the more accurate will
be our perception of risk. Consider the loss of an aircraft when it took off from the
wrong runway in Lexington, Kentucky, in August 20 06 . One of the factors in the
accident was that construction was occurring and the pilots were confused about
temporary changes in taxi patterns. Although similar instances of crew confusion
had occurred in the week before the accident, there were no effective communication channels to get this information to the proper authorities. After the loss, a small
group of aircraft maintenance workers told the investigators that they also had
experienced confusion when taxiing to conduct engine tests.they were worried
that an accident could happen, but did not know how to effectively notify people
who could make a difference .
Another communication disconnect in this accident leading to a misperception
of risk involved a misunderstanding by management about the staffing of the control
tower at the airport. Terminal Services management had ordered the airport air
traffic control management to both reduce control tower budgets and to ensure
separate staffing of the tower and radar functions. It was impossible to comply with
both directives. Because of an ineffective feedback mechanism, management did not
know about the impossible and dangerous goal conflicts they had created or that
the resolution of the conflict was to reduce the budget and ignore the extra staffing
requirements.
Another example occurred in the Deepwater Horizon accident. Reports after the
accident indicated that workers felt comfortable raising safety concerns and ideas
for safety improvement to managers on the rig, but they felt that they could not
raise concerns at the divisional or corporate level without reprisal. In a confidential
survey of workers on Deepwater Horizon taken before the oil platform exploded,
workers expressed concerns about safety.
“Im petrified of dropping anything from heights not because Im afraid of hurting anyone
(the area is barriered off), but because Im afraid of getting fired,” one worker wrote. “The
company is always using fear tactics,” another worker said. “All these games and your
mind gets tired.” Investigators also said “nearly everyone among the workers they interviewed believed that Transoceans system for tracking health and safety issues on the rig
was counter productive.” Many workers entered fake data to try to circumvent the system,
known as See, Think, Act, Reinforce, Track .(or START). As a result, the companys perception of safety on the rig was distorted, the report concluded
Formal methods of operation and strict hierarchies can limit communication. When
information is passed up hierarchies, it may be distorted, depending on the interests
of managers and the way they interpret the information. Concerns about safety may
even be completely silenced as it passes up the chain of command. Employees may
not feel comfortable going around a superior who does not respond to their concerns. The result may be a misperception of risk, leading to inadequate control
actions to enforce the safety constraints.
In other accidents, reporting and feedback systems are simply unused for a
variety of reasons. In many losses, there was evidence that a problem occurred
in time to prevent the loss, but there was either no communication channel established for getting the information to those who could understand it and to those
making decisions or, alternatively, the problem-reporting channel was ineffective or
simply unused.
Communication is critical in both providing information and executing control
actions and in providing feedback to determine whether the control actions were
successful and what further actions are required. Decision makers need accurate
and timely information. Channels for information dissemination and feedback need
to be established that include a means for comparing actual performance with
desired performance and ensuring that required action is taken.
In summary, both the design of the communication channels and the communication dynamics must be considered as well as potential feedback delays. As an
example of communication dynamics, reliance on face-to-face verbal reports during
group meetings is a common method of assessing lower-level operations , but,
particularly when subordinates are communicating with superiors, there is a tendency for adverse situations to be underemphasized .
section 13.2.4. Controls on System Migration toward Higher Risk.
One of the key assumptions underlying the approach to safety described in this
book is that systems adapt and change over time. Under various types of pressures,
that adaptation often moves in the direction of higher risk. The good news is, as
stated earlier, that adaptation is predictable and potentially controllable. The safety
control structure must provide protection from and appropriate responses to the
continuing influences and pressures that tend to degrade safety over time. More
specifically, the potential reasons for and types of migration toward higher risk need
to be identified and controls instituted to prevent it. In addition, audits and performance assessments based on the safety constraints identified during system development can be used to detect migration and the violation of the constraints as described
in chapter 12.
One way to prevent such migration is to anchor safety efforts beyond short-term
program management pressures. At one time, NASA had a strong agency-wide
system safety program with common standards and requirements levied on everyone. Over time, agency-wide standards were eviscerated, and programs were allowed
to set their own standards under the control of the program manager. While the
manned space program started out with strong safety standards, under budget and
performance pressures they were progressively weakened .
As one example, a basic requirement for an effective operational safety program
is that all potentially hazardous incidents during operations are thoroughly investigated. Debris shedding had been identified as a potential hazard during Shuttle
development, but the standard for performing hazard analyses in the Space Shuttle
program was changed to specify that hazards would be revisited only when there
was a new design or the Shuttle design was changed, not after an anomaly .(such as
foam shedding). occurred .
After the Columbia accident, safety standards in the Space Shuttle program .(and
the rest of NASA). were effectively anchored and protected from dilution over time
by moving responsibility for them outside the projects.
section 13.2.5. Safety, Culture, and Blame.
The high-level goal in managing safety is to create and maintain an effective safety
control structure. Because of the importance of safety culture in how the control
structure operates, achieving this goal requires implementing and sustaining a strong
safety culture.
Proper function of the safety control structure relies on decision making by the
controllers in the structure. Decision making always rests upon a set of industry or
organizational values and assumptions. A culture is a set of shared values and norms,
a way of looking at and interpreting the world and events around us and of taking
action in a social context. Safety culture is that subset of culture that reflects the
general attitude and approaches to safety and risk management.
Shein divides culture into three levels .(figure 13.2). . At the top are the
surface-level cultural artifacts or routine aspects of everyday practice including
hazard analyses and control algorithms and procedures. The second, middle level is
the stated organizational rules, values, and practices that are used to create the toplevel artifacts, such as safety policy, standards, and guidelines. At the lowest level is
the often invisible but pervasive underlying deep cultural operating assumptions
upon which actions are taken and decisions are made and thus upon which the upper
levels rest.
Trying to change safety outcomes by simply changing the organizational
structures.including policies, goals, missions, job descriptions, and standard operating procedures.may lower risk over the short term, but superficial fixes that do
not address the set of shared values and social norms are very likely to be undone
over time. Changes are required in the organizational values that underlie peoples
behavior.
Safety culture is primarily set by the leaders of the organization as they establish
the basic values under which decisions will be made. This fact explains why leadership and commitment by leaders is critical in achieving high levels of safety.
To engineer a safety culture requires identifying the desired organizational safety
principles and values and then establishing a safety control structure to achieve
those values and to sustain them over time. Sloganeering or jawboning is not
enough. all aspects of the safety control structure must be engineered to be in alignment with the organizational safety principles, and the leaders must be committed
to the stated policies and principles related to safety in the organization.
Along with leadership and commitment to safety as a basic value of the organization, achieving safety goals requires open communication. In an interview after the
Columbia loss, the new center director at Kennedy Space Center suggested that the
most important cultural issue the Shuttle program faced was establishing a feeling
of openness and honesty with all employees, where everybodys voice was valued.
Statements during the Columbia accident investigation and messages posted to the
NASA Watch website describe a lack of trust of NASA employees to speak up. At
the same time, a critical observation in the C A I B report focused on the engineers
claims that the managers did not hear the engineers concerns . The report concluded that this was in part due to the managers not asking or listening. Managers
created barriers against dissenting opinions by stated preconceived conclusions
based on subjective knowledge and experience rather than on solid data. Much of
the time they listened to those who told them what they wanted to hear. One indication about the poor communication around safety and the atmosphere at the time
were statements in the 19 95 Kraft report that dismissed concerns about Space
Shuttle safety by accusing those who made them as being partners in an unneeded
“safety shield conspiracy.”
Unhealthy work atmospheres with respect to safety and communication are not
limited to NASA. Carroll documents a similarly dysfunctional safety culture at the
Millstone nuclear power plant . An NRC review in 19 96 concluded the safety
culture at the plant was dangerously flawed. it did not tolerate dissenting views and
stifled questioning attitudes among employees.
Changing such interaction patterns is not easy. Management style can be addressed
through training, mentoring, and proper selection of people to fill management
positions, but trust is hard to gain and easy to lose. Employees need to feel psychologically safe about reporting concerns and to believe that managers can be trusted
to hear their concerns and to take appropriate action, while managers have to
believe that employees are worth listening to and worthy of respect.
The difficulty is in getting people to change their view of reality. Gareth Morgan,
a social anthropologist, defines culture as an ongoing, proactive process of reality
construction. According to this view, organizations are socially constructed realities
that rest as much in the heads and minds of their members as they do in concrete
sets of rules and regulations. Morgan assets that organizations are “sustained by
belief systems that emphasize the importance of rationality” . This myth of
rationality “helps us to see certain patterns of action as legitimate, credible, and
normal, and hence to avoid the wrangling and debate that would arise if we were
to recognize the basic uncertainty and ambiguity underlying many of our values and
actions” .
For both the Challenger and Columbia accidents, as well as most other major
accidents where decision making was flawed, the decision makers saw their actions
as rational. Understanding and preventing poor decision making under conditions
of uncertainty requires providing environments and tools that help to stretch our
belief systems and to see patterns that we do not necessarily want to see.
Some common types of dysfunctional safety cultures can be identified that are
common to industries or organizations. Hopkins coined the term “culture of denial”
after investigating accidents in the mining industry, but mining is not the only industry in which denial is pervasive. In such cultures, risk assessment is unrealistic and
credible warnings are dismissed without appropriate action. Management only
wants to hear good news and may ensure that is what they hear by punishing bad
news, sometimes in a subtle way and other times not so subtly. Often arguments are
made in these industries that the conditions are inherently more dangerous than
others and therefore little can be done about improving safety or that accidents are
the price of productivity and cannot be eliminated. Of course, this rationale is untrue
but it is convenient.
A second type of dysfunctional safety culture might be termed a “paperwork
culture.” In these organizations, employees spend all their time proving the system
is safe but little time actually doing the things necessary to make it so. After the
Nimrod aircraft loss in Afghanistan in 20 06 , the accident report noted a “culture of
paper safety” at the expense of real safety .
So what are the aspects of a good safety culture, that is, the core values and norms
that allow us to make better decisions around safety?
1.•Safety commitment is valued.
2.•Safety information is surfaced without fear and incident analysis is conducted
without blame.
3.•Incidents and accidents are valued as an important window into systems that
are not functioning as they should.triggering in-depth and uncircumscribed
causal analysis and improvement actions.
4.• There is a feeling of openness and honesty, where everyones voice is respected.
Employees feel that managers are listening.
5a.•There is trust among all parties.
5b.•Employees feel psychologically safe about reporting concerns.
5c.•
Employees believe that managers can be trusted to hear their concerns and
will take appropriate action.
5d.Managers believe that employees are worth listening to and are worthy of
respect.
Common ingredients of a safety culture based on these values include management
commitment to safety and the safety values, management involvement in achieving
the safety goals, employee empowerment, and appropriate and effective incentive
structures and reporting systems.
When these ingredients form the basis of the safety culture, the organization has
the following characteristics.
1.•Safety is integrated into the dominant culture; it is not a separate subculture.
2.•Safety is integrated into both development and operations. Safety activities
employ a mixture of top-down engineering or reengineering and bottom-up
process improvement.
3.•Individuals have required knowledge, skills, and ability.
4.•Early warning systems for migration toward states of high risk are established
and effective.
5.• The organization has a clearly articulated safety vision, values and procedures,
shared among the stakeholders.
6.• Tensions between safety priorities and other system priorities are addressed
through a constructive, negotiated process.
7.•Key stakeholders .(including all employees and groups such as unions). have full
partnership roles and responsibilities regarding safety.
8.•Passionate, effective leadership exists at all levels of the organization .(particularly the top), and all parts of the safety control structure are committed to
safety as a high priority for the organization.
9.•Effective communication channels exist for disseminating safety information.
10.•High levels of visibility of the state of safety .(i.e., risk awareness). exist at all
levels of the safety control structure through appropriate and effective
feedback.
11.• The results of operating experience, process hazard analyses, audits, near misses,
or accident investigations are used to improve operations and the safety control
structure.
12.•
Deficiencies found during assessments, audits, inspections, and incident investigation are addressed promptly and tracked to completion.
The Just Culture Movement
The Just Culture movement is an attempt to avoid the type of unsafe cultural values
and professional interactions that have been implicated in so many accidents.
Its origins are in aviation although some in the medical community, particularly
hospitals, have also taken steps down this road. Much has been written on Just
Culture.only a summary is provided here. The reader is directed in particular
to Dekkers book Just Culture , which is the source of much of what follows in
this section.
A foundational principle of Just Culture is that the difference between a safe and
unsafe organization is how it deals with reported incidents. This principle stems from
the belief that an organization can benefit more by learning from mistakes than by
punishing people who make them.
In an organization that promotes such a Just Culture .
1.•
Reporting errors and suggesting changes is normal, expected, and without
jeopardy for anyone involved.
2.• A mistake or incident is not seen as a failure but as a free lesson, an opportunity
to focus attention and to learn.
3.•Rather than making people afraid, the system makes people participants in
change and improvement.
4.•Information provided in good faith is not used against those who report it.
Most people have a genuine concern for the safety and quality of their work. If
through reporting problems they contribute to visible improvements, few other
motivations or exhortations to report are necessary. In general, empowering people
to affect their work conditions and making the reporters of safety problems part of
the change process promotes their willingness to shoulder their responsibilities and
to share information about safety problems.
Beyond the obvious safety implications, a Just Culture may improve morale, commitment to the organization, job satisfaction, and willingness to do extra, to step
outside their role. It encourages people to participate in improvement efforts and
gets them actively involved in creating a safer system and workplace.
There are several reasons why people may not report safety problems, which were
covered in chapter 12. To summarize, the reporting channels may be difficult or time
consuming to use, they may feel there is no point in reporting because the organization will not do anything anyway or they may fear negative consequences in reporting. Each of these reasons must be and can be mitigated through better system
design. Reporting should be easy and not require excessive time or effort that takes
away from direct job responsibilities. There must be responses made both to the
initial report that indicates it was received and read and later information should
be provided about the resolution of the reported problem.
Promoting a Just Culture requires getting away from blame and punishment
as a solution to safety problems. One of the new assumptions in chapter 2 for an
accident model and underlying STAMP was.
Blame is the enemy of safety. Focus should instead be on understanding how the entire
system behavior led to the loss and not on who or what to blame.
Blame and punishment discourage reporting problems and mistakes so improvements can be made to the system. As has been argued throughout this book, changing the system is the best way to achieve safety, not trying to change people.
When blame is a primary component of the safety culture, people stop reporting
incidents. This basic understanding underlies the Aviation Safety Reporting System
(ASRS). where pilots and others are given protection from punishment if they report
mistakes .(see chapter 12). A decision was made in establishing the ASRS and other
aviation reporting systems that organizational and industry learning from mistakes
was more important than punishing people for them. If most errors stem from the
design of the system or can be prevented by changing the design of the system, then
blaming the person who made the mistake is misplaced anyway.
A culture of blame creates a climate of fear that makes people reluctant to share
information. It also hampers the potential to learn from incidents; people may even
tamper with safety recording devices, turning them off, for example. A culture of
blame interferes with regulatory work and the investigation of accidents because
people and organizations are less willing to cooperate. The role of lawyers can
impede safety efforts and actually make accidents more likely. Organizations may
focus on creating paper trails instead of utilizing good safety engineering practices.
Some companies avoid standard safety practices under the advice of their lawyers
that this will protect them in legal proceedings, thus almost guaranteeing that accidents and legal proceedings will occur.
Blame and the overuse of punishment as a way to change behavior can directly
lead to accidents that might not have otherwise occurred. As an example, a train
accident in Japan.the 20 05 Fukuchiyama line derailment. occurred when a train
driver was on the phone trying to ensure that he would not be reported for a minor
infraction. Because of this distraction, he did not slow down for a curve, resulting
in the deaths of 106 passengers and the train driver along with injury of 562 passengers . Blame and punishment for mistakes causes stress and isolation and
makes people perform less well.
The alternative is to see mistakes as an indication of an organizational, operational, educational, or political problem. The question then becomes what should be
done about the problem and who should bear responsibility for implementing the
changes. The mistake and any harm from it should be acknowledged, but the
response should be to lay out the opportunities for reducing such mistakes by everyone .(not just this particular person), and the responsibilities for making changes so
that the probability of it happening again is reduced. This approach allows people
and organizations to move forward to prevent mistakes in the future and not just
focus on punishing past behavior . Punishment is usually not a long-term deterrent for mistakes if the system in which the person operates has not changed the
reason for the mistake. Just Culture principles allow us to learn from minor incidents
instead of waiting until tragedies occur.
A common misunderstanding is that a Just Culture means a lack of accountability.
But, in reality, it is just the opposite. Accountability is increased in a Just Culture by
not simply assigning responsibility and accountability to the person at the bottom
of the safety control structure who made the direct action involved in the mistake.
All components of the safety control structure involved are held accountable including .(1). those in operations who contribute to mistakes by creating operational
pressures and providing inadequate oversight to ensure safe procedures are being
followed, and .(2). those in development who create a system design that contributes
to mistakes.
The difference in a Just Culture is not in the accountability for safety problems
but how accountability is implemented. Punishment is an appropriate response to
gross negligence and disregard for other peoples safety, which, of course, applies to
everyone in the safety control structure, including higher-level management and
developers as well as the lower level controllers. But if mistakes were made or
inadequate controls over safety provided because of flaws in the design of the controlled system or the safety control structure, then punishment is not the appropriate
response.fixing the system or the safety control structure is. Dekker has suggested
that accountability be defined in terms of responsibility for finding solutions to the
system design problems from which the mistakes arose .
Overcoming our cultural bias to punish people for their mistakes and the common
belief that punishment is the only way to change behavior can be very difficult. But
the payoff is enormous if we want to significantly reduce accident rates. Trust is a
critical requirement for encouraging people to share their mistakes and safety problems with others so something can be done before major losses occur.
section 13.2.6. Creating an Effective Safety Control Structure.
In some industries, the safety control structure is called the safety management
system .(SMS). In civil aviation, ICAO .(International Civil Aviation Authority). has
created standards and recommended practices for safety management systems and
individual countries have strongly recommended or required certified air carriers
to establish such systems in order to control organizational factors that contribute
to accidents.
There is no right or wrong design of a safety control structure or SMS. Most of
the principles for design of safe control loops in chapter 9 also apply here. The
culture of the industry and the organization will play a role in what is practical and
effective. There are some general rules of thumb, however, that have been found to
be important in practice.
General Safety Control Structure Design Principles.
Making everyone responsible for safety is a well-meaning misunderstanding of
what is required. While, of course, everyone should try to behave safely and to
achieve safety goals, someone has to be assigned responsibility for ensuring that
the goals are being achieved. This lesson was learned long ago in the U.S. Intercontinental Ballistic Missile System .(ICBM). Because safety was such an important
consideration in building the early 19 50 s missile systems, safety was not assigned as
a specific responsibility, but was instead considered to be everyones responsibility.
The large number of resulting incidents, particularly those involving the interfaces
between subsystems, led to the understanding that safety requires leadership
and focus.
There needs to be assignment of responsibility for ensuring that hazardous
behaviors are eliminated or, if not possible, mitigated in design and operations.
Almost all attention during development is focused on what the system and its
components are supposed to do. System safety engineering is responsible for ensuring that adequate attention is also paid to what the system is not supposed to do
and verifying that hazardous behavior will not occur. It is this unique focus that has
made the difference in systems where safety engineering successfully identified
problems that were not found by the other engineering processes.
At the other extreme, safety efforts may be assigned to a separate group that
is isolated from critical decision making. During system development, responsibility for safety may be concentrated in a separate quality assurance group rather
than in the system engineering organization. During operations, safety may be
the responsibility of a staff position with little real power or impact on line
operations.
The danger inherent in this isolation of the safety efforts is argued repeatedly
throughout this book. To be effective, the safety efforts must have impact, and they
must be integrated into mainstream system engineering and operations.
Putting safety into the quality assurance organization is the worst place for it. For
one thing, it sets up the expectation that safety is an after-the-fact or auditing activity
only. safety must be intimately integrated into design and decision-making activities.
Safety permeates every part of development and operations. While there may be
staff positions performing safety functions that affect everyone at their level of the
organization and below, safety must be integrated into all of engineering development and line operations. Important safety functions will be performed by most
everyone, but someone needs the responsibility to ensure that they are being carried
out effectively.
At the same time, independence is also important. The C A I B report addresses
this issue.
Organizations that successfully operate high-risk technologies have a major characteristic
in common. they place a premium on safety and reliability by structuring their programs
so that technical and safety engineering organizations own the process of determining,
maintaining, and waiving technical requirements with a voice that is equal to yet independent of Program Managers, who are governed by cost, schedule, and missionaccomplishment goals .
Besides associating safety with after-the-fact assurance and isolating it from system
engineering, placing it in an assurance group can have a negative impact on its
stature, and thus its influence. Assurance groups often do not have the prestige
necessary to have the influence on decision making that safety requires. A case can
be made that the centralization of system safety in quality assurance at NASA,
matrixed to other parts of the organization, was a major factor in the decline of the
safety culture preceding the Columbia loss. Safety was neither fully independent
nor sufficiently influential to prevent the loss events .
Safety responsibilities should be assigned at every level of the organization,
although they will differ from level to level. At the corporate level, system safety
responsibilities may include defining and enforcing corporate safety policy, and
establishing and monitoring the safety control structure. In some organizations that
build extremely hazardous systems, a group at the corporate or headquarters level
certify these systems as safe for use. For example, the U.S. Navy has a Weapons
Systems Explosives Safety Review Board that assures the incorporation of explosive
safety criteria in all weapon systems by reviews conducted throughout all the systems life cycle phases. For some companies, it may be reasonable to have such a
review process at more than just the highest level.
Communication is important because safety motivated changes in one subsystem
may affect other subsystems and the system as a whole. In military procurement
groups, oversight and communication is enhanced through the use of safety working
groups. In establishing any oversight process, two extremes must be avoided. “getting
into bed” with the project and losing objectivity or backing off too far and losing
insight. Working groups are an effective way of avoiding these extremes. They assure
comprehensive and unified planning and action while allowing for independent
review and reporting channels.
Working groups usually operate at different levels of the organization. As an
example, the Navy Aegis system development, a very large and complex system,
included a System Safety Working Group at the top level chaired by the Navy Principal for Safety, with the permanent members being the prime contractors system
safety lead and representatives from various Navy offices. Contractor representatives attended meetings as required. Members of the group were responsible for
coordinating safety efforts within their respective organizations, for reporting the
status of outstanding safety issues to the group, and for providing information to
the Navy Weapons Systems Explosives Safety Review Board. Working groups also
functioned at lower levels, providing the necessary coordination and communication
for that level and to the levels above and below.
A surprisingly large percentage of the reports on recent aerospace accidents have
implicated improper transition from an oversight to an insight process .(for example,
see ). This transition implies the use of different levels of feedback
control and a change from prescriptive management control to management by
objectives, where the objectives are interpreted and satisfied according to the local
context. For these accidents, the change in management role from oversight to
insight seems to have been implemented simply as a reduction in personnel and
budgets without assuring that anyone was responsible for specific critical tasks.
footnote. The Aegis Combat System is an advanced command and control and weapon control system that uses
powerful computers and radars to track and guide weapons to destroy enemy targets.
Assigning Responsibilities.
An important question is what responsibilities should be assigned to the control
structure components. The list below is derived from the authors experience on a
large number and variety of projects. Many also appear in accident report recommendations, particularly those generated using CAST.
The list is meant only to be a starting point for those establishing a comprehensive safety control structure and a checklist for those who already have sophisticated
safety management systems. It should be supplemented using other sources and
experiences.
The list does not imply that each responsibility will be assigned to a single person
or group. The responsibilities will probably need to be separated into multiple individual responsibilities and assigned throughout the safety control structure, with one
group actually implementing the responsibilities and others above them supervising,
leading .(directing), or overseeing the activity. Of course, each responsibility assumes
the need for associated authority and accountability plus the controls, feedback, and
communication channels necessary to implement the responsibility. The list may
also be useful in accident and incident analysis to identify inadequate controls and
control structures.
Management and General Responsibilities.
1.•Provide leadership, oversight, and management of safety at all levels of the
organization.
2.•Create a corporate or organizational safety policy. Establish criteria for evaluating safety-critical decisions and implementing safety controls. Establish distribution channels for the policy. Establish feedback channels to determine
whether employees understand it, are following it, and whether it is effective.
Update the policy as needed.
3.•Establish corporate or organizational safety standards and then implement,
update, and enforce them. Set minimum requirements for safety engineering
in development and operations and oversee the implementation of those
requirements. Set minimum physical and operational standards for hazardous
operations.
4.•Establish incident and accident investigation standards and ensure recommendations are implemented and effective. Use feedback to improve the standards.
5.•Establish management of change requirements for evaluating all changes for
their impact on safety, including changes in the safety control structure. Audit
the safety control structure for unplanned changes and migration toward states
of higher risk.
6.•Create and monitor the organizational safety control structure. Assign responsibility, authority, and accountability for safety.
7.•Establish working groups.
8.•Establish robust and reliable communication channels to ensure accurate
management risk awareness of the development system design and the state of
the operating process.
9.•Provide physical and personnel resources for safety-related activities. Ensure
that those performing safety-critical activities have the appropriate skills,
knowledge, and physical resources.
10.•Create an easy-to-use problem reporting system and then monitor it for needed
changes and improvements.
11.•Establish safety education and training for all employees and establish feedback channels to determine whether it is effective along with processes for
continual improvement. The education should include reminders of past
accidents and causes and input from lessons learned and trouble reports.
Assessment of effectiveness may include information obtained from knowledge
assessments during audits.
12.•Establish organizational and management structures to ensure that safetyrelated technical decision making is independent from programmatic considerations, including cost and schedule.
13.•Establish defined, transparent, and explicit resolution procedures for conflicts
between safety-related technical decisions and programmatic considerations.
Ensure that the conflict resolution procedures are being used and are
effective.
14.•Ensure that those who are making safety-related decisions are fully informed
and skilled. Establish mechanisms to allow and encourage all employees and
contractors to contribute to safety-related decision making.
15.•Establish an assessment and improvement process for safety-related decision
making.
16.•Create and update the organizational safety information system.
17.•Create and update safety management plans.
18.•Establish communication channels, resolution processes, and adjudication procedures for employees and contractors to surface complaints and concerns
about the safety of the system or parts of the safety control structure that are
not functioning appropriately. Evaluate the need for anonymity in reporting
concerns.
Development.
1.•Implement special training for developers and development managers in safetyguided design and other necessary skills. Update this training as events occur
and more is learned from experience. Create feedback, assessment, and improvement processes for the training.
2.•Create and maintain the hazard log.
3.•Establish working groups.
4.•Design safety into the system using system hazards and safety constraints.
Iterate and refine the design and the safety constraints as the design process
proceeds. Ensure the system design includes consideration of how to reduce
human error.
5.•Document operational assumptions, safety constraints, safety-related design
features, operating assumptions, safety-related operational limitations, training
and operating instructions, audits and performance assessment requirements,
operational procedures, and safety verification and analysis results. Document
both what and why, including tracing between safety constraints and the design
features to enforce them.
6.•Perform high-quality and comprehensive hazard analyses to be available
and usable when safety-related decisions need to be made, starting with early
decision making and continuing through the systems life. Ensure that the
hazard analysis results are communicated in a timely manner to those who need
them. Establish a communication structure that allows communication downward, upward, and sideways .(i.e., among those building subsystems). Ensure
that hazard analyses are updated as the design evolves and test experience is
acquired.
7.• Train engineers and managers to use the results of hazard analyses in their
decision making.
8.•Maintain and use hazard logs and hazard analyses as experience with the
system is acquired. Ensure communication of safety-related requirements and
constraints to everyone involved in development.
•Gather lessons learned in operations .(including accident and incident
reports). and use them to improve the development processes. Use operating
experience to identify flaws in the development safety controls and implement
improvements.
Operations.
1.•
Develop special training for operators and operations management to create
needed skills and update this training as events occur and more is learned from
experience. Create feedback, assessment, and improvement processes for this
training. Train employees to perform their jobs safely, understand proper use
of safety equipment, and respond appropriately in an emergency.
2.•Establish working groups.
3.•Maintain and use hazard logs and hazard analyses during operations as experience is acquired.
4.•Ensure all emergency equipment and safety devices are operable at all times
during hazardous operations. Before safety-critical, nonroutine, potentially hazardous operations are started, inspect all safety equipment to ensure it is operational, including the testing of alarms.
5.•Perform an in-depth investigation of any operational anomalies, including
hazardous conditions .(such as water in a tank that will contain chemicals
that react to water). or events. Determine why they occurred before any
potentially dangerous operations are started or restarted. Provide the training
necessary to do this type of investigation and proper feedback channels to
management.
6.•Create management of change procedures and ensure they are being followed.
These procedures should include hazard analyses on all proposed changes and
approval of all changes related to safety-critical operations. Create and enforce
policies about disabling safety-critical equipment.
7.•Perform safety audits, performance assessments, and inspections using the
hazard analysis results as the preconditions for operations and maintenance.
Collect data to ensure safety policies and procedures are being followed and
that education and training about safety is effective. Establish feedback channels for leading indicators of increasing risk.
8.•Use the hazard analysis and documentation created during development and
passed to operations to identify leading indicators of migration toward states
of higher risk. Establish feedback channels to detect the leading indicators and
respond appropriately.
9.•Establish communication channels from operations to development to pass
back information about operational experience.
10.•Perform in-depth incident and accident investigations, including all systemic
factors. Assign responsibility for implementing all recommendations. Follow
up to determine whether recommendations were fully implemented and
effective.
11.•Perform independent checks of safety-critical activities to ensure they have
been done properly.
12.•Prioritize maintenance for identified safety-critical items. Enforce maintenance
schedules.
13.•Create and enforce policies about disabling safety-critical equipment and
making changes to the physical system.
14.•Create and execute special procedures for the startup of operations in a previously shutdown unit or after maintenance activities.
15.•Investigate and reduce the frequency of spurious alarms.
16.•Clearly mark malfunctioning alarms and gauges. In general, establish procedures for communicating information about all current malfunctioning
equipment to operators and ensure the procedures are being followed. Eliminate all barriers to reporting malfunctioning equipment.
17.•Define and communicate safe operating limits for all safety-critical equipment
and alarm procedures. Ensure that operators are aware of these limits. Assure
that operators are rewarded for following the limits and emergency procedures,
even when it turns out no emergency existed. Provide for tuning the operating
limits and alarm procedures over time as required.
18.•Ensure that spare safety-critical items are in stock or can be acquired quickly.
19.•Establish communication channels to plant management about all events and
activities that are safety-related. Ensure management has the information and
risk awareness they need to make safe decisions about operations.
20.•Ensure emergency equipment and response is available and operable to treat
injured workers.
21.•Establish communication channels to the community to provide information
about hazards and necessary contingency actions and emergency response
requirements.
section 13.2.7. The Safety Information System.
The safety information system is a critical component in managing safety. It acts as
a source of information about the state of safety in the controlled system so that
controllers process models can be kept accurate and coordinated, resulting in better
decision making. Because it in essence acts as a shared process model or a source
for updating individual process models, accurate and timely feedback and data are
important. After studying organizations and accidents, Kjellan concluded that an
effective safety information system ranked second only to top management concern
about safety in discriminating between safe and unsafe companies matched on other
variables .
Setting up a long-term information system can be costly and time consuming, but
the savings in terms of losses prevented will more than make up for the effort. As
an example, a Lessons Learned Information System was created at Boeing for commercial jet transport structural design and analysis. The time constants are large in
this industry, but they finally were able to validate the system after using it in the
design of the 757 and 767 . A tenfold reduction in maintenance costs due to
corrosion and fatigue were attributed to the use of recorded lessons learned from
past designs. All the problems experienced in the introduction of new carbon-fiber
aircraft structures like the B787 show how valuable such learning from the past can
be and the problems that result when it does not exist.
Lessons learned information systems in general are often inadequate to meet
the requirements for improving safety. collected data may be improperly filtered
and thus inaccurate, methods may be lacking for the analysis and summarization
of causal data, information may not be available to decision makers in a form
that is meaningful to them, and such long-term information system efforts
may fail to survive after the original champions and initiators move on to different
projects and management does not provide the resources and leadership to
continue the efforts. Often, lots of information is collected about occupational
safety because it is required for government reports but less for engineering
safety.
Setting up a safety information system for a single project or product may be
easier. The effort starts in the development process and then is passed on for use in
operations. The information accumulated during the safety-driven design process
provides the baseline for operations, as described in chapter 12. For example, the
identification of critical items in the hazard analysis can be used as input to the
maintenance process for prioritization. Another example is the use of the assumptions underlying the hazard analysis to guide the audit and performance assessment
process. But first the information needs to be recorded and easily located and used
by operations personnel.
In general, the safety information system includes
1.• A safety management plan .(for both development and operations)
2.• The status of all safety-related activities
3.• The safety constraints and assumptions underlying the design, including operational limitations
4.• The results of the hazard analyses .(hazard logs). and performance audits and
assessments
5.• Tracking and status information on all known hazards
6.•Incident and accident investigation reports and corrective actions taken
7.•Lessons learned and historical information
8.• Trend analysis
One of the first components of the safety information system for a particular project
or product is a safety program plan. This plan describes the objectives of the program
and how they will be achieved. In addition to other things, the plan provides a
baseline to evaluate compliance and progress. While the organization may have a
general format and documented expectations for safety management plans, this
template may need to be tailored for specific project requirements. The plan should
include review procedures for the plan itself as well as how the plan will be updated
and improved through feedback from experience.
All of the information in the safety information system will probably not be in
one document, but there should be a central location containing pointers to where
all the information can be found. Chapter 12 contains a list of what should be in an
operations safety management plan. The overall safety management plan will
contain similar information with some additions for development.
When safety information is being shared among companies or with regulatory
agencies, there needs to be protection from disclosure and use of proprietary data
for purposes other than safety improvement.
section 13.2.8. Continual Improvement and Learning.
Processes and structures need to be established to allow continual improvement and
learning. Experimentation is an important part of the learning process, and trying
new ideas and approaches to improving safety needs to be allowed and even
encouraged.
In addition, accidents and incidents should be treated as opportunities for learning and investigated thoroughly, as described in chapter 11. Learning will be inhibited if a thorough understanding of the systemic factors involved is not sought.
Simply identifying the causal factors is not enough. recommendations to
eliminate or control these factors must be created along with concrete plans for
implementing the recommendations. Feedback loops are necessary to ensure that
the recommendations are implemented in a timely manner and that controls are
established to detect and react to reappearance of those same causal factors in
the future.
section 13.2.9. Education, Training, and Capability Development.
If employees understand the intent of the safety program and commit to it, they are
more likely to comply with that intention rather than simply follow rules when it is
convenient to do so.
Some properties of effective training programs are presented in chapter 12.
Everyone involved in controlling a potentially dangerous process needs to have
safety training, not just the low-level controllers or operators. The training must
include not only information about the hazards and safety constraints to be
implemented in the control structure and the safety controls, but also about priorities and how decisions about safety are to be made.
One interesting option is to have managers serve as teachers . In this education program design, training experts help manage group dynamics and curriculum
development, but the training itself is delivered by the project leaders. Ford Motor
Company used this approach as part of what they term their Business Leadership
Initiative and have since extended it as part of the Safety Leadership Initiative. They
found that employees pay more attention to a message delivered by their boss than
by a trainer or safety official. By learning to teach the materials, supervisors and
managers are also more likely to absorb and practice the key principles .
section 13.3. Final Thoughts.
Management is key to safety. Top-level management sets the culture, creates the
safety policy, and establishes the safety control structure. Middle management
enforces safe behavior through the designed controls.
Most people want to run safe organizations, but they may misunderstand the
tradeoffs required and how to accomplish the goals. This chapter and the book as a
whole have tried to correct misperceptions and provide advice on how to create
safer products and organizations. The next chapter provides a real-life example of
a successful systems approach to safety.

550
chapter14.raw Normal file

@ -0,0 +1,550 @@
chapter 14.
SUBSAFE: An Example of a Successful Safety
Program.
This book is filled with examples of accidents and of what not to do. One possible
conclusion might be that despite our best efforts accidents are inevitable in
complex systems. That conclusion would be wrong. Many industries and companies
are able to avoid accidents: the nuclear Navy SUBSAFE program is a shining
example. By any measure, SUBSAFE has been remarkably successful: In nearly
fifty years since the beginning of SUBSAFE, no submarine in the program has
been lost.
Looking at a successful safety program and trying to understand why it has been
successful can be very instructive. This chapter looks at the history of the program
and what it is, and proposes some explanations for its great success. SUBSAFE also
provides a good example of most of the principles expounded in this book.
Although SUBSAFE exists in a government and military environment, most of
the important components could be translated into the commercial, profit-making
world. Also note that the success is not related to small size—there are 40,000
people involved in the U.S. submarine safety program, a large percentage of whom
are private contractors and not government employees. Both private and public
shipyards are involved. SUBSAFE is distributed over large parts of the United
States, although mostly on the coasts (for obvious reasons). Five submarine classes
are included, as well as worldwide naval operations.
footnote. I am particularly grateful to Rear Admiral Walt Cantrell, Al Ford, and Commander Jim Hassett for
their insights on and information about the SUBSAFE program.
section 14.1.
History.
The SUBSAFE program was created after the loss of the nuclear submarine
Thresher. The USS Thresher was the first ship of her class and the leading edge of
U.S. submarine technology, combining nuclear power with modern hull design and
newly designed equipment and components. On April 10, 1963, while performing a
deep test dive approximately two hundred miles off the northeastern coast of the
United States, the USS Thresher was lost at sea with all persons aboard: 112 naval
personnel and 17 civilians died.
The head of the U.S. nuclear Navy, Admiral Hyman Rickover, gathered his staff
after the Thresher loss and ordered them to design a program that would ensure
such a loss never happened again. The program was to be completed by June and
operational by that December. To date, that goal has been achieved. Between 1915
and 1963, the U.S. had lost fifteen submarines to noncombat causes, an average of
one loss every three years, with a total of 454 casualties. Thresher was the first
nuclear submarine lost, the worst submarine disaster in history in terms of lives lost
(figure 14.1).
SUBSAFE was established just fifty-four days after the loss of Thresher. It was
created on June 3, 1963, and the program requirements were issued on December
20 of that same year. Since that date, no SUBSAFE-certified submarine has ever
been lost.
One loss did occur in 1968—the USS Scorpion—but it was not SUBSAFE certi-
fied. In a rush to get Scorpion ready for service after it was scheduled for a major
overhaul in 1967, the Chief of Naval Operations allowed a reduced overhaul process
and deferred the required SUBSAFE inspections. The design changes deemed nec-
essary after the loss of Thresher were not made, such as newly designed central valve
control and emergency blow systems, which had not operated properly on Thresher.
Cold War pressures prompted the Navy to search for ways to reduce the duration
of overhauls. By not following SUBSAFE requirements, the Navy reduced the time
Scorpion was out of commission.
In addition, the high quality of the submarine components required by SUBSAFE,
along with intensified structural inspections, had reduced the availability of critical
parts such as seawater piping [8]. A year later, in May 1968, Scorpion was lost at
sea. Although some have attributed its loss to a Soviet attack, a later investigation
of the debris field revealed the most likely cause of the loss was one of its own
torpedoes exploding inside the torpedo room [8]. After the Scorpion loss, the need
for SUBSAFE was reaffirmed and accepted.
The rest of this chapter outlines the SUBSAFE program and provides some
hypotheses to explain its remarkable success. The reader will notice that much
of the program rests on the same systems thinking fundamentals advocated in
this book.
Details of the Thresher Loss.
The accident was thoroughly investigated including, to the Navys credit, the sys-
temic factors as well as the technical failures and deficiencies. Deep sea photogra-
phy, recovered artifacts, and an evaluation of the Threshers design and operational
history led a court of inquiry to conclude that the failure of a deficient silver-braze
joint in a salt water piping system, which relied on silver brazing instead of welding,
led to flooding in the engine room. The crew was unable to access vital equipment
to stop the flooding. As a result of the flooding, saltwater spray on the electrical
components caused short circuits, shutdown of the nuclear reactor, and loss of pro-
pulsion. When the crew attempted to blow the main ballast tanks in order to surface,
excessive moisture in the air system froze, causing a loss of airflow and inability
to surface.
The accident report included recommendations to fix the design problems, for
example, to add high-pressure air compressors to permit the emergency blow
system to operate property. The finding that there were no centrally located isola-
tion valves for the main and auxiliary seawater systems led to the use of flood-
control levers that allowed isolation valves to be closed remotely from a central
panel.
Most accident analyses stop at this point, particularly in that era. To their credit,
however, the investigation continued and looked at why the technical deficiencies
existed, that is, the management and systemic factors involved in the loss. They found
deficient specifications, deficient shipbuilding practices, deficient maintenance prac-
tices, inadequate documentation of construction and maintenance actions, and defi-
cient operational procedures. With respect to documentation, there appeared to be
incomplete or no records of the work that had been done on the submarine and the
critical materials and processes used.
As one example, Thresher had about three thousand silver-brazed pipe joints
exposed to full pressure when the submarine was submerged. During her last ship-
yard maintenance, 145 of these joints were inspected on a “not-to-delay” vessel basis
using what was then the new technique called ultrasonic testing. Fourteen percent
of the 145 joints showed substandard joint integrity. Extrapolating these results to
the entire complement of three thousand joints suggests that more than four hundred
joints could have been substandard. The ship was allowed to go to sea in this con-
dition. The Thresher loss investigators looked at whether the full scope of the joint
problem had been determined and what rationale could have been used to allow
the ship to sail without fixing the joints.
One of the conclusions of the accident investigation is that Navy risk manage-
ment practices had not advanced as fast as submarine capability.
section 14.2. SUBSAFE Goals and Requirements.
A decision was made in 1963 to concentrate the SUBSAFE program on the essen-
tials, and a program was designed to provide maximum reasonable assurance of two
things:
1.• Watertight integrity of the submarines hull.
2.•
Operability and integrity of critical systems to control and recover from a flood-
ing hazard.
By being focused, the SUBSAFE program does not spread or dilute its focus beyond
this stated purpose. For example, mission assurance is not a focus of SUBSAFE,
although it benefits from it. Similarly, fire safety, weapons safety, occupational health
and safety, and nuclear reactor systems safety are not in SUBSAFE. These addi-
tional concerns are handled by regular System Safety programs and mission assur-
ance activities focused on the additional hazards. In this way, the extra rigor required
by SUBSAFE is limited to those activities that ensure U.S. submarines can surface
and return to port safely in an emergency, making the program more acceptable and
practical than it might otherwise be.
SUBSAFE requirements, as documented in the SUBSAFE manual, permeate the
entire submarine community. These requirements are invoked in design, construc-
tion, operations, and maintenance and cover the following aspects of submarine
development and operations:
1.• Administrative
2.•
Organizational
3.• Technical
4.•Unique design
5.•Material control
6.•Fabrication
7.• Testing
8.• Work control
9.• Audits
10.•
Certification
These requirements are invoked in design contracts, construction contracts, overhaul
contracts, the fleet maintenance manual and spare parts procurement specifications,
and so on.
Notice that the requirements encompass not only the technical aspects of the
program but the administrative and organizational aspects as well. The program
requirements are reviewed periodically and renewed when deemed necessary. The
Submarine Safety Working Group, consisting of the SUBSAFE Program Directors
from all SUBSAFE facilities around the country, convenes twice a year to discuss
program issues of mutual concern. This meeting often leads to changes and improve-
ments to the program.
section 14.3. SUBSAFE Risk Management Fundamentals.
SUBSAFE is founded on a basic set of risk management principles, both technical
and cultural. These fundamentals are:
• Work discipline: Knowledge of and compliance with requirements
•Material control: The correct material installed correctly
•Documentation: (1) Design products (specifications, drawings, maintenance
standards, system diagrams, etc.), and (2) objective quality evidence (defined
later)
•Compliance verification: Inspections, surveillance, technical reviews, and audits
•Learning from inspections, audits, and nonconformances
These fundamentals, coupled with a questioning attitude and what those in
SUBSAFE term a chronic uneasiness, are credited for SUBSAFE success. The fun-
damentals are taught and embraced throughout the submarine community. The
members of this community believe that it is absolutely critical that they do not
allow themselves to drift away from the fundamentals.
The Navy, in particular, expends a lot of effort in assuring compliance verification
with the SUBSAFE requirements. A common saying in this community is, “Trust
everybody, but check up.” Whenever a significant issue arises involving compliance
with SUBSAFE requirements, including material defects, system malfunctions, defi-
cient processes, equipment damage, and so on, the Navy requires that an initial
report be provided to Naval Sea Systems Command (NAVSEA) headquarters
within twenty-four hours. The report must describe what happened and must contain
preliminary information concerning apparent root cause(s) and immediate correc-
tive actions taken. Beyond providing the information to prevent recurrence, this
requirement also demonstrates top management commitment to safety and the
SUBSAFE program.
In addition to the technical and managerial risk management fundamentals listed
earlier, SUBSAFE also has cultural principles built into the program:
1.• A questioning attitude
2.•Critical self-evaluation
3.•Lessons learned and continual improvement
4.•Continual training
5.•Separation of powers (a management structure that provides checks and bal-
ances and assures appropriate attention to safety)
As is the case with most risk management programs, the foundation of SUBSAFE
is the personal integrity and responsibility of those individuals who are involved in
the program. The cement bonding this foundation is the selection, training, and
cultural mentoring of those individuals who perform SUBSAFE work. Ultimately,
these people attest to their adherence to technical requirements by documenting
critical data, parameters, statements and their personal signature verifying that work
has been properly completed.
section 14.4.
Separation of Powers.
SUBSAFE has created a unique management structure they call separation of
powers or, less formally, the three-legged stool (figure 14.2). This structure is the
cornerstone of the SUBSAFE program. Responsibility is divided among three dis-
tinct entities providing a system of checks and balances.
The new construction and in-service Platform Program Managers are responsible
for the cost, schedule, and quality of the ships under their control. To ensure that
safety is not traded off under cost and schedule pressures, the Program Managers
can only select from a set of acceptable design options. The Independent Technical
Authority has the responsibility to approve those acceptable options.
The third leg of the stool is the Independent Safety and Quality Assurance
Authority. This group is responsible for administering the SUBSAFE program and
for enforcing compliance. It is staffed by engineers with the authority to question
and challenge the Independent Technical Authority and the Program Managers on
their compliance with SUBSAFE requirements.
The Independent Technical Authority (ITA) is responsible for establishing and
assuring adherence to technical standards and policy. More specifically, they:
1.•Set and enforce technical standards.
2.•Maintain technical subject matter expertise.
3.• Assure safe and reliable operations.
4.•Ensure effective and efficient systems engineering.
5.•Make unbiased, independent technical decisions.
6.•Provide stewardship of technical and engineering capabilities.
Accountability is important in SUBSAFE and the ITA is held accountable for
exercising these responsibilities.
This management structure only works because of support from top manage-
ment. When Program Managers complain that satisfying the SUBSAFE require-
ments will make them unable to satisfy their program goals and deliver new
submarines, SUBSAFE requirements prevail.
section 14.5.
Certification.
In 1963, a SUBSAFE certification boundary was defined. Certification focuses on
the structures, systems, and components that are critical to the watertight integrity
and recovery capability of the submarine.
Certification is also strictly based on what the SUBSAFE program defines as
Objective Quality Evidence (OQE). OQE is defined as any statement of fact, either
quantitative or qualitative, pertaining to the quality of a product or service, based
on observations, measurements, or tests that can be verified. Probabilistic risk assess-
ment, which usually cannot be verified, is not used.
OQE is evidence that deliberate steps were taken to comply with requirements.
It does not matter who did the work or how well they did it, if there is no OQE
then there is no basis for certification.
The goal of certification is to provide maximum reasonable assurance through
the initial SUBSAFE certification and by maintaining certification throughout the
submarines life. SUBSAFE inculcates the basic STAMP assumption that systems
change throughout their existence. SUBSAFE certification is not a one-time activity
but has to be maintained over time: SUBSAFE certification is a process, not just a
final step. This rigorous process structures the construction program through a speci-
fied sequence of events leading to formal authorization for sea trials and delivery
to the Navy. Certification then applies to the maintenance and operations programs
and must be maintained throughout the life of the ship.
section 14.5.1. Initial Certification.
Initial certification is separated into four elements (figure 14.3):
1. Design certification: Design certification consists of design product approval
and design review approval, both of which are based on OQE. For design
product approval, the OQE is reviewed to confirm that the appropriate techni-
cal authority has approved the design products, such as the technical drawings.
Most drawings are produced by the submarine design yard. Approval may be
given by the Navys Supervisor of Shipbuilding, which administers and over-
sees the contract at each of the private shipyards, or, in some cases, the
NAVSEA may act as the review and approval technical authority. Design
approval is considered complete only after the proper technical authority has
reviewed the OQE and at that point the design is certified.
2. Material certification: After the design is certified, the material procured to
build the submarine must meet the requirements of that design. Technical
specifications must be embodied in the purchase documents. Once the material
is received, it goes through a rigorous receipt inspection process to confirm
and certify that it meets the technical specifications. This process usually
involves examining the vendor-supplied chemical and physical OQE for the
material. Records of chemical assay results, heat treatment applied to the mate-
rial, and nondestructive testing conducted on the material constitute OQE.
3. Fabrication certification: Once the certified material is obtained, the next
step is fabrication where industrial processes such as machining, welding, and
assembly are used to construct components, systems, and ships. OQE is used
to document the industrial processes. Separately, and prior to actual fabrication
of the final product, the facility performing the work is certified in the indus-
trial processes necessary to perform the work. An example is a specific
high-strength steel welding procedure. In addition to the weld procedure, the
individual welder using this particular process in the actual fabrication receives
documented training and successfully completes a formal qualification in the
specific weld procedure to be used. Other industrial processes have similar
certification and qualification requirements. In addition, steps are taken to
ensure that the measurement devices, such as temperature sensors, pressure
gauges, torque wrenches, micrometers, and so on, are included in a robust
calibration program at the facility.
4. Testing certification: Finally, a series of tests is used to prove that the assem-
bly, system, or ship meets design parameters. Testing occurs throughout the
fabrication of a submarine, starting at the component level and continuing
through system assembly, final assembly, and sea trials. The material and com-
ponents may receive any of the typical nondestructive tests, such as radiogra-
phy, magnetic particle, and representative tests. Systems are also subjected to
strength testing and operational testing. For certain components, destructive
tests are performed on representative samples.
Each of these certification elements is defined by detailed, documented SUBSAFE
requirements.
At some point near the end of the new construction period, usually lasting five
or so years, every submarine obtains its initial SUBSAFE certification. This process
is very formal and preceded by scrutiny and audit conducted by the shipbuilder, the
supervising authority, and finally, by a NAVSEA Certification Audit Team assem-
bled and led by the Office of Safety and Quality Assurance at NAVSEA. The initial
certification is in the end granted at the flag officer level.
secton 14.5.2. Maintaining Certification.
After the submarine enters the fleet, SUBSAFE certification must be maintained
through the life of the slip. Three tools are used: the Reentry Control (REC) Process,
the Unrestricted Operations Maintenance Requirements Card (URO MRC)
program, and the audit program.
The Reentry Control (REC) process carefully controls work and testing within
the SUBSAFE boundary, that is, the structures, systems, and components that are
critical to the watertight integrity and recovery capability of the submarine. The
purpose of REC is to provide maximum reasonable assurance that the areas dis-
turbed have been restored to their fully certified condition. The procedures used
provide an identifiable, accountable, and auditable record of the work performed.
REC control procedures have three goals: (1) to maintain work discipline by
identifying the work to be performed and the standards to be met, (2) to establish
personal accountability by having the responsible personnel sign their names on the
reentry control document, and (3) to collect the OQE needed for maintaining
certification.
The second process, the Unrestricted Operations Maintenance Requirements
Card (URO MRC) program, involves periodic inspections and tests of critical
items to ensure they have not degraded to an unacceptable level due to use, age,
or environment. In fact, URO MRC did not originate with SUBSAFE, but was
developed to extend the operating cycle of USS Queenfish by one year in 1969. It
now provides the technical basis for continued unrestricted operation of subma-
rines to test depth.
The third aspect of maintaining certification is the audit program. Because the
audit process is used for more general purposes than simply maintaining certifica-
tion, it is considered in a separate section.
14.6 Audit Procedures and Approach
Compliance verification in SUBSAFE is treated as a process, not just one step in a
process or program. The Navy demands that each Navy facility participate fully in
the process, including the use of inspection, surveillance, and audits to confirm their
own compliance. Audits are used to verify that this process is working. They are
conducted either at fixed intervals or when a specific condition is found to exist that
needs attention.
Audits are multi-layered: they exist at the contractor and shipyard level, at the
local government level, and at Navy headquarters. Using the terminology adopted
in this book, responsibilities are assigned to all the components of the safety control
structure as shown in figure 14.4. Contractors and shipyard responsibilities include
implementing specified SUBSAFE requirements, establishing processes for control-
ling work, establishing processes to verify compliance and certify its own work, and
presenting the certification OQE to the local government oversight authority. The
processes established to verify compliance and certify their work include a quality
management system, surveillance, inspections, witnessing critical contractor work
(contractor quality assurance), and internal audits.
Local government oversight responsibilities include surveillance, inspections,
assuring quality, and witnessing critical contractor work, audits of the contractor,
and certifying the work of the contractor to Navy headquarters.
The responsibilities of Navy headquarters include establishing and specifying
SUBSAFE requirements, verifying compliance with the requirements, and provid-
ing SUBSAFE certification for each submarine. Compliance is verified through two
types of audits: (1) ship-specific and (2) functional or facility audits.
A ship-specific audit looks at the OQE associated with an individual ship to
ensure that the material condition of that submarine is satisfactory for sea trial and
unrestricted operations. This audit represents a significant part of the certification
process that a submarines condition meets SUBSAFE requirements and is safe to
go to sea.
Functional or facility audits (such as contractors or shipyards) include reviews
of policies, procedures, and practices to confirm compliance with the SUBSAFE
program requirements, the health of processes, and the capability of producing
certifiable hardware or design products.
Both types of audits are carried out with structured audit plans and qualified
auditors.
The audit philosophy is part of the reason for SUBSAFE success. Audits are
treated as a constructive, learning experience. Audits start from the assumption
that policies, procedures, and practices are in compliance with requirements. The
goal of the audit is to confirm that compliance. Audit findings must be based
on a clear violation of requirements or must be identified as an “operational
improvement.”
The objective of audits is “to make our submarines safer” not to evaluate indi-
vidual performance or to assign blame. Note the use of the word “our”: the SUBSAFE
program emphasizes common safety goals and group effort to achieve them. Every-
one owns the safety goals and is assumed to be committed to them and working to
the same purpose. SUBSAFE literature and training talks about those involved as
being part of a “very special family of people who design, build, maintain, and
operate our nations submarines.”
To this end, audits are a peer review. A typical audit team consists of twenty to
thirty people with approximately 80 percent of the team coming from various
SUBSAFE facilities around the country and the remaining 20 percent coming from
NAVSEA headquarters. An audit is considered a team effort—the facility being
audited is expected to help the audit team make the audit report as accurate and
meaningful as possible.
Audits are conducted under rules of continuous communication—when a problem
is found, the emphasis is on full understanding of the identified problem as well as
identification of potential solutions. Deficiencies are documented and adjudicated.
Contentious issues sometimes arise, but an attempt is made to resolve them during
the audit process.
A significant byproduct of a SUBSAFE audit is the learning experience it pro-
vides to the auditors as well as those being audited. Expected results include cross-
pollination of successful procedures and process improvements. The rationale
behind having SUBSAFE participants on the audit team is not only their under-
standing of the SUBSAFE program and requirements, but also their ability to learn
from the audits and apply that learning to their own SUBSAFE groups.
The current audit philosophy is a product of experience and learning. Before
1986, only ship-specific audits were conducted, not facility or headquarters audits.
In 1986, there was a determination that they had gotten complacent and were assum-
ing that once an audit was completed, there would be no findings if a follow-up
audit was performed. They also decided that the ship-specific audits were not rigor-
ous or complete enough. In STAMP terms, only the lowest level of the safety control
structure was being audited and not the other components. After that time, biennial
audits were conducted at all levels of the safety control structure, even the highest
levels of management. A biennial NAVSEA internal audit gives the field activities
a chance to evaluate operations at headquarters. Headquarters personnel must be
willing to accept and resolve audit findings just like any other member of the nuclear
submarine community.
One lesson learned has been that developing a robust compliance verification
program is difficult. Along the way they learned that (1) clear ground rules for audits
must be established, communicated, and adhered to; (2) it is not possible to “audit
in” requirements; and (3) the compliance verification organization must be equal
with the program managers and the technical authority. In addition, they determined
that not just anyone can do SUBSAFE work. The number of activities authorized
to perform SUBSAFE activities is strictly controlled.
section 14.7. Problem Reporting and Critiques.
SUBSAFE believes that lessons learned are integral to submarine safety and puts
emphasis on problem reporting and critiques. Significant problems are defined as
those that affect ship safety, cause significant damage to the ship or its equipment,
delay ship deployment or incur substantial cost increase, or involve severe personnel
injury. Trouble reports are prepared for all significant problems encountered in
the construction, repair, and maintenance of naval ships. Systemic problems and
issues that constitute significant lessons learned for other activities can also be
identified by trouble reports. Critiques are similar to trouble reports and are utilized
by the fleet.
Trouble reports are distributed to all SUBSAFE responsible activities and are
used to report significant problems to NAVSEA. NAVSEA evaluates the reports to
identify SUBSAFE program improvements.
section 14.8. Challenges.
The leaders of SUBSAFE consider their biggest challenges to be:
•Ignorance:
•Arrogance: Behavior based on pride, self-importance, conceit, or the assump-
tion of intellectual superiority and the presumption of knowledge that is not
supported by facts; and
•Complacency: Satisfaction with ones accomplishments accompanied by a
lack of awareness of actual dangers or deficiencies.
The state of not knowing;
Combating these challenges is a “constant struggle every day” [69]. Many features
of the program are designed to control these challenges, particularly training and
education.
section 14.9. Continual Training and Education.
Continual training and education are a hallmark of SUBSAFE. The goals are to:
1.•Serve as a reminder of the consequences of complacency in ones job.
2.•Emphasize the need to proactively correct and prevent problems.
3.•Stress the need to adhere to program fundamentals.
4.•Convey management support for the program.
Continual improvement and feedback to the SUBSAFE training programs
comes not only from trouble reports and incidents but also from the level of knowl-
edge assessments performed during the audits of organizations that perform
SUBSAFE work.
Annual training is required for all headquarters SUBSAFE workers, from the
apprentice craftsman to the admirals. A periodic refresher is also held at each of the
contractors facilities. At the meetings, a video about the loss of Thresher is shown
and an overview of the SUBSAFE program and their responsibilities is provided as
well as recent lessons learned and deficiency trends encountered over the previous
years. The need to avoid complacency and to proactively correct and prevent prob-
lems is reinforced.
Time is also taken at the annual meetings to remind everyone involved about the
history of the program. By guaranteeing that no one forgets what happened to USS
Thresher, the SUBSAFE program has helped to create a culture that is conducive
to strict adherence to policies and procedures. Everyone is recommitted each year
to ensure that a tragedy like the one that occurred in 1963 never happens again.
SUBSAFE is described by those in the program as “a requirement, an attitude, and
a responsibility.”
section 14.10. Execution and Compliance over the Life of a Submarine.
The design, construction, and initial certification are only a small percentage of the
life of the certified ship. The success of the program during the vast majority of the
certified ships life depends on the knowledge, compliance, and audit by those oper-
ating and maintaining the submarines. Without the rigor of compliance and sustain-
ing knowledge from the petty officers, ships officers, and fleet staff, all of the great
virtues of SUBSAFE would “come to naught” [30]. The following anecdote by
Admiral Walt Cantrell provides an indication of how SUBSAFE principles per-
meate the entire nuclear Navy:
I remember vividly when I escorted the first group of NASA skeptics to a submarine and
they figured they would demonstrate that I had exaggerated the integrity of the program
by picking a member of ships force at random and asked him about SUBSAFE. The
NASA folks were blown away. A second class machinists mate gave a cogent, complete,
correct description of the elements of the program and how important it was that all levels
in the Submarine Force comply. That part of the program is essential to its success—just
as much, if not more so, than all the other support staff effort [30].
section 14.11 Lessons to Be Learned from SUBSAFE.
Those involved in SUBSAFE are very proud of their achievements and the fact that
even after nearly fifty years of no accidents, the program is still strong and vibrant.
On January 8, 2005, USS San Francisco, a twenty-six-year-old ship, crashed head-on
into an underwater mountain. While several crew members were injured and one
died, this incident is considered by SUBSAFE to be a success story: In spite of the
massive damage to her forward structure, there was no flooding, and the ship sur-
faced and returned to port under her own power. There was no breach of the pres-
sure hull, the nuclear reactor remained on line, the emergency main ballast tank
blow system functioned as intended, and the control surfaces functioned properly.
Those in the SUBSAFE program attribute this success to the work discipline, mate-
rial control, documentation, and compliance verification exercised during the design,
construction, and maintenance of USS San Francisco.
Can the SUBSAFE principles be transferred from the military to commercial
companies and industries? The answer lies in why the program has been so effective
and whether these factors can be maintained in other implementations of the prin-
ciples more appropriate to non-military venues. Remember, of course, that private
contractors form the bulk of the companies and workers in the nuclear Navy, and
they seem to be able to satisfy the SUBSAFE program requirements. The primary
difference is in the basic goals of the organization itself.
Some factors that can be identified as contributing to the success of SUBSAFE,
most of which could be translated into a safety program in private industry are:
1.•Leadership support and commitment to the program.
2.•Management (NAVSEA) is not afraid to say “no” when faced with pressures
to compromise the SUBSAFE principles and requirements. Top management
also agrees to be audited for adherence to the principles of SUBSAFE and to
correct any deficiencies that are found.
3.•Establishment of clear and written safety requirements.
4.•Education, not just training, with yearly reminders of the past, continual
improvement, and input from lessons learned, trouble reports, and assessments
during audits.
5.•Updating the SUBSAFE program requirements and the commitment to it
periodically.
6.Separation of powers and assignment of responsibility.
7.•Emphasis on rigor, technical compliance, and work discipline.
8.•Documentation capturing what they do and why they do it.
9.• The participatory audit philosophy and the requirement for objective quality
evidence.
10.• A program based on written procedures, not personality-driven.
11.•Continual feedback and improvement. When something does not conform to
SUBSAFE specifications, it must be reported to NAVSEA headquarters along
with the causal analysis (including the systemic factors) of why it happened.
Everyone at every level of the organization is willing to examine his or her role
in the incident.
12.•Continual certification throughout the life of the ship; it is not a one-time event.
13.• Accountability accompanying responsibility. Personal integrity and personal
responsibility is stressed. The program is designed to foster everyones pride in
his or her work.
14.• A culture of shared responsibility for safety and the SUBSAFE requirements.
15.•
Special efforts to be vigilant against complacency and to fight it when it is
detected.

493
chapter14.txt Normal file

@ -0,0 +1,493 @@
chapter 14.
SUBSAFE. An Example of a Successful Safety
Program.
This book is filled with examples of accidents and of what not to do. One possible
conclusion might be that despite our best efforts accidents are inevitable in
complex systems. That conclusion would be wrong. Many industries and companies
are able to avoid accidents. the nuclear Navy SUBSAFE program is a shining
example. By any measure, SUBSAFE has been remarkably successful. In nearly
fifty years since the beginning of SUBSAFE, no submarine in the program has
been lost.
Looking at a successful safety program and trying to understand why it has been
successful can be very instructive. This chapter looks at the history of the program
and what it is, and proposes some explanations for its great success. SUBSAFE also
provides a good example of most of the principles expounded in this book.
Although SUBSAFE exists in a government and military environment, most of
the important components could be translated into the commercial, profit-making
world. Also note that the success is not related to small size.there are 40,000
people involved in the U.S. submarine safety program, a large percentage of whom
are private contractors and not government employees. Both private and public
shipyards are involved. SUBSAFE is distributed over large parts of the United
States, although mostly on the coasts .(for obvious reasons). Five submarine classes
are included, as well as worldwide naval operations.
footnote. I am particularly grateful to Rear Admiral Walt Cantrell, Al Ford, and Commander Jim Hassett for
their insights on and information about the SUBSAFE program.
section 14.1.
History.
The SUBSAFE program was created after the loss of the nuclear submarine
Thresher. The USS Thresher was the first ship of her class and the leading edge of
U.S. submarine technology, combining nuclear power with modern hull design and
newly designed equipment and components. On April 10, 19 63 , while performing a
deep test dive approximately two hundred miles off the northeastern coast of the
United States, the USS Thresher was lost at sea with all persons aboard. 112 naval
personnel and 17 civilians died.
The head of the U.S. nuclear Navy, Admiral Hyman Rickover, gathered his staff
after the Thresher loss and ordered them to design a program that would ensure
such a loss never happened again. The program was to be completed by June and
operational by that December. To date, that goal has been achieved. Between 19 15
and 19 63 , the U.S. had lost fifteen submarines to noncombat causes, an average of
one loss every three years, with a total of 454 casualties. Thresher was the first
nuclear submarine lost, the worst submarine disaster in history in terms of lives lost
(figure 14.1).
SUBSAFE was established just fifty-four days after the loss of Thresher. It was
created on June 3, 19 63 , and the program requirements were issued on December
20 of that same year. Since that date, no SUBSAFE-certified submarine has ever
been lost.
One loss did occur in 19 68 .the USS Scorpion.but it was not SUBSAFE certified. In a rush to get Scorpion ready for service after it was scheduled for a major
overhaul in 19 67 , the Chief of Naval Operations allowed a reduced overhaul process
and deferred the required SUBSAFE inspections. The design changes deemed necessary after the loss of Thresher were not made, such as newly designed central valve
control and emergency blow systems, which had not operated properly on Thresher.
Cold War pressures prompted the Navy to search for ways to reduce the duration
of overhauls. By not following SUBSAFE requirements, the Navy reduced the time
Scorpion was out of commission.
In addition, the high quality of the submarine components required by SUBSAFE,
along with intensified structural inspections, had reduced the availability of critical
parts such as seawater piping . A year later, in May 19 68 , Scorpion was lost at
sea. Although some have attributed its loss to a Soviet attack, a later investigation
of the debris field revealed the most likely cause of the loss was one of its own
torpedoes exploding inside the torpedo room . After the Scorpion loss, the need
for SUBSAFE was reaffirmed and accepted.
The rest of this chapter outlines the SUBSAFE program and provides some
hypotheses to explain its remarkable success. The reader will notice that much
of the program rests on the same systems thinking fundamentals advocated in
this book.
Details of the Thresher Loss.
The accident was thoroughly investigated including, to the Navys credit, the systemic factors as well as the technical failures and deficiencies. Deep sea photography, recovered artifacts, and an evaluation of the Threshers design and operational
history led a court of inquiry to conclude that the failure of a deficient silver-braze
joint in a salt water piping system, which relied on silver brazing instead of welding,
led to flooding in the engine room. The crew was unable to access vital equipment
to stop the flooding. As a result of the flooding, saltwater spray on the electrical
components caused short circuits, shutdown of the nuclear reactor, and loss of propulsion. When the crew attempted to blow the main ballast tanks in order to surface,
excessive moisture in the air system froze, causing a loss of airflow and inability
to surface.
The accident report included recommendations to fix the design problems, for
example, to add high-pressure air compressors to permit the emergency blow
system to operate property. The finding that there were no centrally located isolation valves for the main and auxiliary seawater systems led to the use of floodcontrol levers that allowed isolation valves to be closed remotely from a central
panel.
Most accident analyses stop at this point, particularly in that era. To their credit,
however, the investigation continued and looked at why the technical deficiencies
existed, that is, the management and systemic factors involved in the loss. They found
deficient specifications, deficient shipbuilding practices, deficient maintenance practices, inadequate documentation of construction and maintenance actions, and deficient operational procedures. With respect to documentation, there appeared to be
incomplete or no records of the work that had been done on the submarine and the
critical materials and processes used.
As one example, Thresher had about three thousand silver-brazed pipe joints
exposed to full pressure when the submarine was submerged. During her last shipyard maintenance, 145 of these joints were inspected on a “not-to-delay” vessel basis
using what was then the new technique called ultrasonic testing. Fourteen percent
of the 145 joints showed substandard joint integrity. Extrapolating these results to
the entire complement of three thousand joints suggests that more than four hundred
joints could have been substandard. The ship was allowed to go to sea in this condition. The Thresher loss investigators looked at whether the full scope of the joint
problem had been determined and what rationale could have been used to allow
the ship to sail without fixing the joints.
One of the conclusions of the accident investigation is that Navy risk management practices had not advanced as fast as submarine capability.
section 14.2. SUBSAFE Goals and Requirements.
A decision was made in 19 63 to concentrate the SUBSAFE program on the essentials, and a program was designed to provide maximum reasonable assurance of two
things.
1.• Watertight integrity of the submarines hull.
2.•
Operability and integrity of critical systems to control and recover from a flooding hazard.
By being focused, the SUBSAFE program does not spread or dilute its focus beyond
this stated purpose. For example, mission assurance is not a focus of SUBSAFE,
although it benefits from it. Similarly, fire safety, weapons safety, occupational health
and safety, and nuclear reactor systems safety are not in SUBSAFE. These additional concerns are handled by regular System Safety programs and mission assurance activities focused on the additional hazards. In this way, the extra rigor required
by SUBSAFE is limited to those activities that ensure U.S. submarines can surface
and return to port safely in an emergency, making the program more acceptable and
practical than it might otherwise be.
SUBSAFE requirements, as documented in the SUBSAFE manual, permeate the
entire submarine community. These requirements are invoked in design, construction, operations, and maintenance and cover the following aspects of submarine
development and operations.
1.• Administrative
2.•
Organizational
3.• Technical
4.•Unique design
5.•Material control
6.•Fabrication
7.• Testing
8.• Work control
9.• Audits
10.•
Certification
These requirements are invoked in design contracts, construction contracts, overhaul
contracts, the fleet maintenance manual and spare parts procurement specifications,
and so on.
Notice that the requirements encompass not only the technical aspects of the
program but the administrative and organizational aspects as well. The program
requirements are reviewed periodically and renewed when deemed necessary. The
Submarine Safety Working Group, consisting of the SUBSAFE Program Directors
from all SUBSAFE facilities around the country, convenes twice a year to discuss
program issues of mutual concern. This meeting often leads to changes and improvements to the program.
section 14.3. SUBSAFE Risk Management Fundamentals.
SUBSAFE is founded on a basic set of risk management principles, both technical
and cultural. These fundamentals are.
• Work discipline. Knowledge of and compliance with requirements
•Material control. The correct material installed correctly
•Documentation. .(1). Design products .(specifications, drawings, maintenance
standards, system diagrams, etc.), and .(2). objective quality evidence .(defined
later)
•Compliance verification. Inspections, surveillance, technical reviews, and audits
•Learning from inspections, audits, and nonconformances
These fundamentals, coupled with a questioning attitude and what those in
SUBSAFE term a chronic uneasiness, are credited for SUBSAFE success. The fundamentals are taught and embraced throughout the submarine community. The
members of this community believe that it is absolutely critical that they do not
allow themselves to drift away from the fundamentals.
The Navy, in particular, expends a lot of effort in assuring compliance verification
with the SUBSAFE requirements. A common saying in this community is, “Trust
everybody, but check up.” Whenever a significant issue arises involving compliance
with SUBSAFE requirements, including material defects, system malfunctions, deficient processes, equipment damage, and so on, the Navy requires that an initial
report be provided to Naval Sea Systems Command .(NAVSEA). headquarters
within twenty-four hours. The report must describe what happened and must contain
preliminary information concerning apparent root cause(s). and immediate corrective actions taken. Beyond providing the information to prevent recurrence, this
requirement also demonstrates top management commitment to safety and the
SUBSAFE program.
In addition to the technical and managerial risk management fundamentals listed
earlier, SUBSAFE also has cultural principles built into the program.
1.• A questioning attitude
2.•Critical self-evaluation
3.•Lessons learned and continual improvement
4.•Continual training
5.•Separation of powers .(a management structure that provides checks and balances and assures appropriate attention to safety)
As is the case with most risk management programs, the foundation of SUBSAFE
is the personal integrity and responsibility of those individuals who are involved in
the program. The cement bonding this foundation is the selection, training, and
cultural mentoring of those individuals who perform SUBSAFE work. Ultimately,
these people attest to their adherence to technical requirements by documenting
critical data, parameters, statements and their personal signature verifying that work
has been properly completed.
section 14.4.
Separation of Powers.
SUBSAFE has created a unique management structure they call separation of
powers or, less formally, the three-legged stool .(figure 14.2). This structure is the
cornerstone of the SUBSAFE program. Responsibility is divided among three distinct entities providing a system of checks and balances.
The new construction and in-service Platform Program Managers are responsible
for the cost, schedule, and quality of the ships under their control. To ensure that
safety is not traded off under cost and schedule pressures, the Program Managers
can only select from a set of acceptable design options. The Independent Technical
Authority has the responsibility to approve those acceptable options.
The third leg of the stool is the Independent Safety and Quality Assurance
Authority. This group is responsible for administering the SUBSAFE program and
for enforcing compliance. It is staffed by engineers with the authority to question
and challenge the Independent Technical Authority and the Program Managers on
their compliance with SUBSAFE requirements.
The Independent Technical Authority .(ITA). is responsible for establishing and
assuring adherence to technical standards and policy. More specifically, they.
1.•Set and enforce technical standards.
2.•Maintain technical subject matter expertise.
3.• Assure safe and reliable operations.
4.•Ensure effective and efficient systems engineering.
5.•Make unbiased, independent technical decisions.
6.•Provide stewardship of technical and engineering capabilities.
Accountability is important in SUBSAFE and the ITA is held accountable for
exercising these responsibilities.
This management structure only works because of support from top management. When Program Managers complain that satisfying the SUBSAFE requirements will make them unable to satisfy their program goals and deliver new
submarines, SUBSAFE requirements prevail.
section 14.5.
Certification.
In 19 63 , a SUBSAFE certification boundary was defined. Certification focuses on
the structures, systems, and components that are critical to the watertight integrity
and recovery capability of the submarine.
Certification is also strictly based on what the SUBSAFE program defines as
Objective Quality Evidence .(OQE). OQE is defined as any statement of fact, either
quantitative or qualitative, pertaining to the quality of a product or service, based
on observations, measurements, or tests that can be verified. Probabilistic risk assessment, which usually cannot be verified, is not used.
OQE is evidence that deliberate steps were taken to comply with requirements.
It does not matter who did the work or how well they did it, if there is no OQE
then there is no basis for certification.
The goal of certification is to provide maximum reasonable assurance through
the initial SUBSAFE certification and by maintaining certification throughout the
submarines life. SUBSAFE inculcates the basic STAMP assumption that systems
change throughout their existence. SUBSAFE certification is not a one-time activity
but has to be maintained over time. SUBSAFE certification is a process, not just a
final step. This rigorous process structures the construction program through a specified sequence of events leading to formal authorization for sea trials and delivery
to the Navy. Certification then applies to the maintenance and operations programs
and must be maintained throughout the life of the ship.
section 14.5.1. Initial Certification.
Initial certification is separated into four elements .(figure 14.3).
1. Design certification. Design certification consists of design product approval
and design review approval, both of which are based on OQE. For design
product approval, the OQE is reviewed to confirm that the appropriate technical authority has approved the design products, such as the technical drawings.
Most drawings are produced by the submarine design yard. Approval may be
given by the Navys Supervisor of Shipbuilding, which administers and oversees the contract at each of the private shipyards, or, in some cases, the
NAVSEA may act as the review and approval technical authority. Design
approval is considered complete only after the proper technical authority has
reviewed the OQE and at that point the design is certified.
2. Material certification. After the design is certified, the material procured to
build the submarine must meet the requirements of that design. Technical
specifications must be embodied in the purchase documents. Once the material
is received, it goes through a rigorous receipt inspection process to confirm
and certify that it meets the technical specifications. This process usually
involves examining the vendor-supplied chemical and physical OQE for the
material. Records of chemical assay results, heat treatment applied to the material, and nondestructive testing conducted on the material constitute OQE.
3. Fabrication certification. Once the certified material is obtained, the next
step is fabrication where industrial processes such as machining, welding, and
assembly are used to construct components, systems, and ships. OQE is used
to document the industrial processes. Separately, and prior to actual fabrication
of the final product, the facility performing the work is certified in the industrial processes necessary to perform the work. An example is a specific
high-strength steel welding procedure. In addition to the weld procedure, the
individual welder using this particular process in the actual fabrication receives
documented training and successfully completes a formal qualification in the
specific weld procedure to be used. Other industrial processes have similar
certification and qualification requirements. In addition, steps are taken to
ensure that the measurement devices, such as temperature sensors, pressure
gauges, torque wrenches, micrometers, and so on, are included in a robust
calibration program at the facility.
4. Testing certification. Finally, a series of tests is used to prove that the assembly, system, or ship meets design parameters. Testing occurs throughout the
fabrication of a submarine, starting at the component level and continuing
through system assembly, final assembly, and sea trials. The material and components may receive any of the typical nondestructive tests, such as radiography, magnetic particle, and representative tests. Systems are also subjected to
strength testing and operational testing. For certain components, destructive
tests are performed on representative samples.
Each of these certification elements is defined by detailed, documented SUBSAFE
requirements.
At some point near the end of the new construction period, usually lasting five
or so years, every submarine obtains its initial SUBSAFE certification. This process
is very formal and preceded by scrutiny and audit conducted by the shipbuilder, the
supervising authority, and finally, by a NAVSEA Certification Audit Team assembled and led by the Office of Safety and Quality Assurance at NAVSEA. The initial
certification is in the end granted at the flag officer level.
secton 14.5.2. Maintaining Certification.
After the submarine enters the fleet, SUBSAFE certification must be maintained
through the life of the slip. Three tools are used. the Reentry Control .(REC). Process,
the Unrestricted Operations Maintenance Requirements Card .(URO MRC)
program, and the audit program.
The Reentry Control .(REC). process carefully controls work and testing within
the SUBSAFE boundary, that is, the structures, systems, and components that are
critical to the watertight integrity and recovery capability of the submarine. The
purpose of REC is to provide maximum reasonable assurance that the areas disturbed have been restored to their fully certified condition. The procedures used
provide an identifiable, accountable, and auditable record of the work performed.
REC control procedures have three goals. .(1). to maintain work discipline by
identifying the work to be performed and the standards to be met, .(2). to establish
personal accountability by having the responsible personnel sign their names on the
reentry control document, and .(3). to collect the OQE needed for maintaining
certification.
The second process, the Unrestricted Operations Maintenance Requirements
Card .(URO MRC). program, involves periodic inspections and tests of critical
items to ensure they have not degraded to an unacceptable level due to use, age,
or environment. In fact, URO MRC did not originate with SUBSAFE, but was
developed to extend the operating cycle of USS Queenfish by one year in 19 69 . It
now provides the technical basis for continued unrestricted operation of submarines to test depth.
The third aspect of maintaining certification is the audit program. Because the
audit process is used for more general purposes than simply maintaining certification, it is considered in a separate section.
14.6 Audit Procedures and Approach
Compliance verification in SUBSAFE is treated as a process, not just one step in a
process or program. The Navy demands that each Navy facility participate fully in
the process, including the use of inspection, surveillance, and audits to confirm their
own compliance. Audits are used to verify that this process is working. They are
conducted either at fixed intervals or when a specific condition is found to exist that
needs attention.
Audits are multi-layered. they exist at the contractor and shipyard level, at the
local government level, and at Navy headquarters. Using the terminology adopted
in this book, responsibilities are assigned to all the components of the safety control
structure as shown in figure 14.4. Contractors and shipyard responsibilities include
implementing specified SUBSAFE requirements, establishing processes for controlling work, establishing processes to verify compliance and certify its own work, and
presenting the certification OQE to the local government oversight authority. The
processes established to verify compliance and certify their work include a quality
management system, surveillance, inspections, witnessing critical contractor work
(contractor quality assurance), and internal audits.
Local government oversight responsibilities include surveillance, inspections,
assuring quality, and witnessing critical contractor work, audits of the contractor,
and certifying the work of the contractor to Navy headquarters.
The responsibilities of Navy headquarters include establishing and specifying
SUBSAFE requirements, verifying compliance with the requirements, and providing SUBSAFE certification for each submarine. Compliance is verified through two
types of audits. .(1). ship-specific and .(2). functional or facility audits.
A ship-specific audit looks at the OQE associated with an individual ship to
ensure that the material condition of that submarine is satisfactory for sea trial and
unrestricted operations. This audit represents a significant part of the certification
process that a submarines condition meets SUBSAFE requirements and is safe to
go to sea.
Functional or facility audits .(such as contractors or shipyards). include reviews
of policies, procedures, and practices to confirm compliance with the SUBSAFE
program requirements, the health of processes, and the capability of producing
certifiable hardware or design products.
Both types of audits are carried out with structured audit plans and qualified
auditors.
The audit philosophy is part of the reason for SUBSAFE success. Audits are
treated as a constructive, learning experience. Audits start from the assumption
that policies, procedures, and practices are in compliance with requirements. The
goal of the audit is to confirm that compliance. Audit findings must be based
on a clear violation of requirements or must be identified as an “operational
improvement.”
The objective of audits is “to make our submarines safer” not to evaluate individual performance or to assign blame. Note the use of the word “our”. the SUBSAFE
program emphasizes common safety goals and group effort to achieve them. Everyone owns the safety goals and is assumed to be committed to them and working to
the same purpose. SUBSAFE literature and training talks about those involved as
being part of a “very special family of people who design, build, maintain, and
operate our nations submarines.”
To this end, audits are a peer review. A typical audit team consists of twenty to
thirty people with approximately 80 percent of the team coming from various
SUBSAFE facilities around the country and the remaining 20 percent coming from
NAVSEA headquarters. An audit is considered a team effort.the facility being
audited is expected to help the audit team make the audit report as accurate and
meaningful as possible.
Audits are conducted under rules of continuous communication.when a problem
is found, the emphasis is on full understanding of the identified problem as well as
identification of potential solutions. Deficiencies are documented and adjudicated.
Contentious issues sometimes arise, but an attempt is made to resolve them during
the audit process.
A significant byproduct of a SUBSAFE audit is the learning experience it provides to the auditors as well as those being audited. Expected results include crosspollination of successful procedures and process improvements. The rationale
behind having SUBSAFE participants on the audit team is not only their understanding of the SUBSAFE program and requirements, but also their ability to learn
from the audits and apply that learning to their own SUBSAFE groups.
The current audit philosophy is a product of experience and learning. Before
1986, only ship-specific audits were conducted, not facility or headquarters audits.
In 19 86 , there was a determination that they had gotten complacent and were assuming that once an audit was completed, there would be no findings if a follow-up
audit was performed. They also decided that the ship-specific audits were not rigorous or complete enough. In STAMP terms, only the lowest level of the safety control
structure was being audited and not the other components. After that time, biennial
audits were conducted at all levels of the safety control structure, even the highest
levels of management. A biennial NAVSEA internal audit gives the field activities
a chance to evaluate operations at headquarters. Headquarters personnel must be
willing to accept and resolve audit findings just like any other member of the nuclear
submarine community.
One lesson learned has been that developing a robust compliance verification
program is difficult. Along the way they learned that .(1). clear ground rules for audits
must be established, communicated, and adhered to; .(2). it is not possible to “audit
in” requirements; and .(3). the compliance verification organization must be equal
with the program managers and the technical authority. In addition, they determined
that not just anyone can do SUBSAFE work. The number of activities authorized
to perform SUBSAFE activities is strictly controlled.
section 14.7. Problem Reporting and Critiques.
SUBSAFE believes that lessons learned are integral to submarine safety and puts
emphasis on problem reporting and critiques. Significant problems are defined as
those that affect ship safety, cause significant damage to the ship or its equipment,
delay ship deployment or incur substantial cost increase, or involve severe personnel
injury. Trouble reports are prepared for all significant problems encountered in
the construction, repair, and maintenance of naval ships. Systemic problems and
issues that constitute significant lessons learned for other activities can also be
identified by trouble reports. Critiques are similar to trouble reports and are utilized
by the fleet.
Trouble reports are distributed to all SUBSAFE responsible activities and are
used to report significant problems to NAVSEA. NAVSEA evaluates the reports to
identify SUBSAFE program improvements.
section 14.8. Challenges.
The leaders of SUBSAFE consider their biggest challenges to be.
•Ignorance.
•Arrogance. Behavior based on pride, self-importance, conceit, or the assumption of intellectual superiority and the presumption of knowledge that is not
supported by facts; and
•Complacency. Satisfaction with ones accomplishments accompanied by a
lack of awareness of actual dangers or deficiencies.
The state of not knowing;
Combating these challenges is a “constant struggle every day” . Many features
of the program are designed to control these challenges, particularly training and
education.
section 14.9. Continual Training and Education.
Continual training and education are a hallmark of SUBSAFE. The goals are to.
1.•Serve as a reminder of the consequences of complacency in ones job.
2.•Emphasize the need to proactively correct and prevent problems.
3.•Stress the need to adhere to program fundamentals.
4.•Convey management support for the program.
Continual improvement and feedback to the SUBSAFE training programs
comes not only from trouble reports and incidents but also from the level of knowledge assessments performed during the audits of organizations that perform
SUBSAFE work.
Annual training is required for all headquarters SUBSAFE workers, from the
apprentice craftsman to the admirals. A periodic refresher is also held at each of the
contractors facilities. At the meetings, a video about the loss of Thresher is shown
and an overview of the SUBSAFE program and their responsibilities is provided as
well as recent lessons learned and deficiency trends encountered over the previous
years. The need to avoid complacency and to proactively correct and prevent problems is reinforced.
Time is also taken at the annual meetings to remind everyone involved about the
history of the program. By guaranteeing that no one forgets what happened to USS
Thresher, the SUBSAFE program has helped to create a culture that is conducive
to strict adherence to policies and procedures. Everyone is recommitted each year
to ensure that a tragedy like the one that occurred in 19 63 never happens again.
SUBSAFE is described by those in the program as “a requirement, an attitude, and
a responsibility.”
section 14.10. Execution and Compliance over the Life of a Submarine.
The design, construction, and initial certification are only a small percentage of the
life of the certified ship. The success of the program during the vast majority of the
certified ships life depends on the knowledge, compliance, and audit by those operating and maintaining the submarines. Without the rigor of compliance and sustaining knowledge from the petty officers, ships officers, and fleet staff, all of the great
virtues of SUBSAFE would “come to naught” . The following anecdote by
Admiral Walt Cantrell provides an indication of how SUBSAFE principles permeate the entire nuclear Navy.
I remember vividly when I escorted the first group of NASA skeptics to a submarine and
they figured they would demonstrate that I had exaggerated the integrity of the program
by picking a member of ships force at random and asked him about SUBSAFE. The
NASA folks were blown away. A second class machinists mate gave a cogent, complete,
correct description of the elements of the program and how important it was that all levels
in the Submarine Force comply. That part of the program is essential to its success.just
as much, if not more so, than all the other support staff effort .
section 14.11 Lessons to Be Learned from SUBSAFE.
Those involved in SUBSAFE are very proud of their achievements and the fact that
even after nearly fifty years of no accidents, the program is still strong and vibrant.
On January 8, 20 05 , USS San Francisco, a twenty-six-year-old ship, crashed head-on
into an underwater mountain. While several crew members were injured and one
died, this incident is considered by SUBSAFE to be a success story. In spite of the
massive damage to her forward structure, there was no flooding, and the ship surfaced and returned to port under her own power. There was no breach of the pressure hull, the nuclear reactor remained on line, the emergency main ballast tank
blow system functioned as intended, and the control surfaces functioned properly.
Those in the SUBSAFE program attribute this success to the work discipline, material control, documentation, and compliance verification exercised during the design,
construction, and maintenance of USS San Francisco.
Can the SUBSAFE principles be transferred from the military to commercial
companies and industries? The answer lies in why the program has been so effective
and whether these factors can be maintained in other implementations of the principles more appropriate to non-military venues. Remember, of course, that private
contractors form the bulk of the companies and workers in the nuclear Navy, and
they seem to be able to satisfy the SUBSAFE program requirements. The primary
difference is in the basic goals of the organization itself.
Some factors that can be identified as contributing to the success of SUBSAFE,
most of which could be translated into a safety program in private industry are.
1.•Leadership support and commitment to the program.
2.•Management .(NAVSEA). is not afraid to say “no” when faced with pressures
to compromise the SUBSAFE principles and requirements. Top management
also agrees to be audited for adherence to the principles of SUBSAFE and to
correct any deficiencies that are found.
3.•Establishment of clear and written safety requirements.
4.•Education, not just training, with yearly reminders of the past, continual
improvement, and input from lessons learned, trouble reports, and assessments
during audits.
5.•Updating the SUBSAFE program requirements and the commitment to it
periodically.
6.Separation of powers and assignment of responsibility.
7.•Emphasis on rigor, technical compliance, and work discipline.
8.•Documentation capturing what they do and why they do it.
9.• The participatory audit philosophy and the requirement for objective quality
evidence.
10.• A program based on written procedures, not personality-driven.
11.•Continual feedback and improvement. When something does not conform to
SUBSAFE specifications, it must be reported to NAVSEA headquarters along
with the causal analysis .(including the systemic factors). of why it happened.
Everyone at every level of the organization is willing to examine his or her role
in the incident.
12.•Continual certification throughout the life of the ship; it is not a one-time event.
13.• Accountability accompanying responsibility. Personal integrity and personal
responsibility is stressed. The program is designed to foster everyones pride in
his or her work.
14.• A culture of shared responsibility for safety and the SUBSAFE requirements.
15.•
Special efforts to be vigilant against complacency and to fight it when it is
detected.

43
epilogue.raw Normal file

@ -0,0 +1,43 @@
Epilogue.
In the simpler world of the past, classic safety engineering techniques that focus on
preventing failures and chains of failure events were adequate. They no longer
suffice for the types of systems we want to build, which are stretching the limits of
complexity human minds and our current tools can handle. Society is also expecting
more protection from those responsible for potentially dangerous systems.
Systems theory provides the foundation necessary to build the tools required
to stretch our human limits on dealing with complexity. STAMP translates basic
system theory ideas into the realm of safety and thus provides a foundation for
our future.
As demonstrated in the previous chapter, some industries have been very suc-
cessful in preventing accidents. The U.S. nuclear submarine program is not the only
one. Others seem to believe that accidents are the price of progress or of profits,
and they have been less successful. What seems to distinguish those experiencing
success is that they:
1.• Take a systems approach to safety in both development and operations
2.•Have instituted a learning culture where they have effective learning from
events
3.•Have established safety as a priority and understand that their long-term
success depends on it
This book suggests a new approach to engineering for safety that changes the focus
from “prevent failures” to “enforce behavioral safety constraints,” from reliability
to control. The approach is constructed on an extended model of accident causation
that includes more than the traditional models, adding those factors that are increas-
ingly causing accidents today. It allows us to deal with much more complex systems.
What is surprising is that the techniques and tools described in part III that are built
on STAMP and have been applied in practice on extremely complex systems have
been easier to use and much more effective than the old ones.
Others will improve these first tools and techniques. What is critical is the overall
philosophy of safety as a function of control. This philosophy is not new: It stems
from the prescient engineers who created System Safety after World War II in the
military aviation and ballistic missile defense systems. What they lacked, and what
we have been hindered in our progress by not having, is a more powerful accident
causality model that matches todays new technology and social drivers. STAMP
provides that. Upon this foundation and using systems theory, new more powerful
hazard analysis, design, specification, system engineering, accident/incident analysis,
operations, and management techniques can be developed to engineer a safer world.
Mueller in 1968 described System Safety as “organized common sense” [109]. I
hope that you have found that to be an accurate description of the contents of this
book. In closing I remind you of the admonition by Bertrand Russell: “A life without
adventure is likely to be unsatisfying, but a life in which adventure is allowed to
take any form it will is sure to be short” [179, p. 21].

41
epilogue.txt Normal file

@ -0,0 +1,41 @@
Epilogue.
In the simpler world of the past, classic safety engineering techniques that focus on
preventing failures and chains of failure events were adequate. They no longer
suffice for the types of systems we want to build, which are stretching the limits of
complexity human minds and our current tools can handle. Society is also expecting
more protection from those responsible for potentially dangerous systems.
Systems theory provides the foundation necessary to build the tools required
to stretch our human limits on dealing with complexity. STAMP translates basic
system theory ideas into the realm of safety and thus provides a foundation for
our future.
As demonstrated in the previous chapter, some industries have been very successful in preventing accidents. The U.S. nuclear submarine program is not the only
one. Others seem to believe that accidents are the price of progress or of profits,
and they have been less successful. What seems to distinguish those experiencing
success is that they.
1.• Take a systems approach to safety in both development and operations
2.•Have instituted a learning culture where they have effective learning from
events
3.•Have established safety as a priority and understand that their long-term
success depends on it
This book suggests a new approach to engineering for safety that changes the focus
from “prevent failures” to “enforce behavioral safety constraints,” from reliability
to control. The approach is constructed on an extended model of accident causation
that includes more than the traditional models, adding those factors that are increasingly causing accidents today. It allows us to deal with much more complex systems.
What is surprising is that the techniques and tools described in part 3 that are built
on STAMP and have been applied in practice on extremely complex systems have
been easier to use and much more effective than the old ones.
Others will improve these first tools and techniques. What is critical is the overall
philosophy of safety as a function of control. This philosophy is not new. It stems
from the prescient engineers who created System Safety after World War 2 in the
military aviation and ballistic missile defense systems. What they lacked, and what
we have been hindered in our progress by not having, is a more powerful accident
causality model that matches todays new technology and social drivers. STAMP
provides that. Upon this foundation and using systems theory, new more powerful
hazard analysis, design, specification, system engineering, accident/incident analysis,
operations, and management techniques can be developed to engineer a safer world.
Mueller in 19 68 described System Safety as “organized common sense” . I
hope that you have found that to be an accurate description of the contents of this
book. In closing I remind you of the admonition by Bertrand Russell. “A life without
adventure is likely to be unsatisfying, but a life in which adventure is allowed to
take any form it will is sure to be short” .

@ -60,6 +60,8 @@ ROE R O E
SD S D
SITREP SIT Rep
STPA S T P A
SpecTRM-RL Spec T R M R L
SpecTRM Spec T R M
TACSAT Tack sat
TAOR T A O R
TAOR T A O R
@ -67,4 +69,10 @@ TCAS T Cass
TMI T M I
TTPS T T P S
USCINCEUR U S C in E U R
WD W D
WD W D
ZTHR Z T H R
INPO In Poh
LERs Leers
FARs Farzz
SUBSAFE Sub Safe
NAVSEA Nav Sea