diff --git a/Makefile b/Makefile index 46aca3d..35526e9 100644 --- a/Makefile +++ b/Makefile @@ -8,6 +8,13 @@ MP3_FILES := $(patsubst %.txt,%.mp3,$(wildcard *.txt)) MODEL=en_GB-alan-medium.onnx CONFIG=en_GB-alan-medium.onnx.json +# MODEL=en_GB-aru-medium.onnx +# CONFIG=en_GB-aru-medium.onnx.json + +# MODEL=en_GB-cori-high.onnx +# CONFIG=en_GB-cori-high.onnx.json + + complete: $(TXT_FILES) $(MP3_FILES) echo $@ $^ diff --git a/chapter10.raw b/chapter10.raw new file mode 100644 index 0000000..b6c6816 --- /dev/null +++ b/chapter10.raw @@ -0,0 +1,1362 @@ +chapter 10. +Integrating Safety into System Engineering. +Previous chapters have provided the individual pieces of the solution to engineering +a safer world. This chapter demonstrates how to put these pieces together to inte- +grate safety into a system engineering process. No one process is being proposed: +Safety must be part of any system engineering process. +The glue that integrates the activities of engineering and operating complex +systems is specifications and the safety information system. Communication is criti- +cal in handling any emergent property in a complex system. Our systems today are +designed and built by hundreds and often thousands of engineers and then operated +by thousands and even tens of thousands more people. Enforcing safety constraints +on system behavior requires that the information needed for decision making is +available to the right people at the right time, whether during system development, +operations, maintenance, or reengineering. +This chapter starts with a discussion of the role of specifications and how systems +theory can be used as the foundation for the specification of complex systems. Then +an example of how to put the components together in system design and develop- +ment is presented. Chapters 11 and 12 cover how to maximize learning from acci- +dents and incidents and how to enforce safety constraints during operations. The +design of safety information systems is discussed in chapter 13. + +section 10.1. The Role of Specifications and the Safety Information System. +While engineers may have been able to get away with minimal specifications during +development of the simpler electromechanical systems of the past, specifications are +critical to the successful engineering of systems of the size and complexity we are +attempting to build today. Specifications are no longer simply a means of archiving +information; they need to play an active role in the system engineering process. They +are a critical tool in stretching our intellectual capabilities to deal with increasing +complexity. + + +Our specifications must reflect and support the system safety engineering process +and the safe operation, evolution and change of the system over time. Specifications +should support the use of notations and techniques for reasoning about hazards and +safety, designing the system to eliminate or control hazards, and validating—at each +step, starting from the very beginning of system development—that the evolving +system has the desired safety level. Later, specifications must support operations +and change over time. +Specification languages can help (or hinder) human performance of the various +problem-solving activities involved in system requirements analysis, hazard analysis, +design, review, verification and validation, debugging, operational use, and mainte- +nance and evolution (sustainment). They do this by including notations and tools +that enhance our ability to: (1) reason about particular properties, (2) construct the +system and the software in it to achieve them, and (3) validate—at each step, starting +from the very beginning of system development—that the evolving system has the +desired qualities. In addition, systems and particularly the software components are +continually changing and evolving; they must be designed to be changeable and the +specifications must support evolution without compromising the confidence in the +properties that were initially verified. +Documenting and tracking hazards and their resolution are basic requirements +for any effective safety program. But simply having the safety engineer track them +and maintain a hazard log is not enough—information must be derived from the +hazards to inform the system engineering process and that information needs to be +specified and recorded in a way that has an impact on the decisions made during +system design and operations. To have such an impact, the safety-related informa- +tion required by the engineers needs to be integrated into the environment in which +safety-related engineering decisions are made. Engineers are unlikely to be able to +read through volumes of hazard analysis information and relate it easily to the +specific component upon which they are working. The information the system safety +engineer has generated must be presented to the system designers, implementers, +maintainers, and operators in such a way that they can easily find what they need +to make safer decisions. +Safety information is not only important during system design; it also needs to +be presented in a form that people can learn from, apply to their daily jobs, and use +throughout the life cycle of projects. Too often, preventable accidents have occurred +due to changes that were made after the initial design period. Accidents are fre- +quently the result of safe designs becoming unsafe over time when changes in the +system itself or in its environment violate the basic assumptions of the original +hazard analysis. Clearly, these assumptions must be recorded and easily retrievable +when changes occur. Good documentation is the most important in complex systems + + +where nobody is able to keep all the information necessary to make safe decisions +in their head. +What types of specifications are needed to support humans in system safety +engineering and operations? Design decisions at each stage must be mapped into +the goals and constraints they are derived to satisfy, with earlier decisions mapped +or traced to later stages of the process. The result should be a seamless and gapless +record of the progression from high-level requirements down to component require- +ments and designs or operational procedures. The rationale behind the design deci- +sions needs to be recorded in a way that is easily retrievable by those reviewing or +changing the system design. The specifications must also support the various types +of formal and informal analysis used to decide between alternative designs and to +verify the results of the design process. Finally, specifications must assist in the +coordinated design of the component functions and the interfaces between them. +The notations used in specification languages must be easily readable and learn- +able. Usability is enhanced by using notations and models that are close to the +mental models created by the users of the specification and the standard notations +in their fields of expertise. +The structure of the specification is also important for usability. The structure will +enhance or limit the ability to retrieve needed information at the appropriate times. +Finally, specifications should not limit the problem-solving strategies of the users +of the specification. Not only do different people prefer different strategies for +solving problems, but the most effective problem solvers have been found to change +strategies frequently [167, 58]. Experts switch problem-solving strategy when they +run into difficulties following a particular strategy and as new information is obtained +that changes the objectives or subgoals or the mental workload needed to use a +particular strategy. Tools often limit the strategies that can be used, usually imple- +menting the favorite strategy of the tool designer, and therefore limiting the problem +solving strategies supported by the specification. +One way to implement these principles is to use intent specifications [120]. + +section 10.2. +Intent Specifications. +Intent specifications are based on systems theory, system engineering principles, and +psychological research on human problem solving and how to enhance it. The goal +is to assist humans in dealing with complexity. While commercial tools exist that +implement intent specifications directly, any specification languages and tools can +be used that allow implementing the properties of an intent specification. +An intent specification differs from a standard specification primarily in its struc- +ture, not its content: no extra information is involved that is not commonly found + + +in detailed specifications—the information is simply organized in a way that has +been found to assist in its location and use. Most complex systems have voluminous +documentation, much of it redundant or inconsistent, and it degrades quickly as +changes are made over time. Sometimes important information is missing, particu- +larly information about why something was done the way it was—the intent or +design rationale. Trying to determine whether a change might have a negative +impact on safety, if possible at all, is usually enormously expensive and often involves +regenerating analyses and work that was already done but either not recorded or +not easily located when needed. Intent specifications were designed to help with +these problems: Design rationale, safety analysis results, and the assumptions upon +which the system design and validation are based are integrated directly into the +system specification and its structure, rather than stored in separate documents, so +the information is at hand when needed for decision making. +The structure of an intent specification is based on the fundamental concept of +hierarchy in systems theory (see chapter 3) where complex systems are modeled in +terms of a hierarchy of levels of organization, each level imposing constraints on +the degree of freedom of the components at the lower level. Different description +languages may be appropriate at the different levels. Figure 10.1 shows the seven +levels of an intent specification. +Intent specifications are organized along three dimensions: intent abstraction, +part-whole abstraction, and refinement. These dimensions constitute the problem +space in which the human navigates. Part-whole abstraction (along the horizontal +dimension) and refinement (within each level) allow users to change their +focus of attention to more or less detailed views within each level or model. +The vertical dimension specifies the level of intent at which the problem is being +considered. +Each intent level contains information about the characteristics of the environ- +ment, human operators or users, the physical and functional system components, +and requirements for and results of verification and validation activities for that +level. The safety information is embedded in each level, instead of being maintained +in a separate safety log, but linked together so that it can easily be located and +reviewed. +The vertical intent dimension has seven levels. Each level represents a different +model of the system from a different perspective and supports a different type of +reasoning about it. Refinement and decomposition occurs within each level of the +specification, rather than between levels. Each level provides information not just +about what and how, but why, that is, the design rationale and reasons behind the +design decisions, including safety considerations. +Figure 10.2 shows an example of the information that might be contained in each +level of the intent specification. + + +The top level (level 0) provides a project management view and insight into the +relationship between the plans and the project development status through links +to the other parts of the intent specification. This level might contain the project +management plans, the safety plan, status information, and so on. +Level 1 is the customer view and assists system engineers and customers in +agreeing on what should be built and, later, whether that has been accomplished. It +includes goals, high-level requirements and constraints (both physical and operator), +environmental assumptions, definitions of accidents, hazard information, and system +limitations. +Level 2 is the system engineering view and helps system engineers record and +reason about the system in terms of the physical principles and system-level design +principles upon which the system design is based. +Level 3 specifies the system architecture and serves as an unambiguous interface +between system engineers and component engineers or contractors. At level 3, the +system functions defined at level 2 are decomposed, allocated to components, and +specified rigorously and completely. Black-box behavioral component models may +be used to specify and reason about the logical design of the system as a whole and + + +the interactions among individual system components without being distracted by +implementation details. +If the language used at level 3 is formal (rigorously defined), then it can play an +important role in system validation. For example, the models can be executed in +system simulation environments to identify system requirements and design errors +early in development. They can also be used to automate the generation of system +and component test data, various types of mathematical analyses, and so forth. It is +important, however, that the black-box (that is, transfer function) models be easily +reviewed by domain experts—most of the safety-related errors in specifications will +be found by expert review, not by automated tools or formal proofs. +A readable but formal and executable black-box requirements specification lan- +guage was developed by the author and her students while helping the FAA specify +the TCAS (Traffic Alert and Collision Avoidance System) requirements [123]. +Reviewers can learn to read the specifications with a few minutes of instruction +about the notation. Improvements have been made over the years, and it is being +used successfully on real systems. This language provides an existence case that a + + +readable and easily learnable but formal specification language is possible. Other +languages with the same properties, of course, can also be used effectively. +The next two levels, Design Representation and Physical Representation, +provide the information necessary to reason about individual component design +and implementation issues. Some parts of level 4 may not be needed if at least por- +tions of the physical design can be generated automatically from the models at +level 3. +The final level, Operations, provides a view of the operational system and acts as +the interface between development and operations. It assists in designing and per- +forming system safety activities during system operations. It may contain required +or suggested operational audit procedures, user manuals, training materials, main- +tenance requirements, error reports and change requests, historical usage informa- +tion, and so on. +Each level of an intent specification supports a different type of reasoning about +the system, with the highest level assisting systems engineers in their reasoning +about system-level goals, constraints, priorities, and tradeoffs. The second level, +System Design Principles, allows engineers to reason about the system in terms of +the physical principles and laws upon which the design is based. The Architecture +level enhances reasoning about the logical design of the system as a whole, the +interactions between the components, and the functions computed by the compo- +nents without being distracted by implementation issues. The lowest two levels +provide the information necessary to reason about individual component design and +implementation issues. The mappings between levels provide the relational informa- +tion that allows reasoning across hierarchical levels and traceability of requirements +to design. +Hyperlinks are used to provide the relational information that allows reasoning +within and across levels, including the tracing from high-level requirements down +to implementation and vice versa. Examples can be found in the rest of this +chapter. +The structure of an intent specification does not imply that the development must +proceed from the top levels down to the bottom levels in that order, only that at +the end of the development process, all levels are complete. Almost all development +involves work at all of the levels at the same time. +When the system changes, the environment in which the system operates changes, +or components are reused in a different system, a new or updated safety analysis is +required. Intent specifications can make that process feasible and practical. +Examples of intent specifications are available [121, 151] as are commercial tools +to support them. But most of the principles can be implemented without special +tools beyond a text editor and hyperlinking facilities. The rest of this chapter assumes +only these very limited facilities are available. + + +section 10.3. An Integrated System and Safety Engineering Process. +There is no agreed upon best system engineering process and probably cannot be +one—the process needs to match the specific problem and environment in which it +is being used. What is described in this section is how to integrate safety engineering +into any reasonable system engineering process. +The system engineering process provides a logical structure for problem solving. +Briefly, first a need or problem is specified in terms of objectives that the system +must satisfy and criteria that can be used to rank alternative designs. Then a process +of system synthesis takes place that usually involves considering alternative designs. +Each of the alternatives is analyzed and evaluated in terms of the stated objectives +and design criteria, and one alternative is selected. In practice, the process is highly +iterative: The results from later stages are fed back to early stages to modify objec- +tives, criteria, design decisions, and so on. +Design alternatives are generated through a process of system architecture devel- +opment and analysis. The system engineers first develop requirements and design +constraints for the system as a whole and then break the system into subsystems +and design the subsystem interfaces and the subsystem interface topology. System +functions and constraints are refined and allocated to the individual subsystems. The +emerging design is analyzed with respect to desired system performance character- +istics and constraints, and the process is iterated until an acceptable system design +results. +The difference in safety-guided design is that hazard analysis is used throughout +the process to generate the safety constraints that are factored into the design deci- +sions as they are made. The preliminary design at the end of this process must be +described in sufficient detail that subsystem implementation can proceed indepen- +dently. The subsystem requirements and design processes are subsets of the larger +system engineering process. +This general system engineering process has some particularly important aspects. +One of these is the focus on interfaces. System engineering views each system as an +integrated whole even though it is composed of diverse, specialized components, +which may be physical, logical (software), or human. The objective is to design +subsystems that when integrated into the whole provide the most effective system +possible to achieve the overall objectives. The most challenging problems in building +complex systems today arise in the interfaces between components. One example +is the new highly automated aircraft where most incidents and accidents have been +blamed on human error, but more properly reflect difficulties in the collateral design +of the aircraft, the avionics systems, the cockpit displays and controls, and the +demands placed on the pilots. + + + +A second critical factor is the integration of humans and nonhuman system +components. As with safety, a separate group traditionally does human factors +design and analysis. Building safety-critical systems requires integrating both +system safety and human factors into the basic system engineering process, which +in turn has important implications for engineering education. Unfortunately, +neither safety nor human factors plays an important role in most engineering +education today. +During program and project planning, a system safety plan, standards, and +project development safety control structure need to be designed including +policies, procedures, the safety management and control structure, and communica- +tion channels. More about safety management plans can be found in chapters 12 +and 13. +Figure 10.3 shows the types of activities that need to be performed in such an +integrated process and the system safety and human factors inputs and products. +Standard validation and verification activities are not shown, since they should be +included throughout the entire process. +The rest of this chapter provides an example using TCAS II. Other examples are +interspersed where TCAS is not appropriate or does not provide an interesting +enough example. +section 10.3.1. Establishing the Goals for the System. +The first step in any system engineering process is to identify the goals of the effort. +Without agreeing on where you are going, it is not possible to determine how to get +there or when you have arrived. +TCAS II is a box required on most commercial and some general aviation aircraft +that assists in avoiding midair collisions. The goals for TCAS II are to: +G1: Provide affordable and compatible collision avoidance system options for a +broad spectrum of National Airspace System users. +G2: Detect potential midair collisions with other aircraft in all meteorological +conditions; throughout navigable airspace, including airspace not covered +by ATC primary or secondary radar systems; and in the absence of ground +equipment. +TCAS was intended to be an independent backup to the normal Air Traffic Control +(ATC) system and the pilot’s “see and avoid” responsibilities. It interrogates air +traffic control transponders on aircraft in its vicinity and listens for the transponder +replies. By analyzing these replies with respect to slant range and relative altitude, +TCAS determines which aircraft represent potential collision threats and provides +appropriate display indications, called advisories, to the flight crew to assure proper + + +separation. Two types of advisories can be issued. Resolution advisories (RAs) +provide instructions to the pilots to ensure safe separation from nearby traffic in +the vertical plane. Traffic advisories (TAs) indicate the positions of intruding air- +craft that may later cause resolution advisories to be displayed. +TCAS is an example of a system created to directly impact safety where the goals +are all directly related to safety. But system safety engineering and safety-driven +design can be applied to systems where maintaining safety is not the only goal and, +in fact, human safety is not even a factor. The example of an outer planets explorer +spacecraft was shown in chapter 7. Another example is the air traffic control system, +which has both safety and nonsafety (throughput) goals. + +footnote. Horizontal advisories were originally planned for later versions of TCAS but have not yet been +implemented. + +section 10.3.2. Defining Accidents. +Before any safety-related activities can start, the definition of an accident needs to +be agreed upon by the system customer and other stakeholders. This definition, in +essence, establishes the goals for the safety effort. +Defining accidents in TCAS is straightforward—only one is relevant, a midair +collision. Other more interesting examples are shown in chapter 7. +Basically, the criterion for specifying events as accidents is that the losses are so +important that they need to play a central role in the design and tradeoff process. +In the outer planets explorer example in chapter 7, some of the losses involve the +mission goals themselves while others involve losses to other missions or a negative +impact on our solar system ecology. +Priorities and evaluation criteria may be assigned to the accidents to indicate how +conflicts are to be resolved, such as conflicts between safety goals or conflicts +between mission goals and safety goals and to guide design choices at lower levels. +The priorities are then inherited by the hazards related to each of the accidents and +traced down to the safety-related design features. + +section 10.3.3. Identifying the System Hazards. +Once the set of accidents has been agreed upon, hazards can be derived from them. +This process is part of what is called Preliminary Hazard Analysis (PHA) in System +Safety. The hazard log is usually started as soon as the hazards to be considered are +identified. While much of the information in the hazard log will be filled in later, +some information is available at this time. +There is no right or wrong list of hazards—only an agreement by all involved on +what hazards will be considered. Some hazards that were considered during the +design of TCAS are listed in chapter 7 and are repeated here for convenience: + + +1. TCAS causes or contributes to a near midair collision (NMAC), defined as a +pair of controlled aircraft violating minimum separation standards. +2. TCAS causes or contributes to a controlled maneuver into the ground. +3. TCAS causes or contributes to the pilot losing control over the aircraft. +4. TCAS interferes with other safety-related aircraft systems (for example, +ground proximity warning). +5. TCAS interferes with the ground-based air traffic control system (e.g., tran- +sponder transmissions to the ground or radar or radio services). +6. TCAS interferes with an ATC advisory that is safety-related (e.g., avoiding a +restricted area or adverse weather conditions). +Once accidents and hazards have been identified, early concept formation (some- +times called high-level architecture development) can be started for the integrated +system and safety engineering process. + +section 10.3.4. Integrating Safety into Architecture Selection and System Trade Studies. +An early activity in the system engineering of complex systems is the selection of +an overall architecture for the system, or as it is sometimes called, system concept +formation. For example, an architecture for manned space exploration might include +a transportation system with parameters and options for each possible architectural +feature related to technology, policy, and operations. Decisions will need to be made +early, for example, about the number and type of vehicles and modules, the destina- +tions for the vehicles, the roles and activities for each vehicle including dockings +and undockings, trajectories, assembly of the vehicles (in space or on Earth), discard- +ing of vehicles, prepositioning of vehicles in orbit and on the planet surface, and so +on. Technology options include type of propulsion, level of autonomy, support +systems (water and oxygen if the vehicle is used to transport humans), and many +others. Policy and operational options may include crew size, level of international +investment, types of missions and their duration, landing sites, and so on. Decisions +about these overall system concepts clearly must precede the actual implementation +of the system. +How are these decisions made? The selection process usually involves extensive +tradeoff analysis that compares the different feasible architectures with respect to +some important system property or properties. Cost, not surprisingly, usually plays +a large role in the selection process while other properties, including system safety, +are usually left as a problem to be addressed later in the development lifecycle. +Many of the early architectural decisions, however, have a significant and lasting +impact on safety and may not be reversible after the basic architectural decisions +have been made. For example, the decision not to include a crew escape system on + + +the Space Shuttle was an early architectural decision and has been impacting Shuttle +safety for more than thirty years [74, 136]. After the Challenger accident and again +after the Columbia loss, the idea resurfaced, but there was no cost-effective way to +add crew escape at that time. +The primary reason why safety is rarely factored in during the early architectural +tradeoff process, except perhaps informally, is that practical methods for analyzing +safety, that is, hazard analysis methods that can be applied at that time, do not exist. +But if information about safety were available early, it could be used in the selection +process and hazards could be eliminated by the selection of appropriate architec- +tural options or mitigated early when the cost of doing so is much less than later in +the system lifecycle. Making basic design changes downstream becomes increasingly +costly and disruptive as development progresses and, often, compromises in safety +must be accepted that could have been eliminated if safety had been considered in +the early architectural evaluation process. +While it is relatively easy to identify hazards at system conception, performing a +hazard or risk assessment before a design is available is more problematic. At best, +only a very rough estimate is possible. Risk is usually defined as a combination of +severity and likelihood. Because these two different qualities (severity and likeli- +hood) cannot be combined mathematically, they are commonly qualitatively com- +bined using a risk matrix. Figure 10.4 shows a fairly standard form for such a matrix. + + +High-level hazards are first identified and, for each identified hazard, a qualitative +evaluation is performed by classifying the hazard according to its severity and +likelihood. +While severity can usually be evaluated using the worst possible consequences +of that hazard, likelihood is almost always unknown and, arguably, unknowable for +complex systems before any system design decisions have been made. The problem +is even worse before a system architecture has been selected. Some probabilistic +information is usually available about physical events, of course, and historical +information may theoretically be available. But new systems are usually being +created because existing systems and designs are not adequate to achieve the system +goals, and the new systems will probably use new technology and design features +that limit the accuracy of historical information. For example, historical information +about the likelihood of propulsion-related losses may not be accurate for new space- +craft designs using nuclear propulsion. Similarly, historical information about the +errors air traffic controllers make has no relevance for new air traffic control systems, +where the type of errors may change dramatically. +The increasing use of software in most complex systems complicates the situation +further. Much or even most of the software in the system will be new and have no +historical usage information. In addition, statistical techniques that assume random- +ness are not applicable to software design flaws. Software and digital systems also +introduce new ways for hazards to occur, including new types of component interac- +tion accidents. Safety is a system property, and, as argued in part I, combining the +probability of failure of the system components to be used has little or no relation- +ship to the safety of the system as a whole. +There are no known or accepted rigorous or scientific ways to obtain probabilistic +or even subjective likelihood information using historical data or analysis in the case +of non-random failures and system design errors, including unsafe software behav- +ior. When forced to come up with such evaluations, engineering judgment is usually +used, which in most cases amounts to pulling numbers out of the air, often influ- +enced by political and other nontechnical factors. Selection of a system architecture +and early architectural trade evaluations on such a basis is questionable and perhaps +one reason why risk usually does not play a primary role in the early architectural +trade process. +Alternatives to the standard risk matrix are possible, but they tend to be applica- +tion specific and so must be constructed for each new system. For many systems, +the use of severity alone is often adequate to categorize the hazards in trade studies. +Two examples of other alternatives are presented here, one created for augmented +air traffic control technology and the other created and used in the early architec- +tural trade study of NASA’s Project Constellation, the program to return to the +moon and later go on to Mars. The reader is encouraged to come up with their own + + +methods appropriate for their particular application. The examples are not meant +to be definitive, but simply illustrative of what is possible. +Example 1: A Human-Intensive System: Air Traffic Control Enhancements +Enhancements to the air traffic control (ATC) system are unique in that the problem +is not to create a new or safer system but to maintain the very high level of safety +built into the current system: The goal is to not degrade safety. The risk likelihood +estimate can be restated, in this case, as the likelihood that safety will be degraded +by the proposed changes and new tools. To tackle this problem, we created a set of +criteria to be used in the evaluation of likelihood. The criteria ranked various high- +level architectural design features of the proposed set of ATC tools on a variety of +factors related to risk in these systems. The ranking was qualitative and most criteria +were ranked as having low, medium, or high impact on the likelihood of safety being +degraded from the current level. For the majority of factors, “low” meant insignifi- +cant or no change in safety with respect to that factor in the new versus the current +system, “medium” denoted the potential for a minor change, and “high” signified +potential for a significant change in safety. Many of the criteria involve human- +automation interaction, since ATC is a very human-intensive system and the new +features being proposed involved primarily new automation to assist human air +traffic controllers. Here are examples of the likelihood level criteria used: +1.•Safety margins: Does the new feature have the potential for (1) an insignifi- +cant or no change to the existing safety margins, (2) a minor change, or (3) a +significant change. +2.•Situation awareness: What is the level of change in the potential for reducing +situation awareness. +3.•Skills currently used and those necessary to backup and monitor the new deci- +sion-support tools: Is there an insignificant or no change in the controller +skills, a minor change, or a significant change. +4.•Introduction of new failure modes and hazard causes: Do the new tools have +the same function and failure modes as the system components they are replac- +ing, are new failure modes and hazards introduced but well understood and +effective mitigation measures can be designed, or are the new failure modes +and hazard causes difficult to control. +5.•Effect of the new software functions on the current system hazard mitigation +measures: Can the new features render the current safety measures ineffective +or are they unrelated to current safety features. + + +6. Need for new system hazard mitigation measures: Will the proposed changes +require new hazard mitigation measures. +These criteria and others were converted into a numerical scheme so they could be +combined and used in an early risk assessment of the changes being contemplated +and their potential likelihood for introducing significant new risk into the system. +The criteria were weighted to reflect their relative importance in the risk analysis. + +footnote. These criteria were developed for a NASA contract by the author and have not been published +previously. + + +Example 2: Early Risk Analysis of Manned Space Exploration +A second example was created by Nicolas Dulac and others as part of an MIT and +Draper Labs contract with NASA to perform an architectural tradeoff analysis for +future human space exploration [59]. The system engineers wanted to include safety +along with the usual factors, such as mass, to evaluate the candidate architectures, +but once again little information was available at this early stage of system engineer- +ing. It was not possible to evaluate likelihood using historical information; all of the +potential architectures involved new technology, new missions, and significant +amounts of software. +In the procedure developed to achieve the goal, the hazards were first identified +as shown in figure 10.5. As is the case at the beginning of any project, identifying +system hazards involved ten percent creativity and ninety percent experience. +Hazards were identified for each mission phase by domain experts under the guid- +ance of the safety experts. Some hazards, such as fire, explosion, or loss of life- +support span multiple (if not all) mission phases and were grouped as General +Hazards. The control strategies used to mitigate them, however, may depend on the +mission phase in which they occur. +Once the hazards were identified, the severity of each hazard was evaluated by +considering the worst-case loss associated with the hazard. In the example, the losses +are evaluated for each of three categories: humans (H), mission (M), and equipment +(E). Initially, potential damage to the Earth and planet surface environment was +included in the hazard log. In the end, the environment component was left out of +the analysis because project managers decided to replace the analysis with manda- +tory compliance with NASA’s planetary protection standards. A risk analysis can be +replaced by a customer policy on how the hazards are to be treated. A more com- +plete example, however, for a different system would normally include environmen- +tal hazards. +A severity scale was created to account for the losses associated with each of the +three categories. The scale used is shown in figure 10.6, but obviously a different +scale could easily be created to match the specific policies or standard practice in +different industries and companies. +As usual, severity was relatively easy to handle but the likelihood of the potential +hazard occurring was unknowable at this early stage of system engineering. In + + +addition, space exploration is the polar opposite of the ATC example above as the +system did not already exist and the architectures and missions would involve things +never attempted before, which created a need for a different approach to estimating +likelihood. +We decided to use the mitigation potential of the hazard in the candidate archi- +tecture as an estimator of, or surrogate for, likelihood. Hazards that are more easily +mitigated in the design and operations are less likely to lead to accidents. Similarly, +hazards that have been eliminated during system design, and thus are not part of +that candidate architecture or can easily be eliminated in the detailed design process, +cannot lead to an accident. +The safety goal of the architectural analysis process was to assist in selecting the +architecture with the fewest serious hazards and highest mitigation potential for +those hazards that were not eliminated. Not all hazards will be eliminated even if +they can be. One reason for not eliminating hazards might be that it would reduce +the potential for achieving other important system goals or constraints. Obviously, +safety is not the only consideration in the architecture selection process, but it is +important enough in this case to be a criterion in the selection process. +Mitigation potential was chosen as a surrogate for likelihood for two reasons: +(1) the potential for eliminating or controlling the hazard in the design or operations +has a direct and important bearing on the likelihood of the hazard occurring +(whether traditional or new designs and technology are used) and (2) mitigatibility +of the hazard can be determined before an architecture or design is selected— +indeed, it assists in the selection process. +Figure 10.7 shows an example from the hazard log created during the PHA effort. +The example hazard shown is nuclear reactor overheating. Nuclear power generation +and use, particularly during planetary surface operations, was considered to be an +important option in the architectural tradeoffs. The potential accident and its effects +are described in the hazard log as: +Nuclear core meltdown would cause loss of power, and possibly radiation exposure. +Surface operations must abort mission and evacuate. If abort is unsuccessful or unavailable +at the time, the crew and surface equipment could be lost. There would be no environ- +mental impact on Earth. +The hazard is defined as the nuclear reactor operating at temperatures above the +design limits. +Although some causal factors can be hypothesized early, a hazard analysis using +STPA can be used to generate a more complete list of causal factors later in the +development process to guide the design process after an architecture is chosen. +Like severity, mitigatibility was evaluated by domain experts under the guidance +of safety experts. Both the cost of the potential mitigation strategy and its + + +effectiveness were evaluated. For the nuclear power example, two strategies were +identified: the first is not to use nuclear power generation at all. The cost of this option +was evaluated as medium (on a low, medium, high scale). But the mitigation potential +was rated as high because it eliminates the hazard completely. The mitigation priority +scale used is shown in figure 10.8. The second mitigation potential identified by +the engineers was to provide a backup power generation system for surface opera- +tions. The difficulty and cost was rated high and the mitigation rating was 1, which was +the lowest possible level, because at best it would only reduce the damage if an acci- +dent occurred but potential serious losses would still occur. Other mitigation strate- +gies are also possible but have been omitted from the sample hazard log entry shown. +None of the effort expended here is wasted. The information included in the +hazard log about the mitigation strategies will be useful later in the design process +if the final architecture selected uses surface nuclear power generation. NASA might +also be able to use the information in future projects and the creation of such early +risk analysis information might be common to companies or industries and not have +to be created for each project. As new technologies are introduced to an industry, +new hazards or mitigation possibilities could be added to the previously stored +information. +The final step in the process is to create safety risk metrics for each candidate +architecture. Because the system engineers on the project created hundreds of fea- +sible architectures, the evaluation process was automated. The actual details of the +mathematical procedures used are of limited general interest and are available +elsewhere [59]. Weighted averages were used to combine mitigation factors and +severity factors to come up with a final Overall Residual Safety-Risk Metric. This +metric was then used in the evaluation and ranking of the potential manned space +exploration architectures. +By selecting and deselecting options in the architecture description, it was also +possible to perform a first-order assessment of the relative importance of each +architectural option in determining the Overall Residual Safety-Risk Metric. +While hundreds of parameters were considered in the risk analysis, the process +allowed the identification of major contributors to the hazard mitigation potential +of selected architectures and thus informed the architecture selection process and + + +the tradeoff analysis. For example, important contributors to increased safety were +determined to include the use of heavy module and equipment prepositioning on +the surface of Mars and the use of minimal rendezvous and docking maneuvers. +Prepositioning modules allows for pretesting and mitigates the hazards associated +with loss of life support, equipment damage, and so on. On the other hand, prepo- +sitioning modules increases the reliance on precision landing to ensure that all +landed modules are within range of each other. Consequently, using heavy preposi- +tioning may require additional mitigation strategies and technology development +to reduce the risk associated with landing in the wrong location. All of this infor- +mation must be considered in selecting the best architecture. As another example, +on one hand, a transportation architecture requiring no docking at Mars orbit +or upon return to Earth inherently mitigates hazards associated with collisions or +failed rendezvous and docking maneuvers. On the other hand, having the capability +to dock during an emergency, even though it is not required during nominal opera- +tions, provides additional mitigation potential for loss of life support, especially in +Earth orbit. +Reducing these considerations to a number is clearly not ideal, but with hundreds +of potential architectures it was necessary in this case in order to pare down the +choices to a smaller number. More careful tradeoff analysis is then possible on the +reduced set of choices. +While mitigatibility is widely applicable as a surrogate for likelihood in many +types of domains, the actual process used above is just one example of how it might +be used. Engineers will need to adapt the scales and other features of the process +to the customary practices in their own industry. Other types of surrogates or ways +to handle likelihood estimates in early phases of projects are possible beyond the +two examples provided in this section. While none of these approaches is ideal, they +are much better than ignoring safety in decision making or selecting likelihood +estimates based solely on wishful thinking or the politics that often surround the +preliminary hazard analysis process. +After a conceptual design is chosen, development begins. + +section 10.3.5. Documenting Environmental Assumptions. +An important part of the system development process is to determine and document +the assumptions under which the system requirements and design features are +derived and upon which the hazard analysis is based. Assumptions will be identified +and specified throughout the system engineering process and the engineering speci- +fications to explain decisions or to record fundamental information upon which the +design is based. If the assumptions change over time or the system changes and the +assumptions are no longer true, then the requirements and the safety constraints +and design features based on those assumptions need to be revisited to ensure safety +has not been compromised by the change. + + +Because operational safety depends on the accuracy of the assumptions and +models underlying the design and hazard analysis processes, the operational system +should be monitored to ensure that: +1. The system is constructed, operated, and maintained in the manner assumed +by the designers. +2. The models and assumptions used during initial decision making and design +are correct. +3. The models and assumptions are not violated by changes in the system, such +as workarounds or unauthorized changes in procedures, or by changes in the +environment. +Operational feedback on trends, incidents, and accidents should trigger reanalysis +when appropriate. Linking the assumptions throughout the document with the parts +of the hazard analysis based on that assumption will assist in performing safety +maintenance activities. +Several types of assumptions are relevant. One is the assumptions under which +the system will be used and the environment in which the system will operate. Not +only will these assumptions play an important role in system development, but they +also provide part of the basis for creating the operational safety control structure +and other operational safety controls such as creating feedback loops to ensure the +assumptions underlying the system design and the safety analyses are not violated +during operations as the system and its environment change over time. +While many of the assumptions that originate in the existing environment into +which the new system will be integrated can be identified at the beginning of devel- +opment, additional assumptions will be identified as the design process continues +and new requirements and design decisions and features are identified. In addition, +assumptions that the emerging system design imposes on the surrounding environ- +ment will become clear only after detailed decisions are made in the design and +safety analyses. +Examples of important environment assumptions for TCAS II are that: +EA1: High-integrity communications exist between aircraft. +EA2: The TCAS-equipped aircraft carries a Mode-S air traffic control transponder. + + +EA3: All aircraft have operating transponders.j +EA4: All aircraft have legal identification numbers. +EA5: Altitude information is available from intruding targets with a minimum +precision of 100 feet. +EA6: The altimetry system that provides own aircraft pressure altitude to the TCAS +equipment will satisfy the requirements in RTCA Standard . . . +EA7: Threat aircraft will not make an abrupt maneuver that thwarts the TCAS +escape maneuver. + + +footnote. An aircraft transponder sends information to help air traffic control maintain aircraft separation. +Primary radar generally provides bearing and range position information, but lacks altitude information. +Mode A transponders transmit only an identification signal, while Mode C and Mode S transponders +also report pressure altitude. Mode S is newer and has more capabilities than Mode C, some of which +are required for the collision avoidance functions in TCAS. + + + +As noted, these assumptions must be enforced in the overall safety control struc- +ture. With respect to assumption EA4, for example, identification numbers are +usually provided by the aviation authorities in each country, and that requirement +will need to be ensured by international agreement or by some international agency. +The assumption that aircraft have operating transponders (EA3) may be enforced +by the airspace rules in a particular country and, again, must be ensured by some +group. Clearly, these assumptions play an important role in the construction of the +safety control structure and assignments of responsibilities for the final system. For +TCAS, some of these assumptions will already be imposed by the existing air trans- +portation safety control structure while others may need to be added to the respon- +sibilities of some group(s) in the control structure. The last assumption, EA7, imposes +constraints on pilots and the air traffic control system. +Environment requirements and constraints may lead to restrictions on the use of +the new system (in this case, TCAS) or may indicate the need for system safety and +other analyses to determine the constraints that must be imposed on the system +being created (TCAS again) or the larger encompassing system to ensure safety. The +requirements for the integration of the new subsystem safely into the larger system +must be determined early. Examples for TCAS include: +E1: The behavior or interaction of non-TCAS equipment with TCAS must not +degrade the performance of the TCAS equipment or the performance of the +equipment with which TCAS interacts. +E2: Among the aircraft environmental alerts, the hierarchy shall be: Windshear has +first priority, then the Ground Proximity Warning System (GPWS), then TCAS. +E3: The TCAS alerts and advisories must be independent of those using the master +caution and warming system. + +section 10.3.6. System-Level Requirements Generation. +Once the goals and hazards have been identified and a conceptual system architec- +ture has been selected, system-level requirements generation can begin. Usually, in + + +the early stages of a project, goals are stated in very general terms, as shown in G1 +and G2. One of the first steps in the design process is to refine the goals into test- +able and achievable high-level requirements (the “shall” statements). Examples of +high-level functional requirements implementing the goals for TCAS are: +1.18: TCAS shall provide collision avoidance protection for any two aircraft +closing horizontally at any rate up to 1200 knots and vertically up to 10,000 feet +per minute. +Assumption: This requirement is derived from the assumption that commer- +cial aircraft can operate up to 600 knots and 5000 fpm during vertical climb +or controlled descent (and therefore two planes can close horizontally up to +1200 knots and vertically up to 10,000 fpm). +1.19.1: TCAS shall operate in enroute and terminal areas with traffic densities up +to 0.3 aircraft per square nautical miles (i.e., 24 aircraft within 5 nmi). +Assumption: Traffic density may increase to this level by 1990, and this will +be the maximum density over the next 20 years. +As stated earlier, assumptions should continue to be specified when appropriate to +explain a decision or to record fundamental information on which the design is +based. Assumptions are an important component of the documentation of design +rationale and form the basis for safety audits during operations. Consider the above +requirement labeled 1.18, for example. In the future, if aircraft performance limits +change or there are proposed changes in airspace management, the origin of the +specific numbers in the requirement (1,200 and 10,000) can be determined and +evaluated for their continued relevance. In the absence of the documentation of +such assumptions and how they impact the detailed design decisions, numbers tend +to become “gospel,” and everyone is afraid to change them. +Requirements (and constraints) must also be included for the human operator +and for the human–computer interface. These requirements will in part be derived +from the concept of operations, which should in turn include a human task analysis +[48, 47], to determine how TCAS is expected to be used by pilots (which, again, +should be checked in safety audits during operations). These analyses use infor- +mation about the goals of the system, the constraints on how the goals are achieved, +including safety constraints, how the automation will be used, how humans now +control the system and work in the system without automation, and the tasks +humans need to perform and how the automation will support them in performing +these tasks. The task analysis must also consider workload and its impact on opera- +tor performance. Note that a low workload may be more dangerous than a high one. +Requirements on the operator (in this case, the pilot) are used to guide the design +of the TCAS-pilot interface, the design of the automation logic, flight-crew tasks + +and procedures, aircraft flight manuals, and training plans and program. Traceability +links should be provided to show the relationships. Links should also be provided +to the parts of the hazard analysis from which safety-related requirements are +derived. Examples of TCAS II operator safety requirements and constraints are: +OP.4: After the threat is resolved, the pilot shall return promptly and smoothly to +his/her previously assigned fight path (→ HA-560, ↓3.3). +OP.9: The pilot must not maneuver on the basis of a Traffic Advisory only (→ +HA-630, ↓2.71.3). +The requirements and constraints include links to the hazard analysis that produced +the information and to design documents and decisions to show where the require- +ments are applied. These two examples have links to the parts of the hazard analysis +from which they were derived, links to the system design and operator procedures +where they are enforced, and links to the user manuals (in this case, the pilot +manuals) to explain why certain activities or behaviors are required. +The links not only provide traceability from requirements to implementation and +vice versa to assist in review activities, but they also embed the design rationale +information into the specification. If changes need to be made to the system, it is +easy to follow the links and determine why and how particular design decisions +were made. + +secton 10.3.7. Identifying High-Level Design and Safety Constraints. +Design constraints are restrictions on how the system can achieve its purpose. For +example, TCAS is not allowed to interfere with the ground-level air traffic control +system while it is trying to maintain adequate separation between aircraft. Avoiding +interference is not a goal or purpose of TCAS—the best way to achieve the goal is +not to build the system at all. It is instead a constraint on how the system can achieve +its purpose, that is, a constraint on the potential system designs. Because of the need +to evaluate and clarify tradeoffs among alternative designs, separating these two +types of intent information (goals and design constraints) is important. +For safety-critical systems, constraints should be further separated into safety- +related and not safety-related. One nonsafety constraint identified for TCAS, for +example, was that requirements for new hardware and equipment on the aircraft be +minimized or the airlines would not be able to afford this new collision avoidance +system. Examples of nonsafety constraints for TCAS II are: +C.1: The system must use the transponders routinely carried by aircraft for ground +ATC purposes (↓2.3, 2.6). +Rationale: To be acceptable to airlines, TCAS must minimize the amount of +new hardware needed. + +C.4: TCAS must comply with all applicable FAA and FCC policies, rules, and +philosophies (↓2.30, 2.79). +The physical environment with which TCAS interacts is shown in figure 10.9. The +constraints imposed by these existing environmental components must also be +identified before system design can begin. +Safety-related constraints should have two-way links to the system hazard log and +to any analysis results that led to that constraint being identified as well as links to +the design features (usually level 2) included to eliminate or control them. Hazard +analyses are linked to level 1 requirements and constraints, to design features on +level 2, and to system limitations (or accepted risks). An example of a level 1 safety +constraint derived to prevent hazards is: +SC.3: TCAS must generate advisories that require as little deviation as possible +from ATC clearances . + + +The link in SC.3 to 2.30 points to the level 2 system design feature that implements +this safety constraint. The other links provide traceability to the hazard (H6) from +which the constraint was derived and to the parts of the hazard analysis involved, +in this case the part of the hazard analysis labeled HA-550. +The following is another example of a safety constraint for TCAS II and some +constraints refined from it, all of which stem from a high-level environmental con- +straint derived from safety considerations in the encompassing system into which +TCAS will be integrated. The refinement will occur as safety-related decisions are +made and guided by an STPA hazard analysis: +SC.2: TCAS must not interfere with the ground ATC system or other aircraft +transmissions to the ground ATC system (→ H5). +SC.2.1: The system design must limit interference with ground-based second- +ary surveillance radar, distance-measuring equipment channels, and with +other radio services that operate in the 1030/1090 MHz frequency band +(↓2.5.1). +SC.2.1.1: The design of the Mode S waveforms used by TCAS must provide +compatibility with Modes A and C of the ground-based secondary surveil- +lance radar system (↓2.6). +SC.2.1.2: The frequency spectrum of Mode S transmissions must be +controlled to protect adjacent distance-measuring equipment channels +(↓2.13). +SC.2.1.3: The design must ensure electromagnetic compatibility between +TCAS and [...] [↓21.4). +SC.2.2: Multiple TCAS units within detection range of one another (approxi- +mately 30 nmi) must be designed to limit their own transmissions. As the +number of such TCAS units within this region increases, the interrogation +rate and power allocation for each of them must decrease in order to prevent +undesired interference with ATC (↓2.13). +Assumptions are also associated with safety constraints. As an example of such an +assumption, consider: +SC.6: TCAS must not disrupt the pilot and ATC operations during critical +phases of flight nor disrupt aircraft operation (→ H3, ↓2.2.3, 2.19, +2.24.2). +SC.6.1: The pilot of a TCAS-equipped aircraft must have the option to switch +to the Traffic-Advisory-Only mode where TAs are displayed but display of +resolution advisories is inhibited (↓ 2.2.3). + + +Assumption: This feature will be used during final approach to parallel +runways, when two aircraft are projected to come close to each other and +TCAS would call for an evasive maneuver (↓ 6.17). +The specified assumption is critical for evaluating safety during operations. Humans +tend to change their behavior over time and use automation in different ways than +originally intended by the designers. Sometimes, these new uses are dangerous. The +hyperlink at the end of the assumption (↓ 6.17) points to the required auditing +procedures for safety during operations and to where the procedures for auditing +this assumption are specified. +Where do these safety constraints come from? Is the system engineer required +to simply make them up? While domain knowledge and expertise is always going +to be required, there are procedures that can be used to guide this process. +The highest-level safety constraints come directly from the identified hazards for +the system. For example, TCAS must not cause or contribute to a near miss (H1), +TCAS must not cause or contribute to a controlled maneuver into the ground (H2), +and TCAS must not interfere with the ground-based ATC system. STPA can be used +to refine these high-level design constraints into more detailed design constraints +as described in chapter 8. +The first step in STPA is to create the high-level TCAS operational safety control +structure. For TCAS, this structure is shown in figure 10.10. For simplicity, much of +the structure above ATC operations management has been omitted and the roles and +responsibilities have been simplified here. In a real design project, roles and respon- +sibilities will be augmented and refined as development proceeds, analyses are per- +formed, and design decisions are made. Early in the system concept formation, +specific roles may not all have been determined, and more will be added as the design +concepts are refined. One thing to note is that there are three groups with potential +responsibilities over the pilot’s response to a potential NMAC: TCAS, the ground +ATC, and the airline operations center which provides the airline procedures for +responding to TCAS alerts. Clearly any potential conflicts and coordination prob- +lems between these three controllers will need to be resolved in the overall air traffic +management system design. In the case of TCAS, the designers decided that because +there was no practical way, at that time, to downlink information to the ground con- +trollers about any TCAS advisories that might have been issued for the crew, the pilot +was to immediately implement the TCAS advisory and the co-pilot would transmit +the TCAS alert information by radio to ground ATC. The airline would provide the +appropriate procedures and training to implement this protocol. +Part of defining this control structure involves identifying the responsibilities of +each of the components related to the goal of the system, in this case collision avoid- +ance. For TCAS, these responsibilities include: + + +1.•Aircraft Components (e.g., transponders, antennas): Execute control maneu- +vers, read and send messages to other aircraft, etc. +2.•TCAS: Receive information about its own and other aircraft, analyze the +information received and provide the pilot with (1) information about where +other aircraft in the vicinity are located and (2) an escape maneuver to avoid +potential NMAC threats. +3.•Aircraft Components (e.g., transponders, antennas): Execute pilot-generated +TCAS control maneuvers, read and send messages to and from other aircraft, +etc. +4.•Pilot: Maintain separation between own and other aircraft, monitor the TCAS +displays, and implement TCAS escape maneuvers. The pilot must also follow +ATC advisories. +5.•Air Traffic Control: Maintain separation between aircraft in the controlled +airspace by providing advisories (control actions) for the pilot to follow. TCAS +is designed to be independent of and a backup for the air traffic controller so +ATC does not have a direct role in the TCAS safety control structure but clearly +has an indirect one. +6.•Airline Operations Management: Provide procedures for using TCAS and +following TCAS advisories, train pilots, and audit pilot performance. +7.•ATC Operations Management: Provide procedures, train controllers, audit +performance of controllers and of the overall collision avoidance system. +8.•ICAO: Provide worldwide procedures and policies for the use of TCAS and +provide oversight that each country is implementing them. +After the general control structure has been defined (or alternative candidate +control structures identified), the next step is to determine how the controlled +system (the two aircraft) can get into a hazardous state. That information will be +used to generate safety constraints for the designers. STAMP assumes that hazard- +ous states (states that violate the safety constraints) are the result of ineffective +control. Step 1 of STPA is to identify the potentially inadequate control actions. +Control actions in TCAS are called resolution advisories or RAs. An RA is an +aircraft escape maneuver created by TCAS for the pilots to follow. Example reso- +lution advisories are descend, increase rate of climb to 2500 fmp, and don’t +descend. Consider the TCAS component of the control structure (see figure 10.10) +and the NMAC hazard. The four types of control flaws for this example translate +into: +1. The aircraft are on a near collision course, and TCAS does not provide an RA +that avoids it (that is, does not provide an RA, or provides an RA that does +not avoid the NMAC). +2. The aircraft are in close proximity and TCAS provides an RA that degrades +vertical separation (causes an NMAC). +3. The aircraft are on a near collision course and TCAS provides a maneuver too +late to avoid an NMAC. +4. TCAS removes an RA too soon. +These inadequate control actions can be restated as high-level constraints on the +behavior of TCAS: +1. TCAS must provide resolution advisories that avoid near midair collisions. +2. TCAS must not provide resolution advisories that degrade vertical separation +between two aircraft (that is, cause an NMAC). +3. TCAS must provide the resolution advisory while enough time remains for +the pilot to avoid an NMAC. (A human factors and aerodynamic analysis +should be performed at this point to determine exactly how much time that +implies.) +4. TCAS must not remove the resolution advisory before the NMAC is resolved. + + +Similarly, for the pilot, the inadequate control actions are: +1. The pilot does not provide a control action to avoid a near midair collision. +2. The pilot provides a control action that does not avoid the NMAC. +3. The pilot provides a control action that causes an NMAC that would not oth- +erwise have occurred. +4. The pilot provides a control action that could have avoided the NMAC but it +was too late. +5. The pilot starts a control action to avoid an NMAC but stops it too soon. +Again, these inadequate pilot control actions can be restated as safety constraints +that can be used to generate pilot procedures. Similar hazardous control actions and +constraints must be identified for each of the other system components. In addition, +inadequate control actions must be identified for the other functions provided by +TCAS (beyond RAs) such as traffic advisories. +Once the high-level design constraints have been identified, they must be refined +into more detailed design constraints to guide the system design and then aug- +mented with new constraints as design decisions are made, creating a seamless +integrated and iterative process of system design and hazard analysis. +Refinement of the constraints involves determining how they could be violated. +The refined constraints will be used to guide attempts to eliminate or control the +hazards in the system design or, if that is not possible, to prevent or control them +in the system or component design. This process of scenario development is exactly +the goal of hazard analysis and STPA. As an example of how the results of the +analysis are used to refine the high-level safety constraints, consider the second +high-level TCAS constraint: that TCAS must not provide resolution advisories that +degrade vertical separation between two aircraft (cause an NMAC): +SC.7: TCAS must not create near misses (result in a hazardous level of vertical +separation that would not have occurred had the aircraft not carried TCAS) +. +SC.7.1: Crossing Maneuvers must be avoided if possible . +SC.7.2: The reversal of a displayed advisory must be extremely rare . +SC.7.3: TCAS must not reverse an advisory if the pilot will have insufficient +time to respond to the RA before the closest point of approach (four seconds + + +or less) or if own and intruder aircraft are separated by less than 200 feet +vertically when ten seconds or less remain to closest point of approach +. +Note again that pointers are used to trace these constraints into the design features +used to implement them. + +footnote. This requirement is clearly vague and untestable. Unfortunately, I could find no definition of “extremely +rare” in any of the TCAS documentation to which I had access. + + +section 10.3.8. System Design and Analysis. +Once the basic requirements and design constraints have been at least partially +specified, the system design features that will be used to implement them must be +created. A strict top-down design process is, of course, not usually feasible. As design +decisions are made and the system behavior becomes better understood, additions +and changes will likely be made in the requirements and constraints. The specifica- +tion of assumptions and the inclusion of traceability links will assist in this process +and in ensuring that safety is not compromised by later decisions and changes. It is +surprising how quickly the rationale behind the decisions that were made earlier is +forgotten. +Once the system design features are determined, (1) an internal control structure +for the system itself is constructed along with the interfaces between the com- +ponents and (2) functional requirements and design constraints, derived from the +system-level requirements and constraints, are allocated to the individual system +components. +System Design +What has been presented so far in this chapter would appear in level 1 of an intent +specification. The second level of an intent specification contains System Design +Principles—the basic system design and scientific and engineering principles needed +to achieve the behavior specified in the top level, as well as any derived require- +ments and design features not related to the level 1 requirements. +While traditional design processes can be used, STAMP and STPA provide the +potential for safety-driven design. In safety-driven design, the refinement of the +high-level hazard analysis is intertwined with the refinement of the system design +to guide the development of the system design and system architecture. STPA can +be used to generate safe design alternatives or applied to the design alternatives +generated in some other way to continually evaluate safety as the design progresses +and to assist in eliminating or controlling hazards in the emerging design, as described +in chapter 9. +For TCAS, this level of the intent specification includes such general principles +as the basic tau concept, which is related to all the high-level alerting goals and +constraints: + + +2.2: Each TCAS-equipped aircraft is surrounded by a protected volume of air- +space. The boundaries of this volume are shaped by the tau and DMOD criteria +. +2.2.1: TAU: In collision avoidance, time-to-go to the closest point of approach +(CPA) is more important than distance-to-go to the CPA. Tau is an approxi- +mation of the time in seconds to CPA. Tau equals 3600 times the slant range +in nmi, divided by the closing speed in knots. +2.2.2: DMOD: If the rate of closure is very low, a target could slip in very +close without crossing the tau boundaries and triggering an advisory. In order +to provide added protection against a possible maneuver or speed change by +either aircraft, the tau boundaries are modified (called DMOD). DMOD +varies depending on own aircraft’s altitude regime. +The principles are linked to the related higher-level requirements, constraints, +assumptions, limitations, and hazard analysis as well as to lower-level system design +and documentation and to other information at the same level. Assumptions used +in the formulation of the design principles should also be specified at this level. +For example, design principle 2.51 (related to safety constraint SC-7.2 shown in +the previous section) describes how sense reversals are handled: +2.51: Sense Reversals: (↓ Reversal-Provides-More-Separation) In most encoun- +ter situations, the resolution advisory will be maintained for the duration of an +encounter with a threat aircraft . However, under certain circumstances, +it may be necessary for that sense to be reversed. For example, a conflict between +two TCAS-equipped aircraft will, with very high probability, result in selection +of complementary advisory senses because of the coordination protocol between +the two aircraft. However, if coordination communication between the two air- +craft is disrupted at a critical time of sense selection, both aircraft may choose +their advisories independently (↑HA-130). This could possibly result in selec- +tion of incompatible senses . + +footnote. The sense is the direction of the advisory, such as descend or climb. + +2.51.1: . . . information about how incompatibilities are handled. +Design principle 2.51 describes the conditions under which reversals of TCAS advi- +sories can result in incompatible senses and lead to the creation of a hazard by +TCAS. The pointer labeled HA-395 points to the part of the hazard analysis analyz- +ing that problem. The hazard analysis portion labeled HA-395 would have a com- +plementary pointer to section 2.51. The design decisions made to handle such + + +incompatibilities are described in 2.51.1, but that part of the specification is omitted +here. 2.51 also contains a hyperlink (↓Reversal-Provides-More-Separation) to the +detailed functional level 3 logic (component black-box requirements specification) +used to implement the design decision. +Information about the allocation of these design decisions to individual system +components and the logic involved is located in level 3, which in turn has links to +the implementation of the logic in lower levels. If a change has to be made to a +system component (such as a change to a software module), it is possible to trace +the function computed by that module upward in the intent specification levels to +determine whether the module is safety critical and if (and how) the change might +affect system safety. +As another example, the TCAS design has a built-in bias against generating +advisories that would result in the aircraft crossing paths (called altitude crossing +advisories). +2.36.2: A bias against altitude crossing RAs is also used in situations involving +intruder level-offs at least 600 feet above or below the TCAS aircraft . +In such a situation, an altitude-crossing advisory is deferred if an intruder +aircraft that is projected to cross own aircraft’s altitude is more than 600 feet +away vertically . +Assumption: In most cases, the intruder will begin a level-off maneuver +when it is more than 600 feet away and so should have a greatly reduced +vertical rate by the time it is within 200 feet of its altitude clearance (thereby +either not requiring an RA if it levels off more than ZTHR feet away or +requiring a non-crossing advisory for level-offs begun after ZTHR is crossed +but before the 600 foot threshold is reached). + +footnote. The vertical dimension, called zthr, used to determine whether advisories should be issued varies +from 750 to 950 feet, depending on the TCAS aircraft’s altitude. + +Again, the example above includes a pointer down to the part of the black box +component requirements (functional) specification (Alt_Separation_Test) that +embodies the design principle. Links could also be provided to detailed mathemati- +cal analyses used to support and validate the design decisions. +As another example of using links to embed design rationale in the specification +and of specifying limitations (defined later) and potential hazardous behavior that +could not be controlled in the design, consider the following. TCAS II advisories +may need to be inhibited because of an inadequate climb performance for the par- +ticular aircraft on which TCAS is installed. The collision avoidance maneuvers +posted as advisories (called RAs or resolution advisories) by TCAS assume an +aircraft’s ability to safely achieve them. If it is likely they are beyond the capability + + +of the aircraft, then TCAS must know beforehand so it can change its strategy and +issue an alternative advisory. The performance characteristics are provided to TCAS +through the aircraft interface (via what are called aircraft discretes). In some cases, +no feasible solutions to the problem could be found. An example design principle +related to this problem found at level 2 of the TCAS intent specification is: +2.39: Because of the limited number of inputs to TCAS for aircraft, performance +inhibits, in some instances where inhibiting RAs would be appropriate it is not +possible to do so (↑L6). In these cases, TCAS may command maneuvers that +may significantly reduce stall margins or result in stall warning (↑SC9.1). Con- +ditions where this may occur include . . . The aircraft flight manual or flight +manual supplement should provide information concerning this aspect of TCAS +so that flight crews may take appropriate action (↓ [Pointers to pilot procedures +on level 3 and Aircraft Flight Manual on level 6). +Finally, design principles may reflect tradeoffs between higher-level goals and con- +straints. As examples: +2.2.3: Tradeoffs must be made between necessary protection (↑1.18) and unnec- +essary advisories (↑SC.5, SC.6). This is accomplished by controlling the +sensitivity level, which controls the tau, and therefore the dimensions of the +protected airspace around each TCAS-equipped aircraft. The greater the +sensitivity level, the more protection is provided but the higher is the incidence +of unnecessary alerts. Sensitivity level is determined by . . . +2.38: The need to inhibit climb RAs because of inadequate aircraft climb perfor- +mance will increase the likelihood of TCAS II (a) issuing crossing maneuvers, +which in turn increases the possibility that an RA may be thwarted by the +intruder maneuvering (↑SC7.1, HA-115), (b) causing an increase in descend +RAs at low altitude (↑SC8.1), and (c) providing no RAs if below the descend +inhibit level (1200 feet above ground level on takeoff and 1000 feet above +ground level on approach). +Architectural Design, Functional Allocation, and Component Implementation +(Level 3) +Once the general system design concepts are agreed upon, the next step usually +involves developing the design architecture and allocating behavioral requirements +and constraints to the subsystems and components. Once again, two-way tracing +should exist between the component requirements and the system design principles +and requirements. These links will be available to the subsystem developers to be +used in their implementation and development activities and in verification (testing +and reviews). Finally, during field testing and operations, the links and recorded +assumptions and design rationale can be used in safety change analysis, incident and + + +accident analysis, periodic audits, and performance monitoring as required to ensure +that the operational system is and remains safe. +Level 3 of an intent specification contains the system architecture, that is, the +allocation of functions to components and the designed communication paths +among those components (including human operators). At this point, a black-box +functional requirements specification language becomes useful, particularly a formal +language that is executable. SpecTRM-RL is used as the example specification +language in this section [85, 86]). An early version of the language was developed +in 1990 to specify the requirements for TCAS II and has been refined and improved +since that time. SpecTRM-RL is part of a larger specification management system +called SpecTRM (Specification Tools and Requirements Methodology). Other +languages, of course, can be used. +One of the first steps in low-level architectural design is to break the system into +a set of components. For TCAS, only three components were used: surveillance, +collision avoidance, and performance monitoring. +The environment description at level 3 includes the assumed behavior of the +external components (such as the altimeters and transponders for TCAS), including +perhaps failure behavior, upon which the correctness of the system design is pre- +dicated, along with a description of the interfaces between the TCAS system +and its environment. Figure 10.11 shows part of a SpecTRM-RL description of an +environment component, in this case an altimeter. + + +enient for the purposes of the specifier. In this example, the environment includes +any component that was already on the aircraft or in the airspace control system +and was not newly designed or built as part of the TCAS effort. +All communications between the system and external components need to be +described in detail, including the designed interfaces. The black-box behavior of +each component also needs to be specified. This specification serves as the func- +tional requirements for the components. What is included in the component speci- +fication will depend on whether the component is part of the environment or part +of the system being constructed. Figure 10.12 shows part of the SpecTRM-RL +description of the behavior of the CAS (collision avoidance system) subcomponent. +SpecTRM-RL specifications are intended to be both easily readable with minimum +instruction and formally analyzable. They are also executable and can be used in a + + +system simulation environment. Readability was a primary goal in the design of +SpecTRM-RL, as was completeness with regard to safety. Most of the requirements +completeness criteria described in Safeware and rewritten as functional design prin- +ciples in chapter 9 of this book are included in the syntax of the language to assist +in system safety reviews of the requirements. +SpecTRM-RL explicitly shows the process model used by the controller and +describes the required behavior in terms of this model. A state machine model is used +to describe the system component’s process model, in this case the state of the air- +craft and the air space around it, and the ways the process model can change state. +Logical behavior is specified in SpecTRM-RL using and/or tables. Figure 10.12 +shows a small part of the specification of the TCAS collision avoidance logic. For +TCAS, an important state variable is the status of the other aircraft around the +TCAS aircraft, called intruders. Intruders are classified into four groups: Other +Traffic, Potential Threat, and Threat. The figure shows the logic for classifying an +intruder as Other Traffic using an and/or table. The information in the tables can +be visualized in additional ways. +The rows of the table represent and relationships, while the columns represent +or. The state variable takes the specified value (in this case, Other Traffic) if any of +the columns evaluate to true. A column evaluates to true if all the rows have the +value specified for that row in the column. A dot in the table indicates that the value +for the row is irrelevant. Underlined variables represent hyperlinks. For example, +clicking on “Alt Reporting” would show how the Alt Reporting variable is defined: +In our TCAS intent specification7 [121], the altitude report for an aircraft is defined +as Lost if no valid altitude report has been received in the past six seconds. Bearing +Valid, Range Valid, Proximate Traffic Condition, and Proximate Threat Condition +are macros, which simply means that they are defined using separate logic tables. +The additional logic for the macros could have been inserted here, but sometimes +the logic gets very complex and it is easier for specifiers and reviewers if, in those +cases, the tables are broken up into smaller pieces (a form of refinement abstrac- +tion). This decision is, of course, up to the creator of the table. +The behavioral descriptions at this level are purely black-box: They describe the +inputs and outputs of each component and their relationships only in terms of +externally visible behavior. Essentially it represents the transfer function across the +component. Any of these components (except the humans, of course) could be +implemented either in hardware or software. Some of the TCAS surveillance + +functions are, in fact, implemented using analog devices by some vendors and digital +by others. Decisions about physical implementation, software design, internal vari- +ables, and so on are limited to levels of the specification below this one. Thus, this +level serves as a rugged interface between the system designers and the component +designers and implementers (including subcontractors). +Software need not be treated any differently than the other parts of the system. +Most safety-related software problems stem from requirements flaws. The system +requirements and system hazard analysis should be used to determine the behav- +ioral safety constraints that must be enforced on software behavior and that the +software must enforce on the controlled system. Once that is accomplished, those +requirements and constraints are passed to the software developers (through the +black-box requirements specifications), and they use them to generate and validate +their designs just as the hardware developers do. +Other information at this level might include flight crew requirements such as +description of tasks and operational procedures, interface requirements, and the +testing requirements for the functionality described on this level. If the black-box +requirements specification is executable, system testing can be performed early to +validate requirements using system and environment simulators or hardware-in- +the-loop simulation. Including a visual operator task-modeling language permits +integrated simulation and analysis of the entire system, including human–computer +interactions [15, 177]. +Models at this level are reusable, and we have found that these models provide the +best place to provide component reuse and build component libraries [119]. Reuse +of application software at the code level has been problematic at best, contributing +to a surprising number of accidents [116]. Level 3 black-box behavioral specifications +provide a way to make the changes almost always necessary to reuse software in a +format that is both reviewable and verifiable. In addition, the black-box models can +be used to maintain the system and to specify and validate changes before they are +made in the various manufacturers’ products. Once the changed level 3 specifications +have been validated, the links to the modules implementing the modeled behavior +can be used to determine which modules need to be changed and how. Libraries of +component models can also be developed and used in a plug-and-play fashion, +making changes as required, in order to develop product families [211]. +The rest of the development process, involving the implementation of the com- +ponent requirements and constraints and documented at levels 4 and 5 of intent +specifications, is straightforward and differs little from what is normally done today. + + + +footnote. A SpecTRM-RL model of TCAS was created by the author and her students Jon Reese, Mats Heim- +dahl, and Holly Hildreth to assist in the certification of TCAS II. Later, as an experiment to show the +feasibility of creating intent specifications, the author created the level 1 and level 2 intent specification +for TCAS. Jon Reese rewrote the level 3 collision avoidance system logic from the early version of the +language into SpecTRM-RL. + + + +section 10.3.9. Documenting System Limitations. +When the system is completed, the system limitations need to be identified and +documented. Some of the identification will, of course, be done throughout the + +development. This information is used by management and stakeholders to deter- +mine whether the system is adequately safe to use, along with information about +each of the identified hazards and how they were handled. +Limitations should be included in level 1 of the intent specification, because they +properly belong in the customer view of the system and will affect both acceptance +and certification. +Some limitations may be related to the basic functional requirements, such as +these: +L4: TCAS does not currently indicate horizontal escape maneuvers and therefore +does not (and is not intended to) increase horizontal separation. +Limitations may also relate to environment assumptions. For example: +L1: TCAS provides no protection against aircraft without transponders or with +nonoperational transponders (→EA3, HA-430). +L6: Aircraft, performance limitations constrain the magnitude of the escape +maneuver that the flight crew can safely execute in response to a resolution +advisory. It is possible for these limitations to preclude a successful resolution +of the conflict (→H3, ↓2.38, 2.39). +L4: TCAS is dependent on the accuracy of the threat aircraft’s reported altitude. +Separation assurance may be degraded by errors in intruder pressure altitude +as reported by the transponder of the intruder aircraft (→EA5). +Assumption: This limitation holds for the airspace existing at the time of the +initial TCAS deployment, where many aircraft use pressure altimeters rather +than GPS. As more aircraft install GPS systems with greater accuracy than +current pressure altimeters, this limitation will be reduced or eliminated. +Limitations are often associated with hazards or hazard causal factors that could +not be completely eliminated or controlled in the design. Thus they represent +accepted risks. For example, +L3: TCAS will not issue an advisory if it is turned on or enabled to issue resolution +advisories in the middle of a conflict (→ HA-405). +L5: If only one of two aircraft is TCAS equipped while the other has only ATCRBS +altitude-reporting capability, the assurance of safe separation may be reduced +(→ HA-290). +In the specification, both of these system limitations would have pointers to the +relevant parts of the hazard analysis along with an explanation of why they could +not be eliminated or adequately controlled in the system design. Decisions about +deployment and certification of the system will need to be based partially on these + + +limitations and their impact on the safety analysis and safety assumptions of the +encompassing system, which, in the case of TCAS, is the overall air traffic system. +A final type of limitation is related to problems encountered or tradeoffs made +during system design. For example, TCAS has a high-level performance-monitoring +requirement that led to the inclusion of a self-test function in the system design to +determine whether TCAS is operating correctly. The following system limitation +relates to this self-test facility: +L9: Use by the pilot of the self-test function in flight will inhibit TCAS operation +for up to 20 seconds depending upon the number of targets being tracked. The +ATC transponder will not function during some portion of the self-test sequence +(↓6.52). +These limitations should be linked to the relevant parts of the development and, +most important, operational specifications. For example, L9 may be linked to the +pilot operations manual. + +section 10.3.10. System Certification, Maintenance, and Evolution. +At this point in development, the safety requirements and constraints are docu- +mented and traced to the design features used to implement them. A hazard log +contains the hazard information (or links to it) generated during the development +process and the results of the hazard analysis performed. The log will contain +embedded links to the resolution of each hazard, such as functional requirements, +design constraints, system design features, operational procedures, and system limi- +tations. The information documented should be easy to collect into a form that can +be used for the final safety assessment and certification of the system. +Whenever changes are made in safety-critical systems or software (during devel- +opment or during maintenance and evolution), the safety of the change needs to be +reevaluated. This process can be difficult and expensive if it has to start from scratch +each time. By providing links throughout the specification, it should be easy to assess +whether a particular design decision or piece of code was based on the original +safety analysis or safety-related design constraint and only that part of the safety +analysis process repeated or reevaluated. \ No newline at end of file diff --git a/chapter10.txt b/chapter10.txt new file mode 100644 index 0000000..4a92e15 --- /dev/null +++ b/chapter10.txt @@ -0,0 +1,1238 @@ +chapter 10. +Integrating Safety into System Engineering. +Previous chapters have provided the individual pieces of the solution to engineering +a safer world. This chapter demonstrates how to put these pieces together to integrate safety into a system engineering process. No one process is being proposed. +Safety must be part of any system engineering process. +The glue that integrates the activities of engineering and operating complex +systems is specifications and the safety information system. Communication is critical in handling any emergent property in a complex system. Our systems today are +designed and built by hundreds and often thousands of engineers and then operated +by thousands and even tens of thousands more people. Enforcing safety constraints +on system behavior requires that the information needed for decision making is +available to the right people at the right time, whether during system development, +operations, maintenance, or reengineering. +This chapter starts with a discussion of the role of specifications and how systems +theory can be used as the foundation for the specification of complex systems. Then +an example of how to put the components together in system design and development is presented. Chapters 11 and 12 cover how to maximize learning from accidents and incidents and how to enforce safety constraints during operations. The +design of safety information systems is discussed in chapter 13. + +section 10.1. The Role of Specifications and the Safety Information System. +While engineers may have been able to get away with minimal specifications during +development of the simpler electromechanical systems of the past, specifications are +critical to the successful engineering of systems of the size and complexity we are +attempting to build today. Specifications are no longer simply a means of archiving +information; they need to play an active role in the system engineering process. They +are a critical tool in stretching our intellectual capabilities to deal with increasing +complexity. + + +Our specifications must reflect and support the system safety engineering process +and the safe operation, evolution and change of the system over time. Specifications +should support the use of notations and techniques for reasoning about hazards and +safety, designing the system to eliminate or control hazards, and validating.at each +step, starting from the very beginning of system development.that the evolving +system has the desired safety level. Later, specifications must support operations +and change over time. +Specification languages can help .(or hinder). human performance of the various +problem-solving activities involved in system requirements analysis, hazard analysis, +design, review, verification and validation, debugging, operational use, and maintenance and evolution .(sustainment). They do this by including notations and tools +that enhance our ability to. .(1). reason about particular properties, .(2). construct the +system and the software in it to achieve them, and .(3). validate.at each step, starting +from the very beginning of system development.that the evolving system has the +desired qualities. In addition, systems and particularly the software components are +continually changing and evolving; they must be designed to be changeable and the +specifications must support evolution without compromising the confidence in the +properties that were initially verified. +Documenting and tracking hazards and their resolution are basic requirements +for any effective safety program. But simply having the safety engineer track them +and maintain a hazard log is not enough.information must be derived from the +hazards to inform the system engineering process and that information needs to be +specified and recorded in a way that has an impact on the decisions made during +system design and operations. To have such an impact, the safety-related information required by the engineers needs to be integrated into the environment in which +safety-related engineering decisions are made. Engineers are unlikely to be able to +read through volumes of hazard analysis information and relate it easily to the +specific component upon which they are working. The information the system safety +engineer has generated must be presented to the system designers, implementers, +maintainers, and operators in such a way that they can easily find what they need +to make safer decisions. +Safety information is not only important during system design; it also needs to +be presented in a form that people can learn from, apply to their daily jobs, and use +throughout the life cycle of projects. Too often, preventable accidents have occurred +due to changes that were made after the initial design period. Accidents are frequently the result of safe designs becoming unsafe over time when changes in the +system itself or in its environment violate the basic assumptions of the original +hazard analysis. Clearly, these assumptions must be recorded and easily retrievable +when changes occur. Good documentation is the most important in complex systems + + +where nobody is able to keep all the information necessary to make safe decisions +in their head. +What types of specifications are needed to support humans in system safety +engineering and operations? Design decisions at each stage must be mapped into +the goals and constraints they are derived to satisfy, with earlier decisions mapped +or traced to later stages of the process. The result should be a seamless and gapless +record of the progression from high-level requirements down to component requirements and designs or operational procedures. The rationale behind the design decisions needs to be recorded in a way that is easily retrievable by those reviewing or +changing the system design. The specifications must also support the various types +of formal and informal analysis used to decide between alternative designs and to +verify the results of the design process. Finally, specifications must assist in the +coordinated design of the component functions and the interfaces between them. +The notations used in specification languages must be easily readable and learnable. Usability is enhanced by using notations and models that are close to the +mental models created by the users of the specification and the standard notations +in their fields of expertise. +The structure of the specification is also important for usability. The structure will +enhance or limit the ability to retrieve needed information at the appropriate times. +Finally, specifications should not limit the problem-solving strategies of the users +of the specification. Not only do different people prefer different strategies for +solving problems, but the most effective problem solvers have been found to change +strategies frequently . Experts switch problem-solving strategy when they +run into difficulties following a particular strategy and as new information is obtained +that changes the objectives or subgoals or the mental workload needed to use a +particular strategy. Tools often limit the strategies that can be used, usually implementing the favorite strategy of the tool designer, and therefore limiting the problem +solving strategies supported by the specification. +One way to implement these principles is to use intent specifications . + +section 10.2. +Intent Specifications. +Intent specifications are based on systems theory, system engineering principles, and +psychological research on human problem solving and how to enhance it. The goal +is to assist humans in dealing with complexity. While commercial tools exist that +implement intent specifications directly, any specification languages and tools can +be used that allow implementing the properties of an intent specification. +An intent specification differs from a standard specification primarily in its structure, not its content. no extra information is involved that is not commonly found + + +in detailed specifications.the information is simply organized in a way that has +been found to assist in its location and use. Most complex systems have voluminous +documentation, much of it redundant or inconsistent, and it degrades quickly as +changes are made over time. Sometimes important information is missing, particularly information about why something was done the way it was.the intent or +design rationale. Trying to determine whether a change might have a negative +impact on safety, if possible at all, is usually enormously expensive and often involves +regenerating analyses and work that was already done but either not recorded or +not easily located when needed. Intent specifications were designed to help with +these problems. Design rationale, safety analysis results, and the assumptions upon +which the system design and validation are based are integrated directly into the +system specification and its structure, rather than stored in separate documents, so +the information is at hand when needed for decision making. +The structure of an intent specification is based on the fundamental concept of +hierarchy in systems theory .(see chapter 3). where complex systems are modeled in +terms of a hierarchy of levels of organization, each level imposing constraints on +the degree of freedom of the components at the lower level. Different description +languages may be appropriate at the different levels. Figure 10.1 shows the seven +levels of an intent specification. +Intent specifications are organized along three dimensions. intent abstraction, +part-whole abstraction, and refinement. These dimensions constitute the problem +space in which the human navigates. Part-whole abstraction .(along the horizontal +dimension). and refinement .(within each level). allow users to change their +focus of attention to more or less detailed views within each level or model. +The vertical dimension specifies the level of intent at which the problem is being +considered. +Each intent level contains information about the characteristics of the environment, human operators or users, the physical and functional system components, +and requirements for and results of verification and validation activities for that +level. The safety information is embedded in each level, instead of being maintained +in a separate safety log, but linked together so that it can easily be located and +reviewed. +The vertical intent dimension has seven levels. Each level represents a different +model of the system from a different perspective and supports a different type of +reasoning about it. Refinement and decomposition occurs within each level of the +specification, rather than between levels. Each level provides information not just +about what and how, but why, that is, the design rationale and reasons behind the +design decisions, including safety considerations. +Figure 10.2 shows an example of the information that might be contained in each +level of the intent specification. + + +The top level .(level 0). provides a project management view and insight into the +relationship between the plans and the project development status through links +to the other parts of the intent specification. This level might contain the project +management plans, the safety plan, status information, and so on. +Level 1 is the customer view and assists system engineers and customers in +agreeing on what should be built and, later, whether that has been accomplished. It +includes goals, high-level requirements and constraints .(both physical and operator), +environmental assumptions, definitions of accidents, hazard information, and system +limitations. +Level 2 is the system engineering view and helps system engineers record and +reason about the system in terms of the physical principles and system-level design +principles upon which the system design is based. +Level 3 specifies the system architecture and serves as an unambiguous interface +between system engineers and component engineers or contractors. At level 3, the +system functions defined at level 2 are decomposed, allocated to components, and +specified rigorously and completely. Black-box behavioral component models may +be used to specify and reason about the logical design of the system as a whole and + + +the interactions among individual system components without being distracted by +implementation details. +If the language used at level 3 is formal .(rigorously defined), then it can play an +important role in system validation. For example, the models can be executed in +system simulation environments to identify system requirements and design errors +early in development. They can also be used to automate the generation of system +and component test data, various types of mathematical analyses, and so forth. It is +important, however, that the black-box .(that is, transfer function). models be easily +reviewed by domain experts.most of the safety-related errors in specifications will +be found by expert review, not by automated tools or formal proofs. +A readable but formal and executable black-box requirements specification language was developed by the author and her students while helping the FAA specify +the TCAS .(Traffic Alert and Collision Avoidance System). requirements . +Reviewers can learn to read the specifications with a few minutes of instruction +about the notation. Improvements have been made over the years, and it is being +used successfully on real systems. This language provides an existence case that a + + +readable and easily learnable but formal specification language is possible. Other +languages with the same properties, of course, can also be used effectively. +The next two levels, Design Representation and Physical Representation, +provide the information necessary to reason about individual component design +and implementation issues. Some parts of level 4 may not be needed if at least portions of the physical design can be generated automatically from the models at +level 3. +The final level, Operations, provides a view of the operational system and acts as +the interface between development and operations. It assists in designing and performing system safety activities during system operations. It may contain required +or suggested operational audit procedures, user manuals, training materials, maintenance requirements, error reports and change requests, historical usage information, and so on. +Each level of an intent specification supports a different type of reasoning about +the system, with the highest level assisting systems engineers in their reasoning +about system-level goals, constraints, priorities, and tradeoffs. The second level, +System Design Principles, allows engineers to reason about the system in terms of +the physical principles and laws upon which the design is based. The Architecture +level enhances reasoning about the logical design of the system as a whole, the +interactions between the components, and the functions computed by the components without being distracted by implementation issues. The lowest two levels +provide the information necessary to reason about individual component design and +implementation issues. The mappings between levels provide the relational information that allows reasoning across hierarchical levels and traceability of requirements +to design. +Hyperlinks are used to provide the relational information that allows reasoning +within and across levels, including the tracing from high-level requirements down +to implementation and vice versa. Examples can be found in the rest of this +chapter. +The structure of an intent specification does not imply that the development must +proceed from the top levels down to the bottom levels in that order, only that at +the end of the development process, all levels are complete. Almost all development +involves work at all of the levels at the same time. +When the system changes, the environment in which the system operates changes, +or components are reused in a different system, a new or updated safety analysis is +required. Intent specifications can make that process feasible and practical. +Examples of intent specifications are available as are commercial tools +to support them. But most of the principles can be implemented without special +tools beyond a text editor and hyperlinking facilities. The rest of this chapter assumes +only these very limited facilities are available. + + +section 10.3. An Integrated System and Safety Engineering Process. +There is no agreed upon best system engineering process and probably cannot be +one.the process needs to match the specific problem and environment in which it +is being used. What is described in this section is how to integrate safety engineering +into any reasonable system engineering process. +The system engineering process provides a logical structure for problem solving. +Briefly, first a need or problem is specified in terms of objectives that the system +must satisfy and criteria that can be used to rank alternative designs. Then a process +of system synthesis takes place that usually involves considering alternative designs. +Each of the alternatives is analyzed and evaluated in terms of the stated objectives +and design criteria, and one alternative is selected. In practice, the process is highly +iterative. The results from later stages are fed back to early stages to modify objectives, criteria, design decisions, and so on. +Design alternatives are generated through a process of system architecture development and analysis. The system engineers first develop requirements and design +constraints for the system as a whole and then break the system into subsystems +and design the subsystem interfaces and the subsystem interface topology. System +functions and constraints are refined and allocated to the individual subsystems. The +emerging design is analyzed with respect to desired system performance characteristics and constraints, and the process is iterated until an acceptable system design +results. +The difference in safety-guided design is that hazard analysis is used throughout +the process to generate the safety constraints that are factored into the design decisions as they are made. The preliminary design at the end of this process must be +described in sufficient detail that subsystem implementation can proceed independently. The subsystem requirements and design processes are subsets of the larger +system engineering process. +This general system engineering process has some particularly important aspects. +One of these is the focus on interfaces. System engineering views each system as an +integrated whole even though it is composed of diverse, specialized components, +which may be physical, logical .(software), or human. The objective is to design +subsystems that when integrated into the whole provide the most effective system +possible to achieve the overall objectives. The most challenging problems in building +complex systems today arise in the interfaces between components. One example +is the new highly automated aircraft where most incidents and accidents have been +blamed on human error, but more properly reflect difficulties in the collateral design +of the aircraft, the avionics systems, the cockpit displays and controls, and the +demands placed on the pilots. + + + +A second critical factor is the integration of humans and nonhuman system +components. As with safety, a separate group traditionally does human factors +design and analysis. Building safety-critical systems requires integrating both +system safety and human factors into the basic system engineering process, which +in turn has important implications for engineering education. Unfortunately, +neither safety nor human factors plays an important role in most engineering +education today. +During program and project planning, a system safety plan, standards, and +project development safety control structure need to be designed including +policies, procedures, the safety management and control structure, and communication channels. More about safety management plans can be found in chapters 12 +and 13. +Figure 10.3 shows the types of activities that need to be performed in such an +integrated process and the system safety and human factors inputs and products. +Standard validation and verification activities are not shown, since they should be +included throughout the entire process. +The rest of this chapter provides an example using TCAS 2 . Other examples are +interspersed where TCAS is not appropriate or does not provide an interesting +enough example. +section 10.3.1. Establishing the Goals for the System. +The first step in any system engineering process is to identify the goals of the effort. +Without agreeing on where you are going, it is not possible to determine how to get +there or when you have arrived. +TCAS 2 is a box required on most commercial and some general aviation aircraft +that assists in avoiding midair collisions. The goals for TCAS 2 are to. +G1. Provide affordable and compatible collision avoidance system options for a +broad spectrum of National Airspace System users. +G2. Detect potential midair collisions with other aircraft in all meteorological +conditions; throughout navigable airspace, including airspace not covered +by ATC primary or secondary radar systems; and in the absence of ground +equipment. +TCAS was intended to be an independent backup to the normal Air Traffic Control +(ATC). system and the pilot’s “see and avoid” responsibilities. It interrogates air +traffic control transponders on aircraft in its vicinity and listens for the transponder +replies. By analyzing these replies with respect to slant range and relative altitude, +TCAS determines which aircraft represent potential collision threats and provides +appropriate display indications, called advisories, to the flight crew to assure proper + + +separation. Two types of advisories can be issued. Resolution advisories .(RAs) +provide instructions to the pilots to ensure safe separation from nearby traffic in +the vertical plane. Traffic advisories .(TAs). indicate the positions of intruding aircraft that may later cause resolution advisories to be displayed. +TCAS is an example of a system created to directly impact safety where the goals +are all directly related to safety. But system safety engineering and safety-driven +design can be applied to systems where maintaining safety is not the only goal and, +in fact, human safety is not even a factor. The example of an outer planets explorer +spacecraft was shown in chapter 7. Another example is the air traffic control system, +which has both safety and nonsafety .(throughput). goals. + +footnote. Horizontal advisories were originally planned for later versions of TCAS but have not yet been +implemented. + +section 10.3.2. Defining Accidents. +Before any safety-related activities can start, the definition of an accident needs to +be agreed upon by the system customer and other stakeholders. This definition, in +essence, establishes the goals for the safety effort. +Defining accidents in TCAS is straightforward.only one is relevant, a midair +collision. Other more interesting examples are shown in chapter 7. +Basically, the criterion for specifying events as accidents is that the losses are so +important that they need to play a central role in the design and tradeoff process. +In the outer planets explorer example in chapter 7, some of the losses involve the +mission goals themselves while others involve losses to other missions or a negative +impact on our solar system ecology. +Priorities and evaluation criteria may be assigned to the accidents to indicate how +conflicts are to be resolved, such as conflicts between safety goals or conflicts +between mission goals and safety goals and to guide design choices at lower levels. +The priorities are then inherited by the hazards related to each of the accidents and +traced down to the safety-related design features. + +section 10.3.3. Identifying the System Hazards. +Once the set of accidents has been agreed upon, hazards can be derived from them. +This process is part of what is called Preliminary Hazard Analysis .(PHA). in System +Safety. The hazard log is usually started as soon as the hazards to be considered are +identified. While much of the information in the hazard log will be filled in later, +some information is available at this time. +There is no right or wrong list of hazards.only an agreement by all involved on +what hazards will be considered. Some hazards that were considered during the +design of TCAS are listed in chapter 7 and are repeated here for convenience. + + +1. TCAS causes or contributes to a near midair collision .(NMAC), defined as a +pair of controlled aircraft violating minimum separation standards. +2. TCAS causes or contributes to a controlled maneuver into the ground. +3. TCAS causes or contributes to the pilot losing control over the aircraft. +4. TCAS interferes with other safety-related aircraft systems .(for example, +ground proximity warning). +5. TCAS interferes with the ground-based air traffic control system .(e.g., transponder transmissions to the ground or radar or radio services). +6. TCAS interferes with an ATC advisory that is safety-related .(e.g., avoiding a +restricted area or adverse weather conditions). +Once accidents and hazards have been identified, early concept formation .(sometimes called high-level architecture development). can be started for the integrated +system and safety engineering process. + +section 10.3.4. Integrating Safety into Architecture Selection and System Trade Studies. +An early activity in the system engineering of complex systems is the selection of +an overall architecture for the system, or as it is sometimes called, system concept +formation. For example, an architecture for manned space exploration might include +a transportation system with parameters and options for each possible architectural +feature related to technology, policy, and operations. Decisions will need to be made +early, for example, about the number and type of vehicles and modules, the destinations for the vehicles, the roles and activities for each vehicle including dockings +and undockings, trajectories, assembly of the vehicles .(in space or on Earth), discarding of vehicles, prepositioning of vehicles in orbit and on the planet surface, and so +on. Technology options include type of propulsion, level of autonomy, support +systems .(water and oxygen if the vehicle is used to transport humans), and many +others. Policy and operational options may include crew size, level of international +investment, types of missions and their duration, landing sites, and so on. Decisions +about these overall system concepts clearly must precede the actual implementation +of the system. +How are these decisions made? The selection process usually involves extensive +tradeoff analysis that compares the different feasible architectures with respect to +some important system property or properties. Cost, not surprisingly, usually plays +a large role in the selection process while other properties, including system safety, +are usually left as a problem to be addressed later in the development lifecycle. +Many of the early architectural decisions, however, have a significant and lasting +impact on safety and may not be reversible after the basic architectural decisions +have been made. For example, the decision not to include a crew escape system on + + +the Space Shuttle was an early architectural decision and has been impacting Shuttle +safety for more than thirty years . After the Challenger accident and again +after the Columbia loss, the idea resurfaced, but there was no cost-effective way to +add crew escape at that time. +The primary reason why safety is rarely factored in during the early architectural +tradeoff process, except perhaps informally, is that practical methods for analyzing +safety, that is, hazard analysis methods that can be applied at that time, do not exist. +But if information about safety were available early, it could be used in the selection +process and hazards could be eliminated by the selection of appropriate architectural options or mitigated early when the cost of doing so is much less than later in +the system lifecycle. Making basic design changes downstream becomes increasingly +costly and disruptive as development progresses and, often, compromises in safety +must be accepted that could have been eliminated if safety had been considered in +the early architectural evaluation process. +While it is relatively easy to identify hazards at system conception, performing a +hazard or risk assessment before a design is available is more problematic. At best, +only a very rough estimate is possible. Risk is usually defined as a combination of +severity and likelihood. Because these two different qualities .(severity and likelihood). cannot be combined mathematically, they are commonly qualitatively combined using a risk matrix. Figure 10.4 shows a fairly standard form for such a matrix. + + +High-level hazards are first identified and, for each identified hazard, a qualitative +evaluation is performed by classifying the hazard according to its severity and +likelihood. +While severity can usually be evaluated using the worst possible consequences +of that hazard, likelihood is almost always unknown and, arguably, unknowable for +complex systems before any system design decisions have been made. The problem +is even worse before a system architecture has been selected. Some probabilistic +information is usually available about physical events, of course, and historical +information may theoretically be available. But new systems are usually being +created because existing systems and designs are not adequate to achieve the system +goals, and the new systems will probably use new technology and design features +that limit the accuracy of historical information. For example, historical information +about the likelihood of propulsion-related losses may not be accurate for new spacecraft designs using nuclear propulsion. Similarly, historical information about the +errors air traffic controllers make has no relevance for new air traffic control systems, +where the type of errors may change dramatically. +The increasing use of software in most complex systems complicates the situation +further. Much or even most of the software in the system will be new and have no +historical usage information. In addition, statistical techniques that assume randomness are not applicable to software design flaws. Software and digital systems also +introduce new ways for hazards to occur, including new types of component interaction accidents. Safety is a system property, and, as argued in part I, combining the +probability of failure of the system components to be used has little or no relationship to the safety of the system as a whole. +There are no known or accepted rigorous or scientific ways to obtain probabilistic +or even subjective likelihood information using historical data or analysis in the case +of non-random failures and system design errors, including unsafe software behavior. When forced to come up with such evaluations, engineering judgment is usually +used, which in most cases amounts to pulling numbers out of the air, often influenced by political and other nontechnical factors. Selection of a system architecture +and early architectural trade evaluations on such a basis is questionable and perhaps +one reason why risk usually does not play a primary role in the early architectural +trade process. +Alternatives to the standard risk matrix are possible, but they tend to be application specific and so must be constructed for each new system. For many systems, +the use of severity alone is often adequate to categorize the hazards in trade studies. +Two examples of other alternatives are presented here, one created for augmented +air traffic control technology and the other created and used in the early architectural trade study of NASA’s Project Constellation, the program to return to the +moon and later go on to Mars. The reader is encouraged to come up with their own + + +methods appropriate for their particular application. The examples are not meant +to be definitive, but simply illustrative of what is possible. +Example 1. A Human-Intensive System. Air Traffic Control Enhancements +Enhancements to the air traffic control .(ATC). system are unique in that the problem +is not to create a new or safer system but to maintain the very high level of safety +built into the current system. The goal is to not degrade safety. The risk likelihood +estimate can be restated, in this case, as the likelihood that safety will be degraded +by the proposed changes and new tools. To tackle this problem, we created a set of +criteria to be used in the evaluation of likelihood. The criteria ranked various highlevel architectural design features of the proposed set of ATC tools on a variety of +factors related to risk in these systems. The ranking was qualitative and most criteria +were ranked as having low, medium, or high impact on the likelihood of safety being +degraded from the current level. For the majority of factors, “low” meant insignificant or no change in safety with respect to that factor in the new versus the current +system, “medium” denoted the potential for a minor change, and “high” signified +potential for a significant change in safety. Many of the criteria involve humanautomation interaction, since ATC is a very human-intensive system and the new +features being proposed involved primarily new automation to assist human air +traffic controllers. Here are examples of the likelihood level criteria used. +1.•Safety margins. Does the new feature have the potential for .(1). an insignificant or no change to the existing safety margins, .(2). a minor change, or .(3). a +significant change. +2.•Situation awareness. What is the level of change in the potential for reducing +situation awareness. +3.•Skills currently used and those necessary to backup and monitor the new decision-support tools. Is there an insignificant or no change in the controller +skills, a minor change, or a significant change. +4.•Introduction of new failure modes and hazard causes. Do the new tools have +the same function and failure modes as the system components they are replacing, are new failure modes and hazards introduced but well understood and +effective mitigation measures can be designed, or are the new failure modes +and hazard causes difficult to control. +5.•Effect of the new software functions on the current system hazard mitigation +measures. Can the new features render the current safety measures ineffective +or are they unrelated to current safety features. + + +6. Need for new system hazard mitigation measures. Will the proposed changes +require new hazard mitigation measures. +These criteria and others were converted into a numerical scheme so they could be +combined and used in an early risk assessment of the changes being contemplated +and their potential likelihood for introducing significant new risk into the system. +The criteria were weighted to reflect their relative importance in the risk analysis. + +footnote. These criteria were developed for a NASA contract by the author and have not been published +previously. + + +Example 2. Early Risk Analysis of Manned Space Exploration +A second example was created by Nicolas Dulac and others as part of an MIT and +Draper Labs contract with NASA to perform an architectural tradeoff analysis for +future human space exploration . The system engineers wanted to include safety +along with the usual factors, such as mass, to evaluate the candidate architectures, +but once again little information was available at this early stage of system engineering. It was not possible to evaluate likelihood using historical information; all of the +potential architectures involved new technology, new missions, and significant +amounts of software. +In the procedure developed to achieve the goal, the hazards were first identified +as shown in figure 10.5. As is the case at the beginning of any project, identifying +system hazards involved ten percent creativity and ninety percent experience. +Hazards were identified for each mission phase by domain experts under the guidance of the safety experts. Some hazards, such as fire, explosion, or loss of lifesupport span multiple .(if not all). mission phases and were grouped as General +Hazards. The control strategies used to mitigate them, however, may depend on the +mission phase in which they occur. +Once the hazards were identified, the severity of each hazard was evaluated by +considering the worst-case loss associated with the hazard. In the example, the losses +are evaluated for each of three categories. humans .(H), mission .(M), and equipment +(E). Initially, potential damage to the Earth and planet surface environment was +included in the hazard log. In the end, the environment component was left out of +the analysis because project managers decided to replace the analysis with mandatory compliance with NASA’s planetary protection standards. A risk analysis can be +replaced by a customer policy on how the hazards are to be treated. A more complete example, however, for a different system would normally include environmental hazards. +A severity scale was created to account for the losses associated with each of the +three categories. The scale used is shown in figure 10.6, but obviously a different +scale could easily be created to match the specific policies or standard practice in +different industries and companies. +As usual, severity was relatively easy to handle but the likelihood of the potential +hazard occurring was unknowable at this early stage of system engineering. In + + +addition, space exploration is the polar opposite of the ATC example above as the +system did not already exist and the architectures and missions would involve things +never attempted before, which created a need for a different approach to estimating +likelihood. +We decided to use the mitigation potential of the hazard in the candidate architecture as an estimator of, or surrogate for, likelihood. Hazards that are more easily +mitigated in the design and operations are less likely to lead to accidents. Similarly, +hazards that have been eliminated during system design, and thus are not part of +that candidate architecture or can easily be eliminated in the detailed design process, +cannot lead to an accident. +The safety goal of the architectural analysis process was to assist in selecting the +architecture with the fewest serious hazards and highest mitigation potential for +those hazards that were not eliminated. Not all hazards will be eliminated even if +they can be. One reason for not eliminating hazards might be that it would reduce +the potential for achieving other important system goals or constraints. Obviously, +safety is not the only consideration in the architecture selection process, but it is +important enough in this case to be a criterion in the selection process. +Mitigation potential was chosen as a surrogate for likelihood for two reasons. +(1). the potential for eliminating or controlling the hazard in the design or operations +has a direct and important bearing on the likelihood of the hazard occurring +(whether traditional or new designs and technology are used). and .(2). mitigatibility +of the hazard can be determined before an architecture or design is selected. +indeed, it assists in the selection process. +Figure 10.7 shows an example from the hazard log created during the PHA effort. +The example hazard shown is nuclear reactor overheating. Nuclear power generation +and use, particularly during planetary surface operations, was considered to be an +important option in the architectural tradeoffs. The potential accident and its effects +are described in the hazard log as. +Nuclear core meltdown would cause loss of power, and possibly radiation exposure. +Surface operations must abort mission and evacuate. If abort is unsuccessful or unavailable +at the time, the crew and surface equipment could be lost. There would be no environmental impact on Earth. +The hazard is defined as the nuclear reactor operating at temperatures above the +design limits. +Although some causal factors can be hypothesized early, a hazard analysis using +STPA can be used to generate a more complete list of causal factors later in the +development process to guide the design process after an architecture is chosen. +Like severity, mitigatibility was evaluated by domain experts under the guidance +of safety experts. Both the cost of the potential mitigation strategy and its + + +effectiveness were evaluated. For the nuclear power example, two strategies were +identified. the first is not to use nuclear power generation at all. The cost of this option +was evaluated as medium .(on a low, medium, high scale). But the mitigation potential +was rated as high because it eliminates the hazard completely. The mitigation priority +scale used is shown in figure 10.8. The second mitigation potential identified by +the engineers was to provide a backup power generation system for surface operations. The difficulty and cost was rated high and the mitigation rating was 1, which was +the lowest possible level, because at best it would only reduce the damage if an accident occurred but potential serious losses would still occur. Other mitigation strategies are also possible but have been omitted from the sample hazard log entry shown. +None of the effort expended here is wasted. The information included in the +hazard log about the mitigation strategies will be useful later in the design process +if the final architecture selected uses surface nuclear power generation. NASA might +also be able to use the information in future projects and the creation of such early +risk analysis information might be common to companies or industries and not have +to be created for each project. As new technologies are introduced to an industry, +new hazards or mitigation possibilities could be added to the previously stored +information. +The final step in the process is to create safety risk metrics for each candidate +architecture. Because the system engineers on the project created hundreds of feasible architectures, the evaluation process was automated. The actual details of the +mathematical procedures used are of limited general interest and are available +elsewhere . Weighted averages were used to combine mitigation factors and +severity factors to come up with a final Overall Residual Safety-Risk Metric. This +metric was then used in the evaluation and ranking of the potential manned space +exploration architectures. +By selecting and deselecting options in the architecture description, it was also +possible to perform a first-order assessment of the relative importance of each +architectural option in determining the Overall Residual Safety-Risk Metric. +While hundreds of parameters were considered in the risk analysis, the process +allowed the identification of major contributors to the hazard mitigation potential +of selected architectures and thus informed the architecture selection process and + + +the tradeoff analysis. For example, important contributors to increased safety were +determined to include the use of heavy module and equipment prepositioning on +the surface of Mars and the use of minimal rendezvous and docking maneuvers. +Prepositioning modules allows for pretesting and mitigates the hazards associated +with loss of life support, equipment damage, and so on. On the other hand, prepositioning modules increases the reliance on precision landing to ensure that all +landed modules are within range of each other. Consequently, using heavy prepositioning may require additional mitigation strategies and technology development +to reduce the risk associated with landing in the wrong location. All of this information must be considered in selecting the best architecture. As another example, +on one hand, a transportation architecture requiring no docking at Mars orbit +or upon return to Earth inherently mitigates hazards associated with collisions or +failed rendezvous and docking maneuvers. On the other hand, having the capability +to dock during an emergency, even though it is not required during nominal operations, provides additional mitigation potential for loss of life support, especially in +Earth orbit. +Reducing these considerations to a number is clearly not ideal, but with hundreds +of potential architectures it was necessary in this case in order to pare down the +choices to a smaller number. More careful tradeoff analysis is then possible on the +reduced set of choices. +While mitigatibility is widely applicable as a surrogate for likelihood in many +types of domains, the actual process used above is just one example of how it might +be used. Engineers will need to adapt the scales and other features of the process +to the customary practices in their own industry. Other types of surrogates or ways +to handle likelihood estimates in early phases of projects are possible beyond the +two examples provided in this section. While none of these approaches is ideal, they +are much better than ignoring safety in decision making or selecting likelihood +estimates based solely on wishful thinking or the politics that often surround the +preliminary hazard analysis process. +After a conceptual design is chosen, development begins. + +section 10.3.5. Documenting Environmental Assumptions. +An important part of the system development process is to determine and document +the assumptions under which the system requirements and design features are +derived and upon which the hazard analysis is based. Assumptions will be identified +and specified throughout the system engineering process and the engineering specifications to explain decisions or to record fundamental information upon which the +design is based. If the assumptions change over time or the system changes and the +assumptions are no longer true, then the requirements and the safety constraints +and design features based on those assumptions need to be revisited to ensure safety +has not been compromised by the change. + + +Because operational safety depends on the accuracy of the assumptions and +models underlying the design and hazard analysis processes, the operational system +should be monitored to ensure that. +1. The system is constructed, operated, and maintained in the manner assumed +by the designers. +2. The models and assumptions used during initial decision making and design +are correct. +3. The models and assumptions are not violated by changes in the system, such +as workarounds or unauthorized changes in procedures, or by changes in the +environment. +Operational feedback on trends, incidents, and accidents should trigger reanalysis +when appropriate. Linking the assumptions throughout the document with the parts +of the hazard analysis based on that assumption will assist in performing safety +maintenance activities. +Several types of assumptions are relevant. One is the assumptions under which +the system will be used and the environment in which the system will operate. Not +only will these assumptions play an important role in system development, but they +also provide part of the basis for creating the operational safety control structure +and other operational safety controls such as creating feedback loops to ensure the +assumptions underlying the system design and the safety analyses are not violated +during operations as the system and its environment change over time. +While many of the assumptions that originate in the existing environment into +which the new system will be integrated can be identified at the beginning of development, additional assumptions will be identified as the design process continues +and new requirements and design decisions and features are identified. In addition, +assumptions that the emerging system design imposes on the surrounding environment will become clear only after detailed decisions are made in the design and +safety analyses. +Examples of important environment assumptions for TCAS 2 are that. +EA1. High-integrity communications exist between aircraft. +EA2. The TCAS-equipped aircraft carries a Mode-S air traffic control transponder. + + +EA3. All aircraft have operating transponders.j +EA4. All aircraft have legal identification numbers. +EA5. Altitude information is available from intruding targets with a minimum +precision of 100 feet. +EA6. The altimetry system that provides own aircraft pressure altitude to the TCAS +equipment will satisfy the requirements in RTCA Standard . . . +EA7. Threat aircraft will not make an abrupt maneuver that thwarts the TCAS +escape maneuver. + + +footnote. An aircraft transponder sends information to help air traffic control maintain aircraft separation. +Primary radar generally provides bearing and range position information, but lacks altitude information. +Mode A transponders transmit only an identification signal, while Mode C and Mode S transponders +also report pressure altitude. Mode S is newer and has more capabilities than Mode C, some of which +are required for the collision avoidance functions in TCAS. + + + +As noted, these assumptions must be enforced in the overall safety control structure. With respect to assumption EA4, for example, identification numbers are +usually provided by the aviation authorities in each country, and that requirement +will need to be ensured by international agreement or by some international agency. +The assumption that aircraft have operating transponders .(EA3). may be enforced +by the airspace rules in a particular country and, again, must be ensured by some +group. Clearly, these assumptions play an important role in the construction of the +safety control structure and assignments of responsibilities for the final system. For +TCAS, some of these assumptions will already be imposed by the existing air transportation safety control structure while others may need to be added to the responsibilities of some group(s). in the control structure. The last assumption, EA7, imposes +constraints on pilots and the air traffic control system. +Environment requirements and constraints may lead to restrictions on the use of +the new system .(in this case, TCAS). or may indicate the need for system safety and +other analyses to determine the constraints that must be imposed on the system +being created .(TCAS again). or the larger encompassing system to ensure safety. The +requirements for the integration of the new subsystem safely into the larger system +must be determined early. Examples for TCAS include. +E1. The behavior or interaction of non-TCAS equipment with TCAS must not +degrade the performance of the TCAS equipment or the performance of the +equipment with which TCAS interacts. +E2. Among the aircraft environmental alerts, the hierarchy shall be. Windshear has +first priority, then the Ground Proximity Warning System .(GPWS), then TCAS. +E3. The TCAS alerts and advisories must be independent of those using the master +caution and warming system. + +section 10.3.6. System-Level Requirements Generation. +Once the goals and hazards have been identified and a conceptual system architecture has been selected, system-level requirements generation can begin. Usually, in + + +the early stages of a project, goals are stated in very general terms, as shown in G1 +and G2. One of the first steps in the design process is to refine the goals into testable and achievable high-level requirements .(the “shall” statements). Examples of +high-level functional requirements implementing the goals for TCAS are. +1.18. TCAS shall provide collision avoidance protection for any two aircraft +closing horizontally at any rate up to 1200 knots and vertically up to 10,000 feet +per minute. +Assumption. This requirement is derived from the assumption that commercial aircraft can operate up to 600 knots and 5000 fpm during vertical climb +or controlled descent .(and therefore two planes can close horizontally up to +1200 knots and vertically up to 10,000 fpm). +1.19.1. TCAS shall operate in enroute and terminal areas with traffic densities up +to 0.3 aircraft per square nautical miles .(i.e., 24 aircraft within 5 nmi). +Assumption. Traffic density may increase to this level by 19 90 , and this will +be the maximum density over the next 20 years. +As stated earlier, assumptions should continue to be specified when appropriate to +explain a decision or to record fundamental information on which the design is +based. Assumptions are an important component of the documentation of design +rationale and form the basis for safety audits during operations. Consider the above +requirement labeled 1.18, for example. In the future, if aircraft performance limits +change or there are proposed changes in airspace management, the origin of the +specific numbers in the requirement .(1,200 and 10,000). can be determined and +evaluated for their continued relevance. In the absence of the documentation of +such assumptions and how they impact the detailed design decisions, numbers tend +to become “gospel,” and everyone is afraid to change them. +Requirements .(and constraints). must also be included for the human operator +and for the human–computer interface. These requirements will in part be derived +from the concept of operations, which should in turn include a human task analysis + , to determine how TCAS is expected to be used by pilots .(which, again, +should be checked in safety audits during operations). These analyses use information about the goals of the system, the constraints on how the goals are achieved, +including safety constraints, how the automation will be used, how humans now +control the system and work in the system without automation, and the tasks +humans need to perform and how the automation will support them in performing +these tasks. The task analysis must also consider workload and its impact on operator performance. Note that a low workload may be more dangerous than a high one. +Requirements on the operator .(in this case, the pilot). are used to guide the design +of the TCAS-pilot interface, the design of the automation logic, flight-crew tasks + +and procedures, aircraft flight manuals, and training plans and program. Traceability +links should be provided to show the relationships. Links should also be provided +to the parts of the hazard analysis from which safety-related requirements are +derived. Examples of TCAS 2 operator safety requirements and constraints are. +OP.4. After the threat is resolved, the pilot shall return promptly and smoothly to +his/her previously assigned fight path .(→ HA-560, ↓3.3). +OP.9. The pilot must not maneuver on the basis of a Traffic Advisory only .(→ +HA-630, ↓2.71.3). +The requirements and constraints include links to the hazard analysis that produced +the information and to design documents and decisions to show where the requirements are applied. These two examples have links to the parts of the hazard analysis +from which they were derived, links to the system design and operator procedures +where they are enforced, and links to the user manuals .(in this case, the pilot +manuals). to explain why certain activities or behaviors are required. +The links not only provide traceability from requirements to implementation and +vice versa to assist in review activities, but they also embed the design rationale +information into the specification. If changes need to be made to the system, it is +easy to follow the links and determine why and how particular design decisions +were made. + +secton 10.3.7. Identifying High-Level Design and Safety Constraints. +Design constraints are restrictions on how the system can achieve its purpose. For +example, TCAS is not allowed to interfere with the ground-level air traffic control +system while it is trying to maintain adequate separation between aircraft. Avoiding +interference is not a goal or purpose of TCAS.the best way to achieve the goal is +not to build the system at all. It is instead a constraint on how the system can achieve +its purpose, that is, a constraint on the potential system designs. Because of the need +to evaluate and clarify tradeoffs among alternative designs, separating these two +types of intent information .(goals and design constraints). is important. +For safety-critical systems, constraints should be further separated into safetyrelated and not safety-related. One nonsafety constraint identified for TCAS, for +example, was that requirements for new hardware and equipment on the aircraft be +minimized or the airlines would not be able to afford this new collision avoidance +system. Examples of nonsafety constraints for TCAS 2 are. +C.1. The system must use the transponders routinely carried by aircraft for ground +ATC purposes .(↓2.3, 2.6). +Rationale. To be acceptable to airlines, TCAS must minimize the amount of +new hardware needed. + +C.4. TCAS must comply with all applicable FAA and FCC policies, rules, and +philosophies .(↓2.30, 2.79). +The physical environment with which TCAS interacts is shown in figure 10.9. The +constraints imposed by these existing environmental components must also be +identified before system design can begin. +Safety-related constraints should have two-way links to the system hazard log and +to any analysis results that led to that constraint being identified as well as links to +the design features .(usually level 2). included to eliminate or control them. Hazard +analyses are linked to level 1 requirements and constraints, to design features on +level 2, and to system limitations .(or accepted risks). An example of a level 1 safety +constraint derived to prevent hazards is. +SC.3. TCAS must generate advisories that require as little deviation as possible +from ATC clearances . + + +The link in SC.3 to 2.30 points to the level 2 system design feature that implements +this safety constraint. The other links provide traceability to the hazard .(H6). from +which the constraint was derived and to the parts of the hazard analysis involved, +in this case the part of the hazard analysis labeled HA-550. +The following is another example of a safety constraint for TCAS 2 and some +constraints refined from it, all of which stem from a high-level environmental constraint derived from safety considerations in the encompassing system into which +TCAS will be integrated. The refinement will occur as safety-related decisions are +made and guided by an STPA hazard analysis. +SC.2. TCAS must not interfere with the ground ATC system or other aircraft +transmissions to the ground ATC system .(→ H5). +SC.2.1. The system design must limit interference with ground-based secondary surveillance radar, distance-measuring equipment channels, and with +other radio services that operate in the 1030/1090 MHz frequency band +(↓2.5.1). +SC.2.1.1. The design of the Mode S waveforms used by TCAS must provide +compatibility with Modes A and C of the ground-based secondary surveillance radar system .(↓2.6). +SC.2.1.2. The frequency spectrum of Mode S transmissions must be +controlled to protect adjacent distance-measuring equipment channels +(↓2.13). +SC.2.1.3. The design must ensure electromagnetic compatibility between +TCAS and [↓21.4). +SC.2.2. Multiple TCAS units within detection range of one another .(approximately 30 nmi). must be designed to limit their own transmissions. As the +number of such TCAS units within this region increases, the interrogation +rate and power allocation for each of them must decrease in order to prevent +undesired interference with ATC .(↓2.13). +Assumptions are also associated with safety constraints. As an example of such an +assumption, consider. +SC.6. TCAS must not disrupt the pilot and ATC operations during critical +phases of flight nor disrupt aircraft operation .(→ H3, ↓2.2.3, 2.19, +2.24.2). +SC.6.1. The pilot of a TCAS-equipped aircraft must have the option to switch +to the Traffic-Advisory-Only mode where TAs are displayed but display of +resolution advisories is inhibited .(↓ 2.2.3). + + +Assumption. This feature will be used during final approach to parallel +runways, when two aircraft are projected to come close to each other and +TCAS would call for an evasive maneuver .(↓ 6.17). +The specified assumption is critical for evaluating safety during operations. Humans +tend to change their behavior over time and use automation in different ways than +originally intended by the designers. Sometimes, these new uses are dangerous. The +hyperlink at the end of the assumption .(↓ 6.17). points to the required auditing +procedures for safety during operations and to where the procedures for auditing +this assumption are specified. +Where do these safety constraints come from? Is the system engineer required +to simply make them up? While domain knowledge and expertise is always going +to be required, there are procedures that can be used to guide this process. +The highest-level safety constraints come directly from the identified hazards for +the system. For example, TCAS must not cause or contribute to a near miss .(H1), +TCAS must not cause or contribute to a controlled maneuver into the ground .(H2), +and TCAS must not interfere with the ground-based ATC system. STPA can be used +to refine these high-level design constraints into more detailed design constraints +as described in chapter 8. +The first step in STPA is to create the high-level TCAS operational safety control +structure. For TCAS, this structure is shown in figure 10.10. For simplicity, much of +the structure above ATC operations management has been omitted and the roles and +responsibilities have been simplified here. In a real design project, roles and responsibilities will be augmented and refined as development proceeds, analyses are performed, and design decisions are made. Early in the system concept formation, +specific roles may not all have been determined, and more will be added as the design +concepts are refined. One thing to note is that there are three groups with potential +responsibilities over the pilot’s response to a potential NMAC. TCAS, the ground +ATC, and the airline operations center which provides the airline procedures for +responding to TCAS alerts. Clearly any potential conflicts and coordination problems between these three controllers will need to be resolved in the overall air traffic +management system design. In the case of TCAS, the designers decided that because +there was no practical way, at that time, to downlink information to the ground controllers about any TCAS advisories that might have been issued for the crew, the pilot +was to immediately implement the TCAS advisory and the co-pilot would transmit +the TCAS alert information by radio to ground ATC. The airline would provide the +appropriate procedures and training to implement this protocol. +Part of defining this control structure involves identifying the responsibilities of +each of the components related to the goal of the system, in this case collision avoidance. For TCAS, these responsibilities include. + + +1.•Aircraft Components .(e.g., transponders, antennas). Execute control maneuvers, read and send messages to other aircraft, etc. +2.•TCAS. Receive information about its own and other aircraft, analyze the +information received and provide the pilot with .(1). information about where +other aircraft in the vicinity are located and .(2). an escape maneuver to avoid +potential NMAC threats. +3.•Aircraft Components .(e.g., transponders, antennas). Execute pilot-generated +TCAS control maneuvers, read and send messages to and from other aircraft, +etc. +4.•Pilot. Maintain separation between own and other aircraft, monitor the TCAS +displays, and implement TCAS escape maneuvers. The pilot must also follow +ATC advisories. +5.•Air Traffic Control. Maintain separation between aircraft in the controlled +airspace by providing advisories .(control actions). for the pilot to follow. TCAS +is designed to be independent of and a backup for the air traffic controller so +ATC does not have a direct role in the TCAS safety control structure but clearly +has an indirect one. +6.•Airline Operations Management. Provide procedures for using TCAS and +following TCAS advisories, train pilots, and audit pilot performance. +7.•ATC Operations Management. Provide procedures, train controllers, audit +performance of controllers and of the overall collision avoidance system. +8.•ICAO. Provide worldwide procedures and policies for the use of TCAS and +provide oversight that each country is implementing them. +After the general control structure has been defined .(or alternative candidate +control structures identified), the next step is to determine how the controlled +system .(the two aircraft). can get into a hazardous state. That information will be +used to generate safety constraints for the designers. STAMP assumes that hazardous states .(states that violate the safety constraints). are the result of ineffective +control. Step 1 of STPA is to identify the potentially inadequate control actions. +Control actions in TCAS are called resolution advisories or RAs. An RA is an +aircraft escape maneuver created by TCAS for the pilots to follow. Example resolution advisories are descend, increase rate of climb to 2500 fmp, and don’t +descend. Consider the TCAS component of the control structure .(see figure 10.10) +and the NMAC hazard. The four types of control flaws for this example translate +into. +1. The aircraft are on a near collision course, and TCAS does not provide an RA +that avoids it .(that is, does not provide an RA, or provides an RA that does +not avoid the NMAC). +2. The aircraft are in close proximity and TCAS provides an RA that degrades +vertical separation .(causes an NMAC). +3. The aircraft are on a near collision course and TCAS provides a maneuver too +late to avoid an NMAC. +4. TCAS removes an RA too soon. +These inadequate control actions can be restated as high-level constraints on the +behavior of TCAS. +1. TCAS must provide resolution advisories that avoid near midair collisions. +2. TCAS must not provide resolution advisories that degrade vertical separation +between two aircraft .(that is, cause an NMAC). +3. TCAS must provide the resolution advisory while enough time remains for +the pilot to avoid an NMAC. .(A human factors and aerodynamic analysis +should be performed at this point to determine exactly how much time that +implies.) +4. TCAS must not remove the resolution advisory before the NMAC is resolved. + + +Similarly, for the pilot, the inadequate control actions are. +1. The pilot does not provide a control action to avoid a near midair collision. +2. The pilot provides a control action that does not avoid the NMAC. +3. The pilot provides a control action that causes an NMAC that would not otherwise have occurred. +4. The pilot provides a control action that could have avoided the NMAC but it +was too late. +5. The pilot starts a control action to avoid an NMAC but stops it too soon. +Again, these inadequate pilot control actions can be restated as safety constraints +that can be used to generate pilot procedures. Similar hazardous control actions and +constraints must be identified for each of the other system components. In addition, +inadequate control actions must be identified for the other functions provided by +TCAS .(beyond RAs). such as traffic advisories. +Once the high-level design constraints have been identified, they must be refined +into more detailed design constraints to guide the system design and then augmented with new constraints as design decisions are made, creating a seamless +integrated and iterative process of system design and hazard analysis. +Refinement of the constraints involves determining how they could be violated. +The refined constraints will be used to guide attempts to eliminate or control the +hazards in the system design or, if that is not possible, to prevent or control them +in the system or component design. This process of scenario development is exactly +the goal of hazard analysis and STPA. As an example of how the results of the +analysis are used to refine the high-level safety constraints, consider the second +high-level TCAS constraint. that TCAS must not provide resolution advisories that +degrade vertical separation between two aircraft .(cause an NMAC). +SC.7. TCAS must not create near misses .(result in a hazardous level of vertical +separation that would not have occurred had the aircraft not carried TCAS) +. +SC.7.1. Crossing Maneuvers must be avoided if possible . +SC.7.2. The reversal of a displayed advisory must be extremely rare . +SC.7.3. TCAS must not reverse an advisory if the pilot will have insufficient +time to respond to the RA before the closest point of approach .(four seconds + + +or less). or if own and intruder aircraft are separated by less than 200 feet +vertically when ten seconds or less remain to closest point of approach +. +Note again that pointers are used to trace these constraints into the design features +used to implement them. + +footnote. This requirement is clearly vague and untestable. Unfortunately, I could find no definition of “extremely +rare” in any of the TCAS documentation to which I had access. + + +section 10.3.8. System Design and Analysis. +Once the basic requirements and design constraints have been at least partially +specified, the system design features that will be used to implement them must be +created. A strict top-down design process is, of course, not usually feasible. As design +decisions are made and the system behavior becomes better understood, additions +and changes will likely be made in the requirements and constraints. The specification of assumptions and the inclusion of traceability links will assist in this process +and in ensuring that safety is not compromised by later decisions and changes. It is +surprising how quickly the rationale behind the decisions that were made earlier is +forgotten. +Once the system design features are determined, .(1). an internal control structure +for the system itself is constructed along with the interfaces between the components and .(2). functional requirements and design constraints, derived from the +system-level requirements and constraints, are allocated to the individual system +components. +System Design +What has been presented so far in this chapter would appear in level 1 of an intent +specification. The second level of an intent specification contains System Design +Principles.the basic system design and scientific and engineering principles needed +to achieve the behavior specified in the top level, as well as any derived requirements and design features not related to the level 1 requirements. +While traditional design processes can be used, STAMP and STPA provide the +potential for safety-driven design. In safety-driven design, the refinement of the +high-level hazard analysis is intertwined with the refinement of the system design +to guide the development of the system design and system architecture. STPA can +be used to generate safe design alternatives or applied to the design alternatives +generated in some other way to continually evaluate safety as the design progresses +and to assist in eliminating or controlling hazards in the emerging design, as described +in chapter 9. +For TCAS, this level of the intent specification includes such general principles +as the basic tau concept, which is related to all the high-level alerting goals and +constraints. + + +2.2. Each TCAS-equipped aircraft is surrounded by a protected volume of airspace. The boundaries of this volume are shaped by the tau and DMOD criteria +. +2.2.1. TAU. In collision avoidance, time-to-go to the closest point of approach +(CPA). is more important than distance-to-go to the CPA. Tau is an approximation of the time in seconds to CPA. Tau equals 3600 times the slant range +in nmi, divided by the closing speed in knots. +2.2.2. DMOD. If the rate of closure is very low, a target could slip in very +close without crossing the tau boundaries and triggering an advisory. In order +to provide added protection against a possible maneuver or speed change by +either aircraft, the tau boundaries are modified .(called DMOD). DMOD +varies depending on own aircraft’s altitude regime. +The principles are linked to the related higher-level requirements, constraints, +assumptions, limitations, and hazard analysis as well as to lower-level system design +and documentation and to other information at the same level. Assumptions used +in the formulation of the design principles should also be specified at this level. +For example, design principle 2.51 .(related to safety constraint SC-7.2 shown in +the previous section). describes how sense reversals are handled. +2.51. Sense Reversals. .(↓ Reversal-Provides-More-Separation). In most encounter situations, the resolution advisory will be maintained for the duration of an +encounter with a threat aircraft . However, under certain circumstances, +it may be necessary for that sense to be reversed. For example, a conflict between +two TCAS-equipped aircraft will, with very high probability, result in selection +of complementary advisory senses because of the coordination protocol between +the two aircraft. However, if coordination communication between the two aircraft is disrupted at a critical time of sense selection, both aircraft may choose +their advisories independently .(↑HA-130). This could possibly result in selection of incompatible senses . + +footnote. The sense is the direction of the advisory, such as descend or climb. + +2.51.1. . . . information about how incompatibilities are handled. +Design principle 2.51 describes the conditions under which reversals of TCAS advisories can result in incompatible senses and lead to the creation of a hazard by +TCAS. The pointer labeled HA-395 points to the part of the hazard analysis analyzing that problem. The hazard analysis portion labeled HA-395 would have a complementary pointer to section 2.51. The design decisions made to handle such + + +incompatibilities are described in 2.51.1, but that part of the specification is omitted +here. 2.51 also contains a hyperlink .(↓Reversal-Provides-More-Separation). to the +detailed functional level 3 logic .(component black-box requirements specification) +used to implement the design decision. +Information about the allocation of these design decisions to individual system +components and the logic involved is located in level 3, which in turn has links to +the implementation of the logic in lower levels. If a change has to be made to a +system component .(such as a change to a software module), it is possible to trace +the function computed by that module upward in the intent specification levels to +determine whether the module is safety critical and if .(and how). the change might +affect system safety. +As another example, the TCAS design has a built-in bias against generating +advisories that would result in the aircraft crossing paths .(called altitude crossing +advisories). +2.36.2. A bias against altitude crossing RAs is also used in situations involving +intruder level-offs at least 600 feet above or below the TCAS aircraft . +In such a situation, an altitude-crossing advisory is deferred if an intruder +aircraft that is projected to cross own aircraft’s altitude is more than 600 feet +away vertically . +Assumption. In most cases, the intruder will begin a level-off maneuver +when it is more than 600 feet away and so should have a greatly reduced +vertical rate by the time it is within 200 feet of its altitude clearance .(thereby +either not requiring an RA if it levels off more than ZTHR feet away or +requiring a non-crossing advisory for level-offs begun after ZTHR is crossed +but before the 600 foot threshold is reached). + +footnote. The vertical dimension, called zthr, used to determine whether advisories should be issued varies +from 750 to 950 feet, depending on the TCAS aircraft’s altitude. + +Again, the example above includes a pointer down to the part of the black box +component requirements .(functional). specification .(Alt_Separation_Test). that +embodies the design principle. Links could also be provided to detailed mathematical analyses used to support and validate the design decisions. +As another example of using links to embed design rationale in the specification +and of specifying limitations .(defined later). and potential hazardous behavior that +could not be controlled in the design, consider the following. TCAS 2 advisories +may need to be inhibited because of an inadequate climb performance for the particular aircraft on which TCAS is installed. The collision avoidance maneuvers +posted as advisories .(called RAs or resolution advisories). by TCAS assume an +aircraft’s ability to safely achieve them. If it is likely they are beyond the capability + + +of the aircraft, then TCAS must know beforehand so it can change its strategy and +issue an alternative advisory. The performance characteristics are provided to TCAS +through the aircraft interface .(via what are called aircraft discretes). In some cases, +no feasible solutions to the problem could be found. An example design principle +related to this problem found at level 2 of the TCAS intent specification is. +2.39. Because of the limited number of inputs to TCAS for aircraft, performance +inhibits, in some instances where inhibiting RAs would be appropriate it is not +possible to do so .(↑L6). In these cases, TCAS may command maneuvers that +may significantly reduce stall margins or result in stall warning .(↑SC9.1). Conditions where this may occur include . . . The aircraft flight manual or flight +manual supplement should provide information concerning this aspect of TCAS +so that flight crews may take appropriate action .(↓ [Pointers to pilot procedures +on level 3 and Aircraft Flight Manual on level 6). +Finally, design principles may reflect tradeoffs between higher-level goals and constraints. As examples. +2.2.3. Tradeoffs must be made between necessary protection .(↑1.18). and unnecessary advisories .(↑SC.5, SC.6). This is accomplished by controlling the +sensitivity level, which controls the tau, and therefore the dimensions of the +protected airspace around each TCAS-equipped aircraft. The greater the +sensitivity level, the more protection is provided but the higher is the incidence +of unnecessary alerts. Sensitivity level is determined by . . . +2.38. The need to inhibit climb RAs because of inadequate aircraft climb performance will increase the likelihood of TCAS 2 .(a). issuing crossing maneuvers, +which in turn increases the possibility that an RA may be thwarted by the +intruder maneuvering .(↑SC7.1, HA-115), .(b). causing an increase in descend +RAs at low altitude .(↑SC8.1), and .(c). providing no RAs if below the descend +inhibit level .(1200 feet above ground level on takeoff and 1000 feet above +ground level on approach). +Architectural Design, Functional Allocation, and Component Implementation +(Level 3) +Once the general system design concepts are agreed upon, the next step usually +involves developing the design architecture and allocating behavioral requirements +and constraints to the subsystems and components. Once again, two-way tracing +should exist between the component requirements and the system design principles +and requirements. These links will be available to the subsystem developers to be +used in their implementation and development activities and in verification .(testing +and reviews). Finally, during field testing and operations, the links and recorded +assumptions and design rationale can be used in safety change analysis, incident and + + +accident analysis, periodic audits, and performance monitoring as required to ensure +that the operational system is and remains safe. +Level 3 of an intent specification contains the system architecture, that is, the +allocation of functions to components and the designed communication paths +among those components .(including human operators). At this point, a black-box +functional requirements specification language becomes useful, particularly a formal +language that is executable. SpecTRM-RL is used as the example specification +language in this section ). An early version of the language was developed +in 19 90 to specify the requirements for TCAS 2 and has been refined and improved +since that time. SpecTRM-RL is part of a larger specification management system +called SpecTRM .(Specification Tools and Requirements Methodology). Other +languages, of course, can be used. +One of the first steps in low-level architectural design is to break the system into +a set of components. For TCAS, only three components were used. surveillance, +collision avoidance, and performance monitoring. +The environment description at level 3 includes the assumed behavior of the +external components .(such as the altimeters and transponders for TCAS), including +perhaps failure behavior, upon which the correctness of the system design is predicated, along with a description of the interfaces between the TCAS system +and its environment. Figure 10.11 shows part of a SpecTRM-RL description of an +environment component, in this case an altimeter. + + +enient for the purposes of the specifier. In this example, the environment includes +any component that was already on the aircraft or in the airspace control system +and was not newly designed or built as part of the TCAS effort. +All communications between the system and external components need to be +described in detail, including the designed interfaces. The black-box behavior of +each component also needs to be specified. This specification serves as the functional requirements for the components. What is included in the component specification will depend on whether the component is part of the environment or part +of the system being constructed. Figure 10.12 shows part of the SpecTRM-RL +description of the behavior of the CAS .(collision avoidance system). subcomponent. +SpecTRM-RL specifications are intended to be both easily readable with minimum +instruction and formally analyzable. They are also executable and can be used in a + + +system simulation environment. Readability was a primary goal in the design of +SpecTRM-RL, as was completeness with regard to safety. Most of the requirements +completeness criteria described in Safeware and rewritten as functional design principles in chapter 9 of this book are included in the syntax of the language to assist +in system safety reviews of the requirements. +SpecTRM-RL explicitly shows the process model used by the controller and +describes the required behavior in terms of this model. A state machine model is used +to describe the system component’s process model, in this case the state of the aircraft and the air space around it, and the ways the process model can change state. +Logical behavior is specified in SpecTRM-RL using and/or tables. Figure 10.12 +shows a small part of the specification of the TCAS collision avoidance logic. For +TCAS, an important state variable is the status of the other aircraft around the +TCAS aircraft, called intruders. Intruders are classified into four groups. Other +Traffic, Potential Threat, and Threat. The figure shows the logic for classifying an +intruder as Other Traffic using an and/or table. The information in the tables can +be visualized in additional ways. +The rows of the table represent and relationships, while the columns represent +or. The state variable takes the specified value .(in this case, Other Traffic). if any of +the columns evaluate to true. A column evaluates to true if all the rows have the +value specified for that row in the column. A dot in the table indicates that the value +for the row is irrelevant. Underlined variables represent hyperlinks. For example, +clicking on “Alt Reporting” would show how the Alt Reporting variable is defined. +In our TCAS intent specification7 , the altitude report for an aircraft is defined +as Lost if no valid altitude report has been received in the past six seconds. Bearing +Valid, Range Valid, Proximate Traffic Condition, and Proximate Threat Condition +are macros, which simply means that they are defined using separate logic tables. +The additional logic for the macros could have been inserted here, but sometimes +the logic gets very complex and it is easier for specifiers and reviewers if, in those +cases, the tables are broken up into smaller pieces .(a form of refinement abstraction). This decision is, of course, up to the creator of the table. +The behavioral descriptions at this level are purely black-box. They describe the +inputs and outputs of each component and their relationships only in terms of +externally visible behavior. Essentially it represents the transfer function across the +component. Any of these components .(except the humans, of course). could be +implemented either in hardware or software. Some of the TCAS surveillance + +functions are, in fact, implemented using analog devices by some vendors and digital +by others. Decisions about physical implementation, software design, internal variables, and so on are limited to levels of the specification below this one. Thus, this +level serves as a rugged interface between the system designers and the component +designers and implementers .(including subcontractors). +Software need not be treated any differently than the other parts of the system. +Most safety-related software problems stem from requirements flaws. The system +requirements and system hazard analysis should be used to determine the behavioral safety constraints that must be enforced on software behavior and that the +software must enforce on the controlled system. Once that is accomplished, those +requirements and constraints are passed to the software developers .(through the +black-box requirements specifications), and they use them to generate and validate +their designs just as the hardware developers do. +Other information at this level might include flight crew requirements such as +description of tasks and operational procedures, interface requirements, and the +testing requirements for the functionality described on this level. If the black-box +requirements specification is executable, system testing can be performed early to +validate requirements using system and environment simulators or hardware-inthe-loop simulation. Including a visual operator task-modeling language permits +integrated simulation and analysis of the entire system, including human–computer +interactions . +Models at this level are reusable, and we have found that these models provide the +best place to provide component reuse and build component libraries . Reuse +of application software at the code level has been problematic at best, contributing +to a surprising number of accidents . Level 3 black-box behavioral specifications +provide a way to make the changes almost always necessary to reuse software in a +format that is both reviewable and verifiable. In addition, the black-box models can +be used to maintain the system and to specify and validate changes before they are +made in the various manufacturers’ products. Once the changed level 3 specifications +have been validated, the links to the modules implementing the modeled behavior +can be used to determine which modules need to be changed and how. Libraries of +component models can also be developed and used in a plug-and-play fashion, +making changes as required, in order to develop product families . +The rest of the development process, involving the implementation of the component requirements and constraints and documented at levels 4 and 5 of intent +specifications, is straightforward and differs little from what is normally done today. + + + +footnote. A SpecTRM-RL model of TCAS was created by the author and her students Jon Reese, Mats Heimdahl, and Holly Hildreth to assist in the certification of TCAS 2 . Later, as an experiment to show the +feasibility of creating intent specifications, the author created the level 1 and level 2 intent specification +for TCAS. Jon Reese rewrote the level 3 collision avoidance system logic from the early version of the +language into SpecTRM-RL. + + + +section 10.3.9. Documenting System Limitations. +When the system is completed, the system limitations need to be identified and +documented. Some of the identification will, of course, be done throughout the + +development. This information is used by management and stakeholders to determine whether the system is adequately safe to use, along with information about +each of the identified hazards and how they were handled. +Limitations should be included in level 1 of the intent specification, because they +properly belong in the customer view of the system and will affect both acceptance +and certification. +Some limitations may be related to the basic functional requirements, such as +these. +L4. TCAS does not currently indicate horizontal escape maneuvers and therefore +does not .(and is not intended to). increase horizontal separation. +Limitations may also relate to environment assumptions. For example. +L1. TCAS provides no protection against aircraft without transponders or with +nonoperational transponders .(→EA3, HA-430). +L6. Aircraft, performance limitations constrain the magnitude of the escape +maneuver that the flight crew can safely execute in response to a resolution +advisory. It is possible for these limitations to preclude a successful resolution +of the conflict .(→H3, ↓2.38, 2.39). +L4. TCAS is dependent on the accuracy of the threat aircraft’s reported altitude. +Separation assurance may be degraded by errors in intruder pressure altitude +as reported by the transponder of the intruder aircraft .(→EA5). +Assumption. This limitation holds for the airspace existing at the time of the +initial TCAS deployment, where many aircraft use pressure altimeters rather +than GPS. As more aircraft install GPS systems with greater accuracy than +current pressure altimeters, this limitation will be reduced or eliminated. +Limitations are often associated with hazards or hazard causal factors that could +not be completely eliminated or controlled in the design. Thus they represent +accepted risks. For example, +L3. TCAS will not issue an advisory if it is turned on or enabled to issue resolution +advisories in the middle of a conflict .(→ HA-405). +L5. If only one of two aircraft is TCAS equipped while the other has only ATCRBS +altitude-reporting capability, the assurance of safe separation may be reduced +(→ HA-290). +In the specification, both of these system limitations would have pointers to the +relevant parts of the hazard analysis along with an explanation of why they could +not be eliminated or adequately controlled in the system design. Decisions about +deployment and certification of the system will need to be based partially on these + + +limitations and their impact on the safety analysis and safety assumptions of the +encompassing system, which, in the case of TCAS, is the overall air traffic system. +A final type of limitation is related to problems encountered or tradeoffs made +during system design. For example, TCAS has a high-level performance-monitoring +requirement that led to the inclusion of a self-test function in the system design to +determine whether TCAS is operating correctly. The following system limitation +relates to this self-test facility. +L9. Use by the pilot of the self-test function in flight will inhibit TCAS operation +for up to 20 seconds depending upon the number of targets being tracked. The +ATC transponder will not function during some portion of the self-test sequence +(↓6.52). +These limitations should be linked to the relevant parts of the development and, +most important, operational specifications. For example, L9 may be linked to the +pilot operations manual. + +section 10.3.10. System Certification, Maintenance, and Evolution. +At this point in development, the safety requirements and constraints are documented and traced to the design features used to implement them. A hazard log +contains the hazard information .(or links to it). generated during the development +process and the results of the hazard analysis performed. The log will contain +embedded links to the resolution of each hazard, such as functional requirements, +design constraints, system design features, operational procedures, and system limitations. The information documented should be easy to collect into a form that can +be used for the final safety assessment and certification of the system. +Whenever changes are made in safety-critical systems or software .(during development or during maintenance and evolution), the safety of the change needs to be +reevaluated. This process can be difficult and expensive if it has to start from scratch +each time. By providing links throughout the specification, it should be easy to assess +whether a particular design decision or piece of code was based on the original +safety analysis or safety-related design constraint and only that part of the safety +analysis process repeated or reevaluated. \ No newline at end of file diff --git a/chapter11.raw b/chapter11.raw new file mode 100644 index 0000000..6cac438 --- /dev/null +++ b/chapter11.raw @@ -0,0 +1,1355 @@ +Chapter 11. +Analyzing Accidents and Incidents (CAST). +The causality model used in accident or incident analysis determines what we look +for, how we go about looking for “facts,” and what we see as relevant. In our experi- +ence using STAMP-based accident analysis, we find that even if we use only the +information presented in an existing accident report, we come up with a very dif- +ferent view of the accident and its causes. +Most accident reports are written from the perspective of an event-based model. +They almost always clearly describe the events and usually one or several of these +events is chosen as the “root causes.” Sometimes “contributory causes” are identi- +fied. But the analysis of why those events occurred is usually incomplete: The analy- +sis frequently stops after finding someone to blame,usually a human operator,and +the opportunity to learn important lessons is lost. +An accident analysis technique should provide a framework or process to assist in +understanding the entire accident process and identifying the most important sys- +temic causal factors involved. This chapter describes an approach to accident analy- +sis, based on STAMP, called CAST (Causal Analysis based on STAMP). CAST can +be used to identify the questions that need to be answered to fully understand why +the accident occurred. It provides the basis for maximizing learning from the events. +The use of CAST does not lead to identifying single causal factors or variables. +Instead it provides the ability to examine the entire sociotechnical system design to +identify the weaknesses in the existing safety control structure and to identify +changes that will not simply eliminate symptoms but potentially all the causal +factors, including the systemic ones. +One goal of CAST is to get away from assigning blame and instead to shift the +focus to why the accident occurred and how to prevent similar losses in the future. +To accomplish this goal, it is necessary to minimize hindsight bias and instead to +determine why people behaved the way they did, given the information they had at +the time. +An example of the results of an accident analysis using CAST is presented in +chapter 5. Additional examples are in appendixes B and C. This chapter describes + + +the steps to go through in producing such an analysis. An accident at a fictional +chemical plant called Citichem [174] is used to demonstrate the process. The acci- +dent scenario was developed by Risk Management Pro to train accident investiga- +tors and describes a realistic accident process similar to many accidents that have +occurred in chemical plants. While the loss involves release of a toxic chemical, the +analysis serves as an example of how to do an accident or incident analysis for any +industry. +An accident investigation process is not being specified here, but only a way to +document and analyze the results of such a process. Accident investigation is a much +larger topic that goes beyond the goals of this book. This chapter only considers +how to analyze the data once it has been collected and organized. The accident +analysis process described in this chapter does, however, contribute to determining +what questions should be asked during the investigation. When attempting to apply +STAMP-based analysis to existing accident reports, it often becomes apparent that +crucial information was not obtained, or at least not included in the report, that +is needed to fully understand why the loss occurred and how to prevent future +occurrences. + +footnote. Maggie Stringfellow and John Thomas, two MIT graduate students, contributed to the CAST analysis +of the fictional accident used in this chapter. + +section 11.1. +The General Process of Applying STAMP to Accident Analysis. +In STAMP, an accident is regarded as involving a complex process, not just indi- +vidual events. Accident analysis in CAST then entails understanding the dynamic +process that led to the loss. That accident process is documented by showing the +sociotechnical safety control structure for the system involved and the safety con- +straints that were violated at each level of this control structure and why. The analy- +sis results in multiple views of the accident, depending on the perspective and level +from which the loss is being viewed. +Although the process is described in terms of steps or parts, no implication is +being made that the analysis process is linear or that one step must be completed +before the next one is started. The first three steps are the same ones that form the +basis of all the STAMP-based techniques described so far. +1. Identify the system(s) and hazard(s) involved in the loss. +2. Identify the system safety constraints and system requirements associated with +that hazard. +3. Document the safety control structure in place to control the hazard and +enforce the safety constraints. This structure includes the roles and responsi- +bilities of each component in the structure as well as the controls provided or +created to execute their responsibilities and the relevant feedback provided to +them to help them do this. This structure may be completed in parallel with +the later steps. +4. Determine the proximate events leading to the loss. +5. Analyze the loss at the physical system level. Identify the contribution of each +of the following to the events: physical and operational controls, physical fail- +ures, dysfunctional interactions, communication and coordination flaws, and +unhandled disturbances. Determine why the physical controls in place were +ineffective in preventing the hazard. +6. Moving up the levels of the safety control structure, determine how and why +each successive higher level allowed or contributed to the inadequate control +at the current level. For each system safety constraint, either the responsibility +for enforcing it was never assigned to a component in the safety control struc- +ture or a component or components did not exercise adequate control to +ensure their assigned responsibilities (safety constraints) were enforced in the +components below them. Any human decisions or flawed control actions need +to be understood in terms of (at least): the information available to the deci- +sion maker as well as any required information that was not available, the +behavior-shaping mechanisms (the context and influences on the decision- +making process), the value structures underlying the decision, and any flaws +in the process models of those making the decisions and why those flaws +existed. +7. Examine overall coordination and communication contributors to the loss. +8. Determine the dynamics and changes in the system and the safety control +structure relating to the loss and any weakening of the safety control structure +over time. +9. Generate recommendations. +In general, the description of the role of each component in the control structure +will include the following: +1.•Safety Requirements and Constraints +2.•Controls +3.•Context +3.1.– Roles and responsibilities +3.2.– Environmental and behavior-shaping factors +4.•Dysfunctional interactions, failures, and flawed decisions leading to erroneous +control actions + +5.Reasons for the flawed control actions and dysfunctional interactions +5.1.– Control algorithm flaws +5.2.– Incorrect process or interface models. +5.3.– Inadequate coordination or communication among multiple controllers +5.4.– Reference channel flaws +5.5.– Feedback flaws +The next sections detail the steps in the analysis process, using Citichem as a +running example. + + +section 11.2. +Creating the Proximal Event Chain. +While the event chain does not provide the most important causality information, +the basic events related to the loss do need to be identified so that the physical +process involved in the loss can be understood. +For Citichem, the physical process events are relatively simple: A chemical reac- +tion occurred in storage tanks 701 and 702 of the Citichem plant when the chemical +contained in the tanks, K34, came in contact with water. K34 is made up of some +extremely toxic and dangerous chemicals that react violently to water and thus need +to be kept away from it. The runaway reaction led to the release of a toxic cloud of +tetrachloric cyanide (TCC) gas, which is flammable, corrosive, and volatile. The TCC +blew toward a nearby park and housing development, in a city called Oakbridge, +killing more than four hundred people. +The direct events leading to the release and deaths are: +1. Rain gets into tank 701 (and presumably 702), both of which are in Unit 7 of +the Citichem Oakbridge plant. Unit 7 was shut down at the time due to +lowered demand for K34. +2. Unit 7 is restarted when a large order for K34 is received. +3. A small amount of water is found in tank 701 and an order is issued to make +sure the tank is dry before startup. +4. T34 transfer is started at unit 7. +5. The level gauge transmitter in the 701 storage tank shows more than it +should. +6. A request is sent to maintenance to put in a new level transmitter. +7. The level transmitter from tank 702 is moved to tank 701. (Tank 702 is used +as a spare tank for overflow from tank 701 in case there is a problem.) +8. Pressure in Unit 7 reads as too high. + + +9. The backup cooling compressor is activated. +10. Tank 701 temperature exceeds 12 degrees Celsius. +11. A sample is run, an operator is sent to check tank pressure, and the plant +manager is called. +12. Vibration is detected in tank 701. +13. The temperature and pressure in tank 701 continue to increase. +14. Water is found in the sample that was taken (see event 11). +15. Tank 701 is dumped into the spare tank 702 +16. A runaway reaction occurs in tank 702. +17. The emergency relief valve jams and runoff is not diverted into the backup +scrubber. +18. An uncontrolled gas release occurs. +19. An alarm sounds in the plant. +20. Nonessential personnel are ordered into units 2 and 3, which have positive +pressure and filtered air. +21. People faint outside the plant fence. +22. Police evacuate a nearby school. +23. The engineering manager calls the local hospital, gives them the chemical +name and a hotline phone number to learn more about the chemical. +24. The public road becomes jammed and emergency crews cannot get into the +surrounding community. +25. Hospital personnel cannot keep up with steady stream of victims. +26. Emergency medical teams are airlifted in. +These events are presented as one list here, but separation into separate interacting +component event chains may be useful sometimes in understanding what happened, +as shown in the friendly fire event description in chapter 5. +The Citichem event chain here provides a superficial analysis of what happened. +A deep understanding of why the events occurred requires much more information. +Remember that the goal of a STAMP-based analysis is to determine why the events +occurred—not who to blame for them—and to identify the changes that could +prevent them and similar events in the future. + +section 11.3. Defining the System(s) and Hazards Involved in the Loss. +Citichem has two relevant physical processes being controlled: the physical plant +and public health. Because separate and independent controllers were controlling + +these two processes, it makes sense to consider them as two interacting but inde- +pendent systems: (1) the chemical company, which controls the chemical process, +and (2) the public political structure, which has responsibilities for public health. +Figure 11.1 shows the major components of the two safety control structures and +interactions between them. Only the major structures are shown in the figure; +the details will be added throughout this chapter.2 No information was provided + + +about the design and engineering process for the Citichem plant in the accident +description, so details about it are omitted. A more complete example of a develop- +ment control structure and analysis of its role can be found in appendix B. +The analyst(s) also needs to identify the hazard(s) being avoided and the safety +constraint(s) to be enforced. An accident or loss event for the combined chemical +plant and public health structure can be defined as death, illness, or injury due to +exposure to toxic chemicals. +The hazards being controlled by the two control structures are related but +different. The public health structure hazard is exposure of the public to toxic +chemicals. The system-level safety constraints for the public health control system +are that: +1. The public must not be exposed to toxic chemicals. +2. Measures must be taken to reduce exposure if it occurs. +3. Means must be available, effective, and used to treat exposed individuals +outside the plant. +The hazard for the chemical plant process is uncontrolled release of toxic chemicals. +Accordingly, the system-level constraints are that: +1. Chemicals must be under positive control at all times. +2. Measures must be taken to reduce exposure if inadvertent release occurs. +3. Warnings and other measures must be available to protect workers in the plant +and minimize losses to the outside community. +4. Means must be available, effective, and used to treat exposed individuals inside +the plant. +Hazards and safety-constraints must be within the design space of those who devel- +oped the system and within the operational space of those who operate it. For +example, the chemical plant designers cannot be responsible for those things +outside the boundaries of the chemical plant over which they have no control, +although they may have some influence over them. Control over the environment +of a plant is usually the responsibility of the community and various levels of gov- +ernment. As another example, while the operators of the plant may cooperate with +local officials in providing public health and emergency response facilities, respon- +sibility for this function normally lies in the public domain. Similarly, while the +community and local government may have some influence on the design of the +chemical plant, the company engineers and managers control detailed design and +operations. +Once the goals and constraints are determined, the controls in place to enforce +them must be identified. + + + + +footnote. OSHA, the Occupational Safety and Health Administration, is part of a third larger governmental +control structure, which has many other components. For simplicity, only OSHA is shown and considered +in the example analysis. + + +section 11.4. Documenting the Safety Control Structure. +If STAMP has been used as the basis for previous safety activities, such as the origi- +nal engineering process or the investigation and analysis of previous incidents and +accidents, a model of the safety-control structure may already exist. If not, it must +be created although it can be reused in the future. Chapters 12 and 13 provide +information about the design of safety-control structures. +The components of the structure as well as each component’s responsibility with +respect to enforcing the system safety constraints must be identified. Determining +what these are (or what they should be) can start from system safety requirements. +The following are some example system safety requirements that might be appropri- +ate for the Citichem chemical plant example: +1. Chemicals must be stored in their safest form. +2. The amount of toxic chemicals stored should be minimized. +3. Release of toxic chemicals and contamination of the environment must be +prevented. +4. Safety devices must be operable and properly maintained at all times when +potentially toxic chemicals are being processed or stored. +5. Safety equipment and emergency procedures (including warning devices) +must be provided to reduce exposure in the event of an inadvertent chemical +release. +6. Emergency procedures and equipment must be available and operable to treat +exposed individuals. +7. All areas of the plant must be accessible to emergency personnel and equip- +ment during emergencies. Delays in providing emergency treatment must be +minimized. +8. Employees must be trained to +a. Perform their jobs safely and understand proper use of safety equipment +b. Understand their responsibilities with regards to safety and the hazards +related to their job +c. Respond appropriately in an emergency +9. Those responsible for safety in the surrounding community must be educated +about potential hazards from the plant and provided with information about +how to respond appropriately. +A similar list of safety-related requirements and responsibilities might be gener- +ated for the community safety control structure. + + +These general system requirements must be enforced somewhere in the safety +control structure. As the accident analysis proceeds, they are used as the starting +point for generating more specific constraints, such as constraints for the specific +chemicals being handled. For example, requirement 4, when instantiated for TCC, +might generate a requirement to prevent contact of the chemical with water. As the +accident analysis proceeds, the identified responsibilities of the components can be +mapped to the system safety requirements—the opposite of the forward tracing +used in safety-guided design. If STPA was used in the design or analysis of the +system, then the safety control structure documentation should already exist. +In some cases, general requirements and policies for an industry are established +by the government or by professional associations. These can be used during an +accident analysis to assist in comparing the actual safety control structure (both in +the plant and in the community) at the time of the accidents with the standards or +best practices of the industry and country. Accident analyses can in this way be made +less arbitrary and more guidance provided to the analysts as to what should be +considered to be inadequate controls. +The specific designed controls need not all be identified before the rest of the +analysis starts. Additional controls will be identified as the analysts go through +the next steps of the process, but a good start can usually be made early in the +analysis process. + +section 11.5. +Analyzing the Physical Process. +Analysis starts with the physical process, identifying the physical and operational +controls and any potential physical failures, dysfunctional interactions and commu- +nication, or unhandled external disturbances that contributed to the events. The goal +is to determine why the physical controls in place were ineffective in preventing the +hazard. Most accident analyses do a good job of identifying the physical contributors +to the events. +Figure 11.2 shows the requirements and controls at the Citichem physical plant +level as well as failures and inadequate controls. The physical contextual factors +contributing to the events are included. +The most likely reason for water getting into tanks 701 and 702 were inadequate +controls provided to keep water out during a recent rainstorm (an unhandled exter- +nal disturbance to the system in figure 4.8), but there is no way to determine that +for sure. +Accident investigations, when the events and physical causes are not obvious, +often make use of a hazard analysis technique, such as fault trees, to create scenarios +to consider. STPA can be used for this purpose. Using control diagrams of the physi- +cal system, scenarios can be generated that could lead to the lack of enforcement + +of the safety constraint(s) at the physical level. The safety design principles in +chapter 9 can provide assistance in identifying design flaws. +As is common in the process industry, the physical plant safety equipment (con- +trols) at Citichem were designed as a series of barriers to satisfy the system safety +constraints identified earlier, that is, to protect against runaway reactions, protect +against inadvertent release of toxic chemicals or an explosion (uncontrolled energy), +convert any released chemicals into a non-hazardous or less hazardous form, provide +protection against human or environmental exposure after release, and provide +emergency equipment to treat exposed individuals. Citichem had the standard +types of safety equipment installed, including gauges and other indicators of the +physical system state. In addition, it had an emergency relief system and devices to +minimize the danger from released chemicals such as a scrubber to reduce the toxic- +ity of any released chemicals and a flare tower to burn off gas before it gets into +the atmosphere. +A CAST accident analysis examines the controls to determine which ones did +not work adequately and why. While there was a reasonable amount of physical +safety controls provided at Citichem, much of this equipment was inadequate or not +operational—a common finding after chemical plant accidents. +In particular, rainwater got into the tank, which implies the tanks were not +adequately protected against rain despite the serious hazard created by the mixing +of TCC with water. While the inadequate protection against rainwater should be +investigated, no information was provided in the Citichem accident description. Did +the hazard analysis process, which in the process industry often involves HAZOP, +identify this hazard? If not, then the hazard analysis process used by the company +needs to be examined to determine why an important factor was omitted. If it was +not omitted, then the flaw lies in the translation of the hazard analysis results into +protection against the hazard in the design and operations. Were controls to protect +against water getting into the tank provided? If not, why not? If so, why were they +ineffective? +Critical gauges and monitoring equipment were missing or inoperable at the time +of the runaway reaction. As one important example, the plant at the time of the +accident had no operational level indicator on tank 702 despite the fact that this +equipment provided safety-critical information. One task for the accident analysis, +then, is to determine whether the indicator was designated as safety-critical, which +would (or should) trigger more controls at the higher levels, such as higher priority +in maintenance activities. The inoperable level indicator also indicates a need to +look at higher levels of the control structure that are responsible for providing and +maintaining safety-critical equipment. +As a final example, the design of the emergency relief system was inadequate: +The emergency relief valve jammed and excess gas could not be sent to the scrubber. + + +The pop-up relief valves in Unit 7 (and Unit 9) at the plant were too small to allow +the venting of the gas if non-gas material was present. The relief valve lines were +also too small to relieve the pressure fast enough, in effect providing a single point +of failure for the emergency relief system. Why an inadequate design existed also +needs to be examined in the higher-level control structure. What group was respon- +sible for the design and why did a flawed design result? Or was the design originally +adequate but conditions changed over time? +The physical contextual factors identified in figure 11.2 play a role in the accident +causal analysis, such as the limited access to the plant, but their importance becomes +obvious only at higher levels of the control structure. +At this point of the analysis, several recommendations are reasonable: add +protection against rainwater getting into the tanks, change the design of the valves +and vent pipes in the emergency relief system, put a level indicator on Tank 702, +and so on. Accident investigations often stop here with the physical process analysis +or go one step higher to determine what the operators (the direct controllers of the +physical process) did wrong. +The other physical process being controlled here, public health, must be exam- +ined in the same way. There were very few controls over public health instituted in +Oakbridge, the community surrounding the plant, and the ones that did exist were +inadequate. The public had no training in what to do in case of an emergency, the +emergency response system was woefully inadequate, and unsafe development was +allowed, such as the creation of a children’s park right outside the walls of the plant. +The reasons for these inadequacies, as well as the inadequacies of the controls on +the physical plant process, are considered in the next section. + + +section 11.6. Analyzing the Higher Levels of the Safety Control Structure. +While the physical control inadequacies are relatively easy to identify in the analysis +and are usually handled well in any accident analysis, understanding why those +physical failures or design inadequacies existed requires examining the higher levels +of safety control: Fully understanding the behavior at any level of the sociotechnical +safety control structure requires understanding how and why the control at the +next higher level allowed or contributed to the inadequate control at the current +level. Most accident reports include some of the higher-level factors, but usually +incompletely and inconsistently, and they focus on finding someone or something +to blame. +Each relevant component of the safety control structure, starting with the lowest +physical controls and progressing upward to the social and political controls, needs +to be examined. How are the components to be examined determined? Considering +everything is not practical or cost effective. By starting at the bottom, the relevant + + +components to consider can be identified. At each level, the flawed behavior or +inadequate controls are examined to determine why the behavior occurred and why +the controls at higher levels were not effective at preventing that behavior. For +example, in the STAMP-based analysis of an accident where an aircraft took off +from the wrong runway during construction at the airport, it was discovered that +the airport maps provided to the pilot were out of date [142]. That led to examining +the procedures at the company that provided the maps and the FAA procedures +for ensuring that maps are up-to-date. +Stopping after identifying inadequate control actions by the lower levels of the +safety control structure is common in accident investigation. The result is that the +cause is attributed to “operator error,” which does not provide enough information +to prevent accidents in the future. It also does not overcome the problems of hind- +sight bias. In hindsight, it is always possible to see that a different behavior would +have been safer. But the information necessary to identify that safer behavior is +usually only available after the fact. To improve safety, we need to understand the +reasons people acted the way they did. Then we can determine if and how to change +conditions so that better decisions can be made in the future. +The analyst should start from the assumption that most people have good inten- +tions and do not purposely cause accidents. The goal then is to understand why +people did not or could not act differently. People acted the way they did for very +good reasons; we need to understand why the behavior of the people involved made +sense to them at the time [51]. +Identifying these reasons requires examining the context and behavior-shaping +factors in the safety control structure that influenced that behavior. What contextual +factors should be considered? Usually the important contextual and behavior- +shaping factors become obvious in the process of explaining why people acted the +way they did. Stringfellow has suggested a set of general factors to consider [195]: +•History: Experiences, education, cultural norms, behavioral patterns: how the +historical context of a controller or organization may impact their ability to +exercise adequate control. +•Resources: Staff, finances, time. +•Tools and Interfaces: Quality, availability, design, and accuracy of tools. Tools +may include such things as risk assessments, checklists, and instruments as well +as the design of interfaces such as displays, control levers, and automated tools. +•Training: +training. +•Human Cognition Characteristics: Person–task compatibility, individual toler- +ance of risk, control role, innate human limitations. + + +Pressures: Time, schedule, resource, production, incentive, compensation, +political. Pressures can include any positive or negative force that can influence +behavior. +•Safety Culture: Values and expectations around such things as incident report- +ing, workarounds, and safety management procedures. +•Communication: How the communication techniques, form, styles, or content +impacted behavior. +•Human Physiology: +Intoxication, sleep deprivation, and the like. +We also need to look at the process models used in the decision making. What +information did the decision makers have or did they need related to the inadequate +control actions? What other information could they have had that would have +changed their behavior? If the analysis determines that the person was truly incom- +petent (not usually the case), then the focus shifts to ask why an incompetent person +was hired to do this job and why they were retained in their position. A useful +method to assist in understanding human behavior is to show the process model of +the human controller at each important event in which he or she participated, that +is, what information they had about the controlled process when they made their +decisions. +Let’s follow some of the physical plant inadequacies up the safety control struc- +ture at Citichem. Three examples of STAMP-based analyses of the inadequate +control at Citichem are shown in figure 11.3: a maintenance worker, the maintenance +manager, and the operations manager. +During the investigation, it was discovered that a maintenance worker had found +water in tank 701. He was told to check the Unit 7 tanks to ensure they were ready +for the T34 production startup. Unit 7 had been shut down previously (see “Physical +Plant Context”). The startup was scheduled for 10 days after the decision to produce +additional K34 was made. The worker found a small amount of water in tank 701, +reported it to the maintenance manager, and was told to make sure the tank was +“bone dry.” However, water was found in the sample taken from tank 701 right +before the uncontrolled reaction. It is unknown (and probably unknowable) whether +the worker did not get all the water out or more water entered later through the same +path it entered previously or via a different path. We do know he was fatigued and +working a fourteen-hour day, and he may not have had time to do the job properly. +He also believed that the tank’s residual water was from condensation, not rain. No +independent check was made to determine whether all the water was removed. +Some potential recommendations from what has been described so far include +establishing procedures for quality control and checking safety-critical activities. +Any existence of a hazardous condition—such as finding water in a tank that is to + + +be used to produce a chemical that is highly reactive to water—should trigger an +in-depth investigation of why it occurred before any dangerous operations are +started or restarted. In addition, procedures should be instituted to ensure that those +performing safety-critical operations have the appropriate skills, knowledge, and +physical resources, which, in this case, include adequate rest. Independent checks of +critical activities also seem to be needed. +The maintenance worker was just following the orders of the maintenance +manager, so the role of maintenance management in the safety-control structure +also needs to be investigated. The runaway reaction was the result of TCC coming +in contact with water. The operator who worked for the maintenance manager told +him about finding water in tank 701 after the rain and was directed to remove it. +The maintenance manager does not tell him to check the spare tank 702 for water +and does not appear to have made any other attempts to perform that check. He +apparently accepted the explanation of condensation as the source of the water and +did not, therefore, investigate the leak further. +Why did the maintenance manager, a long-time employee who had always been +safety conscious in the past, not investigate further? The maintenance manager was +working under extreme time pressure and with inadequate staff to perform the jobs +that were necessary. There was no reporting channel to someone with specified +responsibility for investigating hazardous events, such as finding water in a tank +used for a toxic chemical that should never contact water. Normally an investigation +would not be the responsibility of the maintenance manager but would fall under +the purview of the engineering or safety engineering staff. There did not appear to +be anyone at Citichem with the responsibility to perform the type of investigation +and risk analysis required to understand the reason for water being in the tank. Such +events should be investigated thoroughly by a group with designated responsibility +for process safety, which presumes, of course, such a group exists. +The maintenance manager did protest (to the plant manager) about the unsafe +orders he was given and the inadequate time and resources he had to do his job +adequately. At the same time, he did not tell the plant manager about some of the +things that had occurred. For example, he did not inform the plant manager about +finding water in tank 701. If the plant manager had known these things, he might +have acted differently. There was no problem-reporting system in this plant for such +information to be reliably communicated to decision makers: Communication relied +on chance meetings and informal channels. +Lots of recommendations for changes could be generated from this part of +the analysis, such as providing rigorous procedures for hazard analysis when a haz- +ardous condition is detected and training and assigning personnel to do such an +analysis. Better communication channels are also indicated, particularly problem +reporting channels. + + +The operations manager (figure 11.3) also played a role in the accident process. +He too was under extreme pressure to get Unit 7 operational. He was unaware that +the maintenance group had found water in tank 701 and thought 702 was empty. +During the effort to get Unit 7 online, the level indicator on tank 701 was found to +be not working. When it was determined that there were no spare level indicators +at the plant and that delivery would require two weeks, he ordered the level indica- +tor on 702 to be temporarily placed on tank 701—tank 702 was only used for over- +flow in case of an emergency, and he assessed the risk of such an emergency as low. +This flawed decision clearly needs to be carefully analyzed. What types of risk and +safety analyses were performed at Citichem? What training was provided on the +hazards? What policies were in place with respect to disabling safety-critical equip- +ment? Additional analysis also seems warranted for the inventory control pro- +cedures at the plant and determining why safety-critical replacement parts were +out of stock. +Clearly, safety margins were reduced at Citichem when operations continued +despite serious failures of safety devices. Nobody noticed the degradation in safety. +Any change of the sort that occurred here—startup of operations in a previously +shut down unit and temporary removal of safety-critical equipment—should have +triggered a hazard analysis and a management of change (MOC) process. Lots of +accidents in the chemical industry (and others) involve unsafe workarounds. The +causal analysis so far should trigger additional investigation to determine whether +adequate management of change and control of work procedures had been provided +but not enforced or were not provided at all. The first step in such an analysis is to +determine who was responsible (if anyone) for creating such procedures and who +was responsible for ensuring they were followed. The goal again is not to find +someone to blame but simply to identify the flaws in the process for running +Citichem so they can be fixed. +At this point, it appears that decision making by higher-level management (above +the maintenance and operations manager) and management controls were inade- +quate at Citichem. Figures 11.4 and 11.5 show example STAMP-based analysis results +for the Citichem plant manager and Citichem corporate management. The plant +manager made many unsafe decisions and issued unsafe control actions that directly +contributed to the accident or did not initiate control actions necessary for safety +(as shown in figure 11.4). At the same time, it is clear that he was under extreme +pressure to increase production and was missing information necessary to make +better decisions. An appropriate safety control structure at the plant had not been +established leading to unsafe operational practices and inaccurate risk assessment +by most of the managers, especially those higher in the control structure. Some of +the lower level employees tried to warn against the high-risk practices, but appropri- +ate communication channels had not been established to express these concerns. + + +Safety controls were almost nonexistent at the corporate management level. +The upper levels of management provided inadequate leadership, oversight and +management of safety. There was either no adequate company safety policy or it +was not followed, either of which would lead to further causal analysis. A proper +process safety management system clearly did not exist at Citichem. Management +was under great competitive pressures, which may have led to ignoring corporate +safety controls or adequate controls may never have been established. Everyone +had very flawed mental models of the risks of increasing production without taking +the proper precautions. The recommendations should include consideration of +what kinds of changes might be made to provide better information about risks to +management decision makers and about the state of plant operations with respect +to safety. +Like any major accident, when analyzed thoroughly, the process leading to +the loss is complex and multi-faceted. A complete analysis of this accident is not +needed here. But a look at some of the factors involved in the plant’s environment, +including the control of public health, is instructive. +Figure 11.6 shows the STAMP-based analysis of the Oakbridge city emergency- +response system. Planning was totally inadequate or out of date. The fire department +did not have the proper equipment and training for a chemical emergency, the hos- +pital also did not have adequate emergency resources or a backup plan, and the +evacuation plan was ten years out of date and inadequate for the current level of +population. +Understanding why these inadequate controls existed requires understanding the +context and process model flaws. For example, the police chief had asked for +resources to update equipment and plans, but the city had turned him down. Plans +had been made to widen the road to Oakbridge so that emergency equipment could +be brought in, but those plans were never implemented and the planners never went +back to their plans to see if they were realistic for the current conditions. Citichem +had a policy against disclosing what chemicals they produce and use, justifying this +policy by the need for secrecy from their competitors, making it impossible for the +hospital to stockpile the supplies and provide the training required for emergencies, +all of which contributed to the fatalities in the accident. The government had no +disclosure laws requiring chemical companies to provide such information to emer- +gency responders. +Clear recommendations for changes result from this analysis, for example, updat- +ing evacuation plans and making changes to the planning process. But again, stop- +ping at this level does not help to identify systemic changes that could improve +community safety: The analysts should work their way up the control structure to +understand the entire accident process. For example, why was an inadequate emer- +gency response system allowed to exist? + + +The analysis in figure 11.7 helps to answer this question. For example, the +members of the city government had inadequate knowledge of the hazards associ- +ated with the plant, and they did not try to obtain more information about them or +about the impact of increased development close to the plant. At the same time, +they turned down requests for the funding to upgrade the emergency response +system as the population increased as well as attempts by city employees to provide +emergency response pamphlets for the citizens and set up appropriate communica- +tion channels. +Why did they make what in retrospect look like such bad decisions? With inad- +equate knowledge about the risks, the benefits of increased development were +ranked above the dangers from the plant in the priorities used by the city managers. +A misunderstanding about the dangers involved in the chemical processing at +the plant contributed also to the lack of planning and approval for emergency- +preparedness activities. +The city government officials were subjected to pressures from local developers +and local businesses that would benefit financially from increased development. The +developer sold homes before the development was approved in order to increase +pressure on the city council. He also campaigned against a proposed emergency +response pamphlet for local residents because he was afraid it would reduce his +sales. The city government was subjected to additional pressure from local business- +men who wanted more development in order to increase their business and profits. +The residents did not provide opposing pressure to counteract the business +influences and trusted that government would protect them: No community orga- +nizations existed to provide oversight of the local government safety controls and +to ensure that government was adequately considering their health and safety needs +(figure 11.8). +The city manager had the right instincts and concern for public safety, but she +lacked the freedom to make decisions on her own and the clout to influence the +mayor or city council. She was also subject to external pressures to back down on +her demands and no structure to assist her in resisting those pressures. +In general, there are few requirements for serving on city councils. In the United +States, they are often made up primarily of those with conflicts of interest, such as +real estate agents and developers. Mayors of small communities are often not paid +a full salary and must therefore have other sources of income, and city council +members are likely to be paid even less, if at all. +If community-level management is unable to provide adequate controls, controls +might be enforced by higher levels of government. A full analysis of this accident +would consider what controls existed at the state and federal levels and why they +were not effective in preventing the accident. + + +section 11.7. +A Few Words about Hindsight Bias and Examples. +One of the most common mistakes in accident analyses is the use of hindsight bias. +Words such as “could have” or “should have” in accident reports are judgments that +are almost always the result of such bias [50]. It is not the role of the accident analyst +to render judgment in terms of what people did or did not do (although that needs +to be recorded) but to understand why they acted the way they did. +Although hindsight bias is usually applied to the operators in an accident report, +because most accident reports focus on the operators, it theoretically could be +applied to people at any level of the organization: “The plant manager should have +known …” +The biggest problem with hindsight bias in accident reports is not that it is +unfair (which it usually is), but that an opportunity to learn from the accident and +prevent future occurrences is lost. It is always possible to identify a better decision +in retrospect—or there would not have been a loss or near miss—but it may have +been difficult or impossible to identify that the decision was flawed at the time it +had to be made. To improve safety and to reduce errors, we need to understand why + + +the decision made sense to the person at the time and redesign the system to help +people make better decisions. +Accident investigation should start with the assumption that most people have +good intentions and do not purposely cause accidents. The goal of the investigation, +then, is to understand why they did the wrong thing in that particular situation. In +particular, what were the contextual or systemic factors and flaws in the safety +control structure that influenced their behavior? Often, the person had an inaccu- +rate view of the state of the process and, given that view, did what appeared to be +the right thing at the time but turned out to be wrong with respect to the actual +state. The solution then is to redesign the system so that the controller has better +information on which to make decisions. +As an example, consider a real accident report on a chemical overflow from a +tank, which injured several workers in the vicinity [118]. The control room operator +issued an instruction to open a valve to start the flow of liquid into the tank. The +flow meter did not indicate a flow, so the control room operator asked an outside +operator to check the manual valves near the tank to see if they were closed. +The control room operator believed that the valves were normally left in an open +position to facilitate conducting the operation remotely. The tank level at this time +was 7.2 feet. +The outside operator checked and found the manual valves at the tank open. The +outside operator also saw no indication of flow on the flow meter and made an effort +to visually verify that there was no flow. He then began to open and close the valves +manually to try to fix the problem. He reported to the control room operator that +he heard a clunk that may have cleared an obstruction, and the control room opera- +tor tried opening the valve remotely again. Both operators still saw no flow on the +flow meter. The outside operator at this time got a call to deal with a problem in a +different part of the plant and left. He did not make another attempt to visually verify +if there was flow. The control room operator left the valve in the closed position. In +retrospect, it appears that the tank level at this time was approximately 7.7 feet. +Twelve minutes later, the high-level alarm on the tank sounded in the control +room. The control room operator acknowledged the alarm and turned it off. In +retrospect, it appears that the tank level at this time was approximately 8.5 feet, +although there was no indication of the actual level on the control board. The control +room operator got an alarm about an important condition in another part of the +plant and turned his attention to dealing with that alarm. A few minutes later, the +tank overflowed. +The accident report concluded, “The available evidence should have been suffi- +cient to give the control room operator a clear indication that (the tank) was indeed +filling and required immediate attention.” This statement is a classic example of +hindsight bias—note the use of the words “should have …” The report does not + +identify what that evidence was. In fact, the majority of the evidence that both +operators had at this time was that the tank was not filling. +To overcome hindsight bias, it is useful to examine exactly what evidence the +operators had at time of each decision in the sequence of events. One way to do +this is to draw the operator’s process model and the values of each of the relevant +variables in it. In this case, both operators thought the control valve was closed—the +control room operator had closed it and the control panel indicated that it was +closed, the flow meter showed no flow, and the outside operator had visually checked +and there was no flow. The situation is complicated by the occurrence of other +alarms that the operators had to attend to at the same time. +Why did the control board show the control valve was closed when it must have +actually been open? It turns out that there is no way for the control room operator +to get confirmation that the valve has actually closed after he commands it closed. +The valve was not equipped with a valve stem position monitor, so the control +room operator only knows that a signal has gone to the valve for it to close but not +whether it has actually done so. The operators in many accidents, including Three +Mile Island, have been confused about the actual position of valves due to similar +designs. +An additional complication is that while there is an alarm in the tank that should +sound when the liquid level reaches 7.5 feet, that alarm was not working at the time, +and the operator did not know it was not working. So the operator had extra reason +to believe the liquid level had not risen above 7.5 feet, given that he believed there +was no flow into the tank and the 7.5-foot alarm had not sounded. The level trans- +mitter (which provided the information to the 7.5-foot alarm) had been operating +erratically for a year and a half, but a work order had not been written to repair it +until the month before. It had supposedly been fixed two weeks earlier, but it clearly +was not working at the time of the spill. +The investigators, in retrospect knowing that there indeed had to have been some +flow, suggested that the control room operator “could have” called up trend data on +the control board and detected the flow. But this suggestion is classic hindsight bias. +The control room operator had no reason to perform this extra check and was busy +taking care of critical alarms in other parts of the plant. Dekker notes the distinction +between data availability, which is what can be shown to have been physically avail- +able somewhere in the situation, and data observability, which is what was observ- +able given the features of the interface and the multiple interleaving tasks, goals, +interests, and knowledge of the people looking at it [51]. The trend data were avail- +able to the control room operator, but they were not observable without taking +special actions that did not seem necessary at the time. +While that explains why the operator did not know the tank was filling, it does +not fully explain why he did not respond to the high-level alarm. The operator said +that he thought the liquid was “tickling” the sensor and triggering a false alarm. The + + +accident report concludes that the operator should have had sufficient evidence the +tank was indeed filling and responded to the alarm. Not included in the official +accident report was the fact that nuisance alarms were relatively common in this +unit: they occurred for this alarm about once a month and were caused by sampling +errors or other routine activities. This alarm had never previously signaled a serious +problem. Given that all the observable evidence showed the tank was not filling and +that the operator needed to respond to a serious alarm in another part of the plant +at the time, the operator not responding immediately to the alarm does not seem +unreasonable. +An additional alarm was involved in the sequence of events. This alarm was at +the tank and denoted that a gas from the liquid in the tank was detected in the air +outside the tank. The outside operator went to investigate. Both operators are +faulted in the report for waiting thirty minutes to sound the evacuation horn after +this alarm went off. The official report says: +Interviews with operations personnel did not produce a clear reason why the response to +the [gas] alarm took 31 minutes. The only explanation was that there was not a sense of +urgency since, in their experience, previous [gas] alarms were attributed to minor releases +that did not require a unit evacuation. +This statement is puzzling, because the statement itself provides a clear explanation +for the behavior, that is, the previous experience. In addition, the alarm maxed out +at 25 ppm, which is much lower than the actual amount in the air, but the control +room operator had no way of knowing what the actual amount was. In addition, +there are no established criteria in any written procedure for what level of this gas +or what alarms constitute an emergency condition that should trigger sounding +the evacuation alarm. Also, none of the alarms were designated as critical alarms, +which the accident report does concede might have “elicited a higher degree of +attention amongst the competing priorities” of the control room operator. Finally, +there was no written procedure for responding to an alarm for this gas. The “stan- +dard response” was for an outside operator to conduct a field assessment of the +situation, which he did. +While there is training information provided about the hazards of the particular +gas that escaped, this information was not incorporated in standard operating or +emergency procedures. The operators were apparently on their own to decide if an +emergency existed and then were chastised for not responding (in hindsight) cor- +rectly. If there is a potential for operators to make poor decisions in safety-critical +situations, then they need to be provided with the criteria to make such a decision. +Expecting operators under stress and perhaps with limited information about the +current system state and inadequate training to make such critical decisions based +on their own judgment is unrealistic. It simply ensures that operators will be blamed +when their decisions turn out, in hindsight, to be wrong. + + +One of the actions the operators were criticized for was trying to fix the problem +rather than calling in emergency personnel immediately after the gas alarm sounded. +In fact, this response is the normal one for humans (see chapter 9 and [115], as well +as the following discussion): if it is not the desirable response, then procedures and +training must be used to ensure that a different response is elicited. The accident +report states that the safety policy for this company is: +At units, any employee shall assess the situation and determine what level of evacuation +and what equipment shutdown is necessary to ensure the safety of all personnel, mitigate +the environmental impact and potential for equipment/property damage. When in doubt, +evacuate. +There are two problems with such a policy. +The first problem is that evacuation responsibilities (or emergency procedures +more generally) do not seem to be assigned to anyone but can be initiated by all +employees. While this may seem like a good idea, it has a serious drawback because one +consequence of such a lack of assigned control responsibility is that everyone may +think that someone else will take the initiative—and the blame if the alarm is a false +one. Although everyone should report problems and even sound an emergency alert +when necessary, there must be someone who has the actual responsibility, authority, +and accountability to do so. There should also be backup procedures for others to step +in when that person does not execute his or her responsibility acceptably. +The second problem with this safety policy is that unless the procedures clearly +say to execute emergency procedures, humans are very likely to try to diagnose the +situation first. The same problem pops up in many accident reports—humans who +are overwhelmed with information that they cannot digest quickly or do not under- +stand, will first try to understand what is going on before sounding an alarm [115]. +If management wants employees to sound alarms expeditiously and consistently, +then the safety policy needs to specify exactly when alarms are required, not leave +it up to personnel to “evaluate the situation” when they are probably confused and +unsure as to what is going on (as in this case) and under pressure to make quick +decisions under stressful situations. How many people, instead of dialing 911 imme- +diately, try to put out a small kitchen fire themselves? That it often works simply +reinforces the tendency to act in the same way during the next emergency. And it +avoids the embarrassment of the firemen arriving for a non-emergency. As it turns +out, the evacuation alert had been delayed in the past in this same plant, but nobody +had investigated why that occurred. +The accident report concludes with a recommendation that “operator duty to +respond to alarms needs to be reinforced with the work force.” This recommenda- +tion is inadequate because it ignores why the operators did not respond to the +alarms. More useful recommendations might have included designing more accurate + +and more observable feedback about the actual position of the control valve (rather +than just the commanded position), about the state of flow into the tank, about the +level of the liquid in the tank, and so on. The recommendation also ignores the +ambiguous state of the company policy on responding to alarms. +Because the official report focused only on the role of the operators in the acci- +dent and did not even examine that in depth, a chance to detect flaws in the design +and operation of the plant that could lead to future accidents was lost. To prevent +future accidents, the report needed to explain such things as why the HAZOP per- +formed on the unit did not identify any of the alarms in this unit as critical. Is there +some deficiency in HAZOP or in the way it is being performed in this company? +Why were there no procedures in place, or why were the ones in place ineffective, +to respond to the emergency? Either the hazard was not identified, the company +does not have a policy to create procedures for dealing with hazards, or it was an +oversight and there was no procedure in place to check that there is a response for +all identified hazards. +The report does recommend that a risk assessed procedure for filling this tank +be created that defines critical operational parameters such as the sequence of steps +required to initiate the filling process, the associated process control parameters, the +safe level at which the tank is considered full, the sequence of steps necessary to +conclude and secure the tank-filling process, and appropriate response to alarms. It +does not say anything, however, about performing the same task for other processes +in the plant. Either this tank and its safety-critical process are the only ones missing +such procedures or the company is playing a sophisticated game of Whack-a-Mole +(see chapter 13), in which only symptoms of the real problems are removed with +each set of events investigated. +The official accident report concludes that the control room operator “did not +demonstrate an awareness of risks associated with overflowing the tank and poten- +tial to generate high concentrations of [gas] if the [liquid in the tank] was spilled.” +No further investigation of why this was true was included in the report. Was there +a deficiency in the training procedures about the hazards associated with his job +responsibilities? Even if the explanation is that this particular operator is simply +incompetent (probably not true) and although exposed to potentially effective train- +ing did not profit from it, then the question becomes why such an operator was +allowed to continue in that job and why the evaluation of his training outcomes did +not detect this deficiency. It seemed that the outside operator also had a poor +understanding of the risks from this gas so there is clearly evidence that a systemic +problem exists. An audit should have been performed to determine if a spill in this +tank is the only hazard that is not understood and if these two operators are the +only ones who are confused. Is this unit simply a poorly designed and managed one +in the plant or do similar deficiencies exist in other units? + + + +Other important causal factors and questions also were not addressed in the +report such as why the level transmitter was not working so soon after it was sup- +posedly fixed, why safety orders were so delayed (the average age of a safety-related +work order in this plant was three months), why critical processes were allowed to +operate with non-functioning or erratically functioning safety-related equipment, +whether the plant management knew this was happening, and so on. +Hindsight bias and focusing only on the operator’s role in accidents prevents us +from fully learning from accidents and making significant progress in improving +safety. +section 11.8. +Coordination and Communication. +The analysis so far has looked at each component separately. But coordination and +communication between controllers are important sources of unsafe behavior. +Whenever a component has two or more controllers, coordination should be +examined carefully. Each controller may have different responsibilities, but the +control actions provided may conflict. The controllers may also control the same +aspects of the controlled component’s behavior, leading to confusion about who is +responsible for providing control at any time. In the Walkerton E. coli water supply +contamination example provided in appendix C, three control components were +responsible for following up on inspection reports and ensuring the required changes +were made: the Walkerton Public Utility Commission (WPUC), the Ministry of the +Environment (MOE), and the Ministry of Health (MOH). The WPUC commission- +ers had no expertise in running a water utility and simply left the changes to the +manager. The MOE and MOH both were responsible for performing the same +oversight: The local MOH facility assumed that the MOE was performing this func- +tion, but the MOE’s budget had been cut, and follow-ups were not done. In this +case, each of the three responsible groups assumed the other two controllers were +providing the needed oversight, a common finding after an accident. +A different type of coordination problem occurred in an aircraft collision near +Überlingen, Germany, in 2002 [28, 212]. The two controllers—the automated on- +board TCAS system and the ground air traffic controller—provided uncoordinated +control instructions that conflicted and actually caused a collision. The loss would +have been prevented if both pilots had followed their TCAS alerts or both had fol- +lowed the ground ATC instructions. +In the friendly fire accident analyzed in chapter 5, the responsibility of the +AWACS controllers had officially been disambiguated by assigning one to control +aircraft within the no-fly zone and the other to monitor and control aircraft outside +it. This partitioning of control broke down over time, however, with the result that +neither controlled the Black Hawk helicopter on that fateful day. No performance + + +auditing occurred to ensure that the assumed and designed behavior of the safety +control structure components was actually occurring. +Communication, both feedback and exchange of information, is also critical. All +communication links should be examined to ensure they worked properly and, if +they did not, the reasons for the inadequate communication must be determined. +The Überlingen collision, between a Russian Tupolev aircraft and a DHL Boeing +aircraft, provides a useful example. Wong used STAMP to analyze this accident and +demonstrated how the communications breakdown on the night of the accident +played an important role [212]. Figure 11.9 shows the components surrounding the +controller at the Air Traffic Control Center in Zürich that was controlling both +aircraft at the time and the feedback loops and communication links between the +components. Dashed lines represent partial communication channels that are not +available all the time. For example, only partial communication is available between +the controller and multiple aircraft because only one party can transmit at one time +when they are sharing a single radio frequency. In addition, the controller cannot +directly receive information about TCAS advisories—the Pilot Not Flying (PNF) is + + +supposed to report TCAS advisories to the controller over the radio. Finally, com- +municating all the time with all the aircraft requires the presence of two controllers +at two different consoles, but only one controller was present at the time. +Nearly all the communication links were broken or ineffective at the time of the +accident (see figure 11.10). A variety of conditions contributed to the lost links. +The first reason for the dysfunctional communication was unsafe practices such +as inadequate briefings given to the two controllers scheduled to work the night +shift, the second controller being in the break room (which was not officially allowed +but was known and tolerated by management during times of low traffic), and the +reluctance of the controller’s assistant to speak up with ideas to assist in the situa- +tion due to feeling that he would be overstepping his bounds. The inadequate brief- +ings were due to a lack of information as well as each party believing they were not +responsible for conveying specific information, a result of poorly defined roles and +responsibilities. +More links were broken due to maintenance work that was being done in the +control room to reorganize the physical sectors. This work led to unavailability of +the direct phone line used to communicate with adjacent ATC centers (including +ATC Karlsruhe, which saw the impending collision and tried to call ATC Zurich) +and the loss of an optical short-term conflict alert (STCA) on the console. The aural +short-term conflict alert was theoretically working, but nobody in the control room +heard it. +Unusual situations led to the loss of additional links. These include the failure of +the bypass telephone system from adjacent ATC centers and the appearance of a +delayed A320 aircraft landing at Friedrichshafen. To communicate with all three +aircraft, the controller had to alternate between two consoles, changing all the air- +craft–controller communication channels to partial links. +Finally, some links were unused because the controller did not realize they were +available. These include possible help from the other staff present in the control room +(but working on the resectorization) and a third telephone system that the controller +did not know about. In addition, the link between the crew of the Tupolev aircraft +and its TCAS unit was broken due to the crew ignoring the TCAS advisory. +Figure 11.10 shows the remaining links after all these losses. At the time of the +accident, there were no complete feedback loops left in the system and the few +remaining connections were partial ones. The exception was the connection between +the TCAS units of the two aircraft, which were still communicating with each other. +The TCAS unit can only provide information to the crew, however, so this remaining +loop was unable to exert any control over the aircraft. +Another common type of communication failure is in the problem-reporting +channels. In a large number of accidents, the investigators find that the problems +were identified in time to prevent the loss but that the required problem-reporting + + +channels were not used. Recommendations in the ensuing accident reports usually +involve training people to use the reporting channels—based on an assumption that +the lack of use reflected poor training—or attempting to enforce their use by reit- +erating the requirement that all problems be reported. These investigations, however, +usually stop short of finding out why the reporting channels were not used. Often +an examination and a few questions reveal that the formal reporting channels are +difficult or awkward and time-consuming to use. Redesign of a poorly designed +system will be more effective in ensuring future use than simply telling people they +have to use a poorly designed system. Unless design changes are made, over time +the poorly designed communication channels will again become underused. +At Citichem, all problems were reported orally to the control room operator, who +was supposed to report them to someone above him. One conduit for information, +of course, leads to a very fragile reporting system. At the same time, there were few +formal communication and feedback channels established—communication was +informal and ad hoc, both within Citichem and between Citichem and the local +government. + +section 11.9. Dynamics and Migration to a High-Risk State. +As noted previously, most major accidents result from a migration of the system +toward reduced safety margins over time. In the Citichem example, pressure from +commercial competition was one cause of this degradation in safety. It is, of course, +a very common one. Operational safety practices at Citichem had been better in the +past, but the current market conditions led management to cut the safety margins +and ignore established safety practices. Usually there are precursors signaling the +increasing risks associated with these changes in the form of minor incidents and +accidents, but in this case, as in so many others, these precursors were not recognized. +Ironically, the death of the Citichem maintenance manager in an accident led the +management to make changes in the way they were operating, but it was too late +to prevent the toxic chemical release. +The corporate leaders pressured the Citichem plant manager to operate at higher +levels of risk by threatening to move operations to Mexico, leaving the current +workers without jobs. Without any way of maintaining an accurate model of the risk +in current operations, the plant manager allowed the plant to move to a state of +higher and higher risk. +Another change over time that affected safety in this system was the physical +change in the separation of the population from the plant. Usually hazardous facili- +ties are originally placed far from population centers, but the population shifts +after the facility is created. People want to live near where they work and do not +like long commutes. Land and housing may be cheaper near smelly, polluting plants. +In third world countries, utilities (such as power and water) and transportation +facilities may be more readily available near heavy industrial plants, as was the case +at Bhopal. +At Citichem, an important change over time was the obsolescence of the emer- +gency preparations as the population increased. Roads, hospital facilities, firefighting +equipment, and other emergency resources became inadequate. Not only were there +insufficient resources to handle the changes in population density and location, +but financial and other pressures militated against those wanting to update the +emergency resources and plans. +Considering the Oakbridge community dynamics, the city of Oakbridge con- +tributed to the accident through the erosion of the safety controls due to the normal +pressures facing any city government. Without any history of accidents, or risk +assessments indicating otherwise, the plant was deemed safe, and officials allowed +developers to build on previously restricted land. A contributing factor was the +desire to increase city finances and business relationships that would assist in reelec- +tion of the city officials. The city moved toward a state where casualties would be +massive when an accident did occur. + + +The goal of understanding the dynamics is to redesign the system and the safety +control structure to make them more conducive to system safety. For example, +behavior is influenced by recent accidents or incidents: As safety efforts are success- +fully employed, the feeling grows that accidents cannot occur, leading to reduction +in the safety efforts, an accident, and then increased controls for a while until the +system drifts back to an unsafe state and complacency again increases . . . +This complacency factor is so common that any system safety effort must include +ways to deal with it. SUBSAFE, the U.S. nuclear submarine safety program, has +been particularly successful at accomplishing this goal. The SUBSAFE program is +described in chapter 14. +One way to combat this erosion of safety is to provide ways to maintain accurate +risk assessments in the process models of the system controllers. The more and +better information controllers have, the more accurate will be their process models +and therefore their decisions. +In the Citichem example, the dynamics of the city migration toward higher risk +might be improved by doing better hazard analyses, increasing communication +between the city and the plant (e.g., learning about incidents that are occurring), +and the formation of community citizen groups to provide counterbalancing pres- +sures on city officials to maintain the emergency response system and the other +public safety measures. +Finally, understanding the reason for such migration provides an opportunity to +design the safety control structure to prevent it or to detect it when it occurs. Thor- +ough investigation of incidents using CAST and the insight it provides can be used +to redesign the system or to establish operational controls to stop the migration +toward increasing risk before an accident occurs. + +section 11.10. Generating Recommendations from the CAST Analysis. +The goal of an accident analysis should not be just to address symptoms, to assign +blame, or to determine which group or groups are more responsible than others. +Blame is difficult to eliminate, but, as discussed in section 2.7, blame is antitheti- +cal to improving safety. It hinders accident and incident investigations and the +reporting of errors before a loss occurs, and it hinders finding the most important +factors that need to be changed to prevent accidents in the future. Often, blame is +assigned to the least politically powerful in the control hierarchy or to those people +or physical components physically and operationally closest to the actual loss +events. Understanding why inadequate control was provided and why it made +sense for the controllers to act in the way they did helps to diffuse what seems to +be a natural desire to assign blame for events. In addition, looking at how the entire +safety control structure was flawed and conceptualizing accidents as complex + + +processes rather than the result of independent events should reduce the finger +pointing and arguments about others being more to blame that often arises when +system components other than the operators are identified as being part of the +accident process. “More to blame” is not a relevant concept in a systems approach +to accident analysis and should be resisted and avoided. Each component in a +system works together to obtain the results, and no part is more important than +another. +The goal of the accident analysis should instead be to determine how to change +or reengineer the entire safety-control structure in the most cost-effective and prac- +tical way to prevent similar accident processes in the future. Once the STAMP +analysis has been completed, generating recommendations is relatively simple and +follows directly from the analysis results. +One consequence of the completeness of a STAMP analysis is that many possi- +ble recommendations may result—in some cases, too many to be practical to +include in the final accident report. A determination of the relative importance of +the potential recommendations may be required in terms of having the greatest +impact on the largest number of potential future accidents. There is no algorithm +for identifying these recommendations, nor can there be. Political and situational +factors will always be involved in such decisions. Understanding the entire accident +process and the overall safety control structure should help with this identification, +however. +Some sample recommendations for the Citichem example are shown throughout +the chapter. A more complete list of the recommendations that might result from a +STAMP-based Citichem accident analysis follows. The list is divided into four parts: +physical equipment and design, corporate management, plant operations and man- +agement, and government and community. +Physical Equipment and Design +1. Add protection against rainwater getting into tanks. +2. Consider measures for preventing and detecting corrosion. +3. Change the design of the valves and vent pipes to respond to the two-phase +flow problem (which was responsible for the valves and pipes being jammed). +4. Etc. (the rest of the physical plant factors are omitted) +Corporate Management +1. Establish a corporate safety policy that specifies: +a. Responsibility, authority, accountability of everyone with respect to safety +b. Criteria for evaluating decisions and for designing and implementing safety +controls. + + +2. Establish a corporate process safety organization to provide oversight that is +responsible for: +a. Enforcing the safety policy +b. Advising corporate management on safety-related decisions +c. Performing risk analyses and overseeing safety in operations including +performing audits and setting reporting requirements (to keep corporate +process models accurate). A safety working group at the corporate level +should be considered. +d. Setting minimum requirements for safety engineering and operations at +plants and overseeing the implementation of these requirements as well as +management of change requirements for evaluating all changes for their +impact on safety. +e. Providing a conduit for safety-related information from below (a formal +safety reporting system) as well as an independent feedback channel about +process safety concerns by employees. +f. Setting minimum physical and operational standards (including functioning +equipment and backups) for operations involving dangerous chemicals. +g. Establishing incident/accident investigation standards and ensuring recom- +mendations are adequately implemented. +h. Creating and maintaining a corporate process safety information system. +3. Improve process safety communication channels both within the corporate +level as well as information and feedback channels from Citichem plants to +corporate management. +4. Ensure that appropriate communication and coordination is occurring between +the Citichem plants and the local communities in which they reside. +5. Strengthen or create an inventory control system for safety-critical parts at the +corporate level. Ensure that safety-related equipment is in stock at all times. + +Citichem Oakbridge Plant Management and Operations. +1. Create a safety policy for the plant. Derive it from the corporate safety policy +and make sure everyone understands it. Include minimum requirements for +operations: for example, safety devices must be operational, and production +should be shut down if they are not. +2. Establish a plant process safety organization and assign responsibility, author- +ity, and accountability for this organization. Include a process safety manager +whose primary responsibility is process safety. The responsibilities of this +organization should include at least the following: +a. Perform hazard and risk analysis. + +b. Advise plant management on safety-related decisions. +c. Create and maintain a plant process safety information system. +d. Perform or organize process safety audits and inspections using hazard +analysis results as the preconditions for operations and maintenance. +e. Investigate hazardous conditions, incidents, and accidents. +f. Establish leading indicators of risk. +g. Collect data to ensure process safety policies and procedures are being +followed. +3. Ensure that everyone has appropriate training in process safety and the spe- +cific hazards associated with plant operations. +4. Regularize and improve communication channels. Create the operational +feedback channels from controlled components to controllers necessary to +maintain accurate process models to assist in safety-related decision making. +If the channels exist but are not used, then the reason why they are unused +should be determined and appropriate changes made. +5. Establish a formal problem reporting system along with channels for problem +reporting that include management and rank and file workers. Avoid com- +munication channels with a single point of failure for safety-related messages. +Decisions on whether management is informed about hazardous operational +events should be proceduralized. Any operational conditions found to exist +that involve hazards should be reported and thoroughly investigated by those +responsible for system safety. +6. Consider establishing employee safety committees with union representation +(if there are unions at the plant). Consider also setting up a plant process safety +working group. +7. Require that all changes affecting safety equipment be approved by the plant +manager or by his or her designated representative for safety. Any outage of +safety-critical equipment must be reported immediately. +8. Establish procedures for quality control and checking of safety-critical activi- +ties and follow-up investigation of safety excursions (hazardous conditions). +9. Ensure that those performing safety-critical operations have appropriate skills +and physical resources (including adequate rest). +10. Improve inventory control procedures for safety-critical parts at the +Oakbridge plant. +11. Review procedures for turnarounds, maintenance, changes, operations, etc. +that involve potential hazards and ensure that these are being followed. Create +an MOC procedure that includes hazard analysis on all planned changes. + +12. Enforce maintenance schedules. If delays are unavoidable, a safety analysis +should be performed to understand the risks involved. +13. Establish incident/accident investigation standards and ensure that they are +being followed and recommendations are implemented. +14. Create a periodic audit system on the safety of operations and the state of +the plant. Audit scope might be defined by such information as the hazard +analysis, identified leading indicators of risk, and past incident/accident +investigations. +15. Establish communication channels with the surrounding community and +provide appropriate information for better decision making by community +leaders and information to emergency responders and the medical establish- +ment. Coordinate with the surrounding community to provide information +and assistance in establishing effective emergency preparedness and response +measures. These measures should include a warning siren or other notifica- +tion of an emergency and citizen information about what to do in the case of +an emergency. + +Government and Community. +1. Set policy with respect to safety and ensure that the policy is enforced. +2. Establish communication channels with hazardous industry in the com- +munity. +3. Establish and monitor information channels about the risks in the community. +Collect and disseminate information on hazards, the measures citizens can take +to protect themselves, and what to do in case of an emergency. +4. Encourage citizens to take responsibility for their own safety and to encourage +local, state, and federal government to do the things necessary to protect them. +5. Encourage the establishment of a community safety committee and/or a safety +ombudsman office that is not elected but represents the public in safety-related +decision making. +6. Ensure that safety controls are in place before approving new development in +hazardous areas, and if not (e.g., inadequate roads, communication channels, +emergency response facilities), then perhaps make developers pay for them. +Consider requiring developers to provide an analysis of the impact of new +development on the safety of the community. Hire outside consultants to +evaluate these impact analyses if such expertise is not available locally. +7. Establish an emergency preparedness plan and re-evaluate it periodically to +determine if it is up to date. Include procedures for coordination among emer- +gency responders. + + +8. Plan temporary measures for additional manpower in emergencies. +9. Acquire adequate equipment. +10. Provide drills and ensure alerting and communication channels exist and are +operational. +11. Train emergency responders. +12. Ensure that transportation and other facilities exist for an emergency. +13. Set up formal communications between emergency responders (hospital staff, +police, firefighters, Citichem). Establish emergency plans and means to peri- +odically update them. +One thing to note from this example is that many of the recommendations are +simply good safety management practices. While this particular example involved a +system that was devoid of the standard safety practices common to most industries, +many accident investigations conclude that standard safety management practices +were not observed. This fact points to a great opportunity to prevent accidents simply +by establishing standard safety controls using the techniques described in this book. +While we want to learn as much as possible from each loss, preventing the losses in +the first place is a much better strategy than waiting to learn from our mistakes. +These recommendations and those resulting from other thoroughly investigated +accidents also provide an excellent resource to assist in generating the system safety +requirements and constraints for similar types of systems and in designing improved +safety control structures. +Just investigating the incident or accident is, of course, not enough. Recommenda- +tions must be implemented to be useful. Responsibility must be assigned for ensur- +ing that changes are actually made. In addition, feedback channels should be +established to determine whether the recommendations and changes were success- +ful in reducing risk. + +section 11.11. Experimental Comparisons of CAST with Traditional Accident Analysis. +Although CAST is new, several evaluations have been done, mostly aviation- +related. +Robert Arnold, in a master’s thesis for Lund University, conducted a qualitative +comparison of SOAM and STAMP in an Air Traffic Management (ATM) occur- +rence investigation. SOAM (Systemic Occurrence Analysis Methodology) is used +by Eurocontrol to analyze ATM incidents. In Arnold’s experiment, an incident was +investigated using SOAM and STAMP and the usefulness of each in identifying +systemic countermeasures was compared. The results showed that SOAM is a useful +heuristic and a powerful communication device, but that it is weak with respect to + + +emergent phenomena and nonlinear interactions. SOAM directs the investigator to +consider the context in which the events occur, the barriers that failed, and the +organizational factors involved, but not the processes that created them or how +the entire system can migrate toward the boundaries of safe operation. In contrast, +the author concludes, +STAMP directs the investigator more deeply into the mechanism of the interactions +between system components, and how systems adapt over time. STAMP helps identify the +controls and constraints necessary to prevent undesirable interactions between system +components. STAMP also directs the investigation through a structured analysis of the +upper levels of the system’s control structure, which helps to identify high level systemic +countermeasures. The global ATM system is undergoing a period of rapid technological +and political change. . . . The ATM is moving from centralized human controlled systems +to semi-automated distributed decision making. . . . Detailed new systemic models like +STAMP are now necessary to prevent undesirable interactions between normally func- +tioning system components and to understand changes over time in increasingly complex +ATM systems. +Paul Nelson, in another Lund University master’s thesis, used STAMP and CAST +to analyze the crash of Comair 5191 at Lexington, Kentucky, on August 27, 2006, +when the pilots took off from the wrong runway [142]. The accident, of course, has +been thoroughly investigated by the NTSB. Nelson concludes that the NTSB report +narrowly targeted causes and potential solutions. No recommendations were put +forth to correct the underlying safety control structure, which fostered process +model inconsistencies, inadequate and dysfunctional control actions, and unenforced +safety constraints. The CAST analysis, on the other hand, uncovered these useful +levers for eliminating future loss. +Stringfellow compared the use of STAMP, augmented with guidewords for orga- +nizational and human error analysis, with the use of HFACS (Human Factors Analy- +sis and Classification System) on the crash of a Predator-B unmanned aircraft near +Nogales, Arizona [195]. HFACS, based on the Swiss Cheese Model (event-chain +model), is an error-classification list that can be used to label types of errors, prob- +lems, or poor decisions made by humans and organizations [186]. Once again, +although the analysis of the unmanned vehicle based on STAMP found all the +factors found in the published analysis of the accident using HFACS [31, 195], the +STAMP-based analysis identified additional factors, particularly those at higher +levels of the safety control structure, for example, problems in the FAA’s COA3 +approval process. Stringfellow concludes: + +The organizational influences listed in HFACS . . . do not go far enough for engineers to +create recommendations to address organizational problems. . . . Many of the factors cited +in Swiss Cheese-based methods don’t point to solutions; many are just another label for +human error in disguise [195, p. 154]. +In general, most accident analyses do a good job in describing what happened, but +not why. + + +footnote. The COA or Certificate of Operation allows an air vehicle that does not nominally meet FAA safety +standards access to the National Airspace System. The COA application process includes measures to +mitigate risks, such as sectioning off the airspace to be used by the unmanned aircraft and preventing +other aircraft from entering the space. + + +section 11.12. Summary. +In this chapter, the process for performing accident analysis using STAMP as the +basis is described and illustrated using a chemical plant accident as an example. +Stopping the analysis at the lower levels of the safety-control structure, in this case +at the physical controls and the plant operators, provides a distorted and incomplete +view of the causative factors in the loss. Both a better understanding of why the +accident occurred and how to prevent future ones are enhanced with a more com- +plete analysis. As the entire accident process becomes better understood, individual +mistakes and actions assume a much less important role in comparison to the role +played by the environment and context in which their decisions and control actions +take place. What may look like an error or even negligence by the low-level opera- +tors and controllers may appear much more reasonable given the full picture. In +addition, changes at the lower levels of the safety-control structure often have much +less ability to impact the causal factors in major accidents than those at higher levels. +At all levels, focusing on assessing blame for the accident does not provide the +information necessary to prevent future accidents. Accidents are complex processes, +and understanding the entire process is necessary to provide recommendations that +are going to be effective in preventing a large number of accidents and not just +preventing the symptoms implicit in a particular set of events. There is too much +repetition of the same causes of accidents in most industries. We need to improve +our ability to learn from the past. +Improving accident investigation may require training accident investigators in +systems thinking and in the types of environmental and behavior shaping factors to +consider during an analysis, some of which are discussed in later chapters. Tools to +assist in the analysis, particularly graphical representations that illustrate interactions +and causality, will help. But often the limitations of accident reports do not stem from +the sincere efforts of the investigators but from political and other pressures to limit +the causal factors identified to those at the lower levels of the management or politi- +cal hierarchy. Combating these pressures is beyond the scope of this book. Removing +blame from the process will help somewhat. Management also has to be educated to +understand that safety pays and, in the longer term, costs less than the losses that +result from weak safety programs and incomplete accident investigations. \ No newline at end of file diff --git a/chapter11.txt b/chapter11.txt new file mode 100644 index 0000000..ecfb9b9 --- /dev/null +++ b/chapter11.txt @@ -0,0 +1,1237 @@ +Chapter 11. +Analyzing Accidents and Incidents .(CAST). +The causality model used in accident or incident analysis determines what we look +for, how we go about looking for “facts,” and what we see as relevant. In our experience using STAMP-based accident analysis, we find that even if we use only the +information presented in an existing accident report, we come up with a very different view of the accident and its causes. +Most accident reports are written from the perspective of an event-based model. +They almost always clearly describe the events and usually one or several of these +events is chosen as the “root causes.” Sometimes “contributory causes” are identified. But the analysis of why those events occurred is usually incomplete. The analysis frequently stops after finding someone to blame,usually a human operator,and +the opportunity to learn important lessons is lost. +An accident analysis technique should provide a framework or process to assist in +understanding the entire accident process and identifying the most important systemic causal factors involved. This chapter describes an approach to accident analysis, based on STAMP, called CAST .(Causal Analysis based on STAMP). CAST can +be used to identify the questions that need to be answered to fully understand why +the accident occurred. It provides the basis for maximizing learning from the events. +The use of CAST does not lead to identifying single causal factors or variables. +Instead it provides the ability to examine the entire sociotechnical system design to +identify the weaknesses in the existing safety control structure and to identify +changes that will not simply eliminate symptoms but potentially all the causal +factors, including the systemic ones. +One goal of CAST is to get away from assigning blame and instead to shift the +focus to why the accident occurred and how to prevent similar losses in the future. +To accomplish this goal, it is necessary to minimize hindsight bias and instead to +determine why people behaved the way they did, given the information they had at +the time. +An example of the results of an accident analysis using CAST is presented in +chapter 5. Additional examples are in appendixes B and C. This chapter describes + + +the steps to go through in producing such an analysis. An accident at a fictional +chemical plant called Citichem is used to demonstrate the process. The accident scenario was developed by Risk Management Pro to train accident investigators and describes a realistic accident process similar to many accidents that have +occurred in chemical plants. While the loss involves release of a toxic chemical, the +analysis serves as an example of how to do an accident or incident analysis for any +industry. +An accident investigation process is not being specified here, but only a way to +document and analyze the results of such a process. Accident investigation is a much +larger topic that goes beyond the goals of this book. This chapter only considers +how to analyze the data once it has been collected and organized. The accident +analysis process described in this chapter does, however, contribute to determining +what questions should be asked during the investigation. When attempting to apply +STAMP-based analysis to existing accident reports, it often becomes apparent that +crucial information was not obtained, or at least not included in the report, that +is needed to fully understand why the loss occurred and how to prevent future +occurrences. + +footnote. Maggie Stringfellow and John Thomas, two MIT graduate students, contributed to the CAST analysis +of the fictional accident used in this chapter. + +section 11.1. +The General Process of Applying STAMP to Accident Analysis. +In STAMP, an accident is regarded as involving a complex process, not just individual events. Accident analysis in CAST then entails understanding the dynamic +process that led to the loss. That accident process is documented by showing the +sociotechnical safety control structure for the system involved and the safety constraints that were violated at each level of this control structure and why. The analysis results in multiple views of the accident, depending on the perspective and level +from which the loss is being viewed. +Although the process is described in terms of steps or parts, no implication is +being made that the analysis process is linear or that one step must be completed +before the next one is started. The first three steps are the same ones that form the +basis of all the STAMP-based techniques described so far. +1. Identify the system(s). and hazard(s). involved in the loss. +2. Identify the system safety constraints and system requirements associated with +that hazard. +3. Document the safety control structure in place to control the hazard and +enforce the safety constraints. This structure includes the roles and responsibilities of each component in the structure as well as the controls provided or +created to execute their responsibilities and the relevant feedback provided to +them to help them do this. This structure may be completed in parallel with +the later steps. +4. Determine the proximate events leading to the loss. +5. Analyze the loss at the physical system level. Identify the contribution of each +of the following to the events. physical and operational controls, physical failures, dysfunctional interactions, communication and coordination flaws, and +unhandled disturbances. Determine why the physical controls in place were +ineffective in preventing the hazard. +6. Moving up the levels of the safety control structure, determine how and why +each successive higher level allowed or contributed to the inadequate control +at the current level. For each system safety constraint, either the responsibility +for enforcing it was never assigned to a component in the safety control structure or a component or components did not exercise adequate control to +ensure their assigned responsibilities .(safety constraints). were enforced in the +components below them. Any human decisions or flawed control actions need +to be understood in terms of .(at least). the information available to the decision maker as well as any required information that was not available, the +behavior-shaping mechanisms .(the context and influences on the decisionmaking process), the value structures underlying the decision, and any flaws +in the process models of those making the decisions and why those flaws +existed. +7. Examine overall coordination and communication contributors to the loss. +8. Determine the dynamics and changes in the system and the safety control +structure relating to the loss and any weakening of the safety control structure +over time. +9. Generate recommendations. +In general, the description of the role of each component in the control structure +will include the following. +1.•Safety Requirements and Constraints +2.•Controls +3.•Context +3.1.– Roles and responsibilities +3.2.– Environmental and behavior-shaping factors +4.•Dysfunctional interactions, failures, and flawed decisions leading to erroneous +control actions + +5.Reasons for the flawed control actions and dysfunctional interactions +5.1.– Control algorithm flaws +5.2.– Incorrect process or interface models. +5.3.– Inadequate coordination or communication among multiple controllers +5.4.– Reference channel flaws +5.5.– Feedback flaws +The next sections detail the steps in the analysis process, using Citichem as a +running example. + + +section 11.2. +Creating the Proximal Event Chain. +While the event chain does not provide the most important causality information, +the basic events related to the loss do need to be identified so that the physical +process involved in the loss can be understood. +For Citichem, the physical process events are relatively simple. A chemical reaction occurred in storage tanks 701 and 702 of the Citichem plant when the chemical +contained in the tanks, K34, came in contact with water. K34 is made up of some +extremely toxic and dangerous chemicals that react violently to water and thus need +to be kept away from it. The runaway reaction led to the release of a toxic cloud of +tetrachloric cyanide .(TCC). gas, which is flammable, corrosive, and volatile. The TCC +blew toward a nearby park and housing development, in a city called Oakbridge, +killing more than four hundred people. +The direct events leading to the release and deaths are. +1. Rain gets into tank 701 .(and presumably 702), both of which are in Unit 7 of +the Citichem Oakbridge plant. Unit 7 was shut down at the time due to +lowered demand for K34. +2. Unit 7 is restarted when a large order for K34 is received. +3. A small amount of water is found in tank 701 and an order is issued to make +sure the tank is dry before startup. +4. T34 transfer is started at unit 7. +5. The level gauge transmitter in the 701 storage tank shows more than it +should. +6. A request is sent to maintenance to put in a new level transmitter. +7. The level transmitter from tank 702 is moved to tank 701. .(Tank 702 is used +as a spare tank for overflow from tank 701 in case there is a problem.) +8. Pressure in Unit 7 reads as too high. + + +9. The backup cooling compressor is activated. +10. Tank 701 temperature exceeds 12 degrees Celsius. +11. A sample is run, an operator is sent to check tank pressure, and the plant +manager is called. +12. Vibration is detected in tank 701. +13. The temperature and pressure in tank 701 continue to increase. +14. Water is found in the sample that was taken .(see event 11). +15. Tank 701 is dumped into the spare tank 702 +16. A runaway reaction occurs in tank 702. +17. The emergency relief valve jams and runoff is not diverted into the backup +scrubber. +18. An uncontrolled gas release occurs. +19. An alarm sounds in the plant. +20. Nonessential personnel are ordered into units 2 and 3, which have positive +pressure and filtered air. +21. People faint outside the plant fence. +22. Police evacuate a nearby school. +23. The engineering manager calls the local hospital, gives them the chemical +name and a hotline phone number to learn more about the chemical. +24. The public road becomes jammed and emergency crews cannot get into the +surrounding community. +25. Hospital personnel cannot keep up with steady stream of victims. +26. Emergency medical teams are airlifted in. +These events are presented as one list here, but separation into separate interacting +component event chains may be useful sometimes in understanding what happened, +as shown in the friendly fire event description in chapter 5. +The Citichem event chain here provides a superficial analysis of what happened. +A deep understanding of why the events occurred requires much more information. +Remember that the goal of a STAMP-based analysis is to determine why the events +occurred.not who to blame for them.and to identify the changes that could +prevent them and similar events in the future. + +section 11.3. Defining the System(s). and Hazards Involved in the Loss. +Citichem has two relevant physical processes being controlled. the physical plant +and public health. Because separate and independent controllers were controlling + +these two processes, it makes sense to consider them as two interacting but independent systems. .(1). the chemical company, which controls the chemical process, +and .(2). the public political structure, which has responsibilities for public health. +Figure 11.1 shows the major components of the two safety control structures and +interactions between them. Only the major structures are shown in the figure; +the details will be added throughout this chapter.2 No information was provided + + +about the design and engineering process for the Citichem plant in the accident +description, so details about it are omitted. A more complete example of a development control structure and analysis of its role can be found in appendix B. +The analyst(s). also needs to identify the hazard(s). being avoided and the safety +constraint(s). to be enforced. An accident or loss event for the combined chemical +plant and public health structure can be defined as death, illness, or injury due to +exposure to toxic chemicals. +The hazards being controlled by the two control structures are related but +different. The public health structure hazard is exposure of the public to toxic +chemicals. The system-level safety constraints for the public health control system +are that. +1. The public must not be exposed to toxic chemicals. +2. Measures must be taken to reduce exposure if it occurs. +3. Means must be available, effective, and used to treat exposed individuals +outside the plant. +The hazard for the chemical plant process is uncontrolled release of toxic chemicals. +Accordingly, the system-level constraints are that. +1. Chemicals must be under positive control at all times. +2. Measures must be taken to reduce exposure if inadvertent release occurs. +3. Warnings and other measures must be available to protect workers in the plant +and minimize losses to the outside community. +4. Means must be available, effective, and used to treat exposed individuals inside +the plant. +Hazards and safety-constraints must be within the design space of those who developed the system and within the operational space of those who operate it. For +example, the chemical plant designers cannot be responsible for those things +outside the boundaries of the chemical plant over which they have no control, +although they may have some influence over them. Control over the environment +of a plant is usually the responsibility of the community and various levels of government. As another example, while the operators of the plant may cooperate with +local officials in providing public health and emergency response facilities, responsibility for this function normally lies in the public domain. Similarly, while the +community and local government may have some influence on the design of the +chemical plant, the company engineers and managers control detailed design and +operations. +Once the goals and constraints are determined, the controls in place to enforce +them must be identified. + + + + +footnote. OSHA, the Occupational Safety and Health Administration, is part of a third larger governmental +control structure, which has many other components. For simplicity, only OSHA is shown and considered +in the example analysis. + + +section 11.4. Documenting the Safety Control Structure. +If STAMP has been used as the basis for previous safety activities, such as the original engineering process or the investigation and analysis of previous incidents and +accidents, a model of the safety-control structure may already exist. If not, it must +be created although it can be reused in the future. Chapters 12 and 13 provide +information about the design of safety-control structures. +The components of the structure as well as each component’s responsibility with +respect to enforcing the system safety constraints must be identified. Determining +what these are .(or what they should be). can start from system safety requirements. +The following are some example system safety requirements that might be appropriate for the Citichem chemical plant example. +1. Chemicals must be stored in their safest form. +2. The amount of toxic chemicals stored should be minimized. +3. Release of toxic chemicals and contamination of the environment must be +prevented. +4. Safety devices must be operable and properly maintained at all times when +potentially toxic chemicals are being processed or stored. +5. Safety equipment and emergency procedures .(including warning devices) +must be provided to reduce exposure in the event of an inadvertent chemical +release. +6. Emergency procedures and equipment must be available and operable to treat +exposed individuals. +7. All areas of the plant must be accessible to emergency personnel and equipment during emergencies. Delays in providing emergency treatment must be +minimized. +8. Employees must be trained to +a. Perform their jobs safely and understand proper use of safety equipment +b. Understand their responsibilities with regards to safety and the hazards +related to their job +c. Respond appropriately in an emergency +9. Those responsible for safety in the surrounding community must be educated +about potential hazards from the plant and provided with information about +how to respond appropriately. +A similar list of safety-related requirements and responsibilities might be generated for the community safety control structure. + + +These general system requirements must be enforced somewhere in the safety +control structure. As the accident analysis proceeds, they are used as the starting +point for generating more specific constraints, such as constraints for the specific +chemicals being handled. For example, requirement 4, when instantiated for TCC, +might generate a requirement to prevent contact of the chemical with water. As the +accident analysis proceeds, the identified responsibilities of the components can be +mapped to the system safety requirements.the opposite of the forward tracing +used in safety-guided design. If STPA was used in the design or analysis of the +system, then the safety control structure documentation should already exist. +In some cases, general requirements and policies for an industry are established +by the government or by professional associations. These can be used during an +accident analysis to assist in comparing the actual safety control structure .(both in +the plant and in the community). at the time of the accidents with the standards or +best practices of the industry and country. Accident analyses can in this way be made +less arbitrary and more guidance provided to the analysts as to what should be +considered to be inadequate controls. +The specific designed controls need not all be identified before the rest of the +analysis starts. Additional controls will be identified as the analysts go through +the next steps of the process, but a good start can usually be made early in the +analysis process. + +section 11.5. +Analyzing the Physical Process. +Analysis starts with the physical process, identifying the physical and operational +controls and any potential physical failures, dysfunctional interactions and communication, or unhandled external disturbances that contributed to the events. The goal +is to determine why the physical controls in place were ineffective in preventing the +hazard. Most accident analyses do a good job of identifying the physical contributors +to the events. +Figure 11.2 shows the requirements and controls at the Citichem physical plant +level as well as failures and inadequate controls. The physical contextual factors +contributing to the events are included. +The most likely reason for water getting into tanks 701 and 702 were inadequate +controls provided to keep water out during a recent rainstorm .(an unhandled external disturbance to the system in figure 4.8), but there is no way to determine that +for sure. +Accident investigations, when the events and physical causes are not obvious, +often make use of a hazard analysis technique, such as fault trees, to create scenarios +to consider. STPA can be used for this purpose. Using control diagrams of the physical system, scenarios can be generated that could lead to the lack of enforcement + +of the safety constraint(s). at the physical level. The safety design principles in +chapter 9 can provide assistance in identifying design flaws. +As is common in the process industry, the physical plant safety equipment .(controls). at Citichem were designed as a series of barriers to satisfy the system safety +constraints identified earlier, that is, to protect against runaway reactions, protect +against inadvertent release of toxic chemicals or an explosion .(uncontrolled energy), +convert any released chemicals into a non-hazardous or less hazardous form, provide +protection against human or environmental exposure after release, and provide +emergency equipment to treat exposed individuals. Citichem had the standard +types of safety equipment installed, including gauges and other indicators of the +physical system state. In addition, it had an emergency relief system and devices to +minimize the danger from released chemicals such as a scrubber to reduce the toxicity of any released chemicals and a flare tower to burn off gas before it gets into +the atmosphere. +A CAST accident analysis examines the controls to determine which ones did +not work adequately and why. While there was a reasonable amount of physical +safety controls provided at Citichem, much of this equipment was inadequate or not +operational.a common finding after chemical plant accidents. +In particular, rainwater got into the tank, which implies the tanks were not +adequately protected against rain despite the serious hazard created by the mixing +of TCC with water. While the inadequate protection against rainwater should be +investigated, no information was provided in the Citichem accident description. Did +the hazard analysis process, which in the process industry often involves HAZOP, +identify this hazard? If not, then the hazard analysis process used by the company +needs to be examined to determine why an important factor was omitted. If it was +not omitted, then the flaw lies in the translation of the hazard analysis results into +protection against the hazard in the design and operations. Were controls to protect +against water getting into the tank provided? If not, why not? If so, why were they +ineffective? +Critical gauges and monitoring equipment were missing or inoperable at the time +of the runaway reaction. As one important example, the plant at the time of the +accident had no operational level indicator on tank 702 despite the fact that this +equipment provided safety-critical information. One task for the accident analysis, +then, is to determine whether the indicator was designated as safety-critical, which +would .(or should). trigger more controls at the higher levels, such as higher priority +in maintenance activities. The inoperable level indicator also indicates a need to +look at higher levels of the control structure that are responsible for providing and +maintaining safety-critical equipment. +As a final example, the design of the emergency relief system was inadequate. +The emergency relief valve jammed and excess gas could not be sent to the scrubber. + + +The pop-up relief valves in Unit 7 .(and Unit 9). at the plant were too small to allow +the venting of the gas if non-gas material was present. The relief valve lines were +also too small to relieve the pressure fast enough, in effect providing a single point +of failure for the emergency relief system. Why an inadequate design existed also +needs to be examined in the higher-level control structure. What group was responsible for the design and why did a flawed design result? Or was the design originally +adequate but conditions changed over time? +The physical contextual factors identified in figure 11.2 play a role in the accident +causal analysis, such as the limited access to the plant, but their importance becomes +obvious only at higher levels of the control structure. +At this point of the analysis, several recommendations are reasonable. add +protection against rainwater getting into the tanks, change the design of the valves +and vent pipes in the emergency relief system, put a level indicator on Tank 702, +and so on. Accident investigations often stop here with the physical process analysis +or go one step higher to determine what the operators .(the direct controllers of the +physical process). did wrong. +The other physical process being controlled here, public health, must be examined in the same way. There were very few controls over public health instituted in +Oakbridge, the community surrounding the plant, and the ones that did exist were +inadequate. The public had no training in what to do in case of an emergency, the +emergency response system was woefully inadequate, and unsafe development was +allowed, such as the creation of a children’s park right outside the walls of the plant. +The reasons for these inadequacies, as well as the inadequacies of the controls on +the physical plant process, are considered in the next section. + + +section 11.6. Analyzing the Higher Levels of the Safety Control Structure. +While the physical control inadequacies are relatively easy to identify in the analysis +and are usually handled well in any accident analysis, understanding why those +physical failures or design inadequacies existed requires examining the higher levels +of safety control. Fully understanding the behavior at any level of the sociotechnical +safety control structure requires understanding how and why the control at the +next higher level allowed or contributed to the inadequate control at the current +level. Most accident reports include some of the higher-level factors, but usually +incompletely and inconsistently, and they focus on finding someone or something +to blame. +Each relevant component of the safety control structure, starting with the lowest +physical controls and progressing upward to the social and political controls, needs +to be examined. How are the components to be examined determined? Considering +everything is not practical or cost effective. By starting at the bottom, the relevant + + +components to consider can be identified. At each level, the flawed behavior or +inadequate controls are examined to determine why the behavior occurred and why +the controls at higher levels were not effective at preventing that behavior. For +example, in the STAMP-based analysis of an accident where an aircraft took off +from the wrong runway during construction at the airport, it was discovered that +the airport maps provided to the pilot were out of date . That led to examining +the procedures at the company that provided the maps and the FAA procedures +for ensuring that maps are up-to-date. +Stopping after identifying inadequate control actions by the lower levels of the +safety control structure is common in accident investigation. The result is that the +cause is attributed to “operator error,” which does not provide enough information +to prevent accidents in the future. It also does not overcome the problems of hindsight bias. In hindsight, it is always possible to see that a different behavior would +have been safer. But the information necessary to identify that safer behavior is +usually only available after the fact. To improve safety, we need to understand the +reasons people acted the way they did. Then we can determine if and how to change +conditions so that better decisions can be made in the future. +The analyst should start from the assumption that most people have good intentions and do not purposely cause accidents. The goal then is to understand why +people did not or could not act differently. People acted the way they did for very +good reasons; we need to understand why the behavior of the people involved made +sense to them at the time . +Identifying these reasons requires examining the context and behavior-shaping +factors in the safety control structure that influenced that behavior. What contextual +factors should be considered? Usually the important contextual and behaviorshaping factors become obvious in the process of explaining why people acted the +way they did. Stringfellow has suggested a set of general factors to consider . +•History. Experiences, education, cultural norms, behavioral patterns. how the +historical context of a controller or organization may impact their ability to +exercise adequate control. +•Resources. Staff, finances, time. +•Tools and Interfaces. Quality, availability, design, and accuracy of tools. Tools +may include such things as risk assessments, checklists, and instruments as well +as the design of interfaces such as displays, control levers, and automated tools. +•Training. +training. +•Human Cognition Characteristics. Person–task compatibility, individual tolerance of risk, control role, innate human limitations. + + +Pressures. Time, schedule, resource, production, incentive, compensation, +political. Pressures can include any positive or negative force that can influence +behavior. +•Safety Culture. Values and expectations around such things as incident reporting, workarounds, and safety management procedures. +•Communication. How the communication techniques, form, styles, or content +impacted behavior. +•Human Physiology. +Intoxication, sleep deprivation, and the like. +We also need to look at the process models used in the decision making. What +information did the decision makers have or did they need related to the inadequate +control actions? What other information could they have had that would have +changed their behavior? If the analysis determines that the person was truly incompetent .(not usually the case), then the focus shifts to ask why an incompetent person +was hired to do this job and why they were retained in their position. A useful +method to assist in understanding human behavior is to show the process model of +the human controller at each important event in which he or she participated, that +is, what information they had about the controlled process when they made their +decisions. +Let’s follow some of the physical plant inadequacies up the safety control structure at Citichem. Three examples of STAMP-based analyses of the inadequate +control at Citichem are shown in figure 11.3. a maintenance worker, the maintenance +manager, and the operations manager. +During the investigation, it was discovered that a maintenance worker had found +water in tank 701. He was told to check the Unit 7 tanks to ensure they were ready +for the T34 production startup. Unit 7 had been shut down previously .(see “Physical +Plant Context”). The startup was scheduled for 10 days after the decision to produce +additional K34 was made. The worker found a small amount of water in tank 701, +reported it to the maintenance manager, and was told to make sure the tank was +“bone dry.” However, water was found in the sample taken from tank 701 right +before the uncontrolled reaction. It is unknown .(and probably unknowable). whether +the worker did not get all the water out or more water entered later through the same +path it entered previously or via a different path. We do know he was fatigued and +working a fourteen-hour day, and he may not have had time to do the job properly. +He also believed that the tank’s residual water was from condensation, not rain. No +independent check was made to determine whether all the water was removed. +Some potential recommendations from what has been described so far include +establishing procedures for quality control and checking safety-critical activities. +Any existence of a hazardous condition.such as finding water in a tank that is to + + +be used to produce a chemical that is highly reactive to water.should trigger an +in-depth investigation of why it occurred before any dangerous operations are +started or restarted. In addition, procedures should be instituted to ensure that those +performing safety-critical operations have the appropriate skills, knowledge, and +physical resources, which, in this case, include adequate rest. Independent checks of +critical activities also seem to be needed. +The maintenance worker was just following the orders of the maintenance +manager, so the role of maintenance management in the safety-control structure +also needs to be investigated. The runaway reaction was the result of TCC coming +in contact with water. The operator who worked for the maintenance manager told +him about finding water in tank 701 after the rain and was directed to remove it. +The maintenance manager does not tell him to check the spare tank 702 for water +and does not appear to have made any other attempts to perform that check. He +apparently accepted the explanation of condensation as the source of the water and +did not, therefore, investigate the leak further. +Why did the maintenance manager, a long-time employee who had always been +safety conscious in the past, not investigate further? The maintenance manager was +working under extreme time pressure and with inadequate staff to perform the jobs +that were necessary. There was no reporting channel to someone with specified +responsibility for investigating hazardous events, such as finding water in a tank +used for a toxic chemical that should never contact water. Normally an investigation +would not be the responsibility of the maintenance manager but would fall under +the purview of the engineering or safety engineering staff. There did not appear to +be anyone at Citichem with the responsibility to perform the type of investigation +and risk analysis required to understand the reason for water being in the tank. Such +events should be investigated thoroughly by a group with designated responsibility +for process safety, which presumes, of course, such a group exists. +The maintenance manager did protest .(to the plant manager). about the unsafe +orders he was given and the inadequate time and resources he had to do his job +adequately. At the same time, he did not tell the plant manager about some of the +things that had occurred. For example, he did not inform the plant manager about +finding water in tank 701. If the plant manager had known these things, he might +have acted differently. There was no problem-reporting system in this plant for such +information to be reliably communicated to decision makers. Communication relied +on chance meetings and informal channels. +Lots of recommendations for changes could be generated from this part of +the analysis, such as providing rigorous procedures for hazard analysis when a hazardous condition is detected and training and assigning personnel to do such an +analysis. Better communication channels are also indicated, particularly problem +reporting channels. + + +The operations manager .(figure 11.3). also played a role in the accident process. +He too was under extreme pressure to get Unit 7 operational. He was unaware that +the maintenance group had found water in tank 701 and thought 702 was empty. +During the effort to get Unit 7 online, the level indicator on tank 701 was found to +be not working. When it was determined that there were no spare level indicators +at the plant and that delivery would require two weeks, he ordered the level indicator on 702 to be temporarily placed on tank 701.tank 702 was only used for overflow in case of an emergency, and he assessed the risk of such an emergency as low. +This flawed decision clearly needs to be carefully analyzed. What types of risk and +safety analyses were performed at Citichem? What training was provided on the +hazards? What policies were in place with respect to disabling safety-critical equipment? Additional analysis also seems warranted for the inventory control procedures at the plant and determining why safety-critical replacement parts were +out of stock. +Clearly, safety margins were reduced at Citichem when operations continued +despite serious failures of safety devices. Nobody noticed the degradation in safety. +Any change of the sort that occurred here.startup of operations in a previously +shut down unit and temporary removal of safety-critical equipment.should have +triggered a hazard analysis and a management of change .(MOC). process. Lots of +accidents in the chemical industry .(and others). involve unsafe workarounds. The +causal analysis so far should trigger additional investigation to determine whether +adequate management of change and control of work procedures had been provided +but not enforced or were not provided at all. The first step in such an analysis is to +determine who was responsible .(if anyone). for creating such procedures and who +was responsible for ensuring they were followed. The goal again is not to find +someone to blame but simply to identify the flaws in the process for running +Citichem so they can be fixed. +At this point, it appears that decision making by higher-level management .(above +the maintenance and operations manager). and management controls were inadequate at Citichem. Figures 11.4 and 11.5 show example STAMP-based analysis results +for the Citichem plant manager and Citichem corporate management. The plant +manager made many unsafe decisions and issued unsafe control actions that directly +contributed to the accident or did not initiate control actions necessary for safety +(as shown in figure 11.4). At the same time, it is clear that he was under extreme +pressure to increase production and was missing information necessary to make +better decisions. An appropriate safety control structure at the plant had not been +established leading to unsafe operational practices and inaccurate risk assessment +by most of the managers, especially those higher in the control structure. Some of +the lower level employees tried to warn against the high-risk practices, but appropriate communication channels had not been established to express these concerns. + + +Safety controls were almost nonexistent at the corporate management level. +The upper levels of management provided inadequate leadership, oversight and +management of safety. There was either no adequate company safety policy or it +was not followed, either of which would lead to further causal analysis. A proper +process safety management system clearly did not exist at Citichem. Management +was under great competitive pressures, which may have led to ignoring corporate +safety controls or adequate controls may never have been established. Everyone +had very flawed mental models of the risks of increasing production without taking +the proper precautions. The recommendations should include consideration of +what kinds of changes might be made to provide better information about risks to +management decision makers and about the state of plant operations with respect +to safety. +Like any major accident, when analyzed thoroughly, the process leading to +the loss is complex and multi-faceted. A complete analysis of this accident is not +needed here. But a look at some of the factors involved in the plant’s environment, +including the control of public health, is instructive. +Figure 11.6 shows the STAMP-based analysis of the Oakbridge city emergencyresponse system. Planning was totally inadequate or out of date. The fire department +did not have the proper equipment and training for a chemical emergency, the hospital also did not have adequate emergency resources or a backup plan, and the +evacuation plan was ten years out of date and inadequate for the current level of +population. +Understanding why these inadequate controls existed requires understanding the +context and process model flaws. For example, the police chief had asked for +resources to update equipment and plans, but the city had turned him down. Plans +had been made to widen the road to Oakbridge so that emergency equipment could +be brought in, but those plans were never implemented and the planners never went +back to their plans to see if they were realistic for the current conditions. Citichem +had a policy against disclosing what chemicals they produce and use, justifying this +policy by the need for secrecy from their competitors, making it impossible for the +hospital to stockpile the supplies and provide the training required for emergencies, +all of which contributed to the fatalities in the accident. The government had no +disclosure laws requiring chemical companies to provide such information to emergency responders. +Clear recommendations for changes result from this analysis, for example, updating evacuation plans and making changes to the planning process. But again, stopping at this level does not help to identify systemic changes that could improve +community safety. The analysts should work their way up the control structure to +understand the entire accident process. For example, why was an inadequate emergency response system allowed to exist? + + +The analysis in figure 11.7 helps to answer this question. For example, the +members of the city government had inadequate knowledge of the hazards associated with the plant, and they did not try to obtain more information about them or +about the impact of increased development close to the plant. At the same time, +they turned down requests for the funding to upgrade the emergency response +system as the population increased as well as attempts by city employees to provide +emergency response pamphlets for the citizens and set up appropriate communication channels. +Why did they make what in retrospect look like such bad decisions? With inadequate knowledge about the risks, the benefits of increased development were +ranked above the dangers from the plant in the priorities used by the city managers. +A misunderstanding about the dangers involved in the chemical processing at +the plant contributed also to the lack of planning and approval for emergencypreparedness activities. +The city government officials were subjected to pressures from local developers +and local businesses that would benefit financially from increased development. The +developer sold homes before the development was approved in order to increase +pressure on the city council. He also campaigned against a proposed emergency +response pamphlet for local residents because he was afraid it would reduce his +sales. The city government was subjected to additional pressure from local businessmen who wanted more development in order to increase their business and profits. +The residents did not provide opposing pressure to counteract the business +influences and trusted that government would protect them. No community organizations existed to provide oversight of the local government safety controls and +to ensure that government was adequately considering their health and safety needs +(figure 11.8). +The city manager had the right instincts and concern for public safety, but she +lacked the freedom to make decisions on her own and the clout to influence the +mayor or city council. She was also subject to external pressures to back down on +her demands and no structure to assist her in resisting those pressures. +In general, there are few requirements for serving on city councils. In the United +States, they are often made up primarily of those with conflicts of interest, such as +real estate agents and developers. Mayors of small communities are often not paid +a full salary and must therefore have other sources of income, and city council +members are likely to be paid even less, if at all. +If community-level management is unable to provide adequate controls, controls +might be enforced by higher levels of government. A full analysis of this accident +would consider what controls existed at the state and federal levels and why they +were not effective in preventing the accident. + + +section 11.7. +A Few Words about Hindsight Bias and Examples. +One of the most common mistakes in accident analyses is the use of hindsight bias. +Words such as “could have” or “should have” in accident reports are judgments that +are almost always the result of such bias . It is not the role of the accident analyst +to render judgment in terms of what people did or did not do .(although that needs +to be recorded). but to understand why they acted the way they did. +Although hindsight bias is usually applied to the operators in an accident report, +because most accident reports focus on the operators, it theoretically could be +applied to people at any level of the organization. “The plant manager should have +known …” +The biggest problem with hindsight bias in accident reports is not that it is +unfair .(which it usually is), but that an opportunity to learn from the accident and +prevent future occurrences is lost. It is always possible to identify a better decision +in retrospect.or there would not have been a loss or near miss.but it may have +been difficult or impossible to identify that the decision was flawed at the time it +had to be made. To improve safety and to reduce errors, we need to understand why + + +the decision made sense to the person at the time and redesign the system to help +people make better decisions. +Accident investigation should start with the assumption that most people have +good intentions and do not purposely cause accidents. The goal of the investigation, +then, is to understand why they did the wrong thing in that particular situation. In +particular, what were the contextual or systemic factors and flaws in the safety +control structure that influenced their behavior? Often, the person had an inaccurate view of the state of the process and, given that view, did what appeared to be +the right thing at the time but turned out to be wrong with respect to the actual +state. The solution then is to redesign the system so that the controller has better +information on which to make decisions. +As an example, consider a real accident report on a chemical overflow from a +tank, which injured several workers in the vicinity . The control room operator +issued an instruction to open a valve to start the flow of liquid into the tank. The +flow meter did not indicate a flow, so the control room operator asked an outside +operator to check the manual valves near the tank to see if they were closed. +The control room operator believed that the valves were normally left in an open +position to facilitate conducting the operation remotely. The tank level at this time +was 7.2 feet. +The outside operator checked and found the manual valves at the tank open. The +outside operator also saw no indication of flow on the flow meter and made an effort +to visually verify that there was no flow. He then began to open and close the valves +manually to try to fix the problem. He reported to the control room operator that +he heard a clunk that may have cleared an obstruction, and the control room operator tried opening the valve remotely again. Both operators still saw no flow on the +flow meter. The outside operator at this time got a call to deal with a problem in a +different part of the plant and left. He did not make another attempt to visually verify +if there was flow. The control room operator left the valve in the closed position. In +retrospect, it appears that the tank level at this time was approximately 7.7 feet. +Twelve minutes later, the high-level alarm on the tank sounded in the control +room. The control room operator acknowledged the alarm and turned it off. In +retrospect, it appears that the tank level at this time was approximately 8.5 feet, +although there was no indication of the actual level on the control board. The control +room operator got an alarm about an important condition in another part of the +plant and turned his attention to dealing with that alarm. A few minutes later, the +tank overflowed. +The accident report concluded, “The available evidence should have been sufficient to give the control room operator a clear indication that .(the tank). was indeed +filling and required immediate attention.” This statement is a classic example of +hindsight bias.note the use of the words “should have …” The report does not + +identify what that evidence was. In fact, the majority of the evidence that both +operators had at this time was that the tank was not filling. +To overcome hindsight bias, it is useful to examine exactly what evidence the +operators had at time of each decision in the sequence of events. One way to do +this is to draw the operator’s process model and the values of each of the relevant +variables in it. In this case, both operators thought the control valve was closed.the +control room operator had closed it and the control panel indicated that it was +closed, the flow meter showed no flow, and the outside operator had visually checked +and there was no flow. The situation is complicated by the occurrence of other +alarms that the operators had to attend to at the same time. +Why did the control board show the control valve was closed when it must have +actually been open? It turns out that there is no way for the control room operator +to get confirmation that the valve has actually closed after he commands it closed. +The valve was not equipped with a valve stem position monitor, so the control +room operator only knows that a signal has gone to the valve for it to close but not +whether it has actually done so. The operators in many accidents, including Three +Mile Island, have been confused about the actual position of valves due to similar +designs. +An additional complication is that while there is an alarm in the tank that should +sound when the liquid level reaches 7.5 feet, that alarm was not working at the time, +and the operator did not know it was not working. So the operator had extra reason +to believe the liquid level had not risen above 7.5 feet, given that he believed there +was no flow into the tank and the 7.5-foot alarm had not sounded. The level transmitter .(which provided the information to the 7.5-foot alarm). had been operating +erratically for a year and a half, but a work order had not been written to repair it +until the month before. It had supposedly been fixed two weeks earlier, but it clearly +was not working at the time of the spill. +The investigators, in retrospect knowing that there indeed had to have been some +flow, suggested that the control room operator “could have” called up trend data on +the control board and detected the flow. But this suggestion is classic hindsight bias. +The control room operator had no reason to perform this extra check and was busy +taking care of critical alarms in other parts of the plant. Dekker notes the distinction +between data availability, which is what can be shown to have been physically available somewhere in the situation, and data observability, which is what was observable given the features of the interface and the multiple interleaving tasks, goals, +interests, and knowledge of the people looking at it . The trend data were available to the control room operator, but they were not observable without taking +special actions that did not seem necessary at the time. +While that explains why the operator did not know the tank was filling, it does +not fully explain why he did not respond to the high-level alarm. The operator said +that he thought the liquid was “tickling” the sensor and triggering a false alarm. The + + +accident report concludes that the operator should have had sufficient evidence the +tank was indeed filling and responded to the alarm. Not included in the official +accident report was the fact that nuisance alarms were relatively common in this +unit. they occurred for this alarm about once a month and were caused by sampling +errors or other routine activities. This alarm had never previously signaled a serious +problem. Given that all the observable evidence showed the tank was not filling and +that the operator needed to respond to a serious alarm in another part of the plant +at the time, the operator not responding immediately to the alarm does not seem +unreasonable. +An additional alarm was involved in the sequence of events. This alarm was at +the tank and denoted that a gas from the liquid in the tank was detected in the air +outside the tank. The outside operator went to investigate. Both operators are +faulted in the report for waiting thirty minutes to sound the evacuation horn after +this alarm went off. The official report says. +Interviews with operations personnel did not produce a clear reason why the response to +the alarm took 31 minutes. The only explanation was that there was not a sense of +urgency since, in their experience, previous alarms were attributed to minor releases +that did not require a unit evacuation. +This statement is puzzling, because the statement itself provides a clear explanation +for the behavior, that is, the previous experience. In addition, the alarm maxed out +at 25 ppm, which is much lower than the actual amount in the air, but the control +room operator had no way of knowing what the actual amount was. In addition, +there are no established criteria in any written procedure for what level of this gas +or what alarms constitute an emergency condition that should trigger sounding +the evacuation alarm. Also, none of the alarms were designated as critical alarms, +which the accident report does concede might have “elicited a higher degree of +attention amongst the competing priorities” of the control room operator. Finally, +there was no written procedure for responding to an alarm for this gas. The “standard response” was for an outside operator to conduct a field assessment of the +situation, which he did. +While there is training information provided about the hazards of the particular +gas that escaped, this information was not incorporated in standard operating or +emergency procedures. The operators were apparently on their own to decide if an +emergency existed and then were chastised for not responding .(in hindsight). correctly. If there is a potential for operators to make poor decisions in safety-critical +situations, then they need to be provided with the criteria to make such a decision. +Expecting operators under stress and perhaps with limited information about the +current system state and inadequate training to make such critical decisions based +on their own judgment is unrealistic. It simply ensures that operators will be blamed +when their decisions turn out, in hindsight, to be wrong. + + +One of the actions the operators were criticized for was trying to fix the problem +rather than calling in emergency personnel immediately after the gas alarm sounded. +In fact, this response is the normal one for humans .(see chapter 9 and , as well +as the following discussion). if it is not the desirable response, then procedures and +training must be used to ensure that a different response is elicited. The accident +report states that the safety policy for this company is. +At units, any employee shall assess the situation and determine what level of evacuation +and what equipment shutdown is necessary to ensure the safety of all personnel, mitigate +the environmental impact and potential for equipment/property damage. When in doubt, +evacuate. +There are two problems with such a policy. +The first problem is that evacuation responsibilities .(or emergency procedures +more generally). do not seem to be assigned to anyone but can be initiated by all +employees. While this may seem like a good idea, it has a serious drawback because one +consequence of such a lack of assigned control responsibility is that everyone may +think that someone else will take the initiative.and the blame if the alarm is a false +one. Although everyone should report problems and even sound an emergency alert +when necessary, there must be someone who has the actual responsibility, authority, +and accountability to do so. There should also be backup procedures for others to step +in when that person does not execute his or her responsibility acceptably. +The second problem with this safety policy is that unless the procedures clearly +say to execute emergency procedures, humans are very likely to try to diagnose the +situation first. The same problem pops up in many accident reports.humans who +are overwhelmed with information that they cannot digest quickly or do not understand, will first try to understand what is going on before sounding an alarm . +If management wants employees to sound alarms expeditiously and consistently, +then the safety policy needs to specify exactly when alarms are required, not leave +it up to personnel to “evaluate the situation” when they are probably confused and +unsure as to what is going on .(as in this case). and under pressure to make quick +decisions under stressful situations. How many people, instead of dialing 911 immediately, try to put out a small kitchen fire themselves? That it often works simply +reinforces the tendency to act in the same way during the next emergency. And it +avoids the embarrassment of the firemen arriving for a non-emergency. As it turns +out, the evacuation alert had been delayed in the past in this same plant, but nobody +had investigated why that occurred. +The accident report concludes with a recommendation that “operator duty to +respond to alarms needs to be reinforced with the work force.” This recommendation is inadequate because it ignores why the operators did not respond to the +alarms. More useful recommendations might have included designing more accurate + +and more observable feedback about the actual position of the control valve .(rather +than just the commanded position), about the state of flow into the tank, about the +level of the liquid in the tank, and so on. The recommendation also ignores the +ambiguous state of the company policy on responding to alarms. +Because the official report focused only on the role of the operators in the accident and did not even examine that in depth, a chance to detect flaws in the design +and operation of the plant that could lead to future accidents was lost. To prevent +future accidents, the report needed to explain such things as why the HAZOP performed on the unit did not identify any of the alarms in this unit as critical. Is there +some deficiency in HAZOP or in the way it is being performed in this company? +Why were there no procedures in place, or why were the ones in place ineffective, +to respond to the emergency? Either the hazard was not identified, the company +does not have a policy to create procedures for dealing with hazards, or it was an +oversight and there was no procedure in place to check that there is a response for +all identified hazards. +The report does recommend that a risk assessed procedure for filling this tank +be created that defines critical operational parameters such as the sequence of steps +required to initiate the filling process, the associated process control parameters, the +safe level at which the tank is considered full, the sequence of steps necessary to +conclude and secure the tank-filling process, and appropriate response to alarms. It +does not say anything, however, about performing the same task for other processes +in the plant. Either this tank and its safety-critical process are the only ones missing +such procedures or the company is playing a sophisticated game of Whack-a-Mole +(see chapter 13), in which only symptoms of the real problems are removed with +each set of events investigated. +The official accident report concludes that the control room operator “did not +demonstrate an awareness of risks associated with overflowing the tank and potential to generate high concentrations of was spilled.” +No further investigation of why this was true was included in the report. Was there +a deficiency in the training procedures about the hazards associated with his job +responsibilities? Even if the explanation is that this particular operator is simply +incompetent .(probably not true). and although exposed to potentially effective training did not profit from it, then the question becomes why such an operator was +allowed to continue in that job and why the evaluation of his training outcomes did +not detect this deficiency. It seemed that the outside operator also had a poor +understanding of the risks from this gas so there is clearly evidence that a systemic +problem exists. An audit should have been performed to determine if a spill in this +tank is the only hazard that is not understood and if these two operators are the +only ones who are confused. Is this unit simply a poorly designed and managed one +in the plant or do similar deficiencies exist in other units? + + + +Other important causal factors and questions also were not addressed in the +report such as why the level transmitter was not working so soon after it was supposedly fixed, why safety orders were so delayed .(the average age of a safety-related +work order in this plant was three months), why critical processes were allowed to +operate with non-functioning or erratically functioning safety-related equipment, +whether the plant management knew this was happening, and so on. +Hindsight bias and focusing only on the operator’s role in accidents prevents us +from fully learning from accidents and making significant progress in improving +safety. +section 11.8. +Coordination and Communication. +The analysis so far has looked at each component separately. But coordination and +communication between controllers are important sources of unsafe behavior. +Whenever a component has two or more controllers, coordination should be +examined carefully. Each controller may have different responsibilities, but the +control actions provided may conflict. The controllers may also control the same +aspects of the controlled component’s behavior, leading to confusion about who is +responsible for providing control at any time. In the Walkerton E. coli water supply +contamination example provided in appendix C, three control components were +responsible for following up on inspection reports and ensuring the required changes +were made. the Walkerton Public Utility Commission .(WPUC), the Ministry of the +Environment .(MOE), and the Ministry of Health .(MOH). The WPUC commissioners had no expertise in running a water utility and simply left the changes to the +manager. The MOE and MOH both were responsible for performing the same +oversight. The local MOH facility assumed that the MOE was performing this function, but the MOE’s budget had been cut, and follow-ups were not done. In this +case, each of the three responsible groups assumed the other two controllers were +providing the needed oversight, a common finding after an accident. +A different type of coordination problem occurred in an aircraft collision near +Überlingen, Germany, in 20 02 . The two controllers.the automated onboard TCAS system and the ground air traffic controller.provided uncoordinated +control instructions that conflicted and actually caused a collision. The loss would +have been prevented if both pilots had followed their TCAS alerts or both had followed the ground ATC instructions. +In the friendly fire accident analyzed in chapter 5, the responsibility of the +A Wacks controllers had officially been disambiguated by assigning one to control +aircraft within the no-fly zone and the other to monitor and control aircraft outside +it. This partitioning of control broke down over time, however, with the result that +neither controlled the Black Hawk helicopter on that fateful day. No performance + + +auditing occurred to ensure that the assumed and designed behavior of the safety +control structure components was actually occurring. +Communication, both feedback and exchange of information, is also critical. All +communication links should be examined to ensure they worked properly and, if +they did not, the reasons for the inadequate communication must be determined. +The Überlingen collision, between a Russian Tupolev aircraft and a DHL Boeing +aircraft, provides a useful example. Wong used STAMP to analyze this accident and +demonstrated how the communications breakdown on the night of the accident +played an important role . Figure 11.9 shows the components surrounding the +controller at the Air Traffic Control Center in Zürich that was controlling both +aircraft at the time and the feedback loops and communication links between the +components. Dashed lines represent partial communication channels that are not +available all the time. For example, only partial communication is available between +the controller and multiple aircraft because only one party can transmit at one time +when they are sharing a single radio frequency. In addition, the controller cannot +directly receive information about TCAS advisories.the Pilot Not Flying .(PNF). is + + +supposed to report TCAS advisories to the controller over the radio. Finally, communicating all the time with all the aircraft requires the presence of two controllers +at two different consoles, but only one controller was present at the time. +Nearly all the communication links were broken or ineffective at the time of the +accident .(see figure 11.10). A variety of conditions contributed to the lost links. +The first reason for the dysfunctional communication was unsafe practices such +as inadequate briefings given to the two controllers scheduled to work the night +shift, the second controller being in the break room .(which was not officially allowed +but was known and tolerated by management during times of low traffic), and the +reluctance of the controller’s assistant to speak up with ideas to assist in the situation due to feeling that he would be overstepping his bounds. The inadequate briefings were due to a lack of information as well as each party believing they were not +responsible for conveying specific information, a result of poorly defined roles and +responsibilities. +More links were broken due to maintenance work that was being done in the +control room to reorganize the physical sectors. This work led to unavailability of +the direct phone line used to communicate with adjacent ATC centers .(including +ATC Karlsruhe, which saw the impending collision and tried to call ATC Zurich) +and the loss of an optical short-term conflict alert .(STCA). on the console. The aural +short-term conflict alert was theoretically working, but nobody in the control room +heard it. +Unusual situations led to the loss of additional links. These include the failure of +the bypass telephone system from adjacent ATC centers and the appearance of a +delayed A320 aircraft landing at Friedrichshafen. To communicate with all three +aircraft, the controller had to alternate between two consoles, changing all the aircraft–controller communication channels to partial links. +Finally, some links were unused because the controller did not realize they were +available. These include possible help from the other staff present in the control room +(but working on the resectorization). and a third telephone system that the controller +did not know about. In addition, the link between the crew of the Tupolev aircraft +and its TCAS unit was broken due to the crew ignoring the TCAS advisory. +Figure 11.10 shows the remaining links after all these losses. At the time of the +accident, there were no complete feedback loops left in the system and the few +remaining connections were partial ones. The exception was the connection between +the TCAS units of the two aircraft, which were still communicating with each other. +The TCAS unit can only provide information to the crew, however, so this remaining +loop was unable to exert any control over the aircraft. +Another common type of communication failure is in the problem-reporting +channels. In a large number of accidents, the investigators find that the problems +were identified in time to prevent the loss but that the required problem-reporting + + +channels were not used. Recommendations in the ensuing accident reports usually +involve training people to use the reporting channels.based on an assumption that +the lack of use reflected poor training.or attempting to enforce their use by reiterating the requirement that all problems be reported. These investigations, however, +usually stop short of finding out why the reporting channels were not used. Often +an examination and a few questions reveal that the formal reporting channels are +difficult or awkward and time-consuming to use. Redesign of a poorly designed +system will be more effective in ensuring future use than simply telling people they +have to use a poorly designed system. Unless design changes are made, over time +the poorly designed communication channels will again become underused. +At Citichem, all problems were reported orally to the control room operator, who +was supposed to report them to someone above him. One conduit for information, +of course, leads to a very fragile reporting system. At the same time, there were few +formal communication and feedback channels established.communication was +informal and ad hoc, both within Citichem and between Citichem and the local +government. + +section 11.9. Dynamics and Migration to a High-Risk State. +As noted previously, most major accidents result from a migration of the system +toward reduced safety margins over time. In the Citichem example, pressure from +commercial competition was one cause of this degradation in safety. It is, of course, +a very common one. Operational safety practices at Citichem had been better in the +past, but the current market conditions led management to cut the safety margins +and ignore established safety practices. Usually there are precursors signaling the +increasing risks associated with these changes in the form of minor incidents and +accidents, but in this case, as in so many others, these precursors were not recognized. +Ironically, the death of the Citichem maintenance manager in an accident led the +management to make changes in the way they were operating, but it was too late +to prevent the toxic chemical release. +The corporate leaders pressured the Citichem plant manager to operate at higher +levels of risk by threatening to move operations to Mexico, leaving the current +workers without jobs. Without any way of maintaining an accurate model of the risk +in current operations, the plant manager allowed the plant to move to a state of +higher and higher risk. +Another change over time that affected safety in this system was the physical +change in the separation of the population from the plant. Usually hazardous facilities are originally placed far from population centers, but the population shifts +after the facility is created. People want to live near where they work and do not +like long commutes. Land and housing may be cheaper near smelly, polluting plants. +In third world countries, utilities .(such as power and water). and transportation +facilities may be more readily available near heavy industrial plants, as was the case +at Bhopal. +At Citichem, an important change over time was the obsolescence of the emergency preparations as the population increased. Roads, hospital facilities, firefighting +equipment, and other emergency resources became inadequate. Not only were there +insufficient resources to handle the changes in population density and location, +but financial and other pressures militated against those wanting to update the +emergency resources and plans. +Considering the Oakbridge community dynamics, the city of Oakbridge contributed to the accident through the erosion of the safety controls due to the normal +pressures facing any city government. Without any history of accidents, or risk +assessments indicating otherwise, the plant was deemed safe, and officials allowed +developers to build on previously restricted land. A contributing factor was the +desire to increase city finances and business relationships that would assist in reelection of the city officials. The city moved toward a state where casualties would be +massive when an accident did occur. + + +The goal of understanding the dynamics is to redesign the system and the safety +control structure to make them more conducive to system safety. For example, +behavior is influenced by recent accidents or incidents. As safety efforts are successfully employed, the feeling grows that accidents cannot occur, leading to reduction +in the safety efforts, an accident, and then increased controls for a while until the +system drifts back to an unsafe state and complacency again increases . . . +This complacency factor is so common that any system safety effort must include +ways to deal with it. SUBSAFE, the U.S. nuclear submarine safety program, has +been particularly successful at accomplishing this goal. The SUBSAFE program is +described in chapter 14. +One way to combat this erosion of safety is to provide ways to maintain accurate +risk assessments in the process models of the system controllers. The more and +better information controllers have, the more accurate will be their process models +and therefore their decisions. +In the Citichem example, the dynamics of the city migration toward higher risk +might be improved by doing better hazard analyses, increasing communication +between the city and the plant .(e.g., learning about incidents that are occurring), +and the formation of community citizen groups to provide counterbalancing pressures on city officials to maintain the emergency response system and the other +public safety measures. +Finally, understanding the reason for such migration provides an opportunity to +design the safety control structure to prevent it or to detect it when it occurs. Thorough investigation of incidents using CAST and the insight it provides can be used +to redesign the system or to establish operational controls to stop the migration +toward increasing risk before an accident occurs. + +section 11.10. Generating Recommendations from the CAST Analysis. +The goal of an accident analysis should not be just to address symptoms, to assign +blame, or to determine which group or groups are more responsible than others. +Blame is difficult to eliminate, but, as discussed in section 2.7, blame is antithetical to improving safety. It hinders accident and incident investigations and the +reporting of errors before a loss occurs, and it hinders finding the most important +factors that need to be changed to prevent accidents in the future. Often, blame is +assigned to the least politically powerful in the control hierarchy or to those people +or physical components physically and operationally closest to the actual loss +events. Understanding why inadequate control was provided and why it made +sense for the controllers to act in the way they did helps to diffuse what seems to +be a natural desire to assign blame for events. In addition, looking at how the entire +safety control structure was flawed and conceptualizing accidents as complex + + +processes rather than the result of independent events should reduce the finger +pointing and arguments about others being more to blame that often arises when +system components other than the operators are identified as being part of the +accident process. “More to blame” is not a relevant concept in a systems approach +to accident analysis and should be resisted and avoided. Each component in a +system works together to obtain the results, and no part is more important than +another. +The goal of the accident analysis should instead be to determine how to change +or reengineer the entire safety-control structure in the most cost-effective and practical way to prevent similar accident processes in the future. Once the STAMP +analysis has been completed, generating recommendations is relatively simple and +follows directly from the analysis results. +One consequence of the completeness of a STAMP analysis is that many possible recommendations may result.in some cases, too many to be practical to +include in the final accident report. A determination of the relative importance of +the potential recommendations may be required in terms of having the greatest +impact on the largest number of potential future accidents. There is no algorithm +for identifying these recommendations, nor can there be. Political and situational +factors will always be involved in such decisions. Understanding the entire accident +process and the overall safety control structure should help with this identification, +however. +Some sample recommendations for the Citichem example are shown throughout +the chapter. A more complete list of the recommendations that might result from a +STAMP-based Citichem accident analysis follows. The list is divided into four parts. +physical equipment and design, corporate management, plant operations and management, and government and community. +Physical Equipment and Design +1. Add protection against rainwater getting into tanks. +2. Consider measures for preventing and detecting corrosion. +3. Change the design of the valves and vent pipes to respond to the two-phase +flow problem .(which was responsible for the valves and pipes being jammed). +4. Etc. .(the rest of the physical plant factors are omitted) +Corporate Management +1. Establish a corporate safety policy that specifies. +a. Responsibility, authority, accountability of everyone with respect to safety +b. Criteria for evaluating decisions and for designing and implementing safety +controls. + + +2. Establish a corporate process safety organization to provide oversight that is +responsible for. +a. Enforcing the safety policy +b. Advising corporate management on safety-related decisions +c. Performing risk analyses and overseeing safety in operations including +performing audits and setting reporting requirements .(to keep corporate +process models accurate). A safety working group at the corporate level +should be considered. +d. Setting minimum requirements for safety engineering and operations at +plants and overseeing the implementation of these requirements as well as +management of change requirements for evaluating all changes for their +impact on safety. +e. Providing a conduit for safety-related information from below .(a formal +safety reporting system). as well as an independent feedback channel about +process safety concerns by employees. +f. Setting minimum physical and operational standards .(including functioning +equipment and backups). for operations involving dangerous chemicals. +g. Establishing incident/accident investigation standards and ensuring recommendations are adequately implemented. +h. Creating and maintaining a corporate process safety information system. +3. Improve process safety communication channels both within the corporate +level as well as information and feedback channels from Citichem plants to +corporate management. +4. Ensure that appropriate communication and coordination is occurring between +the Citichem plants and the local communities in which they reside. +5. Strengthen or create an inventory control system for safety-critical parts at the +corporate level. Ensure that safety-related equipment is in stock at all times. + +Citichem Oakbridge Plant Management and Operations. +1. Create a safety policy for the plant. Derive it from the corporate safety policy +and make sure everyone understands it. Include minimum requirements for +operations. for example, safety devices must be operational, and production +should be shut down if they are not. +2. Establish a plant process safety organization and assign responsibility, authority, and accountability for this organization. Include a process safety manager +whose primary responsibility is process safety. The responsibilities of this +organization should include at least the following. +a. Perform hazard and risk analysis. + +b. Advise plant management on safety-related decisions. +c. Create and maintain a plant process safety information system. +d. Perform or organize process safety audits and inspections using hazard +analysis results as the preconditions for operations and maintenance. +e. Investigate hazardous conditions, incidents, and accidents. +f. Establish leading indicators of risk. +g. Collect data to ensure process safety policies and procedures are being +followed. +3. Ensure that everyone has appropriate training in process safety and the specific hazards associated with plant operations. +4. Regularize and improve communication channels. Create the operational +feedback channels from controlled components to controllers necessary to +maintain accurate process models to assist in safety-related decision making. +If the channels exist but are not used, then the reason why they are unused +should be determined and appropriate changes made. +5. Establish a formal problem reporting system along with channels for problem +reporting that include management and rank and file workers. Avoid communication channels with a single point of failure for safety-related messages. +Decisions on whether management is informed about hazardous operational +events should be proceduralized. Any operational conditions found to exist +that involve hazards should be reported and thoroughly investigated by those +responsible for system safety. +6. Consider establishing employee safety committees with union representation +(if there are unions at the plant). Consider also setting up a plant process safety +working group. +7. Require that all changes affecting safety equipment be approved by the plant +manager or by his or her designated representative for safety. Any outage of +safety-critical equipment must be reported immediately. +8. Establish procedures for quality control and checking of safety-critical activities and follow-up investigation of safety excursions .(hazardous conditions). +9. Ensure that those performing safety-critical operations have appropriate skills +and physical resources .(including adequate rest). +10. Improve inventory control procedures for safety-critical parts at the +Oakbridge plant. +11. Review procedures for turnarounds, maintenance, changes, operations, etc. +that involve potential hazards and ensure that these are being followed. Create +an MOC procedure that includes hazard analysis on all planned changes. + +12. Enforce maintenance schedules. If delays are unavoidable, a safety analysis +should be performed to understand the risks involved. +13. Establish incident/accident investigation standards and ensure that they are +being followed and recommendations are implemented. +14. Create a periodic audit system on the safety of operations and the state of +the plant. Audit scope might be defined by such information as the hazard +analysis, identified leading indicators of risk, and past incident/accident +investigations. +15. Establish communication channels with the surrounding community and +provide appropriate information for better decision making by community +leaders and information to emergency responders and the medical establishment. Coordinate with the surrounding community to provide information +and assistance in establishing effective emergency preparedness and response +measures. These measures should include a warning siren or other notification of an emergency and citizen information about what to do in the case of +an emergency. + +Government and Community. +1. Set policy with respect to safety and ensure that the policy is enforced. +2. Establish communication channels with hazardous industry in the community. +3. Establish and monitor information channels about the risks in the community. +Collect and disseminate information on hazards, the measures citizens can take +to protect themselves, and what to do in case of an emergency. +4. Encourage citizens to take responsibility for their own safety and to encourage +local, state, and federal government to do the things necessary to protect them. +5. Encourage the establishment of a community safety committee and/or a safety +ombudsman office that is not elected but represents the public in safety-related +decision making. +6. Ensure that safety controls are in place before approving new development in +hazardous areas, and if not .(e.g., inadequate roads, communication channels, +emergency response facilities), then perhaps make developers pay for them. +Consider requiring developers to provide an analysis of the impact of new +development on the safety of the community. Hire outside consultants to +evaluate these impact analyses if such expertise is not available locally. +7. Establish an emergency preparedness plan and re-evaluate it periodically to +determine if it is up to date. Include procedures for coordination among emergency responders. + + +8. Plan temporary measures for additional manpower in emergencies. +9. Acquire adequate equipment. +10. Provide drills and ensure alerting and communication channels exist and are +operational. +11. Train emergency responders. +12. Ensure that transportation and other facilities exist for an emergency. +13. Set up formal communications between emergency responders .(hospital staff, +police, firefighters, Citichem). Establish emergency plans and means to periodically update them. +One thing to note from this example is that many of the recommendations are +simply good safety management practices. While this particular example involved a +system that was devoid of the standard safety practices common to most industries, +many accident investigations conclude that standard safety management practices +were not observed. This fact points to a great opportunity to prevent accidents simply +by establishing standard safety controls using the techniques described in this book. +While we want to learn as much as possible from each loss, preventing the losses in +the first place is a much better strategy than waiting to learn from our mistakes. +These recommendations and those resulting from other thoroughly investigated +accidents also provide an excellent resource to assist in generating the system safety +requirements and constraints for similar types of systems and in designing improved +safety control structures. +Just investigating the incident or accident is, of course, not enough. Recommendations must be implemented to be useful. Responsibility must be assigned for ensuring that changes are actually made. In addition, feedback channels should be +established to determine whether the recommendations and changes were successful in reducing risk. + +section 11.11. Experimental Comparisons of CAST with Traditional Accident Analysis. +Although CAST is new, several evaluations have been done, mostly aviationrelated. +Robert Arnold, in a master’s thesis for Lund University, conducted a qualitative +comparison of SOAM and STAMP in an Air Traffic Management .(ATM). occurrence investigation. SOAM .(Systemic Occurrence Analysis Methodology). is used +by Eurocontrol to analyze ATM incidents. In Arnold’s experiment, an incident was +investigated using SOAM and STAMP and the usefulness of each in identifying +systemic countermeasures was compared. The results showed that SOAM is a useful +heuristic and a powerful communication device, but that it is weak with respect to + + +emergent phenomena and nonlinear interactions. SOAM directs the investigator to +consider the context in which the events occur, the barriers that failed, and the +organizational factors involved, but not the processes that created them or how +the entire system can migrate toward the boundaries of safe operation. In contrast, +the author concludes, +STAMP directs the investigator more deeply into the mechanism of the interactions +between system components, and how systems adapt over time. STAMP helps identify the +controls and constraints necessary to prevent undesirable interactions between system +components. STAMP also directs the investigation through a structured analysis of the +upper levels of the system’s control structure, which helps to identify high level systemic +countermeasures. The global ATM system is undergoing a period of rapid technological +and political change. . . . The ATM is moving from centralized human controlled systems +to semi-automated distributed decision making. . . . Detailed new systemic models like +STAMP are now necessary to prevent undesirable interactions between normally functioning system components and to understand changes over time in increasingly complex +ATM systems. +Paul Nelson, in another Lund University master’s thesis, used STAMP and CAST +to analyze the crash of Comair 5191 at Lexington, Kentucky, on August 27, 20 06 , +when the pilots took off from the wrong runway . The accident, of course, has +been thoroughly investigated by the NTSB. Nelson concludes that the NTSB report +narrowly targeted causes and potential solutions. No recommendations were put +forth to correct the underlying safety control structure, which fostered process +model inconsistencies, inadequate and dysfunctional control actions, and unenforced +safety constraints. The CAST analysis, on the other hand, uncovered these useful +levers for eliminating future loss. +Stringfellow compared the use of STAMP, augmented with guidewords for organizational and human error analysis, with the use of HFACS .(Human Factors Analysis and Classification System). on the crash of a Predator-B unmanned aircraft near +Nogales, Arizona . HFACS, based on the Swiss Cheese Model .(event-chain +model), is an error-classification list that can be used to label types of errors, problems, or poor decisions made by humans and organizations . Once again, +although the analysis of the unmanned vehicle based on STAMP found all the +factors found in the published analysis of the accident using HFACS , the +STAMP-based analysis identified additional factors, particularly those at higher +levels of the safety control structure, for example, problems in the FAA’s COA3 +approval process. Stringfellow concludes. + +The organizational influences listed in HFACS . . . do not go far enough for engineers to +create recommendations to address organizational problems. . . . Many of the factors cited +in Swiss Cheese-based methods don’t point to solutions; many are just another label for +human error in disguise . +In general, most accident analyses do a good job in describing what happened, but +not why. + + +footnote. The COA or Certificate of Operation allows an air vehicle that does not nominally meet FAA safety +standards access to the National Airspace System. The COA application process includes measures to +mitigate risks, such as sectioning off the airspace to be used by the unmanned aircraft and preventing +other aircraft from entering the space. + + +section 11.12. Summary. +In this chapter, the process for performing accident analysis using STAMP as the +basis is described and illustrated using a chemical plant accident as an example. +Stopping the analysis at the lower levels of the safety-control structure, in this case +at the physical controls and the plant operators, provides a distorted and incomplete +view of the causative factors in the loss. Both a better understanding of why the +accident occurred and how to prevent future ones are enhanced with a more complete analysis. As the entire accident process becomes better understood, individual +mistakes and actions assume a much less important role in comparison to the role +played by the environment and context in which their decisions and control actions +take place. What may look like an error or even negligence by the low-level operators and controllers may appear much more reasonable given the full picture. In +addition, changes at the lower levels of the safety-control structure often have much +less ability to impact the causal factors in major accidents than those at higher levels. +At all levels, focusing on assessing blame for the accident does not provide the +information necessary to prevent future accidents. Accidents are complex processes, +and understanding the entire process is necessary to provide recommendations that +are going to be effective in preventing a large number of accidents and not just +preventing the symptoms implicit in a particular set of events. There is too much +repetition of the same causes of accidents in most industries. We need to improve +our ability to learn from the past. +Improving accident investigation may require training accident investigators in +systems thinking and in the types of environmental and behavior shaping factors to +consider during an analysis, some of which are discussed in later chapters. Tools to +assist in the analysis, particularly graphical representations that illustrate interactions +and causality, will help. But often the limitations of accident reports do not stem from +the sincere efforts of the investigators but from political and other pressures to limit +the causal factors identified to those at the lower levels of the management or political hierarchy. Combating these pressures is beyond the scope of this book. Removing +blame from the process will help somewhat. Management also has to be educated to +understand that safety pays and, in the longer term, costs less than the losses that +result from weak safety programs and incomplete accident investigations. \ No newline at end of file diff --git a/chapter12.raw b/chapter12.raw new file mode 100644 index 0000000..6b8fbe0 --- /dev/null +++ b/chapter12.raw @@ -0,0 +1,925 @@ +Chapter 12. +Controlling Safety during Operations. +In some industries, system safety is viewed as having its primary role in development +and most of the activities occur before operations begin. Those concerned with +safety may lose influence and resources after that time. As an example, one of +the chapters in the Challenger accident report, titled “The Silent Safety Program,” +lamented: +Following the successful completion of the orbital flight test phase of the Shuttle program, +the system was declared to be operational. Subsequently, several safety, reliability, and +quality assurance organizations found themselves with reduced and/or reorganized func- +tional capabilities. . . . The apparent reason for such actions was a perception that less +safety, reliability, and quality assurance activity would be required during “routine” Shuttle +operations. This reasoning was faulty. +While safety-guided design eliminates some hazards and creates controls for others, +hazards and losses may still occur in operations due to: +1.•Inadequate attempts to eliminate or control the hazards in the system design, +perhaps due to inappropriate assumptions about operations. +2.•Inadequate implementation of the controls that designers assumed would exist +during operations. +3.•Changes that occur over time, including violation of the assumptions underly- +ing the design. +4.•Unidentified hazards, sometimes new ones that arise over time and were not +anticipated during design and development. +Treating operational safety as a control problem requires facing and mitigating these +potential reasons for losses. +A complete system safety program spans the entire life of the system and, in some +ways, the safety program during operations is even more important than during +development. System safety does not stop after development; it is just getting started. +The focus now, however, shifts to the operations safety control structure. + + +This chapter describes the implications of STAMP on operations. Some topics +that are relevant here are left to the next chapter on management: organizational +design, safety culture and leadership, assignment of appropriate responsibilities +throughout the safety control structure, the safety information system, and corpo- +rate safety policies. These topics span both development and operations and many +of the same principles apply to each, so they have been put into a separate chapter. +A final section of this chapter considers the application of STAMP and systems +thinking principles to occupational safety. +section 12.1. +Operations Based on STAMP. +Applying the basic principles of STAMP to operations means that, like develop- +ment, the goal during operations is enforcement of the safety constraints, this time +on the operating system rather than in its design. Specific responsibilities and control +actions required during operations are outlined in chapter 13. +Figure 12.1 shows the interactions between development and operations. At the +end of the development process, the safety constraints, the results of the hazard +analyses, as well as documentation of the safety-related design features and design +rationale, should be passed on to those responsible for the maintenance and evo- +lution of the system. This information forms the baseline for safe operations. For +example, the identification of safety-critical items in the hazard analysis should be +used as input to the maintenance process for prioritization of effort. + +At the same time, the accuracy and efficacy of the hazard analyses performed +during development and the safety constraints identified need to be evaluated using +the operational data and experience. Operational feedback on trends, incidents, and +accidents should trigger reanalysis when appropriate. Linking the assumptions +throughout the system specification with the parts of the hazard analysis based on +that assumption will assist in performing safety maintenance activities. During field +testing and operations, the links and recorded assumptions and design rationale can +be used in safety change analysis, incident and accident analysis, periodic audits and +performance monitoring as required to ensure that the operational system is and +remains safe. +For example, consider the TCAS requirement that TCAS provide collision avoid- +ance protection for any two aircraft closing horizontally at any rate up to 1,200 knots +and vertically up to 10,000 feet per minute. As noted in the rationale, this require- +ment is based on aircraft performance limits at the time TCAS was created. It is +also based on minimum horizontal and vertical separation requirements. The safety +analysis originally performed on TCAS is based on these assumptions. If aircraft +performance limits change or if there are proposed changes in airspace manage- +ment, as is now occurring in new Reduced Vertical Separation Minimums (RVSM), +hazard analysis to determine the safety of such changes will require the design +rationale and the tracing from safety constraints to specific system design features +as recorded in intent specifications. Without such documentation, the cost of reanal- +ysis could be enormous and in some cases even impractical. In addition, the links +between design and operations and user manuals in level 6 will ease updating when +design changes are made. +In a traditional System Safety program, much of this information is found +in or can be derived from the hazard log, but it needs to be pulled out and pro- +vided in a form that makes it easy to locate and use in operations. Recording +design rationale and assumptions in intent specifications allows using that informa- +tion both as the criteria under which enforcement of the safety constraints is +predicated and in the inevitable upgrades and changes that will need to be made +during operations. Chapter 10 shows how to identify and record the necessary +information. +The design of the operational safety controls are based on assumptions about the +conditions during operations. Examples include assumptions about how the opera- +tors will operate the system and the environment (both social and physical) in which +the system will operate. These conditions may change. Therefore, not only must the +assumptions and design rationale be conveyed to those who will operate the system, +but there also need to be safeguards against changes over time that violate those +assumptions. + + +The changes may be in the behavior of the system itself: +•Physical changes: the equipment may degrade or not be maintained properly. +•Human changes: human behavior and priorities usually change over time. +•Organizational changes: change is a constant in most organizations, including +changes in the safety control structure itself, or in the physical and social envi- +ronment within which the system operates or with which it interacts. +Controls need to be established to reduce the risk associated with all these types of +changes. +The safeguards may be in the design of the system itself or in the design of the +operational safety control structure. Because operational safety depends on the +accuracy of the assumptions and models underlying the design and hazard analysis +processes, the operational system should be monitored to ensure that: +1. The system is constructed, operated, and maintained in the manner assumed +by the designers. +2. The models and assumptions used during initial decision making and design +are correct. +3. The models and assumptions are not violated by changes in the system, such +as workarounds or unauthorized changes in procedures, or by changes in the +environment. +Designing the operations safety control structure requires establishing controls and +feedback loops to (1) identify and handle flaws in the original hazard analysis and +system design and (2) to detect unsafe changes in the system during operations +before the changes lead to losses. Changes may be intentional or they may be unin- +tended and simply normal changes in system component behavior or the environ- +ment over time. Whether intended or unintended, system changes that violate the +safety constraints must be controlled. + +section 12.2. +Detecting Development Process Flaws during Operations. +Losses can occur due to flaws in the original assumptions and rationale underlying +the system design. Errors may also have been made in the hazard analysis process +used during system design. During operations, three goals and processes to achieve +these goals need to be established: +1. Detect safety-related flaws in the system design and in the safety control +structure, hopefully before major losses, and fix them. + + +2. Determine what was wrong in the development process that allowed the flaws +to exist and improve that process to prevent the same thing from happening +in the future. +3. Determine whether the identified flaws in the process might have led to other +vulnerabilities in the operational system. +If losses are to be reduced over time and companies are not going to simply +engage in constant firefighting, then mechanisms to implement learning and con- +tinual improvement are required. Identified flaws must not only be fixed (symptom +removal), but the larger operational and development safety control structures must +be improved, as well as the process that allowed the flaws to be introduced in the +first place. The overall goal is to change the culture from a fixing orientation— +identifying and eliminating deviations or symptoms of deeper problems—to a learn- +ing orientation where systemic causes are included in the search for the source of +safety problems [33]. +To accomplish these goals, a feedback control loop is needed to regularly track +and assess the effectiveness of the development safety control structure and its +controls. Were hazards overlooked or incorrectly assessed as unlikely or not serious? +Were some potential failures or design errors not included in the hazard analysis? +Were identified hazards inappropriately accepted rather than being fixed? Were the +designed controls ineffective? If so, why? +When numerical risk assessment techniques are used, operational experience can +provide insight into the accuracy of the models and probabilities used. In various +studies of the DC-10 by McDonnell Douglas, the chance of engine power loss with +resulting slat damage during takeoff was estimated to be less than one in a billion +flights. However, this highly improbable event occurred four times in DC-10s in the +first few years of operation without raising alarm bells before it led to an accident +and changes were made. Even one event should have warned someone that the +models used might be incorrect. Surprisingly little scientific evaluation of probabi- +listic risk assessment techniques has ever been conducted [115], yet these techniques +are regularly taught to most engineering students and widely used in industry. Feed- +back loops to evaluate the assumptions underlying the models and the assessments +produced are an obvious way to detect problems. +Most companies have an accident/incident analysis process that identifies the +proximal failures that led to an incident, for example, a flawed design of the pressure +relief valve in a tank. Typical follow-up would include replacement of that valve with +an improved design. On top of fixing the immediate problem, companies should +have procedures to evaluate and potentially replace all the uses of that pressure +relief valve design in tanks throughout the plant or company. Even better would be +to reevaluate pressure relief valve design for all uses in the plant, not just in tanks. + + +But for long-term improvement, a causal analysis—CAST or something similar— +needs to be performed on the process that created the flawed design and that +process improved. If the development process was flawed, perhaps in the hazard +analysis or design and verification, then fixing that process can prevent a large +number of incidents and accidents in the future. +Responsibility for this goal has to be assigned to an appropriate component in +the safety control structure and feedback-control loops established. Feedback may +come from accident and incident reports as well as detected and reported design +and behavioral anomalies. To identify flaws before losses occur, which is clearly +desirable, audits and performance assessments can be used to collect data for vali- +dating and informing the safety design and analysis process without waiting for a +crisis. There must also be feedback channels to the development safety control +structure so that appropriate information can be gathered and used to implement +improvements. The design of these control loops is discussed in the rest of this +chapter. Potential challenges in establishing such control loops are discussed in the +next chapter on management. +section 12.3. Managing or Controlling Change. +Systems are not static but instead are dynamic processes that are continually adapt- +ing to achieve their ends and to react to changes in themselves and their environ- +ment. In STAMP, adaptation or change is assumed to be an inherent part of any +system, particularly those that include humans and organizational components: +Humans and organizations optimize and change their behavior, adapting to the +changes in the world and environment in which the system operates. +To avoid losses, not only must the original design enforce the safety constraints +on system behavior, but the safety control structure must continue to enforce them +as changes to the designed system, including the safety control structure itself, occur +over time. +While engineers usually try to anticipate potential changes and to design for +changeability, the bulk of the effort in dealing with change must necessarily occur +during operations. Controls are needed both to prevent unsafe changes and to detect +them if they occur. +In the friendly fire example in chapter 5, the AWACS controllers stopped handing +off helicopters as they entered and left the no-fly zone. They also stopped using the +Delta Point system to describe flight plans, although the helicopter pilots assumed +the coded destination names were still being used and continued to provide them. +Communication between the helicopters and the AWACS controllers was seriously +degraded although nobody realized it. The basic safety constraint that all aircraft +in the no-fly zone and their locations would be known to the AWACS controllers + + +became over time untrue as the AWACS controllers optimized their procedures. +This type of change is normal; it needs to be identified by checking that the assump- +tions upon which safety is predicated remain true over time. +The deviation from assumed behavior during operations was not, in the friendly +fire example, detected until after an accident. Obviously, finding the deviations at +this time is less desirable than using audits, and other types of feedback mechanisms +to detect hazardous changes, that is, those that violate the safety constraints, before +losses occur. Then something needs to be done to ensure that the safety constraints +are enforced in the future. +Controls are required for both intentional (planned) and unintentional changes. + +section 12.3.1. Planned Changes. +Intentional system changes are a common factor in accidents, including physical, +process, and safety control structure changes [115]. The Flixborough explosion pro- +vides an example of a temporary physical change resulting in a major loss: Without +first performing a proper hazard analysis, a temporary pipe was used to replace a +reactor that had been removed to repair a crack. The crack itself was the result of +a previous process modification [54]. The Walkerton water contamination loss in +appendix C provides an example of a control structure change when the government +water testing lab was privatized without considering how that would affect feedback +to the Ministry of the Environment. +Before any planned changes are made, including organizational and safety +control structure changes, their impact on safety must be evaluated. Whether +this process is expensive depends on how the original hazard analysis was per- +formed and particularly how it was documented. Part of the rationale behind the +design of intent specifications was to make it possible to retrieve the information +needed. +While implementing change controls limits flexibility and adaptability, at least in +terms of the time it takes to make changes, the high accident rate associated with +intentional changes attests to the importance of controlling them and the high level +of risk being assumed by not doing so. Decision makers need to understand these +risks before they waive the change controls. +Most systems and industries do include such controls, usually called Management +of Change (MOC) procedures. But the large number of accidents occurring after +system changes without evaluating their safety implies widespread nonenforcement +of these controls. Responsibility needs to be assigned for ensuring compliance with +the MOC procedures so that change analyses are conducted and the results are not +ignored. One way to do this is to reward people for safe behavior when they choose +safety over other system goals and to hold them accountable when they choose to +ignore the MOC procedures, even when no accident results. Achieving this goal, in + + +turn, requires management commitment to safety (see chapter 13), as does just +about every aspect of building and operating a safe system. + +section 12.3.2. Unplanned Changes. +While dealing with planned changes is relatively straightforward (even if difficult +to enforce), unplanned changes that move systems toward states of higher risk are +less straightforward. There need to be procedures established to prevent or detect +changes that impact the ability of the operations safety control structure and the +designed controls to enforce the safety constraints. +As noted earlier, people will tend to optimize their performance over time to +meet a variety of goals. If an unsafe change is detected, it is important to respond +quickly. People incorrectly reevaluate their perception of risk after a period of +success. One way to interrupt this risk-reevaluation process is to intervene quickly +to stop it before it leads to a further reduction in safety margins or a loss occurs. +But that requires an alerting function to provide feedback to someone who is +responsible for ensuring that the safety constraints are satisfied. +At the same time, change is a normal part of any system. Successful systems are +continually changing and adapting to current conditions. Change should be allowed +as long as it does not violate the basic constraints on safe behavior and therefore +increase risk to unacceptable levels. While in the short term relaxing the safety con- +straints may allow other system goals to be achieved to a greater degree, in the longer +term accidents and losses can cost a great deal more than the short-term gains. +The key is to allow flexibility in how safety goals are achieved, but not flexibility +in violating them, and to provide the information that creates accurate risk percep- +tion by decision makers. +Detecting migration toward riskier behavior starts with identifying baseline +requirements. The requirements follow from the hazard analysis. These require- +ments may be general (“Equipment will not be operated above the identified safety- +critical limits” or “Safety-critical equipment must be operational when the system +is operating”) or specifically tied to the hazard analysis (“AWACS operators must +always hand off aircraft when they enter and leave the no-fly zone” or “Pilots must +always follow the TCAS alerts and continue to do so until they are canceled”). +The next step is to assign responsibility to appropriate places in the safety control +structure to ensure the baseline requirements are not violated, while allowing +changes that do not raise risk. If the baseline requirements make it impossible for +the system to achieve its goals, then instead of waiving them, the entire safety control +structure should be reconsidered and redesigned. For example, consider the foam +shedding problems on the Space Shuttle. Foam had been coming off the external +tank for most of the operational life of the Shuttle. During development, a hazard +had been identified and documented related to the foam damaging the thermal + +control surfaces of the spacecraft. Attempts had been made to eliminate foam shed- +ding, but none of the proposed fixes worked. The response was to simply waive the +requirement before each flight. In fact, at the time of the Columbia loss, more than +three thousand potentially critical failure modes were regularly waived on the +pretext that nothing could be done about them and the Shuttle had to fly [74]. +More than a third of these waivers had not been reviewed in the ten years before +the accident. +After the Columbia loss, controls and mitigation measures for foam shedding +were identified and implemented, such as changing the fabrication procedures and +adding cameras and inspection and repair capabilities and other contingency actions. +The same measures could, theoretically, have been implemented before the loss of +Columbia. Most of the other waived hazards were also resolved in the aftermath of +the accident. While the operational controls to deal with foam shedding raise the +risk associated with a Shuttle accident above actually fixing the problem, the risk is +lower than simply ignoring and waiting for the hazards to occur. Understanding and +explicitly accepting risk is better than simply denying and ignoring it. +The NASA safety program and safety control structure had seriously degraded +before both the Challenger and Columbia losses [117]. Waiving requirements +interminably represents an abdication of the responsibility to redesign the system, +including the controls during operations, after the current design is determined to +be unsafe. +Is such a hard line approach impractical? SUBSAFE, the U.S. nuclear submarine +safety program established after the Thresher loss, described in chapter 14, has not +allowed waiving the SUBSAFE safety requirements for more than forty-five years, +with one exception. In 1967, four years after SUBSAFE was established, SUBSAFE +requirements for one submarine were waived in order to satisfy pressing Navy per- +formance goals. That submarine and its crew were lost less than a year later. The +same mistake has not been made again. +If there is absolutely no way to redesign the system to be safe and at the same +time to satisfy the system requirements that justify its existence, then the existence +of the system itself should be rethought and a major replacement or new design +considered. After the first accident, much more stringent and perhaps unacceptable +controls will be forced on operations. While the decision to live with risk is usually +accorded to management, those who will suffer the losses should have a right to +participate in that decision. Luckily, the choice is usually not so stark if flexibility is +allowed in the way the safety constraints are maintained and long-term rather than +short-term thinking prevails. +Like any set of controls, unplanned change controls involve designing appropri- +ate control loops. In general, the process involves identifying the responsibility +of the controller(s); collecting data (feedback); turning the feedback into useful + + +information (analysis) and updating the process models; generating any necessary +control actions and appropriate communication to other controllers; and measuring +how effective the whole process is (feedback again). +section 12.4. Feedback Channels. +Feedback is a basic part of STAMP and of treating safety as a control problem. +Information flow is key in maintaining safety. +There is often a belief—or perhaps hope—that a small number of “leading indi- +cators” can identify increasing risk of accidents, or, in STAMP terms, migration +toward states of increased risk. It is unlikely that general leading indicators appli- +cable to large industry segments exist or will be useful. The identification of system +safety constraints does, however, provide the possibility of identifying leading +indicators applicable to a specific system. +The desire to predict the future often leads to collecting a large amount of infor- +mation based on the hope that something useful will be obtained and noticed. The +NASA Space Shuttle program was collecting six hundred metrics a month before +the loss of Columbia. Companies often collect data on occupational safety, such as +days without a lost time accident, and they assume that these data reflect on system +safety [17], which of course it does not. Not only is this misuse of data potentially +misleading, but collecting information that may not be indicative of real risk diverts +limited resources and attention from more effective risk-reduction efforts. +Poorly defined feedback can lead to a decrease in safety. As an incentive to +reduce the number of accidents in the California construction industry, for example, +workers with the best safety records—as measured by fewest reported incidents— +were rewarded [126]. The reward created an incentive to withhold information +about small accidents and near misses, and they could not therefore be investigated +and the causes eliminated. Under-reporting of incidents created the illusion that the +system was becoming safer, when instead risk had merely been muted. The inac- +curate risk perception by management led to not taking the necessary control +actions to reduce risk. Instead, the reporting of accidents should have been rewarded. +Feedback requirements should be determined with respect to the design of the +organization’s safety control structure, the safety constraints (derived from the +system hazards) that must be enforced on system operation, and the assumptions +and rationale underlying the system design for safety. They will be similar for dif- +ferent organizations only to the extent that the hazards, safety constraints, and +system design are similar. +The hazards and safety constraints, as well as the causal information derived by +the use of STPA, form the foundation for determining what feedback is necessary +to provide the controllers with the information they need to satisfy their safety + +responsibilities. In addition, there must be mechanisms to ensure that feedback +channels are operating effectively. +The feedback is used to update the controller’s process models and understand- +ing of the risks in the processes they are controlling, to update their control algo- +rithms, and to execute appropriate control actions. +Sometimes, cultural problems interfere with feedback about the state of the +controlled process. If the culture does not encourage sharing information and if +there is a perception that the information can be used in a way that is detrimental +to those providing it, then cultural changes will be necessary. Such changes require +leadership and freedom from blame (see “Just Culture” in chapter 13). Effective +feedback collection requires that those making the reports are convinced that the +information will be used for constructive improvements in safety and not as a basis +for criticism or disciplinary action. Resistance to airing dirty laundry is understand- +able, but this quickly transitions into an organizational culture where only good +news is passed on for fear of retribution. Everyone’s past experience includes indi- +vidual mistakes, and avoiding repeating the same mistakes requires a culture that +encourages sharing. +Three general types of feedback are commonly used: audits and performance +assessments; reporting systems; and anomaly, incident, and accident investigation. +section 12.4.1. Audits and Performance Assessments. +Once again, audits and performance assessments should start from the safety con- +straints and design assumptions and rationale. The goal should be to determine +whether the safety constraints are being enforced in the operation of the system +and whether the assumptions underlying the safety design and rationale are still +true. Audits and performance assessments provide a chance to detect whether the +behavior of the system and the system components still satisfies the safety con- +straints and whether the way the controllers think the system is working—as +reflected in their process models—is accurate. +The entire safety control structure must be audited, not just the lower-level pro- +cesses. Auditing the upper levels of the organization will require buy-in and com- +mitment from management and an independent group at a high enough level to +control audits as well as explicit rules for conducting them. +Audits are often less effective than they might be. When auditing is performed +through contracts with independent companies, there may be subtle pressures on +the audit team to be unduly positive or less than thorough in order to maintain their +customer base. In addition, behavior or conditions may be changed in anticipation +of an audit and then revert back to their normal state immediately afterward. +Overcoming these limitations requires changes in organizational culture and +in the use of the audit results. Safety controllers (managers) must feel personal + + +responsibility for safety. One way to encourage this view is to trust them and expect +them to be part of the solution and to care about safety. “Safety is everyone’s +responsibility” must be more than an empty slogan, and instead a part of the orga- +nizational culture. +A participatory audit philosophy can have an important impact on these cultural +goals. Some features of such a philosophy are: +1.• Audits should not be punitive. Audits need to be viewed as a chance to improve +safety and to evaluate the process rather than a way to evaluate employees. +2.• To +increase buy-in and commitment, those controlling the processes being +audited should participate in creating the rules and procedures and understand +the reasons for the audit and how the results will be used. Everyone should +have a chance to learn from the audit without it having negative consequences— +it should be viewed as an opportunity to learn how to improve. +3.•People from the process being audited should participate on the audit team. In +order to get an outside but educated view, using process experts from other +parts of the organization not directly being audited is a better approach than +using outside audit companies. Various stakeholders in safety may be included +such as unions. The goal should be to inculcate the attitude that this is our audit +and a chance to improve our practices. Audits should be treated as a learning +experience for everyone involved—including the auditors. +4.•Immediate feedback should be provided and solutions discussed. Often audit +results are not available until after the audit and are presented in a written +report. Feedback and discussion with the audit team during the audit are dis- +couraged. One of the best times to discuss problems found and how to design +solutions, however, is when the team is together and on the spot. Doing this +will also reinforce the understanding that the goal is to improve the process, +not to punish or evaluate those involved. +5.• All levels of the safety control structure should be audited, along with the +physical process and its immediate operators. Accepting being audited and +implementing improvements as a result—that is, leading by example—is a +powerful way for leaders to convey their commitment to safety and to its +improvement. +6.• A part of the audit should be to determine the level of safety knowledge and +training that actually exists, not what managers believe exists or what exists in +the training programs and user manuals. These results can be fed back into the +training materials and education programs. Under no circumstances, of course, +should such assessments be used in a negative way or one that is viewed as +punitive by those being assessed. + +Because these rules for audits are so far from common practice, they may be +viewed as unrealistic. But this type of audit is carried out today with great success. +See chapter 14 for an example. The underlying philosophy behind these practices is +that most people do not want to harm others and have innate belief in safety as a +goal. The problems arise when other goals are rewarded or emphasized over safety. +When safety is highly valued in an organizational culture, obtaining buy-in is usually +not difficult. The critical step lies in conveying that commitment. + +section 12.4.2. Anomaly, Incident, and Accident Investigation. +Anomaly, incident, and accident investigations often focus on a single “root” cause +and look for contributory causes near the events. The belief that there is a root cause, +sometimes called root cause seduction [32], is powerful because it provides an illu- +sion of control. If the root cause can simply be eliminated and if that cause is low +in the safety control structure, then changes can easily be made that will eliminate +accidents without implicating management or requiring changes that are costly or +disruptive to the organization. The result is that physical design characteristics or +low-level operators are usually identified as the root cause. +Causality is, however, much more complex than this simple but very entrenched +belief, as has been argued throughout this book. To effect high-leverage policies and +changes that are able to prevent large classes of future losses, the weaknesses in the +entire safety control structure related to the loss need to be identified and the +control structure redesigned to be more effective. +In general, effective learning from experience requires a change from a fixing +orientation to a continual learning and improvement culture. To create such a +culture requires high-level leadership by management, and sometimes organiza- +tional changes. +Chapter 11 describes a way to perform better analyses of anomalies, incidents, +and accidents. But having a process is not enough; the process must be embedded +in an organizational structure that allows the successful exploitation of that process. +Two important organizational factors will impact the successful use of CAST: train- +ing and follow-up. +Applying systems thinking to accident analysis requires training and experience. +Large organizations may be able to train a group of investigators or teams to +perform CAST analyses. This group should be managerially and financially inde- +pendent. Some managers prefer to have accident/incident analysis reports focus on +the low-level system operators and physical processes and the reports never go +beyond those factors. In other cases, those involved in accident analysis, while well- +meaning, have too limited a view to provide the perspective required to perform an +adequate causal analysis. Even when intentions are good and local skills and knowl- +edge are available, budgets may be so tight and pressures to maintain performance + +schedules so high that it is difficult to find the time and resources to do a thorough +causal analysis using local personnel. Trained teams with independent budgets +can overcome some of these obstacles. But while the leaders of investigations and +causal analysis can be independent, participation by those with local knowledge is +also important. +A second requirement is follow-up. Often the process stops after recommenda- +tions are made and accepted. No follow-up is provided to ensure that the recom- +mendations are implemented or that the implementations were effective. Deadlines +and assignment of responsibility for making recommendations, as well as responsi- +bility for ensuring that they are made, are required. The findings in the causal analysis +should be an input to future audits and performance assessments. If the same or +similar causes recur, then that itself requires an analysis of why the problem was +not fixed when it first was detected. Was the fix unsuccessful? Did the system migrate +back to the same high-risk state because the underlying causal factors were never +successfully controlled? Were factors missed in the original causal analysis? Trend +analysis is important to ensure that progress is being made in controlling safety. +section 12.4.3. Reporting Systems. +Accident reports very often note that before a loss, someone detected an anomaly +but never reported it using the official reporting system. The response in accident +investigation reports is often to recommend that the requirement to use reporting +systems be emphasized to personnel or to provide additional training in using them. +This response may be effective for a short time, but eventually people revert back +to their prior behavior. A basic assumption about human behavior in this book (and +in systems approaches to human factors) is that human behavior can usually be +explained by looking at the system in which the human is operating. The reason in +the system design for the behavior must be determined and changed: Simply trying +to force people to behave in ways that are unnatural for them will usually be +unsuccessful. +So the first question to ask is why people do not use reporting systems and to fix +those factors. One obvious reason is that they may be designed poorly. They may +require extra, time-consuming steps, such as logging into a web-based system, that +are not part of their normal operating procedures or environment. Once they +get to the website, they may be faced with a poorly designed form that requires +them to provide a lot of extraneous information or does not allow the flexibility +necessary to enter the information they want to provide. +A second reason people do not report is that the information they provided in +the past appeared to go into a black hole, with nobody responding to it. There is +little incentive to continue to provide information under these conditions, particu- +larly when the reporting system is time-consuming and awkward to use. + +A final reason for lack of reporting is a fear that the information provided may +be used against them or there are other negative repercussions such as a necessity +to spend time filling out additional reports. +Once the reason for failing to use reporting systems is understood, the solutions +usually become obvious. For example, the system may need to be redesigned so it +is easy to use and integrated into normal work procedures. As an example, email is +becoming a primary means of communication at work. The first natural response in +finding a problem is to contact those who can fix it, not to report it to some database +where there is no assurance it will be processed quickly or get to the right people. +A successful solution to this problem used on one large air traffic control system +was to require only that the reporter add an extra “cc:” on their emails in order to +get it reported officially to safety engineering and those responsible for problem +reports [94]. +In addition, the receipt of a problem report should result in both an acknowledg- +ment of receipt and a thank-you. Later, when a resolution is identified, information +should be provided to the reporter of the problem about what was done about it. +If there is no resolution within a reasonable amount of time, that too should be +acknowledged. There is little incentive to use reporting systems if the reporters do +not think the information will be acted upon. +Most important, an effective reporting system requires that those making the +reports are convinced the information will be used for constructive improvements +in safety and not as a basis for criticism or disciplinary action. If reporting is con- +sidered to have negative consequences for the reporter, then anonymity may be +necessary and a written policy provided for the use of such reporting systems, includ- +ing the rights of the reporters and how the reported information will be used. Much +has been written about this aspect of reporting systems (e.g., see Dekker [51]). One +warning is that trust is hard to gain and easy to lose. Once it is lost, regaining it is +even harder than getting buy-in at the beginning. +When reporting involves an outside regulatory agency or industry group, pro- +tection of safety information and proprietary data from disclosure and use for +purposes other than improving safety must be provided. +Designing effective reporting systems is very difficult. Examining two successful +efforts, in nuclear power and in commercial aviation, along with the challenges they +face is instructive. + +Nuclear Power. +Operators of nuclear power plants in the United States are required to file a +Licensee Event Report (LER) with the Nuclear Regulatory Commission (NRC) +whenever an irregular event occurs during plant operation. While the NRC collected +an enormous amount of information on the operating experience of plants in this + +way, the data were not consistently analyzed until after the Three Mile Island (TMI) +accident. The General Accounting Office (GAO) had earlier criticized the NRC for +this failure, but no corrective action was taken until after the events at TMI [98]. +The system also had a lack of closure: important safety issues were raised and +studied to some degree, but were not carried through to resolution [115]. Many +of the conditions involved in the TMI accident had occurred previously at other +plants but nothing had been done about correcting them. Babcock and Wilcox, +the engineering firm for TMI, had no formal procedures to analyze ongoing pro- +blems at plants they had built or to review the LERs on their plants filed with +the NRC. +The TMI accident sequence started when a pilot-operated relief valve stuck open. +In the nine years before the TMI incident, eleven of those valves had stuck open at +other plants, and only a year before, a sequence of events similar to those at TMI +had occurred at another U.S. plant. +The information needed to prevent TMI was available, including the prior +incidents at other plants, recurrent problems with the same equipment at TMI, and +engineers’ critiques that operators had been taught to do the wrong thing in specific +circumstances, yet nothing had been done to incorporate this information into +operating practices. +In reflecting on TMI, the utility’s president, Herman Dieckamp, said: +To me that is probably one of the most significant learnings of the whole accident [TMI] +the degree to which the inadequacies of that experience feedback loop . . . significantly +contributed to making us and the plant vulnerable to this accident [98]. +As a result of this wake-up call, the nuclear industry initiated better evaluation and +follow-up procedures on LERs. It also created the Institute for Nuclear Power +Operations (INPO) to promote safety and reliability through external reviews of +performance and processes, training and accreditation programs, events analysis, +sharing of operating information and best practices, and special assistance to member +utilities. The IAEA (International Atomic Energy Agency) and World Association +of Nuclear Operators (WANO) share these goals and serve similar functions +worldwide. +The reporting system now provides a way for operators of each nuclear power +plant to reflect on their own operating experience in order to identify problems, +interpret the reasons for these problems, and select corrective actions to ameliorate +the problems and their causes. Incident reviews serve as important vehicles for self- +analysis, knowledge sharing across boundaries inside and outside specific plants, and +development of problem-resolution efforts. Both INPO and the NRC issue various +letters and reports to make the industry aware of incidents as part of operating +experience feedback, as does IAEA’s Incident Reporting System. + +The nuclear engineering experience is not perfect, of course, but real strides have +been made since the TMI wakeup call, which luckily occurred without major human +losses. To their credit, an improvement and learning effort was initiated and has +continued. High-profile incidents like TMI are rare, but smaller scale self-analyses +and problem-solving efforts follow detection of small defects, near misses, and pre- +cursors and negative trends. Occasionally the NRC has stepped in and required +changes. For example, in 1996 the NRC ordered the Millstone nuclear power plant +in Connecticut to remain closed until management could demonstrate a “safety +conscious work environment” after identified problems were allowed to continue +without remedial action [34]. +Commercial Aviation. +The highly regarded ASRS (Aviation Safety Reporting System) has been copied by +many individual airline information systems. Although much information is now +collected, there still exist problems in evaluating and learning from it. The breadth +and type of information acquired is much greater than the NRC reporting system +described above. The sheer number of ASRS reports and the free form entry of the +information make evaluation very difficult. There are few ways implemented to +determine whether the report was accurate or evaluated the problem correctly. +Subjective causal attribution and inconsistency in terminology and information +included in the reports makes comparative analysis and categorization difficult and +sometimes impossible. +Existing categorization schemes have also become inadequate as technology +has changed, for example, with increased use of digital technology and computers +in aircraft and ground operations. New categorizations are being implemented, +but that creates problems when comparing data that used older categorization +schemes. +Another problem arising from the goal to encourage use of the system is in the +accuracy of the data. By filing an ASRS report, a limited form of indemnity against +punishment is assured. Many of the reports are biased by personal protection con- +siderations, as evidenced by the large percentage of the filings that report FAA +regulation violations. For example, in a NASA Langley study of reported helicopter +incidents in the ASRS over a nine-year period, nonadherence to FARs (Federal +Aviation Regulations) was by far the largest category of reports. The predominance +of FAR violations in the incident data may reflect the motivation of the ASRS +reporters to obtain immunity from perceived or real violations of FARs and not +necessarily the true percentages. +But with all these problems and limitations, most agree that the ASRS and +similar industry reporting systems have been very successful and the information +obtained extremely useful in enhancing safety. For example, reported unsafe airport + +conditions have been corrected quickly and improvements in air traffic control and +other types of procedures made on the basis of ASRS reports. +The success of the ASRS has led to the creation of other reporting systems in +this industry. The Aviation Safety Action Program (ASAP) in the United States, +for example, encourages air carrier and repair station personnel to voluntarily +report safety information to be used to develop corrective actions for identified +safety concerns. An ASAP involves a partnership between the FAA and the cer- +tified organization (called the certificate holder) and may also include a third +party, such as the employees’ labor organization. It provides a vehicle for employ- +ees of the ASAP participants to identify and report safety issues to management +and to the FAA without fear that the FAA will use the reports accepted under +the program to take legal enforcement action against them or the company or +that companies will use the information to take disciplinary action against the +employee. +Certificate holders may develop ASAP programs and submit them to the FAA +for review and acceptance. Ordinarily, programs are developed for specific employee +groups, such as members of the flightcrew, flight attendants, mechanics, or dispatch- +ers. The FAA may also suggest, but not require, that a certificate holder develop an +ASAP to resolve an identified safety problem. +When ASAP reports are submitted, an event review committee (ERC) reviews +and analyzes them. The ERC usually includes a management representative from +the certificate holder, a representative from the employee labor association (if +applicable), and a specially trained FAA inspector. The ERC considers each ASAP +report for acceptance or denial, and if accepted, analyzes the report to determine +the necessary controls to put in place to respond to the identified problem. +Single ASAP reports can generate corrective actions and, in addition, analysis of +aggregate ASAP data can also reveal trends that require action. Under an ASAP, +safety issues are resolved through corrective action rather than through punishment +or discipline. +To prevent abuse of the immunity provided by ASAP programs, reports are +accepted only for inadvertent regulatory violations that do not appear to involve +an intentional disregard for safety and events that do not appear to involve criminal +activity, substance abuse, or intentional falsification. +Additional reporting programs provide for sharing data that is collected by air- +lines for their internal use. FOQA (Flight Operational Quality Assurance) is an +example. Air carriers often instrument their aircraft with extensive flight data +recording systems or use pilot generated checklists and reports for gathering infor- +mation internally to improve operations and safety. FOQA provides a voluntary +means for the airlines to share this information with other airlines and with the FAA + + +so that national trends can be monitored and the FAA can target its resources to +address the most important operational risk issues. +In contrast with the ASAP voluntary reporting of single events, FOQA programs +allow the accumulation of accurate operational performance information covering +all flights by multiple aircraft types such that single events or overall patterns of +aircraft performance data can be identified and analyzed. Such aggregate data can +determine trends specific to aircraft types, local flight path conditions, and overall +flight performance trends for the commercial aircraft industry. FOQA data has been +used to identify the need for changing air carrier operating procedures for specific +aircraft fleets and for changing air traffic control practices at certain airports with +unique traffic pattern limitations. +FOQA and other such voluntary reporting programs allow early identification +of trends and changes in behavior (i.e., migration of systems toward states of increas- +ing risk) before they lead to accidents. Follow-up is provided to ensure that unsafe +conditions are effectively remediated by corrective actions. +A cornerstone of FOQA programs, once again, is the understanding that aggre- +gate data provided to the FAA will be kept confidential and the identity of reporting +personnel or airlines will remain anonymous. Data that could be used to identify +flight crews are removed from the electronic record as part of the initial processing +of the collected data. Air carrier FOQA programs, however, typically provide a +gatekeeper who can securely retrieve identifying information for a limited amount +of time, in order to enable follow-up requests for additional information from the +specific flight crew associated with a FOQA event. The gatekeeper is typically a line +captain designated by the air carrier’s pilot association. FOQA programs usually +involve agreements between pilot organizations and the carriers that define how +the collected information can be used. + +footnote. FOQA is voluntary in the United States but required in some countries. + +section 12.5. +Using the Feedback. +Once feedback is obtained, it needs to be used to update the controllers’ process +models and perhaps control algorithms. The feedback and its analysis may be passed +to others in the control structure who need it. +Information must be provided in a form that people can learn from, apply to +their daily jobs, and use throughout the system life cycle. +Various types of analysis may be performed by the controller on the feedback, +such as trend analysis. If flaws in the system design or unsafe changes are detected, +obviously actions are required to remedy the problems. + +In major accidents, precursors and warnings are almost always present but ignored +or mishandled. While what appear to be warnings are sometimes simply a matter +of hindsight, sometimes clear evidence does exist. In 1982, two years before the +Bhopal accident, for example, an audit was performed that identified many of the +deficiencies involved in the loss. The audit report noted such factors related to +the later tragedy such as filter-cleaning operations without using slip blinds, leaking +valves, and bad pressure gauges. The report recommended raising the capability +of the water curtain and pointed out that the alarm at the flare tower was nonop- +erational and thus any leakage could go unnoticed for a long time. The report also +noted that a number of hazardous conditions were known and allowed to persist +for considerable amounts of time or inadequate precautions were taken against +them. In addition, there was no follow-up to ensure that deficiencies were corrected. +According to the Bhopal manager, all improvements called for in the report +had been implemented, but obviously that was either untrue or the fixes were +ineffective. +As with accidents and incidents, warning signs or anomalies also need to be +analyzed using CAST. Because practice will naturally deviate from procedures, often +for very good reasons, the gap between procedures and practice needs to be moni- +tored and understood [50]. +12.6 +Education and Training +Everyone in the safety control structure, not just the lower-level controllers of the +physical systems, must understand their roles and responsibilities with respect to +safety and why the system—including the organizational aspects of the safety control +structure—was designed the way it was. +People, both managers and operators, need to understand the risks they are taking +in the decisions they make. Often bad decisions are made because the decision +makers have an incorrect assessment of the risks being assumed, which has implica- +tions for training. Controllers must know exactly what to look for, not just be told +to look for “weak signals,” a common suggestion in the HRO literature. Before a +bad outcome occurs, weak signals are simply noise; they take on the appearance of +signals only in hindsight, when their relevance becomes obvious. Telling managers +and operators to “be mindful of weak signals” simply creates a pretext for blame +after a loss event occurs. Instead, the people involved need to be knowledgeable +about the hazards associated with the operation of the system if we expect them to +recognize the precursors to an accident. Knowledge turns unidentifiable weak signals +into identifiable strong signals. People need to know what to look for. +Decision makers at all levels of the safety control structure also need to under- +stand the risks they are taking in the decisions they make: Training should include + +not just what but why. For good decision making about operational safety, decision +makers must understand the system hazards and their responsibilities with respect +to avoiding them. Understanding the safety rationale, that is, the “why,” behind the +system design will also have an impact on combating complacency and unintended +changes leading to hazardous states. This rationale includes understanding why +previous accidents occurred. The Columbia Accident Investigation Board was sur- +prised at the number of NASA engineers in the Space Shuttle program who had +never read the official Challenger accident report [74]. In contrast, everyone in the +U.S. nuclear Navy has training about the Thresher loss every year. +Training should not be a one-time event for employees but should be continual +throughout their employment, if only as a reminder of their responsibilities and the +system hazards. Learning about recent events and trends can be a focus of this +training. +Finally, assessing for training effectiveness, perhaps during regular audits, can +assist in establishing an effective improvement and learning process. +With highly automated systems, an assumption is often made that less training is +required. In fact, training requirements go up (not down) in automated systems, and +they change their nature. Training needs to be more extensive and deeper when +using automation. One of the reasons for this requirement is that human operators +of highly automated systems not only need a model of the current process state and +how it can change state but also a model of the automation and its operation, as +discussed in chapter 8. +To control complex and highly automated systems safely, operators (controllers) +need to learn more than just the procedures to follow: If we expect them to control +and monitor the automation, they must also have an in-depth understanding of the +controlled physical process and the logic used in any automated controllers they +may be supervising. System controllers—at all levels—need to know: +• The system hazards and the reason behind safety-critical procedures and opera- +tional rules. +• The potential result of removing or overriding controls, changing prescribed +procedures, and inattention to safety-critical features and operations: Past acci- +dents and their causes should be reviewed and understood. +•How to interpret feedback: Training needs to include different combinations of +alerts and sequences of events, not just single events. +•How to think flexibly when solving problems: Controllers need to be provided +with the opportunity to practice problem solving. +•General strategies rather than specific responses: Controllers need to develop +skills for dealing with unanticipated events. + + +•How to test hypotheses in an appropriate way: To update mental models, +human controllers often use hypothesis testing to understand the system state +better and update their process models. Such hypothesis testing is common with +computers and automated systems where documentation is usually so poor +and hard to use that experimentation is often the only way to understand the +automation behavior and design. Such testing can, however, lead to losses. +Designers need to provide operators with the ability to test hypotheses safely +and controllers must be educated on how to do so. +Finally, as with any system, emergency procedures must be overlearned and continu- +ally practiced. Controllers must be provided with operating limits and specific +actions to take in case they are exceeded. Requiring operators to make decisions +under stress and without full information is simply another way to ensure that they +will be blamed for the inevitable loss event, usually based on hindsight bias. Critical +limits must be established and provided to the operators, and emergency procedures +must be stated explicitly. +section 12.7. +Creating an Operations Safety Management Plan. +The operations safety management plan is used to guide operational control of +safety. The plan describes the objectives of the operations safety program and how +they will be achieved. It provides a baseline to evaluate compliance and progress. +Like every other part of safety program, the plan will need buy-in and oversight. +The organization should have a template and documented expectations for oper- +ations safety management plans, but this template may need to be tailored for +particular project requirements. +The information need not all be contained in one document, but there should be +a central reference with pointers to where the information can be found. As is true +for every other part of the safety control structure, the plan should include review +procedures for the plan itself as well as how the plan will be updated and improved +through feedback from experience. +Some things that might be included in the plan: +1.• +General Considerations. +– Scope and objectives. +– Applicable standards (company, industry) +– Documentation and reports. +– Review of plan and progress reporting procedures. +2.• +Safety Organization (safety control structure) +– Personnel qualifications and duties. + +– Staffing and manpower. +– Communication channels +– Responsibility, authority, accountability (functional organization, organiza- +tional structure) +– Information requirements (feedback requirements, process model, updating +requirements) +– Subcontractor responsibilities. +– Coordination. +– Working groups. +– System safety interfaces with other groups, such as maintenance and test, +occupational safety, quality assurance, and so on. +3.• +Procedures. +– Problem reporting (processes, follow-up) +– Incident and accident investigation. +4.•Procedures. +5.•Staffing (participants) +6.•Follow-up (tracing to hazard and risk analyses, communication) +– Testing and audit program. +7.•Procedures. +8.•Scheduling. +9.•Review and follow-up. +10.•Metrics and trend analysis. +11.•Operational assumptions from hazard and risk analyses. +– Emergency and contingency planning and procedures. +– Management of change procedures. +– Training. +– Decision making, conflict resolution. +12.• +Schedule. +– Critical checkpoints and milestones. +– Start and completion dates for tasks, reports, reviews. +– Review procedures and participants. +13.• +Safety Information System. +– Hazard and risk analyses, hazard logs (controls, review and feedback +procedures) + + +– Hazard tracking and reporting system. +– Lessons learned. +– Safety data library (documentation and files) +– Records retention policies. +14.• +Operations hazard analysis. +– Identified hazards. +– Mitigations for hazards. +15.•Evaluation and planned use of feedback to keep the plan up-to-date and +improve it over time. + +section 12.8. Applying STAMP to Occupational Safety. + +Occupational safety has, traditionally, not taken a systems approach but instead has +focused on individuals and changing their behavior. In applying systems theory to +occupational safety, more emphasis would be placed on understanding the impact +of system design on behavior and would focus on changing the system rather than +people. For example, vehicles used in large plants could be equipped with speed +regulators rather than depending on humans to follow speed limits and then punish- +ing them when they do not. The same design for safety principles presented in +chapter 9 for human controllers apply to designing for occupational safety. +With the increasing complexity and automation of our plants, the line between +occupational safety and engineering safety is blurring. By designing the system to +be safe despite normal human error or judgment errors under competing work +pressures, workers will be better protected against injury while fulfilling their job +responsibilities. \ No newline at end of file diff --git a/chapter12.txt b/chapter12.txt new file mode 100644 index 0000000..8a6e3fe --- /dev/null +++ b/chapter12.txt @@ -0,0 +1,842 @@ +Chapter 12. +Controlling Safety during Operations. +In some industries, system safety is viewed as having its primary role in development +and most of the activities occur before operations begin. Those concerned with +safety may lose influence and resources after that time. As an example, one of +the chapters in the Challenger accident report, titled “The Silent Safety Program,” +lamented. +Following the successful completion of the orbital flight test phase of the Shuttle program, +the system was declared to be operational. Subsequently, several safety, reliability, and +quality assurance organizations found themselves with reduced and/or reorganized functional capabilities. . . . The apparent reason for such actions was a perception that less +safety, reliability, and quality assurance activity would be required during “routine” Shuttle +operations. This reasoning was faulty. +While safety-guided design eliminates some hazards and creates controls for others, +hazards and losses may still occur in operations due to. +1.•Inadequate attempts to eliminate or control the hazards in the system design, +perhaps due to inappropriate assumptions about operations. +2.•Inadequate implementation of the controls that designers assumed would exist +during operations. +3.•Changes that occur over time, including violation of the assumptions underlying the design. +4.•Unidentified hazards, sometimes new ones that arise over time and were not +anticipated during design and development. +Treating operational safety as a control problem requires facing and mitigating these +potential reasons for losses. +A complete system safety program spans the entire life of the system and, in some +ways, the safety program during operations is even more important than during +development. System safety does not stop after development; it is just getting started. +The focus now, however, shifts to the operations safety control structure. + + +This chapter describes the implications of STAMP on operations. Some topics +that are relevant here are left to the next chapter on management. organizational +design, safety culture and leadership, assignment of appropriate responsibilities +throughout the safety control structure, the safety information system, and corporate safety policies. These topics span both development and operations and many +of the same principles apply to each, so they have been put into a separate chapter. +A final section of this chapter considers the application of STAMP and systems +thinking principles to occupational safety. +section 12.1. +Operations Based on STAMP. +Applying the basic principles of STAMP to operations means that, like development, the goal during operations is enforcement of the safety constraints, this time +on the operating system rather than in its design. Specific responsibilities and control +actions required during operations are outlined in chapter 13. +Figure 12.1 shows the interactions between development and operations. At the +end of the development process, the safety constraints, the results of the hazard +analyses, as well as documentation of the safety-related design features and design +rationale, should be passed on to those responsible for the maintenance and evolution of the system. This information forms the baseline for safe operations. For +example, the identification of safety-critical items in the hazard analysis should be +used as input to the maintenance process for prioritization of effort. + +At the same time, the accuracy and efficacy of the hazard analyses performed +during development and the safety constraints identified need to be evaluated using +the operational data and experience. Operational feedback on trends, incidents, and +accidents should trigger reanalysis when appropriate. Linking the assumptions +throughout the system specification with the parts of the hazard analysis based on +that assumption will assist in performing safety maintenance activities. During field +testing and operations, the links and recorded assumptions and design rationale can +be used in safety change analysis, incident and accident analysis, periodic audits and +performance monitoring as required to ensure that the operational system is and +remains safe. +For example, consider the TCAS requirement that TCAS provide collision avoidance protection for any two aircraft closing horizontally at any rate up to 1,200 knots +and vertically up to 10,000 feet per minute. As noted in the rationale, this requirement is based on aircraft performance limits at the time TCAS was created. It is +also based on minimum horizontal and vertical separation requirements. The safety +analysis originally performed on TCAS is based on these assumptions. If aircraft +performance limits change or if there are proposed changes in airspace management, as is now occurring in new Reduced Vertical Separation Minimums .(RVSM), +hazard analysis to determine the safety of such changes will require the design +rationale and the tracing from safety constraints to specific system design features +as recorded in intent specifications. Without such documentation, the cost of reanalysis could be enormous and in some cases even impractical. In addition, the links +between design and operations and user manuals in level 6 will ease updating when +design changes are made. +In a traditional System Safety program, much of this information is found +in or can be derived from the hazard log, but it needs to be pulled out and provided in a form that makes it easy to locate and use in operations. Recording +design rationale and assumptions in intent specifications allows using that information both as the criteria under which enforcement of the safety constraints is +predicated and in the inevitable upgrades and changes that will need to be made +during operations. Chapter 10 shows how to identify and record the necessary +information. +The design of the operational safety controls are based on assumptions about the +conditions during operations. Examples include assumptions about how the operators will operate the system and the environment .(both social and physical). in which +the system will operate. These conditions may change. Therefore, not only must the +assumptions and design rationale be conveyed to those who will operate the system, +but there also need to be safeguards against changes over time that violate those +assumptions. + + +The changes may be in the behavior of the system itself. +•Physical changes. the equipment may degrade or not be maintained properly. +•Human changes. human behavior and priorities usually change over time. +•Organizational changes. change is a constant in most organizations, including +changes in the safety control structure itself, or in the physical and social environment within which the system operates or with which it interacts. +Controls need to be established to reduce the risk associated with all these types of +changes. +The safeguards may be in the design of the system itself or in the design of the +operational safety control structure. Because operational safety depends on the +accuracy of the assumptions and models underlying the design and hazard analysis +processes, the operational system should be monitored to ensure that. +1. The system is constructed, operated, and maintained in the manner assumed +by the designers. +2. The models and assumptions used during initial decision making and design +are correct. +3. The models and assumptions are not violated by changes in the system, such +as workarounds or unauthorized changes in procedures, or by changes in the +environment. +Designing the operations safety control structure requires establishing controls and +feedback loops to .(1). identify and handle flaws in the original hazard analysis and +system design and .(2). to detect unsafe changes in the system during operations +before the changes lead to losses. Changes may be intentional or they may be unintended and simply normal changes in system component behavior or the environment over time. Whether intended or unintended, system changes that violate the +safety constraints must be controlled. + +section 12.2. +Detecting Development Process Flaws during Operations. +Losses can occur due to flaws in the original assumptions and rationale underlying +the system design. Errors may also have been made in the hazard analysis process +used during system design. During operations, three goals and processes to achieve +these goals need to be established. +1. Detect safety-related flaws in the system design and in the safety control +structure, hopefully before major losses, and fix them. + + +2. Determine what was wrong in the development process that allowed the flaws +to exist and improve that process to prevent the same thing from happening +in the future. +3. Determine whether the identified flaws in the process might have led to other +vulnerabilities in the operational system. +If losses are to be reduced over time and companies are not going to simply +engage in constant firefighting, then mechanisms to implement learning and continual improvement are required. Identified flaws must not only be fixed .(symptom +removal), but the larger operational and development safety control structures must +be improved, as well as the process that allowed the flaws to be introduced in the +first place. The overall goal is to change the culture from a fixing orientation. +identifying and eliminating deviations or symptoms of deeper problems.to a learning orientation where systemic causes are included in the search for the source of +safety problems . +To accomplish these goals, a feedback control loop is needed to regularly track +and assess the effectiveness of the development safety control structure and its +controls. Were hazards overlooked or incorrectly assessed as unlikely or not serious? +Were some potential failures or design errors not included in the hazard analysis? +Were identified hazards inappropriately accepted rather than being fixed? Were the +designed controls ineffective? If so, why? +When numerical risk assessment techniques are used, operational experience can +provide insight into the accuracy of the models and probabilities used. In various +studies of the D C 10 by McDonnell Douglas, the chance of engine power loss with +resulting slat damage during takeoff was estimated to be less than one in a billion +flights. However, this highly improbable event occurred four times in D C 10 s in the +first few years of operation without raising alarm bells before it led to an accident +and changes were made. Even one event should have warned someone that the +models used might be incorrect. Surprisingly little scientific evaluation of probabilistic risk assessment techniques has ever been conducted , yet these techniques +are regularly taught to most engineering students and widely used in industry. Feedback loops to evaluate the assumptions underlying the models and the assessments +produced are an obvious way to detect problems. +Most companies have an accident/incident analysis process that identifies the +proximal failures that led to an incident, for example, a flawed design of the pressure +relief valve in a tank. Typical follow-up would include replacement of that valve with +an improved design. On top of fixing the immediate problem, companies should +have procedures to evaluate and potentially replace all the uses of that pressure +relief valve design in tanks throughout the plant or company. Even better would be +to reevaluate pressure relief valve design for all uses in the plant, not just in tanks. + + +But for long-term improvement, a causal analysis.CAST or something similar. +needs to be performed on the process that created the flawed design and that +process improved. If the development process was flawed, perhaps in the hazard +analysis or design and verification, then fixing that process can prevent a large +number of incidents and accidents in the future. +Responsibility for this goal has to be assigned to an appropriate component in +the safety control structure and feedback-control loops established. Feedback may +come from accident and incident reports as well as detected and reported design +and behavioral anomalies. To identify flaws before losses occur, which is clearly +desirable, audits and performance assessments can be used to collect data for validating and informing the safety design and analysis process without waiting for a +crisis. There must also be feedback channels to the development safety control +structure so that appropriate information can be gathered and used to implement +improvements. The design of these control loops is discussed in the rest of this +chapter. Potential challenges in establishing such control loops are discussed in the +next chapter on management. +section 12.3. Managing or Controlling Change. +Systems are not static but instead are dynamic processes that are continually adapting to achieve their ends and to react to changes in themselves and their environment. In STAMP, adaptation or change is assumed to be an inherent part of any +system, particularly those that include humans and organizational components. +Humans and organizations optimize and change their behavior, adapting to the +changes in the world and environment in which the system operates. +To avoid losses, not only must the original design enforce the safety constraints +on system behavior, but the safety control structure must continue to enforce them +as changes to the designed system, including the safety control structure itself, occur +over time. +While engineers usually try to anticipate potential changes and to design for +changeability, the bulk of the effort in dealing with change must necessarily occur +during operations. Controls are needed both to prevent unsafe changes and to detect +them if they occur. +In the friendly fire example in chapter 5, the A Wacks controllers stopped handing +off helicopters as they entered and left the no-fly zone. They also stopped using the +Delta Point system to describe flight plans, although the helicopter pilots assumed +the coded destination names were still being used and continued to provide them. +Communication between the helicopters and the A Wacks controllers was seriously +degraded although nobody realized it. The basic safety constraint that all aircraft +in the no-fly zone and their locations would be known to the A Wacks controllers + + +became over time untrue as the A Wacks controllers optimized their procedures. +This type of change is normal; it needs to be identified by checking that the assumptions upon which safety is predicated remain true over time. +The deviation from assumed behavior during operations was not, in the friendly +fire example, detected until after an accident. Obviously, finding the deviations at +this time is less desirable than using audits, and other types of feedback mechanisms +to detect hazardous changes, that is, those that violate the safety constraints, before +losses occur. Then something needs to be done to ensure that the safety constraints +are enforced in the future. +Controls are required for both intentional .(planned). and unintentional changes. + +section 12.3.1. Planned Changes. +Intentional system changes are a common factor in accidents, including physical, +process, and safety control structure changes . The Flixborough explosion provides an example of a temporary physical change resulting in a major loss. Without +first performing a proper hazard analysis, a temporary pipe was used to replace a +reactor that had been removed to repair a crack. The crack itself was the result of +a previous process modification . The Walkerton water contamination loss in +appendix C provides an example of a control structure change when the government +water testing lab was privatized without considering how that would affect feedback +to the Ministry of the Environment. +Before any planned changes are made, including organizational and safety +control structure changes, their impact on safety must be evaluated. Whether +this process is expensive depends on how the original hazard analysis was performed and particularly how it was documented. Part of the rationale behind the +design of intent specifications was to make it possible to retrieve the information +needed. +While implementing change controls limits flexibility and adaptability, at least in +terms of the time it takes to make changes, the high accident rate associated with +intentional changes attests to the importance of controlling them and the high level +of risk being assumed by not doing so. Decision makers need to understand these +risks before they waive the change controls. +Most systems and industries do include such controls, usually called Management +of Change .(MOC). procedures. But the large number of accidents occurring after +system changes without evaluating their safety implies widespread nonenforcement +of these controls. Responsibility needs to be assigned for ensuring compliance with +the MOC procedures so that change analyses are conducted and the results are not +ignored. One way to do this is to reward people for safe behavior when they choose +safety over other system goals and to hold them accountable when they choose to +ignore the MOC procedures, even when no accident results. Achieving this goal, in + + +turn, requires management commitment to safety .(see chapter 13), as does just +about every aspect of building and operating a safe system. + +section 12.3.2. Unplanned Changes. +While dealing with planned changes is relatively straightforward .(even if difficult +to enforce), unplanned changes that move systems toward states of higher risk are +less straightforward. There need to be procedures established to prevent or detect +changes that impact the ability of the operations safety control structure and the +designed controls to enforce the safety constraints. +As noted earlier, people will tend to optimize their performance over time to +meet a variety of goals. If an unsafe change is detected, it is important to respond +quickly. People incorrectly reevaluate their perception of risk after a period of +success. One way to interrupt this risk-reevaluation process is to intervene quickly +to stop it before it leads to a further reduction in safety margins or a loss occurs. +But that requires an alerting function to provide feedback to someone who is +responsible for ensuring that the safety constraints are satisfied. +At the same time, change is a normal part of any system. Successful systems are +continually changing and adapting to current conditions. Change should be allowed +as long as it does not violate the basic constraints on safe behavior and therefore +increase risk to unacceptable levels. While in the short term relaxing the safety constraints may allow other system goals to be achieved to a greater degree, in the longer +term accidents and losses can cost a great deal more than the short-term gains. +The key is to allow flexibility in how safety goals are achieved, but not flexibility +in violating them, and to provide the information that creates accurate risk perception by decision makers. +Detecting migration toward riskier behavior starts with identifying baseline +requirements. The requirements follow from the hazard analysis. These requirements may be general .(“Equipment will not be operated above the identified safetycritical limits” or “Safety-critical equipment must be operational when the system +is operating”). or specifically tied to the hazard analysis .(“A Wacks operators must +always hand off aircraft when they enter and leave the no-fly zone” or “Pilots must +always follow the TCAS alerts and continue to do so until they are canceled”). +The next step is to assign responsibility to appropriate places in the safety control +structure to ensure the baseline requirements are not violated, while allowing +changes that do not raise risk. If the baseline requirements make it impossible for +the system to achieve its goals, then instead of waiving them, the entire safety control +structure should be reconsidered and redesigned. For example, consider the foam +shedding problems on the Space Shuttle. Foam had been coming off the external +tank for most of the operational life of the Shuttle. During development, a hazard +had been identified and documented related to the foam damaging the thermal + +control surfaces of the spacecraft. Attempts had been made to eliminate foam shedding, but none of the proposed fixes worked. The response was to simply waive the +requirement before each flight. In fact, at the time of the Columbia loss, more than +three thousand potentially critical failure modes were regularly waived on the +pretext that nothing could be done about them and the Shuttle had to fly . +More than a third of these waivers had not been reviewed in the ten years before +the accident. +After the Columbia loss, controls and mitigation measures for foam shedding +were identified and implemented, such as changing the fabrication procedures and +adding cameras and inspection and repair capabilities and other contingency actions. +The same measures could, theoretically, have been implemented before the loss of +Columbia. Most of the other waived hazards were also resolved in the aftermath of +the accident. While the operational controls to deal with foam shedding raise the +risk associated with a Shuttle accident above actually fixing the problem, the risk is +lower than simply ignoring and waiting for the hazards to occur. Understanding and +explicitly accepting risk is better than simply denying and ignoring it. +The NASA safety program and safety control structure had seriously degraded +before both the Challenger and Columbia losses . Waiving requirements +interminably represents an abdication of the responsibility to redesign the system, +including the controls during operations, after the current design is determined to +be unsafe. +Is such a hard line approach impractical? SUBSAFE, the U.S. nuclear submarine +safety program established after the Thresher loss, described in chapter 14, has not +allowed waiving the SUBSAFE safety requirements for more than forty-five years, +with one exception. In 19 67 , four years after SUBSAFE was established, SUBSAFE +requirements for one submarine were waived in order to satisfy pressing Navy performance goals. That submarine and its crew were lost less than a year later. The +same mistake has not been made again. +If there is absolutely no way to redesign the system to be safe and at the same +time to satisfy the system requirements that justify its existence, then the existence +of the system itself should be rethought and a major replacement or new design +considered. After the first accident, much more stringent and perhaps unacceptable +controls will be forced on operations. While the decision to live with risk is usually +accorded to management, those who will suffer the losses should have a right to +participate in that decision. Luckily, the choice is usually not so stark if flexibility is +allowed in the way the safety constraints are maintained and long-term rather than +short-term thinking prevails. +Like any set of controls, unplanned change controls involve designing appropriate control loops. In general, the process involves identifying the responsibility +of the controller(s); collecting data .(feedback); turning the feedback into useful + + +information .(analysis). and updating the process models; generating any necessary +control actions and appropriate communication to other controllers; and measuring +how effective the whole process is .(feedback again). +section 12.4. Feedback Channels. +Feedback is a basic part of STAMP and of treating safety as a control problem. +Information flow is key in maintaining safety. +There is often a belief.or perhaps hope.that a small number of “leading indicators” can identify increasing risk of accidents, or, in STAMP terms, migration +toward states of increased risk. It is unlikely that general leading indicators applicable to large industry segments exist or will be useful. The identification of system +safety constraints does, however, provide the possibility of identifying leading +indicators applicable to a specific system. +The desire to predict the future often leads to collecting a large amount of information based on the hope that something useful will be obtained and noticed. The +NASA Space Shuttle program was collecting six hundred metrics a month before +the loss of Columbia. Companies often collect data on occupational safety, such as +days without a lost time accident, and they assume that these data reflect on system +safety , which of course it does not. Not only is this misuse of data potentially +misleading, but collecting information that may not be indicative of real risk diverts +limited resources and attention from more effective risk-reduction efforts. +Poorly defined feedback can lead to a decrease in safety. As an incentive to +reduce the number of accidents in the California construction industry, for example, +workers with the best safety records.as measured by fewest reported incidents. +were rewarded . The reward created an incentive to withhold information +about small accidents and near misses, and they could not therefore be investigated +and the causes eliminated. Under-reporting of incidents created the illusion that the +system was becoming safer, when instead risk had merely been muted. The inaccurate risk perception by management led to not taking the necessary control +actions to reduce risk. Instead, the reporting of accidents should have been rewarded. +Feedback requirements should be determined with respect to the design of the +organization’s safety control structure, the safety constraints .(derived from the +system hazards). that must be enforced on system operation, and the assumptions +and rationale underlying the system design for safety. They will be similar for different organizations only to the extent that the hazards, safety constraints, and +system design are similar. +The hazards and safety constraints, as well as the causal information derived by +the use of STPA, form the foundation for determining what feedback is necessary +to provide the controllers with the information they need to satisfy their safety + +responsibilities. In addition, there must be mechanisms to ensure that feedback +channels are operating effectively. +The feedback is used to update the controller’s process models and understanding of the risks in the processes they are controlling, to update their control algorithms, and to execute appropriate control actions. +Sometimes, cultural problems interfere with feedback about the state of the +controlled process. If the culture does not encourage sharing information and if +there is a perception that the information can be used in a way that is detrimental +to those providing it, then cultural changes will be necessary. Such changes require +leadership and freedom from blame .(see “Just Culture” in chapter 13). Effective +feedback collection requires that those making the reports are convinced that the +information will be used for constructive improvements in safety and not as a basis +for criticism or disciplinary action. Resistance to airing dirty laundry is understandable, but this quickly transitions into an organizational culture where only good +news is passed on for fear of retribution. Everyone’s past experience includes individual mistakes, and avoiding repeating the same mistakes requires a culture that +encourages sharing. +Three general types of feedback are commonly used. audits and performance +assessments; reporting systems; and anomaly, incident, and accident investigation. +section 12.4.1. Audits and Performance Assessments. +Once again, audits and performance assessments should start from the safety constraints and design assumptions and rationale. The goal should be to determine +whether the safety constraints are being enforced in the operation of the system +and whether the assumptions underlying the safety design and rationale are still +true. Audits and performance assessments provide a chance to detect whether the +behavior of the system and the system components still satisfies the safety constraints and whether the way the controllers think the system is working.as +reflected in their process models.is accurate. +The entire safety control structure must be audited, not just the lower-level processes. Auditing the upper levels of the organization will require buy-in and commitment from management and an independent group at a high enough level to +control audits as well as explicit rules for conducting them. +Audits are often less effective than they might be. When auditing is performed +through contracts with independent companies, there may be subtle pressures on +the audit team to be unduly positive or less than thorough in order to maintain their +customer base. In addition, behavior or conditions may be changed in anticipation +of an audit and then revert back to their normal state immediately afterward. +Overcoming these limitations requires changes in organizational culture and +in the use of the audit results. Safety controllers .(managers). must feel personal + + +responsibility for safety. One way to encourage this view is to trust them and expect +them to be part of the solution and to care about safety. “Safety is everyone’s +responsibility” must be more than an empty slogan, and instead a part of the organizational culture. +A participatory audit philosophy can have an important impact on these cultural +goals. Some features of such a philosophy are. +1.• Audits should not be punitive. Audits need to be viewed as a chance to improve +safety and to evaluate the process rather than a way to evaluate employees. +2.• To +increase buy-in and commitment, those controlling the processes being +audited should participate in creating the rules and procedures and understand +the reasons for the audit and how the results will be used. Everyone should +have a chance to learn from the audit without it having negative consequences. +it should be viewed as an opportunity to learn how to improve. +3.•People from the process being audited should participate on the audit team. In +order to get an outside but educated view, using process experts from other +parts of the organization not directly being audited is a better approach than +using outside audit companies. Various stakeholders in safety may be included +such as unions. The goal should be to inculcate the attitude that this is our audit +and a chance to improve our practices. Audits should be treated as a learning +experience for everyone involved.including the auditors. +4.•Immediate feedback should be provided and solutions discussed. Often audit +results are not available until after the audit and are presented in a written +report. Feedback and discussion with the audit team during the audit are discouraged. One of the best times to discuss problems found and how to design +solutions, however, is when the team is together and on the spot. Doing this +will also reinforce the understanding that the goal is to improve the process, +not to punish or evaluate those involved. +5.• All levels of the safety control structure should be audited, along with the +physical process and its immediate operators. Accepting being audited and +implementing improvements as a result.that is, leading by example.is a +powerful way for leaders to convey their commitment to safety and to its +improvement. +6.• A part of the audit should be to determine the level of safety knowledge and +training that actually exists, not what managers believe exists or what exists in +the training programs and user manuals. These results can be fed back into the +training materials and education programs. Under no circumstances, of course, +should such assessments be used in a negative way or one that is viewed as +punitive by those being assessed. + +Because these rules for audits are so far from common practice, they may be +viewed as unrealistic. But this type of audit is carried out today with great success. +See chapter 14 for an example. The underlying philosophy behind these practices is +that most people do not want to harm others and have innate belief in safety as a +goal. The problems arise when other goals are rewarded or emphasized over safety. +When safety is highly valued in an organizational culture, obtaining buy-in is usually +not difficult. The critical step lies in conveying that commitment. + +section 12.4.2. Anomaly, Incident, and Accident Investigation. +Anomaly, incident, and accident investigations often focus on a single “root” cause +and look for contributory causes near the events. The belief that there is a root cause, +sometimes called root cause seduction , is powerful because it provides an illusion of control. If the root cause can simply be eliminated and if that cause is low +in the safety control structure, then changes can easily be made that will eliminate +accidents without implicating management or requiring changes that are costly or +disruptive to the organization. The result is that physical design characteristics or +low-level operators are usually identified as the root cause. +Causality is, however, much more complex than this simple but very entrenched +belief, as has been argued throughout this book. To effect high-leverage policies and +changes that are able to prevent large classes of future losses, the weaknesses in the +entire safety control structure related to the loss need to be identified and the +control structure redesigned to be more effective. +In general, effective learning from experience requires a change from a fixing +orientation to a continual learning and improvement culture. To create such a +culture requires high-level leadership by management, and sometimes organizational changes. +Chapter 11 describes a way to perform better analyses of anomalies, incidents, +and accidents. But having a process is not enough; the process must be embedded +in an organizational structure that allows the successful exploitation of that process. +Two important organizational factors will impact the successful use of CAST. training and follow-up. +Applying systems thinking to accident analysis requires training and experience. +Large organizations may be able to train a group of investigators or teams to +perform CAST analyses. This group should be managerially and financially independent. Some managers prefer to have accident/incident analysis reports focus on +the low-level system operators and physical processes and the reports never go +beyond those factors. In other cases, those involved in accident analysis, while wellmeaning, have too limited a view to provide the perspective required to perform an +adequate causal analysis. Even when intentions are good and local skills and knowledge are available, budgets may be so tight and pressures to maintain performance + +schedules so high that it is difficult to find the time and resources to do a thorough +causal analysis using local personnel. Trained teams with independent budgets +can overcome some of these obstacles. But while the leaders of investigations and +causal analysis can be independent, participation by those with local knowledge is +also important. +A second requirement is follow-up. Often the process stops after recommendations are made and accepted. No follow-up is provided to ensure that the recommendations are implemented or that the implementations were effective. Deadlines +and assignment of responsibility for making recommendations, as well as responsibility for ensuring that they are made, are required. The findings in the causal analysis +should be an input to future audits and performance assessments. If the same or +similar causes recur, then that itself requires an analysis of why the problem was +not fixed when it first was detected. Was the fix unsuccessful? Did the system migrate +back to the same high-risk state because the underlying causal factors were never +successfully controlled? Were factors missed in the original causal analysis? Trend +analysis is important to ensure that progress is being made in controlling safety. +section 12.4.3. Reporting Systems. +Accident reports very often note that before a loss, someone detected an anomaly +but never reported it using the official reporting system. The response in accident +investigation reports is often to recommend that the requirement to use reporting +systems be emphasized to personnel or to provide additional training in using them. +This response may be effective for a short time, but eventually people revert back +to their prior behavior. A basic assumption about human behavior in this book .(and +in systems approaches to human factors). is that human behavior can usually be +explained by looking at the system in which the human is operating. The reason in +the system design for the behavior must be determined and changed. Simply trying +to force people to behave in ways that are unnatural for them will usually be +unsuccessful. +So the first question to ask is why people do not use reporting systems and to fix +those factors. One obvious reason is that they may be designed poorly. They may +require extra, time-consuming steps, such as logging into a web-based system, that +are not part of their normal operating procedures or environment. Once they +get to the website, they may be faced with a poorly designed form that requires +them to provide a lot of extraneous information or does not allow the flexibility +necessary to enter the information they want to provide. +A second reason people do not report is that the information they provided in +the past appeared to go into a black hole, with nobody responding to it. There is +little incentive to continue to provide information under these conditions, particularly when the reporting system is time-consuming and awkward to use. + +A final reason for lack of reporting is a fear that the information provided may +be used against them or there are other negative repercussions such as a necessity +to spend time filling out additional reports. +Once the reason for failing to use reporting systems is understood, the solutions +usually become obvious. For example, the system may need to be redesigned so it +is easy to use and integrated into normal work procedures. As an example, email is +becoming a primary means of communication at work. The first natural response in +finding a problem is to contact those who can fix it, not to report it to some database +where there is no assurance it will be processed quickly or get to the right people. +A successful solution to this problem used on one large air traffic control system +was to require only that the reporter add an extra “cc.” on their emails in order to +get it reported officially to safety engineering and those responsible for problem +reports . +In addition, the receipt of a problem report should result in both an acknowledgment of receipt and a thank-you. Later, when a resolution is identified, information +should be provided to the reporter of the problem about what was done about it. +If there is no resolution within a reasonable amount of time, that too should be +acknowledged. There is little incentive to use reporting systems if the reporters do +not think the information will be acted upon. +Most important, an effective reporting system requires that those making the +reports are convinced the information will be used for constructive improvements +in safety and not as a basis for criticism or disciplinary action. If reporting is considered to have negative consequences for the reporter, then anonymity may be +necessary and a written policy provided for the use of such reporting systems, including the rights of the reporters and how the reported information will be used. Much +has been written about this aspect of reporting systems .(e.g., see Dekker ). One +warning is that trust is hard to gain and easy to lose. Once it is lost, regaining it is +even harder than getting buy-in at the beginning. +When reporting involves an outside regulatory agency or industry group, protection of safety information and proprietary data from disclosure and use for +purposes other than improving safety must be provided. +Designing effective reporting systems is very difficult. Examining two successful +efforts, in nuclear power and in commercial aviation, along with the challenges they +face is instructive. + +Nuclear Power. +Operators of nuclear power plants in the United States are required to file a +Licensee Event Report .(LER). with the Nuclear Regulatory Commission .(NRC) +whenever an irregular event occurs during plant operation. While the NRC collected +an enormous amount of information on the operating experience of plants in this + +way, the data were not consistently analyzed until after the Three Mile Island .(TMI) +accident. The General Accounting Office .( GAOW ). had earlier criticized the NRC for +this failure, but no corrective action was taken until after the events at TMI . +The system also had a lack of closure. important safety issues were raised and +studied to some degree, but were not carried through to resolution . Many +of the conditions involved in the TMI accident had occurred previously at other +plants but nothing had been done about correcting them. Babcock and Wilcox, +the engineering firm for TMI, had no formal procedures to analyze ongoing problems at plants they had built or to review the LERs on their plants filed with +the NRC. +The TMI accident sequence started when a pilot-operated relief valve stuck open. +In the nine years before the TMI incident, eleven of those valves had stuck open at +other plants, and only a year before, a sequence of events similar to those at TMI +had occurred at another U.S. plant. +The information needed to prevent TMI was available, including the prior +incidents at other plants, recurrent problems with the same equipment at TMI, and +engineers’ critiques that operators had been taught to do the wrong thing in specific +circumstances, yet nothing had been done to incorporate this information into +operating practices. +In reflecting on TMI, the utility’s president, Herman Dieckamp, said. +To me that is probably one of the most significant learnings of the whole accident +the degree to which the inadequacies of that experience feedback loop . . . significantly +contributed to making us and the plant vulnerable to this accident . +As a result of this wake-up call, the nuclear industry initiated better evaluation and +follow-up procedures on LERs. It also created the Institute for Nuclear Power +Operations .(INPO). to promote safety and reliability through external reviews of +performance and processes, training and accreditation programs, events analysis, +sharing of operating information and best practices, and special assistance to member +utilities. The IAEA .(International Atomic Energy Agency). and World Association +of Nuclear Operators .(WANO). share these goals and serve similar functions +worldwide. +The reporting system now provides a way for operators of each nuclear power +plant to reflect on their own operating experience in order to identify problems, +interpret the reasons for these problems, and select corrective actions to ameliorate +the problems and their causes. Incident reviews serve as important vehicles for selfanalysis, knowledge sharing across boundaries inside and outside specific plants, and +development of problem-resolution efforts. Both INPO and the NRC issue various +letters and reports to make the industry aware of incidents as part of operating +experience feedback, as does IAEA’s Incident Reporting System. + +The nuclear engineering experience is not perfect, of course, but real strides have +been made since the TMI wakeup call, which luckily occurred without major human +losses. To their credit, an improvement and learning effort was initiated and has +continued. High-profile incidents like TMI are rare, but smaller scale self-analyses +and problem-solving efforts follow detection of small defects, near misses, and precursors and negative trends. Occasionally the NRC has stepped in and required +changes. For example, in 19 96 the NRC ordered the Millstone nuclear power plant +in Connecticut to remain closed until management could demonstrate a “safety +conscious work environment” after identified problems were allowed to continue +without remedial action . +Commercial Aviation. +The highly regarded ASRS .(Aviation Safety Reporting System). has been copied by +many individual airline information systems. Although much information is now +collected, there still exist problems in evaluating and learning from it. The breadth +and type of information acquired is much greater than the NRC reporting system +described above. The sheer number of ASRS reports and the free form entry of the +information make evaluation very difficult. There are few ways implemented to +determine whether the report was accurate or evaluated the problem correctly. +Subjective causal attribution and inconsistency in terminology and information +included in the reports makes comparative analysis and categorization difficult and +sometimes impossible. +Existing categorization schemes have also become inadequate as technology +has changed, for example, with increased use of digital technology and computers +in aircraft and ground operations. New categorizations are being implemented, +but that creates problems when comparing data that used older categorization +schemes. +Another problem arising from the goal to encourage use of the system is in the +accuracy of the data. By filing an ASRS report, a limited form of indemnity against +punishment is assured. Many of the reports are biased by personal protection considerations, as evidenced by the large percentage of the filings that report FAA +regulation violations. For example, in a NASA Langley study of reported helicopter +incidents in the ASRS over a nine-year period, nonadherence to FARs .(Federal +Aviation Regulations). was by far the largest category of reports. The predominance +of FAR violations in the incident data may reflect the motivation of the ASRS +reporters to obtain immunity from perceived or real violations of FARs and not +necessarily the true percentages. +But with all these problems and limitations, most agree that the ASRS and +similar industry reporting systems have been very successful and the information +obtained extremely useful in enhancing safety. For example, reported unsafe airport + +conditions have been corrected quickly and improvements in air traffic control and +other types of procedures made on the basis of ASRS reports. +The success of the ASRS has led to the creation of other reporting systems in +this industry. The Aviation Safety Action Program .(ASAP). in the United States, +for example, encourages air carrier and repair station personnel to voluntarily +report safety information to be used to develop corrective actions for identified +safety concerns. An ASAP involves a partnership between the FAA and the certified organization .(called the certificate holder). and may also include a third +party, such as the employees’ labor organization. It provides a vehicle for employees of the ASAP participants to identify and report safety issues to management +and to the FAA without fear that the FAA will use the reports accepted under +the program to take legal enforcement action against them or the company or +that companies will use the information to take disciplinary action against the +employee. +Certificate holders may develop ASAP programs and submit them to the FAA +for review and acceptance. Ordinarily, programs are developed for specific employee +groups, such as members of the flightcrew, flight attendants, mechanics, or dispatchers. The FAA may also suggest, but not require, that a certificate holder develop an +ASAP to resolve an identified safety problem. +When ASAP reports are submitted, an event review committee .(ERC). reviews +and analyzes them. The ERC usually includes a management representative from +the certificate holder, a representative from the employee labor association .(if +applicable), and a specially trained FAA inspector. The ERC considers each ASAP +report for acceptance or denial, and if accepted, analyzes the report to determine +the necessary controls to put in place to respond to the identified problem. +Single ASAP reports can generate corrective actions and, in addition, analysis of +aggregate ASAP data can also reveal trends that require action. Under an ASAP, +safety issues are resolved through corrective action rather than through punishment +or discipline. +To prevent abuse of the immunity provided by ASAP programs, reports are +accepted only for inadvertent regulatory violations that do not appear to involve +an intentional disregard for safety and events that do not appear to involve criminal +activity, substance abuse, or intentional falsification. +Additional reporting programs provide for sharing data that is collected by airlines for their internal use. FOQA .(Flight Operational Quality Assurance). is an +example. Air carriers often instrument their aircraft with extensive flight data +recording systems or use pilot generated checklists and reports for gathering information internally to improve operations and safety. FOQA provides a voluntary +means for the airlines to share this information with other airlines and with the FAA + + +so that national trends can be monitored and the FAA can target its resources to +address the most important operational risk issues. +In contrast with the ASAP voluntary reporting of single events, FOQA programs +allow the accumulation of accurate operational performance information covering +all flights by multiple aircraft types such that single events or overall patterns of +aircraft performance data can be identified and analyzed. Such aggregate data can +determine trends specific to aircraft types, local flight path conditions, and overall +flight performance trends for the commercial aircraft industry. FOQA data has been +used to identify the need for changing air carrier operating procedures for specific +aircraft fleets and for changing air traffic control practices at certain airports with +unique traffic pattern limitations. +FOQA and other such voluntary reporting programs allow early identification +of trends and changes in behavior .(i.e., migration of systems toward states of increasing risk). before they lead to accidents. Follow-up is provided to ensure that unsafe +conditions are effectively remediated by corrective actions. +A cornerstone of FOQA programs, once again, is the understanding that aggregate data provided to the FAA will be kept confidential and the identity of reporting +personnel or airlines will remain anonymous. Data that could be used to identify +flight crews are removed from the electronic record as part of the initial processing +of the collected data. Air carrier FOQA programs, however, typically provide a +gatekeeper who can securely retrieve identifying information for a limited amount +of time, in order to enable follow-up requests for additional information from the +specific flight crew associated with a FOQA event. The gatekeeper is typically a line +captain designated by the air carrier’s pilot association. FOQA programs usually +involve agreements between pilot organizations and the carriers that define how +the collected information can be used. + +footnote. FOQA is voluntary in the United States but required in some countries. + +section 12.5. +Using the Feedback. +Once feedback is obtained, it needs to be used to update the controllers’ process +models and perhaps control algorithms. The feedback and its analysis may be passed +to others in the control structure who need it. +Information must be provided in a form that people can learn from, apply to +their daily jobs, and use throughout the system life cycle. +Various types of analysis may be performed by the controller on the feedback, +such as trend analysis. If flaws in the system design or unsafe changes are detected, +obviously actions are required to remedy the problems. + +In major accidents, precursors and warnings are almost always present but ignored +or mishandled. While what appear to be warnings are sometimes simply a matter +of hindsight, sometimes clear evidence does exist. In 19 82 , two years before the +Bhopal accident, for example, an audit was performed that identified many of the +deficiencies involved in the loss. The audit report noted such factors related to +the later tragedy such as filter-cleaning operations without using slip blinds, leaking +valves, and bad pressure gauges. The report recommended raising the capability +of the water curtain and pointed out that the alarm at the flare tower was nonoperational and thus any leakage could go unnoticed for a long time. The report also +noted that a number of hazardous conditions were known and allowed to persist +for considerable amounts of time or inadequate precautions were taken against +them. In addition, there was no follow-up to ensure that deficiencies were corrected. +According to the Bhopal manager, all improvements called for in the report +had been implemented, but obviously that was either untrue or the fixes were +ineffective. +As with accidents and incidents, warning signs or anomalies also need to be +analyzed using CAST. Because practice will naturally deviate from procedures, often +for very good reasons, the gap between procedures and practice needs to be monitored and understood . +12.6 +Education and Training +Everyone in the safety control structure, not just the lower-level controllers of the +physical systems, must understand their roles and responsibilities with respect to +safety and why the system.including the organizational aspects of the safety control +structure.was designed the way it was. +People, both managers and operators, need to understand the risks they are taking +in the decisions they make. Often bad decisions are made because the decision +makers have an incorrect assessment of the risks being assumed, which has implications for training. Controllers must know exactly what to look for, not just be told +to look for “weak signals,” a common suggestion in the HRO literature. Before a +bad outcome occurs, weak signals are simply noise; they take on the appearance of +signals only in hindsight, when their relevance becomes obvious. Telling managers +and operators to “be mindful of weak signals” simply creates a pretext for blame +after a loss event occurs. Instead, the people involved need to be knowledgeable +about the hazards associated with the operation of the system if we expect them to +recognize the precursors to an accident. Knowledge turns unidentifiable weak signals +into identifiable strong signals. People need to know what to look for. +Decision makers at all levels of the safety control structure also need to understand the risks they are taking in the decisions they make. Training should include + +not just what but why. For good decision making about operational safety, decision +makers must understand the system hazards and their responsibilities with respect +to avoiding them. Understanding the safety rationale, that is, the “why,” behind the +system design will also have an impact on combating complacency and unintended +changes leading to hazardous states. This rationale includes understanding why +previous accidents occurred. The Columbia Accident Investigation Board was surprised at the number of NASA engineers in the Space Shuttle program who had +never read the official Challenger accident report . In contrast, everyone in the +U.S. nuclear Navy has training about the Thresher loss every year. +Training should not be a one-time event for employees but should be continual +throughout their employment, if only as a reminder of their responsibilities and the +system hazards. Learning about recent events and trends can be a focus of this +training. +Finally, assessing for training effectiveness, perhaps during regular audits, can +assist in establishing an effective improvement and learning process. +With highly automated systems, an assumption is often made that less training is +required. In fact, training requirements go up .(not down). in automated systems, and +they change their nature. Training needs to be more extensive and deeper when +using automation. One of the reasons for this requirement is that human operators +of highly automated systems not only need a model of the current process state and +how it can change state but also a model of the automation and its operation, as +discussed in chapter 8. +To control complex and highly automated systems safely, operators .(controllers) +need to learn more than just the procedures to follow. If we expect them to control +and monitor the automation, they must also have an in-depth understanding of the +controlled physical process and the logic used in any automated controllers they +may be supervising. System controllers.at all levels.need to know. +• The system hazards and the reason behind safety-critical procedures and operational rules. +• The potential result of removing or overriding controls, changing prescribed +procedures, and inattention to safety-critical features and operations. Past accidents and their causes should be reviewed and understood. +•How to interpret feedback. Training needs to include different combinations of +alerts and sequences of events, not just single events. +•How to think flexibly when solving problems. Controllers need to be provided +with the opportunity to practice problem solving. +•General strategies rather than specific responses. Controllers need to develop +skills for dealing with unanticipated events. + + +•How to test hypotheses in an appropriate way. To update mental models, +human controllers often use hypothesis testing to understand the system state +better and update their process models. Such hypothesis testing is common with +computers and automated systems where documentation is usually so poor +and hard to use that experimentation is often the only way to understand the +automation behavior and design. Such testing can, however, lead to losses. +Designers need to provide operators with the ability to test hypotheses safely +and controllers must be educated on how to do so. +Finally, as with any system, emergency procedures must be overlearned and continually practiced. Controllers must be provided with operating limits and specific +actions to take in case they are exceeded. Requiring operators to make decisions +under stress and without full information is simply another way to ensure that they +will be blamed for the inevitable loss event, usually based on hindsight bias. Critical +limits must be established and provided to the operators, and emergency procedures +must be stated explicitly. +section 12.7. +Creating an Operations Safety Management Plan. +The operations safety management plan is used to guide operational control of +safety. The plan describes the objectives of the operations safety program and how +they will be achieved. It provides a baseline to evaluate compliance and progress. +Like every other part of safety program, the plan will need buy-in and oversight. +The organization should have a template and documented expectations for operations safety management plans, but this template may need to be tailored for +particular project requirements. +The information need not all be contained in one document, but there should be +a central reference with pointers to where the information can be found. As is true +for every other part of the safety control structure, the plan should include review +procedures for the plan itself as well as how the plan will be updated and improved +through feedback from experience. +Some things that might be included in the plan. +1.• +General Considerations. +– Scope and objectives. +– Applicable standards .(company, industry) +– Documentation and reports. +– Review of plan and progress reporting procedures. +2.• +Safety Organization .(safety control structure) +– Personnel qualifications and duties. + +– Staffing and manpower. +– Communication channels +– Responsibility, authority, accountability .(functional organization, organizational structure) +– Information requirements .(feedback requirements, process model, updating +requirements) +– Subcontractor responsibilities. +– Coordination. +– Working groups. +– System safety interfaces with other groups, such as maintenance and test, +occupational safety, quality assurance, and so on. +3.• +Procedures. +– Problem reporting .(processes, follow-up) +– Incident and accident investigation. +4.•Procedures. +5.•Staffing .(participants) +6.•Follow-up .(tracing to hazard and risk analyses, communication) +– Testing and audit program. +7.•Procedures. +8.•Scheduling. +9.•Review and follow-up. +10.•Metrics and trend analysis. +11.•Operational assumptions from hazard and risk analyses. +– Emergency and contingency planning and procedures. +– Management of change procedures. +– Training. +– Decision making, conflict resolution. +12.• +Schedule. +– Critical checkpoints and milestones. +– Start and completion dates for tasks, reports, reviews. +– Review procedures and participants. +13.• +Safety Information System. +– Hazard and risk analyses, hazard logs .(controls, review and feedback +procedures) + + +– Hazard tracking and reporting system. +– Lessons learned. +– Safety data library .(documentation and files) +– Records retention policies. +14.• +Operations hazard analysis. +– Identified hazards. +– Mitigations for hazards. +15.•Evaluation and planned use of feedback to keep the plan up-to-date and +improve it over time. + +section 12.8. Applying STAMP to Occupational Safety. + +Occupational safety has, traditionally, not taken a systems approach but instead has +focused on individuals and changing their behavior. In applying systems theory to +occupational safety, more emphasis would be placed on understanding the impact +of system design on behavior and would focus on changing the system rather than +people. For example, vehicles used in large plants could be equipped with speed +regulators rather than depending on humans to follow speed limits and then punishing them when they do not. The same design for safety principles presented in +chapter 9 for human controllers apply to designing for occupational safety. +With the increasing complexity and automation of our plants, the line between +occupational safety and engineering safety is blurring. By designing the system to +be safe despite normal human error or judgment errors under competing work +pressures, workers will be better protected against injury while fulfilling their job +responsibilities. \ No newline at end of file diff --git a/chapter13.raw b/chapter13.raw new file mode 100644 index 0000000..2e3631c --- /dev/null +++ b/chapter13.raw @@ -0,0 +1,1113 @@ +chapter 13. +Managing Safety and the Safety Culture. +The key to effectively accomplishing any of the goals described in the previous +chapters lies in management. Simply having better tools is not enough if they are +not used. Studies have shown that management commitment to the safety goals is +the most important factor distinguishing safe from unsafe systems and companies +[101]. Poor management decision making can undermine any attempts to improve +safety and ensure that accidents continue to occur. +This chapter outlines some of the most important management factors in reduc- +ing accidents. The first question is why managers should care about and invest in +safety. The answer, in short, is that safety pays and investment in safety provides +large returns over the long run. +If managers understand the importance of safety in achieving organizational +goals and decide they want to improve safety in their organizations, then three basic +organizational requirements are necessary to achieve that goal. The first is an effec- +tive safety control structure. Because of the importance of the safety culture in how +effectively the safety control structure operates, the second requirement is to imple- +ment and sustain a strong safety culture. But even the best of intentions will not +suffice without the appropriate information to carry them out, so the last critical +factor is the safety information system. +The previous chapters in this book focus on what needs to be done during design +and operations to control safety and enforce the safety constraints. This chapter +describes the overarching role of management in this process. +section 13.1. Why Should Managers Care about and Invest in Safety? +Most managers do care about safety. The problems usually arise because of mis- +understandings about what is required to achieve high safety levels and what the +costs really are if safety is done right. Safety need not entail enormous financial or +other costs. + + + + +A classic myth is that safety conflicts with achieving other goals and that tradeoffs +are necessary to prevent losses. In fact, this belief is totally wrong. Safety is a pre- +requisite for achieving most organizational goals, including profits and continued +existence. +History is replete with examples of major accidents leading to enormous financial +losses and the demise of companies as a result. Even the largest global corporations +may not be able to withstand the costs associated with such losses, including loss of +reputation and customers. After all these examples, it is surprising that few seem to +learn from them about their own vulnerabilities. Perhaps it is in the nature of +mankind to be optimistic and to assume that disasters cannot happen to us, only +to others. In addition, in the simpler societies of the past, holding governments +and organizations responsible for safety was less common. But with loss of control +over our own environment and its hazards, and with rising wealth and living stan- +dards, the public is increasingly expecting higher standards of behavior with respect +to safety. +The “conflict” myth arises because of a misunderstanding about how safety is +achieved and the long-term consequences of operating under conditions of high risk. +Often, with the best of intentions, we simply do the wrong things in our attempts to +improve safety. It’s not a matter of lack of effort or resources applied, but how they +are used that is the problem. Investments in safety need to be funneled to the most +effective activities in achieving it. +Sometimes it appears that organizations are playing a sophisticated version of +Whack-a-Mole, where symptoms are found and fixed but not the processes that +allow these symptoms to occur. Enormous resources may be expended with little +return on the investment. So many incidents occur that they cannot all be investi- +gated in depth, so only superficial analysis of a few is attempted. If, instead, a few +were investigated in depth and the systemic factors fixed, the number of incidents +would decrease by orders of magnitude. +Such groups find themselves in continual firefighting mode and eventually con- +clude that accidents are inevitable and investments to prevent them are not cost- +effective, thus, like Sisyphus, condemning themselves to traverse the same vicious +circle in perpetuity. Often they convince themselves that their industry is just more +hazardous than others and that accidents in their world are inevitable and are the +price of productivity. +This belief that accidents are inevitable and occur because of random chance +arises from our own inadequate efforts to prevent them. When accident causes are +examined in depth, using the systems approach in this book, it becomes clear that +there is nothing random about them. In fact, we seem to have the same accident +over and over again, with only the symptoms differing, but the causes remaining +fairly constant. Most of these causes could be eliminated, but they are not. The + +precipitating immediate factors, like a stuck valve, may have some randomness +associated with them, such as which valve actually precipitates a loss. But there is +nothing random about systemic factors that have not been corrected and exist +over long periods of time, such as flawed valve design and analysis or inadequate +maintenance practices. +As described in previous chapters, organizations tend to move inexorably toward +states of higher risk under various types of performance pressures until an accident +become inevitable. Under external or internal pressures, projects start to violate +their own rules: “We’ll do it just this once—it’s critical that we get this procedure +finished today.” In the Deepwater Horizon oil platform explosion of 2010, cost pre- +ssures led to not following standard safety procedures and, in the end, to enormous +financial losses [18]. Similar dynamics occurred, with slightly different pressures, in +the Columbia Space Shuttle loss where the tensions among goals were created by +forces largely external to NASA. What appear to be short-term conflicts of other +organizational goals with safety goals, however, may not exist over the long term, +as witnessed in both these cases. +When operating at elevated levels of risk, the only question is which of many +potential events will trigger the loss. Before the Columbia accident, NASA manned +space operations was experiencing a slew of problems in the orbiters. The head of +the NASA Manned Space Program at the time misinterpreted the fact that they +were finding and fixing problems and wrote a report that concluded risk had been +reduced by more than a factor of five [74]. The same unrealistic perception of risk +led to another report in 1995 recommending that NASA “restructure and reduce +overall safety, reliability, and quality assurance elements” [105]. +Figure 13.1 shows some of the dynamics at work. The model demonstrates the +major sources of the high risk in the Shuttle program at the time of the Columbia +loss. In order to get the funding needed to build and operate the space shuttle, +NASA had made unachievable performance promises. The need to justify expendi- +tures and prove the value of manned space flight has been a major and consistent +tension between NASA and other governmental entities: The more missions the +Shuttle could fly, the better able the program was to generate funding. Adding to +these pressures was a commitment to get the International Space Station construc- +tion complete by February 2004 (called “core complete”), which required deliveries +of large items that could only be carried by the shuttle. The only way to meet the +deadline was to have no launch delays, a level of performance that had never previ- +ously been achieved [117]. As just one indication of the pressure, computer screen +savers were mailed to managers in NASA’s human spaceflight program that depicted +a clock counting down (in seconds) to the core complete deadline [74]. + + +The control loop in the lower left corner of figure 13.1, labeled R1 or Pushing +the Limit, shows how as external pressures increased, performance pressure +increased, which led to increased launch rates and thus success in meeting the launch +rate expectations, which in turn led to increased expectations and increasing per- +formance pressures. This reinforcing loop represents an unstable system and cannot +be maintained indefinitely, but NASA is a “can-do” organization that believes +anything can be accomplished with enough effort [136]. +The upper left loop represents the Space Shuttle safety program, which when +operating effectively is meant to balance the risks associated with loop R1. The exter- +nal influences of budget cuts and increasing performance pressures, however, reduced +the priority of safety procedures and led to a decrease in system safety efforts. + +Adding to the problems is the fact that system safety efforts led to launch delays +when problems were found, which created another reason for reducing the priority +of the safety efforts in the face of increasing launch pressures. +While reduction in safety efforts and lower prioritization of safety concerns may +lead to accidents, accidents usually do not occur for a while so false confidence is +created that the reductions are having no impact on safety and therefore pressures +increase to reduce the efforts and priority even further as the external and internal +performance pressures mount. +The combination of the decrease in safety efforts along with loop B2 in which +fixing the problems that were being found increased complacency, which also +contributed to reduction of system safety efforts, eventually led to a situation of +unrecognized high risk. +When working at such elevated levels of risk, the only question is which of many +potential events will trigger the loss. The fact that it was the foam and not one of +the other serious problems identified both before and after the loss was the only +random part of the accident. At the time of the Columbia accident, NASA was +regularly flying the Shuttle with many uncontrolled hazards; the foam was just one +of them. +Often, ironically, our successful efforts to eliminate or reduce accidents contrib- +ute to the march toward higher risk. Perception of the risk associated with an activity +often decreases over a period of time when no losses occur even though the real +risk has not changed at all. This misperception leads to reducing the very factors +that are preventing accidents because they are seen as no longer needed and avail- +able to trade off with other needs. The result is that risk increases until a major loss +occurs. This vicious cycle needs to be broken to prevent accidents. In STAMP terms, +the weakening of the safety control structure over time needs to be prevented or +detected before the conditions occur that lead to a loss. +System migration toward states of higher risk is potentially controllable and +detectable [167]. The migration results from weakening of the safety control struc- +ture. To achieve lasting results, strong operational safety efforts are needed that +provide protection from and appropriate responses to the continuing environmental +influences and pressures that tend to degrade safety over time and that change the +safety control structure and the behavior of those in it. +The experience in the nuclear submarine community is a testament to the fact +that such dynamics can be overcome. The SUBSAFE program (described in the +next chapter) was established after the loss of the Thresher in 1963. Since that time, +no submarine in the SUBSAFE program, that is, satisfying the SUBSAFE require- +ments, has been lost, although such losses were common before SUBSAFE was +established. +The leaders in SUBSAFE describe other benefits beyond preventing the loss of +critical assets. Because those operating the submarines have complete confidence + + +in their ships, they can focus solely on the completion of their mission. The U.S. +nuclear submarine program’s experience over the past forty-five years belies the +myth that increasing safety necessarily decreases system performance. Over a sus- +tained period, a safer operation is generally more efficient. One reason is that stop- +pages and delays are eliminated. +Examples can also be found in private industry. As just one example, because of +a number of serious accidents, OSHA tried to prohibit the use of power presses +where employees had to place one or both hands beneath the ram during the pro- +duction cycle [96]. After vehement protests that the expense would be too great in +terms of reduced productivity, the requirement was dropped: Preliminary motion +studies showed that reduced production would result if all loading and unloading +were done with the die out from under the ram. Some time after OSHA gave up +on the idea, one manufacturer who used power presses decided, purely as a safety +and humanitarian measure, to accept the production penalty. Instead of reducing +production, however, the effect was to increase production from 5 to 15 percent, +even though the machine cycle was longer. Other examples of similar experiences +can be found in Safeware [115]. +The belief that safer systems cost more or that building safety in from the begin- +ning necessarily requires unacceptable compromises with other goals is simply not +justified. The costs, like anything else, depend on the methods used to achieve +increased safety. In another ironic twist, in the attempt to avoid making tradeoffs +with safety, systems are often designed to optimize mission goals and safety devices +added grudgingly when the design is complete. This approach, however, is the most +expensive and least effective that could be used. The costs are much less and in +fact can be eliminated if safety is built into the system design from the beginning +rather than added on or retrofitted later, usually in the form of redundancy +or elaborate protection systems. Eliminating or reducing hazards early in design +often results in a simpler design, which in itself may reduce both risk and costs. +The reduced risk makes it more likely that the mission or system goals will be +achieved. +Sometimes it takes a disaster to “get religion” but it should not have to. This +chapter was written for those managers who are wise enough to know that invest- +ment in safety pays dividends, even before this fact is brought home (usually too +late) by a tragedy. + +footnote. Appendix D explains how to read system dynamics models, for those unfamiliar with them. + + +section 13.2. General Requirements for Achieving Safety Goals. +Escaping from the Whack-a-Mole trap requires identifying and eliminating the +systemic factors behind accidents. Some common reasons why safety efforts are +often not cost-effective were identified in chapter 6, including: + +1.•Superficial, isolated, or misdirected safety engineering activities, such as spend- +ing most of the effort proving the system is safe rather than making it so. +2.•Starting too late. +3.•Using techniques inappropriate for today’s complex systems and new +technology. +4.•Focusing only on the technical parts of the system, and +5.• Assuming systems are static throughout their lifetime and decreasing attention +to safety during operations + +Safety needs to be managed and appropriate controls established. The major ingre- +dients of effective safety management include: +1.• +Commitment and leadership +2.• A corporate safety policy +3.•Risk awareness and communication channels +4.•Controls on system migration toward higher risk +5.• A strong corporate safety culture +6.• A safety control structure with appropriate assignment of responsibility, author- +ity, and accountability +7.• A safety information system +8.•Continual improvement and learning +9.•Education, training, and capability development +Each of these is described in what follows. + +section 13.2.1. Management Commitment and Leadership. +Top management concern about safety is the most important factor in discriminat- +ing between safe and unsafe companies matched on other variables [100]. This +commitment must be genuine, not just a matter of sloganeering. Employees need +to feel they will be supported if they show concern for safety. An Air Force study +of system safety concluded: +Air Force top management support of system safety has not gone unnoticed by contrac- +tors. They now seem more than willing to include system safety tasks, not as “window +dressing” but as a meaningful activity [70, pp. 5–11]. +The B1-B program is an example of how this result was achieved. In that develop- +ment program, the program manager or deputy program manager chaired the meet- +ings of the group where safety decisions were made. “An unmistaken image of + + +the importance of system safety in the program was conveyed to the contractors” +[70, p. 5]. +A manager’s open and sincere concern for safety in everyday dealings with +employees and contractors can have a major impact on the reception given to safety- +related activities [157]. Studies have shown that top management’s support for and +participation in safety efforts is the most effective way to control and reduce acci- +dents [93]. Support for safety is shown by personal involvement, by assigning capable +people and giving them appropriate objectives and resources, by establishing com- +prehensive organizational safety control structures, and by responding to initiatives +by others. +section 13.2.2. Corporate Safety Policy. +A policy is a written statement of the wisdom, intentions, philosophy, experience, +and belief of an organization’s senor managers that states the goals for the organiza- +tion and guides their attainment [93]. The corporate safety policy provides employ- +ees with a clear, shared vision of the organization’s safety goals and values and a +strategy to achieve them. It documents and shows managerial priorities where safety +is involved. +The author has found companies that justify not having a safety policy on the +grounds that “everyone knows safety is important in our business.” While safety may +seem important for a particular business, management remaining mute on their +policy conveys the impression that tradeoffs are acceptable when safety seems to +conflict with other goals. The safety policy provides a way for management to clearly +define the priority between conflicting goals they expect to be used in decision +making. The safety policy should define the relationship of safety to other organi- +zational goals and provide the scope for discretion, initiative, and judgment in +deciding what should be done in specific situations. +Safety policy should be broken into two parts. The first is a short and concise +statement of the safety values of the corporation and what is expected from employ- +ees with respect to safety. Details about how the policy will be implemented should +be separated into other documents. +A complete safety policy contains such things as the goals of the safety program; +a set of criteria for assessing the short- and long-term success of that program with +respect to the goals; the values to be used in tradeoff decisions; and a clear statement +of responsibilities, authority, accountability, and scope. The policy should be explicit +and state in clear and understandable language what is expected, not a set of lofty +goals that cannot be operationalized. An example sometimes found (as noted in the +previous chapter) is a policy for employees to “be mindful of weak signals”: This +policy provides no useful guidance on what to do—both “mindful” and “weak +signals” are undefined and undefinable. An alternative might be, “If you see + + +something that you think is unsafe, you are responsible for reporting it immediately.” +In addition, employees need to be trained on the hazards in the processes they +control and what to look for. +Simply having a safety policy is not enough. Employees need to believe the +safety policy reflects true commitment by management. The only way this commit- +ment can be effectively communicated is through actions by management that +demonstrate that commitment. Employees need to feel that management will +support them when they make reasonable decisions in favor of safety over alterna- +tive goals. Incentives and reward structures must encourage the proper handling of +tradeoffs between safety and other goals. Not only the formal rewards and rules +but also the informal rules (social processes) of the organizational culture must +support the overall safety policy. A practical test is whether employees believe that +company management will support them if they choose safety over the demands of +production [128]. +To encourage proper decision making, the flexibility to respond to safety prob- +lems needs to be built into the organizational procedures. Schedules, for example, +should be adaptable to allow for uncertainties and possibilities of delay due to +legitimate safety concerns, and production goals must be reasonable. +Finally, not only must a safety policy be defined, it must be disseminated and +followed. Management needs to ensure that safety receives appropriate attention +in decision making. Feedback channels must be established and progress in achiev- +ing the goals should be monitored and improvements identified, prioritized, and +implemented. +section 13.2.3. Communication and Risk Awareness. +Awareness of the risk in the controlled process is a major component of safety- +related decision making by controllers. The problem is that risk, when defined +as the severity of a loss event combined with its likelihood, is not calculable or +knowable. It can only be estimated from a set of variables, some of which may be +unknown, or the information to evaluate likelihood of these variables may be +lacking or incorrect. But decisions need to be made based on this unknowable +property. +In the absence of accurate information about the state of the process, risk percep- +tion may be reevaluated downward as time passes without an accident. In fact, risk +probably has not changed, only our perception of it. In this trap, risk is assumed to +be reflected by a lack of accidents or incidents and not by the state of the safety +control structure. +When STAMP is used as the foundation of the safety program, safety and risk +are a function of the effectiveness of the controls to enforce safe system behavior, that +is, the safety constraints and the control structure used to enforce those constraints. + + +Poor safety-related decision making on the part of management, for example, is +commonly related to inadequate feedback and inaccurate process models. As such, +risk is potentially knowable and not some amorphous property denoted by probabil- +ity estimates. This new definition of risk can be used to create new risk assessment +procedures. +While lack of accidents could reflect a strong safety control structure, it may also +simply reflect delays between the relaxation of the controls and negative conse- +quences. The delays encourage relaxation of more controls, which then leads to +accidents. The basic problem is inaccurate risk perception and calculating risk using +the wrong factors. This process is behind the frequently used but rarely defined label +of “complacency.” Complacency results from inaccurate process models and risk +awareness. +Risk perception is directly related to communication and feedback. The more and +better the information we have about the potential causes of accidents in our system +and the state of the controls implemented to prevent them, the more accurate will +be our perception of risk. Consider the loss of an aircraft when it took off from the +wrong runway in Lexington, Kentucky, in August 2006. One of the factors in the +accident was that construction was occurring and the pilots were confused about +temporary changes in taxi patterns. Although similar instances of crew confusion +had occurred in the week before the accident, there were no effective communica- +tion channels to get this information to the proper authorities. After the loss, a small +group of aircraft maintenance workers told the investigators that they also had +experienced confusion when taxiing to conduct engine tests—they were worried +that an accident could happen, but did not know how to effectively notify people +who could make a difference [142]. +Another communication disconnect in this accident leading to a misperception +of risk involved a misunderstanding by management about the staffing of the control +tower at the airport. Terminal Services management had ordered the airport air +traffic control management to both reduce control tower budgets and to ensure +separate staffing of the tower and radar functions. It was impossible to comply with +both directives. Because of an ineffective feedback mechanism, management did not +know about the impossible and dangerous goal conflicts they had created or that +the resolution of the conflict was to reduce the budget and ignore the extra staffing +requirements. +Another example occurred in the Deepwater Horizon accident. Reports after the +accident indicated that workers felt comfortable raising safety concerns and ideas +for safety improvement to managers on the rig, but they felt that they could not +raise concerns at the divisional or corporate level without reprisal. In a confidential +survey of workers on Deepwater Horizon taken before the oil platform exploded, +workers expressed concerns about safety: + + +“I’m petrified of dropping anything from heights not because I’m afraid of hurting anyone +(the area is barriered off), but because I’m afraid of getting fired,” one worker wrote. “The +company is always using fear tactics,” another worker said. “All these games and your +mind gets tired.” Investigators also said “nearly everyone among the workers they inter- +viewed believed that Transocean’s system for tracking health and safety issues on the rig +was counter productive.” Many workers entered fake data to try to circumvent the system, +known as See, Think, Act, Reinforce, Track (or START). As a result, the company’s percep- +tion of safety on the rig was distorted, the report concluded [27, p. Al] +Formal methods of operation and strict hierarchies can limit communication. When +information is passed up hierarchies, it may be distorted, depending on the interests +of managers and the way they interpret the information. Concerns about safety may +even be completely silenced as it passes up the chain of command. Employees may +not feel comfortable going around a superior who does not respond to their con- +cerns. The result may be a misperception of risk, leading to inadequate control +actions to enforce the safety constraints. +In other accidents, reporting and feedback systems are simply unused for a +variety of reasons. In many losses, there was evidence that a problem occurred +in time to prevent the loss, but there was either no communication channel estab- +lished for getting the information to those who could understand it and to those +making decisions or, alternatively, the problem-reporting channel was ineffective or +simply unused. +Communication is critical in both providing information and executing control +actions and in providing feedback to determine whether the control actions were +successful and what further actions are required. Decision makers need accurate +and timely information. Channels for information dissemination and feedback need +to be established that include a means for comparing actual performance with +desired performance and ensuring that required action is taken. +In summary, both the design of the communication channels and the communica- +tion dynamics must be considered as well as potential feedback delays. As an +example of communication dynamics, reliance on face-to-face verbal reports during +group meetings is a common method of assessing lower-level operations [189], but, +particularly when subordinates are communicating with superiors, there is a ten- +dency for adverse situations to be underemphasized [20]. + +section 13.2.4. Controls on System Migration toward Higher Risk. +One of the key assumptions underlying the approach to safety described in this +book is that systems adapt and change over time. Under various types of pressures, +that adaptation often moves in the direction of higher risk. The good news is, as +stated earlier, that adaptation is predictable and potentially controllable. The safety +control structure must provide protection from and appropriate responses to the +continuing influences and pressures that tend to degrade safety over time. More + + +specifically, the potential reasons for and types of migration toward higher risk need +to be identified and controls instituted to prevent it. In addition, audits and perfor- +mance assessments based on the safety constraints identified during system develop- +ment can be used to detect migration and the violation of the constraints as described +in chapter 12. +One way to prevent such migration is to anchor safety efforts beyond short-term +program management pressures. At one time, NASA had a strong agency-wide +system safety program with common standards and requirements levied on every- +one. Over time, agency-wide standards were eviscerated, and programs were allowed +to set their own standards under the control of the program manager. While the +manned space program started out with strong safety standards, under budget and +performance pressures they were progressively weakened [117]. +As one example, a basic requirement for an effective operational safety program +is that all potentially hazardous incidents during operations are thoroughly investi- +gated. Debris shedding had been identified as a potential hazard during Shuttle +development, but the standard for performing hazard analyses in the Space Shuttle +program was changed to specify that hazards would be revisited only when there +was a new design or the Shuttle design was changed, not after an anomaly (such as +foam shedding) occurred [117]. +After the Columbia accident, safety standards in the Space Shuttle program (and +the rest of NASA) were effectively anchored and protected from dilution over time +by moving responsibility for them outside the projects. + +section 13.2.5. Safety, Culture, and Blame. +The high-level goal in managing safety is to create and maintain an effective safety +control structure. Because of the importance of safety culture in how the control +structure operates, achieving this goal requires implementing and sustaining a strong +safety culture. +Proper function of the safety control structure relies on decision making by the +controllers in the structure. Decision making always rests upon a set of industry or +organizational values and assumptions. A culture is a set of shared values and norms, +a way of looking at and interpreting the world and events around us and of taking +action in a social context. Safety culture is that subset of culture that reflects the +general attitude and approaches to safety and risk management. +Shein divides culture into three levels (figure 13.2) [188]. At the top are the +surface-level cultural artifacts or routine aspects of everyday practice including +hazard analyses and control algorithms and procedures. The second, middle level is +the stated organizational rules, values, and practices that are used to create the top- +level artifacts, such as safety policy, standards, and guidelines. At the lowest level is +the often invisible but pervasive underlying deep cultural operating assumptions + + +upon which actions are taken and decisions are made and thus upon which the upper +levels rest. +Trying to change safety outcomes by simply changing the organizational +structures—including policies, goals, missions, job descriptions, and standard operat- +ing procedures—may lower risk over the short term, but superficial fixes that do +not address the set of shared values and social norms are very likely to be undone +over time. Changes are required in the organizational values that underlie people’s +behavior. +Safety culture is primarily set by the leaders of the organization as they establish +the basic values under which decisions will be made. This fact explains why leader- +ship and commitment by leaders is critical in achieving high levels of safety. +To engineer a safety culture requires identifying the desired organizational safety +principles and values and then establishing a safety control structure to achieve +those values and to sustain them over time. Sloganeering or jawboning is not +enough: all aspects of the safety control structure must be engineered to be in align- +ment with the organizational safety principles, and the leaders must be committed +to the stated policies and principles related to safety in the organization. +Along with leadership and commitment to safety as a basic value of the organiza- +tion, achieving safety goals requires open communication. In an interview after the +Columbia loss, the new center director at Kennedy Space Center suggested that the +most important cultural issue the Shuttle program faced was establishing a feeling +of openness and honesty with all employees, where everybody’s voice was valued. +Statements during the Columbia accident investigation and messages posted to the +NASA Watch website describe a lack of trust of NASA employees to speak up. At +the same time, a critical observation in the CAIB report focused on the engineers’ +claims that the managers did not hear the engineers’ concerns [74]. The report con- +cluded that this was in part due to the managers not asking or listening. Managers + + +created barriers against dissenting opinions by stated preconceived conclusions +based on subjective knowledge and experience rather than on solid data. Much of +the time they listened to those who told them what they wanted to hear. One indica- +tion about the poor communication around safety and the atmosphere at the time +were statements in the 1995 Kraft report [105] that dismissed concerns about Space +Shuttle safety by accusing those who made them as being partners in an unneeded +“safety shield conspiracy.” +Unhealthy work atmospheres with respect to safety and communication are not +limited to NASA. Carroll documents a similarly dysfunctional safety culture at the +Millstone nuclear power plant [33]. An NRC review in 1996 concluded the safety +culture at the plant was dangerously flawed: it did not tolerate dissenting views and +stifled questioning attitudes among employees. +Changing such interaction patterns is not easy. Management style can be addressed +through training, mentoring, and proper selection of people to fill management +positions, but trust is hard to gain and easy to lose. Employees need to feel psycho- +logically safe about reporting concerns and to believe that managers can be trusted +to hear their concerns and to take appropriate action, while managers have to +believe that employees are worth listening to and worthy of respect. +The difficulty is in getting people to change their view of reality. Gareth Morgan, +a social anthropologist, defines culture as an ongoing, proactive process of reality +construction. According to this view, organizations are socially constructed realities +that rest as much in the heads and minds of their members as they do in concrete +sets of rules and regulations. Morgan assets that organizations are “sustained by +belief systems that emphasize the importance of rationality” [139]. This myth of +rationality “helps us to see certain patterns of action as legitimate, credible, and +normal, and hence to avoid the wrangling and debate that would arise if we were +to recognize the basic uncertainty and ambiguity underlying many of our values and +actions” [139]. +For both the Challenger and Columbia accidents, as well as most other major +accidents where decision making was flawed, the decision makers saw their actions +as rational. Understanding and preventing poor decision making under conditions +of uncertainty requires providing environments and tools that help to stretch our +belief systems and to see patterns that we do not necessarily want to see. +Some common types of dysfunctional safety cultures can be identified that are +common to industries or organizations. Hopkins coined the term “culture of denial” +after investigating accidents in the mining industry, but mining is not the only indus- +try in which denial is pervasive. In such cultures, risk assessment is unrealistic and +credible warnings are dismissed without appropriate action. Management only +wants to hear good news and may ensure that is what they hear by punishing bad +news, sometimes in a subtle way and other times not so subtly. Often arguments are + + +made in these industries that the conditions are inherently more dangerous than +others and therefore little can be done about improving safety or that accidents are +the price of productivity and cannot be eliminated. Of course, this rationale is untrue +but it is convenient. +A second type of dysfunctional safety culture might be termed a “paperwork +culture.” In these organizations, employees spend all their time proving the system +is safe but little time actually doing the things necessary to make it so. After the +Nimrod aircraft loss in Afghanistan in 2006, the accident report noted a “culture of +paper safety” at the expense of real safety [78]. +So what are the aspects of a good safety culture, that is, the core values and norms +that allow us to make better decisions around safety? +1.•Safety commitment is valued. +2.•Safety information is surfaced without fear and incident analysis is conducted +without blame. +3.•Incidents and accidents are valued as an important window into systems that +are not functioning as they should—triggering in-depth and uncircumscribed +causal analysis and improvement actions. +4.• There is a feeling of openness and honesty, where everyone’s voice is respected. +Employees feel that managers are listening. +5a.•There is trust among all parties. +5b.•Employees feel psychologically safe about reporting concerns. + +5c.• +Employees believe that managers can be trusted to hear their concerns and +will take appropriate action. + +5d.Managers believe that employees are worth listening to and are worthy of +respect. + +Common ingredients of a safety culture based on these values include management +commitment to safety and the safety values, management involvement in achieving +the safety goals, employee empowerment, and appropriate and effective incentive +structures and reporting systems. +When these ingredients form the basis of the safety culture, the organization has +the following characteristics: +1.•Safety is integrated into the dominant culture; it is not a separate subculture. +2.•Safety is integrated into both development and operations. Safety activities +employ a mixture of top-down engineering or reengineering and bottom-up +process improvement. +3.•Individuals have required knowledge, skills, and ability. + +4.•Early warning systems for migration toward states of high risk are established +and effective. +5.• The organization has a clearly articulated safety vision, values and procedures, +shared among the stakeholders. +6.• Tensions between safety priorities and other system priorities are addressed +through a constructive, negotiated process. +7.•Key stakeholders (including all employees and groups such as unions) have full +partnership roles and responsibilities regarding safety. +8.•Passionate, effective leadership exists at all levels of the organization (particu- +larly the top), and all parts of the safety control structure are committed to +safety as a high priority for the organization. +9.•Effective communication channels exist for disseminating safety information. +10.•High levels of visibility of the state of safety (i.e., risk awareness) exist at all +levels of the safety control structure through appropriate and effective +feedback. +11.• The results of operating experience, process hazard analyses, audits, near misses, +or accident investigations are used to improve operations and the safety control +structure. +12.• +Deficiencies found during assessments, audits, inspections, and incident inves- +tigation are addressed promptly and tracked to completion. +The Just Culture Movement +The Just Culture movement is an attempt to avoid the type of unsafe cultural values +and professional interactions that have been implicated in so many accidents. +Its origins are in aviation although some in the medical community, particularly +hospitals, have also taken steps down this road. Much has been written on Just +Culture—only a summary is provided here. The reader is directed in particular +to Dekker’s book Just Culture [51], which is the source of much of what follows in +this section. +A foundational principle of Just Culture is that the difference between a safe and +unsafe organization is how it deals with reported incidents. This principle stems from +the belief that an organization can benefit more by learning from mistakes than by +punishing people who make them. +In an organization that promotes such a Just Culture [51]: +1.• +Reporting errors and suggesting changes is normal, expected, and without +jeopardy for anyone involved. +2.• A mistake or incident is not seen as a failure but as a free lesson, an opportunity +to focus attention and to learn. + + +3.•Rather than making people afraid, the system makes people participants in +change and improvement. +4.•Information provided in good faith is not used against those who report it. +Most people have a genuine concern for the safety and quality of their work. If +through reporting problems they contribute to visible improvements, few other +motivations or exhortations to report are necessary. In general, empowering people +to affect their work conditions and making the reporters of safety problems part of +the change process promotes their willingness to shoulder their responsibilities and +to share information about safety problems. +Beyond the obvious safety implications, a Just Culture may improve morale, com- +mitment to the organization, job satisfaction, and willingness to do extra, to step +outside their role. It encourages people to participate in improvement efforts and +gets them actively involved in creating a safer system and workplace. +There are several reasons why people may not report safety problems, which were +covered in chapter 12. To summarize, the reporting channels may be difficult or time +consuming to use, they may feel there is no point in reporting because the organiza- +tion will not do anything anyway or they may fear negative consequences in report- +ing. Each of these reasons must be and can be mitigated through better system +design. Reporting should be easy and not require excessive time or effort that takes +away from direct job responsibilities. There must be responses made both to the +initial report that indicates it was received and read and later information should +be provided about the resolution of the reported problem. +Promoting a Just Culture requires getting away from blame and punishment +as a solution to safety problems. One of the new assumptions in chapter 2 for an +accident model and underlying STAMP was: +Blame is the enemy of safety. Focus should instead be on understanding how the entire +system behavior led to the loss and not on who or what to blame. +Blame and punishment discourage reporting problems and mistakes so improve- +ments can be made to the system. As has been argued throughout this book, chang- +ing the system is the best way to achieve safety, not trying to change people. +When blame is a primary component of the safety culture, people stop reporting +incidents. This basic understanding underlies the Aviation Safety Reporting System +(ASRS) where pilots and others are given protection from punishment if they report +mistakes (see chapter 12). A decision was made in establishing the ASRS and other +aviation reporting systems that organizational and industry learning from mistakes +was more important than punishing people for them. If most errors stem from the +design of the system or can be prevented by changing the design of the system, then +blaming the person who made the mistake is misplaced anyway. + + +A culture of blame creates a climate of fear that makes people reluctant to share +information. It also hampers the potential to learn from incidents; people may even +tamper with safety recording devices, turning them off, for example. A culture of +blame interferes with regulatory work and the investigation of accidents because +people and organizations are less willing to cooperate. The role of lawyers can +impede safety efforts and actually make accidents more likely: Organizations may +focus on creating paper trails instead of utilizing good safety engineering practices. +Some companies avoid standard safety practices under the advice of their lawyers +that this will protect them in legal proceedings, thus almost guaranteeing that acci- +dents and legal proceedings will occur. +Blame and the overuse of punishment as a way to change behavior can directly +lead to accidents that might not have otherwise occurred. As an example, a train +accident in Japan—the 2005 Fukuchiyama line derailment— occurred when a train +driver was on the phone trying to ensure that he would not be reported for a minor +infraction. Because of this distraction, he did not slow down for a curve, resulting +in the deaths of 106 passengers and the train driver along with injury of 562 pas- +sengers [150]. Blame and punishment for mistakes causes stress and isolation and +makes people perform less well. +The alternative is to see mistakes as an indication of an organizational, opera- +tional, educational, or political problem. The question then becomes what should be +done about the problem and who should bear responsibility for implementing the +changes. The mistake and any harm from it should be acknowledged, but the +response should be to lay out the opportunities for reducing such mistakes by every- +one (not just this particular person), and the responsibilities for making changes so +that the probability of it happening again is reduced. This approach allows people +and organizations to move forward to prevent mistakes in the future and not just +focus on punishing past behavior [51]. Punishment is usually not a long-term deter- +rent for mistakes if the system in which the person operates has not changed the +reason for the mistake. Just Culture principles allow us to learn from minor incidents +instead of waiting until tragedies occur. +A common misunderstanding is that a Just Culture means a lack of accountability. +But, in reality, it is just the opposite. Accountability is increased in a Just Culture by +not simply assigning responsibility and accountability to the person at the bottom +of the safety control structure who made the direct action involved in the mistake. +All components of the safety control structure involved are held accountable includ- +ing (1) those in operations who contribute to mistakes by creating operational +pressures and providing inadequate oversight to ensure safe procedures are being +followed, and (2) those in development who create a system design that contributes +to mistakes. +The difference in a Just Culture is not in the accountability for safety problems +but how accountability is implemented. Punishment is an appropriate response to + + + +gross negligence and disregard for other people’s safety, which, of course, applies to +everyone in the safety control structure, including higher-level management and +developers as well as the lower level controllers. But if mistakes were made or +inadequate controls over safety provided because of flaws in the design of the con- +trolled system or the safety control structure, then punishment is not the appropriate +response—fixing the system or the safety control structure is. Dekker has suggested +that accountability be defined in terms of responsibility for finding solutions to the +system design problems from which the mistakes arose [51]. +Overcoming our cultural bias to punish people for their mistakes and the common +belief that punishment is the only way to change behavior can be very difficult. But +the payoff is enormous if we want to significantly reduce accident rates. Trust is a +critical requirement for encouraging people to share their mistakes and safety prob- +lems with others so something can be done before major losses occur. +section 13.2.6. Creating an Effective Safety Control Structure. +In some industries, the safety control structure is called the safety management +system (SMS). In civil aviation, ICAO (International Civil Aviation Authority) has +created standards and recommended practices for safety management systems and +individual countries have strongly recommended or required certified air carriers +to establish such systems in order to control organizational factors that contribute +to accidents. +There is no right or wrong design of a safety control structure or SMS. Most of +the principles for design of safe control loops in chapter 9 also apply here. The +culture of the industry and the organization will play a role in what is practical and +effective. There are some general rules of thumb, however, that have been found to +be important in practice. +General Safety Control Structure Design Principles. +Making everyone responsible for safety is a well-meaning misunderstanding of +what is required. While, of course, everyone should try to behave safely and to +achieve safety goals, someone has to be assigned responsibility for ensuring that +the goals are being achieved. This lesson was learned long ago in the U.S. Intercon- +tinental Ballistic Missile System (ICBM). Because safety was such an important +consideration in building the early 1950s missile systems, safety was not assigned as +a specific responsibility, but was instead considered to be everyone’s responsibility. +The large number of resulting incidents, particularly those involving the interfaces +between subsystems, led to the understanding that safety requires leadership +and focus. +There needs to be assignment of responsibility for ensuring that hazardous +behaviors are eliminated or, if not possible, mitigated in design and operations. +Almost all attention during development is focused on what the system and its + + +components are supposed to do. System safety engineering is responsible for ensur- +ing that adequate attention is also paid to what the system is not supposed to do +and verifying that hazardous behavior will not occur. It is this unique focus that has +made the difference in systems where safety engineering successfully identified +problems that were not found by the other engineering processes. +At the other extreme, safety efforts may be assigned to a separate group that +is isolated from critical decision making. During system development, responsibil- +ity for safety may be concentrated in a separate quality assurance group rather +than in the system engineering organization. During operations, safety may be +the responsibility of a staff position with little real power or impact on line +operations. +The danger inherent in this isolation of the safety efforts is argued repeatedly +throughout this book. To be effective, the safety efforts must have impact, and they +must be integrated into mainstream system engineering and operations. +Putting safety into the quality assurance organization is the worst place for it. For +one thing, it sets up the expectation that safety is an after-the-fact or auditing activity +only: safety must be intimately integrated into design and decision-making activities. +Safety permeates every part of development and operations. While there may be +staff positions performing safety functions that affect everyone at their level of the +organization and below, safety must be integrated into all of engineering develop- +ment and line operations. Important safety functions will be performed by most +everyone, but someone needs the responsibility to ensure that they are being carried +out effectively. +At the same time, independence is also important. The CAIB report addresses +this issue: +Organizations that successfully operate high-risk technologies have a major characteristic +in common: they place a premium on safety and reliability by structuring their programs +so that technical and safety engineering organizations own the process of determining, +maintaining, and waiving technical requirements with a voice that is equal to yet inde- +pendent of Program Managers, who are governed by cost, schedule, and mission- +accomplishment goals [74, p. 184]. +Besides associating safety with after-the-fact assurance and isolating it from system +engineering, placing it in an assurance group can have a negative impact on its +stature, and thus its influence. Assurance groups often do not have the prestige +necessary to have the influence on decision making that safety requires. A case can +be made that the centralization of system safety in quality assurance at NASA, +matrixed to other parts of the organization, was a major factor in the decline of the +safety culture preceding the Columbia loss. Safety was neither fully independent +nor sufficiently influential to prevent the loss events [117]. + +Safety responsibilities should be assigned at every level of the organization, +although they will differ from level to level. At the corporate level, system safety +responsibilities may include defining and enforcing corporate safety policy, and +establishing and monitoring the safety control structure. In some organizations that +build extremely hazardous systems, a group at the corporate or headquarters level +certify these systems as safe for use. For example, the U.S. Navy has a Weapons +Systems Explosives Safety Review Board that assures the incorporation of explosive +safety criteria in all weapon systems by reviews conducted throughout all the sys- +tem’s life cycle phases. For some companies, it may be reasonable to have such a +review process at more than just the highest level. +Communication is important because safety motivated changes in one subsystem +may affect other subsystems and the system as a whole. In military procurement +groups, oversight and communication is enhanced through the use of safety working +groups. In establishing any oversight process, two extremes must be avoided: “getting +into bed” with the project and losing objectivity or backing off too far and losing +insight. Working groups are an effective way of avoiding these extremes. They assure +comprehensive and unified planning and action while allowing for independent +review and reporting channels. +Working groups usually operate at different levels of the organization. As an +example, the Navy Aegis system development, a very large and complex system, +included a System Safety Working Group at the top level chaired by the Navy Prin- +cipal for Safety, with the permanent members being the prime contractor’s system +safety lead and representatives from various Navy offices. Contractor representa- +tives attended meetings as required. Members of the group were responsible for +coordinating safety efforts within their respective organizations, for reporting the +status of outstanding safety issues to the group, and for providing information to +the Navy Weapons Systems Explosives Safety Review Board. Working groups also +functioned at lower levels, providing the necessary coordination and communication +for that level and to the levels above and below. +A surprisingly large percentage of the reports on recent aerospace accidents have +implicated improper transition from an oversight to an insight process (for example, +see [193, 215, 153]). This transition implies the use of different levels of feedback +control and a change from prescriptive management control to management by +objectives, where the objectives are interpreted and satisfied according to the local +context. For these accidents, the change in management role from oversight to +insight seems to have been implemented simply as a reduction in personnel and +budgets without assuring that anyone was responsible for specific critical tasks. + + +footnote. The Aegis Combat System is an advanced command and control and weapon control system that uses +powerful computers and radars to track and guide weapons to destroy enemy targets. + + +Assigning Responsibilities. +An important question is what responsibilities should be assigned to the control +structure components. The list below is derived from the author’s experience on a +large number and variety of projects. Many also appear in accident report recom- +mendations, particularly those generated using CAST. +The list is meant only to be a starting point for those establishing a comprehen- +sive safety control structure and a checklist for those who already have sophisticated +safety management systems. It should be supplemented using other sources and +experiences. +The list does not imply that each responsibility will be assigned to a single person +or group. The responsibilities will probably need to be separated into multiple indi- +vidual responsibilities and assigned throughout the safety control structure, with one +group actually implementing the responsibilities and others above them supervising, +leading (directing), or overseeing the activity. Of course, each responsibility assumes +the need for associated authority and accountability plus the controls, feedback, and +communication channels necessary to implement the responsibility. The list may +also be useful in accident and incident analysis to identify inadequate controls and +control structures. + +Management and General Responsibilities. +1.•Provide leadership, oversight, and management of safety at all levels of the +organization. +2.•Create a corporate or organizational safety policy. Establish criteria for evaluat- +ing safety-critical decisions and implementing safety controls. Establish distri- +bution channels for the policy. Establish feedback channels to determine +whether employees understand it, are following it, and whether it is effective. +Update the policy as needed. +3.•Establish corporate or organizational safety standards and then implement, +update, and enforce them. Set minimum requirements for safety engineering +in development and operations and oversee the implementation of those +requirements. Set minimum physical and operational standards for hazardous +operations. +4.•Establish incident and accident investigation standards and ensure recommen- +dations are implemented and effective. Use feedback to improve the standards. +5.•Establish management of change requirements for evaluating all changes for +their impact on safety, including changes in the safety control structure. Audit +the safety control structure for unplanned changes and migration toward states +of higher risk. + +6.•Create and monitor the organizational safety control structure. Assign respon- +sibility, authority, and accountability for safety. +7.•Establish working groups. +8.•Establish robust and reliable communication channels to ensure accurate +management risk awareness of the development system design and the state of +the operating process. +9.•Provide physical and personnel resources for safety-related activities. Ensure +that those performing safety-critical activities have the appropriate skills, +knowledge, and physical resources. +10.•Create an easy-to-use problem reporting system and then monitor it for needed +changes and improvements. +11.•Establish safety education and training for all employees and establish feed- +back channels to determine whether it is effective along with processes for +continual improvement. The education should include reminders of past +accidents and causes and input from lessons learned and trouble reports. +Assessment of effectiveness may include information obtained from knowledge +assessments during audits. +12.•Establish organizational and management structures to ensure that safety- +related technical decision making is independent from programmatic con- +siderations, including cost and schedule. +13.•Establish defined, transparent, and explicit resolution procedures for conflicts +between safety-related technical decisions and programmatic considerations. +Ensure that the conflict resolution procedures are being used and are +effective. +14.•Ensure that those who are making safety-related decisions are fully informed +and skilled. Establish mechanisms to allow and encourage all employees and +contractors to contribute to safety-related decision making. +15.•Establish an assessment and improvement process for safety-related decision +making. +16.•Create and update the organizational safety information system. +17.•Create and update safety management plans. +18.•Establish communication channels, resolution processes, and adjudication pro- +cedures for employees and contractors to surface complaints and concerns +about the safety of the system or parts of the safety control structure that are +not functioning appropriately. Evaluate the need for anonymity in reporting +concerns. + +Development. +1.•Implement special training for developers and development managers in safety- +guided design and other necessary skills. Update this training as events occur +and more is learned from experience. Create feedback, assessment, and improve- +ment processes for the training. +2.•Create and maintain the hazard log. +3.•Establish working groups. +4.•Design safety into the system using system hazards and safety constraints. +Iterate and refine the design and the safety constraints as the design process +proceeds. Ensure the system design includes consideration of how to reduce +human error. +5.•Document operational assumptions, safety constraints, safety-related design +features, operating assumptions, safety-related operational limitations, training +and operating instructions, audits and performance assessment requirements, +operational procedures, and safety verification and analysis results. Document +both what and why, including tracing between safety constraints and the design +features to enforce them. +6.•Perform high-quality and comprehensive hazard analyses to be available +and usable when safety-related decisions need to be made, starting with early +decision making and continuing through the system’s life. Ensure that the +hazard analysis results are communicated in a timely manner to those who need +them. Establish a communication structure that allows communication down- +ward, upward, and sideways (i.e., among those building subsystems). Ensure +that hazard analyses are updated as the design evolves and test experience is +acquired. +7.• Train engineers and managers to use the results of hazard analyses in their +decision making. +8.•Maintain and use hazard logs and hazard analyses as experience with the +system is acquired. Ensure communication of safety-related requirements and +constraints to everyone involved in development. +•Gather lessons learned in operations (including accident and incident +reports) and use them to improve the development processes. Use operating +experience to identify flaws in the development safety controls and implement +improvements. + +Operations. +1.• +Develop special training for operators and operations management to create +needed skills and update this training as events occur and more is learned from + +experience. Create feedback, assessment, and improvement processes for this +training. Train employees to perform their jobs safely, understand proper use +of safety equipment, and respond appropriately in an emergency. +2.•Establish working groups. +3.•Maintain and use hazard logs and hazard analyses during operations as experi- +ence is acquired. +4.•Ensure all emergency equipment and safety devices are operable at all times +during hazardous operations. Before safety-critical, nonroutine, potentially haz- +ardous operations are started, inspect all safety equipment to ensure it is opera- +tional, including the testing of alarms. +5.•Perform an in-depth investigation of any operational anomalies, including +hazardous conditions (such as water in a tank that will contain chemicals +that react to water) or events. Determine why they occurred before any +potentially dangerous operations are started or restarted. Provide the training +necessary to do this type of investigation and proper feedback channels to +management. +6.•Create management of change procedures and ensure they are being followed. +These procedures should include hazard analyses on all proposed changes and +approval of all changes related to safety-critical operations. Create and enforce +policies about disabling safety-critical equipment. +7.•Perform safety audits, performance assessments, and inspections using the +hazard analysis results as the preconditions for operations and maintenance. +Collect data to ensure safety policies and procedures are being followed and +that education and training about safety is effective. Establish feedback chan- +nels for leading indicators of increasing risk. +8.•Use the hazard analysis and documentation created during development and +passed to operations to identify leading indicators of migration toward states +of higher risk. Establish feedback channels to detect the leading indicators and +respond appropriately. +9.•Establish communication channels from operations to development to pass +back information about operational experience. +10.•Perform in-depth incident and accident investigations, including all systemic +factors. Assign responsibility for implementing all recommendations. Follow +up to determine whether recommendations were fully implemented and +effective. +11.•Perform independent checks of safety-critical activities to ensure they have +been done properly. + +12.•Prioritize maintenance for identified safety-critical items. Enforce maintenance +schedules. +13.•Create and enforce policies about disabling safety-critical equipment and +making changes to the physical system. +14.•Create and execute special procedures for the startup of operations in a pre- +viously shutdown unit or after maintenance activities. +15.•Investigate and reduce the frequency of spurious alarms. +16.•Clearly mark malfunctioning alarms and gauges. In general, establish pro- +cedures for communicating information about all current malfunctioning +equipment to operators and ensure the procedures are being followed. Elimi- +nate all barriers to reporting malfunctioning equipment. +17.•Define and communicate safe operating limits for all safety-critical equipment +and alarm procedures. Ensure that operators are aware of these limits. Assure +that operators are rewarded for following the limits and emergency procedures, +even when it turns out no emergency existed. Provide for tuning the operating +limits and alarm procedures over time as required. +18.•Ensure that spare safety-critical items are in stock or can be acquired quickly. +19.•Establish communication channels to plant management about all events and +activities that are safety-related. Ensure management has the information and +risk awareness they need to make safe decisions about operations. +20.•Ensure emergency equipment and response is available and operable to treat +injured workers. +21.•Establish communication channels to the community to provide information +about hazards and necessary contingency actions and emergency response +requirements. + +section 13.2.7. The Safety Information System. +The safety information system is a critical component in managing safety. It acts as +a source of information about the state of safety in the controlled system so that +controllers’ process models can be kept accurate and coordinated, resulting in better +decision making. Because it in essence acts as a shared process model or a source +for updating individual process models, accurate and timely feedback and data are +important. After studying organizations and accidents, Kjellan concluded that an +effective safety information system ranked second only to top management concern +about safety in discriminating between safe and unsafe companies matched on other +variables [101]. +Setting up a long-term information system can be costly and time consuming, but +the savings in terms of losses prevented will more than make up for the effort. As + + +an example, a Lessons Learned Information System was created at Boeing for com- +mercial jet transport structural design and analysis. The time constants are large in +this industry, but they finally were able to validate the system after using it in the +design of the 757 and 767 [87]. A tenfold reduction in maintenance costs due to +corrosion and fatigue were attributed to the use of recorded lessons learned from +past designs. All the problems experienced in the introduction of new carbon-fiber +aircraft structures like the B787 show how valuable such learning from the past can +be and the problems that result when it does not exist. +Lessons learned information systems in general are often inadequate to meet +the requirements for improving safety: collected data may be improperly filtered +and thus inaccurate, methods may be lacking for the analysis and summarization +of causal data, information may not be available to decision makers in a form +that is meaningful to them, and such long-term information system efforts +may fail to survive after the original champions and initiators move on to different +projects and management does not provide the resources and leadership to +continue the efforts. Often, lots of information is collected about occupational +safety because it is required for government reports but less for engineering +safety. +Setting up a safety information system for a single project or product may be +easier. The effort starts in the development process and then is passed on for use in +operations. The information accumulated during the safety-driven design process +provides the baseline for operations, as described in chapter 12. For example, the +identification of critical items in the hazard analysis can be used as input to the +maintenance process for prioritization. Another example is the use of the assump- +tions underlying the hazard analysis to guide the audit and performance assessment +process. But first the information needs to be recorded and easily located and used +by operations personnel. +In general, the safety information system includes +1.• A safety management plan (for both development and operations) +2.• The status of all safety-related activities +3.• The safety constraints and assumptions underlying the design, including opera- +tional limitations +4.• The results of the hazard analyses (hazard logs) and performance audits and +assessments +5.• Tracking and status information on all known hazards +6.•Incident and accident investigation reports and corrective actions taken +7.•Lessons learned and historical information +8.• Trend analysis + +One of the first components of the safety information system for a particular project +or product is a safety program plan. This plan describes the objectives of the program +and how they will be achieved. In addition to other things, the plan provides a +baseline to evaluate compliance and progress. While the organization may have a +general format and documented expectations for safety management plans, this +template may need to be tailored for specific project requirements. The plan should +include review procedures for the plan itself as well as how the plan will be updated +and improved through feedback from experience. +All of the information in the safety information system will probably not be in +one document, but there should be a central location containing pointers to where +all the information can be found. Chapter 12 contains a list of what should be in an +operations safety management plan. The overall safety management plan will +contain similar information with some additions for development. +When safety information is being shared among companies or with regulatory +agencies, there needs to be protection from disclosure and use of proprietary data +for purposes other than safety improvement. + +section 13.2.8. Continual Improvement and Learning. +Processes and structures need to be established to allow continual improvement and +learning. Experimentation is an important part of the learning process, and trying +new ideas and approaches to improving safety needs to be allowed and even +encouraged. +In addition, accidents and incidents should be treated as opportunities for learn- +ing and investigated thoroughly, as described in chapter 11. Learning will be inhib- +ited if a thorough understanding of the systemic factors involved is not sought. +Simply identifying the causal factors is not enough: recommendations to +eliminate or control these factors must be created along with concrete plans for +implementing the recommendations. Feedback loops are necessary to ensure that +the recommendations are implemented in a timely manner and that controls are +established to detect and react to reappearance of those same causal factors in +the future. + +section 13.2.9. Education, Training, and Capability Development. +If employees understand the intent of the safety program and commit to it, they are +more likely to comply with that intention rather than simply follow rules when it is +convenient to do so. +Some properties of effective training programs are presented in chapter 12. +Everyone involved in controlling a potentially dangerous process needs to have +safety training, not just the low-level controllers or operators. The training must +include not only information about the hazards and safety constraints to be + + +implemented in the control structure and the safety controls, but also about priori- +ties and how decisions about safety are to be made. +One interesting option is to have managers serve as teachers [46]. In this educa- +tion program design, training experts help manage group dynamics and curriculum +development, but the training itself is delivered by the project leaders. Ford Motor +Company used this approach as part of what they term their Business Leadership +Initiative and have since extended it as part of the Safety Leadership Initiative. They +found that employees pay more attention to a message delivered by their boss than +by a trainer or safety official. By learning to teach the materials, supervisors and +managers are also more likely to absorb and practice the key principles [46]. +section 13.3. Final Thoughts. +Management is key to safety. Top-level management sets the culture, creates the +safety policy, and establishes the safety control structure. Middle management +enforces safe behavior through the designed controls. +Most people want to run safe organizations, but they may misunderstand the +tradeoffs required and how to accomplish the goals. This chapter and the book as a +whole have tried to correct misperceptions and provide advice on how to create +safer products and organizations. The next chapter provides a real-life example of +a successful systems approach to safety. \ No newline at end of file diff --git a/chapter13.txt b/chapter13.txt new file mode 100644 index 0000000..56ffc41 --- /dev/null +++ b/chapter13.txt @@ -0,0 +1,995 @@ +chapter 13. +Managing Safety and the Safety Culture. +The key to effectively accomplishing any of the goals described in the previous +chapters lies in management. Simply having better tools is not enough if they are +not used. Studies have shown that management commitment to the safety goals is +the most important factor distinguishing safe from unsafe systems and companies + . Poor management decision making can undermine any attempts to improve +safety and ensure that accidents continue to occur. +This chapter outlines some of the most important management factors in reducing accidents. The first question is why managers should care about and invest in +safety. The answer, in short, is that safety pays and investment in safety provides +large returns over the long run. +If managers understand the importance of safety in achieving organizational +goals and decide they want to improve safety in their organizations, then three basic +organizational requirements are necessary to achieve that goal. The first is an effective safety control structure. Because of the importance of the safety culture in how +effectively the safety control structure operates, the second requirement is to implement and sustain a strong safety culture. But even the best of intentions will not +suffice without the appropriate information to carry them out, so the last critical +factor is the safety information system. +The previous chapters in this book focus on what needs to be done during design +and operations to control safety and enforce the safety constraints. This chapter +describes the overarching role of management in this process. +section 13.1. Why Should Managers Care about and Invest in Safety? +Most managers do care about safety. The problems usually arise because of misunderstandings about what is required to achieve high safety levels and what the +costs really are if safety is done right. Safety need not entail enormous financial or +other costs. + + + + +A classic myth is that safety conflicts with achieving other goals and that tradeoffs +are necessary to prevent losses. In fact, this belief is totally wrong. Safety is a prerequisite for achieving most organizational goals, including profits and continued +existence. +History is replete with examples of major accidents leading to enormous financial +losses and the demise of companies as a result. Even the largest global corporations +may not be able to withstand the costs associated with such losses, including loss of +reputation and customers. After all these examples, it is surprising that few seem to +learn from them about their own vulnerabilities. Perhaps it is in the nature of +mankind to be optimistic and to assume that disasters cannot happen to us, only +to others. In addition, in the simpler societies of the past, holding governments +and organizations responsible for safety was less common. But with loss of control +over our own environment and its hazards, and with rising wealth and living standards, the public is increasingly expecting higher standards of behavior with respect +to safety. +The “conflict” myth arises because of a misunderstanding about how safety is +achieved and the long-term consequences of operating under conditions of high risk. +Often, with the best of intentions, we simply do the wrong things in our attempts to +improve safety. It’s not a matter of lack of effort or resources applied, but how they +are used that is the problem. Investments in safety need to be funneled to the most +effective activities in achieving it. +Sometimes it appears that organizations are playing a sophisticated version of +Whack-a-Mole, where symptoms are found and fixed but not the processes that +allow these symptoms to occur. Enormous resources may be expended with little +return on the investment. So many incidents occur that they cannot all be investigated in depth, so only superficial analysis of a few is attempted. If, instead, a few +were investigated in depth and the systemic factors fixed, the number of incidents +would decrease by orders of magnitude. +Such groups find themselves in continual firefighting mode and eventually conclude that accidents are inevitable and investments to prevent them are not costeffective, thus, like Sisyphus, condemning themselves to traverse the same vicious +circle in perpetuity. Often they convince themselves that their industry is just more +hazardous than others and that accidents in their world are inevitable and are the +price of productivity. +This belief that accidents are inevitable and occur because of random chance +arises from our own inadequate efforts to prevent them. When accident causes are +examined in depth, using the systems approach in this book, it becomes clear that +there is nothing random about them. In fact, we seem to have the same accident +over and over again, with only the symptoms differing, but the causes remaining +fairly constant. Most of these causes could be eliminated, but they are not. The + +precipitating immediate factors, like a stuck valve, may have some randomness +associated with them, such as which valve actually precipitates a loss. But there is +nothing random about systemic factors that have not been corrected and exist +over long periods of time, such as flawed valve design and analysis or inadequate +maintenance practices. +As described in previous chapters, organizations tend to move inexorably toward +states of higher risk under various types of performance pressures until an accident +become inevitable. Under external or internal pressures, projects start to violate +their own rules. “We’ll do it just this once.it’s critical that we get this procedure +finished today.” In the Deepwater Horizon oil platform explosion of 20 10 , cost pressures led to not following standard safety procedures and, in the end, to enormous +financial losses . Similar dynamics occurred, with slightly different pressures, in +the Columbia Space Shuttle loss where the tensions among goals were created by +forces largely external to NASA. What appear to be short-term conflicts of other +organizational goals with safety goals, however, may not exist over the long term, +as witnessed in both these cases. +When operating at elevated levels of risk, the only question is which of many +potential events will trigger the loss. Before the Columbia accident, NASA manned +space operations was experiencing a slew of problems in the orbiters. The head of +the NASA Manned Space Program at the time misinterpreted the fact that they +were finding and fixing problems and wrote a report that concluded risk had been +reduced by more than a factor of five . The same unrealistic perception of risk +led to another report in 19 95 recommending that NASA “restructure and reduce +overall safety, reliability, and quality assurance elements” . +Figure 13.1 shows some of the dynamics at work. The model demonstrates the +major sources of the high risk in the Shuttle program at the time of the Columbia +loss. In order to get the funding needed to build and operate the space shuttle, +NASA had made unachievable performance promises. The need to justify expenditures and prove the value of manned space flight has been a major and consistent +tension between NASA and other governmental entities. The more missions the +Shuttle could fly, the better able the program was to generate funding. Adding to +these pressures was a commitment to get the International Space Station construction complete by February 20 04 .(called “core complete”), which required deliveries +of large items that could only be carried by the shuttle. The only way to meet the +deadline was to have no launch delays, a level of performance that had never previously been achieved . As just one indication of the pressure, computer screen +savers were mailed to managers in NASA’s human spaceflight program that depicted +a clock counting down .(in seconds). to the core complete deadline . + + +The control loop in the lower left corner of figure 13.1, labeled R1 or Pushing +the Limit, shows how as external pressures increased, performance pressure +increased, which led to increased launch rates and thus success in meeting the launch +rate expectations, which in turn led to increased expectations and increasing performance pressures. This reinforcing loop represents an unstable system and cannot +be maintained indefinitely, but NASA is a “can-do” organization that believes +anything can be accomplished with enough effort . +The upper left loop represents the Space Shuttle safety program, which when +operating effectively is meant to balance the risks associated with loop R1. The external influences of budget cuts and increasing performance pressures, however, reduced +the priority of safety procedures and led to a decrease in system safety efforts. + +Adding to the problems is the fact that system safety efforts led to launch delays +when problems were found, which created another reason for reducing the priority +of the safety efforts in the face of increasing launch pressures. +While reduction in safety efforts and lower prioritization of safety concerns may +lead to accidents, accidents usually do not occur for a while so false confidence is +created that the reductions are having no impact on safety and therefore pressures +increase to reduce the efforts and priority even further as the external and internal +performance pressures mount. +The combination of the decrease in safety efforts along with loop B2 in which +fixing the problems that were being found increased complacency, which also +contributed to reduction of system safety efforts, eventually led to a situation of +unrecognized high risk. +When working at such elevated levels of risk, the only question is which of many +potential events will trigger the loss. The fact that it was the foam and not one of +the other serious problems identified both before and after the loss was the only +random part of the accident. At the time of the Columbia accident, NASA was +regularly flying the Shuttle with many uncontrolled hazards; the foam was just one +of them. +Often, ironically, our successful efforts to eliminate or reduce accidents contribute to the march toward higher risk. Perception of the risk associated with an activity +often decreases over a period of time when no losses occur even though the real +risk has not changed at all. This misperception leads to reducing the very factors +that are preventing accidents because they are seen as no longer needed and available to trade off with other needs. The result is that risk increases until a major loss +occurs. This vicious cycle needs to be broken to prevent accidents. In STAMP terms, +the weakening of the safety control structure over time needs to be prevented or +detected before the conditions occur that lead to a loss. +System migration toward states of higher risk is potentially controllable and +detectable . The migration results from weakening of the safety control structure. To achieve lasting results, strong operational safety efforts are needed that +provide protection from and appropriate responses to the continuing environmental +influences and pressures that tend to degrade safety over time and that change the +safety control structure and the behavior of those in it. +The experience in the nuclear submarine community is a testament to the fact +that such dynamics can be overcome. The SUBSAFE program .(described in the +next chapter). was established after the loss of the Thresher in 19 63 . Since that time, +no submarine in the SUBSAFE program, that is, satisfying the SUBSAFE requirements, has been lost, although such losses were common before SUBSAFE was +established. +The leaders in SUBSAFE describe other benefits beyond preventing the loss of +critical assets. Because those operating the submarines have complete confidence + + +in their ships, they can focus solely on the completion of their mission. The U.S. +nuclear submarine program’s experience over the past forty-five years belies the +myth that increasing safety necessarily decreases system performance. Over a sustained period, a safer operation is generally more efficient. One reason is that stoppages and delays are eliminated. +Examples can also be found in private industry. As just one example, because of +a number of serious accidents, OSHA tried to prohibit the use of power presses +where employees had to place one or both hands beneath the ram during the production cycle . After vehement protests that the expense would be too great in +terms of reduced productivity, the requirement was dropped. Preliminary motion +studies showed that reduced production would result if all loading and unloading +were done with the die out from under the ram. Some time after OSHA gave up +on the idea, one manufacturer who used power presses decided, purely as a safety +and humanitarian measure, to accept the production penalty. Instead of reducing +production, however, the effect was to increase production from 5 to 15 percent, +even though the machine cycle was longer. Other examples of similar experiences +can be found in Safeware . +The belief that safer systems cost more or that building safety in from the beginning necessarily requires unacceptable compromises with other goals is simply not +justified. The costs, like anything else, depend on the methods used to achieve +increased safety. In another ironic twist, in the attempt to avoid making tradeoffs +with safety, systems are often designed to optimize mission goals and safety devices +added grudgingly when the design is complete. This approach, however, is the most +expensive and least effective that could be used. The costs are much less and in +fact can be eliminated if safety is built into the system design from the beginning +rather than added on or retrofitted later, usually in the form of redundancy +or elaborate protection systems. Eliminating or reducing hazards early in design +often results in a simpler design, which in itself may reduce both risk and costs. +The reduced risk makes it more likely that the mission or system goals will be +achieved. +Sometimes it takes a disaster to “get religion” but it should not have to. This +chapter was written for those managers who are wise enough to know that investment in safety pays dividends, even before this fact is brought home .(usually too +late). by a tragedy. + +footnote. Appendix D explains how to read system dynamics models, for those unfamiliar with them. + + +section 13.2. General Requirements for Achieving Safety Goals. +Escaping from the Whack-a-Mole trap requires identifying and eliminating the +systemic factors behind accidents. Some common reasons why safety efforts are +often not cost-effective were identified in chapter 6, including. + +1.•Superficial, isolated, or misdirected safety engineering activities, such as spending most of the effort proving the system is safe rather than making it so. +2.•Starting too late. +3.•Using techniques inappropriate for today’s complex systems and new +technology. +4.•Focusing only on the technical parts of the system, and +5.• Assuming systems are static throughout their lifetime and decreasing attention +to safety during operations + +Safety needs to be managed and appropriate controls established. The major ingredients of effective safety management include. +1.• +Commitment and leadership +2.• A corporate safety policy +3.•Risk awareness and communication channels +4.•Controls on system migration toward higher risk +5.• A strong corporate safety culture +6.• A safety control structure with appropriate assignment of responsibility, authority, and accountability +7.• A safety information system +8.•Continual improvement and learning +9.•Education, training, and capability development +Each of these is described in what follows. + +section 13.2.1. Management Commitment and Leadership. +Top management concern about safety is the most important factor in discriminating between safe and unsafe companies matched on other variables . This +commitment must be genuine, not just a matter of sloganeering. Employees need +to feel they will be supported if they show concern for safety. An Air Force study +of system safety concluded. +Air Force top management support of system safety has not gone unnoticed by contractors. They now seem more than willing to include system safety tasks, not as “window +dressing” but as a meaningful activity . +The B1-B program is an example of how this result was achieved. In that development program, the program manager or deputy program manager chaired the meetings of the group where safety decisions were made. “An unmistaken image of + + +the importance of system safety in the program was conveyed to the contractors” + . +A manager’s open and sincere concern for safety in everyday dealings with +employees and contractors can have a major impact on the reception given to safetyrelated activities . Studies have shown that top management’s support for and +participation in safety efforts is the most effective way to control and reduce accidents . Support for safety is shown by personal involvement, by assigning capable +people and giving them appropriate objectives and resources, by establishing comprehensive organizational safety control structures, and by responding to initiatives +by others. +section 13.2.2. Corporate Safety Policy. +A policy is a written statement of the wisdom, intentions, philosophy, experience, +and belief of an organization’s senor managers that states the goals for the organization and guides their attainment . The corporate safety policy provides employees with a clear, shared vision of the organization’s safety goals and values and a +strategy to achieve them. It documents and shows managerial priorities where safety +is involved. +The author has found companies that justify not having a safety policy on the +grounds that “everyone knows safety is important in our business.” While safety may +seem important for a particular business, management remaining mute on their +policy conveys the impression that tradeoffs are acceptable when safety seems to +conflict with other goals. The safety policy provides a way for management to clearly +define the priority between conflicting goals they expect to be used in decision +making. The safety policy should define the relationship of safety to other organizational goals and provide the scope for discretion, initiative, and judgment in +deciding what should be done in specific situations. +Safety policy should be broken into two parts. The first is a short and concise +statement of the safety values of the corporation and what is expected from employees with respect to safety. Details about how the policy will be implemented should +be separated into other documents. +A complete safety policy contains such things as the goals of the safety program; +a set of criteria for assessing the short- and long-term success of that program with +respect to the goals; the values to be used in tradeoff decisions; and a clear statement +of responsibilities, authority, accountability, and scope. The policy should be explicit +and state in clear and understandable language what is expected, not a set of lofty +goals that cannot be operationalized. An example sometimes found .(as noted in the +previous chapter). is a policy for employees to “be mindful of weak signals”. This +policy provides no useful guidance on what to do.both “mindful” and “weak +signals” are undefined and undefinable. An alternative might be, “If you see + + +something that you think is unsafe, you are responsible for reporting it immediately.” +In addition, employees need to be trained on the hazards in the processes they +control and what to look for. +Simply having a safety policy is not enough. Employees need to believe the +safety policy reflects true commitment by management. The only way this commitment can be effectively communicated is through actions by management that +demonstrate that commitment. Employees need to feel that management will +support them when they make reasonable decisions in favor of safety over alternative goals. Incentives and reward structures must encourage the proper handling of +tradeoffs between safety and other goals. Not only the formal rewards and rules +but also the informal rules .(social processes). of the organizational culture must +support the overall safety policy. A practical test is whether employees believe that +company management will support them if they choose safety over the demands of +production . +To encourage proper decision making, the flexibility to respond to safety problems needs to be built into the organizational procedures. Schedules, for example, +should be adaptable to allow for uncertainties and possibilities of delay due to +legitimate safety concerns, and production goals must be reasonable. +Finally, not only must a safety policy be defined, it must be disseminated and +followed. Management needs to ensure that safety receives appropriate attention +in decision making. Feedback channels must be established and progress in achieving the goals should be monitored and improvements identified, prioritized, and +implemented. +section 13.2.3. Communication and Risk Awareness. +Awareness of the risk in the controlled process is a major component of safetyrelated decision making by controllers. The problem is that risk, when defined +as the severity of a loss event combined with its likelihood, is not calculable or +knowable. It can only be estimated from a set of variables, some of which may be +unknown, or the information to evaluate likelihood of these variables may be +lacking or incorrect. But decisions need to be made based on this unknowable +property. +In the absence of accurate information about the state of the process, risk perception may be reevaluated downward as time passes without an accident. In fact, risk +probably has not changed, only our perception of it. In this trap, risk is assumed to +be reflected by a lack of accidents or incidents and not by the state of the safety +control structure. +When STAMP is used as the foundation of the safety program, safety and risk +are a function of the effectiveness of the controls to enforce safe system behavior, that +is, the safety constraints and the control structure used to enforce those constraints. + + +Poor safety-related decision making on the part of management, for example, is +commonly related to inadequate feedback and inaccurate process models. As such, +risk is potentially knowable and not some amorphous property denoted by probability estimates. This new definition of risk can be used to create new risk assessment +procedures. +While lack of accidents could reflect a strong safety control structure, it may also +simply reflect delays between the relaxation of the controls and negative consequences. The delays encourage relaxation of more controls, which then leads to +accidents. The basic problem is inaccurate risk perception and calculating risk using +the wrong factors. This process is behind the frequently used but rarely defined label +of “complacency.” Complacency results from inaccurate process models and risk +awareness. +Risk perception is directly related to communication and feedback. The more and +better the information we have about the potential causes of accidents in our system +and the state of the controls implemented to prevent them, the more accurate will +be our perception of risk. Consider the loss of an aircraft when it took off from the +wrong runway in Lexington, Kentucky, in August 20 06 . One of the factors in the +accident was that construction was occurring and the pilots were confused about +temporary changes in taxi patterns. Although similar instances of crew confusion +had occurred in the week before the accident, there were no effective communication channels to get this information to the proper authorities. After the loss, a small +group of aircraft maintenance workers told the investigators that they also had +experienced confusion when taxiing to conduct engine tests.they were worried +that an accident could happen, but did not know how to effectively notify people +who could make a difference . +Another communication disconnect in this accident leading to a misperception +of risk involved a misunderstanding by management about the staffing of the control +tower at the airport. Terminal Services management had ordered the airport air +traffic control management to both reduce control tower budgets and to ensure +separate staffing of the tower and radar functions. It was impossible to comply with +both directives. Because of an ineffective feedback mechanism, management did not +know about the impossible and dangerous goal conflicts they had created or that +the resolution of the conflict was to reduce the budget and ignore the extra staffing +requirements. +Another example occurred in the Deepwater Horizon accident. Reports after the +accident indicated that workers felt comfortable raising safety concerns and ideas +for safety improvement to managers on the rig, but they felt that they could not +raise concerns at the divisional or corporate level without reprisal. In a confidential +survey of workers on Deepwater Horizon taken before the oil platform exploded, +workers expressed concerns about safety. + + +“I’m petrified of dropping anything from heights not because I’m afraid of hurting anyone +(the area is barriered off), but because I’m afraid of getting fired,” one worker wrote. “The +company is always using fear tactics,” another worker said. “All these games and your +mind gets tired.” Investigators also said “nearly everyone among the workers they interviewed believed that Transocean’s system for tracking health and safety issues on the rig +was counter productive.” Many workers entered fake data to try to circumvent the system, +known as See, Think, Act, Reinforce, Track .(or START). As a result, the company’s perception of safety on the rig was distorted, the report concluded +Formal methods of operation and strict hierarchies can limit communication. When +information is passed up hierarchies, it may be distorted, depending on the interests +of managers and the way they interpret the information. Concerns about safety may +even be completely silenced as it passes up the chain of command. Employees may +not feel comfortable going around a superior who does not respond to their concerns. The result may be a misperception of risk, leading to inadequate control +actions to enforce the safety constraints. +In other accidents, reporting and feedback systems are simply unused for a +variety of reasons. In many losses, there was evidence that a problem occurred +in time to prevent the loss, but there was either no communication channel established for getting the information to those who could understand it and to those +making decisions or, alternatively, the problem-reporting channel was ineffective or +simply unused. +Communication is critical in both providing information and executing control +actions and in providing feedback to determine whether the control actions were +successful and what further actions are required. Decision makers need accurate +and timely information. Channels for information dissemination and feedback need +to be established that include a means for comparing actual performance with +desired performance and ensuring that required action is taken. +In summary, both the design of the communication channels and the communication dynamics must be considered as well as potential feedback delays. As an +example of communication dynamics, reliance on face-to-face verbal reports during +group meetings is a common method of assessing lower-level operations , but, +particularly when subordinates are communicating with superiors, there is a tendency for adverse situations to be underemphasized . + +section 13.2.4. Controls on System Migration toward Higher Risk. +One of the key assumptions underlying the approach to safety described in this +book is that systems adapt and change over time. Under various types of pressures, +that adaptation often moves in the direction of higher risk. The good news is, as +stated earlier, that adaptation is predictable and potentially controllable. The safety +control structure must provide protection from and appropriate responses to the +continuing influences and pressures that tend to degrade safety over time. More + + +specifically, the potential reasons for and types of migration toward higher risk need +to be identified and controls instituted to prevent it. In addition, audits and performance assessments based on the safety constraints identified during system development can be used to detect migration and the violation of the constraints as described +in chapter 12. +One way to prevent such migration is to anchor safety efforts beyond short-term +program management pressures. At one time, NASA had a strong agency-wide +system safety program with common standards and requirements levied on everyone. Over time, agency-wide standards were eviscerated, and programs were allowed +to set their own standards under the control of the program manager. While the +manned space program started out with strong safety standards, under budget and +performance pressures they were progressively weakened . +As one example, a basic requirement for an effective operational safety program +is that all potentially hazardous incidents during operations are thoroughly investigated. Debris shedding had been identified as a potential hazard during Shuttle +development, but the standard for performing hazard analyses in the Space Shuttle +program was changed to specify that hazards would be revisited only when there +was a new design or the Shuttle design was changed, not after an anomaly .(such as +foam shedding). occurred . +After the Columbia accident, safety standards in the Space Shuttle program .(and +the rest of NASA). were effectively anchored and protected from dilution over time +by moving responsibility for them outside the projects. + +section 13.2.5. Safety, Culture, and Blame. +The high-level goal in managing safety is to create and maintain an effective safety +control structure. Because of the importance of safety culture in how the control +structure operates, achieving this goal requires implementing and sustaining a strong +safety culture. +Proper function of the safety control structure relies on decision making by the +controllers in the structure. Decision making always rests upon a set of industry or +organizational values and assumptions. A culture is a set of shared values and norms, +a way of looking at and interpreting the world and events around us and of taking +action in a social context. Safety culture is that subset of culture that reflects the +general attitude and approaches to safety and risk management. +Shein divides culture into three levels .(figure 13.2). . At the top are the +surface-level cultural artifacts or routine aspects of everyday practice including +hazard analyses and control algorithms and procedures. The second, middle level is +the stated organizational rules, values, and practices that are used to create the toplevel artifacts, such as safety policy, standards, and guidelines. At the lowest level is +the often invisible but pervasive underlying deep cultural operating assumptions + + +upon which actions are taken and decisions are made and thus upon which the upper +levels rest. +Trying to change safety outcomes by simply changing the organizational +structures.including policies, goals, missions, job descriptions, and standard operating procedures.may lower risk over the short term, but superficial fixes that do +not address the set of shared values and social norms are very likely to be undone +over time. Changes are required in the organizational values that underlie people’s +behavior. +Safety culture is primarily set by the leaders of the organization as they establish +the basic values under which decisions will be made. This fact explains why leadership and commitment by leaders is critical in achieving high levels of safety. +To engineer a safety culture requires identifying the desired organizational safety +principles and values and then establishing a safety control structure to achieve +those values and to sustain them over time. Sloganeering or jawboning is not +enough. all aspects of the safety control structure must be engineered to be in alignment with the organizational safety principles, and the leaders must be committed +to the stated policies and principles related to safety in the organization. +Along with leadership and commitment to safety as a basic value of the organization, achieving safety goals requires open communication. In an interview after the +Columbia loss, the new center director at Kennedy Space Center suggested that the +most important cultural issue the Shuttle program faced was establishing a feeling +of openness and honesty with all employees, where everybody’s voice was valued. +Statements during the Columbia accident investigation and messages posted to the +NASA Watch website describe a lack of trust of NASA employees to speak up. At +the same time, a critical observation in the C A I B report focused on the engineers’ +claims that the managers did not hear the engineers’ concerns . The report concluded that this was in part due to the managers not asking or listening. Managers + + +created barriers against dissenting opinions by stated preconceived conclusions +based on subjective knowledge and experience rather than on solid data. Much of +the time they listened to those who told them what they wanted to hear. One indication about the poor communication around safety and the atmosphere at the time +were statements in the 19 95 Kraft report that dismissed concerns about Space +Shuttle safety by accusing those who made them as being partners in an unneeded +“safety shield conspiracy.” +Unhealthy work atmospheres with respect to safety and communication are not +limited to NASA. Carroll documents a similarly dysfunctional safety culture at the +Millstone nuclear power plant . An NRC review in 19 96 concluded the safety +culture at the plant was dangerously flawed. it did not tolerate dissenting views and +stifled questioning attitudes among employees. +Changing such interaction patterns is not easy. Management style can be addressed +through training, mentoring, and proper selection of people to fill management +positions, but trust is hard to gain and easy to lose. Employees need to feel psychologically safe about reporting concerns and to believe that managers can be trusted +to hear their concerns and to take appropriate action, while managers have to +believe that employees are worth listening to and worthy of respect. +The difficulty is in getting people to change their view of reality. Gareth Morgan, +a social anthropologist, defines culture as an ongoing, proactive process of reality +construction. According to this view, organizations are socially constructed realities +that rest as much in the heads and minds of their members as they do in concrete +sets of rules and regulations. Morgan assets that organizations are “sustained by +belief systems that emphasize the importance of rationality” . This myth of +rationality “helps us to see certain patterns of action as legitimate, credible, and +normal, and hence to avoid the wrangling and debate that would arise if we were +to recognize the basic uncertainty and ambiguity underlying many of our values and +actions” . +For both the Challenger and Columbia accidents, as well as most other major +accidents where decision making was flawed, the decision makers saw their actions +as rational. Understanding and preventing poor decision making under conditions +of uncertainty requires providing environments and tools that help to stretch our +belief systems and to see patterns that we do not necessarily want to see. +Some common types of dysfunctional safety cultures can be identified that are +common to industries or organizations. Hopkins coined the term “culture of denial” +after investigating accidents in the mining industry, but mining is not the only industry in which denial is pervasive. In such cultures, risk assessment is unrealistic and +credible warnings are dismissed without appropriate action. Management only +wants to hear good news and may ensure that is what they hear by punishing bad +news, sometimes in a subtle way and other times not so subtly. Often arguments are + + +made in these industries that the conditions are inherently more dangerous than +others and therefore little can be done about improving safety or that accidents are +the price of productivity and cannot be eliminated. Of course, this rationale is untrue +but it is convenient. +A second type of dysfunctional safety culture might be termed a “paperwork +culture.” In these organizations, employees spend all their time proving the system +is safe but little time actually doing the things necessary to make it so. After the +Nimrod aircraft loss in Afghanistan in 20 06 , the accident report noted a “culture of +paper safety” at the expense of real safety . +So what are the aspects of a good safety culture, that is, the core values and norms +that allow us to make better decisions around safety? +1.•Safety commitment is valued. +2.•Safety information is surfaced without fear and incident analysis is conducted +without blame. +3.•Incidents and accidents are valued as an important window into systems that +are not functioning as they should.triggering in-depth and uncircumscribed +causal analysis and improvement actions. +4.• There is a feeling of openness and honesty, where everyone’s voice is respected. +Employees feel that managers are listening. +5a.•There is trust among all parties. +5b.•Employees feel psychologically safe about reporting concerns. + +5c.• +Employees believe that managers can be trusted to hear their concerns and +will take appropriate action. + +5d.Managers believe that employees are worth listening to and are worthy of +respect. + +Common ingredients of a safety culture based on these values include management +commitment to safety and the safety values, management involvement in achieving +the safety goals, employee empowerment, and appropriate and effective incentive +structures and reporting systems. +When these ingredients form the basis of the safety culture, the organization has +the following characteristics. +1.•Safety is integrated into the dominant culture; it is not a separate subculture. +2.•Safety is integrated into both development and operations. Safety activities +employ a mixture of top-down engineering or reengineering and bottom-up +process improvement. +3.•Individuals have required knowledge, skills, and ability. + +4.•Early warning systems for migration toward states of high risk are established +and effective. +5.• The organization has a clearly articulated safety vision, values and procedures, +shared among the stakeholders. +6.• Tensions between safety priorities and other system priorities are addressed +through a constructive, negotiated process. +7.•Key stakeholders .(including all employees and groups such as unions). have full +partnership roles and responsibilities regarding safety. +8.•Passionate, effective leadership exists at all levels of the organization .(particularly the top), and all parts of the safety control structure are committed to +safety as a high priority for the organization. +9.•Effective communication channels exist for disseminating safety information. +10.•High levels of visibility of the state of safety .(i.e., risk awareness). exist at all +levels of the safety control structure through appropriate and effective +feedback. +11.• The results of operating experience, process hazard analyses, audits, near misses, +or accident investigations are used to improve operations and the safety control +structure. +12.• +Deficiencies found during assessments, audits, inspections, and incident investigation are addressed promptly and tracked to completion. +The Just Culture Movement +The Just Culture movement is an attempt to avoid the type of unsafe cultural values +and professional interactions that have been implicated in so many accidents. +Its origins are in aviation although some in the medical community, particularly +hospitals, have also taken steps down this road. Much has been written on Just +Culture.only a summary is provided here. The reader is directed in particular +to Dekker’s book Just Culture , which is the source of much of what follows in +this section. +A foundational principle of Just Culture is that the difference between a safe and +unsafe organization is how it deals with reported incidents. This principle stems from +the belief that an organization can benefit more by learning from mistakes than by +punishing people who make them. +In an organization that promotes such a Just Culture . +1.• +Reporting errors and suggesting changes is normal, expected, and without +jeopardy for anyone involved. +2.• A mistake or incident is not seen as a failure but as a free lesson, an opportunity +to focus attention and to learn. + + +3.•Rather than making people afraid, the system makes people participants in +change and improvement. +4.•Information provided in good faith is not used against those who report it. +Most people have a genuine concern for the safety and quality of their work. If +through reporting problems they contribute to visible improvements, few other +motivations or exhortations to report are necessary. In general, empowering people +to affect their work conditions and making the reporters of safety problems part of +the change process promotes their willingness to shoulder their responsibilities and +to share information about safety problems. +Beyond the obvious safety implications, a Just Culture may improve morale, commitment to the organization, job satisfaction, and willingness to do extra, to step +outside their role. It encourages people to participate in improvement efforts and +gets them actively involved in creating a safer system and workplace. +There are several reasons why people may not report safety problems, which were +covered in chapter 12. To summarize, the reporting channels may be difficult or time +consuming to use, they may feel there is no point in reporting because the organization will not do anything anyway or they may fear negative consequences in reporting. Each of these reasons must be and can be mitigated through better system +design. Reporting should be easy and not require excessive time or effort that takes +away from direct job responsibilities. There must be responses made both to the +initial report that indicates it was received and read and later information should +be provided about the resolution of the reported problem. +Promoting a Just Culture requires getting away from blame and punishment +as a solution to safety problems. One of the new assumptions in chapter 2 for an +accident model and underlying STAMP was. +Blame is the enemy of safety. Focus should instead be on understanding how the entire +system behavior led to the loss and not on who or what to blame. +Blame and punishment discourage reporting problems and mistakes so improvements can be made to the system. As has been argued throughout this book, changing the system is the best way to achieve safety, not trying to change people. +When blame is a primary component of the safety culture, people stop reporting +incidents. This basic understanding underlies the Aviation Safety Reporting System +(ASRS). where pilots and others are given protection from punishment if they report +mistakes .(see chapter 12). A decision was made in establishing the ASRS and other +aviation reporting systems that organizational and industry learning from mistakes +was more important than punishing people for them. If most errors stem from the +design of the system or can be prevented by changing the design of the system, then +blaming the person who made the mistake is misplaced anyway. + + +A culture of blame creates a climate of fear that makes people reluctant to share +information. It also hampers the potential to learn from incidents; people may even +tamper with safety recording devices, turning them off, for example. A culture of +blame interferes with regulatory work and the investigation of accidents because +people and organizations are less willing to cooperate. The role of lawyers can +impede safety efforts and actually make accidents more likely. Organizations may +focus on creating paper trails instead of utilizing good safety engineering practices. +Some companies avoid standard safety practices under the advice of their lawyers +that this will protect them in legal proceedings, thus almost guaranteeing that accidents and legal proceedings will occur. +Blame and the overuse of punishment as a way to change behavior can directly +lead to accidents that might not have otherwise occurred. As an example, a train +accident in Japan.the 20 05 Fukuchiyama line derailment. occurred when a train +driver was on the phone trying to ensure that he would not be reported for a minor +infraction. Because of this distraction, he did not slow down for a curve, resulting +in the deaths of 106 passengers and the train driver along with injury of 562 passengers . Blame and punishment for mistakes causes stress and isolation and +makes people perform less well. +The alternative is to see mistakes as an indication of an organizational, operational, educational, or political problem. The question then becomes what should be +done about the problem and who should bear responsibility for implementing the +changes. The mistake and any harm from it should be acknowledged, but the +response should be to lay out the opportunities for reducing such mistakes by everyone .(not just this particular person), and the responsibilities for making changes so +that the probability of it happening again is reduced. This approach allows people +and organizations to move forward to prevent mistakes in the future and not just +focus on punishing past behavior . Punishment is usually not a long-term deterrent for mistakes if the system in which the person operates has not changed the +reason for the mistake. Just Culture principles allow us to learn from minor incidents +instead of waiting until tragedies occur. +A common misunderstanding is that a Just Culture means a lack of accountability. +But, in reality, it is just the opposite. Accountability is increased in a Just Culture by +not simply assigning responsibility and accountability to the person at the bottom +of the safety control structure who made the direct action involved in the mistake. +All components of the safety control structure involved are held accountable including .(1). those in operations who contribute to mistakes by creating operational +pressures and providing inadequate oversight to ensure safe procedures are being +followed, and .(2). those in development who create a system design that contributes +to mistakes. +The difference in a Just Culture is not in the accountability for safety problems +but how accountability is implemented. Punishment is an appropriate response to + + + +gross negligence and disregard for other people’s safety, which, of course, applies to +everyone in the safety control structure, including higher-level management and +developers as well as the lower level controllers. But if mistakes were made or +inadequate controls over safety provided because of flaws in the design of the controlled system or the safety control structure, then punishment is not the appropriate +response.fixing the system or the safety control structure is. Dekker has suggested +that accountability be defined in terms of responsibility for finding solutions to the +system design problems from which the mistakes arose . +Overcoming our cultural bias to punish people for their mistakes and the common +belief that punishment is the only way to change behavior can be very difficult. But +the payoff is enormous if we want to significantly reduce accident rates. Trust is a +critical requirement for encouraging people to share their mistakes and safety problems with others so something can be done before major losses occur. +section 13.2.6. Creating an Effective Safety Control Structure. +In some industries, the safety control structure is called the safety management +system .(SMS). In civil aviation, ICAO .(International Civil Aviation Authority). has +created standards and recommended practices for safety management systems and +individual countries have strongly recommended or required certified air carriers +to establish such systems in order to control organizational factors that contribute +to accidents. +There is no right or wrong design of a safety control structure or SMS. Most of +the principles for design of safe control loops in chapter 9 also apply here. The +culture of the industry and the organization will play a role in what is practical and +effective. There are some general rules of thumb, however, that have been found to +be important in practice. +General Safety Control Structure Design Principles. +Making everyone responsible for safety is a well-meaning misunderstanding of +what is required. While, of course, everyone should try to behave safely and to +achieve safety goals, someone has to be assigned responsibility for ensuring that +the goals are being achieved. This lesson was learned long ago in the U.S. Intercontinental Ballistic Missile System .(ICBM). Because safety was such an important +consideration in building the early 19 50 s missile systems, safety was not assigned as +a specific responsibility, but was instead considered to be everyone’s responsibility. +The large number of resulting incidents, particularly those involving the interfaces +between subsystems, led to the understanding that safety requires leadership +and focus. +There needs to be assignment of responsibility for ensuring that hazardous +behaviors are eliminated or, if not possible, mitigated in design and operations. +Almost all attention during development is focused on what the system and its + + +components are supposed to do. System safety engineering is responsible for ensuring that adequate attention is also paid to what the system is not supposed to do +and verifying that hazardous behavior will not occur. It is this unique focus that has +made the difference in systems where safety engineering successfully identified +problems that were not found by the other engineering processes. +At the other extreme, safety efforts may be assigned to a separate group that +is isolated from critical decision making. During system development, responsibility for safety may be concentrated in a separate quality assurance group rather +than in the system engineering organization. During operations, safety may be +the responsibility of a staff position with little real power or impact on line +operations. +The danger inherent in this isolation of the safety efforts is argued repeatedly +throughout this book. To be effective, the safety efforts must have impact, and they +must be integrated into mainstream system engineering and operations. +Putting safety into the quality assurance organization is the worst place for it. For +one thing, it sets up the expectation that safety is an after-the-fact or auditing activity +only. safety must be intimately integrated into design and decision-making activities. +Safety permeates every part of development and operations. While there may be +staff positions performing safety functions that affect everyone at their level of the +organization and below, safety must be integrated into all of engineering development and line operations. Important safety functions will be performed by most +everyone, but someone needs the responsibility to ensure that they are being carried +out effectively. +At the same time, independence is also important. The C A I B report addresses +this issue. +Organizations that successfully operate high-risk technologies have a major characteristic +in common. they place a premium on safety and reliability by structuring their programs +so that technical and safety engineering organizations own the process of determining, +maintaining, and waiving technical requirements with a voice that is equal to yet independent of Program Managers, who are governed by cost, schedule, and missionaccomplishment goals . +Besides associating safety with after-the-fact assurance and isolating it from system +engineering, placing it in an assurance group can have a negative impact on its +stature, and thus its influence. Assurance groups often do not have the prestige +necessary to have the influence on decision making that safety requires. A case can +be made that the centralization of system safety in quality assurance at NASA, +matrixed to other parts of the organization, was a major factor in the decline of the +safety culture preceding the Columbia loss. Safety was neither fully independent +nor sufficiently influential to prevent the loss events . + +Safety responsibilities should be assigned at every level of the organization, +although they will differ from level to level. At the corporate level, system safety +responsibilities may include defining and enforcing corporate safety policy, and +establishing and monitoring the safety control structure. In some organizations that +build extremely hazardous systems, a group at the corporate or headquarters level +certify these systems as safe for use. For example, the U.S. Navy has a Weapons +Systems Explosives Safety Review Board that assures the incorporation of explosive +safety criteria in all weapon systems by reviews conducted throughout all the system’s life cycle phases. For some companies, it may be reasonable to have such a +review process at more than just the highest level. +Communication is important because safety motivated changes in one subsystem +may affect other subsystems and the system as a whole. In military procurement +groups, oversight and communication is enhanced through the use of safety working +groups. In establishing any oversight process, two extremes must be avoided. “getting +into bed” with the project and losing objectivity or backing off too far and losing +insight. Working groups are an effective way of avoiding these extremes. They assure +comprehensive and unified planning and action while allowing for independent +review and reporting channels. +Working groups usually operate at different levels of the organization. As an +example, the Navy Aegis system development, a very large and complex system, +included a System Safety Working Group at the top level chaired by the Navy Principal for Safety, with the permanent members being the prime contractor’s system +safety lead and representatives from various Navy offices. Contractor representatives attended meetings as required. Members of the group were responsible for +coordinating safety efforts within their respective organizations, for reporting the +status of outstanding safety issues to the group, and for providing information to +the Navy Weapons Systems Explosives Safety Review Board. Working groups also +functioned at lower levels, providing the necessary coordination and communication +for that level and to the levels above and below. +A surprisingly large percentage of the reports on recent aerospace accidents have +implicated improper transition from an oversight to an insight process .(for example, +see ). This transition implies the use of different levels of feedback +control and a change from prescriptive management control to management by +objectives, where the objectives are interpreted and satisfied according to the local +context. For these accidents, the change in management role from oversight to +insight seems to have been implemented simply as a reduction in personnel and +budgets without assuring that anyone was responsible for specific critical tasks. + + +footnote. The Aegis Combat System is an advanced command and control and weapon control system that uses +powerful computers and radars to track and guide weapons to destroy enemy targets. + + +Assigning Responsibilities. +An important question is what responsibilities should be assigned to the control +structure components. The list below is derived from the author’s experience on a +large number and variety of projects. Many also appear in accident report recommendations, particularly those generated using CAST. +The list is meant only to be a starting point for those establishing a comprehensive safety control structure and a checklist for those who already have sophisticated +safety management systems. It should be supplemented using other sources and +experiences. +The list does not imply that each responsibility will be assigned to a single person +or group. The responsibilities will probably need to be separated into multiple individual responsibilities and assigned throughout the safety control structure, with one +group actually implementing the responsibilities and others above them supervising, +leading .(directing), or overseeing the activity. Of course, each responsibility assumes +the need for associated authority and accountability plus the controls, feedback, and +communication channels necessary to implement the responsibility. The list may +also be useful in accident and incident analysis to identify inadequate controls and +control structures. + +Management and General Responsibilities. +1.•Provide leadership, oversight, and management of safety at all levels of the +organization. +2.•Create a corporate or organizational safety policy. Establish criteria for evaluating safety-critical decisions and implementing safety controls. Establish distribution channels for the policy. Establish feedback channels to determine +whether employees understand it, are following it, and whether it is effective. +Update the policy as needed. +3.•Establish corporate or organizational safety standards and then implement, +update, and enforce them. Set minimum requirements for safety engineering +in development and operations and oversee the implementation of those +requirements. Set minimum physical and operational standards for hazardous +operations. +4.•Establish incident and accident investigation standards and ensure recommendations are implemented and effective. Use feedback to improve the standards. +5.•Establish management of change requirements for evaluating all changes for +their impact on safety, including changes in the safety control structure. Audit +the safety control structure for unplanned changes and migration toward states +of higher risk. + +6.•Create and monitor the organizational safety control structure. Assign responsibility, authority, and accountability for safety. +7.•Establish working groups. +8.•Establish robust and reliable communication channels to ensure accurate +management risk awareness of the development system design and the state of +the operating process. +9.•Provide physical and personnel resources for safety-related activities. Ensure +that those performing safety-critical activities have the appropriate skills, +knowledge, and physical resources. +10.•Create an easy-to-use problem reporting system and then monitor it for needed +changes and improvements. +11.•Establish safety education and training for all employees and establish feedback channels to determine whether it is effective along with processes for +continual improvement. The education should include reminders of past +accidents and causes and input from lessons learned and trouble reports. +Assessment of effectiveness may include information obtained from knowledge +assessments during audits. +12.•Establish organizational and management structures to ensure that safetyrelated technical decision making is independent from programmatic considerations, including cost and schedule. +13.•Establish defined, transparent, and explicit resolution procedures for conflicts +between safety-related technical decisions and programmatic considerations. +Ensure that the conflict resolution procedures are being used and are +effective. +14.•Ensure that those who are making safety-related decisions are fully informed +and skilled. Establish mechanisms to allow and encourage all employees and +contractors to contribute to safety-related decision making. +15.•Establish an assessment and improvement process for safety-related decision +making. +16.•Create and update the organizational safety information system. +17.•Create and update safety management plans. +18.•Establish communication channels, resolution processes, and adjudication procedures for employees and contractors to surface complaints and concerns +about the safety of the system or parts of the safety control structure that are +not functioning appropriately. Evaluate the need for anonymity in reporting +concerns. + +Development. +1.•Implement special training for developers and development managers in safetyguided design and other necessary skills. Update this training as events occur +and more is learned from experience. Create feedback, assessment, and improvement processes for the training. +2.•Create and maintain the hazard log. +3.•Establish working groups. +4.•Design safety into the system using system hazards and safety constraints. +Iterate and refine the design and the safety constraints as the design process +proceeds. Ensure the system design includes consideration of how to reduce +human error. +5.•Document operational assumptions, safety constraints, safety-related design +features, operating assumptions, safety-related operational limitations, training +and operating instructions, audits and performance assessment requirements, +operational procedures, and safety verification and analysis results. Document +both what and why, including tracing between safety constraints and the design +features to enforce them. +6.•Perform high-quality and comprehensive hazard analyses to be available +and usable when safety-related decisions need to be made, starting with early +decision making and continuing through the system’s life. Ensure that the +hazard analysis results are communicated in a timely manner to those who need +them. Establish a communication structure that allows communication downward, upward, and sideways .(i.e., among those building subsystems). Ensure +that hazard analyses are updated as the design evolves and test experience is +acquired. +7.• Train engineers and managers to use the results of hazard analyses in their +decision making. +8.•Maintain and use hazard logs and hazard analyses as experience with the +system is acquired. Ensure communication of safety-related requirements and +constraints to everyone involved in development. +•Gather lessons learned in operations .(including accident and incident +reports). and use them to improve the development processes. Use operating +experience to identify flaws in the development safety controls and implement +improvements. + +Operations. +1.• +Develop special training for operators and operations management to create +needed skills and update this training as events occur and more is learned from + +experience. Create feedback, assessment, and improvement processes for this +training. Train employees to perform their jobs safely, understand proper use +of safety equipment, and respond appropriately in an emergency. +2.•Establish working groups. +3.•Maintain and use hazard logs and hazard analyses during operations as experience is acquired. +4.•Ensure all emergency equipment and safety devices are operable at all times +during hazardous operations. Before safety-critical, nonroutine, potentially hazardous operations are started, inspect all safety equipment to ensure it is operational, including the testing of alarms. +5.•Perform an in-depth investigation of any operational anomalies, including +hazardous conditions .(such as water in a tank that will contain chemicals +that react to water). or events. Determine why they occurred before any +potentially dangerous operations are started or restarted. Provide the training +necessary to do this type of investigation and proper feedback channels to +management. +6.•Create management of change procedures and ensure they are being followed. +These procedures should include hazard analyses on all proposed changes and +approval of all changes related to safety-critical operations. Create and enforce +policies about disabling safety-critical equipment. +7.•Perform safety audits, performance assessments, and inspections using the +hazard analysis results as the preconditions for operations and maintenance. +Collect data to ensure safety policies and procedures are being followed and +that education and training about safety is effective. Establish feedback channels for leading indicators of increasing risk. +8.•Use the hazard analysis and documentation created during development and +passed to operations to identify leading indicators of migration toward states +of higher risk. Establish feedback channels to detect the leading indicators and +respond appropriately. +9.•Establish communication channels from operations to development to pass +back information about operational experience. +10.•Perform in-depth incident and accident investigations, including all systemic +factors. Assign responsibility for implementing all recommendations. Follow +up to determine whether recommendations were fully implemented and +effective. +11.•Perform independent checks of safety-critical activities to ensure they have +been done properly. + +12.•Prioritize maintenance for identified safety-critical items. Enforce maintenance +schedules. +13.•Create and enforce policies about disabling safety-critical equipment and +making changes to the physical system. +14.•Create and execute special procedures for the startup of operations in a previously shutdown unit or after maintenance activities. +15.•Investigate and reduce the frequency of spurious alarms. +16.•Clearly mark malfunctioning alarms and gauges. In general, establish procedures for communicating information about all current malfunctioning +equipment to operators and ensure the procedures are being followed. Eliminate all barriers to reporting malfunctioning equipment. +17.•Define and communicate safe operating limits for all safety-critical equipment +and alarm procedures. Ensure that operators are aware of these limits. Assure +that operators are rewarded for following the limits and emergency procedures, +even when it turns out no emergency existed. Provide for tuning the operating +limits and alarm procedures over time as required. +18.•Ensure that spare safety-critical items are in stock or can be acquired quickly. +19.•Establish communication channels to plant management about all events and +activities that are safety-related. Ensure management has the information and +risk awareness they need to make safe decisions about operations. +20.•Ensure emergency equipment and response is available and operable to treat +injured workers. +21.•Establish communication channels to the community to provide information +about hazards and necessary contingency actions and emergency response +requirements. + +section 13.2.7. The Safety Information System. +The safety information system is a critical component in managing safety. It acts as +a source of information about the state of safety in the controlled system so that +controllers’ process models can be kept accurate and coordinated, resulting in better +decision making. Because it in essence acts as a shared process model or a source +for updating individual process models, accurate and timely feedback and data are +important. After studying organizations and accidents, Kjellan concluded that an +effective safety information system ranked second only to top management concern +about safety in discriminating between safe and unsafe companies matched on other +variables . +Setting up a long-term information system can be costly and time consuming, but +the savings in terms of losses prevented will more than make up for the effort. As + + +an example, a Lessons Learned Information System was created at Boeing for commercial jet transport structural design and analysis. The time constants are large in +this industry, but they finally were able to validate the system after using it in the +design of the 757 and 767 . A tenfold reduction in maintenance costs due to +corrosion and fatigue were attributed to the use of recorded lessons learned from +past designs. All the problems experienced in the introduction of new carbon-fiber +aircraft structures like the B787 show how valuable such learning from the past can +be and the problems that result when it does not exist. +Lessons learned information systems in general are often inadequate to meet +the requirements for improving safety. collected data may be improperly filtered +and thus inaccurate, methods may be lacking for the analysis and summarization +of causal data, information may not be available to decision makers in a form +that is meaningful to them, and such long-term information system efforts +may fail to survive after the original champions and initiators move on to different +projects and management does not provide the resources and leadership to +continue the efforts. Often, lots of information is collected about occupational +safety because it is required for government reports but less for engineering +safety. +Setting up a safety information system for a single project or product may be +easier. The effort starts in the development process and then is passed on for use in +operations. The information accumulated during the safety-driven design process +provides the baseline for operations, as described in chapter 12. For example, the +identification of critical items in the hazard analysis can be used as input to the +maintenance process for prioritization. Another example is the use of the assumptions underlying the hazard analysis to guide the audit and performance assessment +process. But first the information needs to be recorded and easily located and used +by operations personnel. +In general, the safety information system includes +1.• A safety management plan .(for both development and operations) +2.• The status of all safety-related activities +3.• The safety constraints and assumptions underlying the design, including operational limitations +4.• The results of the hazard analyses .(hazard logs). and performance audits and +assessments +5.• Tracking and status information on all known hazards +6.•Incident and accident investigation reports and corrective actions taken +7.•Lessons learned and historical information +8.• Trend analysis + +One of the first components of the safety information system for a particular project +or product is a safety program plan. This plan describes the objectives of the program +and how they will be achieved. In addition to other things, the plan provides a +baseline to evaluate compliance and progress. While the organization may have a +general format and documented expectations for safety management plans, this +template may need to be tailored for specific project requirements. The plan should +include review procedures for the plan itself as well as how the plan will be updated +and improved through feedback from experience. +All of the information in the safety information system will probably not be in +one document, but there should be a central location containing pointers to where +all the information can be found. Chapter 12 contains a list of what should be in an +operations safety management plan. The overall safety management plan will +contain similar information with some additions for development. +When safety information is being shared among companies or with regulatory +agencies, there needs to be protection from disclosure and use of proprietary data +for purposes other than safety improvement. + +section 13.2.8. Continual Improvement and Learning. +Processes and structures need to be established to allow continual improvement and +learning. Experimentation is an important part of the learning process, and trying +new ideas and approaches to improving safety needs to be allowed and even +encouraged. +In addition, accidents and incidents should be treated as opportunities for learning and investigated thoroughly, as described in chapter 11. Learning will be inhibited if a thorough understanding of the systemic factors involved is not sought. +Simply identifying the causal factors is not enough. recommendations to +eliminate or control these factors must be created along with concrete plans for +implementing the recommendations. Feedback loops are necessary to ensure that +the recommendations are implemented in a timely manner and that controls are +established to detect and react to reappearance of those same causal factors in +the future. + +section 13.2.9. Education, Training, and Capability Development. +If employees understand the intent of the safety program and commit to it, they are +more likely to comply with that intention rather than simply follow rules when it is +convenient to do so. +Some properties of effective training programs are presented in chapter 12. +Everyone involved in controlling a potentially dangerous process needs to have +safety training, not just the low-level controllers or operators. The training must +include not only information about the hazards and safety constraints to be + + +implemented in the control structure and the safety controls, but also about priorities and how decisions about safety are to be made. +One interesting option is to have managers serve as teachers . In this education program design, training experts help manage group dynamics and curriculum +development, but the training itself is delivered by the project leaders. Ford Motor +Company used this approach as part of what they term their Business Leadership +Initiative and have since extended it as part of the Safety Leadership Initiative. They +found that employees pay more attention to a message delivered by their boss than +by a trainer or safety official. By learning to teach the materials, supervisors and +managers are also more likely to absorb and practice the key principles . +section 13.3. Final Thoughts. +Management is key to safety. Top-level management sets the culture, creates the +safety policy, and establishes the safety control structure. Middle management +enforces safe behavior through the designed controls. +Most people want to run safe organizations, but they may misunderstand the +tradeoffs required and how to accomplish the goals. This chapter and the book as a +whole have tried to correct misperceptions and provide advice on how to create +safer products and organizations. The next chapter provides a real-life example of +a successful systems approach to safety. \ No newline at end of file diff --git a/chapter14.raw b/chapter14.raw new file mode 100644 index 0000000..ba3bcbd --- /dev/null +++ b/chapter14.raw @@ -0,0 +1,550 @@ +chapter 14. +SUBSAFE: An Example of a Successful Safety +Program. +This book is filled with examples of accidents and of what not to do. One possible +conclusion might be that despite our best efforts accidents are inevitable in +complex systems. That conclusion would be wrong. Many industries and companies +are able to avoid accidents: the nuclear Navy SUBSAFE program is a shining +example. By any measure, SUBSAFE has been remarkably successful: In nearly +fifty years since the beginning of SUBSAFE, no submarine in the program has +been lost. +Looking at a successful safety program and trying to understand why it has been +successful can be very instructive. This chapter looks at the history of the program +and what it is, and proposes some explanations for its great success. SUBSAFE also +provides a good example of most of the principles expounded in this book. +Although SUBSAFE exists in a government and military environment, most of +the important components could be translated into the commercial, profit-making +world. Also note that the success is not related to small size—there are 40,000 +people involved in the U.S. submarine safety program, a large percentage of whom +are private contractors and not government employees. Both private and public +shipyards are involved. SUBSAFE is distributed over large parts of the United +States, although mostly on the coasts (for obvious reasons). Five submarine classes +are included, as well as worldwide naval operations. + +footnote. I am particularly grateful to Rear Admiral Walt Cantrell, Al Ford, and Commander Jim Hassett for +their insights on and information about the SUBSAFE program. + +section 14.1. +History. +The SUBSAFE program was created after the loss of the nuclear submarine +Thresher. The USS Thresher was the first ship of her class and the leading edge of +U.S. submarine technology, combining nuclear power with modern hull design and +newly designed equipment and components. On April 10, 1963, while performing a + + + +deep test dive approximately two hundred miles off the northeastern coast of the +United States, the USS Thresher was lost at sea with all persons aboard: 112 naval +personnel and 17 civilians died. +The head of the U.S. nuclear Navy, Admiral Hyman Rickover, gathered his staff +after the Thresher loss and ordered them to design a program that would ensure +such a loss never happened again. The program was to be completed by June and +operational by that December. To date, that goal has been achieved. Between 1915 +and 1963, the U.S. had lost fifteen submarines to noncombat causes, an average of +one loss every three years, with a total of 454 casualties. Thresher was the first +nuclear submarine lost, the worst submarine disaster in history in terms of lives lost +(figure 14.1). +SUBSAFE was established just fifty-four days after the loss of Thresher. It was +created on June 3, 1963, and the program requirements were issued on December +20 of that same year. Since that date, no SUBSAFE-certified submarine has ever +been lost. +One loss did occur in 1968—the USS Scorpion—but it was not SUBSAFE certi- +fied. In a rush to get Scorpion ready for service after it was scheduled for a major +overhaul in 1967, the Chief of Naval Operations allowed a reduced overhaul process +and deferred the required SUBSAFE inspections. The design changes deemed nec- +essary after the loss of Thresher were not made, such as newly designed central valve +control and emergency blow systems, which had not operated properly on Thresher. +Cold War pressures prompted the Navy to search for ways to reduce the duration +of overhauls. By not following SUBSAFE requirements, the Navy reduced the time +Scorpion was out of commission. +In addition, the high quality of the submarine components required by SUBSAFE, +along with intensified structural inspections, had reduced the availability of critical +parts such as seawater piping [8]. A year later, in May 1968, Scorpion was lost at +sea. Although some have attributed its loss to a Soviet attack, a later investigation +of the debris field revealed the most likely cause of the loss was one of its own +torpedoes exploding inside the torpedo room [8]. After the Scorpion loss, the need +for SUBSAFE was reaffirmed and accepted. +The rest of this chapter outlines the SUBSAFE program and provides some +hypotheses to explain its remarkable success. The reader will notice that much +of the program rests on the same systems thinking fundamentals advocated in +this book. +Details of the Thresher Loss. +The accident was thoroughly investigated including, to the Navy’s credit, the sys- +temic factors as well as the technical failures and deficiencies. Deep sea photogra- +phy, recovered artifacts, and an evaluation of the Thresher’s design and operational + +history led a court of inquiry to conclude that the failure of a deficient silver-braze +joint in a salt water piping system, which relied on silver brazing instead of welding, +led to flooding in the engine room. The crew was unable to access vital equipment +to stop the flooding. As a result of the flooding, saltwater spray on the electrical +components caused short circuits, shutdown of the nuclear reactor, and loss of pro- +pulsion. When the crew attempted to blow the main ballast tanks in order to surface, +excessive moisture in the air system froze, causing a loss of airflow and inability +to surface. +The accident report included recommendations to fix the design problems, for +example, to add high-pressure air compressors to permit the emergency blow +system to operate property. The finding that there were no centrally located isola- +tion valves for the main and auxiliary seawater systems led to the use of flood- +control levers that allowed isolation valves to be closed remotely from a central +panel. +Most accident analyses stop at this point, particularly in that era. To their credit, +however, the investigation continued and looked at why the technical deficiencies +existed, that is, the management and systemic factors involved in the loss. They found +deficient specifications, deficient shipbuilding practices, deficient maintenance prac- +tices, inadequate documentation of construction and maintenance actions, and defi- +cient operational procedures. With respect to documentation, there appeared to be +incomplete or no records of the work that had been done on the submarine and the +critical materials and processes used. +As one example, Thresher had about three thousand silver-brazed pipe joints +exposed to full pressure when the submarine was submerged. During her last ship- +yard maintenance, 145 of these joints were inspected on a “not-to-delay” vessel basis +using what was then the new technique called ultrasonic testing. Fourteen percent +of the 145 joints showed substandard joint integrity. Extrapolating these results to +the entire complement of three thousand joints suggests that more than four hundred +joints could have been substandard. The ship was allowed to go to sea in this con- +dition. The Thresher loss investigators looked at whether the full scope of the joint +problem had been determined and what rationale could have been used to allow +the ship to sail without fixing the joints. +One of the conclusions of the accident investigation is that Navy risk manage- +ment practices had not advanced as fast as submarine capability. +section 14.2. SUBSAFE Goals and Requirements. +A decision was made in 1963 to concentrate the SUBSAFE program on the essen- +tials, and a program was designed to provide maximum reasonable assurance of two +things: + +1.• Watertight integrity of the submarine’s hull. +2.• +Operability and integrity of critical systems to control and recover from a flood- +ing hazard. +By being focused, the SUBSAFE program does not spread or dilute its focus beyond +this stated purpose. For example, mission assurance is not a focus of SUBSAFE, +although it benefits from it. Similarly, fire safety, weapons safety, occupational health +and safety, and nuclear reactor systems safety are not in SUBSAFE. These addi- +tional concerns are handled by regular System Safety programs and mission assur- +ance activities focused on the additional hazards. In this way, the extra rigor required +by SUBSAFE is limited to those activities that ensure U.S. submarines can surface +and return to port safely in an emergency, making the program more acceptable and +practical than it might otherwise be. +SUBSAFE requirements, as documented in the SUBSAFE manual, permeate the +entire submarine community. These requirements are invoked in design, construc- +tion, operations, and maintenance and cover the following aspects of submarine +development and operations: +1.• Administrative +2.• +Organizational +3.• Technical +4.•Unique design +5.•Material control +6.•Fabrication +7.• Testing +8.• Work control +9.• Audits +10.• +Certification +These requirements are invoked in design contracts, construction contracts, overhaul +contracts, the fleet maintenance manual and spare parts procurement specifications, +and so on. +Notice that the requirements encompass not only the technical aspects of the +program but the administrative and organizational aspects as well. The program +requirements are reviewed periodically and renewed when deemed necessary. The +Submarine Safety Working Group, consisting of the SUBSAFE Program Directors +from all SUBSAFE facilities around the country, convenes twice a year to discuss +program issues of mutual concern. This meeting often leads to changes and improve- +ments to the program. + +section 14.3. SUBSAFE Risk Management Fundamentals. +SUBSAFE is founded on a basic set of risk management principles, both technical +and cultural. These fundamentals are: +• Work discipline: Knowledge of and compliance with requirements +•Material control: The correct material installed correctly +•Documentation: (1) Design products (specifications, drawings, maintenance +standards, system diagrams, etc.), and (2) objective quality evidence (defined +later) +•Compliance verification: Inspections, surveillance, technical reviews, and audits +•Learning from inspections, audits, and nonconformances +These fundamentals, coupled with a questioning attitude and what those in +SUBSAFE term a chronic uneasiness, are credited for SUBSAFE success. The fun- +damentals are taught and embraced throughout the submarine community. The +members of this community believe that it is absolutely critical that they do not +allow themselves to drift away from the fundamentals. +The Navy, in particular, expends a lot of effort in assuring compliance verification +with the SUBSAFE requirements. A common saying in this community is, “Trust +everybody, but check up.” Whenever a significant issue arises involving compliance +with SUBSAFE requirements, including material defects, system malfunctions, defi- +cient processes, equipment damage, and so on, the Navy requires that an initial +report be provided to Naval Sea Systems Command (NAVSEA) headquarters +within twenty-four hours. The report must describe what happened and must contain +preliminary information concerning apparent root cause(s) and immediate correc- +tive actions taken. Beyond providing the information to prevent recurrence, this +requirement also demonstrates top management commitment to safety and the +SUBSAFE program. +In addition to the technical and managerial risk management fundamentals listed +earlier, SUBSAFE also has cultural principles built into the program: +1.• A questioning attitude +2.•Critical self-evaluation +3.•Lessons learned and continual improvement +4.•Continual training +5.•Separation of powers (a management structure that provides checks and bal- +ances and assures appropriate attention to safety) + + +As is the case with most risk management programs, the foundation of SUBSAFE +is the personal integrity and responsibility of those individuals who are involved in +the program. The cement bonding this foundation is the selection, training, and +cultural mentoring of those individuals who perform SUBSAFE work. Ultimately, +these people attest to their adherence to technical requirements by documenting +critical data, parameters, statements and their personal signature verifying that work +has been properly completed. +section 14.4. +Separation of Powers. +SUBSAFE has created a unique management structure they call separation of +powers or, less formally, the three-legged stool (figure 14.2). This structure is the +cornerstone of the SUBSAFE program. Responsibility is divided among three dis- +tinct entities providing a system of checks and balances. +The new construction and in-service Platform Program Managers are responsible +for the cost, schedule, and quality of the ships under their control. To ensure that +safety is not traded off under cost and schedule pressures, the Program Managers +can only select from a set of acceptable design options. The Independent Technical +Authority has the responsibility to approve those acceptable options. +The third leg of the stool is the Independent Safety and Quality Assurance +Authority. This group is responsible for administering the SUBSAFE program and +for enforcing compliance. It is staffed by engineers with the authority to question +and challenge the Independent Technical Authority and the Program Managers on +their compliance with SUBSAFE requirements. + + +The Independent Technical Authority (ITA) is responsible for establishing and +assuring adherence to technical standards and policy. More specifically, they: +1.•Set and enforce technical standards. +2.•Maintain technical subject matter expertise. +3.• Assure safe and reliable operations. +4.•Ensure effective and efficient systems engineering. +5.•Make unbiased, independent technical decisions. +6.•Provide stewardship of technical and engineering capabilities. +Accountability is important in SUBSAFE and the ITA is held accountable for +exercising these responsibilities. +This management structure only works because of support from top manage- +ment. When Program Managers complain that satisfying the SUBSAFE require- +ments will make them unable to satisfy their program goals and deliver new +submarines, SUBSAFE requirements prevail. +section 14.5. +Certification. +In 1963, a SUBSAFE certification boundary was defined. Certification focuses on +the structures, systems, and components that are critical to the watertight integrity +and recovery capability of the submarine. +Certification is also strictly based on what the SUBSAFE program defines as +Objective Quality Evidence (OQE). OQE is defined as any statement of fact, either +quantitative or qualitative, pertaining to the quality of a product or service, based +on observations, measurements, or tests that can be verified. Probabilistic risk assess- +ment, which usually cannot be verified, is not used. +OQE is evidence that deliberate steps were taken to comply with requirements. +It does not matter who did the work or how well they did it, if there is no OQE +then there is no basis for certification. +The goal of certification is to provide maximum reasonable assurance through +the initial SUBSAFE certification and by maintaining certification throughout the +submarine’s life. SUBSAFE inculcates the basic STAMP assumption that systems +change throughout their existence. SUBSAFE certification is not a one-time activity +but has to be maintained over time: SUBSAFE certification is a process, not just a +final step. This rigorous process structures the construction program through a speci- +fied sequence of events leading to formal authorization for sea trials and delivery +to the Navy. Certification then applies to the maintenance and operations programs +and must be maintained throughout the life of the ship. + + +section 14.5.1. Initial Certification. +Initial certification is separated into four elements (figure 14.3): +1. Design certification: Design certification consists of design product approval +and design review approval, both of which are based on OQE. For design +product approval, the OQE is reviewed to confirm that the appropriate techni- +cal authority has approved the design products, such as the technical drawings. +Most drawings are produced by the submarine design yard. Approval may be +given by the Navy’s Supervisor of Shipbuilding, which administers and over- +sees the contract at each of the private shipyards, or, in some cases, the +NAVSEA may act as the review and approval technical authority. Design +approval is considered complete only after the proper technical authority has +reviewed the OQE and at that point the design is certified. +2. Material certification: After the design is certified, the material procured to +build the submarine must meet the requirements of that design. Technical +specifications must be embodied in the purchase documents. Once the material +is received, it goes through a rigorous receipt inspection process to confirm +and certify that it meets the technical specifications. This process usually +involves examining the vendor-supplied chemical and physical OQE for the +material. Records of chemical assay results, heat treatment applied to the mate- +rial, and nondestructive testing conducted on the material constitute OQE. +3. Fabrication certification: Once the certified material is obtained, the next +step is fabrication where industrial processes such as machining, welding, and +assembly are used to construct components, systems, and ships. OQE is used +to document the industrial processes. Separately, and prior to actual fabrication +of the final product, the facility performing the work is certified in the indus- +trial processes necessary to perform the work. An example is a specific + + +high-strength steel welding procedure. In addition to the weld procedure, the +individual welder using this particular process in the actual fabrication receives +documented training and successfully completes a formal qualification in the +specific weld procedure to be used. Other industrial processes have similar +certification and qualification requirements. In addition, steps are taken to +ensure that the measurement devices, such as temperature sensors, pressure +gauges, torque wrenches, micrometers, and so on, are included in a robust +calibration program at the facility. +4. Testing certification: Finally, a series of tests is used to prove that the assem- +bly, system, or ship meets design parameters. Testing occurs throughout the +fabrication of a submarine, starting at the component level and continuing +through system assembly, final assembly, and sea trials. The material and com- +ponents may receive any of the typical nondestructive tests, such as radiogra- +phy, magnetic particle, and representative tests. Systems are also subjected to +strength testing and operational testing. For certain components, destructive +tests are performed on representative samples. +Each of these certification elements is defined by detailed, documented SUBSAFE +requirements. +At some point near the end of the new construction period, usually lasting five +or so years, every submarine obtains its initial SUBSAFE certification. This process +is very formal and preceded by scrutiny and audit conducted by the shipbuilder, the +supervising authority, and finally, by a NAVSEA Certification Audit Team assem- +bled and led by the Office of Safety and Quality Assurance at NAVSEA. The initial +certification is in the end granted at the flag officer level. + +secton 14.5.2. Maintaining Certification. +After the submarine enters the fleet, SUBSAFE certification must be maintained +through the life of the slip. Three tools are used: the Reentry Control (REC) Process, +the Unrestricted Operations Maintenance Requirements Card (URO MRC) +program, and the audit program. +The Reentry Control (REC) process carefully controls work and testing within +the SUBSAFE boundary, that is, the structures, systems, and components that are +critical to the watertight integrity and recovery capability of the submarine. The +purpose of REC is to provide maximum reasonable assurance that the areas dis- +turbed have been restored to their fully certified condition. The procedures used +provide an identifiable, accountable, and auditable record of the work performed. +REC control procedures have three goals: (1) to maintain work discipline by +identifying the work to be performed and the standards to be met, (2) to establish +personal accountability by having the responsible personnel sign their names on the + +reentry control document, and (3) to collect the OQE needed for maintaining +certification. +The second process, the Unrestricted Operations Maintenance Requirements +Card (URO MRC) program, involves periodic inspections and tests of critical +items to ensure they have not degraded to an unacceptable level due to use, age, +or environment. In fact, URO MRC did not originate with SUBSAFE, but was +developed to extend the operating cycle of USS Queenfish by one year in 1969. It +now provides the technical basis for continued unrestricted operation of subma- +rines to test depth. +The third aspect of maintaining certification is the audit program. Because the +audit process is used for more general purposes than simply maintaining certifica- +tion, it is considered in a separate section. +14.6 Audit Procedures and Approach +Compliance verification in SUBSAFE is treated as a process, not just one step in a +process or program. The Navy demands that each Navy facility participate fully in +the process, including the use of inspection, surveillance, and audits to confirm their +own compliance. Audits are used to verify that this process is working. They are +conducted either at fixed intervals or when a specific condition is found to exist that +needs attention. +Audits are multi-layered: they exist at the contractor and shipyard level, at the +local government level, and at Navy headquarters. Using the terminology adopted +in this book, responsibilities are assigned to all the components of the safety control +structure as shown in figure 14.4. Contractors and shipyard responsibilities include +implementing specified SUBSAFE requirements, establishing processes for control- +ling work, establishing processes to verify compliance and certify its own work, and +presenting the certification OQE to the local government oversight authority. The +processes established to verify compliance and certify their work include a quality +management system, surveillance, inspections, witnessing critical contractor work +(contractor quality assurance), and internal audits. +Local government oversight responsibilities include surveillance, inspections, +assuring quality, and witnessing critical contractor work, audits of the contractor, +and certifying the work of the contractor to Navy headquarters. +The responsibilities of Navy headquarters include establishing and specifying +SUBSAFE requirements, verifying compliance with the requirements, and provid- +ing SUBSAFE certification for each submarine. Compliance is verified through two +types of audits: (1) ship-specific and (2) functional or facility audits. +A ship-specific audit looks at the OQE associated with an individual ship to +ensure that the material condition of that submarine is satisfactory for sea trial and + +unrestricted operations. This audit represents a significant part of the certification +process that a submarine’s condition meets SUBSAFE requirements and is safe to +go to sea. +Functional or facility audits (such as contractors or shipyards) include reviews +of policies, procedures, and practices to confirm compliance with the SUBSAFE +program requirements, the health of processes, and the capability of producing +certifiable hardware or design products. +Both types of audits are carried out with structured audit plans and qualified +auditors. + +The audit philosophy is part of the reason for SUBSAFE success. Audits are +treated as a constructive, learning experience. Audits start from the assumption +that policies, procedures, and practices are in compliance with requirements. The +goal of the audit is to confirm that compliance. Audit findings must be based +on a clear violation of requirements or must be identified as an “operational +improvement.” +The objective of audits is “to make our submarines safer” not to evaluate indi- +vidual performance or to assign blame. Note the use of the word “our”: the SUBSAFE +program emphasizes common safety goals and group effort to achieve them. Every- +one owns the safety goals and is assumed to be committed to them and working to +the same purpose. SUBSAFE literature and training talks about those involved as +being part of a “very special family of people who design, build, maintain, and +operate our nation’s submarines.” +To this end, audits are a peer review. A typical audit team consists of twenty to +thirty people with approximately 80 percent of the team coming from various +SUBSAFE facilities around the country and the remaining 20 percent coming from +NAVSEA headquarters. An audit is considered a team effort—the facility being +audited is expected to help the audit team make the audit report as accurate and +meaningful as possible. +Audits are conducted under rules of continuous communication—when a problem +is found, the emphasis is on full understanding of the identified problem as well as +identification of potential solutions. Deficiencies are documented and adjudicated. +Contentious issues sometimes arise, but an attempt is made to resolve them during +the audit process. +A significant byproduct of a SUBSAFE audit is the learning experience it pro- +vides to the auditors as well as those being audited. Expected results include cross- +pollination of successful procedures and process improvements. The rationale +behind having SUBSAFE participants on the audit team is not only their under- +standing of the SUBSAFE program and requirements, but also their ability to learn +from the audits and apply that learning to their own SUBSAFE groups. +The current audit philosophy is a product of experience and learning. Before +1986, only ship-specific audits were conducted, not facility or headquarters audits. +In 1986, there was a determination that they had gotten complacent and were assum- +ing that once an audit was completed, there would be no findings if a follow-up +audit was performed. They also decided that the ship-specific audits were not rigor- +ous or complete enough. In STAMP terms, only the lowest level of the safety control +structure was being audited and not the other components. After that time, biennial +audits were conducted at all levels of the safety control structure, even the highest +levels of management. A biennial NAVSEA internal audit gives the field activities + + +a chance to evaluate operations at headquarters. Headquarters personnel must be +willing to accept and resolve audit findings just like any other member of the nuclear +submarine community. +One lesson learned has been that developing a robust compliance verification +program is difficult. Along the way they learned that (1) clear ground rules for audits +must be established, communicated, and adhered to; (2) it is not possible to “audit +in” requirements; and (3) the compliance verification organization must be equal +with the program managers and the technical authority. In addition, they determined +that not just anyone can do SUBSAFE work. The number of activities authorized +to perform SUBSAFE activities is strictly controlled. + +section 14.7. Problem Reporting and Critiques. + +SUBSAFE believes that lessons learned are integral to submarine safety and puts +emphasis on problem reporting and critiques. Significant problems are defined as +those that affect ship safety, cause significant damage to the ship or its equipment, +delay ship deployment or incur substantial cost increase, or involve severe personnel +injury. Trouble reports are prepared for all significant problems encountered in +the construction, repair, and maintenance of naval ships. Systemic problems and +issues that constitute significant lessons learned for other activities can also be +identified by trouble reports. Critiques are similar to trouble reports and are utilized +by the fleet. +Trouble reports are distributed to all SUBSAFE responsible activities and are +used to report significant problems to NAVSEA. NAVSEA evaluates the reports to +identify SUBSAFE program improvements. + +section 14.8. Challenges. +The leaders of SUBSAFE consider their biggest challenges to be: +•Ignorance: +•Arrogance: Behavior based on pride, self-importance, conceit, or the assump- +tion of intellectual superiority and the presumption of knowledge that is not +supported by facts; and +•Complacency: Satisfaction with one’s accomplishments accompanied by a +lack of awareness of actual dangers or deficiencies. +The state of not knowing; +Combating these challenges is a “constant struggle every day” [69]. Many features +of the program are designed to control these challenges, particularly training and +education. + + +section 14.9. Continual Training and Education. +Continual training and education are a hallmark of SUBSAFE. The goals are to: +1.•Serve as a reminder of the consequences of complacency in one’s job. +2.•Emphasize the need to proactively correct and prevent problems. +3.•Stress the need to adhere to program fundamentals. +4.•Convey management support for the program. +Continual improvement and feedback to the SUBSAFE training programs +comes not only from trouble reports and incidents but also from the level of knowl- +edge assessments performed during the audits of organizations that perform +SUBSAFE work. +Annual training is required for all headquarters SUBSAFE workers, from the +apprentice craftsman to the admirals. A periodic refresher is also held at each of the +contractor’s facilities. At the meetings, a video about the loss of Thresher is shown +and an overview of the SUBSAFE program and their responsibilities is provided as +well as recent lessons learned and deficiency trends encountered over the previous +years. The need to avoid complacency and to proactively correct and prevent prob- +lems is reinforced. +Time is also taken at the annual meetings to remind everyone involved about the +history of the program. By guaranteeing that no one forgets what happened to USS +Thresher, the SUBSAFE program has helped to create a culture that is conducive +to strict adherence to policies and procedures. Everyone is recommitted each year +to ensure that a tragedy like the one that occurred in 1963 never happens again. +SUBSAFE is described by those in the program as “a requirement, an attitude, and +a responsibility.” + +section 14.10. Execution and Compliance over the Life of a Submarine. +The design, construction, and initial certification are only a small percentage of the +life of the certified ship. The success of the program during the vast majority of the +certified ship’s life depends on the knowledge, compliance, and audit by those oper- +ating and maintaining the submarines. Without the rigor of compliance and sustain- +ing knowledge from the petty officers, ship’s officers, and fleet staff, all of the great +virtues of SUBSAFE would “come to naught” [30]. The following anecdote by +Admiral Walt Cantrell provides an indication of how SUBSAFE principles per- +meate the entire nuclear Navy: +I remember vividly when I escorted the first group of NASA skeptics to a submarine and +they figured they would demonstrate that I had exaggerated the integrity of the program + +by picking a member of ship’s force at random and asked him about SUBSAFE. The +NASA folks were blown away. A second class machinist’s mate gave a cogent, complete, +correct description of the elements of the program and how important it was that all levels +in the Submarine Force comply. That part of the program is essential to its success—just +as much, if not more so, than all the other support staff effort [30]. + +section 14.11 Lessons to Be Learned from SUBSAFE. +Those involved in SUBSAFE are very proud of their achievements and the fact that +even after nearly fifty years of no accidents, the program is still strong and vibrant. +On January 8, 2005, USS San Francisco, a twenty-six-year-old ship, crashed head-on +into an underwater mountain. While several crew members were injured and one +died, this incident is considered by SUBSAFE to be a success story: In spite of the +massive damage to her forward structure, there was no flooding, and the ship sur- +faced and returned to port under her own power. There was no breach of the pres- +sure hull, the nuclear reactor remained on line, the emergency main ballast tank +blow system functioned as intended, and the control surfaces functioned properly. +Those in the SUBSAFE program attribute this success to the work discipline, mate- +rial control, documentation, and compliance verification exercised during the design, +construction, and maintenance of USS San Francisco. +Can the SUBSAFE principles be transferred from the military to commercial +companies and industries? The answer lies in why the program has been so effective +and whether these factors can be maintained in other implementations of the prin- +ciples more appropriate to non-military venues. Remember, of course, that private +contractors form the bulk of the companies and workers in the nuclear Navy, and +they seem to be able to satisfy the SUBSAFE program requirements. The primary +difference is in the basic goals of the organization itself. +Some factors that can be identified as contributing to the success of SUBSAFE, +most of which could be translated into a safety program in private industry are: +1.•Leadership support and commitment to the program. +2.•Management (NAVSEA) is not afraid to say “no” when faced with pressures +to compromise the SUBSAFE principles and requirements. Top management +also agrees to be audited for adherence to the principles of SUBSAFE and to +correct any deficiencies that are found. +3.•Establishment of clear and written safety requirements. +4.•Education, not just training, with yearly reminders of the past, continual +improvement, and input from lessons learned, trouble reports, and assessments +during audits. +5.•Updating the SUBSAFE program requirements and the commitment to it +periodically. + + +6.Separation of powers and assignment of responsibility. +7.•Emphasis on rigor, technical compliance, and work discipline. +8.•Documentation capturing what they do and why they do it. + +9.• The participatory audit philosophy and the requirement for objective quality +evidence. +10.• A program based on written procedures, not personality-driven. +11.•Continual feedback and improvement. When something does not conform to +SUBSAFE specifications, it must be reported to NAVSEA headquarters along +with the causal analysis (including the systemic factors) of why it happened. +Everyone at every level of the organization is willing to examine his or her role +in the incident. +12.•Continual certification throughout the life of the ship; it is not a one-time event. +13.• Accountability accompanying responsibility. Personal integrity and personal +responsibility is stressed. The program is designed to foster everyone’s pride in +his or her work. +14.• A culture of shared responsibility for safety and the SUBSAFE requirements. +15.• +Special efforts to be vigilant against complacency and to fight it when it is +detected. + diff --git a/chapter14.txt b/chapter14.txt new file mode 100644 index 0000000..b40dd2c --- /dev/null +++ b/chapter14.txt @@ -0,0 +1,493 @@ +chapter 14. +SUBSAFE. An Example of a Successful Safety +Program. +This book is filled with examples of accidents and of what not to do. One possible +conclusion might be that despite our best efforts accidents are inevitable in +complex systems. That conclusion would be wrong. Many industries and companies +are able to avoid accidents. the nuclear Navy SUBSAFE program is a shining +example. By any measure, SUBSAFE has been remarkably successful. In nearly +fifty years since the beginning of SUBSAFE, no submarine in the program has +been lost. +Looking at a successful safety program and trying to understand why it has been +successful can be very instructive. This chapter looks at the history of the program +and what it is, and proposes some explanations for its great success. SUBSAFE also +provides a good example of most of the principles expounded in this book. +Although SUBSAFE exists in a government and military environment, most of +the important components could be translated into the commercial, profit-making +world. Also note that the success is not related to small size.there are 40,000 +people involved in the U.S. submarine safety program, a large percentage of whom +are private contractors and not government employees. Both private and public +shipyards are involved. SUBSAFE is distributed over large parts of the United +States, although mostly on the coasts .(for obvious reasons). Five submarine classes +are included, as well as worldwide naval operations. + +footnote. I am particularly grateful to Rear Admiral Walt Cantrell, Al Ford, and Commander Jim Hassett for +their insights on and information about the SUBSAFE program. + +section 14.1. +History. +The SUBSAFE program was created after the loss of the nuclear submarine +Thresher. The USS Thresher was the first ship of her class and the leading edge of +U.S. submarine technology, combining nuclear power with modern hull design and +newly designed equipment and components. On April 10, 19 63 , while performing a + + + +deep test dive approximately two hundred miles off the northeastern coast of the +United States, the USS Thresher was lost at sea with all persons aboard. 112 naval +personnel and 17 civilians died. +The head of the U.S. nuclear Navy, Admiral Hyman Rickover, gathered his staff +after the Thresher loss and ordered them to design a program that would ensure +such a loss never happened again. The program was to be completed by June and +operational by that December. To date, that goal has been achieved. Between 19 15 +and 19 63 , the U.S. had lost fifteen submarines to noncombat causes, an average of +one loss every three years, with a total of 454 casualties. Thresher was the first +nuclear submarine lost, the worst submarine disaster in history in terms of lives lost +(figure 14.1). +SUBSAFE was established just fifty-four days after the loss of Thresher. It was +created on June 3, 19 63 , and the program requirements were issued on December +20 of that same year. Since that date, no SUBSAFE-certified submarine has ever +been lost. +One loss did occur in 19 68 .the USS Scorpion.but it was not SUBSAFE certified. In a rush to get Scorpion ready for service after it was scheduled for a major +overhaul in 19 67 , the Chief of Naval Operations allowed a reduced overhaul process +and deferred the required SUBSAFE inspections. The design changes deemed necessary after the loss of Thresher were not made, such as newly designed central valve +control and emergency blow systems, which had not operated properly on Thresher. +Cold War pressures prompted the Navy to search for ways to reduce the duration +of overhauls. By not following SUBSAFE requirements, the Navy reduced the time +Scorpion was out of commission. +In addition, the high quality of the submarine components required by SUBSAFE, +along with intensified structural inspections, had reduced the availability of critical +parts such as seawater piping . A year later, in May 19 68 , Scorpion was lost at +sea. Although some have attributed its loss to a Soviet attack, a later investigation +of the debris field revealed the most likely cause of the loss was one of its own +torpedoes exploding inside the torpedo room . After the Scorpion loss, the need +for SUBSAFE was reaffirmed and accepted. +The rest of this chapter outlines the SUBSAFE program and provides some +hypotheses to explain its remarkable success. The reader will notice that much +of the program rests on the same systems thinking fundamentals advocated in +this book. +Details of the Thresher Loss. +The accident was thoroughly investigated including, to the Navy’s credit, the systemic factors as well as the technical failures and deficiencies. Deep sea photography, recovered artifacts, and an evaluation of the Thresher’s design and operational + +history led a court of inquiry to conclude that the failure of a deficient silver-braze +joint in a salt water piping system, which relied on silver brazing instead of welding, +led to flooding in the engine room. The crew was unable to access vital equipment +to stop the flooding. As a result of the flooding, saltwater spray on the electrical +components caused short circuits, shutdown of the nuclear reactor, and loss of propulsion. When the crew attempted to blow the main ballast tanks in order to surface, +excessive moisture in the air system froze, causing a loss of airflow and inability +to surface. +The accident report included recommendations to fix the design problems, for +example, to add high-pressure air compressors to permit the emergency blow +system to operate property. The finding that there were no centrally located isolation valves for the main and auxiliary seawater systems led to the use of floodcontrol levers that allowed isolation valves to be closed remotely from a central +panel. +Most accident analyses stop at this point, particularly in that era. To their credit, +however, the investigation continued and looked at why the technical deficiencies +existed, that is, the management and systemic factors involved in the loss. They found +deficient specifications, deficient shipbuilding practices, deficient maintenance practices, inadequate documentation of construction and maintenance actions, and deficient operational procedures. With respect to documentation, there appeared to be +incomplete or no records of the work that had been done on the submarine and the +critical materials and processes used. +As one example, Thresher had about three thousand silver-brazed pipe joints +exposed to full pressure when the submarine was submerged. During her last shipyard maintenance, 145 of these joints were inspected on a “not-to-delay” vessel basis +using what was then the new technique called ultrasonic testing. Fourteen percent +of the 145 joints showed substandard joint integrity. Extrapolating these results to +the entire complement of three thousand joints suggests that more than four hundred +joints could have been substandard. The ship was allowed to go to sea in this condition. The Thresher loss investigators looked at whether the full scope of the joint +problem had been determined and what rationale could have been used to allow +the ship to sail without fixing the joints. +One of the conclusions of the accident investigation is that Navy risk management practices had not advanced as fast as submarine capability. +section 14.2. SUBSAFE Goals and Requirements. +A decision was made in 19 63 to concentrate the SUBSAFE program on the essentials, and a program was designed to provide maximum reasonable assurance of two +things. + +1.• Watertight integrity of the submarine’s hull. +2.• +Operability and integrity of critical systems to control and recover from a flooding hazard. +By being focused, the SUBSAFE program does not spread or dilute its focus beyond +this stated purpose. For example, mission assurance is not a focus of SUBSAFE, +although it benefits from it. Similarly, fire safety, weapons safety, occupational health +and safety, and nuclear reactor systems safety are not in SUBSAFE. These additional concerns are handled by regular System Safety programs and mission assurance activities focused on the additional hazards. In this way, the extra rigor required +by SUBSAFE is limited to those activities that ensure U.S. submarines can surface +and return to port safely in an emergency, making the program more acceptable and +practical than it might otherwise be. +SUBSAFE requirements, as documented in the SUBSAFE manual, permeate the +entire submarine community. These requirements are invoked in design, construction, operations, and maintenance and cover the following aspects of submarine +development and operations. +1.• Administrative +2.• +Organizational +3.• Technical +4.•Unique design +5.•Material control +6.•Fabrication +7.• Testing +8.• Work control +9.• Audits +10.• +Certification +These requirements are invoked in design contracts, construction contracts, overhaul +contracts, the fleet maintenance manual and spare parts procurement specifications, +and so on. +Notice that the requirements encompass not only the technical aspects of the +program but the administrative and organizational aspects as well. The program +requirements are reviewed periodically and renewed when deemed necessary. The +Submarine Safety Working Group, consisting of the SUBSAFE Program Directors +from all SUBSAFE facilities around the country, convenes twice a year to discuss +program issues of mutual concern. This meeting often leads to changes and improvements to the program. + +section 14.3. SUBSAFE Risk Management Fundamentals. +SUBSAFE is founded on a basic set of risk management principles, both technical +and cultural. These fundamentals are. +• Work discipline. Knowledge of and compliance with requirements +•Material control. The correct material installed correctly +•Documentation. .(1). Design products .(specifications, drawings, maintenance +standards, system diagrams, etc.), and .(2). objective quality evidence .(defined +later) +•Compliance verification. Inspections, surveillance, technical reviews, and audits +•Learning from inspections, audits, and nonconformances +These fundamentals, coupled with a questioning attitude and what those in +SUBSAFE term a chronic uneasiness, are credited for SUBSAFE success. The fundamentals are taught and embraced throughout the submarine community. The +members of this community believe that it is absolutely critical that they do not +allow themselves to drift away from the fundamentals. +The Navy, in particular, expends a lot of effort in assuring compliance verification +with the SUBSAFE requirements. A common saying in this community is, “Trust +everybody, but check up.” Whenever a significant issue arises involving compliance +with SUBSAFE requirements, including material defects, system malfunctions, deficient processes, equipment damage, and so on, the Navy requires that an initial +report be provided to Naval Sea Systems Command .(NAVSEA). headquarters +within twenty-four hours. The report must describe what happened and must contain +preliminary information concerning apparent root cause(s). and immediate corrective actions taken. Beyond providing the information to prevent recurrence, this +requirement also demonstrates top management commitment to safety and the +SUBSAFE program. +In addition to the technical and managerial risk management fundamentals listed +earlier, SUBSAFE also has cultural principles built into the program. +1.• A questioning attitude +2.•Critical self-evaluation +3.•Lessons learned and continual improvement +4.•Continual training +5.•Separation of powers .(a management structure that provides checks and balances and assures appropriate attention to safety) + + +As is the case with most risk management programs, the foundation of SUBSAFE +is the personal integrity and responsibility of those individuals who are involved in +the program. The cement bonding this foundation is the selection, training, and +cultural mentoring of those individuals who perform SUBSAFE work. Ultimately, +these people attest to their adherence to technical requirements by documenting +critical data, parameters, statements and their personal signature verifying that work +has been properly completed. +section 14.4. +Separation of Powers. +SUBSAFE has created a unique management structure they call separation of +powers or, less formally, the three-legged stool .(figure 14.2). This structure is the +cornerstone of the SUBSAFE program. Responsibility is divided among three distinct entities providing a system of checks and balances. +The new construction and in-service Platform Program Managers are responsible +for the cost, schedule, and quality of the ships under their control. To ensure that +safety is not traded off under cost and schedule pressures, the Program Managers +can only select from a set of acceptable design options. The Independent Technical +Authority has the responsibility to approve those acceptable options. +The third leg of the stool is the Independent Safety and Quality Assurance +Authority. This group is responsible for administering the SUBSAFE program and +for enforcing compliance. It is staffed by engineers with the authority to question +and challenge the Independent Technical Authority and the Program Managers on +their compliance with SUBSAFE requirements. + + +The Independent Technical Authority .(ITA). is responsible for establishing and +assuring adherence to technical standards and policy. More specifically, they. +1.•Set and enforce technical standards. +2.•Maintain technical subject matter expertise. +3.• Assure safe and reliable operations. +4.•Ensure effective and efficient systems engineering. +5.•Make unbiased, independent technical decisions. +6.•Provide stewardship of technical and engineering capabilities. +Accountability is important in SUBSAFE and the ITA is held accountable for +exercising these responsibilities. +This management structure only works because of support from top management. When Program Managers complain that satisfying the SUBSAFE requirements will make them unable to satisfy their program goals and deliver new +submarines, SUBSAFE requirements prevail. +section 14.5. +Certification. +In 19 63 , a SUBSAFE certification boundary was defined. Certification focuses on +the structures, systems, and components that are critical to the watertight integrity +and recovery capability of the submarine. +Certification is also strictly based on what the SUBSAFE program defines as +Objective Quality Evidence .(OQE). OQE is defined as any statement of fact, either +quantitative or qualitative, pertaining to the quality of a product or service, based +on observations, measurements, or tests that can be verified. Probabilistic risk assessment, which usually cannot be verified, is not used. +OQE is evidence that deliberate steps were taken to comply with requirements. +It does not matter who did the work or how well they did it, if there is no OQE +then there is no basis for certification. +The goal of certification is to provide maximum reasonable assurance through +the initial SUBSAFE certification and by maintaining certification throughout the +submarine’s life. SUBSAFE inculcates the basic STAMP assumption that systems +change throughout their existence. SUBSAFE certification is not a one-time activity +but has to be maintained over time. SUBSAFE certification is a process, not just a +final step. This rigorous process structures the construction program through a specified sequence of events leading to formal authorization for sea trials and delivery +to the Navy. Certification then applies to the maintenance and operations programs +and must be maintained throughout the life of the ship. + + +section 14.5.1. Initial Certification. +Initial certification is separated into four elements .(figure 14.3). +1. Design certification. Design certification consists of design product approval +and design review approval, both of which are based on OQE. For design +product approval, the OQE is reviewed to confirm that the appropriate technical authority has approved the design products, such as the technical drawings. +Most drawings are produced by the submarine design yard. Approval may be +given by the Navy’s Supervisor of Shipbuilding, which administers and oversees the contract at each of the private shipyards, or, in some cases, the +NAVSEA may act as the review and approval technical authority. Design +approval is considered complete only after the proper technical authority has +reviewed the OQE and at that point the design is certified. +2. Material certification. After the design is certified, the material procured to +build the submarine must meet the requirements of that design. Technical +specifications must be embodied in the purchase documents. Once the material +is received, it goes through a rigorous receipt inspection process to confirm +and certify that it meets the technical specifications. This process usually +involves examining the vendor-supplied chemical and physical OQE for the +material. Records of chemical assay results, heat treatment applied to the material, and nondestructive testing conducted on the material constitute OQE. +3. Fabrication certification. Once the certified material is obtained, the next +step is fabrication where industrial processes such as machining, welding, and +assembly are used to construct components, systems, and ships. OQE is used +to document the industrial processes. Separately, and prior to actual fabrication +of the final product, the facility performing the work is certified in the industrial processes necessary to perform the work. An example is a specific + + +high-strength steel welding procedure. In addition to the weld procedure, the +individual welder using this particular process in the actual fabrication receives +documented training and successfully completes a formal qualification in the +specific weld procedure to be used. Other industrial processes have similar +certification and qualification requirements. In addition, steps are taken to +ensure that the measurement devices, such as temperature sensors, pressure +gauges, torque wrenches, micrometers, and so on, are included in a robust +calibration program at the facility. +4. Testing certification. Finally, a series of tests is used to prove that the assembly, system, or ship meets design parameters. Testing occurs throughout the +fabrication of a submarine, starting at the component level and continuing +through system assembly, final assembly, and sea trials. The material and components may receive any of the typical nondestructive tests, such as radiography, magnetic particle, and representative tests. Systems are also subjected to +strength testing and operational testing. For certain components, destructive +tests are performed on representative samples. +Each of these certification elements is defined by detailed, documented SUBSAFE +requirements. +At some point near the end of the new construction period, usually lasting five +or so years, every submarine obtains its initial SUBSAFE certification. This process +is very formal and preceded by scrutiny and audit conducted by the shipbuilder, the +supervising authority, and finally, by a NAVSEA Certification Audit Team assembled and led by the Office of Safety and Quality Assurance at NAVSEA. The initial +certification is in the end granted at the flag officer level. + +secton 14.5.2. Maintaining Certification. +After the submarine enters the fleet, SUBSAFE certification must be maintained +through the life of the slip. Three tools are used. the Reentry Control .(REC). Process, +the Unrestricted Operations Maintenance Requirements Card .(URO MRC) +program, and the audit program. +The Reentry Control .(REC). process carefully controls work and testing within +the SUBSAFE boundary, that is, the structures, systems, and components that are +critical to the watertight integrity and recovery capability of the submarine. The +purpose of REC is to provide maximum reasonable assurance that the areas disturbed have been restored to their fully certified condition. The procedures used +provide an identifiable, accountable, and auditable record of the work performed. +REC control procedures have three goals. .(1). to maintain work discipline by +identifying the work to be performed and the standards to be met, .(2). to establish +personal accountability by having the responsible personnel sign their names on the + +reentry control document, and .(3). to collect the OQE needed for maintaining +certification. +The second process, the Unrestricted Operations Maintenance Requirements +Card .(URO MRC). program, involves periodic inspections and tests of critical +items to ensure they have not degraded to an unacceptable level due to use, age, +or environment. In fact, URO MRC did not originate with SUBSAFE, but was +developed to extend the operating cycle of USS Queenfish by one year in 19 69 . It +now provides the technical basis for continued unrestricted operation of submarines to test depth. +The third aspect of maintaining certification is the audit program. Because the +audit process is used for more general purposes than simply maintaining certification, it is considered in a separate section. +14.6 Audit Procedures and Approach +Compliance verification in SUBSAFE is treated as a process, not just one step in a +process or program. The Navy demands that each Navy facility participate fully in +the process, including the use of inspection, surveillance, and audits to confirm their +own compliance. Audits are used to verify that this process is working. They are +conducted either at fixed intervals or when a specific condition is found to exist that +needs attention. +Audits are multi-layered. they exist at the contractor and shipyard level, at the +local government level, and at Navy headquarters. Using the terminology adopted +in this book, responsibilities are assigned to all the components of the safety control +structure as shown in figure 14.4. Contractors and shipyard responsibilities include +implementing specified SUBSAFE requirements, establishing processes for controlling work, establishing processes to verify compliance and certify its own work, and +presenting the certification OQE to the local government oversight authority. The +processes established to verify compliance and certify their work include a quality +management system, surveillance, inspections, witnessing critical contractor work +(contractor quality assurance), and internal audits. +Local government oversight responsibilities include surveillance, inspections, +assuring quality, and witnessing critical contractor work, audits of the contractor, +and certifying the work of the contractor to Navy headquarters. +The responsibilities of Navy headquarters include establishing and specifying +SUBSAFE requirements, verifying compliance with the requirements, and providing SUBSAFE certification for each submarine. Compliance is verified through two +types of audits. .(1). ship-specific and .(2). functional or facility audits. +A ship-specific audit looks at the OQE associated with an individual ship to +ensure that the material condition of that submarine is satisfactory for sea trial and + +unrestricted operations. This audit represents a significant part of the certification +process that a submarine’s condition meets SUBSAFE requirements and is safe to +go to sea. +Functional or facility audits .(such as contractors or shipyards). include reviews +of policies, procedures, and practices to confirm compliance with the SUBSAFE +program requirements, the health of processes, and the capability of producing +certifiable hardware or design products. +Both types of audits are carried out with structured audit plans and qualified +auditors. + +The audit philosophy is part of the reason for SUBSAFE success. Audits are +treated as a constructive, learning experience. Audits start from the assumption +that policies, procedures, and practices are in compliance with requirements. The +goal of the audit is to confirm that compliance. Audit findings must be based +on a clear violation of requirements or must be identified as an “operational +improvement.” +The objective of audits is “to make our submarines safer” not to evaluate individual performance or to assign blame. Note the use of the word “our”. the SUBSAFE +program emphasizes common safety goals and group effort to achieve them. Everyone owns the safety goals and is assumed to be committed to them and working to +the same purpose. SUBSAFE literature and training talks about those involved as +being part of a “very special family of people who design, build, maintain, and +operate our nation’s submarines.” +To this end, audits are a peer review. A typical audit team consists of twenty to +thirty people with approximately 80 percent of the team coming from various +SUBSAFE facilities around the country and the remaining 20 percent coming from +NAVSEA headquarters. An audit is considered a team effort.the facility being +audited is expected to help the audit team make the audit report as accurate and +meaningful as possible. +Audits are conducted under rules of continuous communication.when a problem +is found, the emphasis is on full understanding of the identified problem as well as +identification of potential solutions. Deficiencies are documented and adjudicated. +Contentious issues sometimes arise, but an attempt is made to resolve them during +the audit process. +A significant byproduct of a SUBSAFE audit is the learning experience it provides to the auditors as well as those being audited. Expected results include crosspollination of successful procedures and process improvements. The rationale +behind having SUBSAFE participants on the audit team is not only their understanding of the SUBSAFE program and requirements, but also their ability to learn +from the audits and apply that learning to their own SUBSAFE groups. +The current audit philosophy is a product of experience and learning. Before +1986, only ship-specific audits were conducted, not facility or headquarters audits. +In 19 86 , there was a determination that they had gotten complacent and were assuming that once an audit was completed, there would be no findings if a follow-up +audit was performed. They also decided that the ship-specific audits were not rigorous or complete enough. In STAMP terms, only the lowest level of the safety control +structure was being audited and not the other components. After that time, biennial +audits were conducted at all levels of the safety control structure, even the highest +levels of management. A biennial NAVSEA internal audit gives the field activities + + +a chance to evaluate operations at headquarters. Headquarters personnel must be +willing to accept and resolve audit findings just like any other member of the nuclear +submarine community. +One lesson learned has been that developing a robust compliance verification +program is difficult. Along the way they learned that .(1). clear ground rules for audits +must be established, communicated, and adhered to; .(2). it is not possible to “audit +in” requirements; and .(3). the compliance verification organization must be equal +with the program managers and the technical authority. In addition, they determined +that not just anyone can do SUBSAFE work. The number of activities authorized +to perform SUBSAFE activities is strictly controlled. + +section 14.7. Problem Reporting and Critiques. + +SUBSAFE believes that lessons learned are integral to submarine safety and puts +emphasis on problem reporting and critiques. Significant problems are defined as +those that affect ship safety, cause significant damage to the ship or its equipment, +delay ship deployment or incur substantial cost increase, or involve severe personnel +injury. Trouble reports are prepared for all significant problems encountered in +the construction, repair, and maintenance of naval ships. Systemic problems and +issues that constitute significant lessons learned for other activities can also be +identified by trouble reports. Critiques are similar to trouble reports and are utilized +by the fleet. +Trouble reports are distributed to all SUBSAFE responsible activities and are +used to report significant problems to NAVSEA. NAVSEA evaluates the reports to +identify SUBSAFE program improvements. + +section 14.8. Challenges. +The leaders of SUBSAFE consider their biggest challenges to be. +•Ignorance. +•Arrogance. Behavior based on pride, self-importance, conceit, or the assumption of intellectual superiority and the presumption of knowledge that is not +supported by facts; and +•Complacency. Satisfaction with one’s accomplishments accompanied by a +lack of awareness of actual dangers or deficiencies. +The state of not knowing; +Combating these challenges is a “constant struggle every day” . Many features +of the program are designed to control these challenges, particularly training and +education. + + +section 14.9. Continual Training and Education. +Continual training and education are a hallmark of SUBSAFE. The goals are to. +1.•Serve as a reminder of the consequences of complacency in one’s job. +2.•Emphasize the need to proactively correct and prevent problems. +3.•Stress the need to adhere to program fundamentals. +4.•Convey management support for the program. +Continual improvement and feedback to the SUBSAFE training programs +comes not only from trouble reports and incidents but also from the level of knowledge assessments performed during the audits of organizations that perform +SUBSAFE work. +Annual training is required for all headquarters SUBSAFE workers, from the +apprentice craftsman to the admirals. A periodic refresher is also held at each of the +contractor’s facilities. At the meetings, a video about the loss of Thresher is shown +and an overview of the SUBSAFE program and their responsibilities is provided as +well as recent lessons learned and deficiency trends encountered over the previous +years. The need to avoid complacency and to proactively correct and prevent problems is reinforced. +Time is also taken at the annual meetings to remind everyone involved about the +history of the program. By guaranteeing that no one forgets what happened to USS +Thresher, the SUBSAFE program has helped to create a culture that is conducive +to strict adherence to policies and procedures. Everyone is recommitted each year +to ensure that a tragedy like the one that occurred in 19 63 never happens again. +SUBSAFE is described by those in the program as “a requirement, an attitude, and +a responsibility.” + +section 14.10. Execution and Compliance over the Life of a Submarine. +The design, construction, and initial certification are only a small percentage of the +life of the certified ship. The success of the program during the vast majority of the +certified ship’s life depends on the knowledge, compliance, and audit by those operating and maintaining the submarines. Without the rigor of compliance and sustaining knowledge from the petty officers, ship’s officers, and fleet staff, all of the great +virtues of SUBSAFE would “come to naught” . The following anecdote by +Admiral Walt Cantrell provides an indication of how SUBSAFE principles permeate the entire nuclear Navy. +I remember vividly when I escorted the first group of NASA skeptics to a submarine and +they figured they would demonstrate that I had exaggerated the integrity of the program + +by picking a member of ship’s force at random and asked him about SUBSAFE. The +NASA folks were blown away. A second class machinist’s mate gave a cogent, complete, +correct description of the elements of the program and how important it was that all levels +in the Submarine Force comply. That part of the program is essential to its success.just +as much, if not more so, than all the other support staff effort . + +section 14.11 Lessons to Be Learned from SUBSAFE. +Those involved in SUBSAFE are very proud of their achievements and the fact that +even after nearly fifty years of no accidents, the program is still strong and vibrant. +On January 8, 20 05 , USS San Francisco, a twenty-six-year-old ship, crashed head-on +into an underwater mountain. While several crew members were injured and one +died, this incident is considered by SUBSAFE to be a success story. In spite of the +massive damage to her forward structure, there was no flooding, and the ship surfaced and returned to port under her own power. There was no breach of the pressure hull, the nuclear reactor remained on line, the emergency main ballast tank +blow system functioned as intended, and the control surfaces functioned properly. +Those in the SUBSAFE program attribute this success to the work discipline, material control, documentation, and compliance verification exercised during the design, +construction, and maintenance of USS San Francisco. +Can the SUBSAFE principles be transferred from the military to commercial +companies and industries? The answer lies in why the program has been so effective +and whether these factors can be maintained in other implementations of the principles more appropriate to non-military venues. Remember, of course, that private +contractors form the bulk of the companies and workers in the nuclear Navy, and +they seem to be able to satisfy the SUBSAFE program requirements. The primary +difference is in the basic goals of the organization itself. +Some factors that can be identified as contributing to the success of SUBSAFE, +most of which could be translated into a safety program in private industry are. +1.•Leadership support and commitment to the program. +2.•Management .(NAVSEA). is not afraid to say “no” when faced with pressures +to compromise the SUBSAFE principles and requirements. Top management +also agrees to be audited for adherence to the principles of SUBSAFE and to +correct any deficiencies that are found. +3.•Establishment of clear and written safety requirements. +4.•Education, not just training, with yearly reminders of the past, continual +improvement, and input from lessons learned, trouble reports, and assessments +during audits. +5.•Updating the SUBSAFE program requirements and the commitment to it +periodically. + + +6.Separation of powers and assignment of responsibility. +7.•Emphasis on rigor, technical compliance, and work discipline. +8.•Documentation capturing what they do and why they do it. + +9.• The participatory audit philosophy and the requirement for objective quality +evidence. +10.• A program based on written procedures, not personality-driven. +11.•Continual feedback and improvement. When something does not conform to +SUBSAFE specifications, it must be reported to NAVSEA headquarters along +with the causal analysis .(including the systemic factors). of why it happened. +Everyone at every level of the organization is willing to examine his or her role +in the incident. +12.•Continual certification throughout the life of the ship; it is not a one-time event. +13.• Accountability accompanying responsibility. Personal integrity and personal +responsibility is stressed. The program is designed to foster everyone’s pride in +his or her work. +14.• A culture of shared responsibility for safety and the SUBSAFE requirements. +15.• +Special efforts to be vigilant against complacency and to fight it when it is +detected. + diff --git a/epilogue.raw b/epilogue.raw new file mode 100644 index 0000000..72575ef --- /dev/null +++ b/epilogue.raw @@ -0,0 +1,43 @@ +Epilogue. +In the simpler world of the past, classic safety engineering techniques that focus on +preventing failures and chains of failure events were adequate. They no longer +suffice for the types of systems we want to build, which are stretching the limits of +complexity human minds and our current tools can handle. Society is also expecting +more protection from those responsible for potentially dangerous systems. +Systems theory provides the foundation necessary to build the tools required +to stretch our human limits on dealing with complexity. STAMP translates basic +system theory ideas into the realm of safety and thus provides a foundation for +our future. +As demonstrated in the previous chapter, some industries have been very suc- +cessful in preventing accidents. The U.S. nuclear submarine program is not the only +one. Others seem to believe that accidents are the price of progress or of profits, +and they have been less successful. What seems to distinguish those experiencing +success is that they: +1.• Take a systems approach to safety in both development and operations +2.•Have instituted a learning culture where they have effective learning from +events +3.•Have established safety as a priority and understand that their long-term +success depends on it +This book suggests a new approach to engineering for safety that changes the focus +from “prevent failures” to “enforce behavioral safety constraints,” from reliability +to control. The approach is constructed on an extended model of accident causation +that includes more than the traditional models, adding those factors that are increas- +ingly causing accidents today. It allows us to deal with much more complex systems. +What is surprising is that the techniques and tools described in part III that are built +on STAMP and have been applied in practice on extremely complex systems have +been easier to use and much more effective than the old ones. + +Others will improve these first tools and techniques. What is critical is the overall +philosophy of safety as a function of control. This philosophy is not new: It stems +from the prescient engineers who created System Safety after World War II in the +military aviation and ballistic missile defense systems. What they lacked, and what +we have been hindered in our progress by not having, is a more powerful accident +causality model that matches today’s new technology and social drivers. STAMP +provides that. Upon this foundation and using systems theory, new more powerful +hazard analysis, design, specification, system engineering, accident/incident analysis, +operations, and management techniques can be developed to engineer a safer world. +Mueller in 1968 described System Safety as “organized common sense” [109]. I +hope that you have found that to be an accurate description of the contents of this +book. In closing I remind you of the admonition by Bertrand Russell: “A life without +adventure is likely to be unsatisfying, but a life in which adventure is allowed to +take any form it will is sure to be short” [179, p. 21]. \ No newline at end of file diff --git a/epilogue.txt b/epilogue.txt new file mode 100644 index 0000000..1c94d98 --- /dev/null +++ b/epilogue.txt @@ -0,0 +1,41 @@ +Epilogue. +In the simpler world of the past, classic safety engineering techniques that focus on +preventing failures and chains of failure events were adequate. They no longer +suffice for the types of systems we want to build, which are stretching the limits of +complexity human minds and our current tools can handle. Society is also expecting +more protection from those responsible for potentially dangerous systems. +Systems theory provides the foundation necessary to build the tools required +to stretch our human limits on dealing with complexity. STAMP translates basic +system theory ideas into the realm of safety and thus provides a foundation for +our future. +As demonstrated in the previous chapter, some industries have been very successful in preventing accidents. The U.S. nuclear submarine program is not the only +one. Others seem to believe that accidents are the price of progress or of profits, +and they have been less successful. What seems to distinguish those experiencing +success is that they. +1.• Take a systems approach to safety in both development and operations +2.•Have instituted a learning culture where they have effective learning from +events +3.•Have established safety as a priority and understand that their long-term +success depends on it +This book suggests a new approach to engineering for safety that changes the focus +from “prevent failures” to “enforce behavioral safety constraints,” from reliability +to control. The approach is constructed on an extended model of accident causation +that includes more than the traditional models, adding those factors that are increasingly causing accidents today. It allows us to deal with much more complex systems. +What is surprising is that the techniques and tools described in part 3 that are built +on STAMP and have been applied in practice on extremely complex systems have +been easier to use and much more effective than the old ones. + +Others will improve these first tools and techniques. What is critical is the overall +philosophy of safety as a function of control. This philosophy is not new. It stems +from the prescient engineers who created System Safety after World War 2 in the +military aviation and ballistic missile defense systems. What they lacked, and what +we have been hindered in our progress by not having, is a more powerful accident +causality model that matches today’s new technology and social drivers. STAMP +provides that. Upon this foundation and using systems theory, new more powerful +hazard analysis, design, specification, system engineering, accident/incident analysis, +operations, and management techniques can be developed to engineer a safer world. +Mueller in 19 68 described System Safety as “organized common sense” . I +hope that you have found that to be an accurate description of the contents of this +book. In closing I remind you of the admonition by Bertrand Russell. “A life without +adventure is likely to be unsatisfying, but a life in which adventure is allowed to +take any form it will is sure to be short” . \ No newline at end of file diff --git a/replacements b/replacements index 6a22b12..4eb8e76 100644 --- a/replacements +++ b/replacements @@ -60,6 +60,8 @@ ROE R O E SD S D SITREP SIT Rep STPA S T P A +SpecTRM-RL Spec T R M R L +SpecTRM Spec T R M TACSAT Tack sat TAOR T A O R TAOR T A O R @@ -67,4 +69,10 @@ TCAS T Cass TMI T M I TTPS T T P S USCINCEUR U S C in E U R -WD W D \ No newline at end of file +WD W D +ZTHR Z T H R +INPO In Poh +LERs Leers +FARs Farzz +SUBSAFE Sub Safe +NAVSEA Nav Sea \ No newline at end of file