From ff069b52c4bc69caca22d363791388a2ea3e1b13 Mon Sep 17 00:00:00 2001 From: xuu Date: Sat, 15 Mar 2025 19:07:36 -0600 Subject: [PATCH] chore: add more chapters and replacements --- .gitignore | 1 + Makefile | 8 +- chapter02.txt | 38 +- chapter03.txt | 387 ++++++++++++++ chapter04.txt | 890 ++++++++++++++++++++++++++++++ chapter05.raw | 1425 +++++++++++++++++++++++++++++++++++++++++++++++++ replacements | 47 +- 7 files changed, 2770 insertions(+), 26 deletions(-) create mode 100644 chapter03.txt create mode 100644 chapter04.txt create mode 100644 chapter05.raw diff --git a/.gitignore b/.gitignore index acded50..d665f98 100644 --- a/.gitignore +++ b/.gitignore @@ -1,4 +1,5 @@ piper/ *.wav *.ogg +*.mp3 *.onnx* diff --git a/Makefile b/Makefile index c58663a..41da950 100644 --- a/Makefile +++ b/Makefile @@ -2,20 +2,20 @@ PATH:=./piper:$(PATH) WAV_FILES := $(patsubst %.txt,%.wav,$(wildcard *.txt)) -OGG_FILES := $(patsubst %.txt,%.ogg,$(wildcard *.txt)) +MP3_FILES := $(patsubst %.txt,%.mp3,$(wildcard *.txt)) MODEL=en_GB-alan-medium.onnx CONFIG=en_GB-alan-medium.onnx.json -complete: $(OGG_FILES) +complete: $(MP3_FILES) echo $@ $^ $(WAV_FILES): %.wav: %.txt cat $^ | piper -m $(MODEL) -c $(CONFIG) -f $@ -$(OGG_FILES): %.ogg: %.wav - ffmpeg -i $^ $@ +$(MP3_FILES): %.mp3: %.wav + ffmpeg -y -i $^ $@ install: diff --git a/chapter02.txt b/chapter02.txt index c77eddb..bd9d5db 100644 --- a/chapter02.txt +++ b/chapter02.txt @@ -311,14 +311,14 @@ started to slow down as the most obvious hazards were eliminated. The emphasis then shifted to unsafe acts. Accidents began to be regarded as someone’s fault rather than as an event that could have been prevented by some change in the plant or product. -Heinrich’s Domino Model, published in 1931, was one of the first published +Heinrich’s Domino Model, published in 19 31, was one of the first published general accident models and was very influential in shifting the emphasis in safety to human error. Heinrich compared the general sequence of accidents to five domi noes standing on end in a line (figure 2 3). When the first domino falls, it automati cally knocks down its neighbor and so on until the injury occurs. In any accident sequence, according to this model, ancestry or social environment leads to a fault of a person, which is the proximate reason for an unsafe act or condition (mechani -cal or physical), which results in an accident, which leads to an injury. In 1976, Bird +cal or physical), which results in an accident, which leads to an injury. In 19 76, Bird and Loftus extended the basic Domino Model to include management decisions as a factor in accidents. 1. Lack of control by management, permitting. @@ -439,7 +439,7 @@ able as the identified cause. Other events or explanations may be excluded or no examined in depth because they raise issues that are embarrassing to the organiza tion or its contractors or are politically unacceptable. The accident report on a friendly fire shootdown of a U.S. Army helicopter over -the Iraqi nofly zone in 1994, for example, describes the chain of events leading to +the Iraqi nofly zone in 19 94, for example, describes the chain of events leading to the shootdown. Included in these events is the fact that the helicopter pilots did not change to the radio frequency required in the nofly zone when they entered it (they stayed on the enroute frequency). Stopping at this event in the chain (which the @@ -459,14 +459,14 @@ more basis for this distinction than the selection of a root cause. Making such distinctions between causes or limiting the factors considered can be a hindrance in learning from and preventing future accidents. Consider the following aircraft examples. -In the crash of an American Airlines D C 10 at Chicago’s O’Hare Airport in 1979, +In the crash of an American Airlines D C 10 at Chicago’s O’Hare Airport in 19 79, the U.S. National Transportation Safety Board (N T S B) blamed only a “mainte nanceinduced crack,” and not also a design error that allowed the slats to retract if the wing was punctured. Because of this omission, McDonnell Douglas was not required to change the design, leading to future accidents related to the same design flaw. Similar omissions of causal factors in aircraft accidents have occurred more -recently. One example is the crash of a China Airlines A300 on April 26, 1994, while +recently. One example is the crash of a China Airlines A300 on April 26, 19 94, while approaching the Nagoya, Japan, airport. One of the factors involved in the accident was the design of the flight control computer software. Previous incidents with the same type of aircraft had led to a Service Bulletin being issued for a modification @@ -480,7 +480,7 @@ that delay, 264 passengers and crew died. In another D C 10 saga, explosive decompression played a critical role in a near miss over Windsor, Ontario. An American Airlines D C 10 lost part of its passenger floor, and thus all of the control cables that ran through it, when a cargo door opened -in flight in June 1972. Thanks to the extraordinary skill and poise of the pilot, Bryce +in flight in June 19 72. Thanks to the extraordinary skill and poise of the pilot, Bryce McCormick, the plane landed safely. In a remarkable coincidence, McCormick had trained himself to fly the plane using only the engines because he had been con cerned about a decompressioncaused collapse of the floor. After this close call, @@ -499,14 +499,14 @@ more basis for this distinction than the selection of a root cause. Making such distinctions between causes or limiting the factors considered can be a hindrance in learning from and preventing future accidents. Consider the following aircraft examples. -In the crash of an American Airlines D C 10 at Chicago’s O’Hare Airport in 1979, +In the crash of an American Airlines D C 10 at Chicago’s O’Hare Airport in 19 79, the U.S. National Transportation Safety Board (N T S B) blamed only a “mainte nanceinduced crack,” and not also a design error that allowed the slats to retract if the wing was punctured. Because of this omission, McDonnell Douglas was not required to change the design, leading to future accidents related to the same design flaw . Similar omissions of causal factors in aircraft accidents have occurred more -recently. One example is the crash of a China Airlines A300 on April 26, 1994, while +recently. One example is the crash of a China Airlines A300 on April 26, 19 94, while approaching the Nagoya, Japan, airport. One of the factors involved in the accident was the design of the flight control computer software. Previous incidents with the same type of aircraft had led to a Service Bulletin being issued for a modification @@ -520,7 +520,7 @@ that delay, 264 passengers and crew died. In another D C 10 saga, explosive decompression played a critical role in a near miss over Windsor, Ontario. An American Airlines D C 10 lost part of its passenger floor, and thus all of the control cables that ran through it, when a cargo door opened -in flight in June 1972. Thanks to the extraordinary skill and poise of the pilot, Bryce +in flight in June 19 72. Thanks to the extraordinary skill and poise of the pilot, Bryce McCorMICk, the plane landed safely. In a remarkable coincidence, McCorMICk had trained himself to fly the plane using only the engines because he had been con cerned about a decompressioncaused collapse of the floor. After this close call, @@ -545,14 +545,14 @@ exceptional case when every life was saved through a combination of crew skill a sheer luck that the plane was so lightly loaded. If there had been more passengers and thus more weight, damage to the control cables would undoubtedly have been more severe, and it is highly questionable if any amount of skill could have saved the plane . -Almost two years later, in March 1974, a fully loaded Turkish Airlines D C 10 crashed +Almost two years later, in March 19 74, a fully loaded Turkish Airlines D C 10 crashed near Paris, resulting in 346 deaths.one of the worst accidents in aviation history. Once again, the cargo door had opened in flight, causing the cabin floor to collapse, severing the flight control cables. Immediately after the accident, Sanford McDon nell stated the official McDonnellDouglas position that once again placed the blame on the baggage handler and the ground crew. This time, however, the FAA finally ordered modifications to all D C 10s that eliminated the hazard. In addition, -an FAA regulation issued in July 1975 required all widebodied jets to be able to +an FAA regulation issued in July 19 75 required all widebodied jets to be able to tolerate a hole in the fuselage of twenty square feet. By labeling the root cause in the event chain as baggage handler error and attempting only to eliminate that event or link in the chain rather than the basic engineering design flaws, fixes that could @@ -575,7 +575,7 @@ different types of links according to the mental representations the analyst has the production of this event. When several types of rules are possible, the analyst will apply those that agree with his or her mental model of the situation . Consider, for example, the loss of an American Airlines B757 near Cali, -Colombia, in 1995 . Two significant events in this loss were +Colombia, in 19 95 . Two significant events in this loss were (1.) Pilot asks for clearance to take the R O Z O. approach followed later by (2.) Pilot types R into the F M S. 5. @@ -630,7 +630,7 @@ often laid years before. One event simply triggers the loss, but if that event h happened, another one would have led to a loss. The Bhopal disaster provides a good example. The release of methyl isocyanate. (M I C.) from the Union Carbide chemical plant -in Bhopal, India, in December 1984 has been called the worst industrial accident +in Bhopal, India, in December 19 84 has been called the worst industrial accident in history. Conservative estimates point to 2,000 fatalities, 10,000 permanent dis abilities (including blindness), and 200,000 injuries . The Indian government blamed the accident on human error.the improper cleaning of a pipe at the plant. @@ -733,7 +733,7 @@ their face and closing their eyes. If the community had been alerted and provide with this simple information, many (if not most) lives would have been saved and injuries prevented . Some of the reasons why the poor conditions in the plant were allowed to persist -are financial. Demand for M I C had dropped sharply after 1981, leading to reduc +are financial. Demand for M I C had dropped sharply after 19 81, leading to reduc tions in production and pressure on the company to cut costs. The plant was operat ing at less than half capacity when the accident occurred. Union Carbide put pressure on the Indian management to reduce losses, but gave no specific details on how @@ -776,7 +776,7 @@ time and without any particular single decision to do so but simply as a series decisions that moved the plant slowly toward a situation where any slight error would lead to a major accident. Given the overall state of the Bhopal Union Carbide plant and its operation, if the action of inserting the slip disk had not been left out -of the pipe washing operation that December day in 1984, something else would +of the pipe washing operation that December day in 19 84, something else would have triggered an accident. In fact, a similar leak had occurred the year before, but did not have the same catastrophic consequences and the true root causes of that incident were neither identified nor fixed. @@ -822,7 +822,7 @@ Without understanding the purpose, goals, and decision criteria used to construc and operate systems, it is not possible to completely understand and most effectively prevent accidents. Awareness of the importance of social and organizational aspects of safety goes -back to the early days of System Safety.7 In 1968, Jerome Lederer, then the director +back to the early days of System Safety.7 In 19 68, Jerome Lederer, then the director of the NASA Manned Flight Safety Program for Apollo, wrote. System safety covers the total spectrum of risk management. It goes beyond the hardware and associated procedures of system safety engineering. It involves. attitudes and motiva @@ -876,7 +876,7 @@ be evaluated? Was a maintenance plan provided before startup? Was all relevant information provided to planners and managers? Was it used? Was concern for safety displayed by vigorous, visible personal action by top executives? And so forth. Johnson originally provided hundreds of such questions, and additions have been -made to his checklist since Johnson created it in the 1970s so it is now even larger. +made to his checklist since Johnson created it in the 19 70s so it is now even larger. The use of the MORT checklist is feasible because the items are so general, but that same generality also limits its usefulness. Something more effective than checklists is needed. @@ -1090,9 +1090,9 @@ rate has dropped by 35 per cent. sectio 2 4 1. Do Operators Cause Most Accidents? The tendency to blame the operator is not simply a nineteenth century problem, but persists today. During and after World War 2, the Air Force had serious prob -lems with aircraft accidents. From 1952 to 1966, for example, 7,715 aircraft were lost +lems with aircraft accidents. From 19 52 to 19 66, for example, 7,715 aircraft were lost and 8,547 people killed .. Most of these accidents were blamed on pilots. Some -aerospace engineers in the 1950s did not believe the cause was so simple and +aerospace engineers in the 19 50s did not believe the cause was so simple and argued that safety must be designed and built into aircraft just as are performance, stability, and structural integrity. Although a few seminars were conducted and papers written about this approach, the Air Force did not take it seriously until diff --git a/chapter03.txt b/chapter03.txt new file mode 100644 index 0000000..396cbf3 --- /dev/null +++ b/chapter03.txt @@ -0,0 +1,387 @@ +chapter 3. + +Systems Theory and Its Relationship to Safety. +To achieve the goals set at the end of the last chapter, a new theoretical underpinning is needed for system safety. Systems theory provides that foundation. This +chapter introduces some basic concepts in systems theory, how this theory is reflected +in system engineering, and how all of this relates to system safety. +section 3 1. +An Introduction to Systems Theory. +Systems theory dates from the 19 30s and 19 40s and was a response to limitations of +the classic analysis techniques in coping with the increasingly complex systems starting to be built at that time . Norbert Wiener applied the approach to control +and communications engineering , while Ludwig von Bertalanffy developed +similar ideas for biology . Bertalanffy suggested that the emerging ideas in +various fields could be combined into a general theory of systems. +In the traditional scientific method, sometimes referred to as divide and conquer, +systems are broken into distinct parts so that the parts can be examined separately. +Physical aspects of systems are decomposed into separate physical components, +while behavior is decomposed into discrete events over time. +Physical aspects → Separate physical components +Behavior → Discrete events over time +This decomposition .(formally called analytic reduction).assumes that the separation +is feasible. that is, each component or subsystem operates independently, and analysis results are not distorted when these components are considered separately. This +assumption in turn implies that the components or events are not subject to feedback loops and other nonlinear interactions and that the behavior of the components is the same when examined singly as when they are playing their part in the +whole. A third fundamental assumption is that the principles governing the assembling of the components into the whole are straightforward, that is, the interactions + +among the subsystems are simple enough that they can be considered separate from +the behavior of the subsystems themselves. +These are reasonable assumptions, it turns out, for many of the physical +regularities of the universe. System theorists have described these systems as +displaying organized simplicity .(figure 3 1.).. Such systems can be separated +into non-interacting subsystems for analysis purposes. the precise nature of the +component interactions is known and interactions can be examined pairwise. Analytic reduction has been highly effective in physics and is embodied in structural +mechanics. +Other types of systems display what systems theorists have labeled unorganized +complexity.that is, they lack the underlying structure that allows reductionism to +be effective. They can, however, often be treated as aggregates. They are complex, +but regular and random enough in their behavior that they can be studied statistically. This study is simplified by treating them as a structureless mass with interchangeable parts and then describing them in terms of averages. The basis of this +approach is the law of large numbers. The larger the population, the more likely that +observed values are close to the predicted average values. In physics, this approach +is embodied in statistical mechanics. + +These systems are too complex for complete analysis and too organized for statistics; +the averages are deranged by the underlying structure . Many of the complex +engineered systems of the post–World War 2 era, as well as biological systems and +social systems, fit into this category. Organized complexity also represents particularly well the problems that are faced by those attempting to build complex software, +and it explains the difficulty computer scientists have had in attempting to apply +analysis and statistics to software. +Systems theory was developed for this third type of system. The systems approach +focuses on systems taken as a whole, not on the parts taken separately. It assumes +that some properties of systems can be treated adequately only in their entirety, +taking into account all facets relating the social to the technical aspects . These +system properties derive from the relationships between the parts of systems. how +the parts interact and fit together . Concentrating on the analysis and design of +the whole as distinct from the components or parts provides a means for studying +systems exhibiting organized complexity. +The foundation of systems theory rests on two pairs of ideas. .(1).emergence and +hierarchy and .(2).communication and control . + +section 3 2. Emergence and Hierarchy. +A general model of complex systems can be expressed in terms of a hierarchy of +levels of organization, each more complex than the one below, where a level is characterized by having emergent properties. Emergent properties do not exist at lower +levels; they are meaningless in the language appropriate to those levels. The shape of +an apple, although eventually explainable in terms of the cells of the apple, has no +meaning at that lower level of description. The operation of the processes at the +lower levels of the hierarchy result in a higher level of complexity.that of the whole +apple itself.that has emergent properties, one of them being the apple’s shape . +The concept of emergence is the idea that at a given level of complexity, some properties characteristic of that level .(emergent at that level).are irreducible. +Hierarchy theory deals with the fundamental differences between one level of +complexity and another. Its ultimate aim is to explain the relationships between +different levels. what generates the levels, what separates them, and what links +them. Emergent properties associated with a set of components at one level in a +hierarchy are related to constraints upon the degree of freedom of those components. +Describing the emergent properties resulting from the imposition of constraints +requires a language at a higher level .(a metalevel).different than that describing the +components themselves. Thus, different languages of description are appropriate at +different levels. + +Reliability is a component property.1 Conclusions can be reached about the +reliability of a valve in isolation, where reliability is defined as the probability that +the behavior of the valve will satisfy its specification over time and under given +conditions. +Safety, on the other hand, is clearly an emergent property of systems. Safety can +be determined only in the context of the whole. Determining whether a plant is +acceptably safe is not possible, for example, by examining a single valve in the plant. +In fact, statements about the “safety of the valve” without information about the +context in which that valve is used are meaningless. Safety is determined by the +relationship between the valve and the other plant components. As another example, +pilot procedures to execute a landing might be safe in one aircraft or in one set of +circumstances but unsafe in another. +Although they are often confused, reliability and safety are different properties. +The pilots may reliably execute the landing procedures on a plane or at an airport +in which those procedures are unsafe. A gun when discharged out on a desert with +no other humans or animals for hundreds of miles may be both safe and reliable. +When discharged in a crowded mall, the reliability will not have changed, but the +safety most assuredly has. +Because safety is an emergent property, it is not possible to take a single system +component, like a software module or a single human action, in isolation and assess +its safety. A component that is perfectly safe in one system or in one environment +may not be when used in another. +The new model of accidents introduced in part 2 of this book incorporates the +basic systems theory idea of hierarchical levels, where constraints or lack of constraints at the higher levels control or allow lower-level behavior. Safety is treated +as an emergent property at each of these levels. Safety depends on the enforcement +of constraints on the behavior of the components in the system, including constraints +on their potential interactions. Safety in the batch chemical reactor in the previous +chapter, for example, depends on the enforcement of a constraint on the relationship +between the state of the catalyst valve and the water valve. + +footnote. 1. This statement is somewhat of an oversimplification, because the reliability of a system component +can, under some conditions .(e.g., magnetic interference or excessive heat).be impacted by its environment. The basic reliability of the component, however, can be defined and measured in isolation, whereas +the safety of an individual component is undefined except in a specific environment. + + +section 3 3. +Communication and Control. +The second major pair of ideas in systems theory is communication and control. An +example of regulatory or control action is the imposition of constraints upon the +activity at one level of a hierarchy, which define the “laws of behavior” at that level. +Those laws of behavior yield activity meaningful at a higher level. Hierarchies are +characterized by control processes operating at the interfaces between levels . +The link between control mechanisms studied in natural systems and those engineered in man-made systems was provided by a part of systems theory known as +cybernetics. Checkland writes. +Control is always associated with the imposition of constraints, and an account of a control +process necessarily requires our taking into account at least two hierarchical levels. At a +given level, it is often possible to describe the level by writing dynamical equations, on the +assumption that one particle is representative of the collection and that the forces at other +levels do not interfere. But any description of a control process entails an upper level +imposing constraints upon the lower. The upper level is a source of an alternative .(simpler) +description of the lower level in terms of specific functions that are emergent as a result +of the imposition of constraints . +Note Checkland’s statement about control always being associated with the +imposition of constraints. Imposing safety constraints plays a fundamental role in +the approach to safety presented in this book. The limited focus on avoiding failures, +which is common in safety engineering today, is replaced by the larger concept of +imposing constraints on system behavior to avoid unsafe events or conditions, that +is, hazards. +Control in open systems .(those that have inputs and outputs from their environment).implies the need for communication. Bertalanffy distinguished between +closed systems, in which unchanging components settle into a state of equilibrium, +and open systems, which can be thrown out of equilibrium by exchanges with their +environment. +In control theory, open systems are viewed as interrelated components that are +kept in a state of dynamic equilibrium by feedback loops of information and control. +The plant’s overall performance has to be controlled in order to produce the desired +product while satisfying cost, safety, and general quality constraints. +In order to control a process, four conditions are required . +•Goal Condition. The controller must have a goal or goals .(for example, to +maintain the setpoint). +•Action Condition. The controller must be able to affect the state of the system. +In engineering, control actions are implemented by actuators. +•Model Condition. The controller must be .(or contain).a model of the system +(see section 4.3). +•Observability Condition. The controller must be able to ascertain the state of +the system. In engineering terminology, observation of the state of the system +is provided by sensors. + + +Figure 3 2. shows a typical control loop. The plant controller obtains information +about .(observes).the process state from measured variables .(feedback).and uses this +information to initiate action by manipulating controlled variables to keep the +process operating within predefined limits or set points .(the goal).despite disturbances to the process. In general, the maintenance of any open-system hierarchy +(either biological or man-made).will require a set of processes in which there is +communication of information for regulation or control . +Control actions will generally lag in their effects on the process because of delays +in signal propagation around the control loop. an actuator may not respond immediately to an external command signal .(called dead time); the process may have +delays in responding to manipulated variables .(time constants); and the sensors +may obtain values only at certain sampling intervals .(feedback delays). Time lags +restrict the speed and extent with which the effects of disturbances, both within the +process itself and externally derived, can be reduced. They also impose extra requirements on the controller, for example, the need to infer delays that are not directly +observable. +The model condition plays an important role in accidents and safety. In order to +create effective control actions, the controller must know the current state of the +controlled process and be able to estimate the effect of various control actions on +that state. As discussed further in section 4.3, many accidents have been caused by +the controller incorrectly assuming the controlled system was in a particular state +and imposing a control action .(or not providing one).that led to a loss. the Mars +Polar Lander descent engine controller, for example, assumed that the spacecraft + + +was on the surface of the planet and shut down the descent engines. The captain +of the Herald of Free Enterprise thought the car deck doors were shut and left +the mooring. + + +section 3 4. +Using Systems Theory to Understand Accidents. +Safety approaches based on systems theory consider accidents as arising from the +interactions among system components and usually do not specify single causal +variables or factors . Whereas industrial .(occupational).safety models and +event chain models focus on unsafe acts or conditions, classic system safety models +instead look at what went wrong with the system’s operation or organization to +allow the accident to take place. +This systems approach treats safety as an emergent property that arises when +the system components interact within an environment. Emergent properties like +safety are controlled or enforced by a set of constraints .(control laws).related to +the behavior of the system components. For example, the spacecraft descent engines +must remain on until the spacecraft reaches the surface of the planet and the car +deck doors on the ferry must be closed before leaving port. Accidents result from +interactions among components that violate these constraints.in other words, +from a lack of appropriate constraints on the interactions. Component interaction +accidents, as well as component failure accidents, can be explained using these +concepts. +Safety then can be viewed as a control problem. Accidents occur when component failures, external disturbances, and/or dysfunctional interactions among system +components are not adequately controlled. In the space shuttle Challenger loss, the +O-rings did not adequately control propellant gas release by sealing a tiny gap in +the field joint. In the Mars Polar Lander loss, the software did not adequately control +the descent speed of the spacecraft.it misinterpreted noise from a Hall effect +sensor .(feedback of a measured variable).as an indication the spacecraft had reached +the surface of the planet. Accidents such as these, involving engineering design +errors, may in turn stem from inadequate control over the development process. A +Milstar satellite was lost when a typo in the software load tape was not detected +during the development and testing. Control is also imposed by the management +functions in an organization.the Challenger and Columbia losses, for example, +involved inadequate controls in the launch-decision process. +While events reflect the effects of dysfunctional interactions and inadequate +enforcement of safety constraints, the inadequate control itself is only indirectly +reflected by the events.the events are the result of the inadequate control. The +control structure itself must be examined to determine why it was inadequate to +maintain the constraints on safe behavior and why the events occurred. + +As an example, the unsafe behavior .(hazard).in the Challenger loss was the +release of hot propellant gases from the field joint. The miscreant O-ring was used +to control the hazard.that is, its role was to seal a tiny gap in the field joint created +by pressure at ignition. The loss occurred because the system design, including the +O-ring, did not effectively impose the required constraint on the propellant gas +release. Starting from here, there are then several questions that need to be answered +to understand why the accident occurred and to obtain the information necessary +to prevent future accidents. Why was this particular design unsuccessful in imposing +the constraint, why was it chosen .(what was the decision process), why was the +flaw not found during development, and was there a different design that might +have been more successful? These questions and others consider the original +design process. +Understanding the accident also requires examining the contribution of the +operations process. Why were management decisions made to launch despite warnings that it might not be safe to do so? One constraint that was violated during +operations was the requirement to correctly handle feedback about any potential +violation of the safety design constraints, in this case, feedback during operations +that the control by the O-rings of the release of hot propellant gases from the field +joints was not being adequately enforced by the design. There were several instances +of feedback that was not adequately handled, such as data about O-ring blowby and +erosion during previous shuttle launches and feedback by engineers who were concerned about the behavior of the O-rings in cold weather. Although the lack of +redundancy provided by the second O-ring was known long before the loss of Challenger, that information was never incorporated into the NASA Marshall Space +Flight Center database and was unknown by those making the launch decision. +In addition, there was missing feedback about changes in the design and testing +procedures during operations, such as the use of a new type of putty and the introduction of new O-ring leak checks without adequate verification that they satisfied +system safety constraints on the field joints. As a final example, the control processes +that ensured unresolved safety concerns were fully considered before each flight, +that is, the flight readiness reviews and other feedback channels to project management making flight decisions, were flawed. +Systems theory provides a much better foundation for safety engineering than +the classic analytic reduction approach underlying event-based models of accidents. +It provides a way forward to much more powerful and effective safety and risk +analysis and management procedures that handle the inadequacies and needed +extensions to current practice described in chapter 2. +Combining a systems-theoretic approach to safety with system engineering +processes will allow designing safety into the system as it is being developed or +reengineered. System engineering provides an appropriate vehicle for this process + +because it rests on the same systems theory foundation and involves engineering +the system as a whole. +section 3 5. +Systems Engineering and Safety. +The emerging theory of systems, along with many of the historical forces noted in +chapter 1, gave rise after World War 2 to a new emphasis in engineering, eventually +called systems engineering. During and after the war, technology expanded rapidly +and engineers were faced with designing and building more complex systems than +had been attempted previously. Much of the impetus for the creation of this new +discipline came from military programs in the 19 50s and 19 60s, particularly intercontinental ballistic missile .(ICBM).systems. Apollo was the first nonmilitary government program in which systems engineering was recognized from the beginning +as an essential function . +System Safety, as defined in MIL-STD-882, is a subdiscipline of system engineering. It was created at the same time and for the same reasons. The defense community tried using the standard safety engineering techniques on their complex +new systems, but the limitations became clear when interface and component interaction problems went unnoticed until it was too late, resulting in many losses and +near misses. When these early aerospace accidents were investigated, the causes of +a large percentage of them were traced to deficiencies in design, operations, and +management. Clearly, big changes were needed. System engineering along with its +subdiscipline, System Safety, were developed to tackle these problems. +Systems theory provides the theoretical foundation for systems engineering, +which views each system as an integrated whole even though it is composed of +diverse, specialized components. The objective is to integrate the subsystems into +the most effective system possible to achieve the overall objectives, given a prioritized set of design criteria. Optimizing the system design often requires making +tradeoffs between these design criteria .(goals). +The development of systems engineering as a discipline enabled the solution of +enormously more complex and difficult technological problems than previously +. Many of the elements of systems engineering can be viewed merely as good +engineering. It represents more a shift in emphasis than a change in content. In +addition, while much of engineering is based on technology and science, systems +engineering is equally concerned with overall management of the engineering +process. +A systems engineering approach to safety starts with the basic assumption that +some properties of systems, in this case safety, can only be treated adequately in the +context of the social and technical system as a whole. A basic assumption of systems +engineering is that optimization of individual components or subsystems will not in + + +general lead to a system optimum; in fact, improvement of a particular subsystem +may actually worsen the overall system performance because of complex, nonlinear +interactions among the components. When each aircraft tries to optimize its path +from its departure point to its destination, for example, the overall air transportation +system throughput may not be optimized when they all arrive at a popular hub at +the same time. One goal of the air traffic control system is to optimize the overall +air transportation system throughput while, at the same time, trying to allow as much +flexibility for the individual aircraft and airlines to achieve their goals. In the end, +if system engineering is successful, everyone gains. Similarly, each pharmaceutical +company acting to optimize its profits, which is a legitimate and reasonable company +goal, will not necessarily optimize the larger societal system goal of producing safe +and effective pharmaceutical and biological products to enhance public health. +These system engineering principles are applicable even to systems beyond those +traditionally thought of as in the engineering realm. The financial system and its +meltdown starting in 2007 is an example of a social system that could benefit from +system engineering concepts. +Another assumption of system engineering is that individual component behavior .(including events or actions).cannot be understood without considering the +components’ role and interaction within the system as a whole. This basis for systems +engineering has been stated as the principle that a system is more than the sum of +its parts. Attempts to improve long-term safety in complex systems by analyzing and +changing individual components have often proven to be unsuccessful over the long +term. For example, Rasmussen notes that over many years of working in the field +of nuclear power plant safety, he found that attempts to improve safety from models +of local features were compensated for by people adapting to the change in an +unpredicted way . +Approaches used to enhance safety in complex systems must take these basic +systems engineering principles into account. Otherwise, our safety engineering +approaches will be limited in the types of accidents and systems they can handle. +At the same time, approaches that include them, such as those described in this +book, have the potential to greatly improve our ability to engineer safer and more +complex systems. +section 3 6. +Building Safety into the System Design. +System Safety, as practiced by the U.S. defense and aerospace communities as well +as the new approach outlined in this book, fit naturally within the general systems +engineering process and the problem-solving approach that a system view provides. +This problem-solving process entails several steps. First, a need or problem is specified in terms of objectives that the system must satisfy along with criteria that can + +be used to rank alternative designs. For a system that has potential hazards, the +objectives will include safety objectives and criteria along with high-level requirements and safety design constraints. The hazards for an automated train system, for +example, might include the train doors closing while a passenger is in the doorway. +The safety-related design constraint might be that obstructions in the path of a +closing door must be detected and the door closing motion reversed. +After the high-level requirements and constraints on the system design are identified, a process of system synthesis takes place that results in a set of alternative +designs. Each of these alternatives is analyzed and evaluated in terms of the stated +objectives and design criteria, and one alternative is selected to be implemented. In +practice, the process is highly iterative. The results from later stages are fed back to +early stages to modify objectives, criteria, design alternatives, and so on. Of course, +the process described here is highly simplified and idealized. +The following are some examples of basic systems engineering activities and the +role of safety within them. +•Needs analysis. The starting point of any system design project is a perceived +need. This need must first be established with enough confidence to justify the +commitment of resources to satisfy it and understood well enough to allow +appropriate solutions to be generated. Criteria must be established to provide +a means to evaluate both the evolving and final system. If there are hazards +associated with the operation of the system, safety should be included in the +needs analysis. +•Feasibility studies. The goal of this step in the design process is to generate a +set of realistic designs. This goal is accomplished by identifying the principal +constraints and design criteria.including safety constraints and safety design +criteria.for the specific problem being addressed and then generating plausible solutions to the problem that satisfy the requirements and constraints and +are physically and economically feasible. +•Trade studies. In trade studies, the alternative feasible designs are evaluated +with respect to the identified design criteria. A hazard might be controlled by +any one of several safeguards. A trade study would determine the relative +desirability of each safeguard with respect to effectiveness, cost, weight, size, +safety, and any other relevant criteria. For example, substitution of one material +for another may reduce the risk of fire or explosion, but may also reduce reliability or efficiency. Each alternative design may have its own set of safety +constraints .(derived from the system hazards).as well as other performance +goals and constraints that need to be assessed. Although decisions ideally should +be based upon mathematical analysis, quantification of many of the key factors +is often difficult, if not impossible, and subjective judgment often has to be used. + + +•System architecture development and analysis. In this step, the system engineers break down the system into a set of subsystems, together with the functions and constraints, including safety constraints, imposed upon the individual +subsystem designs, the major system interfaces, and the subsystem interface +topology. These aspects are analyzed with respect to desired system performance characteristics and constraints .(again including safety constraints).and +the process is iterated until an acceptable system design results. The preliminary +design at the end of this process must be described in sufficient detail that +subsystem implementation can proceed independently. +•Interface analysis. The interfaces define the functional boundaries of the +system components. From a management standpoint, interfaces must .(1).optimize visibility and control and .(2).isolate components that can be implemented +independently and for which authority and responsibility can be delegated +. From an engineering standpoint, interfaces must be designed to separate +independent functions and to facilitate the integration, testing, and operation +of the overall system. One important factor in designing the interfaces is safety, +and safety analysis should be a part of the system interface analysis. Because +interfaces tend to be particularly susceptible to design error and are implicated +in the majority of accidents, a paramount goal of interface design is simplicity. +Simplicity aids in ensuring that the interface can be adequately designed, analyzed, and tested prior to integration and that interface responsibilities can be +clearly understood. +Any specific realization of this general systems engineering process depends on +the engineering models used for the system components and the desired system +qualities. For safety, the models commonly used to understand why and how accidents occur have been based on events, particularly failure events, and the use of +reliability engineering techniques to prevent them. Part 2 of this book further +details the alternative systems approach to safety introduced in this chapter, while +part 3 provides techniques to perform many of these safety and system engineering +activities. \ No newline at end of file diff --git a/chapter04.txt b/chapter04.txt new file mode 100644 index 0000000..9b45869 --- /dev/null +++ b/chapter04.txt @@ -0,0 +1,890 @@ +PART 2. + + +STAMP. AN ACCIDENT MODEL BASED ON +SYSTEMS THEORY. +Part 2 introduces an expanded accident causality model based on the new assumptions in chapter 2 and satisfying the goals stemming from them. The theoretical +foundation for the new model is systems theory, as introduced in chapter 3. Using +this new causality model, called STAMP .(Systems-Theoretic Accident Model and +Processes), changes the emphasis in system safety from preventing failures to enforcing behavioral safety constraints. Component failure accidents are still included, but +our conception of causality is extended to include component interaction accidents. +Safety is reformulated as a control problem rather than a reliability problem. This +change leads to much more powerful and effective ways to engineer safer systems, +including the complex sociotechnical systems of most concern today. +The three main concepts in this model.safety constraints, hierarchical control +structures, and process models.are introduced first in chapter 4. Then the STAMP +causality model is described, along with a classification of accident causes implied +by the new model. +To provide additional understanding of STAMP, it is used to describe the causes +of several very different types of losses.a friendly fire shootdown of a U.S. Army +helicopter by a U.S. Air Force fighter jet over northern Iraq, the contamination of +a public water system with E. coli bacteria in a small town in Canada, and the loss +of a Milstar satellite. Chapter 5 presents the friendly fire accident analysis. The other +accident analyses are contained in appendixes B and C. + + +chapter 4. +A Systems-Theoretic View of Causality. + +In the traditional causality models, accidents are considered to be caused by chains +of failure events, each failure directly causing the next one in the chain. Part I +explained why these simple models are no longer adequate for the more complex +sociotechnical systems we are attempting to build today. The definition of accident +causation needs to be expanded beyond failure events so that it includes component +interaction accidents and indirect or systemic causal mechanisms. +The first step is to generalize the definition of an accident.1 An accident is an +unplanned and undesired loss event. That loss may involve human death and injury, +but it may also involve other major losses, including mission, equipment, financial, +and information losses. +Losses result from component failures, disturbances external to the system, interactions among system components, and behavior of individual system components +that lead to hazardous system states. Examples of hazards include the release of +toxic chemicals from an oil refinery, a patient receiving a lethal dose of medicine, +two aircraft violating minimum separation requirements, and commuter train doors +opening between stations. +In systems theory, emergent properties, such as safety, arise from the interactions +among the system components. The emergent properties are controlled by imposing +constraints on the behavior of and interactions among the components. Safety then +becomes a control problem where the goal of the control is to enforce the safety +constraints. Accidents result from inadequate control or enforcement of safetyrelated constraints on the development, design, and operation of the system. +At Bhopal, the safety constraint that was violated was that the MIC must not +come in contact with water. In the Mars Polar Lander, the safety constraint was that +the spacecraft must not impact the planet surface with more than a maximum force. + + +In the batch chemical reactor accident described in chapter 2, one safety constraint +is a limitation on the temperature of the contents of the reactor. +The problem then becomes one of control where the goal is to control the behavior of the system by enforcing the safety constraints in its design and operation. +Controls must be established to accomplish this goal. These controls need not necessarily involve a human or automated controller. Component behavior .(including +failures). and unsafe interactions may be controlled through physical design, through +process .(such as manufacturing processes and procedures, maintenance processes, +and operations), or through social controls. Social controls include organizational +(management), governmental, and regulatory structures, but they may also be cultural, policy, or individual .(such as self-interest). As an example of the latter, one +explanation that has been given for the 2 thousand 9 financial crisis is that when investment +banks went public, individual controls to reduce personal risk and long-term profits +were eliminated and risk shifted to shareholders and others who had few and weak +controls over those taking the risks. +In this framework, understanding why an accident occurred requires determining +why the control was ineffective. Preventing future accidents requires shifting from +a focus on preventing failures to the broader goal of designing and implementing +controls that will enforce the necessary constraints. +The STAMP .(System-Theoretic Accident Model and Processes). accident model +is based on these principles. Three basic constructs underlie STAMP. safety constraints, hierarchical safety control structures, and process models. + + +section 4 1. +Safety Constraints. +The most basic concept in STAMP is not an event, but a constraint. Events leading +to losses occur only because safety constraints were not successfully enforced. +The difficulty in identifying and enforcing safety constraints in design and operations has increased from the past. In many of our older and less automated systems, +physical and operational constraints were often imposed by the limitations of technology and of the operational environments. Physical laws and the limits of our +materials imposed natural constraints on the complexity of physical designs and +allowed the use of passive controls. +In engineering, passive controls are those that maintain safety by their presence. +basically, the system fails into a safe state or simple interlocks are used to limit +the interactions among system components to safe ones. Some examples of passive +controls that maintain safety by their presence are shields or barriers such as +containment vessels, safety harnesses, hardhats, passive restraint systems in vehicles, +and fences. Passive controls may also rely on physical principles, such as gravity, +to fail into a safe state. An example is an old railway semaphore that used weights + + +to ensure that if the cable .(controlling the semaphore). broke, the arm would automatically drop into the stop position. Other examples include mechanical relays +designed to fail with their contacts open, and retractable landing gear for aircraft in +which the wheels drop and lock in the landing position if the pressure system that +raises and lowers them fails. For the batch chemical reactor example in chapter 2, +where the order valves are opened is crucial, designers might have used a physical +interlock that did not allow the catalyst valve to be opened while the water valve +was closed. +In contrast, active controls require some action(s). to provide protection. .(1). detection of a hazardous event or condition .(monitoring), .(2). measurement of some +variable(s), .(3). interpretation of the measurement .(diagnosis), and .(4). response +(recovery or fail-safe procedures), all of which must be completed before a loss +occurs. These actions are usually implemented by a control system, which now commonly includes a computer. +Consider the simple passive safety control where the circuit for a high-power +outlet is run through a door that shields the power outlet. When the door is opened, +the circuit is broken and the power disabled. When the door is closed and the power +enabled, humans cannot touch the high power outlet. Such a design is simple and +foolproof. An active safety control design for the same high power source, requires +some type of sensor to detect when the access door to the power outlet is opened +and an active controller to issue a control command to cut the power. The failure +modes for the active control system are greatly increased over the passive design, +as is the complexity of the system component interactions. In the railway semaphore +example, there must be a way to detect that the cable has broken .(probably now a +digital system is used instead of a cable so the failure of the digital signaling system +must be detected). and some type of active controls used to warn operators to stop +the train. The design of the batch chemical reactor described in chapter 2 used a +computer to control the valve opening and closing order instead of a simple mechanical interlock. +While simple examples are used here for practical reasons, the complexity of our +designs is reaching and exceeding the limits of our intellectual manageability with +a resulting increase in component interaction accidents and lack of enforcement of +the system safety constraints. Even the relatively simple computer-based batch +chemical reactor valve control design resulted in a component interaction accident. +There are often very good reasons to use active controls instead of passive ones, +including increased functionality, more flexibility in design, ability to operate over +large distances, weight reduction, and so on. But the difficulty of the engineering +problem is increased and more potential for design error is introduced. +A similar argument can be made for the interactions between operators and +the processes they control. Cook suggests that when controls were primarily + + +mechanical and were operated by people located close to the operating process, +proximity allowed sensory perception of the status of the process via direct physical +feedback such as vibration, sound, and temperature .(figure 4.1). Displays were +directly linked to the process and were essentially a physical extension of it. For +example, the flicker of a gauge needle in the cab of a train indicated that .(1). the +engine valves were opening and closing in response to slight pressure fluctuations, +(2). the gauge was connected to the engine, .(3). the pointing indicator was free, and +so on. In this way, the displays provided a rich source of information about the +controlled process and the state of the displays themselves. +The introduction of electromechanical controls allowed operators to control +processes from a greater distance .(both physical and conceptual). than possible with +pure mechanically linked controls .(figure 4.2). That distance, however, meant that +operators lost a lot of direct information about the process.they could no longer +sense the process state directly and the control and display surfaces no longer provided as rich a source of information about the process or the state of the controls +themselves. The system designers had to synthesize and provide an image of the +process state to the operators. An important new source of design errors was introduced by the need for the designers to determine beforehand what information the +operator would need under all conditions to safely control the process. If the designers had not anticipated a particular situation could occur and provided for it in the +original system design, they might also not anticipate the need of the operators for +information about it during operations. + + +Designers also had to provide feedback on the actions of the operators and on +any failures that might have occurred. The controls could now be operated without +the desired effect on the process, and the operators might not know about it. Accidents started to occur due to incorrect feedback. For example, major accidents +(including Three Mile Island). have involved the operators commanding a valve to +open and receiving feedback that the valve had opened, when in reality it had not. +In this case and others, the valves were wired to provide feedback indicating that +power had been applied to the valve, but not that the valve had actually opened. +Not only could the design of the feedback about success and failures of control +actions be misleading in these systems, but the return links were also subject +to failure. +Electromechanical controls relaxed constraints on the system design allowing +greater functionality .(figure 4.3). At the same time, they created new possibilities +for designer and operator error that had not existed or were much less likely in +mechanically controlled systems. The later introduction of computer and digital +controls afforded additional advantages and removed even more constraints on the +control system design.and introduced more possibility for error. Proximity in our +old mechanical systems provided rich sources of feedback that involved almost all +of the senses, enabling early detection of potential problems. We are finding it hard +to capture and provide these same qualities in new systems that use automated +controls and displays. +It is the freedom from constraints that makes the design of such systems so difficult. Physical constraints enforced discipline and limited complexity in system +design, construction, and modification. The physical constraints also shaped system +design in ways that efficiently transmitted valuable physical component and process +information to operators and supported their cognitive processes. +The same argument applies to the increasing complexity in organizational and +social controls and in the interactions among the components of sociotechnical +systems. Some engineering projects today employ thousands of engineers. The Joint + + +Strike Fighter, for example, has eight thousand engineers spread over most of the +United States. Corporate operations have become global, with greatly increased +interdependencies and producing a large variety of products. A new holistic approach +to safety, based on control and enforcing safety constraints in the entire sociotechnical system, is needed to ensure safety. +To accomplish this goal, system-level constraints must be identified, and responsibility for enforcing them must be divided up and allocated to appropriate groups. +For example, the members of one group might be responsible for performing hazard +analyses. The manager of this group might be assigned responsibility for ensuring +that the group has the resources, skills, and authority to perform such analyses and +for ensuring that high-quality analyses result. Higher levels of management might +have responsibility for budgets, for establishing corporate safety policies, and for +providing oversight to ensure that safety policies and activities are being carried out +successfully and that the information provided by the hazard analyses is used in +design and operations. +During system and product design and development, the safety constraints will +be broken down and sub-requirements or constraints allocated to the components +of the design as it evolves. In the batch chemical reactor, for example, the system +safety requirement is that the temperature in the reactor must always remain below +a particular level. A design decision may be made to control this temperature using +a reflux condenser. This decision leads to a new constraint. “Water must be flowing +into the reflux condenser whenever catalyst is added to the reactor.” After a decision +is made about what component(s). will be responsible for operating the catalyst and +water valves, additional requirements will be generated. If, for example, a decision +is made to use software rather than .(or in addition to). a physical interlock, the +software must be assigned the responsibility for enforcing the constraint. “The +water valve must always be open when the catalyst valve is open.” +In order to provide the level of safety demanded by society today, we first need +to identify the safety constraints to enforce and then to design effective controls to +enforce them. This process is much more difficult for today’s complex and often +high-tech systems than in the past and new techniques, such as those described in +part THREE, are going to be required to solve it, for example, methods to assist in generating the component safety constraints from the system safety constraints. +The alternative.building only the simple electromechanical systems of the past or +living with higher levels of risk.is for the most part not going to be considered an +acceptable solution. +section 4 2. +The Hierarchical Safety Control Structure. +In systems theory .(see section 3 3.), systems are viewed as hierarchical structures, +where each level imposes constraints on the activity of the level beneath it.that is, + + +constraints or lack of constraints at a higher level allow or control lower-level +behavior. +Control processes operate between levels to control the processes at lower levels +in the hierarchy. These control processes enforce the safety constraints for which +the control process is responsible. Accidents occur when these processes provide +inadequate control and the safety constraints are violated in the behavior of the +lower-level components. +By describing accidents in terms of a hierarchy of control based on adaptive +feedback mechanisms, adaptation plays a central role in the understanding and +prevention of accidents. +At each level of the hierarchical structure, inadequate control may result from +missing constraints .(unassigned responsibility for safety), inadequate safety control +commands, commands that were not executed correctly at a lower level, or inadequately communicated or processed feedback about constraint enforcement. For +example, an operations manager may provide unsafe work instructions or procedures to the operators, or the manager may provide instructions that enforce the +safety constraints, but the operators may ignore them. The operations manager may +not have the feedback channels established to determine that unsafe instructions +were provided or that his or her safety-related instructions are not being followed. +Figure 4.4 shows a typical sociotechnical hierarchical safety control structure +common in a regulated, safety-critical industry in the United States, such as air +transportation. Each system, of course, must be modeled to include its specific +features. Figure 4.4 has two basic hierarchical control structures.one for system +development .(on the left). and one for system operation .(on the right).with interactions between them. An aircraft manufacturer, for example, might have only +system development under its immediate control, but safety involves both development and operational use of the aircraft, and neither can be accomplished successfully in isolation. Safety during operation depends partly on the original design and +development and partly on effective control over operations. Communication channels may be needed between the two structures.3 For example, aircraft manufacturers must communicate to their customers the assumptions about the operational +environment upon which the safety analysis was based, as well as information about +safe operating procedures. The operational environment .(e.g., the commercial airline +industry), in turn, provides feedback to the manufacturer about the performance of +the system over its lifetime. +Between the hierarchical levels of each safety control structure, effective communication channels are needed, both a downward reference channel providing the + + +information necessary to impose safety constraints on the level below and an upward +measuring channel to provide feedback about how effectively the constraints are +being satisfied .(figure 4.5). Feedback is critical in any open system in order to +provide adaptive control. The controller uses the feedback to adapt future control +commands to more readily achieve its goals. +Government, general industry groups, and the court system are the top two +levels of each of the generic control structures shown in figure 4.4. The government +control structure in place to control development may differ from that controlling +operations.responsibility for certifying the aircraft developed by aircraft manufacturers is assigned to one group at the FAA, while responsibility for supervising +airline operations is assigned to a different group. The appropriate constraints in +each control structure and at each level will vary but in general may include technical design and process constraints, management constraints, manufacturing constraints, and operational constraints. +At the highest level in both the system development and system operation hierarchies are Congress and state legislatures.4 Congress controls safety by passing laws +and by establishing and funding government regulatory structures. Feedback as to +the success of these controls or the need for additional ones comes in the form of +government reports, congressional hearings and testimony, lobbying by various +interest groups, and, of course, accidents. +The next level contains government regulatory agencies, industry associations, +user associations, insurance companies, and the court system. Unions have always +played an important role in ensuring safe operations, such as the air traffic controllers union in the air transportation system, or in ensuring worker safety in + +manufacturing. The legal system tends to be used when there is no regulatory +authority and the public has no other means to encourage a desired level of concern +for safety in company management. The constraints generated at this level and +imposed on companies are usually in the form of policy, regulations, certification, +standards .(by trade or user associations), or threat of litigation. Where there is a +union, safety-related constraints on operations or manufacturing may result from +union demands and collective bargaining. +Company management takes the standards, regulations, and other general controls on its behavior and translates them into specific policy and standards for the +company. Many companies have a general safety policy .(it is required by law in +Great Britain). as well as more detailed standards documents. Feedback may come +in the form of status reports, risk assessments, and incident reports. +In the development control structure .(shown on the left of figure 4.4), company +policies and standards are usually tailored and perhaps augmented by each engineering project to fit the needs of the particular project. The higher-level control +process may provide only general goals and constraints and the lower levels may +then add many details to operationalize the general goals and constraints given the +immediate conditions and local goals. For example, while government or company +standards may require a hazard analysis be performed, the system designers and +documenters .(including those designing the operational procedures and writing user +manuals). may have control over the actual hazard analysis process used to identify +specific safety constraints on the design and operation of the system. These detailed +procedures may need to be approved by the level above. +The design constraints identified as necessary to control system hazards are +passed to the implementers and assurers of the individual system components +along with standards and other requirements. Success is determined through feedback provided by test reports, reviews, and various additional hazard analyses. At +the end of the development process, the results of the hazard analyses as well +as documentation of the safety-related design features and design rationale should +be passed on to the maintenance group to be used in the system evolution and +sustainment process. +A similar process involving layers of control is found in the system operation +control structure. In addition, there will be .(or at least should be). interactions +between the two structures. For example, the safety design constraints used during +development should form the basis for operating procedures and for performance +and process auditing. +As in any control loop, time lags may affect the flow of control actions and feedback and may impact the effectiveness of the control loop in enforcing the safety +constraints. For example, standards can take years to develop or change.a time +scale that may keep them behind current technology and practice. At the physical + + +level, new technology may be introduced in different parts of the system at different +rates, which may result in asynchronous evolution of the control structure. In the +accidental shootdown of two U.S. Army Black Hawk helicopters by two U.S. Air +Force F-15s in the no-fly zone over northern Iraq in 1994, for example, the fighter +jet aircraft and the helicopters were inhibited in communicating by radio because +the F-15 pilots used newer jam-resistant radios that could not communicate with +the older-technology Army helicopter radios. Hazard analysis needs to include the +influence of these time lags and potential changes over time. +A common way to deal with time lags leading to delays is to delegate responsibility to lower levels that are not subject to as great a delay in obtaining information +or feedback from the measuring channels. In periods of quickly changing technology, +time lags may make it necessary for the lower levels to augment the control processes passed down from above or to modify them to fit the current situation. Time +lags at the lowest levels, as in the Black Hawk shootdown example, may require the +use of feedforward control to overcome lack of feedback or may require temporary +controls on behavior. Communication between the F-15s and the Black Hawks +would have been possible if the F-15 pilots had been told to use an older radio +technology available to them, as they were commanded to do for other types of +friendly aircraft. +More generally, control structures always change over time, particularly those +that include humans and organizational components. Physical devices also change +with time, but usually much slower and in more predictable ways. If we are to handle +social and human aspects of safety, then our accident causality models must include +the concept of change. In addition, controls and assurance that the safety control +structure remains effective in enforcing the constraints over time are required. +Control does not necessarily imply rigidity and authoritarian management +styles. Rasmussen notes that control at each level may be enforced in a very prescriptive command and control structure or it may be loosely implemented as performance objectives with many degrees of freedom in how the objectives are met +. Recent trends from management by oversight to management by insight +reflect differing levels of feedback control that are exerted over the lower levels and +a change from prescriptive management control to management by objectives, +where the objectives are interpreted and satisfied according to the local context. +Management insight, however, does not mean abdication of safety-related responsibility. In a Milstar satellite loss and +Mars Polar Lander losses, the accident reports all note that a poor transition from oversight to insight was a factor in the losses. Attempts to delegate decisions and to manage by objectives require an explicit formulation of the value +criteria to be used and an effective means for communicating the values down +through society and organizations. In addition, the impact of specific decisions at + + + +each level on the objectives and values passed down need to be adequately and +formally evaluated. Feedback is required to measure how successfully the functions +are being performed. +Although regulatory agencies are included in the figure 4.4 example, there is no +implication that government regulation is required for safety. The only requirement +is that responsibility for safety is distributed in an appropriate way throughout +the sociotechnical system. In aircraft safety, for example, manufacturers play the +major role while the FAA type certification authority simply provides oversight that +safety is being successfully engineered into aircraft at the lower levels of the hierarchy. If companies or industries are unwilling or incapable of performing their +public safety responsibilities, then government has to step in to achieve the overall +public safety goals. But a much better solution is for company management to take +responsibility, as it has direct control over the system design and manufacturing and +over operations. +The safety-control structure will differ among industries and examples are spread +among the following chapters. Figure C.1 in appendix C shows the control structure +and safety constraints for the hierarchical water safety control system in Ontario, +Canada. The structure is drawn on its side .(as is more common for control diagrams) +so that the top of the hierarchy is on the left side of the figure. The system hazard +is exposure of the public to E. coli or other health-related contaminants through the +public drinking water system; therefore, the goal of the safety control structure is to +prevent such exposure. This goal leads to two system safety constraints. +1. Water quality must not be compromised. +2. Public health measures must reduce the risk of exposure if water quality is +somehow compromised .(such as notification and procedures to follow). +The physical processes being controlled by this control structure .(shown at the +right of the figure). are the water system, the wells used by the local public utilities, +and public health. Details of the control structure are discussed in appendix C, but +appropriate responsibility, authority, and accountability must be assigned to each +component with respect to the role it plays in the overall control structure. For +example, the responsibility of the Canadian federal government is to establish a +nationwide public health system and ensure that it is operating effectively. The +provincial government must establish regulatory bodies and codes, provide resources +to the regulatory bodies, provide oversight and feedback loops to ensure that the +regulators are doing their job adequately, and ensure that adequate risk assessment +is conducted and effective risk management plans are in place. Local public utility +operations must apply adequate doses of chlorine to kill bacteria, measure the +chlorine residuals, and take further steps if evidence of bacterial contamination is + + +found. While chlorine residuals are a quick way to get feedback about possible +contamination, more accurate feedback is provided by analyzing water samples but +takes longer .(it has a greater time lag). Both have their uses in the overall safety +control structure of the public water supply. +Safety control structures may be very complex. Abstracting and concentrating on +parts of the overall structure may be useful in understanding and communicating +about the controls. In examining different hazards, only subsets of the overall structure may be relevant and need to be considered in detail and the rest can be treated +as the inputs to or the environment of the substructure. The only critical part is that +the hazards must first be identified at the system level and the process must then +proceed top-down and not bottom-up to identify the safety constraints for the parts +of the overall control structure. +The operation of sociotechnical safety control structures at all levels is facing the +stresses noted in chapter 1, such as rapidly changing technology, competitive and +time-to-market pressures, and changing public and regulatory views of responsibility +for safety. These pressures can lead to a need for new procedures or new controls +to ensure that required safety constraints are not ignored. + +section 4 3. +Process Models. + +The third concept used in STAMP, along with safety constraints and hierarchical +safety control structures, is process models. Process models are an important part of +control theory. The four conditions required to control a process are described in +chapter 3. The first is a goal, which in STAMP is the safety constraints that must +be enforced by each controller in the hierarchical safety control structure. The +action condition is implemented in the .(downward). control channels and the observability condition is embodied in the .(upward). feedback or measuring channels. The +final condition is the model condition. Any controller.human or automated. +needs a model of the process being controlled to control it effectively .(figure 4.6). +At one extreme, this process model may contain only one or two variables, such +as the model required for a simple thermostat, which contains the current temperature and the setpoint and perhaps a few control laws about how temperature is +changed. At the other extreme, effective control may require a very complex model +with a large number of state variables and transitions, such as the model needed to +control air traffic. +Whether the model is embedded in the control logic of an automated controller +or in the mental model maintained by a human controller, it must contain the same +type of information. the required relationship among the system variables .(the +control laws), the current state .(the current values of the system variables), and the +ways the process can change state. This model is used to determine what control + + +actions are needed, and it is updated through various forms of feedback. If the model +of the room temperature shows that the ambient temperature is less than the setpoint, then the thermostat issues a control command to start a heating element. +Temperature sensors provide feedback about the .(hopefully rising). temperature. +This feedback is used to update the thermostat’s model of the current room temperature. When the setpoint is reached, the thermostat turns off the heating element. +In the same way, human operators also require accurate process or mental models +to provide safe control actions. +Component interaction accidents can usually be explained in terms of incorrect +process models. For example, the Mars Polar Lander software thought the spacecraft +had landed and issued a control instruction to shut down the descent engines. The +captain of the Herald of Free Enterprise thought the ferry doors were closed and +ordered the ship to leave the mooring. The pilots in the Cali Colombia B757 crash +thought R was the symbol denoting the radio beacon near Cali. +In general, accidents often occur, particularly component interaction accidents +and accidents involving complex digital technology or human error, when the +process model used by the controller .(automated or human). does not match the +process and, as a result. +1. Incorrect or unsafe control commands are given +2. Required control actions .(for safety). are not provided +3. Potentially correct control commands are provided at the wrong time .(too +early or too late), or +4. Control is stopped too soon or applied too long. + + +These four types of inadequate control actions are used in the new hazard analysis technique described in chapter 8. +A model of the process being controlled is required not just at the lower physical +levels of the hierarchical control structure, but at all levels. In order to make proper +decisions, the manager of an oil refinery may need to have a model of the current +maintenance level of the safety equipment of the refinery, the state of safety training +of the workforce, and the degree to which safety requirements are being followed +or are effective, among other things. The CEO of the global oil conglomerate has a +much less detailed model of the state of the refineries he controls but at the same +time requires a broader view of the state of safety of all the corporate assets in order +to make appropriate corporate-level decisions impacting safety. +Process models are not only used during operations but also during system development activities. Designers use both models of the system being designed and +models of the development process itself. The developers may have an incorrect +model of the system or software behavior necessary for safety or the physical laws +controlling the system. Safety may also be impacted by developers’ incorrect models +of the development process itself. +As an example of the latter, a Titan/Centaur satellite launch system, along with +the Milstar satellite it was transporting into orbit, was lost due to a typo in a load +tape used by the computer to determine the attitude change instructions to issue to +the engines. The information on the load tape was essentially part of the process +model used by the attitude control software. The typo was not caught during the +development process partly because of flaws in the developers’ models of the testing +process.each thought someone else was testing the software using the actual load +tape when, in fact, nobody was .(see appendix B). +In summary, process models play an important role .(1). in understanding why +accidents occur and why humans provide inadequate control over safety-critical +systems and .(2). in designing safer systems. +section 4.4. +STAMP. +The STAMP .(Systems-Theoretic Accident Model and Process). model of accident +causation is built on these three basic concepts.safety constraints, a hierarchical +safety control structure, and process models.along with basic systems theory concepts. All the pieces for a new causation model have been presented. It is now simply +a matter of putting them together. +In STAMP, systems are viewed as interrelated components kept in a state of +dynamic equilibrium by feedback control loops. Systems are not treated as static +but as dynamic processes that are continually adapting to achieve their ends and to +react to changes in themselves and their environment. + + +Safety is an emergent property of the system that is achieved when appropriate +constraints on the behavior of the system and its components are satisfied. The +original design of the system must not only enforce appropriate constraints on +behavior to ensure safe operation, but the system must continue to enforce the +safety constraints as changes and adaptations to the system design occur over time. +Accidents are the result of flawed processes involving interactions among people, +societal and organizational structures, engineering activities, and physical system +components that lead to violating the system safety constraints. The process leading +up to an accident is described in STAMP in terms of an adaptive feedback function +that fails to maintain safety as system performance changes over time to meet a +complex set of goals and values. +Instead of defining safety management in terms of preventing component +failures, it is defined as creating a safety control structure that will enforce the +behavioral safety constraints and ensure its continued effectiveness as changes +and adaptations occur over time. Effective safety .(and risk). management may +require limiting the types of changes that occur but the goal is to allow as much +flexibility and performance enhancement as possible while enforcing the safety +constraints. +Accidents can be understood, using STAMP, by identifying the safety constraints +that were violated and determining why the controls were inadequate in enforcing +them. For example, understanding the Bhopal accident requires determining not +simply why the maintenance personnel did not insert the slip blind, but also why +the controls that had been designed into the system to prevent the release of hazardous chemicals and to mitigate the consequences of such occurrences.including +maintenance procedures and oversight of maintenance processes, refrigeration units, +gauges and other monitoring units, a vent scrubber, water spouts, a flare tower, +safety audits, alarms and practice alerts, emergency procedures and equipment, and +others.were not successful. +STAMP not only allows consideration of more accident causes than simple component failures, but it also allows more sophisticated analysis of failures and component failure accidents. Component failures may result from inadequate constraints +on the manufacturing process; inadequate engineering design such as missing or +incorrectly implemented fault tolerance; lack of correspondence between individual +component capacity .(including human capacity). and task requirements; unhandled +environmental disturbances .(e.g., electromagnetic interference or EMI); inadequate +maintenance; physical degradation .(wearout); and so on. +Component failures may be prevented by increasing the integrity or resistance +of the component to internal or external influences or by building in safety margins +or safety factors. They may also be avoided by operational controls, such as + + +operating the component within its design envelope and by periodic inspections and +preventive maintenance. Manufacturing controls can reduce deficiencies or flaws +introduced during the manufacturing process. The effects of physical component +failure on system behavior may be eliminated or reduced by using redundancy. The +important difference from other causality models is that STAMP goes beyond +simply blaming component failure for accidents by requiring that the reasons be +identified for why those failures occurred .(including systemic factors). and led to an +accident, that is, why the controls instituted for preventing such failures or for minimizing their impact on safety were missing or inadequate. And it includes other +types of accident causes, such as component interaction accidents, which are becoming more frequent with the introduction of new technology and new roles for +humans in system control. +STAMP does not lend itself to a simple graphic representation of accident causality .(see figure 4.7). While dominoes, event chains, and holes in Swiss cheese are very +compelling because they are easy to grasp, they oversimplify causality and thus the +approaches used to prevent accidents. + + +section 4.5. +A General Classification of Accident Causes. +Starting from the basic definitions in STAMP, the general causes of accidents can +be identified using basic systems and control theory. The resulting classification is +useful in accident analysis and accident prevention activities. +Accidents in STAMP are the result of a complex process that results in the system +behavior violating the safety constraints. The safety constraints are enforced by the +control loops between the various levels of the hierarchical control structure that +are in place during design, development, manufacturing, and operations. +Using the STAMP causality model, if there is an accident, one or more of the +following must have occurred. +1. The safety constraints were not enforced by the controller. +a. The control actions necessary to enforce the associated safety constraint at +each level of the sociotechnical control structure for the system were not +provided. +b. The necessary control actions were provided but at the wrong time .(too +early or too late). or stopped too soon. +c. Unsafe control actions were provided that caused a violation of the safety +constraints. +2. Appropriate control actions were provided but not followed. +These same general factors apply at each level of the sociotechnical control structure, but the interpretation .(application). of the factor at each level may differ. +Classification of accident causal factors starts by examining each of the basic +components of a control loop .(see figure 3.2). and determining how their improper +operation may contribute to the general types of inadequate control. +Figure 4.8 shows the classification. The causal factors in accidents can be divided +into three general categories. .(1). the controller operation, .(2). the behavior of actuators and controlled processes, and .(3). communication and coordination among +controllers and decision makers. When humans are involved in the control structure, context and behavior-shaping mechanisms also play an important role in +causality. +4.5.1 Controller Operation +Controller operation has three primary parts. control inputs and other relevant +external information sources, the control algorithms, and the process model. Inadequate, ineffective, or missing control actions necessary to enforce the safety constraints and ensure safety can stem from flaws in each of these parts. For human +controllers and actuators, context is also an important factor. + + +Unsafe Inputs .(① in figure 4.8). +Each controller in the hierarchical control structure is itself controlled by higherlevel controllers. The control actions and other information provided by the higher +level and required for safe behavior may be missing or wrong. Using the Black Hawk +friendly fire example again, the F-15 pilots patrolling the no-fly zone were given +instructions to switch to a non-jammed radio mode for a list of aircraft types that +did not have the ability to interpret jammed broadcasts. Black Hawk helicopters +had not been upgraded with new anti-jamming technology but were omitted from +the list and so could not hear the F-15 radio broadcasts. Other types of missing or +wrong noncontrol inputs may also affect the operation of the controller. +Unsafe Control Algorithms .(② in figure 4.8). +Algorithms in this sense are both the procedures designed by engineers for hardware controllers and the procedures that human controllers use. Control algorithms +may not enforce safety constraints because the algorithms are inadequately designed +originally, the process may change and the algorithms become unsafe, or the control +algorithms may be inadequately modified by maintainers if the algorithms are automated or through various types of natural adaptation if they are implemented by +humans. Human control algorithms are affected by initial training, by the procedures +provided to the operators to follow, and by feedback and experimentation over time +(see figure 2.9). +Time delays are an important consideration in designing control algorithms. Any +control loop includes time lags, such as the time between the measurement of +process parameters and receiving those measurements or between issuing a +command and the time the process state actually changes. For example, pilot +response delays are important time lags that must be considered in designing the +control function for TCAS5 or other aircraft systems, as are time lags in the controlled process.the aircraft trajectory, for example.caused by aircraft performance limitations. +Delays may not be directly observable, but may need to be inferred. Depending +on where in the feedback loop the delay occurs, different control algorithms are +required to cope with the delays . dead time and time constants require an +algorithm that makes it possible to predict when an action is needed before the +need. Feedback delays generate requirements to predict when a prior control action +has taken effect and when resources will be available again. Such requirements may +impose the need for some type of open loop or feedforward strategy to cope with + + +delays. When time delays are not adequately considered in the control algorithm, +accidents can result. +Leplat has noted that many accidents relate to asynchronous evolution , +where one part of a system .(in this case the hierarchical safety control structure) +changes without the related necessary changes in other parts. Changes to subsystems +may be carefully designed, but consideration of their effects on other parts of the +system, including the safety control aspects, may be neglected or inadequate. Asynchronous evolution may also occur when one part of a properly designed system +deteriorates. +In both these cases, the erroneous expectations of users or system components +about the behavior of the changed or degraded subsystem may lead to accidents. +The Ariane 5 trajectory changed from that of the Ariane 4, but the inertial reference +system software was not changed. As a result, an assumption of the inertial reference +software was violated and the spacecraft was lost shortly after launch. One factor +in the loss of contact with SOHO .(SOlar Heliospheric Observatory), a scientific +spacecraft, in 19 98 was the failure to communicate to operators that a functional +change had been made in a procedure to perform gyro spin down. The Black Hawk +friendly fire accident .(analyzed in chapter 5). had several examples of asynchronous +evolution, for example the mission changed and an individual key to communication +between the Air Force and Army left, leaving the safety control structure without +an important component. +Communication is a critical factor here as well as monitoring for changes that +may occur and feeding back this information to the higher-level control. For example, +the safety analysis process that generates constraints always involves some basic +assumptions about the operating environment of the process. When the environment changes such that those assumptions are no longer true, as in the Ariane 5 and +SOHO examples, the controls in place may become inadequate. Embedded pacemakers provide another example. These devices were originally assumed to be used +only in adults, who would lie quietly in the doctor’s office while the pacemaker was +being “programmed.” Later these devices began to be used in children, and the +assumptions under which the hazard analysis was conducted and the controls were +designed no longer held and needed to be revisited. A requirement for effective +updating of the control algorithms is that the assumptions of the original .(and subsequent). analysis are recorded and retrievable. +Inconsistent, Incomplete, or Incorrect Process Models .(③ in figure 4.8) +Section 4.3 stated that effective control is based on a model of the process state. +Accidents, particularly component interaction accidents, most often result from +inconsistencies between the models of the process used by the controllers .(both + + +human and automated). and the actual process state. When the controller’s model of +the process .(either the human mental model or the software or hardware model) +diverges from the process state, erroneous control commands .(based on the incorrect model). can lead to an accident. for example, .(1). the software does not know that +the plane is on the ground and raises the landing gear, or .(2). the controller .(automated or human). does not identify an object as friendly and shoots a missile at it, or +(3). the pilot thinks the aircraft controls are in speed mode but the computer has +changed the mode to open descent and the pilot behaves inappropriately for that +mode, or .(4). the computer does not think the aircraft has landed and overrides the +pilots’ attempts to operate the braking system. All of these examples have actually +occurred. +The mental models of the system developers are also important. During software +development, for example, the programmers’ models of required behavior may not +match the engineers’ models .(commonly referred to as a software requirements +error), or the software may be executed on computer hardware or may control +physical systems during operations that differ from what was assumed by the programmer and used during testing. The situation becomes more even complicated +when there are multiple controllers .(both human and automated). because each of +their process models must also be kept consistent. +The most common form of inconsistency occurs when one or more process +models is incomplete in terms of not defining appropriate behavior for all possible +process states or all possible disturbances, including unhandled or incorrectly +handled component failures. Of course, no models are complete in the absolute +sense. The goal is to make them complete enough that no safety constraints are +violated when they are used. Criteria for completeness in this sense are presented +in Safeware, and completeness analysis is integrated into the new hazard analysis +method as described in chapter 9. +How does the process model become inconsistent with the actual process state? +The process model designed into the system .(or provided by training if the controller is human). may be wrong from the beginning, there may be missing or incorrect +feedback for updating the process model as the controlled process changes state, +the process model may be updated incorrectly .(an error in the algorithm of the +controller), or time lags may not be accounted for. The result can be uncontrolled +disturbances, unhandled process states, inadvertent commanding of the system into +a hazardous state, unhandled or incorrectly handled controlled process component +failures, and so forth. +Feedback is critically important to the safe operation of the controller. A basic +principle of system theory is that no control system will perform better than its +measuring channel. Feedback may be missing or inadequate because such feedback +is not included in the system design, flaws exist in the monitoring or feedback + + + +communication channel, the feedback is not timely, or the measuring instrument +operates inadequately. +A contributing factor cited in the Cali B757 accident report, for example, was the +omission of the waypoints6 behind the aircraft from cockpit displays, which contributed to the crew not realizing that the waypoint for which they were searching was +behind them .(missing feedback). The model of the Ariane 501 attitude used by the +attitude control software became inconsistent with the launcher attitude when an +error message sent by the inertial reference system was interpreted by the attitude +control system as data .(incorrect processing of feedback), causing the spacecraft +onboard computer to issue an incorrect and unsafe command to the booster and +main engine nozzles. +Other reasons for the process models to diverge from the true system state may +be more subtle. Information about the process state has to be inferred from measurements. For example, in the TCAS TWO aircraft collision avoidance system, relative +range positions of other aircraft are computed based on round-trip message propagation time. The theoretical control function .(control law). uses the true values of +the controlled variables or component states .(e.g., true aircraft positions). However, +at any time, the controller has only measured values, which may be subject to time +lags or inaccuracies. The controller must use these measured values to infer the true +conditions in the process and, if necessary, to derive corrective actions to maintain +the required process state. In the TCAS example, sensors include on-board devices +such as altimeters that provide measured altitude .(not necessarily true altitude). and +antennas for communicating with other aircraft. The primary TCAS actuator is the +pilot, who may or may not respond to system advisories. The mapping between the +measured or assumed values and the true values can be flawed. +To summarize, process models can be incorrect from the beginning.where +correct is defined in terms of consistency with the current process state and with +the models being used by other controllers.or they can become incorrect due to +erroneous or missing feedback or measurement inaccuracies. They may also be +incorrect only for short periods of time due to time lags in the process loop. +4.5.2. Actuators and Controlled Processes .(④ in figure 4.8) +The factors discussed so far have involved inadequate control. The other case occurs +when the control commands maintain the safety constraints, but the controlled +process may not implement these commands. One reason might be a failure or flaw +in the reference channel, that is, in the transmission of control commands. Another +reason might be an actuator or controlled component fault or failure. A third is that + + + +the safety of the controlled process may depend on inputs from other system components, such as power, for the execution of the control actions provided. If these +process inputs are missing or inadequate in some way, the controller process may +be unable to execute the control commands and accidents may result. Finally, there +may be external disturbances that are not handled by the controller. +In a hierarchical control structure, the actuators and controlled process may +themselves be a controller of a lower-level process. In this case, the flaws in executing the control are the same described earlier for a controller. +Once again, these types of flaws do not simply apply to operations or to the +technical system but also to system design and development. For example, a common +flaw in system development is that the safety information gathered or created by +the system safety engineers .(the hazards and the necessary design constraints to +control them). is inadequately communicated to the system designers and testers, or +that flaws exist in the use of this information in the system development process. + +section 4.5.3. Coordination and Communication among Controllers and Decision Makers. +When there are multiple controllers .(human and/or automated), control actions +may be inadequately coordinated, including unexpected side effects of decisions +or actions or conflicting control actions. Communication flaws play an important +role here. +Leplat suggests that accidents are most likely in overlap areas or in boundary +areas or where two or more controllers .(human or automated). control the same +process or processes with common boundaries .(figure 4.9). . In both boundary +and overlap areas, the potential exists for ambiguity and for conflicts among +independent decisions. +Responsibility for the control functions in boundary areas is often poorly defined. +For example, Leplat cites an iron and steel plant where frequent accidents occurred +at the boundary of the blast furnace department and the transport department. One +conflict arose when a signal informing transport workers of the state of the blast + + +furnace did not work and was not repaired because each department was waiting +for the other to fix it. Faverge suggests that such dysfunction can be related to the +number of management levels separating the workers in the departments from a +common manager. The greater the distance, the more difficult the communication, +and thus the greater the uncertainty and risk. +Coordination problems in the control of boundary areas are rife. As mentioned +earlier, a Milstar satellite was lost due to inadequate attitude control of the Titan/ +Centaur launch vehicle, which used an incorrect process model based on erroneous +inputs on a software load tape. After the accident, it was discovered that nobody +had tested the software using the actual load tape.each group involved in testing +and assurance had assumed some other group was doing so. In the system development process, system engineering and mission assurance activities were missing or +ineffective, and a common control or management function was quite distant from +the individual development and assurance groups .(see appendix B). One factor +in the loss of the Black Hawk helicopters to friendly fire over northern Iraq was +that the helicopters normally flew only in the boundary areas of the no-fly zone and +procedures for handling aircraft in those areas were ill defined. Another factor was +that an Army base controlled the flights of the Black Hawks, while an Air Force +base controlled all the other components of the airspace. A common control point +once again was high above where the accident occurred in the control structure. In +addition, communication problems existed between the Army and Air Force bases +at the intermediate control levels. +Overlap areas exist when a function is achieved by the cooperation of two controllers or when two controllers exert influence on the same object. Such overlap +creates the potential for conflicting control actions .(dysfunctional interactions +among control actions). Leplat cites a study of the steel industry that found 67 +percent of technical incidents with material damage occurred in areas of co-activity, +although these represented only a small percentage of the total activity areas. In an +A320 accident in Bangalore, India, the pilot had disconnected his flight director +during approach and assumed that the copilot would do the same. The result would +have been a mode configuration in which airspeed is automatically controlled by +the autothrottle .(the speed mode), which is the recommended procedure for the +approach phase. However, the copilot had not turned off his flight director, which +meant that open descent mode became active when a lower altitude was selected +instead of speed mode, eventually contributing to the crash of the aircraft short of +the runway . In the Black Hawks’ shootdown by friendly fire, the aircraft surveillance officer .(A S O). thought she was responsible only for identifying and tracking aircraft south of the 36th Parallel, while the air traffic controller for the area +north of the 36th Parallel thought the A S O was also tracking and identifying aircraft +in his area and acted accordingly. + + +In 2002, two aircraft collided over southern Germany. An important factor in the +accident was the lack of coordination between the airborne TCAS .(collision avoidance). system and the ground air traffic controller. They each gave different and +conflicting advisories on how to avoid a collision. If both pilots had followed one +or the other, the loss would have been avoided, but one followed the TCAS advisory +and the other followed the ground air traffic control advisory. + +section 4.5.4. Context and Environment. +Flawed human decision making can result from incorrect information and inaccurate process models, as described earlier. But human behavior is also greatly +impacted by the context and environment in which the human is working. These +factors have been called “behavior shaping mechanisms.” While value systems and +other influences on decision making can be considered to be inputs to the controller, +describing them in this way oversimplifies their role and origin. A classification of +the contextual and behavior-shaping mechanisms is premature at this point, but +relevant principles and heuristics are elucidated throughout the rest of the book. + +section 4.6. +Applying the New Model. +To summarize, STAMP focuses particular attention on the role of constraints in +safety management. Accidents are seen as resulting from inadequate control or +enforcement of constraints on safety-related behavior at each level of the system +development and system operations control structures. Accidents can be understood +in terms of why the controls that were in place did not prevent or detect maladaptive changes. +Accident causal analysis based on STAMP starts with identifying the safety constraints that were violated and then determines why the controls designed to enforce +the safety constraints were inadequate or, if they were potentially adequate, why +the system was unable to exert appropriate control over their enforcement. +In this conception of safety, there is no “root cause.” Instead, the accident “cause” +consists of an inadequate safety control structure that under some circumstances +leads to the violation of a behavioral safety constraint. Preventing future accidents +requires reengineering or designing the safety control structure to be more effective. +Because the safety control structure and the behavior of the individuals in it, like +any physical or social system, changes over time, accidents must be viewed as +dynamic processes. Looking only at the time of the proximal loss events distorts and +omits from view the most important aspects of the larger accident process that are +needed to prevent reoccurrences of losses from the same causes in the future. +Without that view, we see and fix only the symptoms, that is, the results of the flawed +processes and inadequate safety control structure without getting to the sources of +those symptoms. + + +To understand the dynamic aspects of accidents, the process leading to the loss +can be viewed as an adaptive feedback function where the safety control system +performance degrades over time as the system attempts to meet a complex set of +goals and values. Adaptation is critical in understanding accidents, and the adaptive +feedback mechanism inherent in the model allows a STAMP analysis to incorporate +adaptation as a fundamental system property. +We have found in practice that using this model helps us to separate factual +data from the interpretations of that data. While the events and physical data +involved in accidents may be clear, their importance and the explanations for why +the factors were present are often subjective as is the selection of the events to +consider. +STAMP models are also more complete than most accident reports and other +models, for example see . Each of the explanations for the incorrect +FMS input of R in the Cali American Airlines accident described in chapter 2, for +example, appears in the STAMP analysis of that accident at the appropriate levels +of the control structure where they operated. The use of STAMP helps not only to +identify the factors but also to understand the relationships among them. +While STAMP models will probably not be useful in law suits as they do not +assign blame for the accident to a specific person or group, they do provide more +help in understanding accidents by forcing examination of each part of the sociotechnical system to see how it contributed to the loss.and there will usually be +contributions at each level. Such understanding should help in learning how to +engineer safer systems, including the technical, managerial, organizational, and regulatory aspects. +To accomplish this goal, a framework for classifying the factors that lead to accidents was derived from the basic underlying conceptual accident model .(see figure +4.8). This classification can be used in identifying the factors involved in a particular +accident and in understanding their role in the process leading to the loss. The accident investigation after the Black Hawk shootdown .(analyzed in detail in the next +chapter). identified 130 different factors involved in the accident. In the end, only +the AWACS senior director was court-martialed, and he was acquitted. The more +one knows about an accident process, the more difficult it is to find one person or +part of the system responsible, but the easier it is to find effective ways to prevent +similar occurrences in the future. +STAMP is useful not only in analyzing accidents that have occurred but in developing new and potentially more effective system engineering methodologies to +prevent accidents. Hazard analysis can be thought of as investigating an accident +before it occurs. Traditional hazard analysis techniques, such as fault tree analysis +and various types of failure analysis techniques, do not work well for very complex +systems, for software errors, human errors, and system design errors. Nor do they +usually include organizational and management flaws. The problem is that these + +hazard analysis techniques are limited by a focus on failure events and the role of +component failures in accidents; they do not account for component interaction +accidents, the complex roles that software and humans are assuming in high-tech +systems, the organizational factors in accidents, and the indirect relationships +between events and actions required to understand why accidents occur. +STAMP provides a direction to take in creating these new hazard analysis and +prevention techniques. Because in a system accident model everything starts from +constraints, the new approach focuses on identifying the constraints required to +maintain safety; identifying the flaws in the control structure that can lead to an +accident .(inadequate enforcement of the safety constraints); and then designing +a control structure, physical system and operating conditions that enforces the +constraints. +Such hazard analysis techniques augment the typical failure-based design focus +and encourage a wider variety of risk reduction measures than simply adding redundancy and overdesign to deal with component failures. The new techniques also +provide a way to implement safety-guided design so that safety analysis guides the +design generation rather than waiting until a design is complete to discover it is +unsafe. Part THREE describes ways to use techniques based on STAMP to prevent accidents through system design, including design of the operating conditions and the +safety management control structure. +STAMP can also be used to improve performance analysis. Performance monitoring of complex systems has created some dilemmas. Computers allow the collection +of massive amounts of data, but analyzing that data to determine whether the system +is moving toward the boundaries of safe behavior is difficult. The use of an accident +model based on system theory and the basic concept of safety constraints may +provide directions for identifying appropriate safety metrics and leading indicators; +determining whether control over the safety constraints is adequate; evaluating the +assumptions about the technical failures and potential design errors, organizational +structure, and human behavior underlying the hazard analysis; detecting errors in +the operational and environmental assumptions underlying the design and the organizational culture; and identifying any maladaptive changes over time that could +increase risk of accidents to unacceptable levels. +Finally, STAMP points the way to very different approaches to risk assessment. +Currently, risk assessment is firmly rooted in the probabilistic analysis of failure +events. Attempts to extend current P R A techniques to software and other new +technology, to management, and to cognitively complex human control activities +have been disappointing. This way forward may lead to a dead end. Significant +progress in risk assessment for complex systems will require innovative approaches +starting from a completely different theoretical foundation. \ No newline at end of file diff --git a/chapter05.raw b/chapter05.raw new file mode 100644 index 0000000..77cf1bd --- /dev/null +++ b/chapter05.raw @@ -0,0 +1,1425 @@ +chapter 5. + +A Friendly Fire Accident. +The goal of STAMP is to assist in understanding why accidents occur and to use +that understanding to create new and better ways to prevent losses. This chapter +and several of the appendices provide examples of how STAMP can be used to +analyze and understand accident causation. The particular examples were selected +to demonstrate the applicability of STAMP to very different types of systems and +industries. A process, called CAST (Causal Analysis based on STAMP) is described +in chapter 11 to assist in performing the analysis. +This chapter delves into the causation of the loss of a U.S. Army Black Hawk +helicopter and all its occupants from friendly fire by a U.S. Air Force F-15 over +northern Iraq in 1994. This example was chosen because the controversy and mul- +tiple viewpoints and books about the shootdown provide the information necessary +to create most of the STAMP analysis. Accident reports often leave out important +causal information (as did the official accident report in this case). Because of the +nature of the accident, most of the focus is on operations. Appendix B presents +an example of an accident where engineering development plays an important +role. Social issues involving public health are the focus of the accident analysis in +appendix C. +section 5.1. +Background. +After the Persian Gulf War, Operation Provide Comfort (OPC) was created as a +multinational humanitarian effort to relieve the suffering of hundreds of thousands +of Kurdish refugees who fled into the hills of northern Iraq during the war. The goal +of the military efforts was to provide a safe haven for the resettlement of the refu- +gees and to ensure the security of relief workers assisting them. The formal mission +statement for OPC read: “To deter Iraqi behavior that may upset peace and order +in northern Iraq.” +In addition to operations on the ground, a major component of OPC’s mission +was to occupy the airspace over northern Iraq. To accomplish this task, a no-fly zone + + +(also called the TAOR or Tactical Area of Responsibility) was established that +included all airspace within Iraq north of the 36th Parallel (see figure 5.1). Air +operations were led by the Air Force to prohibit Iraqi aircraft from entering the +no-fly zone while ground operations were organized by the Army to provide human- +itarian assistance to the Kurds and other ethnic groups in the area. +U.S., Turkish, British, and French fighter and support aircraft patrolled the no-fly +zone daily to prevent Iraqi warplanes from threatening the relief efforts. The mission +of the Army helicopters was to support the ground efforts; the Army used them +primarily for troop movement, resupply, and medical evacuation. +On April 15, 1994, after nearly three years of daily operations over the TAOR +(Tactical Area of Responsibility), two U.S. Air Force F-15’s patrolling the area shot +down two U.S. Army Black Hawk helicopters, mistaking them for Iraqi Hind heli- +copters. The Black Hawks were carrying twenty-six people, fifteen U.S. citizens and +eleven others, among them British, French, and Turkish military officers as well as +Kurdish citizens. All were killed in one of the worst air-to-air friendly fire accidents +involving U.S. aircraft in military history. +All the aircraft involved were flying in clear weather with excellent visibility, an +AWACS (Airborne Warning and Control System) aircraft was providing surveil- +lance and control for the aircraft in the area, and all the aircraft were equipped with +electronic identification and communication equipment (apparently working prop- +erly) and flown by decorated and highly experienced pilots. + +The hazard being controlled was mistaking a “friendly” (coalition) aircraft for a +threat and shooting at it. This hazard, informally called friendly fire, was well known, +and a control structure was established to prevent it. Appropriate constraints were +established and enforced at each level, from the Joint Chiefs of Staff down to the +aircraft themselves. Understanding why this accident occurred requires understand- +ing why the control structure in place was ineffective in preventing the loss. Prevent- +ing future accidents involving the same control flaws requires making appropriate +changes to the control structure, including establishing monitoring and feedback +loops to detect when the controls are becoming ineffective and the system is migrat- +ing toward an accident, that is, moving toward a state of increased risk. The more +comprehensive the model and factors identified, the larger the class of accidents +that can be prevented. +For this STAMP example, information about the accident and the control struc- +ture was obtained from the original accident report [5], a GAO (Government +Accountability Office) report on the accident investigation process and results [200], +and two books on the shootdown—one originally a Ph.D. dissertation by Scott +Snook [191] and one by Joan Piper, the mother of one of the victims [159]. Because +of the extensive existing analysis, much of the control structure (shown in figure 5.3) +can be reconstructed from these sources. A large number of acronyms are used in +this chapter. They are defined in figure 5.2. + + +5.2 +The Hierarchical Safety Control Structure to Prevent Friendly Fire Accidents +National Command Authority and Commander-in-Chief Europe +When the National Command Authority (the President and Secretary of Defense) +directed the military to conduct Operation Provide Comfort, the U.S. Commander +in Chief Europe (USCINCEUR) directed the creation of Combined Task Force +(CTF) Provide Comfort. +A series of orders and plans established the general command and control struc- +ture of the CTF. These orders and plans also transmitted sufficient authority and +guidance to subordinate component commands and operational units so that they +could then develop the local procedures that were necessary to bridge the gap +between general mission orders and specific subunit operations. +At the top of the control structure, the National Command Authority (the Presi- +dent and Secretary of Defense, who operate through the Joint Chiefs of Staff) +provided guidelines for establishing Rules of Engagement (ROE). ROE govern the +actions allowed by U.S. military forces to protect themselves and other personnel +and property against attack or hostile incursion and specify a strict sequence of +procedures to be followed prior to any coalition aircraft firing its weapons. They are + + +based on legal, political, and military considerations and are intended to provide for +adequate self-defense to ensure that military activities are consistent with current +national objectives and that appropriate controls are placed on combat activities. +Commanders establish ROE for their areas of responsibility that are consistent with +the Joint Chiefs of Staff guidelines, modifying them for special operations and for +changing conditions. +Because the ROE dictate how hostile aircraft or military threats are treated, +they play an important role in any friendly fire accidents. The ROE in force for +OPC were the peacetime ROE for the United States European Command with +OPC modifications approved by the National Command Authority. These conserva- +tive ROE required a strict sequence of procedures to be followed prior to any +coalition aircraft firing its weapons. The less aggressive peacetime rules of engage- +ment were used even though the area had been designated a combat zone because +of the number of countries involved in the joint task force. The goal of the ROE +was to slow down any military confrontation in order to prevent the type of friendly +fire accidents that had been common during Operation Desert Storm. Understand- +ing the reasons for the shootdown of the Black Hawk helicopters requires under- +standing why the ROE did not provide an effective control to prevent friendly fire +accidents. +Three System-Level Safety Constraints Related to This Accident: +1. The NCA and UNCINCEUR must establish a command and control structure +that provides the ability to prevent friendly fire accidents. +2. The guidelines for ROE generated by the Joint Chiefs of Staff (with tailoring +to suit specific operational conditions) must be capable of preventing friendly +fire accidents in all types of situations. +3. The European Commander-in-Chief must review and monitor operational +plans generated by the Combined Task Force, ensure they are updated as the +mission changes, and provide the personnel required to carry out the plans. +Controls: The controls in place included the ROE guidelines, the operational +orders, and review procedures for the controls (e.g., the actual ROE and Operational +Plans) generated at the control levels below. +Combined Task Force (CTF) +The components of the Combined Task Force (CTF) organization relevant to the +accident (and to preventing friendly fire) were a Combined Task Force staff, a Com- +bined Forces Air Component (CFAC), and an Army Military Coordination Center. +The Air Force fighter aircraft were co-located with CTF Headquarters and CFAC + +based on legal, political, and military considerations and are intended to provide for +adequate self-defense to ensure that military activities are consistent with current +national objectives and that appropriate controls are placed on combat activities. +Commanders establish ROE for their areas of responsibility that are consistent with +the Joint Chiefs of Staff guidelines, modifying them for special operations and for +changing conditions. +Because the ROE dictate how hostile aircraft or military threats are treated, +they play an important role in any friendly fire accidents. The ROE in force for +OPC were the peacetime ROE for the United States European Command with +OPC modifications approved by the National Command Authority. These conserva- +tive ROE required a strict sequence of procedures to be followed prior to any +coalition aircraft firing its weapons. The less aggressive peacetime rules of engage- +ment were used even though the area had been designated a combat zone because +of the number of countries involved in the joint task force. The goal of the ROE +was to slow down any military confrontation in order to prevent the type of friendly +fire accidents that had been common during Operation Desert Storm. Understand- +ing the reasons for the shootdown of the Black Hawk helicopters requires under- +standing why the ROE did not provide an effective control to prevent friendly fire +accidents. +Three System-Level Safety Constraints Related to This Accident: +1. The NCA and UNCINCEUR must establish a command and control structure +that provides the ability to prevent friendly fire accidents. +2. The guidelines for ROE generated by the Joint Chiefs of Staff (with tailoring +to suit specific operational conditions) must be capable of preventing friendly +fire accidents in all types of situations. +3. The European Commander-in-Chief must review and monitor operational +plans generated by the Combined Task Force, ensure they are updated as the +mission changes, and provide the personnel required to carry out the plans. +Controls: The controls in place included the ROE guidelines, the operational +orders, and review procedures for the controls (e.g., the actual ROE and Operational +Plans) generated at the control levels below. +Combined Task Force (CTF) +The components of the Combined Task Force (CTF) organization relevant to the +accident (and to preventing friendly fire) were a Combined Task Force staff, a Com- +bined Forces Air Component (CFAC), and an Army Military Coordination Center. +The Air Force fighter aircraft were co-located with CTF Headquarters and CFAC + + +at Incirlik Air Base in Turkey while the U.S. Army helicopters were located with the +Army headquarters at Diyarbakir, also in Turkey (see figure 5.1). +The Combined Task Force had three components under it (figure 5.3): +1. The Military Coordination Center (MCC) monitored conditions in the security +zone and had operational control of Eagle Flight helicopters (the Black +Hawks), which provided general aviation support to the MCC and the CTF. +2. The Joint Special Operations Component (JSOC) was assigned primary +responsibility to conduct search-and-rescue operations should any coalition +aircraft go down inside Iraq. +3. The Combined Forces Air Component (CFAC) was tasked with exercising +tactical control of all OPC aircraft operating in the Tactical Area of Respon- +sibility (TAOR) and operational control over Air Force aircraft.1 The CFAC +commander exercised daily control of the OPC flight mission through a Direc- +tor of Operations (CFAC/DO), as well as a ground-based Mission Director at +the Combined Task Force (CTF) headquarters in Incirlik and an Airborne +Command Element (ACE) aboard the AWACS. +Operational orders were generated at the European Command level of authority +that defined the initial command and control structure and directed the CTF +commanders to develop an operations plan to govern OPC. In response, the CTF +commander created an operations plan in July 1991 delineating the command rela- +tionships and organizational responsibilities within the CTF. In September 1991, the +U.S. Commander-in-Chief, Europe, modified the original organizational structure in +response to the evolving mission in northern Iraq, directing an increase in the size +of the Air Force and the withdrawal of a significant portion of the ground forces. +The CTF was ordered to provide a supporting plan to implement the changes +necessary in their CTF operations plan. The Accident Investigation Board found +that although an effort was begun in 1991 to revise the operations plan, no evidence +could be found in 1994 to indicate that the plan was actually updated to reflect the +change in command and control relationships and responsibilities. The critical +element of the plan with respect to the shootdown was that the change in mission +led to the departure of an individual key to the communication between the Air +Force and Army, without his duties being assigned to someone else. This example +of asynchronous evolution plays a role in the loss. + +footnote. Tactical control involves a fairly limited scope of authority, that is, the detailed and usually local direc- +tion and control of movement and maneuvers necessary to accomplish the assigned mission. Operational +control, on the other hand, involves a broader authority to command subordinate forces, assign tasks, +designate objectives, and give the authoritative direction necessary to accomplish the mission. + + + +Command-Level Safety Constraints Related to the Accident: +1. Rules of engagement and operational orders and plans must be established at +the command level that prevent friendly fire accidents. The plans must include +allocating responsibility and establishing and monitoring communication +channels to allow for coordination of flights into the theater of action. +2. Compliance with the ROE and operational orders and plans must be moni- +tored. Alterations must be made in response to changing conditions and +changing mission. +Controls: The controls included the ROE and operational plans plus feedback +mechanisms on their effectiveness and application. +CFAC and MCC +The two parts of the Combined Task Force involved in the accident were the Army +Military Coordination Center (MCC) and the Air Force Combined Forces Air +Component (CFAC). +The shootdown obviously involved a communication failure: the F-15 pilots did +not know the U.S. Army Black Hawks were in the area or that they were targeting +friendly aircraft. Problems in communication between the three services (Air Force, +Army, and Navy) are legendary. Procedures had been established to attempt to +eliminate these problems in Operation Provide Comfort. +The Military Coordination Center (MCC) coordinated land and U.S. helicopter +missions that supported the Kurdish people. In addition to providing humanitarian +relief and protection to the Kurds, another important function of the Army detach- +ment was to establish an ongoing American presence in the Kurdish towns and +villages by showing the U.S. flag. This U.S. Army function was supported by a +helicopter detachment called Eagle Flight. +All CTF components, with the exception of the Army Military Coordination +Center lived and operated out of Incirlik Air Base in Turkey. The MCC operated +out of two locations. A forward headquarters was located in the small village of +Zakhu (see figure 5.1), just inside Iraq. Approximately twenty people worked in +Zakhu, including operations, communications, and security personnel, medics, trans- +lators, and coalition chiefs. Zakhu operations were supported by a small administra- +tive contingent working out of Pirinclik Air Base in Diyarbakir, Turkey. Pirinclik is +also where the Eagle Flight Platoon of UH-60 Black Hawk helicopters was located. +Eagle Flight helicopters made numerous (usually daily) trips to Zakhu to support +MCC operations. +The Combined Forces Air Component (CFAC) Commander was responsible for +coordinating the employment of all air operations to accomplish the OPC mission. +He was delegated operational control of the Airborne Warning and Control System + + +(AWACS), U.S. Air Force (USAF) airlift, and the fighter forces. He had tactical +control of the U.S. Army, U.S. Navy, Turkish, French, and British fixed wing and +helicopter aircraft. The splintering of control between the CFAC and MCC com- +manders, along with communication problems between them, were major contribu- +tors to the accident. +In a complex coordination problem of this sort, communication is critical. Com- +munications were implemented through the Joint Operations and Intelligence +Center (JOIC). The JOIC received, delivered, and transmitted communications up, +down, and across the CTF control structure. No Army liaison officer was assigned +to the JOIC, but one was available on request to provide liaison between the MCC +helicopter detachment and the CTF staff. +To prevent friendly fire accidents, pilots need to know exactly what friendly air- +craft are flying in the no-fly zone at all times as well as know and follow the ROE +and other procedures for preventing such accidents. The higher levels of control +delegated the authority and guidance to develop local procedures2 to the CTF level +and below. These local procedures included: +•Airspace Control Order (ACO): The ACO contains the authoritative guidance +for all local air operations in OPC. It covers such things as standard altitudes +and routes, air refueling procedures, recovery procedures, airspace deconfliction +responsibilities, and jettison procedures. The deconfliction procedures were a +way to prevent interactions between aircraft that might result in accidents. For +the Iraqi TAOR, fighter aircraft, which usually operated at high altitudes, were +to stay above 10,000 feet above ground level while helicopters, which normally +conducted low-altitude operations, were to stay below 400 feet. All flight crews +were responsible for reviewing and complying with the information contained +in the ACO. The CFAC Director of Operations was responsible for publishing +the guidance, including the Airspace Control Order, for conducting OPC +missions. +•Aircrew Read Files (ARFs): The Aircraft Read Files supplement the ACOs +and are also required reading by all flight crews. They contain the classified +rules of engagement (ROE), changes to the ACO, and recent amplification of +how local commanders want air missions executed. +•Air Tasking Orders (ATOs): While the ACO and ARFs contain general infor- +mation that applies to all aircraft in OPC, specific mission guidance was pub- +lished in the daily ATOs. They contained the daily flight schedule, radio +frequencies to be used, IFF codes (used to identify an aircraft as friend or foe), + +and other late-breaking information necessary to fly on any given day. All air- +craft are required to have a hard copy of the current ATO with Special Instruc- +tions (SPINS) on board before flying. Each morning around 11:30 (1130 hours, +in military time), the mission planning cell (or Frag shop) publishes the ATO for +the following day, and copies are distributed to all units by late afternoon. +•Battle Staff Directives (BSDs): Any late scheduling changes that do not make +it onto the ATO are published in last-minute Battle Staff Directives, which are +distributed separately and attached to all ATOs prior to any missions flying the +next morning. +•Daily Flowsheets: Military pilots fly with a small clipboard attached to their +knees. These kneeboards contain boiled-down reference information essential +to have handy while flying a mission, including the daily flowsheet and radio +frequencies. The flowsheets are graphical depictions of the chronological flow +of aircraft scheduled into the no-fly zone for that day. Critical information is +taken from the ATO, translated into timelines, and reduced on a copier to +provide pilots with a handy in-flight reference. +•Local Operating Procedures and Instructions, Standard Operating Procedures, +Checklists, and so on: In addition to written material, real-time guidance is +provided to pilots after taking off via radio through an unbroken command +chain that runs from the OPC Commanding General, through the CFAC, +through the mission director, through an Airborne Command Element (ACE) +on board the AWACS, and ultimately to pilots. +The CFAC commander of operations was responsible for ensuring that aircrews +were informed of all unique aspects of the OPC mission, including the ROE, upon +their arrival. He was also responsible for publishing the Aircrew Read File (ARF), +the Airspace Control Order (ACO), the daily Air Tasking Order, and mission- +related special instructions (SPINS). + + +footnote. The term procedures as used in the military denote standard and detailed courses of action that +describe how to perform a task. + +Safety Constraints Related to the Accident: +1. Coordination and communication among all flights into the TAOR must be +established. Procedures must be established for determining who should be +and is in the TAOR at all times. +2. Procedures must be instituted and monitored to ensure that all aircraft in the +TAOR are tracked and fighters are aware of the location of all friendly aircraft +in the TAOR. +3. The ROE must be understood and followed by those at lower levels. +4. All aircraft must be able to communicate effectively in the TAOR. + +Controls: The controls in place included the ACO, ARFs, flowsheets, intelligence +and other briefings, training (on the ROE, on aircraft identification, etc.), AWACS +procedures for identifying and tracking aircraft, established radio frequencies and +radar signals for the no-fly zone, a chain of command (OPC Commander to Mission +Director to ACE to pilots), disciplinary actions for those not following the written +rules, and a group (the JOIE) responsible for ensuring effective communication +occurred. +Mission Director and Airborne Command Element +The Airborne Command Element (ACE) flies in the AWACS and is the com- +mander’s representative in the air, armed with up-to-the-minute situational infor- +mation to make time-critical decisions. The ACE monitors all air operations and +is in direct contact with the Mission Director located in the ground command +post. He must also interact with the AWACS crew to identify reported unidentified +aircraft. +The ground-based Mission Director maintains constant communication links +with both the ACE up in the AWACS and with the CFAC commander on the +ground. The Mission Director must inform the OPC commander immediately if +anything happens over the no-fly zone that might require a decision by the com- +mander or his approval. Should the ACE run into any situation that would involve +committing U.S. or coalition forces, the Mission Director will communicate with him +to provide command guidance. The Mission Director is also responsible for making +weather-related decisions, implementing safety procedures, scheduling aircraft, and +ensuring that the ATO is executed correctly. +The ROE in place at the time of the shootdown stated that aircrews experiencing +unusual circumstances were to pass details to the ACE or AWACS, who would +provide guidance on the appropriate response [200]. Exceptions were possible, of +course, in cases of imminent threat. Aircrews were directed to first contact the ACE +and, if that individual was unavailable, to then contact the AWACS. The six unusual +circumstances/occurrences to be reported, as defined in the ROE, included “any +intercept run on an unidentified aircraft.” As stated, the ROE was specifically +designed to slow down a potential engagement to allow time for those in the chain +of command to check things out. +Although the written guidance was clear, there was controversy with respect to +how it was or should have been implemented and who had decision-making author- +ity. Conflicting testimony during the investigation of the shootdown about respon- +sibility may either reflect after-the-fact attempts to justify actions or may instead +reflect real confusion on the part of everyone, including those in charge, as to where +the responsibility lay—perhaps a little of both. + + +Safety Constraints Related to the Accident: +1. The ACE and MD must follow procedures specified and implied by the +ROE. +2. The ACE must ensure that pilots follow the ROE. +3. The ACE must interact with the AWACS crew to identify reported unidenti- +fied aircraft. +Controls: Controls to enforce the safety constraints included the ROE to provide +overall principles for decision-making and to slow down engagements in order to +prevent individual error or erratic behavior, the ACE up in the AWACS to augment +communication by getting up-to-the-minute information about the state of the +TAOR airspace and communicating with the pilots and AWACS crews, and the +Mission Director on the ground to provide a chain of command from the pilots to +the CFAC commander for real-time decision making. +AWACS Controllers +The AWACS (Airborne Warning and Control Systems) acts as an air traffic control +tower in the sky. The AWACS OPC mission was to: +1. Control aircraft en route to and from the no-fly zone +2. Coordinate air refueling (for the fighter aircraft and the AWACS itself) +3. Provide airborne threat warning and control for all OPC aircraft operating +inside the no-fly zone +4. Provide surveillance, detection, and identification of all unknown aircraft +An AWACS is a modified Boeing 707, with a saucer-shaped radar dome on the top, +equipped inside with powerful radars and radio equipment that scan the sky for +aircraft. A computer takes raw data from the radar dome, processes it, and ultimately +displays tactical information on fourteen color consoles arranged in rows of three +throughout the rear of the aircraft. AWACS have the capability to track approxi- +mately one thousand enemy aircraft at once while directing one hundred friendly +ones [159]. +The AWACS carries a flight crew (pilot, copilot, navigator, and flight engineer) +responsible for safe ground and flight operation of the AWACS aircraft and a +mission crew that has overall responsibility for the AWACS command, control, +surveillance, communications, and sensor systems. +The mission crew of approximately nineteen people are under the direction of +a mission crew commander (MCC). The MCC has overall responsibility for the +AWACS mission and the management, supervision, and training of the mission crew. +The mission crew members were divided into three sections: + + +1. Technicians: The technicians are responsible for operating, monitoring, and +maintaining the physical equipment on the aircraft. +2. Surveillance: The surveillance section is responsible for the detection, track- +ing, identification, height measurement, display, and recording of surveillance +data. As unknown targets appear on the radarscopes, surveillance technicians +follow a detailed procedure to identify the tracks. They are responsible for +handling unidentified and non-OPC aircraft detected by the AWACS elec- +tronic systems. The section is supervised by the air surveillance officer, and the +work is carried out by an advanced air surveillance technician and three air +surveillance technicians. +3. Weapons: The weapons controllers are supervised by the senior director +(SD). This section is responsible for the control of all assigned aircraft and +weapons systems in the TAOR. The SD and three weapons directors are +together responsible for locating, identifying, tracking, and controlling all +friendly aircraft flying in support of OPC. Each weapons director was assigned +responsibility for a specific task: +• +• +• +The enroute controller controlled the flow of OPC aircraft to and from the +TAOR. This person also conducted radio and IFF checks on friendly aircraft +outside the TAOR. +The TAOR controller provided threat warning and tactical control for all +OPC aircraft within the TAOR. +The tanker controller coordinated all air refueling operations (and played no +part in the accident so is not mentioned further). +To facilitate communication and coordination, the SD’s console was physically +located in the “pit” right between the MCC and the ACE (Airborne Command +Element). Through internal radio nets, the SD synchronized the work of the +weapons section with that of the surveillance section. He also monitored and coor- +dinated the actions of his weapons directors to meet the demands of both the ACE +and MCC. +Because those who had designed the control structure recognized the potential +for some distance to develop between the training of the AWACS crew members +and the continually evolving practice in the no-fly zone (another example of asyn- +chronous evolution of the safety control structure), they had instituted a control by +creating staff or instructor personnel permanently stationed in Turkey. Their job was +to help provide continuity for U.S. AWACS crews who rotated through OPC on +temporary duty status, usually for thirty-day rotations. This shadow crew flew with +each new AWACS crew on their first mission in the TAOR to alert them as to how +things were really done in OPC. Their job was to answer any questions the new crew + +might have about local procedures, recent occurrences, or changes in policy or inter- +pretation that had come about since the last time they had been in the theater. +Because the accident occurred on the first day for a new AWACS crew, instructor +or staff personnel were also on board. +In addition to all these people, a Turkish controller flew on all OPC missions to +help the crew interface with local air traffic control systems. +The AWACS typically takes off from Incirlik AFB approximately two hours +before the first air refueling and fighter aircraft. Once the AWACS is airborne, the +systems of the AWACS are brought on line, and a Joint Tactical Information Distri- +bution System (JTIDS3) link is established with a Turkish Sector Operations Center +(radar site). After the JTIDS link is confirmed, the CFAC airborne command +element (ACE) initiates the planned launch sequence for the rest of the force. +Normally, within a one-hour period, tanker and fighter aircraft take off and proceed +to the TAOR in a carefully orchestrated flow. Fighters may not cross the political +border into Iraq without AWACS coverage. + + +footnote. The Joint Tactical Information Distribution System acts as a central component of the mission +command and control system, providing ground commanders with a real-time downlink of the current +air picture from AWACS. This information is then integrated with data from other sources to provide +commanders with a more complete picture of the situation. + + +Safety Constraints Related to the Accident: +1. The AWACS mission crew must identify and track all aircraft in the TAOR. +Friendly aircraft must not be identified as a threat (hostile). +2. The AWACS mission crew must accurately inform fighters about the status of +all tracked aircraft when queried. +3. The AWACS mission crew must alert aircraft in the TAOR to any coalition +aircraft not appearing on the flowsheet (ATO). +4. The AWACS crew must not fail to warn fighters about any friendly aircraft +the fighters are targeting. +5. The JTIDS must provide the ground with an accurate picture of the airspace +and its occupants. +Controls: Controls included procedures for identifying and tracking aircraft, train- +ing (including simulator missions), briefings, staff controllers, and communication +channels. The SD and ASO provided real-time oversight of the crew’s activities. +Pilots +Fighter aircraft, flying in formations of two and four aircraft, must always have a +clear line of command. In the two-aircraft formation involved in the accident, the + + +lead pilot is completely in charge of the flight and the wingman takes all of his com- +mands from the lead. +The ACO (Airspace Control Order) stipulates that fighter aircraft may not cross +the political border into Iraq without AWACS coverage and no aircraft may enter +the TAOR until fighters with airborne intercept (AI) radars have searched the +TAOR for Iraqi aircraft. Once the AI radar-equipped aircraft have “sanitized” the +no-fly zone, they establish an orbit and continue their search for Iraqi aircraft and +provide air cover while other aircraft are in the area. When they detect non-OPC +aircraft, they are to intercept, identify, and take appropriate action as prescribed by +the rules of engagement (ROE) and specified in the ACO. +After the area is sanitized, additional fighters and tankers flow to and from the +TAOR throughout the six- to eight-hour daily flight schedule. This flying window is +randomly selected to avoid predictability. +Safety Constraints Related to the Accident: +1. Pilots must know and follow the rules of engagement established and com- +municated from the levels above. +2. Pilots must know who is in the no-fly zone at all times and whether they should +be there or not, i.e., they must be able to accurately identify the status of all +other aircraft in the no-fly zone at all times and must not misidentify a friendly +aircraft as a threat. +3. Pilots of aircraft in the area must be able to hear radio communications. +4. Fixed-wing aircraft must fly above 10,000 feet and helicopters must remain +below 400 feet. +Controls: Controls included the ACO, the ATO, flowsheets, radios, IFF, the ROE, +training, the AWACS, procedures to keep fighters and helicopters from coming into +contact (for example, they fly at different altitudes), and special tactical radio fre- +quencies when operating in the TAOR. Flags were displayed prominently on all +aircraft in order to identify their origin. +Communication: Communication is important in preventing friendly fire acci- +dents. The U.S. Army Black Hawk helicopters carried a full array of standard avion- +ics, radio, IFF, and radar equipment as well as communication equipment consisting +of FM, UHF, and VHF radios. Each day the FM and UHF radios were keyed with +classified codes to allow pilots to talk secure in encrypted mode. The ACO directed +that special frequencies were to be used when flying inside the TAOR. +Due to the line-of-sight limitations of their radios, the high mountainous terrain +in northern Iraq, and the fact that helicopters tried to fly at low altitudes to use the +terrain to mask them from enemy air defense radars, all Black Hawk flights into the + + +no-fly zone also carried tactical satellite radios (TACSATs). These TACSATS were +used to communicate with MCC operations. The helicopters had to land to place the +TACSATs in operation; they cannot be operated from inside a moving helicopter. +The F-15’s were equipped with avionics, communications, and electronic equip- +ment similar to that on the Black Hawks, except that the F-15’s were equipped with +HAVE QUICK II (HQ-II) frequency-hopping radios while the helicopters were +not. HQ-II defeated most enemy attempts to jam transmissions by changing fre- +quencies many times per second. Although the F-15 pilots preferred to use the more +advanced HQ technology, the F-15 radios were capable of communicating in a clear, +non-HQ-II mode. The ACO directed that F-15s use the non-HQ-II frequency when +specified aircraft that were not HQ-II capable flew in the TAOR. One factor involved +in the accident was that Black Hawk helicopters (UH-60s) were not on the list of +non-HQ-II aircraft that must be contacted using a non-HQ-II mode. +Identification: Identification of aircraft was assisted by systems called AAI/IFF +(electronic Air-to-Air Interrogation/Identification Friend or Foe). Each coalition +aircraft was equipped with an IFF transponder. Friendly radars (located in the +AWACS, a fighter aircraft, or a ground site) execute what is called a parrot check +to determine if the target being reflected on their radar screens is friendly or hostile. +The AAI component (the interrogator) sends a signal to an airborne aircraft to +determine its identity, and the IFF component answers or squawks back with a +secret code—a numerically identifying pulse that changes daily and must be uploaded +into aircraft using secure equipment prior to takeoff. If the return signal is valid, it +appears on the challenging aircraft’s visual display (radarscope). A compatible code +has to be loaded into the cryptographic system of both the challenging and the +responding aircraft to produce a friendly response. +An F-15’s AAI/IFF system can interrogate using four identification signals or +modes. The different types of IFF signals provide a form of redundancy. Mode I is +a general identification signal that permits selection of 32 codes. Two Mode I codes +were designated for use in OPC at the time of the accident: one for inside the TAOR +and the other for outside. Mode II is an aircraft-specific identification mode allowing +the use of 4,096 possible codes. Mode III provides a nonsecure friendly identification +of both military and civilian aircraft and was not used in the TAOR. Mode IV is +secure and provides high-confidence identification of friendly targets. According to +the ACO, the primary means of identifying friendly aircraft in the Iraqi no-fly zone +were to be modes I and IV in the IFF interrogation process. +Physical identification is also important in preventing friendly fire accidents. +The ROE require that the pilots perform a visual identification of the potential +threat. To assist in this identification, the Black Hawks were marked with six two- +by-three-foot American flags. An American flag was painted on each door, on both + + +sponsons,4 on the nose, and on the belly of each helicopter [159]. A flag had been +added to the side of each sponson because the Black Hawks had been the target of +small-arms ground fire several months before. + +footnote. Sponsons are auxiliary fuel tanks. + + +section 5.3. + +The Accident Analysis Using STAMP. +With all these controls and this elaborate control structure to protect against friendly +fire accidents, which was a well-known hazard, how could the shootdown occur on +a clear day with all equipment operational? As the Chairman of the Joint Chiefs of +Staff said after the accident: +In place were not just one, but a series of safeguards—some human, some procedural, +some technical—that were supposed to ensure an accident of this nature could never +happen. Yet, quite clearly, these safeguards failed.5 +Using STAMP to understand why this accident occurred and to learn how to prevent +such losses in the future requires determining why these safeguards were not suc- +cessful in preventing the friendly fire. Various explanations for the accident have +been posited. Making sense out of these conflicting explanations and understanding +the accident process involved, including not only failures of individual system com- +ponents but the unsafe interactions and miscommunications between components, +requires understanding the role played in this process by each of the elements of +the safety control structure in place at the time. +The next section contains a description of the proximate events involved in the +loss. Then the STAMP analysis providing an explanation of why these events +occurred is presented. + +footnote. John Shalikashvili, chairman of the Joint Chief of Staff, from a cover letter to the twenty-one-volume +report of the Aircraft Accident Investigation Board, 19 94 a, page 1. + + +section 5.3.1. Proximate Events. +Figure 5.4, taken from the official Accident Investigation Board Report, shows a +timeline of the actions of each of the main actors in the proximate events—the +AWACS, the F-15s, and the Black Hawks. It may also be helpful to refer back to +figure 5.1, which contains a map of the area showing the relative locations of the +important activities. +After receiving a briefing on the day’s mission, the AWACS took off from Incirlik +Air Base. When they arrived on station and started to track aircraft, the AWACS +surveillance section noticed unidentified radar returns (from the Black Hawks). A +“friendly general” track symbol was assigned to the aircraft and labeled as H, + + +denoting a helicopter. The Black Hawks (Eagle Flight) later entered the TAOR +(no-fly zone) through Gate 1, checked in with the AWACS controllers who anno- +tated the track with the identifier EE01, and flew to Zakhu. The Black Hawk pilots +did not change their IFF (Identify Friend or Foe) Mode I code: The code for all +friendly fixed-wing aircraft flying in Turkey on that day was 42, and the code for the +TAOR was 52. They also remained on the enroute radio frequency instead of chang- +ing to the frequency to be used in the TAOR. When the helicopters landed at Zakhu, +their radar and IFF (Identify Friend or Foe) returns on the AWACS radarscopes +faded. Thirty minutes later, Eagle Flight reported their departure from Zakhu to +the AWACS and said they were enroute from Whiskey (code name for Zakhu) to +Lima (code name for Irbil, a town deep in the TAOR). The enroute controller +reinitiated tracking of the helicopters. +Two F-15s were tasked that day to be the first aircraft in the TAOR and to sanitize +it (check for hostile aircraft) before other coalition aircraft entered the area. The +F-15s reached their final checkpoint before entering the TAOR approximately an +hour after the helicopters had entered. They turned on all combat systems, switched +their IFF Mode I code from 42 to 52, and switched to the TAOR radio frequency. +They reported their entry into the TAOR to the AWACS. +At this point, the Black Hawks’ radar and IFF contacts faded as the helicopters +entered mountainous terrain. The AWACS computer continued to move the heli- +copter tracks on the radar display at the last known speed and direction, but +the identifying H symbol (for helicopter) on the track was no longer displayed. The +ASO placed an “attention arrow” (used to point out an area of interest) on the +SD’s scope at the point of the Black Hawk’s last known location. This large arrow +is accompanied by a blinking alert light on the SD’s console. The SD did not +acknowledge the arrow and after sixty seconds, both the arrow and the light +were automatically dropped. The ASO then adjusted the AWACS radar to detect +slow-moving objects. +Before entering the TAOR, the lead F-15 pilot checked in with the ACE and was +told there were no relevant changes from previously briefed information (“negative +words”). Five minutes later, the F-15’s entered the TAOR, and the lead pilot reported +their arrival to the TAOR controller. One minute later, the enroute controller finally +dropped the symbol for the helicopters from the scope, the last remaining visual +reminder that there were helicopters inside the TAOR. +Two minutes after entering the TAOR, the lead F-15 picked up hits on its instru- +ments indicating that it was getting radar returns from a low and slow-flying aircraft. +The lead F-15 pilot alerted his wingman and then locked onto the contact and used +the F-15’s air-to-air interrogator to query the target’s IFF code. If it was a coalition +aircraft, it should be squawking Mode I, code 52. The scope showed it was not. He +reported the radar hits to the controllers in the AWACS, and the TAOR controller + + +told him they had no radar contacts in that location (“clean there”). The wing pilot +replied to the lead pilot’s alert, noting that his radar also showed the target. +The lead F-15 pilot then switched the interrogation to the second mode (Mode +IV) that all coalition aircraft should be squawking. For the first second it showed +the right symbol, but for the rest of the interrogation (4 to 5 seconds) it said the +target was not squawking Mode IV. The lead F-15 pilot then made a second contact +call to the AWACS over the main radio, repeating the location, altitude, and heading +of his target. This time the AWACS enroute controller responded that he had radar +returns on his scope at the spot (“hits there”) but did not indicate that these returns +might be from a friendly aircraft. At this point, the Black Hawk IFF response was +continuous but the radar returns were intermittent. The enroute controller placed +an “unknown, pending, unevaluated” track symbol in the area of the helicopter’s +radar and IFF returns and attempted to make an IFF identification. +The lead F-15 pilot, after making a second check of Modes I and IV and again +receiving no response, executed a visual identification pass to confirm that the target +was hostile—the next step required in the rules of engagement. He saw what he +thought were Iraqi helicopters. He pulled out his “goody book” with aircraft pictures +in it, checked the silhouettes, and identified the helicopters as Hinds, a type of +Russian aircraft flown by the Iraqis (“Tally two Hinds”). The F-15 wing pilot also +reported seeing two helicopters (“Tally two”), but never confirmed that he had +identified them as Hinds or as Iraqi aircraft. +The lead F-15 pilot called the AWACS and said they were engaging enemy air- +craft (“Tiger Two6 has tallied two Hinds, engaged”), cleared his wingman to shoot +(“Arm hot”), and armed his missiles. He then did one final Mode I check, received +a negative response, and pressed the button that released the missiles. The wingman +fired at the other helicopter, and both were destroyed. +This description represents the chain of events, but it does not explain “why” the +accident occurred except at the most superficial level and provides few clues as to +how to redesign the system to prevent future occurrences. Just looking at these basic +events surrounding the accident, it appears that mistakes verging on gross negli- +gence were involved—undisciplined pilots shot down friendly aircraft in clear skies, +and the AWACS crew and others who were supposed to provide assistance simply +sat and watched without telling the F-15 pilots that the helicopters were there. An +analysis using STAMP, as will be seen, provides a very different level of understand- +ing. In the following analysis, the goal is to understand why the controls in place did +not prevent the accident and to identify the changes necessary to prevent similar +accidents in the future. A related type of hazard analysis can be used during system + +design and development (see chapters 8 and 9) to prevent such occurrences in the +first place. +In the following analysis, the basic failures and dysfunctional interactions leading +to the loss at the physical level are identified first. Then each level of the hierarchical +safety control structure is considered in turn, starting from the bottom. +At each level, the context in which the behaviors took place is considered. The +context for each level includes the hazards, the safety requirements and constraints, +the controls in place to prevent the hazard, and aspects of the environment or situ- +ation relevant to understanding the control flaws, including the people involved, +their assigned tasks and responsibilities, and any relevant environmental behavior- +shaping factors. Following a description of the context, the dysfunctional interac- +tions and failures at that level are described, along with the accident factors (see +figure 4.8) that were involved. + +footnote. Tiger One was the code name for the F-15 lead pilot, while Tiger Two denoted the wing pilot. + + +section 5.3.2. Physical Process Failures and Dysfunctional Interactions. +The first step in the analysis is to understand the physical failures and dysfunctional +interactions within the physical process that were related to the accident. Figure 5.5 +shows this information. +All the physical components worked exactly as intended, except perhaps for the +IFF system. The fact that the Mode IV IFF gave an intermittent response has never +been completely explained. Even after extensive equipment teardowns and reenact- +ments with the same F-15s and different Black Hawks, no one has been able to +explain why the F-15 IFF interrogator did not receive a Mode IV response [200]. +The Accident Investigation Board report states: “The reason for the unsuccessful + + +Mode IV interrogation attempts cannot be established, but was probably attribut- +able to one or more of the following factors: incorrect selection of interrogation +modes, faulty air-to-air interrogators, incorrectly loaded IFF transponder codes, +garbling of electronic responses, and intermittent loss of line-of-sight radar contact.”7 +There were several dysfunctional interactions and communication inadequacies +among the correctly operating aircraft equipment. The most obvious unsafe interac- +tion was the release of two missiles in the direction of two friendly aircraft, but there +were also four obstacles to the type of fighter–helicopter communications that might +have prevented that release. +1. The Black Hawks and F-15s were on different radio frequencies and thus the +pilots could not speak to each other or hear the transmissions between others +involved in the incident, the most critical of which were the radio transmissions +between the two F-15 pilots and between the lead F-15 pilot and personnel +onboard the AWACS. The Black Hawks, according to the Aircraft Control +Order, should have been communicating on the TAOR frequency. Stopping +here and looking only at this level, it appears that the Black Hawk pilots were +at fault in not changing to the TAOR frequency, but an examination of the +higher levels of control points to a different conclusion. +2. Even if they had been on the same frequency, the Air Force fighter aircraft +were equipped with HAVE QUICK II (HQ-II) radios, while the Army heli- +copters were not. The only way the F-15 and Black Hawk pilots could have +communicated would have been if the F-15 pilots switched to non-HQ mode. +The procedures the pilots were given to follow did not tell them to do so. In +fact, with respect to the two helicopters that were shot down, one contained +an outdated version called HQ-I, which was not compatible with HQ-II. The +other was equipped with HQ-II, but because not all of the Army helicopters +supported HQ-II, CFAC refused to provide Army helicopter operations with +the necessary cryptographic support required to synchronize their radios with +the other OPC components. +If the objective of the accident analysis is to assign blame, then the different +radio frequencies could be considered irrelevant because the differing technol- +ogy meant they could not have communicated even if they had been on the +same frequency. If the objective, however, is to learn enough to prevent future +accidents, then the different radio frequencies are relevant. + + +3. The Black Hawks were not squawking the required IFF Mode I code for those +flying within the TAOR. The GAO report states that Black Hawk pilots told +them they routinely used the same Mode I code for outside the TAOR while +operating within the TAOR and no one had advised them that it was incorrect +to do so. But, again, the wrong Mode I code is only part of the story. +The Accident Investigation Board report concluded that the use of the +incorrect Mode I IFF code by the Black Hawks was responsible for the F-15 +pilots’ failure to receive a Mode I response when they interrogated the heli- +copters. However, an Air Force special task force concluded that based on the +descriptions of the system settings that the pilots testified they had used on +the interrogation attempts, the F-15s should have received and displayed any +Mode I or II response regardless of the code [200]. The AWACS was receiving +friendly Mode I and II returns from the helicopters at the same time that the +F-15s received no response. The GAO report concluded that the helicopters’ +use of the wrong Mode I code should not have prevented the F-15s from +receiving a response. Confusing the situation even further, the GAO report +cites the Accident Board president as telling the GAO investigators that +because of the difference between the lead F-15 pilot’s statement on the day +of the incident and his testimony to the investigation board, it was difficult to +determine the number of times the lead pilot had interrogated the helicopters +[200]. +4. Communication was also impeded by physical line-of-sight restrictions. The +Black Hawks were flying in narrow valleys among very high mountains that +disrupted communication depending on line-of-sight transmissions. +One reason for these dysfunctional interactions lies in the asynchronous evolu- +tion of the Army and Air Force technology, leaving the different services with largely +incompatible radios. Looking only at the event chain or at the failures and dysfunc- +tional interactions in the technical process—a common stopping point in accident +investigations—gives a very misleading picture of the reasons this accident occurred. +Examining the higher levels of control is necessary to obtain the information neces- +sary to prevent future occurrences. +After the shootdown, the following changes were made: +1.•Updated radios were placed on Black Hawk helicopters to enable communica- +tion with fighter aircraft. Until the time the conversion was complete, fighters +were directed to remain on the TAOR clear frequencies for deconfliction with +helicopters. +2.•Helicopter pilots were directed to monitor the common TAOR radio frequency +and to squawk the TAOR IFF codes. + + + +footnote. The commander of the U.S. Army in Europe objected to this sentence. He argued that nothing in the +board report supported the possibility that the codes had been loaded improperly and that it was clear +the Army crews were not at fault in this matter. The U.S. Commander in Chief, Europe, agreed with his +view. Although the language in the opinion was not changed, the former said his concerns were addressed +because the complaint had been included as an attachment to the board report. + + +section 5.3.3. The Controllers of the Aircraft and Weapons. +The pilots directly control the aircraft, including the activation of weapons (figure +5.6). The context in which their decisions and actions took place is first described, fol- +lowed by the dysfunctional interactions at this level of the control structure. Then the +inadequate control actions are outlined and the factors that led to them are described. +Context in Which Decisions and Actions Took Place +Safety Requirements and Constraints: The safety constraints that must be enforced +at this level of the sociotechnical control structure were described earlier. The F-15 +pilots must know who is in the TAOR and whether they should be there or not— +that is, they must be able to identify accurately the status of all other aircraft in the +TAOR at all times so that a friendly aircraft is not identified as a threat. They must +also follow the rules of engagement (ROE), which specify the procedures to be +executed before firing weapons at any targets. As noted earlier in this chapter, the +OPC ROE were devised by the OPC commander, based on guidelines created by +the Joint Chiefs of Staff, and were purposely conservative because of the many +multinational participants in OPC and the potential for friendly fire accidents. The +ROE were designed to slow down any military confrontation, but were unsuccessful +in this case. An important part of understanding this accident process and prevent- +ing repetitions is understanding why this goal was not achieved. +Controls: As noted in the previous section, the controls at this level included the +rules and procedures for operating in the TAOR (specified in the ACO), informa- +tion provided about daily operations in the TAOR (specified in the Air Tasking +Order or ATO), flowsheets, communication and identification channels (radios and +IFF), training, AWACS oversight, and procedures to keep fighters and helicopters +from coming into contact (for example, the F-15s fly at different altitudes). National +flags were required to be displayed prominently on all aircraft in order to facilitate +identification of their origin. +Roles and Responsibilities of the F-15 Pilots: When conducting combat missions, +aerial tactics dictate that F-15s always fly in pairs with one pilot as the lead and one +as the wingman. They fly and fight as a team, but the lead is always in charge. The +mission that day was to conduct a thorough radar search of the area to ensure that +the TAOR was clear of hostile aircraft (to sanitize the airspace) before the other +aircraft entered. They were also tasked to protect the AWACS from any threats. The +wing pilot was responsible for looking 20,000 feet and higher with his radar while +the lead pilot was responsible for the area 25,000 feet and below. The lead pilot had +final responsibility for the 5,000-foot overlap area. +Environmental and Behavior-Shaping Factors for the F-15 Pilots: The lead pilot +that day was a captain with nine years’ experience in the Air Force. He had flown + + +F-15s for over three years, including eleven combat missions over Bosnia and nine- +teen over northern Iraq protecting the no-fly zone. The mishap occurred on his sixth +flight during his second tour flying in support of OPC. +The wing pilot was a lieutenant colonel and Commander of the 53rd Fighter +Squadron at the time of the shootdown, and he was a highly experienced pilot. +He had flown combat missions out of Incirlik during Desert Storm and had served +in the initial group that set up OPC afterward. He was credited with the only +confirmed kill of an enemy Hind helicopter during the Gulf War. That downing +involved a beyond visual range shot, which means he never actually saw the +helicopter. +F-15 pilots were rotated through every six to eight weeks. Serving in the no-fly +zone was an unusual chance for peacetime pilots to have a potential for engaging +in combat. The pilots were very aware they were going to be flying in unfriendly +skies. They drew personal sidearms with live rounds, removed wedding bands and +other personal items that could be used by potential captors, were supplied with +blood chits offering substantial rewards for returning downed pilots, and were +briefed about threats in the area. Every part of their preparation that morning drove +home the fact that they could run into enemy aircraft: The pilots were making deci- +sions in the context of being in a war zone and were ready for combat. +Another factor that might have influenced behavior, according to the GAO +report, was rivalry between the F-15 and F-16 pilots engaged in Operation Provide +Comfort (OPC). While such rivalry was normally perceived as healthy and leading +to positive professional competition, at the time of the shootdown the rivalry had +become more pronounced and intense. The Combined Task Force Commander +attributed this atmosphere to the F-16 community’s having executed the only fighter +shootdown in OPC and all the shootdowns in Bosnia [200]. F-16 pilots are better +trained and equipped to intercept low-flying helicopters. The F-15 pilots knew that +F-16s would follow them into the TAOR that day. Any hesitation might have resulted +in the F-16s getting another kill. +A final factor was a strong cultural norm of “radio discipline” (called minimum +communication or min comm), which led to abbreviated phraseology in communica- +tion and a reluctance to clarify potential miscommunications. Fighter pilots are kept +extremely busy in the cockpit; their cognitive capabilities are often stretched to the +limit. As a result, any unnecessary interruptions on the radio are a significant distrac- +tion from important competing demands [191]. Hence, there was a great deal +of pressure within the fighter community to minimize talking on the radio, which +discouraged efforts to check accuracy and understanding. +Roles and Responsibilities of the Black Hawk Pilots: The Army helicopter pilots +flew daily missions into the TAOR to visit Zakhu. On this particular day, a change + + +of command had taken place at the US Army Command Center at Zakhu. The +outgoing commander was to escort his replacement into the no-fly zone in order to +introduce him to the two Kurdish leaders who controlled the area. The pilots were +first scheduled to fly the routine leg into Zakhu, where they would pick up two +Army colonels and carry other high-ranking VIPs representing the major players in +OPC to the two Iraqi towns of Irbil and Salah ad Din. It was not uncommon for the +Black Hawks to fly this far into the TAOR; they had done it frequently during the +three preceding years of Operation Provide Comfort. +Environmental and Behavior-Shaping Factors for the Black Hawk Pilots: Inside +Iraq, helicopters flew in terrain flight mode, that is, they hugged the ground, both +to avoid midair collisions and to mask their presence from threatening ground- +to-air Iraqi radars. There are three types of terrain flight: Pilots select the appro- +priate mode based on a wide range of tactical and mission-related variables. +Low-level terrain flight is flown when enemy contact is not likely. Contour flying +is closer to the ground than low level, and nap-of-the-earth flying is the lowest +and slowest form of terrain flight, flown only when enemy contact is expected. +Eagle Flight helicopters flew contour mode most of the time in northern Iraq. +They liked to fly in the valleys and the low-level areas. The route they were taking +the day of the shootdown was through a green valley between two steep, rugged +mountains. The mountainous terrain provided them with protection from Iraqi air +defenses during the one-hour flight to Irbil, but it also led to disruptions in +communication. +Because of the distance and thus time required for the mission, the Black Hawks +were fitted with sponsons or pontoon-shaped fuel tanks. The sponsons are mounted +below the side doors, and each holds 230 gallons of extra fuel. The Black Hawks +were painted with green camouflage, while the Iraqi Hinds’ camouflage scheme was +light brown and desert tan. To assist with identification, the Black Hawks were +marked with three two-by-three-foot American flags—one on each door and one +on the nose—and a fourth larger flag on the belly of the helicopter. In addition, two +American flags had been painted on the side of each sponson. +Dysfunctional Interactions at This Level +Communication between the F-15 and Black Hawk pilots was obviously dysfunc- +tional and related to the dysfunctional interactions in the physical process (incom- +patible radio frequencies, IFF codes, and anti-jamming technology) resulting in the +ends of the communication channels not matching and information not being trans- +mitted along the channel. Communication between the F-15 pilots was also hindered +by the minimum communication policy that led to abbreviated messages and a +reluctance to clarify potential miscommunications as described above as well as by +the physical terrain. + + +Flawed or Inadequate Decisions and Control Actions. +Both the Army helicopter pilots and the F-15 pilots executed inappropriate or +inadequate control actions during their flights, beyond the obviously incorrect F-15 +pilot commands to fire on two friendly aircraft. +Black Hawk Pilots: +1.•The Army helicopters entered the TAOR before it had been sanitized by the Air +Force. The Air Control Order or ACO specified that a fighter sweep of the +area must precede any entry of allied aircraft. However, because of the frequent +trips of Eagle Flight helicopters to Zakhu, an official exception had been made +to this policy for the Army helicopters. The Air Force fighter pilots had not +been informed about this exception. Understanding this miscommunication +requires looking at the higher levels of the control structure, particularly the +communication structure at those levels. +2.•The Army pilots did not change to the appropriate radio frequency to be used +in the TAOR. As noted earlier, however, even if they had been on the same +frequency, they would have been unable to communicate with the F-15s because +of the different anti-jamming technology of the radios. +3.•The Army pilots did not change to the appropriate IFF Mode I signal for the +TAOR. Again, as noted above, the F-15s should still have been able to receive +the Mode I response. +F-15 Lead Pilot: The accounts of and explanation for the unsafe control actions of +the F-15 pilots differ greatly among those who have written about the accident. +Analysis is complicated by the fact that any statements the pilots made after the +accident were likely to have been influenced by the fact that they were being inves- +tigated on charges of negligent homicide—their stories changed significantly over +time. Also, in the excitement of the moment, the lead pilot did not make the required +radio call to his wingman requesting that he turn on the HUD8 tape, and he also +forgot to turn on his own tape. Therefore, evidence about certain aspects of what +occurred and what was observed is limited to pilot testimony during the post-acci- +dent investigations and trials. +Complications also arise in determining whether the pilots followed the rules of +engagement (ROE) specified for the no-fly zone, because the ROE are not public +and the relevant section of the Accident Investigation Board Report is censored. +Other sources of information about the accident, however, reference clear instances +of Air Force pilot violations of the ROE. + + +The following inadequate decisions and control actions can be identified for the +lead F-15 pilot: +1.• +He did not perform a proper visual ID as required by the ROE and did not take +a second pass to confirm the identification. F-15 pilots are not accustomed to +flying close to the ground or to terrain. The lead pilot testified that because of +concerns about being fired on from the ground and the danger associated with +flying in a narrow valley surrounded by high mountains, he had remained high +as long as possible and then dropped briefly for a visual identification that +lasted between 3 and 4 seconds. He passed the helicopter on his left while flying +more than 500 miles an hour and at a distance of about 1,000 feet off to the +side and about 300 feet above the helicopter. He testified: +I was trying to keep my wing tips from hitting mountains and I accomplished two +tasks simultaneously, making a call on the main radio and pulling out a guide that +had the silhouettes of helicopters. I got only three quick interrupted glances of less +than 1.25 seconds each. [159]. +The dark green Black Hawk camouflage blended into the green background +of the valley, adding to the difficulty of the identification. +The Accident Investigation Board used pilots flying F-15s and Black Hawks +to recreate the circumstances under which the visual identification was made. +The test pilots were unable to identify the Black Hawks, and they could not +see any of the six American flags on each helicopter. The F-15 pilots could not +have satisfied the ROE identification requirements using the type of visual +identification passes they testified that they made. +2.• +He misidentified the helicopters as Iraqi Hinds. There were two basic incorrect +decisions involved in this misidentification. The first was identifying the UH-60 +(Black Hawk) helicopters as Russian Hinds, and the second was assuming that +the Hinds were Iraqi. Both Syria and Turkey flew Hinds, and the helicopters +could have belonged to one of the U.S. coalition partners. The Commander of +the Operations Support Squadron, whose job was to run the weekly detach- +ment squadron meetings, testified that as long as he had been in OPC, he had +reiterated to the squadrons each week that they should be careful about mis- +identifying aircraft over the no-fly zone because there were so many nations +and so many aircraft in the area and that any time F-15s or anyone else picked +up a helicopter on radar, it was probably a U.S., Turkish, or United Nations +helicopter: +Any time you intercept a helicopter as an unknown, there is always a question of +procedures, equipment failure, and high terrain masking the line-of-sight radar. There + + + + +are numerous reasons why you would not be able to electronically identify a heli- +copter. Use discipline. It is better to miss a shot than be wrong. [159]. +4.•He did not confirm, as required by the ROE, that the helicopters had hostile +intent before firing. The ROE required that the pilot not only determine the +type of aircraft and nationality, but to take into consideration the possibility +the aircraft was lost, in distress, on a medical mission, or was possibly being +flown by pilots who were defecting. +5.•He violated the rules of engagement by not reporting to the Air Command +Element (ACE). According to the ROE, the pilot should have reported to +the ACE (who is in his chain of command and physically located in the +AWACS) that he had encountered an unidentified aircraft. He did not wait for +the ACE to approve the release of the missiles. +6.•He acted with undue and unnecessary haste that did not allow time for those +above him in the control structure (who were responsible for controlling the +engagement) to act. The entire incident, from the first time the pilots received +an indication about helicopters in the TAOR to shooting them down lasted +only seven minutes. Pilots are allowed by the ROE to take action on their own +in an emergency, so the question then becomes whether this situation was an +emergency. +CFAC officials testified that there had been no need for haste. The slow-flying +helicopters had traveled less than fourteen miles since the F-15s first picked +them up on radar, they were not flying in a threatening manner, and they were +flying southeast away from the Security Zone. The GAO report cites the Mission +Director as stating that given the speed of the helicopters, the fighters had time +to return to Turkish airspace, refuel, and still return and engage the helicopters +before they could have crossed south of the 36th Parallel. +The helicopters also posed no threat to the F-15s or to their mission, which +was to protect the AWACS and determine whether the area was clear. One +expert later commented that even if they had been Iraqi Hinds, “A Hind is only +a threat to an F-15 if the F-15 is parked almost stationary directly in front of it +and says ‘Kill me.’ Other than that, it’s probably not very vulnerable” [191]. +Piper quotes Air Force Lt. Col. Tony Kern, a professor at the U.S. Air Force +Academy, who wrote about this accident: +Mistakes happen, but there was no rush to shoot these helicopters. The F-15s could +have done multiple passes, or even followed the helicopters to their destination to +determine their intentions. [159]. +Any explanation behind the pilot’s hasty action can only be the product of +speculation. Snook attributes the fast reaction to the overlearned defensive + +responses taught to fighter pilots. Both Snook and the GAO report mention +the rivalry with the F-16 pilots and a desire of the lead F-15 pilot to shoot down +an enemy aircraft. F-16s would have entered the TAOR ten to fifteen minutes +after the F-15s, potentially allowing the F-16 pilots to get credit for the downing +of an enemy aircraft: F-16s are better trained and equipped to intercept low- +flying helicopters. If the F-15 pilots had involved the chain of command, the +pace would have slowed down, ruining the pilots’ chance for a shootdown. In +addition, Snook argues that this was a rare opportunity for peacetime pilots to +engage in combat. +The goals and motivation behind any human action are unknowable (see +section 2.7). Even in this case where the F-15 pilots survived the accident, there +are many reasons to discount their own explanations, not the least of which is +potential jail sentences. The explanations provided by the pilots right after the +engagement differ significantly from their explanations a week later during the +official investigations to determine whether they should be court-martialed. +But in any case, there was no chance that such slow flying helicopters could +have escaped two supersonic jet fighters in the open terrain of northern Iraq +nor were they ever a serious threat to the F-15s. This situation, therefore, was +not an emergency. +7.•He did not wait for a positive ID from the wing pilot before firing on the heli- +copters and did not question the vague response when he got it: When the lead +pilot called out that he had visually identified two Iraqi helicopters, he asked +the wing pilot to confirm the identification. The wingman called out “Tally Two” +on his radio, which the lead pilot took as confirmation, but which the wing pilot +later testified only meant he saw two helicopters but not necessarily Iraqi +Hinds. The lead pilot did not wait for a positive identification from the wingman +before starting the engagement. +8.•He violated altitude restrictions without permission: According to Piper, the +commander of the OPC testified at one of the hearings, +I regularly, routinely imposed altitude limitations in northern Iraq. On the fourteenth +of April, the restrictions were a minimum of ten thousand feet for fixed-wing aircraft. +This information was in each squadron’s Aircrew Read File. Any exceptions had to +have my approval. [159] +None of the other accident reports, including the official one, mentions this +erroneous action on the part of the pilots. Because this control flaw was never +investigated, it is not possible to determine whether the action resulted from a +“reference channel” problem (i.e., the pilots did not know about the altitude +restriction) or an “actuator” error (i.e., the pilots knew about it but chose to +ignore it for an unknown reason.) + +9.•He deviated from the basic mission to protect the AWACS, leaving the AWACS +open to attack: The helicopter could have been a diversionary ploy. The +mission of the first flight into the TAOR was to make sure it was safe for the +AWACS and other aircraft to enter the restricted operating zone. Piper empha- +sizes that that was the only purpose of their mission [159]. Piper, who again is +the only one who mentions it, cites testimony of the commander of OPC during +one of the hearings when asked whether the F-15s exposed the AWACS to +other air threats when they attacked and shot down the helicopters. The +commander replied: +Yes, when the F-15s went down to investigate the helicopters, made numerous passes, +engaged the helicopters and then made more passes to visually reconnaissance the +area, AWACS was potentially exposed for that period of time. [159] + + + +Wing Pilot: The wing pilot, like the lead pilot, violated altitude restrictions and +deviated from the basic mission. In addition: +1.•He did not make a positive identification of the helicopters: His visual identi- +fication was not even as close to the helicopters as the lead F-15 pilot, which +was inadequate to recognize the helicopters, and the wing pilot’s ID lasted only +between two and three seconds. According to a Washington Post article, he told +investigators that he never clearly saw the helicopters before reporting “Tally +Two.” In a transcript of one of his interviews with investigators, he said: “I did +not identify them as friendly; I did not identify them as hostile. I expected to +see Hinds based on the call my flight leader had made. I didn’t see anything +that disputed that.” +Although the wing had originally testified he could not identify the helicop- +ters as Hinds, he reversed his statement between April and six months later +when he testified at the hearing on whether to court-martial him that “I could +identify them as Hinds” [159]. There is no way to determine which of these +contradictory statements is true. +Explanations for continuing the engagement without an identification could +range from an inadequate mental model of the ROE, following the orders of +the lead pilot and assuming that his identification had been proper, the strong +influence on what one sees by what one expects to see, wanting the helicopters +to be hostile, and any combination of these. +2.•He did not tell the lead pilot that he had not identified the helicopters: In +the hearings to place blame for the shootdown, the lead pilot testified that +he had radioed the wing pilot and said, “Tiger One has tallied two Hinds, +confirm.” Both pilots agree to this point, but then the testimony becomes +contradictory. + + +The hearing in the fall of 1994 on whether the wing pilot should be charged +with twenty-six counts of negligent homicide rested on the very narrow ques- +tion of whether the lead pilot had called the AWACS announcing the engage- +ment before or after the wing pilot responded to the lead pilot’s directive to +confirm whether the helicopters were Iraqi Hinds. The lead pilot testified that +he had identified the helicopters as Hinds and then asked the wing to confirm +the identification. When the wing responded with “Tally Two,” the lead believed +this response signaled confirmation of the identification. The lead then radioed +the AWACS and reported, “Tiger Two has tallied two Hinds, engaged.” The +wing pilot, on the other hand, testified that the lead had called the AWACS +with the “engaged” message before he (the wing pilot) had made his “Tally +Two” radio call to the lead. He said his “Tally Two” call was in response to the +“engaged” call, not the “confirm” call and simply meant that he had both target +aircraft in sight. He argued that once the engaged call had been made, he cor- +rectly concluded that an identification was no longer needed. +The fall 1994 hearing conclusion about which of these scenarios actually +occurred is different than the conclusions in the official Air Force accident +report and that of the hearing officer in another hearing. Again, it is not pos- +sible nor necessary to determine blame here or to determine exactly which +scenario is correct to conclude that the communications were ambiguous. The +minimum communication policy was a factor here as was probably the excite- +ment of a potential combat engagement. Snook suggests that the expectations +of what the pilots expected to hear resulted in a filtering of the inputs. Such +filtering is a well-known problem in airline pilots’ communications with con- +trollers. The use of well-established phraseology is meant to reduce it. But the +calls by the wing pilot were nonstandard. In fact, Piper notes that in pilot train- +ing bases and programs that train pilots to fly fighter aircraft since the shoot- +down, these radio calls are used as examples of “the poorest radio communications +possibly ever given by pilots during a combat intercept” [159]. +3.• +He continued the engagement despite the lack of an adequate identification: +Explanations for continuing the engagement without an identification could +range from an inadequate mental model of the ROE, following the orders +of the lead pilot and assuming that the lead pilot’s identification had been +proper, wanting the helicopters to be hostile, and any combination of these. +With only his contradictory testimony, it is not possible to determine +the reason. +Some Reasons for the Flawed Control Actions and Dysfunctional Interactions +The accident factors shown in figure 4.8 can be used to provide an explanation for +the flawed control actions. These factors here are divided into incorrect control + + +algorithms, inaccurate mental models, poor coordination among multiple control- +lers, and inadequate feedback from the controlled process. +Incorrect Control Algorithms: The Black Hawk pilots correctly followed the pro- +cedures they had been given (see the discussion of the CFAC–MCC level later). +These procedures were unsafe and were changed after the accident. +The F-15 pilots apparently did not execute their control algorithms (the proce- +dures required by the rules of engagement) correctly, although the secrecy involved +in the ROE make this conclusion difficult to prove. After the accident, the ROE +were changed, but the exact changes made are not public. +Inaccurate Mental Models of the F-15 Pilots: There were many inconsistencies +between the mental models of the Air Force pilots and the actual process state. First, +they had an ineffective model of what a Black Hawk helicopter looked like. There +are several explanations for this, including poor visual recognition training and the +fact that Black Hawks with sponsons attached resemble Hinds. None of the pictures +of Black Hawks on which the F-15 pilots had been trained had these wing-mounted +fuel tanks. Additional factors include the speeds at which the F-15 pilots do their +visual identification (VID) passes and the angle at which the pilots passed over +their targets. +Both F-15 pilots received only limited visual recognition training in the previous +four months, partly due to the disruption of normal training caused by their wing’s +physical relocation from one base to another in Germany. But the training was +probably inadequate even if it had been completed. Because the primary mission +of F-15s is air-to-air combat against other fast-moving aircraft, most of the opera- +tional training is focused on their most dangerous and likely threats—other high- +altitude fighters. In the last training before the accident, only five percent of the +slides depicted helicopters. None of the F-15 intelligence briefings or training ever +covered the camouflage scheme of Iraqi helicopters, which was light brown and +desert tan (in contrast to the forest green camouflage of the Black Hawks). +Pilots are taught to recognize many different kinds of aircraft at high speeds using +“beer shots,” which are blurry pictures that resemble how the pilot might see those +aircraft while in flight. The Air Force pilots, however, received very little training in +the recognition of Army helicopters, which they rarely encountered because of the +different altitudes at which they flew. All the helicopter photos they did see during +training, which were provided by the Army, were taken from the ground—a perspec- +tive from which it was common for Army personnel to view them but not useful +for a fighter pilot in flight above them. None of the photographs were taken from +the above aft quadrant—the position from which most fighters would view a heli- +copter. Air Force visual recognition training and procedures were changed after +this accident. + + +The F-15 pilots also had an inaccurate model of the current airspace occupants, +based on the information they had received about who would be in the airspace +that day and when. They assumed and had been told in multiple ways that they +would be the first coalition aircraft in the TAOR: +1.• The AGO specified that no coalition aircraft (fixed or rotary wing) was allowed +to enter the TAOR before it was sanitized by a fighter sweep. +2.• The daily ATO and ARF included a list of all flights scheduled to be in the +TAOR that day. The ATO listed the Army Black Hawk flights only in terms of +their call signs, aircraft numbers, type of mission (transport), and general route +(from Diyarbakir to the TAOR and back to Diyarbakir). All departure times +were listed “as required” and no helicopters were mentioned on the daily flow- +sheet. Pilots fly with the flowsheet on kneeboards as a primary reference during +the mission. The F-15s were listed as the very first mission into the TAOR; all +other aircraft were scheduled to follow them. +3.• +During preflight briefings that morning, the ATO and flowsheet were reviewed +in detail. No mention was made of any Army helicopter flights not appearing +on the flowsheet. +4.• The +Battle Sheet Directive (a handwritten sheet containing last-minute +changes to information published in the ATO and the ARF) handed to them +before going to their aircraft contained no information about Black Hawk +flights. +5.•In a radio call to the ground-based Mission Director just after engine start, the +lead F-15 pilot was told that no new information had been received since the +ATO was published. +6.•Right before entering the TAOR, the lead pilot checked in again, this time with +the ACE in the AWACS. Again, he was not told about any Army helicopters +in the area. +7.• At 10 20, the lead pilot reported that they were on station. Usually at this time, +the AWACS will give them a “picture” of any aircraft in the area. No informa- +tion was provided to the F-15 pilots at this time, although the Black Hawks had +already checked in with the AWACS on three separate occasions. +8.• The +AWACS continued not to inform the pilots about Army helicopters +during the encounter. The lead F-15 pilot twice reported unsuccessful attempts +to identify radar contacts they were receiving, but in response they were not +informed about the presence of Black Hawks in the area. After the first +report, the TAOR controller responded with “Clean there,” meaning he did +not have a radar hit in that location. Three minutes later, after the second +call, the TAOR controller replied, “Hits there.” If the radar signal had been + + + +identified as a friendly aircraft, the controller would have responded, “Paint +there.” +9.• The IFF transponders on the F-15s did not identify the signals as from a friendly +aircraft, as discussed earlier. +Various complex analyses have been proposed to explain why the F-15 pilots’ mental +models of the airspace occupants were incorrect and not open to reexamination +once they received conflicting input. But a possible simple explanation is that they +believed what they were told. It is well known in cognitive psychology that mental +models are slow to change, particularly in the face of ambiguous evidence like that +provided in this case. When operators receive input about the state of the system +being controlled, they will first try to fit that information into their current mental +model and will find reasons to exclude information that does not fit. Because opera- +tors are continually testing their mental models against reality (see figure 2.9), the +longer a model has been held and the more different sources of information that +led to that incorrect model, the more resistant the models will be to change due to +conflicting information, particularly ambiguous information. The pilots had been +told repeatedly and by almost everyone involved that there were no friendly heli- +copters in the TAOR at that time. +The F-15 pilots also may have had a misunderstanding about (incorrect model +of) the ROE and the procedures required when they detected an unidentified +aircraft. The accident report says that the ROE were reduced in briefings and in +individual crew members’ understandings to a simplified form. This simplification +led to some pilots not being aware of specific considerations required prior to +engagement, including identification difficulties, the need to give defectors safe +conduct, and the possibility of an aircraft being in distress and the crew being +unaware of their position. On the other hand, there had been an incident the week +before and the F-15 pilots had been issued an oral directive reemphasizing the +requirement for fighter pilots to report to the ACE. That directive was the result +of an incident on April 7 in which F-15 pilots had initially ignored directions from +the ACE to “knock off” or stop an intercept with an Iraqi aircraft. The ACE over- +heard the pilots preparing to engage the aircraft and contacted them, telling them +to stop the engagement because he had determined that the hostile aircraft was +outside the no-fly zone and because he was leery of a “bait and trap” situation.9 +The GAO report stated that CFAC officials told the GAO that the F-15 community +was “very upset” about the intervention of the ACE during the knock-off incident + +and felt he had interfered with the carrying out of the F-15 pilots’ duties [200]. +As discussed in chapter 2, there is no way to determine the motivation behind +an individual’s actions. Accident analysts can only present the alternative +explanations. +Additional reasons for the lead pilot’s incorrect mental model stem from ambigu- +ous or missing feedback from the F-15 wing pilot, dysfunctional communication with +the Black Hawks, and inadequate information provided over the reference channels +from the AWACS and CFAC operations. + + + +footnote. . According to the GAO report, in such a strategy, a fighter aircraft is lured into an area by one or more +enemy targets and then attacked by other fighter aircraft or surface-to-air missiles. + + + +Inaccurate Mental Models of the Black Hawk Pilots: The Black Hawk control +actions can also be linked to inaccurate mental models, that is, they were unaware +there were separate IFF codes for flying inside and outside the TAOR and that they +were supposed to change radio frequencies inside the TAOR. As will be seen later, +they were actually told not to change frequencies. They had also been told that the +AGO restriction on the entry of allied aircraft into the TAOR before the fighter +sweep did not apply to them—an official exception had been made for helicopters. +They understood that helicopters were allowed inside the TAOR without AWACS +coverage as long as they stayed inside the security zone. In practice, the Black Hawk +pilots frequently entered the TAOR prior to AWACS and fighter support without +incident or comment, and therefore it became accepted practice. +In addition, because their radios were unable to pick up the HAVE QUICK +communications between the F-15 pilots and between the F-15s and the AWACS, +the Black Hawk pilots’ mental models of the situation were incomplete. According +to Snook, Black Hawk pilots testified during the investigation, +We were not integrated into the entire system. We were not aware of what was going on +with the F-15s and the sweep and the refuelers and the recon missions and AWACS. We +had no idea who was where and when they were there. [191] +Coordination among Multiple Controllers: At this level, each component (air- +craft) had a single controller and thus coordination problems did not occur. They +were rife, however, at the higher control levels. +Feedback from the Controlled Process: The F-15 pilots received ambiguous infor- +mation from their visual identification pass. At the speeds and altitudes they were +traveling, it is unlikely that they would have detected the unique Black Hawk mark- +ings that identified them as friendly. The mountainous terrain in which they were +flying limited their ability to perform an adequate identification pass and the green +helicopter camouflage added to the difficulty. The feedback from the wingman to +the lead F-15 pilot was also ambiguous and was most likely misinterpreted by the +lead pilot. Both pilots apparently received incorrect IFF feedback. + + +Changes after the Accident +After the accident, Black Hawk pilots were: +1.•Required to strictly adhere to their ATO published routing and timing. +2.•Not allowed to operate in the TAOR unless under positive control of AWACS. +Without AWACS coverage, only administrative helicopter flights between +Diyarbakir and Zakhu were allowed, provided they were listed on the ATO. +3.•Required to monitor the common TAOR radio frequency. +4.•Required to confirm radio contact with AWACS at least every twenty minutes +unless they were on the ground. +5.•Required to inform AWACS upon landing. They must make mandatory radio +calls at each enroute point. +6.•If radio contact could not be established, required to climb to line-of-sight with +AWACS until contact is reestablished. +7.•Prior to landing in the TAOR (including Zakhu), required to inform the +AWACS of anticipated delays on the ground that would preclude taking off at +the scheduled time. +8.•Immediately after takeoff, required to contact the AWACS and reconfirm +IFF Modes I, II, and IV are operating. If they have either a negative radio +check with AWACS or an inoperative Mode IV, they cannot proceed into the +TAOR. +All fighter pilots were: +9.•Required to check in with the AWACS when entering the low-altitude environ- +ment and remain on the TAOR clear frequencies for deconfliction with +helicopters. +10.•Required to make contact with AWACS using UHF, HAVE QUICK, or UHF +clear radio frequencies and confirm IFF Modes I, II, and IV before entering +the TAOR. If there was either a negative radio contact with AWACS or an +inoperative Mode IV, they could not enter the TAOR. +Finally, white recognition strips were painted on the Black Hawk rotor blades to +enhance their identification from the air. + +section 5.3.4. +The ACE and Mission Director. +Context in Which Decisions and Actions Took Place +Safety Requirements and Constraints: The ACE and mission director must follow +the procedures specified and implied by the ROE, the ACE must ensure that pilots + + +follow the ROE, and the ACE must interact with the AWACS crew to identify +reported unidentified aircraft (see figure 5.7). +Controls: The controls include the ROE to slow down the engagement and a chain +of command to prevent individual error or erratic behavior. +Roles and Responsibilities: The ACE was responsible for controlling combat oper- +ations and for ensuring that the ROE were enforced. He flew in the AWACS so he +could get up-to-the-minute information about the state of the TAOR airspace. +The ACE was always a highly experienced person with fighter experience. That +day, the ACE was a major with nineteen years in the Air Force. He had perhaps +more combat experience than anyone else in the Air Force under forty. He had +logged 2,000 total hours of flight time and flown 125 combat missions, including 27 +in the Gulf War, during which time he earned the Distinguished Flying Cross and +two air medals for heroism. At the time of the accident, he had worked for four +months as an ACE and flown approximately fifteen to twenty missions on the +AWACS [191]. +The Mission Director on the ground provided a chain of command for real-time +decision making from the pilots to the CFAC commander. On the day of the acci- +dent, the Mission Director was a lieutenant colonel with more than eighteen years +in the Air Force. He had logged more than 1,000 hours in the F-4 in Europe and an +additional 100 hours worldwide in the F-15 [191]. +Environmental and Behavior-Shaping Factors: No pertinent factors were identified +in the reports and books on the accident. + +Dysfunctional Interactions at This Level. +The ACE was supposed to get information about unidentified or enemy aircraft +from the AWACS mission crew, but in this instance they did not provide it. + +Flawed or Inadequate Decisions and Control Actions. +The ACE did not provide any control commands to the F-15s with respect to fol- +lowing the ROE or engaging and firing on the U.S. helicopters. + +Reasons for Flawed Control Actions and Dysfunctional Interactions. +Incorrect Control Algorithms: The control algorithms should theoretically have +been effective, but they were never executed. +Inaccurate Mental Models: CFAC, and thus the Mission Director and ACE, exer- +cised ultimate tactical control of the helicopters, but they shared the common view +with the AWACS crew that helicopter activities were not an integral part of OPC +air operations. In testimony after the accident, the ACE commented, “The way I +understand it, only as a courtesy does the AWACS track Eagle Flight.” + +page 141 \ No newline at end of file diff --git a/replacements b/replacements index f8c63a7..bafe65b 100644 --- a/replacements +++ b/replacements @@ -1,6 +1,47 @@ +: . — . \[.+\] -\n -HMO H M O -MIC M I C -DC-10 D C 10. \ No newline at end of file + 19(\d\d) 19 $1 + 200(\d) 2 thousand $1 + 20(\d\d) 20 $1 + \( .( +\) ). + III 3 + II 2 + IV 4 + ASO A S O + PRA P R A + HMO H M O + MIC M I C + DC-10 D C 10 + OPC O P C + TAOR T A O R + AAI A A I + ACO A C O + AFB A F B + AI A I + ATO A T O + BH B H + BSD B S D + CTF C T F + CFAC C FACK + DO D O + GAO GAOW + HQ-II H Q-2 + IFF I F F + JOIC J O I C + JSOC J SOCK + JTIDS J tides + MCC M C C + MD M D + NCA N C A + NFZ N F Z + OPC O P C + ROE R O E + SD S D + SITREP SIT Rep + TACSAT Tack sat + TAOR T A O R + USCINCEUR U S C in E U R + WD W D