From 48e35621ed772111ec5fc00f53d3d38972d38176 Mon Sep 17 00:00:00 2001 From: xuu Date: Sun, 16 Mar 2025 20:38:47 -0600 Subject: [PATCH] chore: add 7 8 9 --- chapter07.raw | 633 +++++++++++++++++ chapter07.txt | 586 ++++++++++++++++ chapter08.raw | 1276 +++++++++++++++++++++++++++++++++ chapter08.txt | 1175 +++++++++++++++++++++++++++++++ chapter09.raw | 1864 +++++++++++++++++++++++++++++++++++++++++++++++++ chapter09.txt | 1703 ++++++++++++++++++++++++++++++++++++++++++++ replacements | 102 +-- 7 files changed, 7299 insertions(+), 40 deletions(-) create mode 100644 chapter07.raw create mode 100644 chapter07.txt create mode 100644 chapter08.raw create mode 100644 chapter08.txt create mode 100644 chapter09.raw create mode 100644 chapter09.txt diff --git a/chapter07.raw b/chapter07.raw new file mode 100644 index 0000000..e970a61 --- /dev/null +++ b/chapter07.raw @@ -0,0 +1,633 @@ +chapter 7. Fundamentals. +All the parts of the process described in the following chapters start from the same +fundamental system engineering activities. These include defining, for the system +involved, accidents or losses, hazards, safety requirements and constraints, and the +safety control structure. + +section 7.1. +Defining Accidents and Unacceptable Losses. +The first step in any safety effort involves agreeing on the types of accidents or +losses to be considered. +In general, the definition of an accident comes from the customer and occasion- +ally from the government for systems that are regulated by government agencies. +Other sources might be user groups, insurance companies, professional societies, +industry standards, and other stakeholders. If the company or group developing the +system is free to build whatever they want, then considerations of liability and the +cost of accidents will come into play. +Definitions of basic terms differ greatly among industries and engineering disci- +plines. A set of basic definitions is used in this book (see appendix A) that reflect +common usage in System Safety. An accident is defined as: +Accident: An undesired or unplanned event that results in a loss, including loss +of human life or human injury, property damage, environmental pollution, +mission loss, etc. +An accident need not involve loss of life, but it does result in some loss that is unac- +ceptable to the stakeholders. System Safety has always considered non-human +losses, but for some reason, many other approaches to safety engineering have +limited the definition of a loss to human death or injury. As an example of an +inclusive definition, a spacecraft accident might include loss of the astronauts (if +the spacecraft is manned), death or injury to support personnel or the public, non- +accomplishment of the mission, major equipment damage (such as damage to launch + + +facilities), environmental pollution of planets, and so on. An accident definition used +in the design of an explorer spacecraft to characterize the icy moon of a planet in +the Earth’s solar system, for example, was [151]: +A1. Humans or human assets on earth are killed or damaged. +A2. Humans or human assets off of the earth are killed or damaged. +A3. Organisms on any of the moons of the outer planet (if they exist) are killed +or mutated by biological agents of Earth origin. +Rationale: Contamination of an icy outer planet moon with biological agents +of Earth origin could have catastrophically adverse effects on any biological +agents indigenous to the icy outer planet moon. +A4. The scientific data corresponding to the mission goals is not collected. +A5. The scientific data corresponding to the mission goals is rendered unusable +(i.e., deleted or corrupted) before it can be fully investigated. +A6. Organisms of Earth origin are mistaken for organisms indigenous to any of +the moons of the outer planet in future missions to study the outer planet’s +moon. +Rationale: Contamination of a moon of an outer planet with biological +agents of Earth origin could lead to a situation in which a future mission +discovers the biological agents and falsely concludes that they are indige- +nous to the moon of the outer planet. +A7. An incident during this mission directly causes another mission to fail +to collect, return, or use the scientific data corresponding to its mission +goals. +Rationale: It is possible for this mission to interfere with the completion of +other missions through denying the other missions access to the space +exploration infrastructure (for example, overuse of limited Deep Space +Network1 (DSN) resources, causing another mission to miss its launch +window because of damage to the launch pad during this mission, etc.) + +footnote. The Deep Space Network is an international network of large antennas and communication facilities +that supports interplanetary spacecraft missions and radio and radar astronomy observations for +the exploration of the solar system and the universe. The network also supports some Earth-orbiting +missions. + +Prioritizing or assigning a level of severity to the identified losses may be useful +when tradeoffs among goals are required in the design process. As an example, +consider an industrial robot to service the thermal tiles on the Space Shuttle, which + + +is used as an example in chapter 9. The goals for the robot are (1) to inspect the +thermal tiles for damage caused during launch, reentry, and transport of a Space +Shuttle and (2) to apply waterproofing chemicals to the thermal tiles. +Level 1: +A1.1: Loss of the orbiter and crew. (e.g., inadequate thermal protection) +A1.2: Loss of life or serious injury in the processing facility. +Level 2: +A2.1: Damage to the orbiter or to objects in the processing facility that results. +in the delay of a launch or in a loss of greater than x dollars. +A2.2: Injury to humans requiring hospitalization or medical attention and +leading to long-term or permanent physical effects. +Level 3: +A3.1: Minor human injury. (does not require medical attention or requires only +minimal intervention and does not lead to long-term or permanent physical +effects) +A3.2: Damage to orbiter that does not delay launch and results in a loss of less +than x dollars. +A3.3: Damage to objects in the processing facility (both on the floor or sus- +pended) that does not result in delay of a launch or a loss of greater than x +dollars. +A3.4: Damage to the mobile robot. +Assumption: It is assumed that there is a backup plan in place for servicing +the orbiter thermal tiles in case the tile processing robot has a mechanical +failure and that the same backup measures can be used in the event the +robot is out of commission due to other reasons. +The customer may also have a safety policy that must be followed by the contractor +or those designing the thermal tile servicing robot. As an example, the following is +similar to a typical NASA safety policy: +General Safety Policy: All hazards related to human injury or damage to the +orbiter must be eliminated or mitigated by the system design. A reasonable +effort must be made to eliminate or mitigate hazards resulting at most in +damage to the robot or objects in the work area. For any hazards that cannot +be eliminated, the hazard analysis as well as the design features and develop- +ment procedures, including any tradeoff studies, must be documented and +presented to the customer for acceptance. + + +One (but only one) of the controls used to avoid this type of accident is an air- +borne collision avoidance system like TCAS (Traffic alert and Collision Avoidance +System), which is now required on most commercial aircraft. While the goal of TCAS +is increased safety, TCAS itself introduces new hazards associated with its use. Some +hazards that were considered during the design of TCAS are: +H1. TCAS causes or contributes to a near midair collision (NMAC), defined as +a pair of controlled aircraft violating minimum separation standards. +H2. TCAS causes or contributes to a controlled maneuver into the ground. +H3. TCAS causes or contributes to the pilot losing control over the aircraft. +H4. TCAS interferes with other safety-related aircraft systems. +H5. TCAS interferes with the ground-based Air Traffic Control system (e.g., +transponder transmissions to the ground or radar or radio services). +H6. TCAS interferes with an ATC advisory that is safety-related (e.g., avoiding +a restricted area or adverse weather conditions). +Ground-based air traffic control also plays an important role in collision avoidance, +although it has responsibility for a larger and different set of hazards: +H1. Controlled aircraft violate minimum separation standards (NMAC). +H2. An airborne controlled aircraft enters an unsafe atmospheric region. +H3. A controlled airborne aircraft enters restricted airspace without author- +ization. +H4. A controlled airborne aircraft gets too close to a fixed obstacle other than a +safe point of touchdown on assigned runway (known as controlled flight into +terrain or CFIT). +H5. A controlled airborne aircraft and an intruder in controlled airspace violate +minimum separation. +H6. Loss of controlled flight or loss of airframe integrity. +H7. An aircraft on the ground comes too close to moving objects or collides with +stationary objects or leaves the paved area. +H8. An aircraft enters a runway for which it does not have a clearance (called +runway incursion). +Unsafe behavior (hazards) at the system level can be mapped into hazardous +behaviors at the component or subsystem level. Note, however, that the reverse +(bottom-up) process is not possible, that is, it is not possible to identify the system- +level hazards by looking only at individual component behavior. Safety is a system +property, not a component property. Consider an automated door system. One + + +reasonable hazard when considering the door alone is the door closing on someone. +The associated safety constraint is that the door must not close on anyone in the +doorway. This hazard is relevant if the door system is used in any environment. If +the door is in a building, another important hazard is not being able to get out of a +dangerous environment, for example, if the building is on fire. Therefore, a reason- +able design constraint would be that the door opens whenever a door open request +is received. But if the door is used on a moving train, an additional hazard must +be considered, namely, the door opening while the train is moving and between +stations. In a moving train, different safety design constraints would apply compared +to an automated door system in a building. Hazard identification is a top-down +process that must consider the encompassing system and its hazards and potential +accidents. +Let’s assume that the automated door system is part of a train control system. +The system-level train hazards related to train doors include a person being hit by +closing doors, someone falling from a moving train or from a stationary train that +is not properly aligned with a station platform, and passengers and staff being unable +to escape from a dangerous environment in the train compartment. Tracing these +system hazards into the related hazardous behavior of the automated door compo- +nent of the train results in the following hazards: +1. Door is open when the train starts. +2. Door opens while train is in motion. +3. Door opens while not properly aligned with station platform. +4. Door closes while someone is in the doorway. +5. Door that closes on an obstruction does not reopen or reopened door does +not reclose. +6. Doors cannot be opened for emergency evacuation between stations. +The designers of the train door controller would design to control these hazards. +Note that constraints 3 and 6 are conflicting, and the designers will have to reconcile +such conflicts. In general, attempts should first be made to eliminate hazards at +the system level. If they cannot be eliminated or adequately controlled at the +system level, then they must be refined into hazards to be handled by the system +components. +Unfortunately, no tools exist for identifying hazards. It takes domain expertise +and depends on subjective evaluation by those constructing the system. Chapter 13 +in Safeware provides some common heuristics that may be helpful in the process. +The good news is that identifying hazards is usually not a difficult process. The later +steps in the hazard analysis process are where most of the mistakes and effort occurs. + + +There is also no right or wrong set of hazards, only a set that the system stake- +holders agree is important to avoid. Some government agencies have mandated the +hazards they want considered for the systems they regulate or certify. For example, +the U.S. Department of Defense requires that producers of nuclear weapons +consider four hazards: +1. Weapons involved in accident or incidents, or jettisoned weapons, produce a +nuclear yield. +2. Nuclear weapons are deliberately prearmed, armed, launched, fired, or released +without execution of emergency war orders or without being directed to do so +by a competent authority. +3. Nuclear weapons are inadvertently prearmed, armed, launched, fired, or +released. +4. Inadequate security is applied to nuclear weapons. +Sometimes user or professional associations define the hazards for the systems they +use and that they want developers to eliminate or control. In most systems, however, +the hazards to be considered are up to the developer and their customer(s). +section 7.3. +System Safety Requirements and Constraints. +After the system and component hazards have been identified, the next major goal +is to specify the system-level safety requirements and design constraints necessary +to prevent the hazards from occurring. These constraints will be used to guide the +system design and tradeoff analyses. +The system-level constraints are refined and allocated to each component during +the system engineering decomposition process. The process then iterates over the +individual components as they are refined (and perhaps further decomposed) and +as design decisions are made. +Figure 7.1 shows an example of the design constraints that might be generated +from the automated train door hazards. Again, note that the third constraint poten- +tially conflicts with the last one and the resolution of this conflict will be an impor- +tant part of the system design process. Identifying these types of conflicts early in +the design process will lead to better solutions. Choices may be more limited later +on when it may not be possible or practical to change the early decisions. +As the design process progresses and design decisions are made, the safety +requirements and constraints are further refined and expanded. For example, a +safety constraint on TCAS is that it must not interfere with the ground-based air +traffic control system. Later in the process, this constraint will be refined into more +detailed constraints on the ways this interference might occur. Examples include + + +constraints on TCAS design to limit interference with ground-based surveillance +radar, with distance-measuring equipment channels, and with radio services. Addi- +tional constraints include how TCAS can process and transmit information (see +chapter 10). +Figure 7.2 shows the high-level requirements and constraints for some of the air +traffic control hazards identified above. Comparing the ATC high-level constraints +with the TCAS high-level constraints (figure 7.3) is instructive. Ground-based air +traffic control has additional requirements and constraints related to aspects of the +collision problem that TCAS cannot handle alone, as well as other hazards and +potential aircraft accidents that it must control. +Some constraints on the two system components (ATC and TCAS) are closely +related, such as the requirement to provide advisories that maintain safe separation +between aircraft. This example of overlapping control raises important concerns +about potential conflicts and coordination problems that need to be resolved. As +noted in section 4.5, accidents often occur in the boundary areas between controllers +and when multiple controllers control the same process. The inadequate resolution +of the conflict between multiple controller responsibilities for aircraft separation +contributed to the collision of two aircraft over the town of Überlingen (Germany) + + +in July 2002 when TCAS and the ground air traffic controller provided conflicting +advisories to the pilots. Potentially conflicting responsibilities must be carefully +handled in system design and operations and identifying such conflicts are part of +the new hazard analysis technique described in chapter 8. +Hazards related to the interaction among components, for example the inter- +action between attempts by air traffic control and by TCAS to prevent collisions, +need to be handled in the safety control structure design, perhaps by mandating +how the pilot is to select between conflicting advisories. There may be considerations +in handling these hazards in the subsystem design that will impact the behavior of +multiple subsystems and therefore must be resolved at a higher level and passed to +them as constraints on their behavior. + +section 7.4. +The Safety Control Structure. +The safety requirements and constraints on the physical system design shown in +section 7.3 act as input to the standard system engineering process and must be +incorporated into the physical system design and safety control structure. An +example of how they are used is provided in chapter 10. +Additional system safety requirements and constraints, including those on opera- +tions and maintenance or upgrades will be used in the design of the safety control +structure at the organizational and social system levels above the physical system. +There is no one correct safety control structure: what is practical and effective will +depend greatly on cultural and other factors. Some general principles that apply to +all safety control structures are described in chapter 13. These principles need to be +combined with specific system safety requirements and constraints for the particular +system involved to design the control structure. +The process for engineering social systems is very similar to the regular system +engineering process and starts, like any system engineering project, with identifying +system requirements and constraints. The responsibility for implementing each +requirement needs to be assigned to the components of the control structure, along +with requisite authority and accountability, as in any management system; controls +must be designed to ensure that the responsibilities can be carried out; and feedback +loops created to assist the controller in maintaining accurate process models. + +section 7.4.1. The Safety Control Structure for a Technical System. +An example from the world of space exploration is used in this section, but many +of the same requirements and constraints could easily be adapted for other types +of technical system development and operations. +The requirements in this example were generated to perform a programmatic +risk assessment of a new NASA management structure called Independent + +Technical Authority (ITA) recommended in the report of the Columbia Accident +Investigation Board. The risk analysis itself is described in the chapter on the new +hazard analysis technique called STPA (chapter 8). But the first step in the safety +or risk analysis is the same as for technical systems: to identify the system hazards +to be avoided, to generate a set of requirements for the new management structure, +and to design the control structure. +The new safety control structure for the NASA manned space program was +introduced to improve the flawed engineering and management decision making +leading to the Columbia loss. The hazard to be eliminated or mitigated was: + +System Hazard: +Poor engineering and management decision making leading to a loss. + +Four high-level system safety requirements and constraints for preventing the +hazard were identified and then refined into more specific requirements and +constraints. +1. Safety considerations must be first and foremost in technical decision +making. +a. State-of-the art safety standards and requirements for NASA missions must +be established, implemented, enforced, and maintained that protect the +astronauts, the workforce, and the public. +b. Safety-related technical decision making must be independent from pro- +grammatic considerations, including cost and schedule. +c. Safety-related decision making must be based on correct, complete, and +up-to-date information. +d. Overall (final) decision making must include transparent and explicit con- +sideration of both safety and programmatic concerns. +e. The Agency must provide for effective assessment and improvement in +safety-related decision making. +2. Safety-related technical decision making must be done by eminently qualified +experts, with broad participation of the full workforce. +a. Technical decision making must be credible (executed using credible per- +sonnel, technical requirements, and decision-making tools) . +b. Technical decision making must be clear and unambiguous with respect to +authority, responsibility, and accountability. +c. All safety-related technical decisions, before being implemented by the +Program, must have the approval of the technical decision maker assigned +responsibility for that class of decisions. + +d. Mechanisms and processes must be created that allow and encourage all +employees and contractors to contribute to safety-related decision making. +3. Safety analyses must be available and used starting in the early acquisition, +requirements development, and design processes and continuing through the +system life cycle. +a. High-quality system hazard analyses must be created. +b. Personnel must have the capability to produce high-quality safety +analyses. +c. Engineers and managers must be trained to use the results of hazard analy- +ses in their decision making. +d. Adequate resources must be applied to the hazard analysis process. +e. Hazard analysis results must be communicated in a timely manner to those +who need them. A communication structure must be established that +includes contractors and allows communication downward, upward, and +sideways (e.g., among those building subsystems). +f. Hazard analyses must be elaborated (refined and extended) and updated +as the design evolves and test experience is acquired. +g. During operations, hazard logs must be maintained and used as experience +is acquired. All in-flight anomalies must be evaluated for their potential to +contribute to hazards. +4. The Agency must provide avenues for the full expression of technical con- +science (for safety-related technical concerns) and provide a process for full +and adequate resolution of technical conflicts as well as conflicts between +programmatic and technical concerns. +a. Communication channels, resolution processes, adjudication procedures +must be created to handle expressions of technical conscience. +b. Appeals channels must be established to surface complaints and concerns +about aspects of the safety-related decision making and technical conscience +structures that are not functioning appropriately. +Where do these requirements and constraints come from? Many of them are based +on fundamental safety-related development, operations and management principles +identified in various chapters of this book, particularly chapters 12 and 13. Others +are based on experience, such as the causal factors identified in the Columbia and +Challenger accident reports or other critiques of the NASA safety culture and of +NASA safety management. The requirements listed obviously reflect the advanced +technology and engineering domain of NASA and the space program that was the +focus of the ITA program along with some of the unique aspects of the NASA + + +culture. Other industries will have their own requirements. An example for the +pharmaceutical industry is shown in the next section of this chapter. +There is unlikely to be a universal set of requirements that holds for every safety +control structure beyond a small set of requirements too general to be very useful +in a risk analysis. Each organization needs to determine what its particular safety +goals are and the system requirements and constraints that are likely to ensure that +it reaches them. +Clearly buy-in and approval of the safety goals and requirements by the stake- +holders, such as management and the broader workforce as well as anyone oversee- +ing the group being analyzed, such as a regulatory agency, is important when +designing and analyzing a safety control structure. +Independent Technical Authority is a safety control structure used in the nuclear +Navy SUBSAFE program described in chapter 14. In this structure, safety-related +decision making is taken out of the hands of the program manager and assigned to +a Technical Authority. In the original NASA implementation, the technical authority +rested in the NASA Chief Engineer, but changes have since been made. The overall +safety control structure for the original NASA ITA is shown in figure 7.4.3 +For each component of the structure, information must be determined about its +overall role, responsibilities, controls, process model requirements, coordination and +communication requirements, contextual (environmental and behavior-shaping) +factors that might bear on the component’s ability to fulfill its responsibilities, and +inputs and outputs to other components in the control structure. The responsibilities +are shown in figure 7.5. A risk analysis on ITA and the safety control structure is +described in chapter 8. + +footnote. The control structure was later changed to have ITA under the control of the NASA center directors +rather than the NASA chief engineer; therefore, this control structure does not reflect the actual +implementation of ITA at NASA, but it was the design at the time of the hazard analysis described in +chapter 8. + + +section 7.4.2. Safety Control Structures in Social Systems. +Social system safety control structures often are not designed but evolve over time. +They can, however, be analyzed for inherent risk and redesigned or “reengineered” +to prevent accidents or to eliminate or control past causes of losses as determined +in an accident analysis. +The reengineering process starts with the definition of the hazards to be elimi- +nated or mitigated, system requirements and constraints necessary to increase safety, +and the design of the current safety-control structure. Analysis can then be used to +drive the redesign of the safety controls. But once again, just like every system that +has been described so far in this chapter, the process starts by identifying the hazards + + +and safety requirements and constraints derived from them. The process is illus- +trated using drug safety. +Dozens of books have been written about the problems in the pharmaceutical +industry. Everyone appears to have good intentions and are simply striving to opti- +mize their performance within the existing incentive structure. The result is that the +system has evolved to the point where each group’s individual best interests do not +necessarily add up to or are not aligned with the best interests of society as a whole. +A safety control structure exists, but does not necessarily provide adequate satisfac- +tion of the system-level goals, as opposed to the individual component goals. +This problem can be viewed as a classic system engineering problem: optimizing +each component does not necessarily add up to a system optimum. Consider the air +transportation system, as noted earlier. When each aircraft tries to optimize its path +from its departure point to its destination, the overall system throughput may not +be optimized when they all arrive in a popular hub at the same time. One goal of +the air traffic control system is to control individual aircraft movement in order to + + +optimize overall system throughput while trying to allow as much flexibility as pos- +sible for the individual aircraft and airlines to achieve their goals. The air traffic +control system and the rules of operation of the air transportation system resolve +conflicting goals when public safety is at stake. Each airline might want its own +aircraft to land as quickly as possible, but the air traffic controllers ensure adequate +spacing between aircraft to preserve safety margins. These same principles can be +applied to non-engineered systems. +The ultimate goal is to determine how to reengineer or redesign the overall +pharmaceutical safety control structure in a way that aligns incentives for the greater +good of society. A well-designed system would make it easier for all stakeholders +to do the right thing, both scientifically and ethically, while achieving their own goals +as much as possible. By providing the decision makers with information about ways +to achieve the overall system objectives and the tradeoffs involved, better decision +making can result. +While system engineering is applicable to pharmaceutical (and more generally +medical) safety and risk management, there are important differences from the +classic engineering problem that require changes to the traditional system safety +approaches. In most technical systems, managing risk is simpler because not doing +something (e.g., not inadvertently launching the missile) is usually safe and the +problem revolves around preventing the hazardous event (inadvertent launch): a +risk/no risk situation. The traditional engineering approach identifies and evaluates +the costs and potential effectiveness of different ways to eliminate or control the +hazards involved in the operational system. Tradeoffs require comparing the costs +of various solutions, including costs that involve reduction in desirable system func- +tions or system reliability. +The problem in pharmaceutical safety is different: there is risk in prescribing a +potentially unsafe drug, but there is also risk in not prescribing the drug (the patient +dies from their medical condition): a risk/risk situation. The risks and benefits +conflict in ways that greatly increase the complexity of decision making and the +information needed to make decisions. New, more powerful system engineering +techniques are required to deal with risk/risk decisions. +Once again, the basic goals, hazards, and safety requirements must first be identi- +fied [43]. +System Goal: To provide safe and effective pharmaceuticals to enhance the long- +term health of the population. + +Important loss events (accidents) we are trying to avoid are: +1. Patients get a drug treatment that negatively impacts their health. +2. Patients do not get the treatment they need. + + +Three system hazards can be identified that are related to these loss events: +H1: The public is exposed to an unsafe drug. +1. The drug is released with a label that does not correctly specify the condi- +tions for its safe use. +2. An approved drug is found to be unsafe and appropriate responses are +not taken (warnings, withdrawals from the market, etc.) +3. Patients are subjected to unacceptable risk during clinical trials. +H2: Drugs are taken unsafely. +1. The wrong drug is prescribed for the indication. +2. The pharmacist provides a different medication than was prescribed. +3. Drugs are taken in an unsafe combination. +4. Drugs are not taken according to directions (dosage, timing). +H3: Patients do not get an effective treatment they require. +1. Safe and effective drugs are not developed, are not approved for use, or +are withdrawn from the market. +2. Safe and effective drugs are not affordable by those who need them. +3. Unnecessary delays are introduced into development and marketing. +4. Physicians do not prescribe needed drugs or patients have no access to +those who could provide the drugs to them. +5. Patients stop taking a prescribed drug due to perceived ineffectiveness or +intolerable side effects. +From these hazards, a set of system requirements can be derived to prevent them: +1. Pharmaceutical products are developed to enhance long-term health. +a. Continuous appropriate incentives exist to develop and market needed +drugs. +b. The scientific knowledge and technology needed to develop new drugs and +optimize their use is available. +2. Drugs on the market are adequately safe and effective. +a. Drugs are subjected to effective and timely safety testing. +b. New drugs are approved by the FDA based upon a validated and reproduc- +ible decision-making process. +c. The labels attached to drugs provide correct information about safety and +efficacy. + + +d. Drugs are manufactured according to good manufacturing practices. +e. Marketed drugs are monitored for adverse events, side effects, and potential +negative interactions. Long-term studies after approval are conducted to +detect long-term effects and effects on subpopulations not in the original +study. +f. New information about potential safety risk is reviewed by an independent +advisory board. Marketed drugs found to be unsafe after they are approved +are removed, recalled, restricted, or appropriate risk/benefit information is +provided. +3. Patients get and use the drugs they need for good health. +a. Drug approval is not unnecessarily delayed. +b. Drugs are obtainable by patients. +c. Accurate information is available to support decision making about risks +and benefits. +d. Patients get the best intervention possible, practical, and reasonable for +their health needs. +e. Patients get drugs with the required dosage and purity. +4. Patients take the drugs in a safe and effective manner. +a. Patients get correct instructions about dosage and follow them. +b. Patients do not take unsafe combinations of drugs. +c. Patients are properly monitored by a physician while they are being treated. +d. Patients are not subjected to unacceptable risk during clinical trials. +In system engineering, the requirements may not be totally achievable in any practi- +cal design. For one thing, they may be conflicting among themselves (as was dem- +onstrated in the train door example) or with other system (non-safety) requirements +or constraints. The goal is to design a system or to evaluate and improve an existing +system that satisfies the requirements as much as possible today and to continually +improve the design over time using feedback and new scientific and engineering +advances. Tradeoffs that must be made in the design process are carefully evaluated +and considered and revisited when necessary. +Figure 7.6 shows the general pharmaceutical safety control structure in the +United States. Each component’s assigned responsibilities are those assumed in the +design of the structure. In fact, at any time, they may not be living up to these +responsibilities. +Congress provides guidance to the FDA by passing laws and providing directives, +provides any necessary legislation to ensure drug safety, ensures that the FDA has + + +enough funding to operate independently, provides legislative oversight on the +effectiveness of FDA activities, and holds committee hearings and investigations of +industry practices. +The FDA CDER (Center for Drug Evaluation and Research) ensures that the +prescription, generic, and over-the-counter drug products are adequately available +to the public and are safe and effective; monitors marketed drug products for +unexpected health risks; and monitors and enforces the quality of marketed drug +products. CDER staff members are responsible for selecting competent FDA advi- +sory committee members, establishing and enforcing conflict of interest rules, and +providing researchers with access to accurate and useful adverse event reports. +There are three major components within CDER. The Office of New Drugs +(OND) is in charge of approving new drugs, setting drug labels and, when required, +recalling drugs. More specifically, OND is responsible to: +1.•Oversee all U.S. human trials and development programs for investigational +medical products to ensure safety of participants in clinical trials and provide +oversight of the Institutional Review Boards (IRBs) that actually perform these +functions for the FDA. +2.•Set the requirements and process for the approval of new drugs. +3.•Critically examine a sponsor’s claim that a drug is safe for intended use (New +Drug Application Safety Review). Impartially evaluate new drugs for safety +and efficacy and approve them for sale if deemed appropriate. +4.•Upon approval, set the label for the drug. +5.•Not unnecessarily delay drugs that may have a beneficial effect. +6.•Require Phase IV (after-market) safety testing if there is a potential for long- +term safety risk. +7.•Remove a drug from the market if new evidence shows that the risks outweigh +the benefits. +8.•Update the label information when new information about drug safety is +discovered. +The second office within the FDA CDER is the Division of Drug Marketing, Adver- +tising, and Communications (DDMAC). This group provides oversight of the mar- +keting and promotion of drugs. It reviews advertisements for accuracy and balance. +The third component of the FDA CDER is the Office of Surveillance and Epi- +demiology. This group is responsible for ongoing reviews of product safety, efficacy, +and quality. It accomplishes this goal by performing statistical analysis of adverse +event data it receives to determine whether there is a safety problem. This office +reassesses risks based on new data learned after a drug is marketed and recommends + + +ways to manage risk. Its staff members may also serve as consultants to OND with +regard to drug safety issues. While they can recommend that a drug be removed +from the market if new evidence shows significant risks, only OND can actually +require that it be removed. +The FDA performs its duties with input from FDA Advisory Boards. These +boards are made up of academic researchers whose responsibility is to provide +independent advice and recommendations that are in the best interest of the general +public. They must disclose any conflicts of interest related to subjects on which +advice is being given. +Research scientists and centers are responsible for providing independent and +objective research on a drug’s safety, efficacy, and new uses and give their unbiased +expert opinion when it is requested by the FDA. They should disclose all their con- +flicts of interest when publishing and take credit only for papers on which they have +significantly contributed. +Scientific journals are responsible for publishing articles of high scientific quality +and provide accurate and balanced information to doctors. +Payers and insurers pay the medical costs for the people insured as needed +and only reimburse for drugs that are safe and effective. They control the use of +drugs by providing formularies or lists of approved drugs for which they will reim- +burse claims. +Pharmaceutical developers and manufacturers also have responsibilities within +the drug safety control structure. They must ensure that patients are protected from +avoidable risks by providing safe and effective drugs, testing drugs for effectiveness, +properly labeling their drugs, protecting patients during clinical trials by properly +monitoring the trial, not promoting unsafe use of their drugs, removing a drug from +the market if it is no longer considered safe, and manufacturing their drugs accord- +ing to good manufacturing practice. They are also responsible for monitoring drugs +for safety by running long-term, post-approval studies as required by the FDA; +running new trials to test for potential hazards; and providing, maintaining, and +incentivizing adverse-event reporting channels. +Pharmaceutical companies must also give accurate and up-to-date information +to doctors and the FDA about drug safety by educating doctors, providing all avail- +able information about the safety of the drug to the FDA, and informing the FDA +of potential new safety issues in a timely manner. Pharmaceutical companies also +sponsor research for the development of new drugs and treatments. +Last, but not least, are the physicians and patients. Physicians have the responsi- +bility to: +1.• +Make treatment decisions based on the best interests of their clients. +2.• Weigh the risks of treatment and non-treatment. + + +3.•Prescribe drugs according to the limitations on the label +4.•Maintain up-to-date knowledge of the risk/benefit profile of the drugs they are +prescribing +5.•Monitor the symptoms of their patients under treatment for adverse events +and negative interactions +6.•Report adverse events potentially linked to the use of the drugs they +prescribe +Patients are taking increasing responsibility for their own health in today’s world, +limited by what is practical. Traditionally they have been responsible to follow their +physician’s instructions and take drugs as prescribed, accede to the doctor’s superior +knowledge when appropriate, and go through physicians or appropriate channels to +get prescription drugs. +As designed, this safety control structure looks strong and potentially effective. +Unfortunately, it has not always worked the way it was supposed to work and the +individual components have not always satisfied their responsibilities. Chapter 8 +describes the use of the new hazard analysis technique, STPA, as well as other basic +STAMP concepts in analyzing the potential risks in this structure. \ No newline at end of file diff --git a/chapter07.txt b/chapter07.txt new file mode 100644 index 0000000..88c93bf --- /dev/null +++ b/chapter07.txt @@ -0,0 +1,586 @@ +chapter 7. Fundamentals. +All the parts of the process described in the following chapters start from the same +fundamental system engineering activities. These include defining, for the system +involved, accidents or losses, hazards, safety requirements and constraints, and the +safety control structure. + +section 7.1. +Defining Accidents and Unacceptable Losses. +The first step in any safety effort involves agreeing on the types of accidents or +losses to be considered. +In general, the definition of an accident comes from the customer and occasionally from the government for systems that are regulated by government agencies. +Other sources might be user groups, insurance companies, professional societies, +industry standards, and other stakeholders. If the company or group developing the +system is free to build whatever they want, then considerations of liability and the +cost of accidents will come into play. +Definitions of basic terms differ greatly among industries and engineering disciplines. A set of basic definitions is used in this book .(see appendix A). that reflect +common usage in System Safety. An accident is defined as. +Accident. An undesired or unplanned event that results in a loss, including loss +of human life or human injury, property damage, environmental pollution, +mission loss, etc. +An accident need not involve loss of life, but it does result in some loss that is unacceptable to the stakeholders. System Safety has always considered non-human +losses, but for some reason, many other approaches to safety engineering have +limited the definition of a loss to human death or injury. As an example of an +inclusive definition, a spacecraft accident might include loss of the astronauts .(if +the spacecraft is manned), death or injury to support personnel or the public, nonaccomplishment of the mission, major equipment damage .(such as damage to launch + + +facilities), environmental pollution of planets, and so on. An accident definition used +in the design of an explorer spacecraft to characterize the icy moon of a planet in +the Earth’s solar system, for example, was . +A1. Humans or human assets on earth are killed or damaged. +A2. Humans or human assets off of the earth are killed or damaged. +A3. Organisms on any of the moons of the outer planet .(if they exist). are killed +or mutated by biological agents of Earth origin. +Rationale. Contamination of an icy outer planet moon with biological agents +of Earth origin could have catastrophically adverse effects on any biological +agents indigenous to the icy outer planet moon. +A4. The scientific data corresponding to the mission goals is not collected. +A5. The scientific data corresponding to the mission goals is rendered unusable +(i.e., deleted or corrupted). before it can be fully investigated. +A6. Organisms of Earth origin are mistaken for organisms indigenous to any of +the moons of the outer planet in future missions to study the outer planet’s +moon. +Rationale. Contamination of a moon of an outer planet with biological +agents of Earth origin could lead to a situation in which a future mission +discovers the biological agents and falsely concludes that they are indigenous to the moon of the outer planet. +A7. An incident during this mission directly causes another mission to fail +to collect, return, or use the scientific data corresponding to its mission +goals. +Rationale. It is possible for this mission to interfere with the completion of +other missions through denying the other missions access to the space +exploration infrastructure .(for example, overuse of limited Deep Space +Network1 .(DSN). resources, causing another mission to miss its launch +window because of damage to the launch pad during this mission, etc.) + +footnote. The Deep Space Network is an international network of large antennas and communication facilities +that supports interplanetary spacecraft missions and radio and radar astronomy observations for +the exploration of the solar system and the universe. The network also supports some Earth-orbiting +missions. + +Prioritizing or assigning a level of severity to the identified losses may be useful +when tradeoffs among goals are required in the design process. As an example, +consider an industrial robot to service the thermal tiles on the Space Shuttle, which + + +is used as an example in chapter 9. The goals for the robot are .(1). to inspect the +thermal tiles for damage caused during launch, reentry, and transport of a Space +Shuttle and .(2). to apply waterproofing chemicals to the thermal tiles. +Level 1. +A1.1. Loss of the orbiter and crew. .(e.g., inadequate thermal protection) +A1.2. Loss of life or serious injury in the processing facility. +Level 2. +A2.1. Damage to the orbiter or to objects in the processing facility that results. +in the delay of a launch or in a loss of greater than x dollars. +A2.2. Injury to humans requiring hospitalization or medical attention and +leading to long-term or permanent physical effects. +Level 3. +A3.1. Minor human injury. .(does not require medical attention or requires only +minimal intervention and does not lead to long-term or permanent physical +effects) +A3.2. Damage to orbiter that does not delay launch and results in a loss of less +than x dollars. +A3.3. Damage to objects in the processing facility .(both on the floor or suspended). that does not result in delay of a launch or a loss of greater than x +dollars. +A3.4. Damage to the mobile robot. +Assumption. It is assumed that there is a backup plan in place for servicing +the orbiter thermal tiles in case the tile processing robot has a mechanical +failure and that the same backup measures can be used in the event the +robot is out of commission due to other reasons. +The customer may also have a safety policy that must be followed by the contractor +or those designing the thermal tile servicing robot. As an example, the following is +similar to a typical NASA safety policy. +General Safety Policy. All hazards related to human injury or damage to the +orbiter must be eliminated or mitigated by the system design. A reasonable +effort must be made to eliminate or mitigate hazards resulting at most in +damage to the robot or objects in the work area. For any hazards that cannot +be eliminated, the hazard analysis as well as the design features and development procedures, including any tradeoff studies, must be documented and +presented to the customer for acceptance. + + +One .(but only one). of the controls used to avoid this type of accident is an airborne collision avoidance system like T Cass .(Traffic alert and Collision Avoidance +System), which is now required on most commercial aircraft. While the goal of T Cass +is increased safety, T Cass itself introduces new hazards associated with its use. Some +hazards that were considered during the design of T Cass are. +H1. T Cass causes or contributes to a near midair collision .(N Mack), defined as +a pair of controlled aircraft violating minimum separation standards. +H2. T Cass causes or contributes to a controlled maneuver into the ground. +H3. T Cass causes or contributes to the pilot losing control over the aircraft. +H4. T Cass interferes with other safety-related aircraft systems. +H5. T Cass interferes with the ground-based Air Traffic Control system .(e.g., +transponder transmissions to the ground or radar or radio services). +H6. T Cass interferes with an A T C advisory that is safety-related .(e.g., avoiding +a restricted area or adverse weather conditions). +Ground-based air traffic control also plays an important role in collision avoidance, +although it has responsibility for a larger and different set of hazards. +H1. Controlled aircraft violate minimum separation standards .(N Mack). +H2. An airborne controlled aircraft enters an unsafe atmospheric region. +H3. A controlled airborne aircraft enters restricted airspace without authorization. +H4. A controlled airborne aircraft gets too close to a fixed obstacle other than a +safe point of touchdown on assigned runway .(known as controlled flight into +terrain or C Fit). +H5. A controlled airborne aircraft and an intruder in controlled airspace violate +minimum separation. +H6. Loss of controlled flight or loss of airframe integrity. +H7. An aircraft on the ground comes too close to moving objects or collides with +stationary objects or leaves the paved area. +H8. An aircraft enters a runway for which it does not have a clearance .(called +runway incursion). +Unsafe behavior .(hazards). at the system level can be mapped into hazardous +behaviors at the component or subsystem level. Note, however, that the reverse +(bottom-up). process is not possible, that is, it is not possible to identify the systemlevel hazards by looking only at individual component behavior. Safety is a system +property, not a component property. Consider an automated door system. One + + +reasonable hazard when considering the door alone is the door closing on someone. +The associated safety constraint is that the door must not close on anyone in the +doorway. This hazard is relevant if the door system is used in any environment. If +the door is in a building, another important hazard is not being able to get out of a +dangerous environment, for example, if the building is on fire. Therefore, a reasonable design constraint would be that the door opens whenever a door open request +is received. But if the door is used on a moving train, an additional hazard must +be considered, namely, the door opening while the train is moving and between +stations. In a moving train, different safety design constraints would apply compared +to an automated door system in a building. Hazard identification is a top-down +process that must consider the encompassing system and its hazards and potential +accidents. +Let’s assume that the automated door system is part of a train control system. +The system-level train hazards related to train doors include a person being hit by +closing doors, someone falling from a moving train or from a stationary train that +is not properly aligned with a station platform, and passengers and staff being unable +to escape from a dangerous environment in the train compartment. Tracing these +system hazards into the related hazardous behavior of the automated door component of the train results in the following hazards. +1. Door is open when the train starts. +2. Door opens while train is in motion. +3. Door opens while not properly aligned with station platform. +4. Door closes while someone is in the doorway. +5. Door that closes on an obstruction does not reopen or reopened door does +not reclose. +6. Doors cannot be opened for emergency evacuation between stations. +The designers of the train door controller would design to control these hazards. +Note that constraints 3 and 6 are conflicting, and the designers will have to reconcile +such conflicts. In general, attempts should first be made to eliminate hazards at +the system level. If they cannot be eliminated or adequately controlled at the +system level, then they must be refined into hazards to be handled by the system +components. +Unfortunately, no tools exist for identifying hazards. It takes domain expertise +and depends on subjective evaluation by those constructing the system. Chapter 13 +in Safeware provides some common heuristics that may be helpful in the process. +The good news is that identifying hazards is usually not a difficult process. The later +steps in the hazard analysis process are where most of the mistakes and effort occurs. + + +There is also no right or wrong set of hazards, only a set that the system stakeholders agree is important to avoid. Some government agencies have mandated the +hazards they want considered for the systems they regulate or certify. For example, +the U.S. Department of Defense requires that producers of nuclear weapons +consider four hazards. +1. Weapons involved in accident or incidents, or jettisoned weapons, produce a +nuclear yield. +2. Nuclear weapons are deliberately prearmed, armed, launched, fired, or released +without execution of emergency war orders or without being directed to do so +by a competent authority. +3. Nuclear weapons are inadvertently prearmed, armed, launched, fired, or +released. +4. Inadequate security is applied to nuclear weapons. +Sometimes user or professional associations define the hazards for the systems they +use and that they want developers to eliminate or control. In most systems, however, +the hazards to be considered are up to the developer and their customer(s). +section 7.3. +System Safety Requirements and Constraints. +After the system and component hazards have been identified, the next major goal +is to specify the system-level safety requirements and design constraints necessary +to prevent the hazards from occurring. These constraints will be used to guide the +system design and tradeoff analyses. +The system-level constraints are refined and allocated to each component during +the system engineering decomposition process. The process then iterates over the +individual components as they are refined .(and perhaps further decomposed). and +as design decisions are made. +Figure 7.1 shows an example of the design constraints that might be generated +from the automated train door hazards. Again, note that the third constraint potentially conflicts with the last one and the resolution of this conflict will be an important part of the system design process. Identifying these types of conflicts early in +the design process will lead to better solutions. Choices may be more limited later +on when it may not be possible or practical to change the early decisions. +As the design process progresses and design decisions are made, the safety +requirements and constraints are further refined and expanded. For example, a +safety constraint on T Cass is that it must not interfere with the ground-based air +traffic control system. Later in the process, this constraint will be refined into more +detailed constraints on the ways this interference might occur. Examples include + + +constraints on T Cass design to limit interference with ground-based surveillance +radar, with distance-measuring equipment channels, and with radio services. Additional constraints include how T Cass can process and transmit information .(see +chapter 10). +Figure 7.2 shows the high-level requirements and constraints for some of the air +traffic control hazards identified above. Comparing the A T C high-level constraints +with the T Cass high-level constraints .(figure 7.3). is instructive. Ground-based air +traffic control has additional requirements and constraints related to aspects of the +collision problem that T Cass cannot handle alone, as well as other hazards and +potential aircraft accidents that it must control. +Some constraints on the two system components .(A T C and T Cass). are closely +related, such as the requirement to provide advisories that maintain safe separation +between aircraft. This example of overlapping control raises important concerns +about potential conflicts and coordination problems that need to be resolved. As +noted in section 4.5, accidents often occur in the boundary areas between controllers +and when multiple controllers control the same process. The inadequate resolution +of the conflict between multiple controller responsibilities for aircraft separation +contributed to the collision of two aircraft over the town of Überlingen .(Germany) + + +in July 2 thousand 2 when T Cass and the ground air traffic controller provided conflicting +advisories to the pilots. Potentially conflicting responsibilities must be carefully +handled in system design and operations and identifying such conflicts are part of +the new hazard analysis technique described in chapter 8. +Hazards related to the interaction among components, for example the interaction between attempts by air traffic control and by T Cass to prevent collisions, +need to be handled in the safety control structure design, perhaps by mandating +how the pilot is to select between conflicting advisories. There may be considerations +in handling these hazards in the subsystem design that will impact the behavior of +multiple subsystems and therefore must be resolved at a higher level and passed to +them as constraints on their behavior. + +section 7.4. +The Safety Control Structure. +The safety requirements and constraints on the physical system design shown in +section 7.3 act as input to the standard system engineering process and must be +incorporated into the physical system design and safety control structure. An +example of how they are used is provided in chapter 10. +Additional system safety requirements and constraints, including those on operations and maintenance or upgrades will be used in the design of the safety control +structure at the organizational and social system levels above the physical system. +There is no one correct safety control structure. what is practical and effective will +depend greatly on cultural and other factors. Some general principles that apply to +all safety control structures are described in chapter 13. These principles need to be +combined with specific system safety requirements and constraints for the particular +system involved to design the control structure. +The process for engineering social systems is very similar to the regular system +engineering process and starts, like any system engineering project, with identifying +system requirements and constraints. The responsibility for implementing each +requirement needs to be assigned to the components of the control structure, along +with requisite authority and accountability, as in any management system; controls +must be designed to ensure that the responsibilities can be carried out; and feedback +loops created to assist the controller in maintaining accurate process models. + +section 7.4.1. The Safety Control Structure for a Technical System. +An example from the world of space exploration is used in this section, but many +of the same requirements and constraints could easily be adapted for other types +of technical system development and operations. +The requirements in this example were generated to perform a programmatic +risk assessment of a new NASA management structure called Independent + +Technical Authority .(I T A). recommended in the report of the Columbia Accident +Investigation Board. The risk analysis itself is described in the chapter on the new +hazard analysis technique called STPA .(chapter 8). But the first step in the safety +or risk analysis is the same as for technical systems. to identify the system hazards +to be avoided, to generate a set of requirements for the new management structure, +and to design the control structure. +The new safety control structure for the NASA manned space program was +introduced to improve the flawed engineering and management decision making +leading to the Columbia loss. The hazard to be eliminated or mitigated was. + +System Hazard. +Poor engineering and management decision making leading to a loss. + +Four high-level system safety requirements and constraints for preventing the +hazard were identified and then refined into more specific requirements and +constraints. +1. Safety considerations must be first and foremost in technical decision +making. +a. State-of-the art safety standards and requirements for NASA missions must +be established, implemented, enforced, and maintained that protect the +astronauts, the workforce, and the public. +b. Safety-related technical decision making must be independent from programmatic considerations, including cost and schedule. +c. Safety-related decision making must be based on correct, complete, and +up-to-date information. +d. Overall .(final). decision making must include transparent and explicit consideration of both safety and programmatic concerns. +e. The Agency must provide for effective assessment and improvement in +safety-related decision making. +2. Safety-related technical decision making must be done by eminently qualified +experts, with broad participation of the full workforce. +a. Technical decision making must be credible .(executed using credible personnel, technical requirements, and decision-making tools). . +b. Technical decision making must be clear and unambiguous with respect to +authority, responsibility, and accountability. +c. All safety-related technical decisions, before being implemented by the +Program, must have the approval of the technical decision maker assigned +responsibility for that class of decisions. + +d. Mechanisms and processes must be created that allow and encourage all +employees and contractors to contribute to safety-related decision making. +3. Safety analyses must be available and used starting in the early acquisition, +requirements development, and design processes and continuing through the +system life cycle. +a. High-quality system hazard analyses must be created. +b. Personnel must have the capability to produce high-quality safety +analyses. +c. Engineers and managers must be trained to use the results of hazard analyses in their decision making. +d. Adequate resources must be applied to the hazard analysis process. +e. Hazard analysis results must be communicated in a timely manner to those +who need them. A communication structure must be established that +includes contractors and allows communication downward, upward, and +sideways .(e.g., among those building subsystems). +f. Hazard analyses must be elaborated .(refined and extended). and updated +as the design evolves and test experience is acquired. +g. During operations, hazard logs must be maintained and used as experience +is acquired. All in-flight anomalies must be evaluated for their potential to +contribute to hazards. +4. The Agency must provide avenues for the full expression of technical conscience .(for safety-related technical concerns). and provide a process for full +and adequate resolution of technical conflicts as well as conflicts between +programmatic and technical concerns. +a. Communication channels, resolution processes, adjudication procedures +must be created to handle expressions of technical conscience. +b. Appeals channels must be established to surface complaints and concerns +about aspects of the safety-related decision making and technical conscience +structures that are not functioning appropriately. +Where do these requirements and constraints come from? Many of them are based +on fundamental safety-related development, operations and management principles +identified in various chapters of this book, particularly chapters 12 and 13. Others +are based on experience, such as the causal factors identified in the Columbia and +Challenger accident reports or other critiques of the NASA safety culture and of +NASA safety management. The requirements listed obviously reflect the advanced +technology and engineering domain of NASA and the space program that was the +focus of the I T A program along with some of the unique aspects of the NASA + + +culture. Other industries will have their own requirements. An example for the +pharmaceutical industry is shown in the next section of this chapter. +There is unlikely to be a universal set of requirements that holds for every safety +control structure beyond a small set of requirements too general to be very useful +in a risk analysis. Each organization needs to determine what its particular safety +goals are and the system requirements and constraints that are likely to ensure that +it reaches them. +Clearly buy-in and approval of the safety goals and requirements by the stakeholders, such as management and the broader workforce as well as anyone overseeing the group being analyzed, such as a regulatory agency, is important when +designing and analyzing a safety control structure. +Independent Technical Authority is a safety control structure used in the nuclear +Navy SUBSAFE program described in chapter 14. In this structure, safety-related +decision making is taken out of the hands of the program manager and assigned to +a Technical Authority. In the original NASA implementation, the technical authority +rested in the NASA Chief Engineer, but changes have since been made. The overall +safety control structure for the original NASA I T A is shown in figure 7.4.3 +For each component of the structure, information must be determined about its +overall role, responsibilities, controls, process model requirements, coordination and +communication requirements, contextual .(environmental and behavior-shaping) +factors that might bear on the component’s ability to fulfill its responsibilities, and +inputs and outputs to other components in the control structure. The responsibilities +are shown in figure 7.5. A risk analysis on I T A and the safety control structure is +described in chapter 8. + +footnote. The control structure was later changed to have I T A under the control of the NASA center directors +rather than the NASA chief engineer; therefore, this control structure does not reflect the actual +implementation of I T A at NASA, but it was the design at the time of the hazard analysis described in +chapter 8. + + +section 7.4.2. Safety Control Structures in Social Systems. +Social system safety control structures often are not designed but evolve over time. +They can, however, be analyzed for inherent risk and redesigned or “reengineered” +to prevent accidents or to eliminate or control past causes of losses as determined +in an accident analysis. +The reengineering process starts with the definition of the hazards to be eliminated or mitigated, system requirements and constraints necessary to increase safety, +and the design of the current safety-control structure. Analysis can then be used to +drive the redesign of the safety controls. But once again, just like every system that +has been described so far in this chapter, the process starts by identifying the hazards + + +and safety requirements and constraints derived from them. The process is illustrated using drug safety. +Dozens of books have been written about the problems in the pharmaceutical +industry. Everyone appears to have good intentions and are simply striving to optimize their performance within the existing incentive structure. The result is that the +system has evolved to the point where each group’s individual best interests do not +necessarily add up to or are not aligned with the best interests of society as a whole. +A safety control structure exists, but does not necessarily provide adequate satisfaction of the system-level goals, as opposed to the individual component goals. +This problem can be viewed as a classic system engineering problem. optimizing +each component does not necessarily add up to a system optimum. Consider the air +transportation system, as noted earlier. When each aircraft tries to optimize its path +from its departure point to its destination, the overall system throughput may not +be optimized when they all arrive in a popular hub at the same time. One goal of +the air traffic control system is to control individual aircraft movement in order to + + +optimize overall system throughput while trying to allow as much flexibility as possible for the individual aircraft and airlines to achieve their goals. The air traffic +control system and the rules of operation of the air transportation system resolve +conflicting goals when public safety is at stake. Each airline might want its own +aircraft to land as quickly as possible, but the air traffic controllers ensure adequate +spacing between aircraft to preserve safety margins. These same principles can be +applied to non-engineered systems. +The ultimate goal is to determine how to reengineer or redesign the overall +pharmaceutical safety control structure in a way that aligns incentives for the greater +good of society. A well-designed system would make it easier for all stakeholders +to do the right thing, both scientifically and ethically, while achieving their own goals +as much as possible. By providing the decision makers with information about ways +to achieve the overall system objectives and the tradeoffs involved, better decision +making can result. +While system engineering is applicable to pharmaceutical .(and more generally +medical). safety and risk management, there are important differences from the +classic engineering problem that require changes to the traditional system safety +approaches. In most technical systems, managing risk is simpler because not doing +something .(e.g., not inadvertently launching the missile). is usually safe and the +problem revolves around preventing the hazardous event .(inadvertent launch). a +risk/no risk situation. The traditional engineering approach identifies and evaluates +the costs and potential effectiveness of different ways to eliminate or control the +hazards involved in the operational system. Tradeoffs require comparing the costs +of various solutions, including costs that involve reduction in desirable system functions or system reliability. +The problem in pharmaceutical safety is different. there is risk in prescribing a +potentially unsafe drug, but there is also risk in not prescribing the drug .(the patient +dies from their medical condition). a risk/risk situation. The risks and benefits +conflict in ways that greatly increase the complexity of decision making and the +information needed to make decisions. New, more powerful system engineering +techniques are required to deal with risk/risk decisions. +Once again, the basic goals, hazards, and safety requirements must first be identified . +System Goal. To provide safe and effective pharmaceuticals to enhance the longterm health of the population. + +Important loss events .(accidents). we are trying to avoid are. +1. Patients get a drug treatment that negatively impacts their health. +2. Patients do not get the treatment they need. + + +Three system hazards can be identified that are related to these loss events. +H1. The public is exposed to an unsafe drug. +1. The drug is released with a label that does not correctly specify the conditions for its safe use. +2. An approved drug is found to be unsafe and appropriate responses are +not taken .(warnings, withdrawals from the market, etc.) +3. Patients are subjected to unacceptable risk during clinical trials. +H2. Drugs are taken unsafely. +1. The wrong drug is prescribed for the indication. +2. The pharmacist provides a different medication than was prescribed. +3. Drugs are taken in an unsafe combination. +4. Drugs are not taken according to directions .(dosage, timing). +H3. Patients do not get an effective treatment they require. +1. Safe and effective drugs are not developed, are not approved for use, or +are withdrawn from the market. +2. Safe and effective drugs are not affordable by those who need them. +3. Unnecessary delays are introduced into development and marketing. +4. Physicians do not prescribe needed drugs or patients have no access to +those who could provide the drugs to them. +5. Patients stop taking a prescribed drug due to perceived ineffectiveness or +intolerable side effects. +From these hazards, a set of system requirements can be derived to prevent them. +1. Pharmaceutical products are developed to enhance long-term health. +a. Continuous appropriate incentives exist to develop and market needed +drugs. +b. The scientific knowledge and technology needed to develop new drugs and +optimize their use is available. +2. Drugs on the market are adequately safe and effective. +a. Drugs are subjected to effective and timely safety testing. +b. New drugs are approved by the F D A based upon a validated and reproducible decision-making process. +c. The labels attached to drugs provide correct information about safety and +efficacy. + + +d. Drugs are manufactured according to good manufacturing practices. +e. Marketed drugs are monitored for adverse events, side effects, and potential +negative interactions. Long-term studies after approval are conducted to +detect long-term effects and effects on subpopulations not in the original +study. +f. New information about potential safety risk is reviewed by an independent +advisory board. Marketed drugs found to be unsafe after they are approved +are removed, recalled, restricted, or appropriate risk/benefit information is +provided. +3. Patients get and use the drugs they need for good health. +a. Drug approval is not unnecessarily delayed. +b. Drugs are obtainable by patients. +c. Accurate information is available to support decision making about risks +and benefits. +d. Patients get the best intervention possible, practical, and reasonable for +their health needs. +e. Patients get drugs with the required dosage and purity. +4. Patients take the drugs in a safe and effective manner. +a. Patients get correct instructions about dosage and follow them. +b. Patients do not take unsafe combinations of drugs. +c. Patients are properly monitored by a physician while they are being treated. +d. Patients are not subjected to unacceptable risk during clinical trials. +In system engineering, the requirements may not be totally achievable in any practical design. For one thing, they may be conflicting among themselves .(as was demonstrated in the train door example). or with other system .(non-safety). requirements +or constraints. The goal is to design a system or to evaluate and improve an existing +system that satisfies the requirements as much as possible today and to continually +improve the design over time using feedback and new scientific and engineering +advances. Tradeoffs that must be made in the design process are carefully evaluated +and considered and revisited when necessary. +Figure 7.6 shows the general pharmaceutical safety control structure in the +United States. Each component’s assigned responsibilities are those assumed in the +design of the structure. In fact, at any time, they may not be living up to these +responsibilities. +Congress provides guidance to the F D A by passing laws and providing directives, +provides any necessary legislation to ensure drug safety, ensures that the F D A has + + +enough funding to operate independently, provides legislative oversight on the +effectiveness of F D A activities, and holds committee hearings and investigations of +industry practices. +The F D A CDER .(Center for Drug Evaluation and Research). ensures that the +prescription, generic, and over-the-counter drug products are adequately available +to the public and are safe and effective; monitors marketed drug products for +unexpected health risks; and monitors and enforces the quality of marketed drug +products. CDER staff members are responsible for selecting competent F D A advisory committee members, establishing and enforcing conflict of interest rules, and +providing researchers with access to accurate and useful adverse event reports. +There are three major components within CDER. The Office of New Drugs +(O N D). is in charge of approving new drugs, setting drug labels and, when required, +recalling drugs. More specifically, O N D is responsible to. +1.•Oversee all U.S. human trials and development programs for investigational +medical products to ensure safety of participants in clinical trials and provide +oversight of the Institutional Review Boards .(I R Bs). that actually perform these +functions for the F D A. +2.•Set the requirements and process for the approval of new drugs. +3.•Critically examine a sponsor’s claim that a drug is safe for intended use .(New +Drug Application Safety Review). Impartially evaluate new drugs for safety +and efficacy and approve them for sale if deemed appropriate. +4.•Upon approval, set the label for the drug. +5.•Not unnecessarily delay drugs that may have a beneficial effect. +6.•Require Phase 4 .(after-market). safety testing if there is a potential for longterm safety risk. +7.•Remove a drug from the market if new evidence shows that the risks outweigh +the benefits. +8.•Update the label information when new information about drug safety is +discovered. +The second office within the F D A CDER is the Division of Drug Marketing, Advertising, and Communications .(DDMAC). This group provides oversight of the marketing and promotion of drugs. It reviews advertisements for accuracy and balance. +The third component of the F D A CDER is the Office of Surveillance and Epidemiology. This group is responsible for ongoing reviews of product safety, efficacy, +and quality. It accomplishes this goal by performing statistical analysis of adverse +event data it receives to determine whether there is a safety problem. This office +reassesses risks based on new data learned after a drug is marketed and recommends + + +ways to manage risk. Its staff members may also serve as consultants to O N D with +regard to drug safety issues. While they can recommend that a drug be removed +from the market if new evidence shows significant risks, only O N D can actually +require that it be removed. +The F D A performs its duties with input from F D A Advisory Boards. These +boards are made up of academic researchers whose responsibility is to provide +independent advice and recommendations that are in the best interest of the general +public. They must disclose any conflicts of interest related to subjects on which +advice is being given. +Research scientists and centers are responsible for providing independent and +objective research on a drug’s safety, efficacy, and new uses and give their unbiased +expert opinion when it is requested by the F D A. They should disclose all their conflicts of interest when publishing and take credit only for papers on which they have +significantly contributed. +Scientific journals are responsible for publishing articles of high scientific quality +and provide accurate and balanced information to doctors. +Payers and insurers pay the medical costs for the people insured as needed +and only reimburse for drugs that are safe and effective. They control the use of +drugs by providing formularies or lists of approved drugs for which they will reimburse claims. +Pharmaceutical developers and manufacturers also have responsibilities within +the drug safety control structure. They must ensure that patients are protected from +avoidable risks by providing safe and effective drugs, testing drugs for effectiveness, +properly labeling their drugs, protecting patients during clinical trials by properly +monitoring the trial, not promoting unsafe use of their drugs, removing a drug from +the market if it is no longer considered safe, and manufacturing their drugs according to good manufacturing practice. They are also responsible for monitoring drugs +for safety by running long-term, post-approval studies as required by the F D A; +running new trials to test for potential hazards; and providing, maintaining, and +incentivizing adverse-event reporting channels. +Pharmaceutical companies must also give accurate and up-to-date information +to doctors and the F D A about drug safety by educating doctors, providing all available information about the safety of the drug to the F D A, and informing the F D A +of potential new safety issues in a timely manner. Pharmaceutical companies also +sponsor research for the development of new drugs and treatments. +Last, but not least, are the physicians and patients. Physicians have the responsibility to. +1.• +Make treatment decisions based on the best interests of their clients. +2.• Weigh the risks of treatment and non-treatment. + + +3.•Prescribe drugs according to the limitations on the label +4.•Maintain up-to-date knowledge of the risk/benefit profile of the drugs they are +prescribing +5.•Monitor the symptoms of their patients under treatment for adverse events +and negative interactions +6.•Report adverse events potentially linked to the use of the drugs they +prescribe +Patients are taking increasing responsibility for their own health in today’s world, +limited by what is practical. Traditionally they have been responsible to follow their +physician’s instructions and take drugs as prescribed, accede to the doctor’s superior +knowledge when appropriate, and go through physicians or appropriate channels to +get prescription drugs. +As designed, this safety control structure looks strong and potentially effective. +Unfortunately, it has not always worked the way it was supposed to work and the +individual components have not always satisfied their responsibilities. Chapter 8 +describes the use of the new hazard analysis technique, STPA, as well as other basic +STAMP concepts in analyzing the potential risks in this structure. \ No newline at end of file diff --git a/chapter08.raw b/chapter08.raw new file mode 100644 index 0000000..16f4068 --- /dev/null +++ b/chapter08.raw @@ -0,0 +1,1276 @@ +chapter 8. +STPA: A New Hazard Analysis Technique. +Hazard analysis can be described as “investigating an accident before it occurs.” The +goal is to identify potential causes of accidents, that is, scenarios that can lead +to losses, so they can be eliminated or controlled in design or operations before +damage occurs. +The most widely used existing hazard analysis techniques were developed fifty +years ago and have serious limitations in their applicability to today’s more complex, +software-intensive, sociotechnical systems. This chapter describes a new approach +to hazard analysis, based on the STAMP causality model, called STPA (System- +Theoretic Process Analysis). +section 8.1. +Goals for a New Hazard Analysis Technique. +Three hazard analysis techniques are currently used widely: Fault Tree Analysis, +Event Tree Analysis, and HAZOP. Variants that combine aspects of these three +techniques, such as Cause-Consequence Analysis (combining top-down fault trees +and forward analysis Event Trees) and Bowtie Analysis (combining forward and +backward chaining techniques) are also sometimes used. Safeware and other basic +textbooks contain more information about these techniques for those unfamiliar +with them. FMEA (Failure Modes and Effects Analysis) is sometimes used as a +hazard analysis technique, but it is a bottom-up reliability analysis technique and +has very limited applicability for safety analysis. +The primary reason for developing STPA was to include the new causal factors +identified in STAMP that are not handled by the older techniques. More specifically, +the hazard analysis technique should include design errors, including software flaws; +component interaction accidents; cognitively complex human decision-making +errors; and social, organizational, and management factors contributing to accidents. +In short, the goal is to identify accident scenarios that encompass the entire accident +process, not just the electromechanical components. While attempts have been +made to add new features to traditional hazard analysis techniques to handle new + + +technology, these attempts have had limited success because the underlying assump- +tions of the old techniques and the causality models on which they are based do not +fit the characteristics of these new causal factors. STPA is based on the new causality +assumptions identified in chapter 2. +An additional goal in the design of STPA was to provide guidance to the users +in getting good results. Fault tree and event tree analysis provide little guidance to +the analyst—the tree itself is simply the result of the analysis. Both the model of the +system being used by the analyst and the analysis itself are only in the analyst’s +head. Analyst expertise in using these techniques is crucial, and the quality of the +fault or event trees that result varies greatly. +HAZOP, widely used in the process industries, provides much more guidance to +the analysts. HAZOP is based on a slightly different accident model than fault and +event trees, namely that accidents result from deviations in system parameters, such +as too much flow through a pipe or backflow when forward flow is required. +HAZOP uses a set of guidewords to examine each part of a plant piping and wiring +diagram, such as more than, less than, and opposite. Both guidance in performing +the process and a concrete model of the physical structure of the plant are therefore +available. +Like HAZOP, STPA works on a model of the system and has “guidewords” to +assist in the analysis, but because in STAMP accidents are seen as resulting from +inadequate control, the model used is a functional control diagram rather than a +physical component diagram. In addition, the set of guidewords is based on lack of +control rather than physical parameter deviations. While engineering expertise is +still required, guidance is provided for the STPA process to provide some assurance +of completeness in the analysis. +The third and final goal for STPA is that it can be used before a design has been +created, that is, it provides the information necessary to guide the design process, +rather than requiring a design to exist before the analysis can start. Designing +safety into a system, starting in the earliest conceptual design phases, is the most +cost-effective way to engineer safer systems. The analysis technique must also, of +course, be applicable to existing designs or systems when safety-guided design is +not possible. +section 8.2. +The STPA Process. +STPA (System-Theoretic Process Analysis) can be used at any stage of the system +life cycle. It has the same general goals as any hazard analysis technique: accumulat- +ing information about how the behavioral safety constraints, which are derived +from the system hazards, can be violated. Depending on when it is used, it provides +the information and documentation necessary to ensure the safety constraints are + + +enforced in system design, development, manufacturing, and operations, including +the natural changes in these processes that will occur over time. +STPA uses a functional control diagram and the requirements, system hazards, +and the safety constraints and safety requirements for the component as defined in +chapter 7. When STPA is applied to an existing design, this information is available +when the analysis process begins. When STPA is used for safety-guided design, only +the system-level requirements and constraints may be available at the beginning +of the process. In the latter case, these requirements and constraints are refined +and traced to individual system components as the iterative design and analysis +process proceeds. +STPA has two main steps: +1. Identify the potential for inadequate control of the system that could lead to +a hazardous state. Hazardous states result from inadequate control or enforce- +ment of the safety constraints, which can occur because: +a. A control action required for safety is not provided or not followed. +b. An unsafe control action is provided. +c. A potentially safe control action is provided too early or too late, that is, at +the wrong time or in the wrong sequence. +d. A control action required for safety is stopped too soon or applied too long. +2. Determine how each potentially hazardous control action identified in step 1 +could occur. +a. For each unsafe control action, examine the parts of the control loop to see +if they could cause it. Design controls and mitigation measures if they do not +already exist or evaluate existing measures if the analysis is being performed +on an existing design. For multiple controllers of the same component or +safety constraint, identify conflicts and potential coordination problems. +b. Consider how the designed controls could degrade over time and build in +protection, including +b.1. Management of change procedures to ensure safety constraints are +enforced in planned changes. +b.2. Performance audits where the assumptions underlying the hazard analy- +sis are the preconditions for the operational audits and controls so that +unplanned changes that violate the safety constraints can be detected. +b.3. Accident and incident analysis to trace anomalies to the hazards and to +the system design. +While the analysis can be performed in one step, dividing the process into +discrete steps reduces the analytical burden on the safety engineers and provides a + + +structured process for hazard analysis. The information from the first step (identify- +ing the unsafe control actions) is required to perform the second step (identifying +the causes of the unsafe control actions). +The assumption in this chapter is that the system design exists when STPA +is performed. The next chapter describes safety-guided design using STPA and +principles for safe design of control systems. +STPA is defined in this chapter using two examples. The first is a simple, generic +interlock. The hazard involved is exposure of a human to a potentially dangerous +energy source, such as high power. The power controller, which is responsible for +turning the energy on or off, implements an interlock to prevent the hazard. In the +physical controlled system, a door or barrier over the power source prevents expo- +sure while it is active. To simplify the example, we will assume that humans cannot +physically be inside the area when the barrier is in place—that is, the barrier is +simply a cover over the energy source. The door or cover will be manually operated +so the only function of the automated controller is to turn the power off when the +door is opened and to turn it back on when the door is closed. +Given this design, the process starts from: + +Hazard: Exposure to a high-energy source. +Constraint: The energy source must be off when the door is not closed. + +Figure 8.1 shows the control structure for this simple system. In this figure, the +components of the system are shown along with the control instructions each com- +ponent can provide and some potential feedback and other information or control +sources for each component. Control operations by the automated controller include +turning the power off and turning it on. The human operator can open and close +the door. Feedback to the automated controller includes an indication of whether +the door is open or not. Other feedback may be required or useful as determined +during the STPA (hazard analysis) process. +The control structure for a second more complex example to be used later in the +chapter, a fictional but realistic ballistic missile intercept system (FMIS), is shown +in figure 8.2. Pereira, Lee, and Howard [154] created this example to describe their +use of STPA to assess the risk of inadvertent launch in the U.S. Ballistic Missile +Defense System (BMDS) before its first deployment and field test. +The BMDS is a layered defense to defeat all ranges of threats in all phases of +flight (boost, midcourse, and terminal). The example used in this chapter is, for + +security reasons, changed from the real system, but it is realistic, and the problems +identified by STPA in this chapter are similar to some that were found using STPA +on the real system. +The U S BDMS system has a variety of components, including sea-based sensors +in the Aegis shipborne platform; upgraded early warning systems; new and upgraded +radars, ground-based midcourse defense, fire control, and communications; a +Command and Control Battle Management and Communications component; +and ground-based interceptors. Future upgrades will add features. Some parts +of the system have been omitted in the example, such as the Aegis (ship-based) +platform. +Figure 8.2 shows the control structure for the FMIS components included in the +example. The command authority controls the operators by providing such things +as doctrine, engagement criteria, and training. As feedback, the command authority +gets the exercise results, readiness information, wargame results, and other informa- +tion. The operators are responsible for controlling the launch of interceptors by +sending instructions to the fire control subsystem and receiving status information +as feedback. + + +Fire control receives instructions from the operators and information from the +radars about any current threats. Using these inputs, fire control provides instruc- +tions to the launch station, which actually controls the launch of any interceptors. +Fire control can enable firing, disable firing, and so forth, and, of course, it receives +feedback from the launch station about the status of any previously provided +control actions and the state of the system itself. The launch station controls the +actual launcher and the flight computer, which in turn controls the interceptor +hardware. +There is one other component of the system. To ensure operational readiness, the +FMIS contains an interceptor simulator that periodically is used to mimic the flight +computer in order to detect a failure in the system. + +footnote. The phrase “when the door is open” would be incorrect because a case is missing (a common problem): +in the power controller’s model of the controlled process, which enforces the constraint, the door may +be open, closed, or the door position may be unknown to the controller. The phrase “is open or the door +position is unknown” could be used instead. See section 9.3.2 for a discussion of why the difference is +important. + + + +section 8.3. +Identifying Potentially Hazardous Control Actions (Step 1) +Starting from the fundamentals defined in chapter 7, the first step in STPA is to +assess the safety controls provided in the system design to determine the potential +for inadequate control, leading to a hazard. The assessment of the hazard controls +uses the fact that control actions can be hazardous in four ways (as noted earlier): +1. A control action required for safety is not provided or is not followed. +2. An unsafe control action is provided that leads to a hazard. +3. A potentially safe control action is provided too late, too early, or out of +sequence. +4. A safe control action is stopped too soon or applied too long (for a continuous +or nondiscrete control action). +For convenience, a table can be used to record the results of this part of the analysis. +Other ways to record the information are also possible. In a classic System Safety +program, the information would be included in the hazard log. Figure 8.3 shows the +results of step 1 for the simple interlock example. The table contains four hazardous +types of behavior: +1. A power off command is not given when the door is opened, +2. The door is opened and the controller waits too long to turn the power off; +3. A power on command is given while the door is open, and +4. A power on command is provided too early (when the door has not yet fully +closed). +Incorrect but non-hazardous behavior is not included in the table. For example, +not providing a power on command when the power is off and the door is opened + + +or closed is not hazardous, although it may represent a quality-assurance problem. +Another example of a mission assurance problem but not a hazard occurs when the +power is turned off while the door is closed. Thomas has created a procedure to +assist the analyst in considering the effect of all possible combinations of environ- +mental and process variables for each control action in order to avoid missing any +cases that should be included in the table [199a]. +The final column of the table, Stopped Too Soon or Applied Too Long, is not +applicable to the discrete interlock commands. An example where it does apply is +in an aircraft collision avoidance system where the pilot may be told to climb or +descend to avoid another aircraft. If the climb or descend control action is stopped +too soon, the collision may not be avoided. +The identified hazardous behaviors can now be translated into safety constraints +(requirements) on the system component behavior. For this example, four con- +straints must be enforced by the power controller (interlock): +1. The power must always be off when the door is open; +2. A power off command must be provided within x milliseconds after the door +is opened; +3. A power on command must never be issued when the door is open; +4. The power on command must never be given until the door is fully closed. +For more complex examples, the mode in which the system is operating may deter- +mine the safety of the action or event. In that case, the operating mode may need +to be included in the table, perhaps as an additional column. For example, some +spacecraft mission control actions may only be hazardous during the launch or +reentry phase of the mission. +In chapter 2, it was stated that many accidents, particularly component interac- +tion accidents, stem from incomplete requirements specifications. Examples were + + +provided such as missing constraints on the order of valve position changes in a +batch chemical reactor and the conditions under which the descent engines should +be shut down on the Mars Polar Lander spacecraft. The information provided +in this first step of STPA can be used to identify the necessary constraints on com- +ponent behavior to prevent the identified system hazards, that is, the safety require- +ments. In the second step of STPA, the information required by the component to +properly implement the constraint is identified as well as additional safety con- +straints and information necessary to eliminate or control the hazards in the design +or to design the system properly in the first place. +The FMIS system provides a less trivial example of step 1. Remember, the hazard +is inadvertent launch. Consider the fire enable command, which can be sent by the +fire control module to the launch station to allow launch commands subsequently +received by the launch station to be executed. As described in Pereira, Lee, and +Howard [154], the fire enable control command directs the launch station to enable +the live fire of interceptors. Prior to receiving this command, the launch station will +return an error message when it receives commands to fire an interceptor and will +discard the fire commands.2 +Figure 8.4 shows the results of performing STPA Step 1 on the fire enable +command. If this command is missing (column 2), a launch will not take place. While +this omission might potentially be a mission assurance concern, it does not contrib- +ute to the hazard being analyzed (inadvertent launch). + + +If the fire enable command is provided to a launch station incorrectly, the launch +station will transition to a state where it accepts interceptor tasking and can progress +through a launch sequence. In combination with other incorrect or mistimed com- +mands, this control action could contribute to an inadvertent launch. +A late fire enable command will only delay the launch station’s ability to +process a launch sequence, which will not contribute to an inadvertent launch. A +fire enable command sent too early could open a window of opportunity for +inadvertently progressing toward an inadvertent launch, similar to the incorrect +fire enable considered above. In the third case, a fire enable command might +be out of sequence with a fire disable command. If this incorrect sequencing is +possible in the system as designed and constructed, the system could be left +capable of processing interceptor tasking and launching an interceptor when not +intended. +Finally, the fire enable command is a discrete command sent to the launch +station to signal that it should allow processing of interceptor tasking. Because +fire enable is not a continuous command, the “stopped too soon” category does +not apply. + +footnote. Section 9.4.4 explains the safety-related reasons for breaking up potentially hazardous actions into +multiple steps. + + +section 8.4. +Determining How Unsafe Control Actions Could Occur. (Step 2) +Performing the first step of STPA provides the component safety requirements, +which may be sufficient for some systems. A second step can be performed, however, +to identify the scenarios leading to the hazardous control actions that violate the +component safety constraints. Once the potential causes have been identified, the +design can be checked to ensure that the identified scenarios have been eliminated +or controlled in some way. If not, then the design needs to be changed. If the design +does not already exist, then the designers at this point can try to eliminate or control +the behaviors as the design is created, that is, use safety-guided design as described +in the next chapter. +Why is the second step needed? While providing the engineers with the safety +constraints to be enforced is necessary, it is not sufficient. Consider the chemical +batch reactor described in section 2.1. The hazard is overheating of the reactor +contents. At the system level, the engineers may decide (as in this design) to use +water and a reflux condenser to control the temperature. After this decision is made, +controls need to be enforced on the valves controlling the flow of catalyst and water. +Applying step 1 of STPA determines that opening the valves out of sequence is +dangerous, and the software requirements would accordingly be augmented with +constraints on the order of the valve opening and closing instructions, namely that +the water valve must be opened before the catalyst valve and the catalyst valve must +be closed before the water valve is closed or, more generally, that the water valve + + +must always be open when the catalyst valve is opened. If the software already exists, +the hazard analysis would ensure that this ordering of commands has been enforced +in the software. Clearly, building the software to enforce this ordering is a great deal +easier than proving the ordering is true after the software already exists. +But enforcing these safety constraints is not enough to ensure safe software +behavior. Suppose the software has commanded the water valve to open but some- +thing goes wrong and the valve does not actually open or it opens but water flow +is restricted in some way (the no flow guideword in HAZOP). Feedback is needed +for the software to determine if water is flowing through the pipes and the software +needs to check this feedback before opening the catalyst valve. The second step of +STPA is used to identify the ways that the software safety constraint, even if pro- +vided to the software engineers, might still not be enforced by the software logic +and system design. In essence, step 2 identifies the scenarios or paths to a hazard +found in a classic hazard analysis. This step is the usual “magic” one that creates the +contents of a fault tree, for example. The difference is that guidance is provided to +help create the scenarios and more than just failures are considered. +To create causal scenarios, the control structure diagram must include the process +models for each component. If the system exists, then the content of these models +should be easily determined by looking at the system functional design and its docu- +mentation. If the system does not yet exist, the analysis can start with a best guess +and then be refined and changed as the analysis process proceeds. +For the high power interlock example, the process model is simple and shown in +figure 8.5. The general causal factors, shown in figure 4.8 and repeated here in figure +8.6 for convenience, are used to identify the scenarios. + +section 8.4.1. Identifying Causal Scenarios. +Starting with each hazardous control action identified in step 1, the analysis in step +2 involves identifying how it could happen. To gather information about how the +hazard could occur, the parts of the control loop for each of the hazardous control +actions identified in step 1 are examined to determine if they could cause or con- +tribute to it. Once the potential causes are identified, the engineers can design +controls and mitigation measures if they do not already exist or evaluate existing +measures if the analysis is being performed on an existing design. +Each potentially hazardous control action must be considered. As an example, +consider the unsafe control action of not turning off the power when the door is +opened. Figure 8.7 shows the results of the causal analysis in a graphical form. Other +ways of documenting the results are, of course, possible. +The hazard in figure 8.7 is that the door is open but the power is not turned off. +Looking first at the controller itself, the hazard could result if the requirement is +not passed to the developers of the controller, the requirement is not implemented + + +correctly, or the process model incorrectly shows the door closed and/or the power +off when that is not true. Working around the loop, the causal factors for each of +the loop components are similarly identified using the general causal factors shown +in figure 8.6. These causes include that the power off command is sent but not +received by the actuator, the actuator received the command but does not imple- +ment it (actuator failure), the actuator delays in implementing the command, the +power on and power off commands are received or executed in the wrong order, +the door open event is not detected by the door sensor or there is an unacceptable +delay in detecting it, the sensor fails or provides spurious feedback, and the feedback +about the state of the door or the power is not received by the controller or is not +incorporated correctly into the process model. +More detailed causal analysis can be performed if a specific design is being con- +sidered. For example, the features of the communication channels used will deter- +mine the potential way that commands or feedback could be lost or delayed. +Once the causal analysis is completed, each of the causes that cannot be shown +to be physically impossible must be checked to determine whether they are + + +adequately handled in the design (if the design exists) or design features added to +control them if the design is being developed with support from the analysis. +The first step in designing for safety is to try to eliminate the hazard completely. +In this example, the hazard can be eliminated by redesigning the system to have the +circuit run through the door in such a way that the circuit is broken as soon as the +door opens. Let’s assume, however, that for some reason this design alternative is +rejected, perhaps as impractical. Design precedence then suggests that the next best +alternatives in order are to reduce the likelihood of the hazard occurring, to prevent +the hazard from leading to a loss, and finally to minimize damage. More about safe +design can be found in chapters 16 and 17 of Safeware and chapter 9 of this book. +Because design almost always involves tradeoffs with respect to achieving mul- +tiple objectives, the designers may have good reasons not to select the most effective +way to control the hazard but one of the other alternatives instead. It is important +that the rationale behind the choice is documented for future analysis, certification, +reuse, maintenance, upgrades, and other activities. +For this simple example, one way to mitigate many of the causes is to add a light +that identifies whether the power supply is on or off. How do human operators know +that the power has been turned off before inserting their hands into the high-energy + + +power source? In the original design, they will most likely assume it is off because +they have opened the door, which may be an incorrect assumption. Additional +feedback and assurance can be attained from the light. In fact, protection systems +in automated factories commonly are designed to provide humans in the vicinity +with aural or visual information that they have been detected by the protection +system. Of course, once a change has been made, such as adding a light, that change +must then be analyzed for new hazards or causal scenarios. For example, a light bulb +can burn out. The design might ensure that the safe state (the power is off) is rep- +resented by the light being on rather than the light being off, or two colors might +be used. Every solution for a safety problem usually has its own drawbacks and +limitations and therefore they will need to be compared and decisions made about +the best design given the particular situation involved. +In addition to the factors shown in figure 8.6, the analysis must consider the +impact of having two controllers of the same component whenever this occurs in +the system safety control structure. In the friendly fire example in chapter 5, for +example, confusion existed between the two AWACS operators responsible for +tracking aircraft inside and outside of the no-fly-zone about who was responsible +for aircraft in the boundary area between the two. The FMIS example below con- +tains such a scenario. An analysis must be made to determine that no path to a +hazard exists because of coordination problems. +The FMIS system provides a more complex example of STPA step 2. Consider +the fire enable command provided by fire control to the launch station. In step 1, +it was determined that if this command is provided incorrectly, the launch station +will transition to a state where it accepts interceptor tasking and can progress +through a launch sequence. In combination with other incorrect or mistimed control +actions, this incorrect command could contribute to an inadvertent launch. +The following are two examples of causal factors identified using STPA step 2 as +potentially leading to the hazardous state (violation of the safety constraint). Neither +of these examples involves component failures, but both instead result from unsafe +component interactions and other more complex causes that are for the most part +not identifiable by current hazard analysis methods. +In the first example, the fire enable command can be sent inadvertently due to +a missing case in the requirements—a common occurrence in accidents where soft- +ware is involved. +The fire enable command is sent when the fire control receives a weapons free +command from the operators and the fire control system has at least one active +track. An active track indicates that the radars have detected something that might +be an incoming missile. Three criteria are specified for declaring a track inactive: +(1) a given period passes with no radar input, (2) the total predicted impact time +elapses for the track, and (3) an intercept is confirmed. Operators are allowed to + + +deselect any of these options. One case was not considered by the designers: if an +operator deselects all of the options, no tracks will be marked as inactive. Under +these conditions, the inadvertent entry of a weapons free command would send the +fire enable command to the launch station immediately, even if there were no +threats currently being tracked by the system. +Once this potential cause is identified, the solution is obvious—fix the software +requirements and the software design to include the missing case. While the opera- +tor might instead be warned not to deselect all the options, this kind of human error +is possible and the software should be able to handle the error safely. Depending +on humans not to make mistakes is an almost certain way to guarantee that acci- +dents will happen. +The second example involves confusion between the regular and the test soft- +ware. The FMIS undergoes periodic system operability testing using an interceptor +simulator that mimics the interceptor flight computer. The original hazard analysis +had identified the possibility that commands intended for test activities could be +sent to the operational system. As a result, the system status information provided +by the launch station includes whether the launch station is connected only to +missile simulators or to any live interceptors. If the fire control computer detects a +change in this state, it will warn the operator and offer to reset into a matching state. +There is, however, a small window of time before the launch station notifies the fire +control component of the change. During this time interval, the fire control software +could send a fire enable command intended for test to the live launch station. This +latter example is a coordination problem arising because there are multiple control- +lers of the launch station and two operating modes (e.g., testing and live fire). A +potential mode confusion problem exists where the launch station can think it is in +one mode but really be in the other one. Several different design changes could be +used to prevent this hazardous state. +In the use of STPA on the real missile defense system, the risks involved in inte- +grating separately developed components into a larger system were assessed, and +several previously unknown scenarios for inadvertent launch were identified. Those +conducting the assessment concluded that the STPA analysis and supporting data +provided management with a sound basis on which to make risk acceptance deci- +sions [154]. The assessment results were used to plan mitigations for open safety +risks deemed necessary to change before deployment and field-testing of the system. +As system changes are proposed, they are assessed by updating the control structure +diagrams and assessment analysis results. + +section 8.4.2. Considering the Degradation of Controls over Time. +A final step in STPA is to consider how the designed controls could degrade over +time and to build in protection against it. The mechanisms for the degradation could + +be identified and mitigated in the design: for example, if corrosion is identified as a +potential cause, a stronger or less corrosive material might be used. Protection might +also include planned performance audits where the assumptions underlying the +hazard analysis are the preconditions for the operational audits and controls. For +example, an assumption for the interlock system with a light added to warn the +operators is that the light is operational and operators will use it to determine +whether it is safe to open the door. Performance audits might check to validate that +the operators know the purpose of the light and the importance of not opening the +door while the warning light is on. Over time, operators might create workarounds +to bypass this feature if it slows them up too much in their work or if they do not +understand the purpose, the light might be partially blocked from view because of +workplace changes, and so on. The assumptions and required audits should be iden- +tified during the system design process and then passed to the operations team. +Along with performance audits, management of change procedures need to be +developed and the STPA analysis revisited whenever a planned change is made in +the system design. Many accidents occur after changes have been made in the +system. If appropriate documentation is maintained along with the rationale for the +control strategy selected, this reanalysis should not be overly burdensome. How to +accomplish this goal is discussed in chapter 10. +Finally, after accidents and incidents, the design and the hazard analysis should +be revisited to determine why the controls were not effective. The hazard of foam +damaging the thermal surfaces of the Space Shuttle had been identified during +design, for example, but over the years before the Columbia loss the process for +updating the hazard analysis after anomalies occurred in flight was eliminated. The +Space Shuttle standard for hazard analyses (NSTS 22254, Methodology for Conduct +of Space Shuttle Program Hazard Analyses) specified that hazards be revisited only +when there was a new design or the design was changed: There was no process for +updating the hazard analyses when anomalies occurred or even for determining +whether an anomaly was related to a known hazard [117]. +Chapter 12 provides more information about the use of the STPA results during +operations. + +section 8.5. Human Controllers. +Humans in the system can be treated in the same way as automated components in +step 1 of STPA, as was seen in the interlock system above where a person controlled +the position of the door. The causal analysis and detailed scenario generation for +human controllers, however, is much more complex than that of electromechanical +devices and even software, where at least the algorithm is known and can be evalu- +ated. Even if operators are given a procedure to follow, for reasons discussed in + +chapter 2, it is very likely that the operator may feel the need to change the proce- +dure over time. +The first major difference between human and automated controllers is that +humans need an additional process model. All controllers need a model of the +process they are controlling directly, but human controllers also need a model of +any process, such as an oil refinery or an aircraft, they are indirectly controlling +through an automated controller. If the human is being asked to supervise the +automated controller or to monitor it for wrong or dangerous behavior then he +or she needs to have information about the state of both the automated controller +and the controlled process. Figure 8.8 illustrates this requirement. The need for +an additional process model explains why supervising an automated system +requires extra training and skill. A wrong assumption is sometimes made that if the + +human is supervising a computer, training requirements are reduced but this +belief is untrue. Human skill levels and required knowledge almost always go up in +this situation. +Figure 8.8 includes dotted lines to indicate that the human controller may need +direct access to the process actuators if the human is to act as a backup to the +automated controller. In addition, if the human is to monitor the automation, he +or she will need direct input from the sensors to detect when the automation is +confused and is providing incorrect information as feedback about the state of the +controlled process. +The system design, training, and operational procedures must support accurate +creation and updating of the extra process model required by the human supervisor. +More generally, when a human is supervising an automated controller, there are +extra analysis and design requirements. For example, the control algorithm used by +the automation must be learnable and understandable. Inconsistent behavior or +unnecessary complexity in the automation function can lead to increased human +error. Additional design requirements are discussed in the next chapter. +With respect to STPA, the extra process model and complexity in the system +design requires additional causal analysis when performing step 2 to determine the +ways that both process models can become inaccurate. +The second important difference between human and automated controllers is +that, as noted by Thomas [199], while automated systems have basically static control +algorithms (although they may be updated periodically), humans employ dynamic +control algorithms that they change as a result of feedback and changes in goals. +Human error is best modeled and understood using feedback loops, not as a chain +of directly related events or errors as found in traditional accident causality models. +Less successful actions are a natural part of the search by operators for optimal +performance [164]. +Consider again figure 2.9. Operators are often provided with procedures to follow +by designers. But designers are dealing with their own models of the controlled +process, which may not reflect the actual process as constructed and changed over +time. Human controllers must deal with the system as it exists. They update their +process models using feedback, just as in any control loop. Sometimes humans use +experimentation to understand the behavior of the controlled system and its current +state and use that information to change their control algorithm. For example, after +picking up a rental car, drivers may try the brakes and the steering system to get a +feel for how they work before driving on a highway. +If human controllers suspect a failure has occurred in a controlled process, they +may experiment to try to diagnose it and determine a proper response. Humans +also use experimentation to determine how to optimize system performance. The +driver’s control algorithm may change over time as the driver learns more about + + +the automated system and learns how to optimize the car’s behavior. Driver goals +and motivation may also change over time. In contrast, automated controllers by +necessity must be designed with a single set of requirements based on the designer’s +model of the controlled process and its environment. +Thomas provides an example [199] using cruise control. Designers of an auto- +mated cruise control system may choose a control algorithm based on their model +of the vehicle (such as weight, engine power, response time), the general design of +roadways and vehicle traffic, and basic engineering design principles for propulsion +and braking systems. A simple control algorithm might control the throttle in pro- +portion to the difference between current speed (monitored through feedback) and +desired speed (the goal). +Like the automotive cruise control designer, the human driver also has a process +model of the car’s propulsion system, although perhaps simpler than that of the +automotive control expert, including the approximate rate of car acceleration for +each accelerator position. This model allows the driver to construct an appropriate +control algorithm for the current road conditions (slippery with ice or clear and dry) +and for a given goal (obeying the speed limit or arriving at the destination at a +required time). Unlike the static control algorithm designed into the automated +cruise control, the human driver may dynamically change his or her control algo- +rithm over time based on changes in the car’s performance, in goals and motivation, +or driving experience. +The differences between automated and human controllers lead to different +requirements for hazard analysis and system design. Simply identifying human +“failures” or errors is not enough to design safer systems. Hazard analysis must +identify the specific human behaviors that can lead to the hazard. In some cases, it +may be possible to identify why the behaviors occur. In either case, we are not able +to “redesign” humans. Training can be helpful, but not nearly enough—training can +do only so much in avoiding human error even when operators are highly trained +and skilled. In many cases, training is impractical or minimal, such as automobile +drivers. The only real solution lies in taking the information obtained in the hazard +analysis about worst-case human behavior and using it in the design of the other +system components and the system as a whole to eliminate, reduce, or compensate +for that behavior. Chapter 9 discusses why we need human operators in systems and +how to design to eliminate or reduce human errors. +STPA as currently defined provides much more useful information about the +cause of human errors than traditional hazard analysis methods, but augmenting +STPA could provide more information for designers. Stringfellow has suggested +some additions to STPA for human controllers [195]. In general, engineers need +better tools for including humans in hazard analyses in order to cope with the unique +aspects of human control. + + +section 8.6. Using STPA on Organizational Components of the Safety Control Structure. +The examples above focus on the lower levels of safety control structures, but STPA +can also be used on the organizational and management components. Less experi- +mentation has been done on applying it at these levels, and, once again, more needs +to be done. +Two examples are used in this section: one was a demonstration for NASA of +risk analysis using STPA on a new management structure proposed after the Colum- +bia accident. The second is pharmaceutical safety. The fundamental activities of +identifying system hazards, safety requirements and constraints, and of documenting +the safety control structure were described for these two examples in chapter 7. +This section starts from that point and illustrates the actual risk analysis process. + +section 8.6.1. Programmatic and Organizational Risk Analysis. +The Columbia Accident Investigation Board (CAIB) found that one of the causes +of the Columbia loss was the lack of independence of the safety program from the +Space Shuttle program manager. The CAIB report recommended that NASA insti- +tute an Independent Technical Authority (ITA) function similar to that used in +SUBSAFE (see chapter 14), and individuals with SUBSAFE experience were +recruited to help design and implement the new NASA Space Shuttle program +organizational structure. After the program was designed and implementation +started, a risk analysis of the program was performed to assist in a planned review +of the program’s effectiveness. A classic programmatic risk analysis, which used +experts to identify the risks in the program, was performed. In parallel, a group at +MIT developed a process to use STAMP as a foundation for the same type of pro- +grammatic risk analysis to understand the risks and vulnerabilities of this new +organizational structure and recommend improvements [125].3 This section describes +the STAMP-based process and results as an example of what can be done for other +systems and other emergent properties. Laracy [108] used a similar process to +examine transportation system security, for example. +The STAMP-based analysis rested on the basic STAMP concept that most major +accidents do not result simply from a unique set of proximal, physical events but +from the migration of the organization to a state of heightened risk over time as +safeguards and controls are relaxed due to conflicting goals and tradeoffs. In such +a high-risk state, events are bound to occur that will trigger an accident. In both the +Challenger and Columbia losses, organizational risk had been increasing to unac- +ceptable levels for quite some time as behavior and decision-making evolved in + +response to a variety of internal and external performance pressures. Because risk +increased slowly, nobody noticed, that is, the boiled frog phenomenon. In fact, con- +fidence and complacency were increasing at the same time as risk due to the lack +of accidents. +The goal of the STAMP-based analysis was to apply a classic system safety +engineering process to the analysis and redesign of this organizational structure. +Figure 8.9 shows the basic process used, which started with a preliminary hazard +analysis to identify the system hazards and the safety requirements and constraints. +In the second step, a STAMP model of the ITA safety control structure was created +(as designed by NASA; see figure 7.4) and a gap analysis was performed to map the +identified safety requirements and constraints to the assigned responsibilities in the +safety control structure and identify any gaps. A detailed hazard analysis using STPA +was then performed to identify the system risks and to generate recommendations +for improving the designed new safety control structure and for monitoring the +implementation and long-term health of the new program. Only enough of the +modeling and analysis is included here to allow the reader to understand the process. +The complete modeling and analysis effort is documented elsewhere [125]. +The hazard identification, system safety requirements, and safety control struc- +ture for this example are described in section 7.4.1, so the example starts from this +basic information. + + +footnote. Many people contributed to the analysis described in this section, including Nicolas Dulac, Betty +Barrett, Joel Cutcher-Gershenfeld, John Carroll, and Stephen Friedenthal. + + +section 8.6.2. Gap Analysis. +In analyzing an existing organizational or social safety control structure, one of the +first steps is to determine where the responsibility for implementing each require- +ment rests and to perform a gap analysis to identify holes in the current design, that +is, requirements that are not being implemented (enforced) anywhere. Then the +safety control structure needs to be evaluated to determine whether it is potentially +effective in enforcing the system safety requirements and constraints. +A mapping was made between the system-level safety requirements and con- +straints and the individual responsibilities of each component in the NASA safety +control structure to see where and how requirements are enforced. The ITA program +was at the time being carefully defined and documented. In other situations, where +such documentation may be lacking, interview or other techniques may need to be +used to elicit how the organizational control structure actually works. In the end, +complete documentation should exist in order to maintain and operate the system +safely. While most organizations have job descriptions for each employee, the safety- +related responsibilities are not necessarily separated out or identified, which can +lead to unidentified gaps or overlaps. +As an example, in the ITA structure the responsibility for the system-level safety +requirement: + +1a. State-of-the art safety standards and requirements for NASA missions must +be established, implemented, enforced, and maintained that protect the astro- +nauts, the workforce, and the public +was assigned to the NASA Chief Engineer but the Discipline Technical Warrant +Holders, the Discipline Trusted Agents, the NASA Technical Standards Program, +and the headquarters Office of Safety and Mission Assurance also play a role in +implementing this Chief Engineer responsibility. More specifically, system require- +ment 1a was implemented in the control structure by the following responsibility +assignments: +•Chief Engineer: Develop, monitor, and maintain technical standards and +policy. +•Discipline Technical Warrant Holders: +1.– Recommend priorities for development and updating of technical +standards. +2.– Approve all new or updated NASA Preferred Standards within their assigned +discipline (the NASA Chief Engineer retains Agency approval) +3.– Participate in (lead) development, adoption, and maintenance of NASA +Preferred Technical Standards in the warranted discipline. +4.– Participate as members of technical standards working groups. +•Discipline Trusted Agents: Represent the Discipline Technical Warrant +Holders on technical standards committees +•NASA Technical Standards Program: Coordinate with Technical Warrant +Holders when creating or updating standards +•NASA Headquarters Office Safety and Mission Assurance: +1.– Develop and improve generic safety, reliability, and quality process standards +and requirements, including FMEA, risk, and the hazard analysis process. +2.– Ensure that safety and mission assurance policies and procedures are ade- +quate and properly documented. +Once the mapping is complete, a gap analysis can be performed to ensure that each +system safety requirement and constraint is embedded in the organizational design +and to find holes or weaknesses in the design. In this analysis, concerns surfaced, +particularly about requirements not reflected in the defined ITA organizational +structure. +As an example, one omission detected was appeals channels for complaints +and concerns about the components of the ITA structure itself that may not +function appropriately. All channels for expressing what NASA calls “technical +conscience” go through the warrant holders, but there was no defined way to express + + +concerns about the warrant holders themselves or about aspects of ITA that are not +working well. +A second example was the omission in the documentation of the ITA implemen- +tation plans of the person(s) who was to be responsible to see that engineers and +managers are trained to use the results of hazard analyses in their decision making. +More generally, a distributed and ill-defined responsibility for the hazard analysis +process made it difficult to determine responsibility for ensuring that adequate +resources are applied; that hazard analyses are elaborated (refined and extended) +and updated as the design evolves and test experience is acquired; that hazard logs +are maintained and used as experience is acquired; and that all anomalies are evalu- +ated for their hazard potential. Before ITA, many of these responsibilities were +assigned to each Center’s Safety and Mission Assurance Office, but with much of +this process moving to engineering (which is where it should be) under the new ITA +structure, clear responsibilities for these functions need to be specified. One of the +basic causes of accidents in STAMP is multiple controllers with poorly defined or +overlapping responsibilities. +A final example involved the ITA program assessment process. An assessment +of how well ITA is working is part of the plan and is an assigned responsibility of +the chief engineer. The official risk assessment of the ITA program performed in +parallel with the STAMP-based one was an implementation of that chief engineer’s +responsibility and was planned to be performed periodically. We recommended the +addition of specific organizational structures and processes for implementing a +continual learning and improvement process and making adjustments to the design +of ITA itself when necessary outside of the periodic review. + +section 8.6.3. Hazard Analysis to Identify Organizational and Programmatic Risks. +A risk analysis to identify ITA programmatic risks and to evaluate these risks peri- +odically had been specified as one of the chief engineer’s responsibilities. To accom- +plish this goal, NASA identified the programmatic risks using a classic process using +experts in risk analysis interviewing stakeholders and holding meetings where risks +were identified and discussed. The STAMP-based analysis used a more formal, +structured approach. +Risks in STAMP terms can be divided into two types: (1) basic inadequacies in +the way individual components in the control structure fulfill their responsibilities +and (2) risks involved in the coordination of activities and decision making that can +lead to unintended interactions and consequences. +Basic Risks +Applying the four types of inadequate control identified in STPA and interpreted +for the hazard, which in this case is unsafe decision-making leading to an accident, +ITA has four general types of risks: + + +1. Unsafe decisions are made or approved by the chief engineer or warrant +holders. +2. Safe decisions are disallowed (e.g., overly conservative decision making that +undermines the goals of NASA and long-term support for ITA). +3. Decision making takes too long, minimizing impact and also reducing support +for the ITA. +4. Good decisions are made by the ITA, but do not have adequate impact on +system design, construction, and operation. +The specific potentially unsafe control actions by those in the ITA safety control +structure that could lead to these general risks are the ITA programmatic risks. Once +identified, they must be eliminated or controlled just like any unsafe control actions. +Using the responsibilities and control actions defined for the components of the +safety control structure, the STAMP-based risk analysis applied the four general +types of inadequate control actions, omitting those that did not make sense for the +particular responsibility or did not impact risk. To accomplish this, the general +responsibilities must be refined into more specific control actions. +As an example, the chief engineer is responsible as the ITA for the technical +standards and system requirements and all changes, variances, and waivers to the +requirements, as noted earlier. The control actions the chief engineer has available +to implement this responsibility are: +1.• To develop, monitor, and maintain technical standards and policy. +2.•In coordination with programs and projects, to establish or approve the techni- +cal requirements and ensure they are enforced and implemented in the pro- +grams and projects (ensure the design is compliant with the requirements). +3.• To approve all changes to the initial technical requirements. +4.• To approve all variances (waivers, deviations, exceptions to the requirements. +5.•Etc. +Taking just one of these, the control responsibility to develop, monitor, and maintain +technical standards and policy, the risks (potentially inadequate or unsafe control +actions) identified using STPA step 1 include: +1. General technical and safety standards are not created. +2. Inadequate standards and requirements are created. +3. Standards degrade over time due to external pressures to weaken them. The +process for approving changes is flawed. +4. Standards are not changed over time as the environment changes. + +As another example, the chief engineer cannot perform all these duties himself, so +he has a network of people below him in the hierarchy to whom he delegates or +“warrants” some of the responsibilities. The chief engineer retains responsibility for +ensuring that the warrant holders perform their duties adequately as in any hierar- +chical management structure. +The chief engineer responsibility to approve all variances and waivers to technical +requirements is assigned to the System Technical Warrant Holder (STWH). The +risks or potentially unsafe control actions of the STWH with respect to this respon- +sibility are: +1.• An unsafe engineering variance or waiver is approved. +2.•Designs are approved without determining conformance with safety require- +ments. Waivers become routine. +3.•Reviews and approvals take so long that ITA becomes a bottleneck. Mission +achievement is threatened. Engineers start to ignore the need for approvals +and work around the STWH in other ways. +Although a long list of risks was identified in this experimental application of STPA +to a management structure, many of the risks for different participants in the ITA +process were closely related. The risks listed for each participant are related to his +or her particular role and responsibilities and therefore those with related roles or +responsibilities will generate related risks. The relationships were made clear in the +earlier step tracing from system requirements to the roles and responsibilities for +each of the components of the ITA. + +Coordination Risks. +Coordination risks arise when multiple people or groups control the same process. +The types of unsafe interactions that may result include: (1) both controllers +assume that the other is performing the control responsibilities, and as a result +nobody does, or (2) controllers provide conflicting control actions that have unin- +tended side effects. +Potential coordination risks are identified by the mapping from the system +requirements to the component requirements used in the gap analysis described +earlier. When similar responsibilities related to the same system requirement are +identified, the potential for new coordination risks needs to be considered. +As an example, the original ITA design documentation was ambiguous about +who had the responsibility for performing many of the safety engineering func- +tions. Safety engineering had previously been the responsibility of the Center +Safety and Mission Assurance Offices but the plan envisioned that these functions +would shift to the ITA in the new organization leading to several obvious +risks. + + +Another example involves the transition of responsibility for the production of +standards to the ITA from the NASA Headquarters Office of Safety and Mission +Assurance (OSMA). In the plan, some of the technical standards responsibilities +were retained by OSMA, such as the technical design standards for human rating +spacecraft and for conducting hazard analyses, while others were shifted to the ITA +without a clear demarcation of who was responsible for what. At the same time, +responsibilities for the assurance that the plans are followed, which seems to logi- +cally belong to the mission assurance group, were not cleanly divided. Both overlaps +raised the potential for some functions not being accomplished or conflicting stan- +dards being produced. + +section 8.6.4. Use of the Analysis and Potential Extensions. +While risk mitigation and control measures could be generated from the list of risks +themselves, the application of step 2 of STPA to identify causes of the risks will help +to provide better control measures in the same way STPA step 2 plays a similar role +in physical systems. Taking the responsibility of the System Technical Warrant +Holder to approve all variances and waivers to technical requirements in the +example above, potential causes for approving an unsafe engineering variance or +waiver include: inadequate or incorrect information about the safety of the action, +inadequate training, bowing to pressure about programmatic concerns, lack of +support from management, inadequate time or resources to evaluate the requested +variance properly, and so on. These causal factors were generated using the generic +factors in figure 8.6 but defined in a more appropriate way. Stringfellow has exam- +ined in more depth how STPA can be applied to organizational factors [195]. +The analysis can be used to identify potential changes to the safety control struc- +ture (the ITA program) that could eliminate or mitigate identified risks. General +design principles for safety are described in the next chapter. +A goal of the NASA risk analysis was to determine what to include in a planned +special assessment of the ITA early in its existence. To accomplish the same goal, +the MIT group categorized their identified risks as (1) immediate, (2) long-term, or +(3) controllable by standard ongoing processes. These categories were defined in +the following way: +Immediate concern: An immediate and substantial concern that should be part +of a near-term assessment. +Longer-term concern: A substantial longer-term concern that should potentially +be part of future assessments; as the risk will increase over time or cannot be +evaluated without future knowledge of the system or environment behavior. +Standard process: An important concern that should be addressed through +standard processes, such as inspections, rather than an extensive special assess- +ment procedure. + +This categorization allowed identifying a manageable subset of risks to be part of the +planned near-term risk assessment and those that could wait for future assessments +or could be controlled by on-going procedures. For example, it is important to assess +immediately the degree of “buy-in” to the ITA program. Without such support, ITA +cannot be sustained and the risk of dangerous decision making is very high. On the +other hand, the ability to find appropriate successors to the current warrant holders +is a longer-term concern identified in the STAMP-based risk analysis that would be +difficult to assess early in the existence of the new ITA control structure. The perfor- +mance of the current technical warrant holders, for example, is one factor that will +have an impact on whether the most qualified people will want the job in the future. + +section 8.6.5. Comparisons with Traditional Programmatic Risk Analysis Techniques. +The traditional risk analysis performed by NASA on ITA identified about one +hundred risks. The more rigorous, structured STAMP-based analysis—done inde- +pendently and without any knowledge of the results of the NASA process— +identified about 250 risks, all the risks identified by NASA plus additional ones. A +small part of the difference was related to the consideration by the STAMP group +of more components in the safety control structure, such as the NASA administrator, +Congress, and the Executive Branch (White House). There is no way to determine +whether the other additional risks identified by the STAMP-based process were +simply missed in the NASA analysis or were discarded for some reason. +The NASA analysis did not include a causal analysis of the risks and thus no +comparison is possible. Their goal was to determine what should be included in the +upcoming ITA risk assessment process and thus was narrower than the STAMP +demonstration risk analysis effort. + +section 8.7. Reengineering a Sociotechnical System: Pharmaceutical Safety and the Vioxx +Tragedy. +The previous section describes the use of STPA on the management structure of an +organization that develops and operates high-tech systems. STPA and other types +of analysis are potentially also applicable to social systems. This section provides an +example using pharmaceutical safety. +Couturier has performed a STAMP-based causal analysis of the incidents associ- +ated with the introduction and withdrawal of Vioxx [43]. Once the causes of such +losses are determined, changes need to be made to prevent a recurrence. Many sug- +gestions for changes as a result of the Vioxx losses +have been proposed. After the Vioxx recall, three main reports were written by the +Government Accountability Office (GAO) [73], the Institute of Medicine (IOM) +[16], and one commissioned by Merck. The publication of these reports led to two +waves of changes, the first initiated within the FDA and the second by Congress in + + +the form of a new set of rules called FDAAA (FDA Amendments Act). Couturier +[43, 44], with inputs from others,4 used the Vioxx events to demonstrate how these +proposed and implemented policy and structural changes could be analyzed to +predict their potential effectiveness using STAMP. + +footnote. Many people provided input to the analysis described in this section, including Stan Finkelstein, John +Thomas, John Carroll, Margaret Stringfellow, Meghan Dierks, Bruce Psaty, David Wierz, and various +other reviewers. + +section 8.7.1. The Events Surrounding the Approval and Withdrawal of Vioxx. +Vioxx (Rofecoxib) is a prescription COX-2 inhibitor manufactured by Merck. It was +approved by the Food and Drug Administration (FDA) in May 1999 and was widely +used for pain management, primarily from osteoarthritis. Vioxx was one of the major +sources of revenue for Merck while on the market: It was marketed in more than +eighty countries with worldwide sales totaling $2.5 billion in 2003. +In September 2004, Merck voluntarily withdrew the drug from the market +because of safety concerns: The drug was suspected to increase the risk of cardio- +vascular events (heart attacks and stroke) for the patients taking it long term at high +dosages. Vioxx was one of the most widely used drugs ever to be withdrawn from +the market. According to an epidemiological study done by Graham, an FDA sci- +entist, Vioxx has been associated with more than 27,000 heart attacks or deaths and +may be the “single greatest drug safety catastrophe in the history of this country or +the history of the world” [76]. +The important question to be considered is how did such a dangerous drug get +on the market and stay there so long despite warnings of problems and how can +this type of loss be avoided in the future. +The major events that occurred in this saga start with the discovery of the Vioxx +molecule in 1994. Merck sought FDA approval in November 1998. +In May 1999 the FDA approved Vioxx for the relief of osteoarthritis symptoms +and management of acute pain. Nobody had suggested that the COX-2 inhibitors +are more effective than the classic NSAIDS in relieving pain, but their selling point +had been that they were less likely to cause bleeding and other digestive tract com- +plications. The FDA was not convinced and required that the drug carry a warning +on its label about possible digestive problems. By December, Vioxx had more than +40 percent of the new prescriptions in its class. +In order to validate their claims about Rofecoxib having fewer digestive system +complications, Merck launched studies to prove their drugs should not be lumped +with other NSAIDS. The studies backfired. +In January 1999, before Vioxx was approved, Merck started a trial called VIGOR +(Vioxx Gastrointestinal Outcomes Research) to compare the efficacy and adverse + + +effects of Rofecoxib and Naproxen, an older nonsteroidal anti-inflammatory drug +or NSAID. In March 2000, Merck announced that the VIGOR trial had shown that +Vioxx was safer on the digestive tract than Naproxen, but it doubled the risk of +cardiovascular problems. Merck argued that the increased risk resulted not because +Vioxx caused the cardiovascular problems but that Celebrex (the Naproxen used in +the trial) protected against them. Merck continued to minimize unfavorable findings +for Vioxx up to a month before withdrawing it from the market in 2004. +Another study, ADVANTAGE, was started soon after the VIGOR trial. +ADVANTAGE had the same goal as VIGOR, but it targeted osteoarthritis, +whereas VIGOR was for rheumatoid arthritis. Although the ADVANTAGE trial +did demonstrate that Vioxx was safer on the digestive track than Naproxen, it +failed to show that Rofecoxib had any advantage over Naproxen in terms of pain +relief. Long after the report on ADVANTAGE was published, it turned out that its +first author had no involvement in the study until Merck presented him with a copy +of the manuscript written by Merck authors. This turned out to be one of the more +prominent recent examples of ghostwriting of journal articles where company +researchers wrote the articles and included the names of prominent researchers as +authors [178]. +In addition, Merck documents later came to light that appear to show the +ADVANTAGE trial emerged from the Merck marketing division and was actually +a “seeding” trial, designed to market the drug by putting “its product in the hands +of practicing physicians, hoping that the experience of treating patients with the +study drug and a pleasant, even profitable interaction with the company will result +in more loyal physicians who prescribe the drug” [83]. +Although the studies did demonstrate that Vioxx was safer on the digestive track +than Naproxen, they also again unexpectedly found that the COX-2 inhibitor +doubled the risk of cardiovascular problems. In April 2002, the FDA required that +Merck note a possible link to heart attacks and strokes on Vioxx’s label. But it never +ordered Merck to conduct a trial comparing Vioxx with a placebo to determine +whether a link existed. In April 2000 the FDA recommended that Merck conduct +an animal study with Vioxx to evaluate cardiovascular safety, but no such study was +ever conducted. +For both the VIGOR and ADVANTAGE studies, claims have been made that +cardiovascular events were omitted from published reports [160]. In May 2000 +Merck published the results from the VIGOR trial. The data included only seven- +teen of the twenty heart attacks the Vioxx patients had. When the omission was +later detected, Merck argued that the events occurred after the trial was over and +therefore did not have to be reported. The data showed a four times higher risk of +heart attacks compared with Naproxen. In October 2000, Merck officially told the +FDA about the other three heart attacks in the VIGOR study. + + +Merck marketed Vioxx heavily to doctors and spent more than $100 million +a year on direct-to-the-consumer advertising using popular athletes including +Dorothy Hamill and Bruce Jenner. In September 2001, the FDA sent Merck a letter +warning the company to stop misleading doctors about Vioxx’s effect on the cardio- +vascular system. +In 2001, Merck started a new study called APPROVe (Adenomatous Polyp +PRevention On Vioxx) in order to expand its market by showing the efficacy of +Vioxx on colorectal polyps. APPROVe was halted early when the preliminary data +showed an increased relative risk of heart attacks and strokes after eighteen months +of Vioxx use. The long-term use of Rofecoxib resulted in nearly twice the risk of +suffering a heart attack or stroke compared to patients receiving a placebo. +David Graham, an FDA researcher, did an analysis of a database of 1.4 million +Kaiser Permanente members and found that those who took Vioxx were more likely +to suffer a heart attack or sudden cardiac death than those who took Celebrex, +Vioxx’s main rival. Graham testified to a congressional committee that the FDA +tried to block publication of his findings. He described an environment “where he +was ‘ostracized’; ‘subjected to veiled threats’ and ‘intimidation.’” Graham gave the +committee copies of email that support his claims that his superiors at the FDA +suggested watering down his conclusions [178]. +Despite all their efforts to deny the risks associated with Vioxx, Merck withdrew +the drug from the market in September 2004. In October 2004, the FDA approved +a replacement drug for Vioxx by Merck, called Arcoxia. +Because of the extensive litigation associated with Vioxx, many questionable +practices in the pharmaceutical industry have come to light [6]. Merck has been +accused of several unsafe “control actions” in this sequence of events, including not +accurately reporting trial results to the FDA, not having a proper control board +(DSMB) overseeing the safety of the patients in at least one of the trials, misleading +marketing efforts, ghostwriting journal articles about Rofecoxib studies, and paying +publishers to create fake medical journals to publish favorable articles [45]. Post- +market safety studies recommended by the FDA were never done, only studies +directed at increasing the market. + + +section 8.7.2. Analysis of the Vioxx Case. +The hazards, system safety requirements and constraints, and documentation of the +safety control structure for pharmaceutical safety were shown in chapter 7. Using +these, Couturier performed several types of analysis. +He first traced the system requirements to the responsibilities assigned to each +of the components in the safety control structure, that is, he performed a gap analysis +as described above for the NASA ITA risk analysis. The goal was to check that at +least one controller was responsible for enforcing each of the safety requirements, +to identify when multiple controllers had the same responsibility, and to study each + +of the controllers independently to determine if they are capable of carrying out +their assigned responsibilities. +In the gap analysis, no obvious gaps or missing responsibilities were found, but +multiple controllers are in charge of enforcing some of the same safety requirements. +For example, the FDA, the pharmaceutical companies, and physicians are all respon- +sible for monitoring drugs for adverse events. This redundancy is helpful if the +controllers work together and share the information they have. Problems can occur, +however, if efforts are not coordinated and gaps occur. +The assignment of responsibilities does not necessarily mean they are carried out +effectively. As in the NASA ITA analysis, potentially inadequate control actions can +be identified using STPA step 1, potential causes identified using step 2, and controls +to protect against these causes designed and implemented. Contextual factors must +be considered such as external or internal pressures militating against effective +implementation or application of the controls. For example, given the financial +incentives involved in marketing a blockbuster drug—Vioxx in 2003 provided $2.5 +billion, or 11 percent of Merck’s revenue [66]—it may be unreasonable to expect +pharmaceutical companies to be responsible for drug safety without strong external +oversight and controls or even to be responsible at all: Suggestions have been made +that responsibility for drug development and testing be taken away from the phar- +maceutical manufacturers [67]. +Controllers must also have the resources and information necessary to enforce +the safety constraints they have been assigned. Physicians need information about +drug safety and efficacy that is independent from the pharmaceutical company +representatives in order to adequately protect their patients. One of the first steps +in performing an analysis of the drug safety control structure is to identify the con- +textual factors that can influence whether each component’s responsibilities are +carried out and the information required to create an accurate process model to +support informed decision making in exercising the controls they have available to +carry out their responsibilities. +Couturier also used the drug safety control structure, system safety requirements +and constraints, the events in the Vioxx losses, and STPA and system dynamics +models (see appendix D) to investigate the potential effectiveness of the changes +implemented after the Vioxx events to control the marketing of unsafe drugs and +the impact of the changes on the system as a whole. For example, the Food and Drug +Amendments Act of 2007 (FDAAA) increased the responsibilities of the FDA and +provided it with new authority. Couturier examined the recommendations from the +FDAAA, the IOM report, and those generated from his STAMP causal analysis of +the Vioxx events. +System dynamics modeling was used to show the relationship among the contex- +tual factors and unsafe control actions and the reasons why the safety control struc- +ture migrated toward ineffectiveness over time. Most modeling techniques provide + + +only direct relationships (arrows), which are inadequate to understand the indirect +relationships between causal factors. System dynamics provides a way to show such +indirect and nonlinear relationships. Appendix D explains this modeling technique. +First, system dynamics models were created to model the contextual influences +on the behavior of each component (patients, pharmaceutical companies, the FDA, +and so on) in the pharmaceutical safety control structure. Then the models were +combined to assist in understanding the behavior of the system as a whole and the +interactions among the components. The complete analysis can be found in [43] and +a shorter paper on some of the results [44]. An overview and some examples are +provided here. +Figure 8.10 shows a simple model of two types of pressures in this system that +militate against drugs being recalled. The loop on the left describes pressures within +the pharmaceutical company related to drug recalls while the loop on the right +describes pressures on the FDA related to drug recalls. +Once a drug has been approved, the pharmaceutical company, which invested +large resources in developing, testing, and marketing the drug, has incentives to +maximize profits from the drug and keep it on the market. Those pressures are +accentuated in the case of expected blockbuster drugs where the company’s finan- +cial well-being potentially depends on the success of the product. This goal creates +a reinforcing loop within the company to try to keep the drug on the market. The +company also has incentives to pressure the FDA to increase the number of approved + +indications, and thus purchasers, resist label changes, and prevent drug recalls. If the +company is successful at preventing recalls, the expectations for the drug increase, +creating another reinforcing loop. External pressures to recall the drug limit the +reinforcing dynamics, but they have a lot of inertia to overcome. +Figure 8.11 includes more details, more complex feedback loops, and more outside +pressures, such as the availability of a replacement drug, the time left on the drug’s +patent, and the amount of time spent on drug development. Pressures on the FDA +from the pharmaceutical companies are elaborated including the pressures on the +Office of New Drugs (OND) through PDUFA fees,5 pressures from advisory boards + + +to keep the drug (which are, in turn, subject to pressures from patient advocacy +groups and lucrative consulting contracts with the pharmaceutical companies), and +pressures from the FDA Office of Surveillance and Epidemiology (OSE) to recall +the drug. +Figures 8.12 and 8.13 show the pressures leading to overprescribing drugs. The +overview in figure 8.12 has two primary feedback loops. The loop on the left describes +pressures to lower the number of prescriptions based on the number of adverse +events and negative studies. The loop on the right shows the pressures within the +pharmaceutical company to increase the number of prescriptions based on company +earnings and marketing efforts. +For a typical pharmaceutical product, more drug prescriptions lead to higher +earnings for the drug manufacturer, part of which can be used to pay for more +advertising to get doctors to continue to prescribe the drug. This reinforcing loop is +usually balanced by the adverse effects of the drug. The more the drug is prescribed, +the more likely is observation of negative side effects, which will serve to balance +the pressures from the pharmaceutical companies. The two loops then theoretically +reach a dynamic equilibrium where drugs are prescribed only when their benefits +outweigh the risks. +As demonstrated in the Vioxx case, delays within a loop can significantly alter +the behavior of the system. By the time the first severe side effects were discovered, +millions of prescriptions had been given out. The balancing influences of the side- +effects loop were delayed so long that they could not effectively control the reinforc- +ing pressures coming from the pharmaceutical companies. Figure 8.13 shows how +additional factors can be incorporated including the quality of collected data, the +market size, and patient drug requests. + +Couturier incorporated into the system dynamics models the changes that were +proposed by the IOM after the Vioxx events, the changes actually implemented +in FDAAA, and the recommendations coming out of the STAMP-based causal +analysis. One major difference was that the STAMP-based recommendations had +a broader scope. While the IOM and FDAAA changes focused on the FDA, the +STAMP analysis considered the contributions of all the components of the pharma- +ceutical safety control structure to the Vioxx events and the STAMP causal analysis +led to recommendations for changes in nearly all of them. +Couturier concluded, not surprisingly, that most of the FDAAA changes are +useful and will have the intended effects. He also determined that a few may be +counterproductive and others need to be added. The added ones come from the fact +that the IOM recommendations and the FDAAA focus on a single component of +the system (the FDA). The FDA does not operate in a vacuum, and the proposed +changes do not take into account the safety role played by other components in the +system, particularly physicians. As a result, the pressures that led to the erosion of +the overall system safety controls were left unaddressed and are likely to lead to +changes in the system static and dynamic safety controls that will undermine the +improvements implemented by FDAAA. See Couturier [43] for the complete results. + +A potential contribution of such an analysis is the ability to consider the impact +of multiple changes within the entire safety control structure. Less than effective +controls may be implemented when they are created piecemeal to fix a current set +of adverse events. Existing pressures and influences, not changed by the new pro- +cedures, can defeat the intent of the changes by leading to unintended and counter- +balancing actions in the components of the safety control structure. STAMP-based +analysis suggest how to reengineer the safety control structure as a whole to achieve +the system goals, including both enhancing the safety of current drugs while at the +same time encouraging the development of new drugs. + + +footnote. The Prescription Drug Use Fee Act (PDUFA) was first passed by Congress in 1992. It allows the FDA +to collect fees from the pharmaceutical companies to pay the expenses for the approval of new drugs. +In return, the FDA agrees to meet drug review performance goals. The main goal of PDUFA is to accel- +erate the drug review process. Between 1993 and 2002, user fees allowed the FDA to increase by 77 +percent the number of personnel assigned to review applications. In 2004, more than half the funding +for the CDEH was coming from user fees [148]. A growing group of scientists and regulators have +expressed fears that in allowing the FDA to be sponsored by the pharmaceutical companies, the FDA +has shifted its priorities to satisfying the companies, its “client,” instead of protecting the public. + + +section 8.8. +Comparison of STPA with Traditional Hazard Analysis Techniques. +Few formal comparisons have been made yet between STPA and traditional tech- +niques such as fault tree analysis and HAZOP. Theoretically, because STAMP +extends the causality model underlying the hazard analysis, non-failures and addi- +tional causes should be identifiable, as well as the failure-related causes found by +the traditional techniques. The few comparisons that have been made, both informal +and formal, have confirmed this hypothesis. +In the use of STPA on the U.S. missile defense system, potential paths to inad- +vertent launch were identified that had not been identified by previous analyses or +in extensive hazard analyses on the individual components of the system [BMDS]. +Each element of the system had an active safety program, but the complexity and +coupling introduced by their integration into a single system created new subtle and +complex hazard scenarios. While the scenarios identified using STPA included those +caused by potential component failures, as expected, scenarios were also identified +that involved unsafe interactions among the components without any components +actually failing—each operated according to its specified requirements, but the +interactions could lead to hazardous system states. In the evaluation of this effort, +two other advantages were noted: +1. The effort was bounded and predictable and assisted the engineers in scoping +their efforts. Once all the control actions have been examined, the assessment +is complete. +2. As the control structure is developed and the potential inadequate control +actions are identified, they were able to prioritize required changes according +to which control actions have the greatest role in keeping the system from +transitioning to a hazardous state. +A paper published on this effort concluded: +The STPA safety assessment methodology . . . provided an orderly, organized fashion in +which to conduct the analysis. The effort successfully assessed safety risks arising from the + +integration of the Elements. The assessment provided the information necessary to char- +acterize the residual safety risk of hazards associated with the system. The analysis and +supporting data provided management a sound basis on which to make risk acceptance +decisions. Lastly, the assessment results were also used to plan mitigations for open safety +risks. As changes are made to the system, the differences are assessed by updating the +control structure diagrams and assessment analysis templates. +Another informal comparison was made in the ITA (Independent Technical Author- +ity) analysis described in section 8.6. An informal review of the risks identified by +using STPA showed that they included all the risks identified by the informal NASA +risk analysis process using the traditional method common to such analyses. The +additional risks identified by STPA appeared on the surface to be as important as +those identified by the NASA analysis. As noted, there is no way to determine +whether the less formal NASA process identified additional risks and discarded +them for some reason or simply missed them. +A more careful comparison has also been made. JAXA (the Japanese Space +Agency) and MIT engineers compared the use of STPA on a JAXA unmanned +spacecraft (HTV) to transfer cargo to the International Space Station (ISS). Because +human life is potentially involved (one hazard is collision with the International +Space Station), rigorous NASA hazard analysis standards using fault trees and other +analyses had been employed and reviewed by NASA. In an STPA analysis of the +HTV used in an evaluation of the new technique for potential use at JAXA, all of +the hazard causal factors identified by the fault tree analysis were identified also by +STPA [88]. As with the BMDS comparison, additional causal factors were identified +by STPA alone. These additional causal factors again involved those related to more +sophisticated types of errors beyond simple component failures and those related +to software and human errors. +Additional independent comparisons (not done by the author or her students) +have been made between accident causal analysis methods comparing STAMP and +more traditional methods. The results are described in chapter 11 on accident analy- +sis based on STAMP. + + +section 8.9. +Summary. +Some new approaches to hazard and risk analysis based on STAMP and systems +theory have been suggested in this chapter. We are only beginning to develop such +techniques and hopefully others will work on alternatives and improvements. The +only thing for sure is that applying the techniques developed for simple electrome- +chanical systems to complex, human and software-intensive systems without funda- +mentally changing the foundations of the techniques is futile. New ideas are +desperately needed if we are going to solve the problems and respond to the changes +in the world of engineering described in chapter 1. \ No newline at end of file diff --git a/chapter08.txt b/chapter08.txt new file mode 100644 index 0000000..83b9261 --- /dev/null +++ b/chapter08.txt @@ -0,0 +1,1175 @@ +chapter 8. +S T P A. A New Hazard Analysis Technique. +Hazard analysis can be described as “investigating an accident before it occurs.” The +goal is to identify potential causes of accidents, that is, scenarios that can lead +to losses, so they can be eliminated or controlled in design or operations before +damage occurs. +The most widely used existing hazard analysis techniques were developed fifty +years ago and have serious limitations in their applicability to today’s more complex, +software-intensive, sociotechnical systems. This chapter describes a new approach +to hazard analysis, based on the STAMP causality model, called S T P A .(SystemTheoretic Process Analysis). +section 8.1. +Goals for a New Hazard Analysis Technique. +Three hazard analysis techniques are currently used widely. Fault Tree Analysis, +Event Tree Analysis, and Haz Op. Variants that combine aspects of these three +techniques, such as Cause-Consequence Analysis .(combining top-down fault trees +and forward analysis Event Trees). and Bowtie Analysis .(combining forward and +backward chaining techniques). are also sometimes used. Safeware and other basic +textbooks contain more information about these techniques for those unfamiliar +with them. F M E A .(Failure Modes and Effects Analysis). is sometimes used as a +hazard analysis technique, but it is a bottom-up reliability analysis technique and +has very limited applicability for safety analysis. +The primary reason for developing S T P A was to include the new causal factors +identified in STAMP that are not handled by the older techniques. More specifically, +the hazard analysis technique should include design errors, including software flaws; +component interaction accidents; cognitively complex human decision-making +errors; and social, organizational, and management factors contributing to accidents. +In short, the goal is to identify accident scenarios that encompass the entire accident +process, not just the electromechanical components. While attempts have been +made to add new features to traditional hazard analysis techniques to handle new + + +technology, these attempts have had limited success because the underlying assumptions of the old techniques and the causality models on which they are based do not +fit the characteristics of these new causal factors. S T P A is based on the new causality +assumptions identified in chapter 2. +An additional goal in the design of S T P A was to provide guidance to the users +in getting good results. Fault tree and event tree analysis provide little guidance to +the analyst.the tree itself is simply the result of the analysis. Both the model of the +system being used by the analyst and the analysis itself are only in the analyst’s +head. Analyst expertise in using these techniques is crucial, and the quality of the +fault or event trees that result varies greatly. +Haz Op, widely used in the process industries, provides much more guidance to +the analysts. Haz Op is based on a slightly different accident model than fault and +event trees, namely that accidents result from deviations in system parameters, such +as too much flow through a pipe or backflow when forward flow is required. +Haz Op uses a set of guidewords to examine each part of a plant piping and wiring +diagram, such as more than, less than, and opposite. Both guidance in performing +the process and a concrete model of the physical structure of the plant are therefore +available. +Like Haz Op, S T P A works on a model of the system and has “guidewords” to +assist in the analysis, but because in STAMP accidents are seen as resulting from +inadequate control, the model used is a functional control diagram rather than a +physical component diagram. In addition, the set of guidewords is based on lack of +control rather than physical parameter deviations. While engineering expertise is +still required, guidance is provided for the S T P A process to provide some assurance +of completeness in the analysis. +The third and final goal for S T P A is that it can be used before a design has been +created, that is, it provides the information necessary to guide the design process, +rather than requiring a design to exist before the analysis can start. Designing +safety into a system, starting in the earliest conceptual design phases, is the most +cost-effective way to engineer safer systems. The analysis technique must also, of +course, be applicable to existing designs or systems when safety-guided design is +not possible. +section 8.2. +The S T P A Process. +S T P A .(System-Theoretic Process Analysis). can be used at any stage of the system +life cycle. It has the same general goals as any hazard analysis technique. accumulating information about how the behavioral safety constraints, which are derived +from the system hazards, can be violated. Depending on when it is used, it provides +the information and documentation necessary to ensure the safety constraints are + + +enforced in system design, development, manufacturing, and operations, including +the natural changes in these processes that will occur over time. +S T P A uses a functional control diagram and the requirements, system hazards, +and the safety constraints and safety requirements for the component as defined in +chapter 7. When S T P A is applied to an existing design, this information is available +when the analysis process begins. When S T P A is used for safety-guided design, only +the system-level requirements and constraints may be available at the beginning +of the process. In the latter case, these requirements and constraints are refined +and traced to individual system components as the iterative design and analysis +process proceeds. +S T P A has two main steps. +1. Identify the potential for inadequate control of the system that could lead to +a hazardous state. Hazardous states result from inadequate control or enforcement of the safety constraints, which can occur because. +a. A control action required for safety is not provided or not followed. +b. An unsafe control action is provided. +c. A potentially safe control action is provided too early or too late, that is, at +the wrong time or in the wrong sequence. +d. A control action required for safety is stopped too soon or applied too long. +2. Determine how each potentially hazardous control action identified in step 1 +could occur. +a. For each unsafe control action, examine the parts of the control loop to see +if they could cause it. Design controls and mitigation measures if they do not +already exist or evaluate existing measures if the analysis is being performed +on an existing design. For multiple controllers of the same component or +safety constraint, identify conflicts and potential coordination problems. +b. Consider how the designed controls could degrade over time and build in +protection, including +b.1. Management of change procedures to ensure safety constraints are +enforced in planned changes. +b.2. Performance audits where the assumptions underlying the hazard analysis are the preconditions for the operational audits and controls so that +unplanned changes that violate the safety constraints can be detected. +b.3. Accident and incident analysis to trace anomalies to the hazards and to +the system design. +While the analysis can be performed in one step, dividing the process into +discrete steps reduces the analytical burden on the safety engineers and provides a + + +structured process for hazard analysis. The information from the first step .(identifying the unsafe control actions). is required to perform the second step .(identifying +the causes of the unsafe control actions). +The assumption in this chapter is that the system design exists when S T P A +is performed. The next chapter describes safety-guided design using S T P A and +principles for safe design of control systems. +S T P A is defined in this chapter using two examples. The first is a simple, generic +interlock. The hazard involved is exposure of a human to a potentially dangerous +energy source, such as high power. The power controller, which is responsible for +turning the energy on or off, implements an interlock to prevent the hazard. In the +physical controlled system, a door or barrier over the power source prevents exposure while it is active. To simplify the example, we will assume that humans cannot +physically be inside the area when the barrier is in place.that is, the barrier is +simply a cover over the energy source. The door or cover will be manually operated +so the only function of the automated controller is to turn the power off when the +door is opened and to turn it back on when the door is closed. +Given this design, the process starts from. + +Hazard. Exposure to a high-energy source. +Constraint. The energy source must be off when the door is not closed. + +Figure 8.1 shows the control structure for this simple system. In this figure, the +components of the system are shown along with the control instructions each component can provide and some potential feedback and other information or control +sources for each component. Control operations by the automated controller include +turning the power off and turning it on. The human operator can open and close +the door. Feedback to the automated controller includes an indication of whether +the door is open or not. Other feedback may be required or useful as determined +during the S T P A .(hazard analysis). process. +The control structure for a second more complex example to be used later in the +chapter, a fictional but realistic ballistic missile intercept system .(F Miss), is shown +in figure 8.2. Pereira, Lee, and Howard created this example to describe their +use of S T P A to assess the risk of inadvertent launch in the U.S. Ballistic Missile +Defense System .(B M D S). before its first deployment and field test. +The B M D S is a layered defense to defeat all ranges of threats in all phases of +flight .(boost, midcourse, and terminal). The example used in this chapter is, for + +security reasons, changed from the real system, but it is realistic, and the problems +identified by S T P A in this chapter are similar to some that were found using S T P A +on the real system. +The U S BDMS system has a variety of components, including sea-based sensors +in the Aegis shipborne platform; upgraded early warning systems; new and upgraded +radars, ground-based midcourse defense, fire control, and communications; a +Command and Control Battle Management and Communications component; +and ground-based interceptors. Future upgrades will add features. Some parts +of the system have been omitted in the example, such as the Aegis .(ship-based) +platform. +Figure 8.2 shows the control structure for the F Miss components included in the +example. The command authority controls the operators by providing such things +as doctrine, engagement criteria, and training. As feedback, the command authority +gets the exercise results, readiness information, wargame results, and other information. The operators are responsible for controlling the launch of interceptors by +sending instructions to the fire control subsystem and receiving status information +as feedback. + + +Fire control receives instructions from the operators and information from the +radars about any current threats. Using these inputs, fire control provides instructions to the launch station, which actually controls the launch of any interceptors. +Fire control can enable firing, disable firing, and so forth, and, of course, it receives +feedback from the launch station about the status of any previously provided +control actions and the state of the system itself. The launch station controls the +actual launcher and the flight computer, which in turn controls the interceptor +hardware. +There is one other component of the system. To ensure operational readiness, the +F Miss contains an interceptor simulator that periodically is used to mimic the flight +computer in order to detect a failure in the system. + +footnote. The phrase “when the door is open” would be incorrect because a case is missing .(a common problem). +in the power controller’s model of the controlled process, which enforces the constraint, the door may +be open, closed, or the door position may be unknown to the controller. The phrase “is open or the door +position is unknown” could be used instead. See section 9.3.2 for a discussion of why the difference is +important. + + + +section 8.3. +Identifying Potentially Hazardous Control Actions .(Step 1) +Starting from the fundamentals defined in chapter 7, the first step in S T P A is to +assess the safety controls provided in the system design to determine the potential +for inadequate control, leading to a hazard. The assessment of the hazard controls +uses the fact that control actions can be hazardous in four ways .(as noted earlier). +1. A control action required for safety is not provided or is not followed. +2. An unsafe control action is provided that leads to a hazard. +3. A potentially safe control action is provided too late, too early, or out of +sequence. +4. A safe control action is stopped too soon or applied too long .(for a continuous +or nondiscrete control action). +For convenience, a table can be used to record the results of this part of the analysis. +Other ways to record the information are also possible. In a classic System Safety +program, the information would be included in the hazard log. Figure 8.3 shows the +results of step 1 for the simple interlock example. The table contains four hazardous +types of behavior. +1. A power off command is not given when the door is opened, +2. The door is opened and the controller waits too long to turn the power off; +3. A power on command is given while the door is open, and +4. A power on command is provided too early .(when the door has not yet fully +closed). +Incorrect but non-hazardous behavior is not included in the table. For example, +not providing a power on command when the power is off and the door is opened + + +or closed is not hazardous, although it may represent a quality-assurance problem. +Another example of a mission assurance problem but not a hazard occurs when the +power is turned off while the door is closed. Thomas has created a procedure to +assist the analyst in considering the effect of all possible combinations of environmental and process variables for each control action in order to avoid missing any +cases that should be included in the table . +The final column of the table, Stopped Too Soon or Applied Too Long, is not +applicable to the discrete interlock commands. An example where it does apply is +in an aircraft collision avoidance system where the pilot may be told to climb or +descend to avoid another aircraft. If the climb or descend control action is stopped +too soon, the collision may not be avoided. +The identified hazardous behaviors can now be translated into safety constraints +(requirements). on the system component behavior. For this example, four constraints must be enforced by the power controller .(interlock). +1. The power must always be off when the door is open; +2. A power off command must be provided within x milliseconds after the door +is opened; +3. A power on command must never be issued when the door is open; +4. The power on command must never be given until the door is fully closed. +For more complex examples, the mode in which the system is operating may determine the safety of the action or event. In that case, the operating mode may need +to be included in the table, perhaps as an additional column. For example, some +spacecraft mission control actions may only be hazardous during the launch or +reentry phase of the mission. +In chapter 2, it was stated that many accidents, particularly component interaction accidents, stem from incomplete requirements specifications. Examples were + + +provided such as missing constraints on the order of valve position changes in a +batch chemical reactor and the conditions under which the descent engines should +be shut down on the Mars Polar Lander spacecraft. The information provided +in this first step of S T P A can be used to identify the necessary constraints on component behavior to prevent the identified system hazards, that is, the safety requirements. In the second step of S T P A, the information required by the component to +properly implement the constraint is identified as well as additional safety constraints and information necessary to eliminate or control the hazards in the design +or to design the system properly in the first place. +The F Miss system provides a less trivial example of step 1. Remember, the hazard +is inadvertent launch. Consider the fire enable command, which can be sent by the +fire control module to the launch station to allow launch commands subsequently +received by the launch station to be executed. As described in Pereira, Lee, and +Howard , the fire enable control command directs the launch station to enable +the live fire of interceptors. Prior to receiving this command, the launch station will +return an error message when it receives commands to fire an interceptor and will +discard the fire commands.2 +Figure 8.4 shows the results of performing S T P A Step 1 on the fire enable +command. If this command is missing .(column 2), a launch will not take place. While +this omission might potentially be a mission assurance concern, it does not contribute to the hazard being analyzed .(inadvertent launch). + + +If the fire enable command is provided to a launch station incorrectly, the launch +station will transition to a state where it accepts interceptor tasking and can progress +through a launch sequence. In combination with other incorrect or mistimed commands, this control action could contribute to an inadvertent launch. +A late fire enable command will only delay the launch station’s ability to +process a launch sequence, which will not contribute to an inadvertent launch. A +fire enable command sent too early could open a window of opportunity for +inadvertently progressing toward an inadvertent launch, similar to the incorrect +fire enable considered above. In the third case, a fire enable command might +be out of sequence with a fire disable command. If this incorrect sequencing is +possible in the system as designed and constructed, the system could be left +capable of processing interceptor tasking and launching an interceptor when not +intended. +Finally, the fire enable command is a discrete command sent to the launch +station to signal that it should allow processing of interceptor tasking. Because +fire enable is not a continuous command, the “stopped too soon” category does +not apply. + +footnote. Section 9.4.4 explains the safety-related reasons for breaking up potentially hazardous actions into +multiple steps. + + +section 8.4. +Determining How Unsafe Control Actions Could Occur. .(Step 2) +Performing the first step of S T P A provides the component safety requirements, +which may be sufficient for some systems. A second step can be performed, however, +to identify the scenarios leading to the hazardous control actions that violate the +component safety constraints. Once the potential causes have been identified, the +design can be checked to ensure that the identified scenarios have been eliminated +or controlled in some way. If not, then the design needs to be changed. If the design +does not already exist, then the designers at this point can try to eliminate or control +the behaviors as the design is created, that is, use safety-guided design as described +in the next chapter. +Why is the second step needed? While providing the engineers with the safety +constraints to be enforced is necessary, it is not sufficient. Consider the chemical +batch reactor described in section 2.1. The hazard is overheating of the reactor +contents. At the system level, the engineers may decide .(as in this design). to use +water and a reflux condenser to control the temperature. After this decision is made, +controls need to be enforced on the valves controlling the flow of catalyst and water. +Applying step 1 of S T P A determines that opening the valves out of sequence is +dangerous, and the software requirements would accordingly be augmented with +constraints on the order of the valve opening and closing instructions, namely that +the water valve must be opened before the catalyst valve and the catalyst valve must +be closed before the water valve is closed or, more generally, that the water valve + + +must always be open when the catalyst valve is opened. If the software already exists, +the hazard analysis would ensure that this ordering of commands has been enforced +in the software. Clearly, building the software to enforce this ordering is a great deal +easier than proving the ordering is true after the software already exists. +But enforcing these safety constraints is not enough to ensure safe software +behavior. Suppose the software has commanded the water valve to open but something goes wrong and the valve does not actually open or it opens but water flow +is restricted in some way .(the no flow guideword in Haz Op). Feedback is needed +for the software to determine if water is flowing through the pipes and the software +needs to check this feedback before opening the catalyst valve. The second step of +S T P A is used to identify the ways that the software safety constraint, even if provided to the software engineers, might still not be enforced by the software logic +and system design. In essence, step 2 identifies the scenarios or paths to a hazard +found in a classic hazard analysis. This step is the usual “magic” one that creates the +contents of a fault tree, for example. The difference is that guidance is provided to +help create the scenarios and more than just failures are considered. +To create causal scenarios, the control structure diagram must include the process +models for each component. If the system exists, then the content of these models +should be easily determined by looking at the system functional design and its documentation. If the system does not yet exist, the analysis can start with a best guess +and then be refined and changed as the analysis process proceeds. +For the high power interlock example, the process model is simple and shown in +figure 8.5. The general causal factors, shown in figure 4.8 and repeated here in figure +8.6 for convenience, are used to identify the scenarios. + +section 8.4.1. Identifying Causal Scenarios. +Starting with each hazardous control action identified in step 1, the analysis in step +2 involves identifying how it could happen. To gather information about how the +hazard could occur, the parts of the control loop for each of the hazardous control +actions identified in step 1 are examined to determine if they could cause or contribute to it. Once the potential causes are identified, the engineers can design +controls and mitigation measures if they do not already exist or evaluate existing +measures if the analysis is being performed on an existing design. +Each potentially hazardous control action must be considered. As an example, +consider the unsafe control action of not turning off the power when the door is +opened. Figure 8.7 shows the results of the causal analysis in a graphical form. Other +ways of documenting the results are, of course, possible. +The hazard in figure 8.7 is that the door is open but the power is not turned off. +Looking first at the controller itself, the hazard could result if the requirement is +not passed to the developers of the controller, the requirement is not implemented + + +correctly, or the process model incorrectly shows the door closed and/or the power +off when that is not true. Working around the loop, the causal factors for each of +the loop components are similarly identified using the general causal factors shown +in figure 8.6. These causes include that the power off command is sent but not +received by the actuator, the actuator received the command but does not implement it .(actuator failure), the actuator delays in implementing the command, the +power on and power off commands are received or executed in the wrong order, +the door open event is not detected by the door sensor or there is an unacceptable +delay in detecting it, the sensor fails or provides spurious feedback, and the feedback +about the state of the door or the power is not received by the controller or is not +incorporated correctly into the process model. +More detailed causal analysis can be performed if a specific design is being considered. For example, the features of the communication channels used will determine the potential way that commands or feedback could be lost or delayed. +Once the causal analysis is completed, each of the causes that cannot be shown +to be physically impossible must be checked to determine whether they are + + +adequately handled in the design .(if the design exists). or design features added to +control them if the design is being developed with support from the analysis. +The first step in designing for safety is to try to eliminate the hazard completely. +In this example, the hazard can be eliminated by redesigning the system to have the +circuit run through the door in such a way that the circuit is broken as soon as the +door opens. Let’s assume, however, that for some reason this design alternative is +rejected, perhaps as impractical. Design precedence then suggests that the next best +alternatives in order are to reduce the likelihood of the hazard occurring, to prevent +the hazard from leading to a loss, and finally to minimize damage. More about safe +design can be found in chapters 16 and 17 of Safeware and chapter 9 of this book. +Because design almost always involves tradeoffs with respect to achieving multiple objectives, the designers may have good reasons not to select the most effective +way to control the hazard but one of the other alternatives instead. It is important +that the rationale behind the choice is documented for future analysis, certification, +reuse, maintenance, upgrades, and other activities. +For this simple example, one way to mitigate many of the causes is to add a light +that identifies whether the power supply is on or off. How do human operators know +that the power has been turned off before inserting their hands into the high-energy + + +power source? In the original design, they will most likely assume it is off because +they have opened the door, which may be an incorrect assumption. Additional +feedback and assurance can be attained from the light. In fact, protection systems +in automated factories commonly are designed to provide humans in the vicinity +with aural or visual information that they have been detected by the protection +system. Of course, once a change has been made, such as adding a light, that change +must then be analyzed for new hazards or causal scenarios. For example, a light bulb +can burn out. The design might ensure that the safe state .(the power is off). is represented by the light being on rather than the light being off, or two colors might +be used. Every solution for a safety problem usually has its own drawbacks and +limitations and therefore they will need to be compared and decisions made about +the best design given the particular situation involved. +In addition to the factors shown in figure 8.6, the analysis must consider the +impact of having two controllers of the same component whenever this occurs in +the system safety control structure. In the friendly fire example in chapter 5, for +example, confusion existed between the two A Wacks operators responsible for +tracking aircraft inside and outside of the no-fly-zone about who was responsible +for aircraft in the boundary area between the two. The F Miss example below contains such a scenario. An analysis must be made to determine that no path to a +hazard exists because of coordination problems. +The F Miss system provides a more complex example of S T P A step 2. Consider +the fire enable command provided by fire control to the launch station. In step 1, +it was determined that if this command is provided incorrectly, the launch station +will transition to a state where it accepts interceptor tasking and can progress +through a launch sequence. In combination with other incorrect or mistimed control +actions, this incorrect command could contribute to an inadvertent launch. +The following are two examples of causal factors identified using S T P A step 2 as +potentially leading to the hazardous state .(violation of the safety constraint). Neither +of these examples involves component failures, but both instead result from unsafe +component interactions and other more complex causes that are for the most part +not identifiable by current hazard analysis methods. +In the first example, the fire enable command can be sent inadvertently due to +a missing case in the requirements.a common occurrence in accidents where software is involved. +The fire enable command is sent when the fire control receives a weapons free +command from the operators and the fire control system has at least one active +track. An active track indicates that the radars have detected something that might +be an incoming missile. Three criteria are specified for declaring a track inactive. +(1). a given period passes with no radar input, .(2). the total predicted impact time +elapses for the track, and .(3). an intercept is confirmed. Operators are allowed to + + +deselect any of these options. One case was not considered by the designers. if an +operator deselects all of the options, no tracks will be marked as inactive. Under +these conditions, the inadvertent entry of a weapons free command would send the +fire enable command to the launch station immediately, even if there were no +threats currently being tracked by the system. +Once this potential cause is identified, the solution is obvious.fix the software +requirements and the software design to include the missing case. While the operator might instead be warned not to deselect all the options, this kind of human error +is possible and the software should be able to handle the error safely. Depending +on humans not to make mistakes is an almost certain way to guarantee that accidents will happen. +The second example involves confusion between the regular and the test software. The F Miss undergoes periodic system operability testing using an interceptor +simulator that mimics the interceptor flight computer. The original hazard analysis +had identified the possibility that commands intended for test activities could be +sent to the operational system. As a result, the system status information provided +by the launch station includes whether the launch station is connected only to +missile simulators or to any live interceptors. If the fire control computer detects a +change in this state, it will warn the operator and offer to reset into a matching state. +There is, however, a small window of time before the launch station notifies the fire +control component of the change. During this time interval, the fire control software +could send a fire enable command intended for test to the live launch station. This +latter example is a coordination problem arising because there are multiple controllers of the launch station and two operating modes .(e.g., testing and live fire). A +potential mode confusion problem exists where the launch station can think it is in +one mode but really be in the other one. Several different design changes could be +used to prevent this hazardous state. +In the use of S T P A on the real missile defense system, the risks involved in integrating separately developed components into a larger system were assessed, and +several previously unknown scenarios for inadvertent launch were identified. Those +conducting the assessment concluded that the S T P A analysis and supporting data +provided management with a sound basis on which to make risk acceptance decisions . The assessment results were used to plan mitigations for open safety +risks deemed necessary to change before deployment and field-testing of the system. +As system changes are proposed, they are assessed by updating the control structure +diagrams and assessment analysis results. + +section 8.4.2. Considering the Degradation of Controls over Time. +A final step in S T P A is to consider how the designed controls could degrade over +time and to build in protection against it. The mechanisms for the degradation could + +be identified and mitigated in the design. for example, if corrosion is identified as a +potential cause, a stronger or less corrosive material might be used. Protection might +also include planned performance audits where the assumptions underlying the +hazard analysis are the preconditions for the operational audits and controls. For +example, an assumption for the interlock system with a light added to warn the +operators is that the light is operational and operators will use it to determine +whether it is safe to open the door. Performance audits might check to validate that +the operators know the purpose of the light and the importance of not opening the +door while the warning light is on. Over time, operators might create workarounds +to bypass this feature if it slows them up too much in their work or if they do not +understand the purpose, the light might be partially blocked from view because of +workplace changes, and so on. The assumptions and required audits should be identified during the system design process and then passed to the operations team. +Along with performance audits, management of change procedures need to be +developed and the S T P A analysis revisited whenever a planned change is made in +the system design. Many accidents occur after changes have been made in the +system. If appropriate documentation is maintained along with the rationale for the +control strategy selected, this reanalysis should not be overly burdensome. How to +accomplish this goal is discussed in chapter 10. +Finally, after accidents and incidents, the design and the hazard analysis should +be revisited to determine why the controls were not effective. The hazard of foam +damaging the thermal surfaces of the Space Shuttle had been identified during +design, for example, but over the years before the Columbia loss the process for +updating the hazard analysis after anomalies occurred in flight was eliminated. The +Space Shuttle standard for hazard analyses .(NSTS 22254, Methodology for Conduct +of Space Shuttle Program Hazard Analyses). specified that hazards be revisited only +when there was a new design or the design was changed. There was no process for +updating the hazard analyses when anomalies occurred or even for determining +whether an anomaly was related to a known hazard . +Chapter 12 provides more information about the use of the S T P A results during +operations. + +section 8.5. Human Controllers. +Humans in the system can be treated in the same way as automated components in +step 1 of S T P A, as was seen in the interlock system above where a person controlled +the position of the door. The causal analysis and detailed scenario generation for +human controllers, however, is much more complex than that of electromechanical +devices and even software, where at least the algorithm is known and can be evaluated. Even if operators are given a procedure to follow, for reasons discussed in + +chapter 2, it is very likely that the operator may feel the need to change the procedure over time. +The first major difference between human and automated controllers is that +humans need an additional process model. All controllers need a model of the +process they are controlling directly, but human controllers also need a model of +any process, such as an oil refinery or an aircraft, they are indirectly controlling +through an automated controller. If the human is being asked to supervise the +automated controller or to monitor it for wrong or dangerous behavior then he +or she needs to have information about the state of both the automated controller +and the controlled process. Figure 8.8 illustrates this requirement. The need for +an additional process model explains why supervising an automated system +requires extra training and skill. A wrong assumption is sometimes made that if the + +human is supervising a computer, training requirements are reduced but this +belief is untrue. Human skill levels and required knowledge almost always go up in +this situation. +Figure 8.8 includes dotted lines to indicate that the human controller may need +direct access to the process actuators if the human is to act as a backup to the +automated controller. In addition, if the human is to monitor the automation, he +or she will need direct input from the sensors to detect when the automation is +confused and is providing incorrect information as feedback about the state of the +controlled process. +The system design, training, and operational procedures must support accurate +creation and updating of the extra process model required by the human supervisor. +More generally, when a human is supervising an automated controller, there are +extra analysis and design requirements. For example, the control algorithm used by +the automation must be learnable and understandable. Inconsistent behavior or +unnecessary complexity in the automation function can lead to increased human +error. Additional design requirements are discussed in the next chapter. +With respect to S T P A, the extra process model and complexity in the system +design requires additional causal analysis when performing step 2 to determine the +ways that both process models can become inaccurate. +The second important difference between human and automated controllers is +that, as noted by Thomas , while automated systems have basically static control +algorithms .(although they may be updated periodically), humans employ dynamic +control algorithms that they change as a result of feedback and changes in goals. +Human error is best modeled and understood using feedback loops, not as a chain +of directly related events or errors as found in traditional accident causality models. +Less successful actions are a natural part of the search by operators for optimal +performance . +Consider again figure 2.9. Operators are often provided with procedures to follow +by designers. But designers are dealing with their own models of the controlled +process, which may not reflect the actual process as constructed and changed over +time. Human controllers must deal with the system as it exists. They update their +process models using feedback, just as in any control loop. Sometimes humans use +experimentation to understand the behavior of the controlled system and its current +state and use that information to change their control algorithm. For example, after +picking up a rental car, drivers may try the brakes and the steering system to get a +feel for how they work before driving on a highway. +If human controllers suspect a failure has occurred in a controlled process, they +may experiment to try to diagnose it and determine a proper response. Humans +also use experimentation to determine how to optimize system performance. The +driver’s control algorithm may change over time as the driver learns more about + + +the automated system and learns how to optimize the car’s behavior. Driver goals +and motivation may also change over time. In contrast, automated controllers by +necessity must be designed with a single set of requirements based on the designer’s +model of the controlled process and its environment. +Thomas provides an example using cruise control. Designers of an automated cruise control system may choose a control algorithm based on their model +of the vehicle .(such as weight, engine power, response time), the general design of +roadways and vehicle traffic, and basic engineering design principles for propulsion +and braking systems. A simple control algorithm might control the throttle in proportion to the difference between current speed .(monitored through feedback). and +desired speed .(the goal). +Like the automotive cruise control designer, the human driver also has a process +model of the car’s propulsion system, although perhaps simpler than that of the +automotive control expert, including the approximate rate of car acceleration for +each accelerator position. This model allows the driver to construct an appropriate +control algorithm for the current road conditions .(slippery with ice or clear and dry) +and for a given goal .(obeying the speed limit or arriving at the destination at a +required time). Unlike the static control algorithm designed into the automated +cruise control, the human driver may dynamically change his or her control algorithm over time based on changes in the car’s performance, in goals and motivation, +or driving experience. +The differences between automated and human controllers lead to different +requirements for hazard analysis and system design. Simply identifying human +“failures” or errors is not enough to design safer systems. Hazard analysis must +identify the specific human behaviors that can lead to the hazard. In some cases, it +may be possible to identify why the behaviors occur. In either case, we are not able +to “redesign” humans. Training can be helpful, but not nearly enough.training can +do only so much in avoiding human error even when operators are highly trained +and skilled. In many cases, training is impractical or minimal, such as automobile +drivers. The only real solution lies in taking the information obtained in the hazard +analysis about worst-case human behavior and using it in the design of the other +system components and the system as a whole to eliminate, reduce, or compensate +for that behavior. Chapter 9 discusses why we need human operators in systems and +how to design to eliminate or reduce human errors. +S T P A as currently defined provides much more useful information about the +cause of human errors than traditional hazard analysis methods, but augmenting +S T P A could provide more information for designers. Stringfellow has suggested +some additions to S T P A for human controllers . In general, engineers need +better tools for including humans in hazard analyses in order to cope with the unique +aspects of human control. + + +section 8.6. Using S T P A on Organizational Components of the Safety Control Structure. +The examples above focus on the lower levels of safety control structures, but S T P A +can also be used on the organizational and management components. Less experimentation has been done on applying it at these levels, and, once again, more needs +to be done. +Two examples are used in this section. one was a demonstration for NASA of +risk analysis using S T P A on a new management structure proposed after the Columbia accident. The second is pharmaceutical safety. The fundamental activities of +identifying system hazards, safety requirements and constraints, and of documenting +the safety control structure were described for these two examples in chapter 7. +This section starts from that point and illustrates the actual risk analysis process. + +section 8.6.1. Programmatic and Organizational Risk Analysis. +The Columbia Accident Investigation Board .(CAIB). found that one of the causes +of the Columbia loss was the lack of independence of the safety program from the +Space Shuttle program manager. The CAIB report recommended that NASA institute an Independent Technical Authority .(I T A). function similar to that used in +SUBSAFE .(see chapter 14), and individuals with SUBSAFE experience were +recruited to help design and implement the new NASA Space Shuttle program +organizational structure. After the program was designed and implementation +started, a risk analysis of the program was performed to assist in a planned review +of the program’s effectiveness. A classic programmatic risk analysis, which used +experts to identify the risks in the program, was performed. In parallel, a group at +MIT developed a process to use STAMP as a foundation for the same type of programmatic risk analysis to understand the risks and vulnerabilities of this new +organizational structure and recommend improvements .3 This section describes +the STAMP-based process and results as an example of what can be done for other +systems and other emergent properties. Laracy used a similar process to +examine transportation system security, for example. +The STAMP-based analysis rested on the basic STAMP concept that most major +accidents do not result simply from a unique set of proximal, physical events but +from the migration of the organization to a state of heightened risk over time as +safeguards and controls are relaxed due to conflicting goals and tradeoffs. In such +a high-risk state, events are bound to occur that will trigger an accident. In both the +Challenger and Columbia losses, organizational risk had been increasing to unacceptable levels for quite some time as behavior and decision-making evolved in + +response to a variety of internal and external performance pressures. Because risk +increased slowly, nobody noticed, that is, the boiled frog phenomenon. In fact, confidence and complacency were increasing at the same time as risk due to the lack +of accidents. +The goal of the STAMP-based analysis was to apply a classic system safety +engineering process to the analysis and redesign of this organizational structure. +Figure 8.9 shows the basic process used, which started with a preliminary hazard +analysis to identify the system hazards and the safety requirements and constraints. +In the second step, a STAMP model of the I T A safety control structure was created +(as designed by NASA; see figure 7.4). and a gap analysis was performed to map the +identified safety requirements and constraints to the assigned responsibilities in the +safety control structure and identify any gaps. A detailed hazard analysis using S T P A +was then performed to identify the system risks and to generate recommendations +for improving the designed new safety control structure and for monitoring the +implementation and long-term health of the new program. Only enough of the +modeling and analysis is included here to allow the reader to understand the process. +The complete modeling and analysis effort is documented elsewhere . +The hazard identification, system safety requirements, and safety control structure for this example are described in section 7.4.1, so the example starts from this +basic information. + + +footnote. Many people contributed to the analysis described in this section, including Nicolas Dulac, Betty +Barrett, Joel Cutcher-Gershenfeld, John Carroll, and Stephen Friedenthal. + + +section 8.6.2. Gap Analysis. +In analyzing an existing organizational or social safety control structure, one of the +first steps is to determine where the responsibility for implementing each requirement rests and to perform a gap analysis to identify holes in the current design, that +is, requirements that are not being implemented .(enforced). anywhere. Then the +safety control structure needs to be evaluated to determine whether it is potentially +effective in enforcing the system safety requirements and constraints. +A mapping was made between the system-level safety requirements and constraints and the individual responsibilities of each component in the NASA safety +control structure to see where and how requirements are enforced. The I T A program +was at the time being carefully defined and documented. In other situations, where +such documentation may be lacking, interview or other techniques may need to be +used to elicit how the organizational control structure actually works. In the end, +complete documentation should exist in order to maintain and operate the system +safely. While most organizations have job descriptions for each employee, the safetyrelated responsibilities are not necessarily separated out or identified, which can +lead to unidentified gaps or overlaps. +As an example, in the I T A structure the responsibility for the system-level safety +requirement. + +1a. State-of-the art safety standards and requirements for NASA missions must +be established, implemented, enforced, and maintained that protect the astronauts, the workforce, and the public +was assigned to the NASA Chief Engineer but the Discipline Technical Warrant +Holders, the Discipline Trusted Agents, the NASA Technical Standards Program, +and the headquarters Office of Safety and Mission Assurance also play a role in +implementing this Chief Engineer responsibility. More specifically, system requirement 1a was implemented in the control structure by the following responsibility +assignments. +•Chief Engineer. Develop, monitor, and maintain technical standards and +policy. +•Discipline Technical Warrant Holders. +1.– Recommend priorities for development and updating of technical +standards. +2.– Approve all new or updated NASA Preferred Standards within their assigned +discipline .(the NASA Chief Engineer retains Agency approval) +3.– Participate in .(lead). development, adoption, and maintenance of NASA +Preferred Technical Standards in the warranted discipline. +4.– Participate as members of technical standards working groups. +•Discipline Trusted Agents. Represent the Discipline Technical Warrant +Holders on technical standards committees +•NASA Technical Standards Program. Coordinate with Technical Warrant +Holders when creating or updating standards +•NASA Headquarters Office Safety and Mission Assurance. +1.– Develop and improve generic safety, reliability, and quality process standards +and requirements, including F M E A, risk, and the hazard analysis process. +2.– Ensure that safety and mission assurance policies and procedures are adequate and properly documented. +Once the mapping is complete, a gap analysis can be performed to ensure that each +system safety requirement and constraint is embedded in the organizational design +and to find holes or weaknesses in the design. In this analysis, concerns surfaced, +particularly about requirements not reflected in the defined I T A organizational +structure. +As an example, one omission detected was appeals channels for complaints +and concerns about the components of the I T A structure itself that may not +function appropriately. All channels for expressing what NASA calls “technical +conscience” go through the warrant holders, but there was no defined way to express + + +concerns about the warrant holders themselves or about aspects of I T A that are not +working well. +A second example was the omission in the documentation of the I T A implementation plans of the person(s). who was to be responsible to see that engineers and +managers are trained to use the results of hazard analyses in their decision making. +More generally, a distributed and ill-defined responsibility for the hazard analysis +process made it difficult to determine responsibility for ensuring that adequate +resources are applied; that hazard analyses are elaborated .(refined and extended) +and updated as the design evolves and test experience is acquired; that hazard logs +are maintained and used as experience is acquired; and that all anomalies are evaluated for their hazard potential. Before I T A, many of these responsibilities were +assigned to each Center’s Safety and Mission Assurance Office, but with much of +this process moving to engineering .(which is where it should be). under the new I T A +structure, clear responsibilities for these functions need to be specified. One of the +basic causes of accidents in STAMP is multiple controllers with poorly defined or +overlapping responsibilities. +A final example involved the I T A program assessment process. An assessment +of how well I T A is working is part of the plan and is an assigned responsibility of +the chief engineer. The official risk assessment of the I T A program performed in +parallel with the STAMP-based one was an implementation of that chief engineer’s +responsibility and was planned to be performed periodically. We recommended the +addition of specific organizational structures and processes for implementing a +continual learning and improvement process and making adjustments to the design +of I T A itself when necessary outside of the periodic review. + +section 8.6.3. Hazard Analysis to Identify Organizational and Programmatic Risks. +A risk analysis to identify I T A programmatic risks and to evaluate these risks periodically had been specified as one of the chief engineer’s responsibilities. To accomplish this goal, NASA identified the programmatic risks using a classic process using +experts in risk analysis interviewing stakeholders and holding meetings where risks +were identified and discussed. The STAMP-based analysis used a more formal, +structured approach. +Risks in STAMP terms can be divided into two types. .(1). basic inadequacies in +the way individual components in the control structure fulfill their responsibilities +and .(2). risks involved in the coordination of activities and decision making that can +lead to unintended interactions and consequences. +Basic Risks +Applying the four types of inadequate control identified in S T P A and interpreted +for the hazard, which in this case is unsafe decision-making leading to an accident, +I T A has four general types of risks. + + +1. Unsafe decisions are made or approved by the chief engineer or warrant +holders. +2. Safe decisions are disallowed .(e.g., overly conservative decision making that +undermines the goals of NASA and long-term support for I T A). +3. Decision making takes too long, minimizing impact and also reducing support +for the I T A. +4. Good decisions are made by the I T A, but do not have adequate impact on +system design, construction, and operation. +The specific potentially unsafe control actions by those in the I T A safety control +structure that could lead to these general risks are the I T A programmatic risks. Once +identified, they must be eliminated or controlled just like any unsafe control actions. +Using the responsibilities and control actions defined for the components of the +safety control structure, the STAMP-based risk analysis applied the four general +types of inadequate control actions, omitting those that did not make sense for the +particular responsibility or did not impact risk. To accomplish this, the general +responsibilities must be refined into more specific control actions. +As an example, the chief engineer is responsible as the I T A for the technical +standards and system requirements and all changes, variances, and waivers to the +requirements, as noted earlier. The control actions the chief engineer has available +to implement this responsibility are. +1.• To develop, monitor, and maintain technical standards and policy. +2.•In coordination with programs and projects, to establish or approve the technical requirements and ensure they are enforced and implemented in the programs and projects .(ensure the design is compliant with the requirements). +3.• To approve all changes to the initial technical requirements. +4.• To approve all variances .(waivers, deviations, exceptions to the requirements. +5.•Etc. +Taking just one of these, the control responsibility to develop, monitor, and maintain +technical standards and policy, the risks .(potentially inadequate or unsafe control +actions). identified using S T P A step 1 include. +1. General technical and safety standards are not created. +2. Inadequate standards and requirements are created. +3. Standards degrade over time due to external pressures to weaken them. The +process for approving changes is flawed. +4. Standards are not changed over time as the environment changes. + +As another example, the chief engineer cannot perform all these duties himself, so +he has a network of people below him in the hierarchy to whom he delegates or +“warrants” some of the responsibilities. The chief engineer retains responsibility for +ensuring that the warrant holders perform their duties adequately as in any hierarchical management structure. +The chief engineer responsibility to approve all variances and waivers to technical +requirements is assigned to the System Technical Warrant Holder .(STWH). The +risks or potentially unsafe control actions of the STWH with respect to this responsibility are. +1.• An unsafe engineering variance or waiver is approved. +2.•Designs are approved without determining conformance with safety requirements. Waivers become routine. +3.•Reviews and approvals take so long that I T A becomes a bottleneck. Mission +achievement is threatened. Engineers start to ignore the need for approvals +and work around the STWH in other ways. +Although a long list of risks was identified in this experimental application of S T P A +to a management structure, many of the risks for different participants in the I T A +process were closely related. The risks listed for each participant are related to his +or her particular role and responsibilities and therefore those with related roles or +responsibilities will generate related risks. The relationships were made clear in the +earlier step tracing from system requirements to the roles and responsibilities for +each of the components of the I T A. + +Coordination Risks. +Coordination risks arise when multiple people or groups control the same process. +The types of unsafe interactions that may result include. .(1). both controllers +assume that the other is performing the control responsibilities, and as a result +nobody does, or .(2). controllers provide conflicting control actions that have unintended side effects. +Potential coordination risks are identified by the mapping from the system +requirements to the component requirements used in the gap analysis described +earlier. When similar responsibilities related to the same system requirement are +identified, the potential for new coordination risks needs to be considered. +As an example, the original I T A design documentation was ambiguous about +who had the responsibility for performing many of the safety engineering functions. Safety engineering had previously been the responsibility of the Center +Safety and Mission Assurance Offices but the plan envisioned that these functions +would shift to the I T A in the new organization leading to several obvious +risks. + + +Another example involves the transition of responsibility for the production of +standards to the I T A from the NASA Headquarters Office of Safety and Mission +Assurance .(OSMA). In the plan, some of the technical standards responsibilities +were retained by OSMA, such as the technical design standards for human rating +spacecraft and for conducting hazard analyses, while others were shifted to the I T A +without a clear demarcation of who was responsible for what. At the same time, +responsibilities for the assurance that the plans are followed, which seems to logically belong to the mission assurance group, were not cleanly divided. Both overlaps +raised the potential for some functions not being accomplished or conflicting standards being produced. + +section 8.6.4. Use of the Analysis and Potential Extensions. +While risk mitigation and control measures could be generated from the list of risks +themselves, the application of step 2 of S T P A to identify causes of the risks will help +to provide better control measures in the same way S T P A step 2 plays a similar role +in physical systems. Taking the responsibility of the System Technical Warrant +Holder to approve all variances and waivers to technical requirements in the +example above, potential causes for approving an unsafe engineering variance or +waiver include. inadequate or incorrect information about the safety of the action, +inadequate training, bowing to pressure about programmatic concerns, lack of +support from management, inadequate time or resources to evaluate the requested +variance properly, and so on. These causal factors were generated using the generic +factors in figure 8.6 but defined in a more appropriate way. Stringfellow has examined in more depth how S T P A can be applied to organizational factors . +The analysis can be used to identify potential changes to the safety control structure .(the I T A program). that could eliminate or mitigate identified risks. General +design principles for safety are described in the next chapter. +A goal of the NASA risk analysis was to determine what to include in a planned +special assessment of the I T A early in its existence. To accomplish the same goal, +the MIT group categorized their identified risks as .(1). immediate, .(2). long-term, or +(3). controllable by standard ongoing processes. These categories were defined in +the following way. +Immediate concern. An immediate and substantial concern that should be part +of a near-term assessment. +Longer-term concern. A substantial longer-term concern that should potentially +be part of future assessments; as the risk will increase over time or cannot be +evaluated without future knowledge of the system or environment behavior. +Standard process. An important concern that should be addressed through +standard processes, such as inspections, rather than an extensive special assessment procedure. + +This categorization allowed identifying a manageable subset of risks to be part of the +planned near-term risk assessment and those that could wait for future assessments +or could be controlled by on-going procedures. For example, it is important to assess +immediately the degree of “buy-in” to the I T A program. Without such support, I T A +cannot be sustained and the risk of dangerous decision making is very high. On the +other hand, the ability to find appropriate successors to the current warrant holders +is a longer-term concern identified in the STAMP-based risk analysis that would be +difficult to assess early in the existence of the new I T A control structure. The performance of the current technical warrant holders, for example, is one factor that will +have an impact on whether the most qualified people will want the job in the future. + +section 8.6.5. Comparisons with Traditional Programmatic Risk Analysis Techniques. +The traditional risk analysis performed by NASA on I T A identified about one +hundred risks. The more rigorous, structured STAMP-based analysis.done independently and without any knowledge of the results of the NASA process. +identified about 250 risks, all the risks identified by NASA plus additional ones. A +small part of the difference was related to the consideration by the STAMP group +of more components in the safety control structure, such as the NASA administrator, +Congress, and the Executive Branch .(White House). There is no way to determine +whether the other additional risks identified by the STAMP-based process were +simply missed in the NASA analysis or were discarded for some reason. +The NASA analysis did not include a causal analysis of the risks and thus no +comparison is possible. Their goal was to determine what should be included in the +upcoming I T A risk assessment process and thus was narrower than the STAMP +demonstration risk analysis effort. + +section 8.7. Reengineering a Sociotechnical System. Pharmaceutical Safety and the Vioxx +Tragedy. +The previous section describes the use of S T P A on the management structure of an +organization that develops and operates high-tech systems. S T P A and other types +of analysis are potentially also applicable to social systems. This section provides an +example using pharmaceutical safety. +Couturier has performed a STAMP-based causal analysis of the incidents associated with the introduction and withdrawal of Vioxx . Once the causes of such +losses are determined, changes need to be made to prevent a recurrence. Many suggestions for changes as a result of the Vioxx losses +have been proposed. After the Vioxx recall, three main reports were written by the +Government Accountability Office .(GAO). , the Institute of Medicine .(IOM) + , and one commissioned by Merck. The publication of these reports led to two +waves of changes, the first initiated within the F D A and the second by Congress in + + +the form of a new set of rules called F D AAA .(F D A Amendments Act). Couturier + , with inputs from others,4 used the Vioxx events to demonstrate how these +proposed and implemented policy and structural changes could be analyzed to +predict their potential effectiveness using STAMP. + +footnote. Many people provided input to the analysis described in this section, including Stan Finkelstein, John +Thomas, John Carroll, Margaret Stringfellow, Meghan Dierks, Bruce Psaty, David Wierz, and various +other reviewers. + +section 8.7.1. The Events Surrounding the Approval and Withdrawal of Vioxx. +Vioxx .(Rofecoxib). is a prescription COX-2 inhibitor manufactured by Merck. It was +approved by the Food and Drug Administration .(F D A). in May 19 99 and was widely +used for pain management, primarily from osteoarthritis. Vioxx was one of the major +sources of revenue for Merck while on the market. It was marketed in more than +eighty countries with worldwide sales totaling $2.5 billion in 2 thousand 3 . +In September 2 thousand 4 , Merck voluntarily withdrew the drug from the market +because of safety concerns. The drug was suspected to increase the risk of cardiovascular events .(heart attacks and stroke). for the patients taking it long term at high +dosages. Vioxx was one of the most widely used drugs ever to be withdrawn from +the market. According to an epidemiological study done by Graham, an F D A scientist, Vioxx has been associated with more than 27,000 heart attacks or deaths and +may be the “single greatest drug safety catastrophe in the history of this country or +the history of the world” . +The important question to be considered is how did such a dangerous drug get +on the market and stay there so long despite warnings of problems and how can +this type of loss be avoided in the future. +The major events that occurred in this saga start with the discovery of the Vioxx +molecule in 19 94 . Merck sought F D A approval in November 19 98 . +In May 19 99 the F D A approved Vioxx for the relief of osteoarthritis symptoms +and management of acute pain. Nobody had suggested that the COX-2 inhibitors +are more effective than the classic NSAIDS in relieving pain, but their selling point +had been that they were less likely to cause bleeding and other digestive tract complications. The F D A was not convinced and required that the drug carry a warning +on its label about possible digestive problems. By December, Vioxx had more than +40 percent of the new prescriptions in its class. +In order to validate their claims about Rofecoxib having fewer digestive system +complications, Merck launched studies to prove their drugs should not be lumped +with other NSAIDS. The studies backfired. +In January 19 99 , before Vioxx was approved, Merck started a trial called VIGOR +(Vioxx Gastrointestinal Outcomes Research). to compare the efficacy and adverse + + +effects of Rofecoxib and Naproxen, an older nonsteroidal anti-inflammatory drug +or NSAID. In March 2 thousand 0 , Merck announced that the VIGOR trial had shown that +Vioxx was safer on the digestive tract than Naproxen, but it doubled the risk of +cardiovascular problems. Merck argued that the increased risk resulted not because +Vioxx caused the cardiovascular problems but that Celebrex .(the Naproxen used in +the trial). protected against them. Merck continued to minimize unfavorable findings +for Vioxx up to a month before withdrawing it from the market in 2 thousand 4 . +Another study, ADVANTAGE, was started soon after the VIGOR trial. +ADVANTAGE had the same goal as VIGOR, but it targeted osteoarthritis, +whereas VIGOR was for rheumatoid arthritis. Although the ADVANTAGE trial +did demonstrate that Vioxx was safer on the digestive track than Naproxen, it +failed to show that Rofecoxib had any advantage over Naproxen in terms of pain +relief. Long after the report on ADVANTAGE was published, it turned out that its +first author had no involvement in the study until Merck presented him with a copy +of the manuscript written by Merck authors. This turned out to be one of the more +prominent recent examples of ghostwriting of journal articles where company +researchers wrote the articles and included the names of prominent researchers as +authors . +In addition, Merck documents later came to light that appear to show the +ADVANTAGE trial emerged from the Merck marketing division and was actually +a “seeding” trial, designed to market the drug by putting “its product in the hands +of practicing physicians, hoping that the experience of treating patients with the +study drug and a pleasant, even profitable interaction with the company will result +in more loyal physicians who prescribe the drug” . +Although the studies did demonstrate that Vioxx was safer on the digestive track +than Naproxen, they also again unexpectedly found that the COX-2 inhibitor +doubled the risk of cardiovascular problems. In April 2 thousand 2 , the F D A required that +Merck note a possible link to heart attacks and strokes on Vioxx’s label. But it never +ordered Merck to conduct a trial comparing Vioxx with a placebo to determine +whether a link existed. In April 2 thousand 0 the F D A recommended that Merck conduct +an animal study with Vioxx to evaluate cardiovascular safety, but no such study was +ever conducted. +For both the VIGOR and ADVANTAGE studies, claims have been made that +cardiovascular events were omitted from published reports . In May 2 thousand 0 +Merck published the results from the VIGOR trial. The data included only seventeen of the twenty heart attacks the Vioxx patients had. When the omission was +later detected, Merck argued that the events occurred after the trial was over and +therefore did not have to be reported. The data showed a four times higher risk of +heart attacks compared with Naproxen. In October 2 thousand 0 , Merck officially told the +F D A about the other three heart attacks in the VIGOR study. + + +Merck marketed Vioxx heavily to doctors and spent more than $100 million +a year on direct-to-the-consumer advertising using popular athletes including +Dorothy Hamill and Bruce Jenner. In September 2 thousand 1 , the F D A sent Merck a letter +warning the company to stop misleading doctors about Vioxx’s effect on the cardiovascular system. +In 2 thousand 1 , Merck started a new study called APPROVe .(Adenomatous Polyp +PRevention On Vioxx). in order to expand its market by showing the efficacy of +Vioxx on colorectal polyps. APPROVe was halted early when the preliminary data +showed an increased relative risk of heart attacks and strokes after eighteen months +of Vioxx use. The long-term use of Rofecoxib resulted in nearly twice the risk of +suffering a heart attack or stroke compared to patients receiving a placebo. +David Graham, an F D A researcher, did an analysis of a database of 1.4 million +Kaiser Permanente members and found that those who took Vioxx were more likely +to suffer a heart attack or sudden cardiac death than those who took Celebrex, +Vioxx’s main rival. Graham testified to a congressional committee that the F D A +tried to block publication of his findings. He described an environment “where he +was ‘ostracized’; ‘subjected to veiled threats’ and ‘intimidation.’” Graham gave the +committee copies of email that support his claims that his superiors at the F D A +suggested watering down his conclusions . +Despite all their efforts to deny the risks associated with Vioxx, Merck withdrew +the drug from the market in September 2 thousand 4 . In October 2 thousand 4 , the F D A approved +a replacement drug for Vioxx by Merck, called Arcoxia. +Because of the extensive litigation associated with Vioxx, many questionable +practices in the pharmaceutical industry have come to light . Merck has been +accused of several unsafe “control actions” in this sequence of events, including not +accurately reporting trial results to the F D A, not having a proper control board +(DSMB). overseeing the safety of the patients in at least one of the trials, misleading +marketing efforts, ghostwriting journal articles about Rofecoxib studies, and paying +publishers to create fake medical journals to publish favorable articles . Postmarket safety studies recommended by the F D A were never done, only studies +directed at increasing the market. + + +section 8.7.2. Analysis of the Vioxx Case. +The hazards, system safety requirements and constraints, and documentation of the +safety control structure for pharmaceutical safety were shown in chapter 7. Using +these, Couturier performed several types of analysis. +He first traced the system requirements to the responsibilities assigned to each +of the components in the safety control structure, that is, he performed a gap analysis +as described above for the NASA I T A risk analysis. The goal was to check that at +least one controller was responsible for enforcing each of the safety requirements, +to identify when multiple controllers had the same responsibility, and to study each + +of the controllers independently to determine if they are capable of carrying out +their assigned responsibilities. +In the gap analysis, no obvious gaps or missing responsibilities were found, but +multiple controllers are in charge of enforcing some of the same safety requirements. +For example, the F D A, the pharmaceutical companies, and physicians are all responsible for monitoring drugs for adverse events. This redundancy is helpful if the +controllers work together and share the information they have. Problems can occur, +however, if efforts are not coordinated and gaps occur. +The assignment of responsibilities does not necessarily mean they are carried out +effectively. As in the NASA I T A analysis, potentially inadequate control actions can +be identified using S T P A step 1, potential causes identified using step 2, and controls +to protect against these causes designed and implemented. Contextual factors must +be considered such as external or internal pressures militating against effective +implementation or application of the controls. For example, given the financial +incentives involved in marketing a blockbuster drug.Vioxx in 2 thousand 3 provided $2.5 +billion, or 11 percent of Merck’s revenue .it may be unreasonable to expect +pharmaceutical companies to be responsible for drug safety without strong external +oversight and controls or even to be responsible at all. Suggestions have been made +that responsibility for drug development and testing be taken away from the pharmaceutical manufacturers . +Controllers must also have the resources and information necessary to enforce +the safety constraints they have been assigned. Physicians need information about +drug safety and efficacy that is independent from the pharmaceutical company +representatives in order to adequately protect their patients. One of the first steps +in performing an analysis of the drug safety control structure is to identify the contextual factors that can influence whether each component’s responsibilities are +carried out and the information required to create an accurate process model to +support informed decision making in exercising the controls they have available to +carry out their responsibilities. +Couturier also used the drug safety control structure, system safety requirements +and constraints, the events in the Vioxx losses, and S T P A and system dynamics +models .(see appendix D). to investigate the potential effectiveness of the changes +implemented after the Vioxx events to control the marketing of unsafe drugs and +the impact of the changes on the system as a whole. For example, the Food and Drug +Amendments Act of 2 thousand 7 .(F D AAA). increased the responsibilities of the F D A and +provided it with new authority. Couturier examined the recommendations from the +F D AAA, the IOM report, and those generated from his STAMP causal analysis of +the Vioxx events. +System dynamics modeling was used to show the relationship among the contextual factors and unsafe control actions and the reasons why the safety control structure migrated toward ineffectiveness over time. Most modeling techniques provide + + +only direct relationships .(arrows), which are inadequate to understand the indirect +relationships between causal factors. System dynamics provides a way to show such +indirect and nonlinear relationships. Appendix D explains this modeling technique. +First, system dynamics models were created to model the contextual influences +on the behavior of each component .(patients, pharmaceutical companies, the F D A, +and so on). in the pharmaceutical safety control structure. Then the models were +combined to assist in understanding the behavior of the system as a whole and the +interactions among the components. The complete analysis can be found in and +a shorter paper on some of the results . An overview and some examples are +provided here. +Figure 8.10 shows a simple model of two types of pressures in this system that +militate against drugs being recalled. The loop on the left describes pressures within +the pharmaceutical company related to drug recalls while the loop on the right +describes pressures on the F D A related to drug recalls. +Once a drug has been approved, the pharmaceutical company, which invested +large resources in developing, testing, and marketing the drug, has incentives to +maximize profits from the drug and keep it on the market. Those pressures are +accentuated in the case of expected blockbuster drugs where the company’s financial well-being potentially depends on the success of the product. This goal creates +a reinforcing loop within the company to try to keep the drug on the market. The +company also has incentives to pressure the F D A to increase the number of approved + +indications, and thus purchasers, resist label changes, and prevent drug recalls. If the +company is successful at preventing recalls, the expectations for the drug increase, +creating another reinforcing loop. External pressures to recall the drug limit the +reinforcing dynamics, but they have a lot of inertia to overcome. +Figure 8.11 includes more details, more complex feedback loops, and more outside +pressures, such as the availability of a replacement drug, the time left on the drug’s +patent, and the amount of time spent on drug development. Pressures on the F D A +from the pharmaceutical companies are elaborated including the pressures on the +Office of New Drugs .(O N D). through PDUFA fees,5 pressures from advisory boards + + +to keep the drug .(which are, in turn, subject to pressures from patient advocacy +groups and lucrative consulting contracts with the pharmaceutical companies), and +pressures from the F D A Office of Surveillance and Epidemiology .(OSE). to recall +the drug. +Figures 8.12 and 8.13 show the pressures leading to overprescribing drugs. The +overview in figure 8.12 has two primary feedback loops. The loop on the left describes +pressures to lower the number of prescriptions based on the number of adverse +events and negative studies. The loop on the right shows the pressures within the +pharmaceutical company to increase the number of prescriptions based on company +earnings and marketing efforts. +For a typical pharmaceutical product, more drug prescriptions lead to higher +earnings for the drug manufacturer, part of which can be used to pay for more +advertising to get doctors to continue to prescribe the drug. This reinforcing loop is +usually balanced by the adverse effects of the drug. The more the drug is prescribed, +the more likely is observation of negative side effects, which will serve to balance +the pressures from the pharmaceutical companies. The two loops then theoretically +reach a dynamic equilibrium where drugs are prescribed only when their benefits +outweigh the risks. +As demonstrated in the Vioxx case, delays within a loop can significantly alter +the behavior of the system. By the time the first severe side effects were discovered, +millions of prescriptions had been given out. The balancing influences of the sideeffects loop were delayed so long that they could not effectively control the reinforcing pressures coming from the pharmaceutical companies. Figure 8.13 shows how +additional factors can be incorporated including the quality of collected data, the +market size, and patient drug requests. + +Couturier incorporated into the system dynamics models the changes that were +proposed by the IOM after the Vioxx events, the changes actually implemented +in F D AAA, and the recommendations coming out of the STAMP-based causal +analysis. One major difference was that the STAMP-based recommendations had +a broader scope. While the IOM and F D AAA changes focused on the F D A, the +STAMP analysis considered the contributions of all the components of the pharmaceutical safety control structure to the Vioxx events and the STAMP causal analysis +led to recommendations for changes in nearly all of them. +Couturier concluded, not surprisingly, that most of the F D AAA changes are +useful and will have the intended effects. He also determined that a few may be +counterproductive and others need to be added. The added ones come from the fact +that the IOM recommendations and the F D AAA focus on a single component of +the system .(the F D A). The F D A does not operate in a vacuum, and the proposed +changes do not take into account the safety role played by other components in the +system, particularly physicians. As a result, the pressures that led to the erosion of +the overall system safety controls were left unaddressed and are likely to lead to +changes in the system static and dynamic safety controls that will undermine the +improvements implemented by F D AAA. See Couturier for the complete results. + +A potential contribution of such an analysis is the ability to consider the impact +of multiple changes within the entire safety control structure. Less than effective +controls may be implemented when they are created piecemeal to fix a current set +of adverse events. Existing pressures and influences, not changed by the new procedures, can defeat the intent of the changes by leading to unintended and counterbalancing actions in the components of the safety control structure. STAMP-based +analysis suggest how to reengineer the safety control structure as a whole to achieve +the system goals, including both enhancing the safety of current drugs while at the +same time encouraging the development of new drugs. + + +footnote. The Prescription Drug Use Fee Act .(PDUFA). was first passed by Congress in 19 92 . It allows the F D A +to collect fees from the pharmaceutical companies to pay the expenses for the approval of new drugs. +In return, the F D A agrees to meet drug review performance goals. The main goal of PDUFA is to accelerate the drug review process. Between 19 93 and 2 thousand 2 , user fees allowed the F D A to increase by 77 +percent the number of personnel assigned to review applications. In 2 thousand 4 , more than half the funding +for the CDEH was coming from user fees . A growing group of scientists and regulators have +expressed fears that in allowing the F D A to be sponsored by the pharmaceutical companies, the F D A +has shifted its priorities to satisfying the companies, its “client,” instead of protecting the public. + + +section 8.8. +Comparison of S T P A with Traditional Hazard Analysis Techniques. +Few formal comparisons have been made yet between S T P A and traditional techniques such as fault tree analysis and Haz Op. Theoretically, because STAMP +extends the causality model underlying the hazard analysis, non-failures and additional causes should be identifiable, as well as the failure-related causes found by +the traditional techniques. The few comparisons that have been made, both informal +and formal, have confirmed this hypothesis. +In the use of S T P A on the U.S. missile defense system, potential paths to inadvertent launch were identified that had not been identified by previous analyses or +in extensive hazard analyses on the individual components of the system . +Each element of the system had an active safety program, but the complexity and +coupling introduced by their integration into a single system created new subtle and +complex hazard scenarios. While the scenarios identified using S T P A included those +caused by potential component failures, as expected, scenarios were also identified +that involved unsafe interactions among the components without any components +actually failing.each operated according to its specified requirements, but the +interactions could lead to hazardous system states. In the evaluation of this effort, +two other advantages were noted. +1. The effort was bounded and predictable and assisted the engineers in scoping +their efforts. Once all the control actions have been examined, the assessment +is complete. +2. As the control structure is developed and the potential inadequate control +actions are identified, they were able to prioritize required changes according +to which control actions have the greatest role in keeping the system from +transitioning to a hazardous state. +A paper published on this effort concluded. +The S T P A safety assessment methodology . . . provided an orderly, organized fashion in +which to conduct the analysis. The effort successfully assessed safety risks arising from the + +integration of the Elements. The assessment provided the information necessary to characterize the residual safety risk of hazards associated with the system. The analysis and +supporting data provided management a sound basis on which to make risk acceptance +decisions. Lastly, the assessment results were also used to plan mitigations for open safety +risks. As changes are made to the system, the differences are assessed by updating the +control structure diagrams and assessment analysis templates. +Another informal comparison was made in the I T A .(Independent Technical Authority). analysis described in section 8.6. An informal review of the risks identified by +using S T P A showed that they included all the risks identified by the informal NASA +risk analysis process using the traditional method common to such analyses. The +additional risks identified by S T P A appeared on the surface to be as important as +those identified by the NASA analysis. As noted, there is no way to determine +whether the less formal NASA process identified additional risks and discarded +them for some reason or simply missed them. +A more careful comparison has also been made. JAXA .(the Japanese Space +Agency). and MIT engineers compared the use of S T P A on a JAXA unmanned +spacecraft .(HTV). to transfer cargo to the International Space Station .(ISS). Because +human life is potentially involved .(one hazard is collision with the International +Space Station), rigorous NASA hazard analysis standards using fault trees and other +analyses had been employed and reviewed by NASA. In an S T P A analysis of the +HTV used in an evaluation of the new technique for potential use at JAXA, all of +the hazard causal factors identified by the fault tree analysis were identified also by +S T P A . As with the B M D S comparison, additional causal factors were identified +by S T P A alone. These additional causal factors again involved those related to more +sophisticated types of errors beyond simple component failures and those related +to software and human errors. +Additional independent comparisons .(not done by the author or her students) +have been made between accident causal analysis methods comparing STAMP and +more traditional methods. The results are described in chapter 11 on accident analysis based on STAMP. + + +section 8.9. +Summary. +Some new approaches to hazard and risk analysis based on STAMP and systems +theory have been suggested in this chapter. We are only beginning to develop such +techniques and hopefully others will work on alternatives and improvements. The +only thing for sure is that applying the techniques developed for simple electromechanical systems to complex, human and software-intensive systems without fundamentally changing the foundations of the techniques is futile. New ideas are +desperately needed if we are going to solve the problems and respond to the changes +in the world of engineering described in chapter 1. \ No newline at end of file diff --git a/chapter09.raw b/chapter09.raw new file mode 100644 index 0000000..da82237 --- /dev/null +++ b/chapter09.raw @@ -0,0 +1,1864 @@ +chapter 9. +Safety-Guided Design. +In the examples of STPA in the last chapter, the development of the design was +assumed to occur independently. Most of the time, hazard analysis is done after the +major design decisions have been made. But STPA can be used in a proactive way +to help guide the design and system development, rather than as simply a hazard +analysis technique on an existing design. This integrated design and analysis process +is called safety-guided design (figure 9.1). +As the systems we build and operate increase in size and complexity, the use of +sophisticated system engineering approaches becomes more critical. Important +system-level (emergent) properties, such as safety, must be built into the design of +these systems; they cannot be effectively added on or simply measured afterward. +Adding barriers or protection devices after the fact is not only enormously more +expensive, it is also much less effective than designing safety in from the beginning +(see Safeware, chapter 16). This chapter describes the process of safety-guided +design, which is enhanced by defining accident prevention as a control problem +rather than a “prevent failures” problem. The next chapter shows how safety engi- +neering and safety-guided design can be integrated into basic system engineering +processes. +section 9.1. +The Safety-Guided Design Process. +One key to having a cost-effective safety effort is to embed it into a system engi- +neering process from the very beginning and to design safety into the system as the +design decisions are made. Once again, the process starts with the fundamental +activities in chapter 7. After the hazards and system-level safety requirements and +constraints have been identified; the design process starts: +1. Try to eliminate the hazards from the conceptual design. +2. If any of the hazards cannot be eliminated, then identify the potential for their +control at the system level. + + +3. Create a system control structure and assign responsibilities for enforcing +safety constraints. Some guidance for this process is provided in the operations +and management chapters. +4. Refine the constraints and design in parallel. +a. Identify potentially hazardous control actions by each of system com- +ponents that would violate system design constraints using STPA step 1. +Restate the identified hazard control actions as component design +constraints. +b. Using STPA Step 2, determine what factors could lead to a violation of the +safety constraints. +c. Augment the basic design to eliminate or control potentially unsafe control +actions and behaviors. +d. Iterate over the process, that is, perform STPA steps 1 and 2 on the new +augmented design and continue to refine the design until all hazardous +scenarios are eliminated, mitigated, or controlled. +The next section provides an example of the process. The rest of the chapter dis- +cusses safe design principles for physical processes, automated controllers, and +human controllers. + +section 9.2. +An Example of Safety-Guided Design for an Industrial Robot. +The process of safety-guided design and the use of STPA to support it is illustrated +here with the design of an experimental Space Shuttle robotic Thermal Tile +Processing System (TTPS) based on a design created for a research project at +CMU [57]. +The goal of the TTPS system is to inspect and waterproof the thermal protection +tiles on the belly of the Space Shuttle, thus saving humans from a laborious task, +typically lasting three to four months, that begins within minutes after the Shuttle + + +lands and ends just prior to launch. Upon landing at either the Dryden facility in +California or Kennedy Space Center in Florida, the orbiter is brought to either the +Mate-Demate Device (MDD) or the Orbiter Processing Facility (OPF). These large +structures provide access to all areas of the orbiters. +The Space Shuttle is covered with several types of heat-resistant tiles that protect +the orbiter’s aluminum skin during the heat of reentry. While the majority of the +upper surfaces are covered with flexible insulation blankets, the lower surfaces are +covered with silica tiles. These tiles have a glazed coating over soft and highly porous +silica fibers. The tiles are 95 percent air by volume, which makes them extremely +light but also makes them capable of absorbing a tremendous amount of water. +Water in the tiles causes a substantial weight problem that can adversely affect +launch and orbit capabilities for the shuttles. Because the orbiters may be exposed +to rain during transport and on the launch pad, the tiles must be waterproofed. This +task is accomplished through the use of a specialized hydrophobic chemical, DMES, +which is injected into each tile. There are approximately 17,000 lower surface tiles +covering an area that is roughly 25m × 40m. +In the standard process, DMES is injected into a small hole in each tile by a +handheld tool that pumps a small quantity of chemical into the nozzle. The nozzle +is held against the tile and the chemical is forced through the tile by a pressurized +nitrogen purge for several seconds. It takes about 240 hours to waterproof the tiles +on an orbiter. Because the chemical is toxic, human workers have to wear heavy +suits and respirators while injecting the chemical and, at the same time, maneuvering +in a crowded work area. One goal for using a robot to perform this task was to +eliminate a very tedious, uncomfortable, and potentially hazardous human activity. +The tiles must also be inspected. A goal for the TTPS was to inspect the tiles +more accurately than the human eye and therefore reduce the need for multiple +inspections. During launch, reentry, and transport, a number of defects can occur on +the tiles in the form of scratches, cracks, gouges, discoloring, and erosion of surfaces. +The examination of the tiles determines if they need to be replaced or repaired. The +typical procedures involve visual inspection of each tile to see if there is any damage +and then assessment and categorization of the defects according to detailed check- +lists. Later, work orders are issued for repair of individual tiles. +Like any design process, safety-guided design starts with identifying the goals for +the system and the constraints under which the system must operate. The high-level +goals for the TTPS are to: +1. Inspect the thermal tiles for damage caused during launch, reentry, and +transport +2. Apply waterproofing chemicals to the thermal tiles + + +Environmental constraints delimit how these goals can be achieved and identifying +those constraints, particularly the safety constraints, is an early goal in safety- +guided design. +The environmental constraints on the system design stem from physical proper- +ties of the Orbital Processing Facility (OPF) at KSC, such as size constraints on the +physical system components and the necessity of any mobile robotic components +to deal with crowded work areas and for humans to be in the area. Example work +area environmental constraints for the TTPS are: +EA1: The work areas of the Orbiter Processing Facility (OPF) can be very +crowded. The facilities provide access to all areas of the orbiters through the +use of intricate platforms that are laced with plumbing, wiring, corridors, lifting +devices, and so on. After entering the facility, the orbiters are jacked up and +leveled. Substantial structure then swings around and surrounds the orbiter on +all sides and at all levels. With the exception of the jack stands that support +the orbiters, the floor space directly beneath the orbiter is initially clear but +the surrounding structure can be very crowded. +EA2: The mobile robot must enter the facility through personnel access doors 1.1 +meters (42″) wide. The layout within the OPF allows a length of 2.5 meters +(100″) for the robot. There are some structural beams whose heights are as +low as 1.75 meters (70″), but once under the orbiter the tile heights range from +about 2.9 meters to 4 meters. The compact roll-in form of the mobile system +must maneuver these spaces and also raise its inspection and injection equip- +ment up to heights of 4 meters to reach individual tiles while still meeting a 1 +millimeter accuracy requirement. +EA3: Additional constraints involve moving around the crowded workspace. The +robot must negotiate jack stands, columns, work stands, cables, and hoses. In +addition, there are hanging cords, clamps, and hoses. Because the robot might +cause damage to the ground obstacles, cable covers will be used for protection +and the robot system must traverse these covers. +Other design constraints on the TTPS include: +1.•Use of the TTPS must not negatively impact the flight schedules of the orbiters +more than that of the manual system being replaced. +2.•Maintenance costs of the TTPS must not exceed x dollars per year. +3.•Use of the TTPS must not cause or contribute to an unacceptable loss (acci- +dent) as defined by Shuttle management. +As with many systems, prioritizing the hazards by severity is enough in this case to +assist the engineers in making decisions during design. Sometimes a preliminary + + +hazard analysis is performed using a risk matrix to determine how much effort will +be put into eliminating or controlling the hazards and in making tradeoffs in design. +Likelihood, at this point, is unknowable but some type of surrogate, like mitigatibil- +ity, as demonstrated in section 10.3.4, could be used. In the TTPS example, severity +plus the NASA policy described earlier is adequate. To decide not to consider some +of the hazards at all would be pointless and dangerous at this stage of development +as likelihood is not determinable. As the design proceeds and decisions must be +made, specific additional information may be found to be useful and acquired at +that time. After the system design is completed, if it is determined that some hazards +cannot be adequately handled or the compromises required to handle them are too +great; then the limitations would be documented (as described in chapter 10) and +decisions would have to be made at that point about the risks of using the system. +At that time, however, the information necessary to make those decisions will more +likely be available than before the development process begins. +After the hazards are identified, system-level safety-related requirements and +design constraints are derived from them. As an example, for hazard H7 (inadequate +thermal protection), a system-level safety design constraint is that the mobile robot +processing must not result in any tiles being missed in the inspection or waterproof- +ing process. More detailed design constraints will be generated during the safety- +guided design process. +To get started, a general system architecture must be selected (figure 9.2). Let’s +assume that the initial TTPS architecture consists of a mobile base on which tools +will be mounted, including a manipulator arm that performs the processing and +contains the vision and waterproofing tools. This very early decision may be changed +after the safety-guided design process starts, but some very basic initial assumptions +are necessary to get going. As the concept development and detailed design process +proceeds, information generated about hazards and design tradeoffs may lead to +changes in the initial configuration. Alternatively, multiple design configurations +may be considered in parallel. +In the initial candidate architecture (control structure), a decision is made to +introduce a human operator in order to supervise robot movement as so many of +the hazards are related to movement. At the same time, it may be impractical for +an operator to monitor all the activities so the first version of the system architecture +is to have the TTPS control system in charge of the non-movement activities and +to have both the TTPS and the control room operator share control of movement. +The safety-guided design process, including STPA, will identify the implications of +this decision and will assist in analyzing the allocation of tasks to the various com- +ponents to determine the safety tradeoffs involved. +In the candidate starting architecture (control structure), there is an automated +robot work planner to provide the overall processing goals and tasks for the + + +TTPS. A location system is needed to provide information to the movement con- +troller about the current location of the robot. A camera is used to provide infor- +mation to the human controller, as the control room will be located at a distance +from the orbiter. The role of the other components should be obvious. +The proposed design has two potential movement controllers, so coordination +problems will have to be eliminated. The operator could control all movement, but +that may be considered impractical given the processing requirements. To assist with +this decision process, engineers may create a concept of operations and perform a +human task analysis [48, 122]. +The safety-guided design process, including STPA, will identify the implications +of the basic decisions in the candidate tasks and will assist in analyzing the +allocation of tasks to the various components to determine the safety tradeoffs +involved. +The design process is now ready to start. Using the information already specified, +particularly the general functional responsibilities assigned to each component, + + +designers will identify potentially hazardous control actions by each of the system +components that could violate the safety constraints, determine the causal factors +that could lead to these hazardous control actions, and prevent or control them in +the system design. The process thus involves a top-down identification of scenarios +in which the safety constraints could be violated. The scenarios can then be used to +guide more detailed design decisions. +In general, safety-guided design involves first attempting to eliminate the +hazard from the design and, if that is not possible or requires unacceptable +tradeoffs, reducing the likelihood the hazard will occur, reducing the negative +consequences of the hazard if it does occur, and implementing contingency plans +for limiting damage. More about design procedures is presented in the next +section. +As design decisions are made, an STPA-based hazard analysis is used to +inform these decisions. Early in the system design process, little information is +available, so the hazard analysis will be very general at first and will be refined +and augmented as additional information emerges through the system design +activities. +For the example, let’s focus on the robot instability hazard. The first goal should +be to eliminate the hazard in the system design. One way to eliminate potential +instability is to make the robot base so heavy that it cannot become unstable, no +matter how the manipulator arm is positioned. A heavy base, however, could increase +the damage caused by the base coming into contact with a human or object or make +it difficult for workers to manually move the robot out of the way in an emergency +situation. An alternative solution is to make the base long and wide so the moment +created by the operation of the manipulator arm is compensated by the moments +created by base supports that are far from the robot’s center of mass. A long and +wide base could remove the hazard but may violate the environmental constraints +in the facility layout, such as the need to maneuver through doors and in the +crowded OPF. +The environmental constraint EA2 above implies a maximum length for the +robot of 2.5 meters and a width no larger than 1.1 meter. Given the required +maximum extension length of the manipulator arm and the estimated weight of +the equipment that will need to be carried on the mobile base, a calculation might +show that the length of the robot base is sufficient to prevent any longitudinal +instability, but that the width of the base is not sufficient to prevent lateral +instability. +If eliminating the hazard is determined to be impractical (as in this case) or not +desirable for some reason, the alternative is to identify ways to control it. The deci- +sion to try to control it may turn out not to be practical or later may seem less +satisfactory than increasing the weight (the solution earlier discarded). All decisions + + +should remain open as more information is obtained about alternatives and back- +tracking is an option. +At the initial stages in design, we identified only the general hazards—for +example, instability of the robot base and the related system design constraint that +the mobile base must not be capable of falling over under worst-case operational +conditions. As design decisions are proposed and analyzed, they will lead to addi- +tional refinements in the hazards and the design constraints. +For example, a potential solution to the stability problem is to use lateral stabi- +lizer legs that are deployed when the manipulator arm is extended but must be +retracted when the robot base moves. Let’s assume that a decision is made to at +least consider this solution. That potential design decision generates a new refined +hazard from the high-level stability hazard (H2): +H2.1: The manipulator arm is extended while the stabilizer legs are not fully +extended. +Damage to the mobile base or other equipment around the OPF is another potential +hazard introduced by the addition of the legs if the mobile base moves while the +stability legs are extended. Again, engineers would consider whether this hazard +could be eliminated by appropriate design of the stability legs. If it cannot, then that +is a second additional hazard that must be controlled in the design with a corre- +sponding design constraint that the mobile base must not move with the stability +legs extended. +There are now two new refined hazards that must be translated into design +constraints: +1. The manipulator arm must never be extended if the stabilizer legs are not +extended. +2. The mobile base must not move with the stability legs extended. +STPA can be used to further refine these constraints and to evaluate the resulting +designs. In the process, the safety control structure will be refined and perhaps +changed. In this case, a controller must be identified for the stabilizer legs, which +were previously not in the design. Let’s assume that the legs are controlled by the +TTPS movement controller (figure 9.3). +Using the augmented control structure, the remaining activities in STPA are to +identify potentially hazardous control actions by each of the system components +that could violate the safety constraints, determine the causal factors that could lead +to these hazardous control actions, and prevent or control them in the system design. +The process thus involves a top-down identification of scenarios in which the safety + + +constraints could be violated so that they can be used to guide more detailed design +decisions. +The unsafe control actions associated with the stability hazard are shown in +figure 9.4. Movement and thermal tile processing hazards are also identified in the +table. Combining similar entries for H1 in the table leads to the following unsafe +control actions by the leg controller with respect to the instability hazard: +1. The leg controller does not command a deployment of the stabilizer legs before +the arm is extended. +2. The leg controller commands a retraction of the stabilizer legs before the +manipulator arm is fully stowed. +3. The leg controller commands a retraction of the stabilizer legs after the arm +has been extended or commands a retraction of the stabilizer legs before the +manipulator arm is stowed. + + +4. The leg controller stops extension of the stabilizer legs before they are fully +extended. +and by the arm controller: +1. The arm controller extends the manipulator arm when the stabilizer legs are +not extended or before they are fully extended. +The inadequate control actions can be restated as system safety constraints on the +controller behavior (whether the controller is automated or human): +1. The leg controller must ensure the stabilizer legs are fully extended before arm +movements are enabled. +2. The leg controller must not command a retraction of the stabilizer legs when +the manipulator arm is not in a fully stowed position. +3. The leg controller must command a deployment of the stabilizer legs before +arm movements are enabled; the leg controller must not command a retraction +of the stabilizer legs before the manipulator arm is stowed. +4. The leg controller must not stop the leg extension until the legs are fully +extended. +Similar constraints will be identified for all hazardous commands: for example, the +arm controller must not extend the manipulator arm before the stabilizer legs are +fully extended. +These system safety constraints might be enforced through physical interlocks, +human procedures, and so on. Performing STPA step 2 will provide information +during detailed design (1) to evaluate and compare the different design choices, +(2) to design the controllers and design fault tolerance features for the system, and +(3) to guide the test and verification procedures (or training for humans). As design +decisions and safety constraints are identified, the functional specifications for the +controllers can be created. +To produce detailed scenarios for the violation of safety constraints, the control +structure is augmented with process models. The preliminary design of the process +models comes from the information necessary to ensure the system safety con- +straints hold. For example, the constraint that the arm controller must not enable +manipulator movement before the stabilizer legs are completely extended implies +there must be some type of feedback to the arm controller to determine when the +leg extension has been completed. +While a preliminary functional decomposition of the system components is +created to start the process, as more information is obtained from the hazard analy- +sis and the system design continues, this decomposition may be altered to optimize +fault tolerance and communication requirements. For example, at this point the need + + +for the process models of the leg and arm controllers to be consistent and the com- +munication required to achieve this goal may lead the designers to decide to combine +the leg and arm controllers (figure 9.5). +Causal factors for the stability hazard being violated can be determined using +STPA step 2. Feedback about the position of the legs is clearly critical to ensure +that the process model of the state of the stabilizer legs is consistent with the actual +state. The movement and arm controller cannot assume the legs are extended simply +because a command was issued to extend them. The command may not be executed +or may only be executed partly. One possible scenario, for example, involves an +external object preventing the complete extension of the stabilizer legs. In that case, +the robot controller (either human or automated) may assume the stabilizer legs +are extended because the extension motors have been powered up (a common type +of design error). Subsequent movement of the manipulator arm would then violate +the identified safety constraints. Just as the analysis assists in refining the component +safety constraints (functional requirements), the causal analysis can be used to + + +further refine those requirements and to design the control algorithm, the control +loop components, and the feedback necessary to implement them. +Many of the causes of inadequate control actions are so common that they can +be restated as general design principles for safety-critical control loops. The require- +ment for feedback about whether a command has been executed in the previous +paragraph is one of these. The rest of this chapter presents those general design +principles. + +section 9.3. +Designing for Safety. +Hazard analysis using STPA will identify application-specific safety design con- +straints that must be enforced by the control algorithm. For the thermal-tile process- +ing robot, a safety constraint identified above is that the manipulator arm must +never be extended if the stabilizer legs are not fully extended. Causal analysis (step +2 of STPA) can identify specific causes for the constraint to be violated and design +features can be created to eliminate or control them. +More general principles of safe control algorithm functional design can also be +identified by using the general causes of accidents as defined in STAMP (and used +in STPA step 2), general engineering principles, and common design flaws that have +led to accidents in the past. +Accidents related to software or system logic design often result from incom- +pleteness and unhandled cases in the functional design of the controller. This incom- +pleteness can be considered a requirements or functional design problem. Some +requirements completeness criteria were identified in Safeware and specified using +a state machine model. Here those criteria plus additional design criteria are trans- +lated into functional design principles for the components of the control loop. +In STAMP, accidents are caused by inadequate control. The controllers can be +human or physical. This section focuses on design principles for the components of +the control loop that are important whether a human is in the loop or not. Section +9.4 describes extra safety-related design principles that apply for systems that +include human controllers. We cannot “design” human controllers, but we can design +the environment or context in which they operate, and we can design the procedures +they use, the control loops in which they operate, the processes they control, and +the training they receive. + +section 9.3.1. Controlled Process and Physical Component Design. +Protection against component failure accidents is well understood in engineering. +Principles for safe design of common hardware systems (including sensors and +actuators) with standard safety constraints are often systematized and encoded in +checklists for an industry, such as mechanical design or electrical design. In addition, + + +most engineers have learned about the use of redundancy and overdesign (safety +margins) to protect against component failures. +These standard design techniques are still relevant today but provide little or no +protection against component interaction accidents. The added complexity of redun- +dant designs may even increase the occurrence of these accidents. Figure 9.6 shows +the design precedence described in Safeware. The highest precedence is to eliminate +the hazard. If the hazard cannot be eliminated, then its likelihood of occurrence +should be reduced, the likelihood of it leading to an accident should be reduced +and, at the lowest precedence, the design should reduce the potential damage +incurred. Clearly, the higher the precedence level, the more effective and less costly +will be the safety design effort. As there is little that is new here that derives from +using the STAMP causality model, the reader is referred to Safeware and standard +engineering references for more information. + + +section 9.3.2. Functional Design of the Control Algorithm. +Design for safety includes more than simply the physical components but also the +control components. We start by considering the design of the control algorithm. +The controller algorithm is responsible for processing inputs and feedback, initial- +izing and updating the process model, and using the process model plus other knowl- +edge and inputs to produce control outputs. Each of these is considered in turn. +Designing and Processing Inputs and Feedback +The basic function of the algorithm is to implement a feedback control loop, as +defined by the controller responsibilities, along with appropriate checks to detect +internal or external failures or errors. +Feedback is critical for safe control. Without feedback, controllers do not know +whether their control actions were received and performed properly or whether + + +The controller must be designed to respond appropriately to the arrival of any +possible (i.e., detectable by the sensors) input at any time as well as the lack of an +expected input over a given time period. Humans are better (and more flexible) +than automated controllers at this task. Often automation is not designed to handle +input arriving unexpectedly, for example, a target detection report from a radar that +was previously sent a message to shut down. +All inputs should be checked for out-of-range or unexpected values and a +response designed into the control algorithm. A surprising number of losses still +occur due to software not being programmed to handle unexpected inputs. +In addition, the time bounds (minimum and maximum) for every input should +be checked and appropriate behavior provided in case the input does not arrive +within these bounds. There should also be a response for the non-arrival of an input +within a given amount of time (a timeout) for every variable in the process model. +The controller must also be designed to respond to excessive inputs (overload condi- +tions) in a safe way. +Because sensors and input channels can fail, there should be a minimum-arrival- +rate check for each physically distinct communication path, and the controller +should have the ability to query its environment with respect to inactivity over a +given communication path. Traditionally these queries are called sanity or health +checks. Care needs to be taken, however, to ensure that the design of the response +to a health check is distinct from the normal inputs and that potential hardware +failures cannot impact the sanity checks. As an example of the latter, in June 1980 +warnings were received at the U.S. command and control headquarters that a major +nuclear attack had been launched against the United States [180]. The military +prepared for retaliation, but the officers at command headquarters were able to +ascertain from direct contact with warning sensors that no incoming missile had +been detected and the alert was canceled. Three days later, the same thing hap- +pened again. The false alerts were caused by the failure of a computer chip in a +multiplexor system that formats messages sent out continuously to command posts +indicating that communication circuits are operating properly. This health check +message was designed to report that there were 000 ICBMs and 000 SLBMs +detected. Instead, the integrated circuit failure caused some of the zeros to be +replaced with twos. After the problem was diagnosed, the message formats were +changed to report only the status of the communication system and nothing about +detecting ballistic missiles. Most likely, the developers thought it would be easier to +have one common message format but did not consider the impact of erroneous +hardware behavior. +STAMP identifies inconsistency between the process model and the actual +system state as a common cause of accidents. Besides incorrect feedback, as in the +example early warning system, a common way for the process model to become + + + +inconsistent with the state of the actual process is for the controller to assume that +an output command has been executed when it has not. The TTPS controller, for +example, assumes that because it has sent a command to extend the stabilizer legs, +the legs will, after a suitable amount of time, be extended. If commands cannot be +executed for any reason, including time outs, controllers have to know about it. To +detect errors and failures in the actuators or controlled process, there should be an +input (feedback) that the controller can use to detect the effect of any output on +the process. +This feedback, however, should not simply be an indication that the command +arrived at the controlled process—for example, the command to open a valve was +received by the valve, but that the valve actually opened. An explosion occurred +in a U.S. Air Force system due to overpressurization when a relief valve failed to +open after the operator sent a command to open it. Both the position indicator +light and open indicator light were illuminated on the control board. Believing +the primary valve had opened, the operator did not open the secondary valve, +which was to be used if the primary valve failed. A post-accident examination +discovered that the indicator light circuit was wired to indicate presence of a signal +at the valve, but it did not indicate valve position. The indicator therefore showed +only that the activation button had been pushed, not that the valve had opened. +An extensive quantitative safety analysis of this design had assumed a low prob- +ability of simultaneous failure for the two relief valves, but it ignored the possibility +of a design error in the electrical wiring; the probability of the design error was +not quantifiable. Many other accidents have involved a similar design flaw, includ- +ing Three Mile Island. +When the feedback associated with an output is received, the controller must be +able to handle the normal response as well as deal with feedback that is missing, +too late, too early, or has an unexpected value. +Initializing and Updating the Process Model. +Because the process model is used by the controller to determine what control com- +mands to issue and when, the accuracy of the process model with respect to the +controlled process is critical. As noted earlier, many software-related losses have +resulted from such inconsistencies. STPA will identify which process model variables +are critical to safety; the controller design must ensure that the controller receives +and processes updates for these variables in a timely manner. +Sometimes normal updating of the process model is done correctly by the con- +troller, but problems arise in initialization at startup and after a temporary shut- +down. The process model must reflect the actual process state at initial startup and +after a restart. It seems to be common, judging from the number of incidents +and accidents that have resulted, for software designers to forget that the world + + +continues to change even though the software may not be operating. When the +computer controlling a process is temporarily shut down, perhaps for maintenance +or updating of the software, it may restart with the assumption that the controlled +process is still in the state it was when the software was last operating. In addition, +assumptions may be made about when the operation of the controller will be started, +which may be violated. For example, an assumption may be made that a particular +aircraft system will be powered up and initialized before takeoff and appropriate +default values used in the process model for that case. In the event it was not started +at that time or was shut down and then restarted after takeoff, the default startup +values in the process model may not apply and may be hazardous. +Consider the mobile tile-processing robot at the beginning of this chapter. The +mobile base may be designed to allow manually retracting the stabilizer legs if an +emergency occurs while the robot is servicing the tiles and the robot must be physi- +cally moved out of the way. When the robot is restarted, the controller may assume + + +that the stabilizer legs are still extended and arm movements may be commanded +that would violate the safety constraints. +The use of an unknown value can assist in protecting against this type of design +flaw. At startup and after temporary shutdown, process variables that reflect the +state of the controlled process should be initialized with the value unknown and +updated when new feedback arrives. This procedure will result in resynchronizing +the process model and the controlled process state. The control algorithm must also +account, of course, for the proper behavior in case it needs to use a process model +variable that has the unknown value. +Just as timeouts must be specified and handled for basic input processing as +described earlier, the maximum time the controller waits until the first input after +startup needs to be determined and what to do if this time limit is violated. Once +again, while human controllers will likely detect such a problem eventually, such as +a failed input channel or one that was not restarted on system startup, computers +will patiently wait forever if they are not given instructions to detect such a timeout +and to respond to it. +In general, the system and control loop should start in a safe state. Interlocks may +need to be initialized or checked to be operational at system startup, including +startup after temporarily overriding the interlocks. +Finally the behavior of the controller with respect to input received before +startup, after shutdown, or while the controller is temporarily disconnected from the +process (offline) must be considered and it must be determined if this information +can be safely ignored or how it will be stored and later processed if it cannot. One +factor in the loss of an aircraft that took off from the wrong runway at Lexington +Airport, for example, is that information about temporary changes in the airport +taxiways was not reflected in the airport maps provided to the crew. The information +about the changes, which was sent by the National Flight Data Center, was received +by the map-provider computers at a time when they were not online, leading to +airport charts that did not match the actual state of the airport. The document +control system software used by the map provider was designed to only make +reports of information received during business hours Monday through Friday [142]. + +Producing Outputs. +The primary responsibility of the process controller is to produce commands to +fulfill its control responsibilities. Again, the STPA hazard analysis and safety-guided +design process will produce the application-specific behavioral safety requirements +and constraints on controller behavior to ensure safety. But some general guidelines +are also useful. +One general safety constraint is that the behavior of an automated controller +should be deterministic: it should exhibit only one behavior for arrival of any input + + +in a particular state. While it is easy to design software with nondeterministic +behavior and, in some cases, actually has some advantages from a software point of +view, nondeterministic behavior makes testing more difficult and, more important, +much more difficult for humans to learn how an automated system works and +to monitor it. If humans are expected to control or monitor an automated system +or an automated controller, then the behavior of the automation should be +deterministic. +Just as inputs can arrive faster than they can be processed by the controller, the +absorption rate of the actuators and recipients of output from the controller must +be considered. Again, the problem usually arises when a fast output device (such as +a computer) is providing input to a slower device, such as a human. Contingency +action must be designed when the output absorption rate limit is exceeded. +Three additional general considerations in the safe design of controllers are data +age, latency, and fault handling. + + +Data age: No inputs or output commands are valid forever. The control loop +design must account for inputs that are no longer valid and should not be used by +the controller and for outputs that cannot be executed immediately. All inputs used +in the generation of output commands must be properly limited in the time they +are used and marked as obsolete once that time limit has been exceeded. At the +same time, the design of the control loop must account for outputs that are not +executed within a given amount of time. As an example of what can happen when +data age is not properly handled in the design, an engineer working in the cockpit +of a B-lA aircraft issued a close weapons bay door command during a test. At the +time, a mechanic working on the door had activated a mechanical inhibit on it. The +close door command was not executed, but it remained active. Several hours later, +when the door maintenance was completed, the mechanical inhibit was removed. +The door closed unexpectedly, killing the worker [64]. +Latency: Latency is the time interval during which receipt of new information +cannot change an output even though it arrives prior to the output. While latency +time can be reduced by using various types of design techniques, it cannot be elimi- +nated completely. Controllers need to be informed about the arrival of feedback +affecting previously issued commands and, if possible, provided with the ability to +undo or to mitigate the effects of the now unwanted command. +Fault-handling: Most accidents involve off-nominal processing modes, including +startup and shutdown and fault handling. The design of the control loop should assist +the controller in handling these modes and the designers need to focus particular +attention on them. +The system design may allow for performance degradation and may be designed +to fail into safe states or to allow partial shutdown and restart. Any fail-safe behavior +that occurs in the process should be reported to the controller. In some cases, auto- +mated systems have been designed to fail so gracefully that human controllers +are not aware of what is going on until they need to take control and may not be +prepared to do so. Also, hysteresis needs to be provided in the control algorithm +for transitions between off-nominal and nominal processing modes to avoid ping- +ponging when the conditions that caused the controlled process to leave the normal +state still exist or recur. +Hazardous functions have special requirements. Clearly, interlock failures should +result in the halting of the functions they are protecting. In addition, the control +algorithm design may differ after failures are detected, depending on whether the +controller outputs are hazard-reducing or hazard-increasing. A hazard-increasing +output is one that moves the controlled process to a more hazardous state, for +example, arming a weapon. A hazard-reducing output is a command that leads to a + + +reduced risk state, for example, safing a weapon or any other command whose +purpose is to maintain safety. +If a failure in the control loop, such as a sensor or actuator, could inhibit the +production of a hazard-reducing command, there should be multiple ways to trigger +such commands. On the other hand, multiple inputs should be required to trigger +commands that can lead to hazardous states so they are not inadvertently issued. +Any failure should inhibit the production of a hazard-increasing command. As an +example of the latter condition, loss of the ability of the controller to receive input, +such as failure of a sensor, that might inhibit the production of a hazardous output +should prevent such an output from being issued. +section 9.4. +Special Considerations in Designing for Human Controllers. +The design principles in section 9.3 apply when the controller is automated or +human, particularly when designing procedures for human controllers to follow. But +humans do not always follow procedures, nor should they. We use humans to control +systems because of their flexibility and adaptability to changing conditions and to +the incorrect assumptions made by the designers. Human error is an inevitable and +unavoidable consequence. But appropriate design can assist in reducing human +error and increasing safety in human-controlled systems. +Human error is not random. It results from basic human mental abilities and +physical skills combined with the features of the tools being used, the tasks assigned, +and the operating environment. We can use what is known about human mental +abilities and design the other aspects of the system—the tools, the tasks, and the +operating environment—to reduce and control human error to a significant degree. +The previous section described general principles for safe design. This section +focuses on additional design principles that apply when humans control, either +directly or indirectly, safety-critical systems. + +section 9.4.1. Easy but Ineffective Approaches. +One simple solution for engineers is to simply use human factors checklists. While +many such checklists exist, they often do not distinguish among the qualities they +enhance, which may not be related to safety and may even conflict with safety. The +only way such universal guidelines could be useful is if all design qualities were +complementary and achieved in exactly the same way, which is not the case. Quali- +ties are conflicting and require design tradeoffs and decisions about priorities. +Usability and safety, in particular, are often conflicting; an interface that is easy +to use may not necessarily be safe. As an example, a common guideline is to ensure +that a user must enter data only once and that the computer can access that data if + + +needed later for the same task or for different tasks [192]. Duplicate entry, however, +is required for the computer to detect entry errors unless the errors are so extreme +that they violate reasonableness criteria. A small slip usually cannot be detected +and such entry errors have led to many accidents. Multiple entry of critical data can +prevent such losses. +As another example, a design that involves displaying data or instructions on a +screen for an operator to check and verify by pressing the enter button minimizes +the typing an operator must do. Over time, however, and after few errors are +detected, operators will get in the habit of pressing the enter key multiple times in +rapid succession. This design feature has been implicated in many losses. For example, +the Therac-25 was a linear accelerator that overdosed multiple patients during radia- +tion therapy. In the original Therac-25 design, operators were required to enter the +treatment parameters at the treatment site as well as on the computer console. After +the operators complained about the duplication, the parameters entered at the +treatment site were instead displayed on the console and the operator needed only +to press the return key if they were correct. Operators soon became accustomed to +pushing the return key quickly the required number of times without checking the +parameters carefully. +The second easy but not very effective solution is to write procedures for human +operators to follow and then assume the engineering job is done. Enforcing the +following of procedures is unlikely, however, to lead to a high level of safety. +Dekker notes what he called the “Following Procedures Dilemma” [50]. Opera- +tors must balance between adapting procedures in the face of unanticipated con- +ditions versus sticking to procedures rigidly when cues suggest they should be +adapted. If human controllers choose the former, that is, they adapt procedures +when it appears the procedures are wrong, a loss may result when the human con- +troller does not have complete knowledge of the circumstances or system state. In +this case, the humans will be blamed for deviations and nonadherence to the pro- +cedures. On the other hand, if they stick to procedures (the control algorithm pro- +vided) rigidly when the procedures turn out to be wrong, they will be blamed for +their inflexibility and the application of the rules in the wrong context. Hindsight +bias is often involved in identifying what the operator should have known and done. +Insisting that operators always follow procedures does not guarantee safety +although it does usually guarantee that there is someone to blame—either for fol- +lowing the procedures or for not following them—when things go wrong. Safety +comes from controllers being skillful in judging when and how procedures apply. As +discussed in chapter 12, organizations need to monitor adherence to procedures not +simply to enforce compliance but to understand how and why the gap between +procedures and practice grows and to use that information to redesign both the +system and the procedures [50]. + + +Section 8.5 of chapter 8 describes important differences between human and +automated controllers. One of these differences is that the control algorithm used +by humans is dynamic. This dynamic aspect of human control is why humans are +kept in systems. They provide the flexibility to deviate from procedures when it turns +out the assumptions underlying the engineering design are wrong. But with this +flexibility comes the possibility of unsafe changes in the dynamic control algorithm +and raises new design requirements for engineers and system designers to under- +stand the reason for such unsafe changes and prevent them through appropriate +system design. +Just as engineers have the responsibility to understand the hazards in the physical +systems they are designing and to control and mitigate them, engineers also must +understand how their system designs can lead to human error and how they can +design to reduce errors. +Designing to prevent human error requires some basic understanding about the +role humans play in systems and about human error. + +section 9.4.2. The Role of Humans in Control Systems. +Humans can play a variety of roles in a control system. In the simplest cases, they +create the control commands and apply them directly to the controlled process. For +a variety of reasons, particularly speed and efficiency, the system may be designed +with a computer between the human controller and the system. The computer may +exist only in the feedback loop to process and present data to the human operator. +In other systems, the computer actually issues the control instructions with the +human operator either providing high-level supervision of the computer or simply +monitoring the computer to detect errors or problems. +An unanswered question is what is the best role for humans in safety-critical +process control. There are three choices beyond direct control: the human can +monitor an automated control system, the human can act as a backup to the auto- +mation, or the human and automation can both participate in the control through +some type of partnership. These choices are discussed in depth in Safeware and are +only summarized here. +Unfortunately for the first option, humans make very poor monitors. They cannot +sit and watch something without active control duties for any length of time and +maintain vigilance. Tasks that require little active operator behavior may result in +lowered alertness and can lead to complacency and overreliance on the automation. +Complacency and lowered vigilance are exacerbated by the high reliability and low +failure rate of automated systems. +But even if humans could remain vigilant while simply sitting and monitoring a +computer that is performing the control tasks (and usually doing the right thing), +Bainbridge has noted the irony that automatic control systems are installed because + + +they can do the job better than humans, but then humans are assigned the task of +monitoring the automated system [14]. Two questions arise: +1. The human monitor needs to know what the correct behavior of the controlled +or monitored process should be; however, in complex modes of operation—for +example, where the variables in the process have to follow a particular trajec- +tory over time—evaluating whether the automated control system is perform- +ing correctly requires special displays and information that may only be +available from the automated system being monitored. How will human moni- +tors know when the computer is wrong if the only information they have comes +from that computer? In addition, the information provided by an automated +controller is more indirect, which may make it harder for humans to get a clear +picture of the system: Failures may be silent or masked by the automation. +2. If the decisions can be specified fully, then a computer can make them more +quickly and accurately than a human. How can humans monitor such a system? +Whitfield and Ord found that, for example, air traffic controllers’ appreciation +of the traffic situation was reduced at the high traffic levels made feasible by +using computers [198]. In such circumstances, humans must monitor the auto- +mated controller at some metalevel, deciding whether the computer’s deci- +sions are acceptable rather than completely correct. In case of a disagreement, +should the human or the computer be the final arbiter? +Employing humans as backups is equally ineffective. Controllers need to have accu- +rate process models to control effectively, but not being in active control leads to a +degradation of their process models. At the time they need to intervene, it may take +a while to “get their bearings”—in other words, to update their process models so +that effective and safe control commands can be given. In addition, controllers need +both manual and cognitive skills, but both of these decline in the absence of practice. +If human backups need to take over control from automated systems, they may be +unable to do so effectively and safely. Computers are often introduced into safety- +critical control loops because they increase system reliability, but at the same time, +that high reliability can provide little opportunity for human controllers to practice +and maintain the skills and knowledge required to intervene when problems +do occur. +It appears, at least for now, that humans will have to provide direct control or +will have to share control with automation unless adequate confidence can be estab- +lished in the automation to justify eliminating monitors completely. Few systems +exist today where such confidence can be achieved when safety is at stake. The +problem then becomes one of finding the correct partnership and allocation of tasks +between humans and computers. Unfortunately, this problem has not been solved, +although some guidelines are presented later. + + +One of the things that make the problem difficult is that it is not just a matter of +splitting responsibilities. Computer control is changing the cognitive demands on +human controllers. Humans are increasingly supervising a computer rather than +directly monitoring the process, leading to more cognitively complex decision +making. Automation logic complexity and the proliferation of control modes are +confusing humans. In addition, whenever there are multiple controllers, the require- +ments for cooperation and communication are increased, not only between the +human and the computer but also between humans interacting with the same com- +puter, for example, the need for coordination among multiple people making entries +to the computer. The consequences can be increased memory demands, new skill +and knowledge requirements, and new difficulties in the updating of the human’s +process models. +A basic question that must be answered and implemented in the design is who +will have the final authority if the human and computers disagree about the proper +control actions. In the loss of an Airbus 320 while landing at Warsaw in 1993, one +of the factors was that the automated system prevented the pilots from activating +the braking system until it was too late to prevent crashing into a bank built at the +end of the runway. This automation feature was a protection device included to +prevent the reverse thrusters accidentally being deployed in flight, a presumed cause +of a previous accident. For a variety of reasons, including water on the runway +causing the aircraft wheels to hydroplane, the criteria used by the software logic to +determine that the aircraft had landed were not satisfied by the feedback received +by the automation [133]. Other incidents have occurred where the pilots have been +confused about who is in control, the pilot or the automation, and found themselves +fighting the automation [181]. +One common design mistake is to set a goal of automating everything and then +leaving some miscellaneous tasks that are difficult to automate for the human con- +trollers to perform. The result is that the operator is left with an arbitrary collection +of tasks for which little thought was given to providing support, particularly support +for maintaining accurate process models. The remaining tasks may, as a consequence, +be significantly more complex and error-prone. New tasks may be added, such as +maintenance and monitoring, that introduce new types of errors. Partial automation, +in fact, may not reduce operator workload but merely change the type of demands +on the operator, leading to potentially increased workload. For example, cockpit +automation may increase the demands on the pilots by creating a lot of data entry +tasks during approach when there is already a lot to do. These automation interac- +tion tasks also create “heads down” work at a time when increased monitoring of +nearby traffic is necessary. +By taking away the easy parts of the operator’s job, automation may make the +more difficult ones even harder [14]. One causal factor here is that taking away or + + +changing some operator tasks may make it difficult or even impossible for the opera- +tors to receive the feedback necessary to maintain accurate process models. +When designing the automation, these factors need to be considered. A basic +design principle is that automation should be designed to augment human abilities, +not replace them, that is, to aid the operator, not to take over. +To design safe automated controllers with humans in the loop, designers need +some basic knowledge about human error related to control tasks. In fact, Rasmus- +sen has suggested that the term human error be replaced by considering such events +as human–task mismatches. + +section 9.4.3. Human Error Fundamentals. +Human error can be divided into the general categories of slips and mistakes [143, +144]. Basic to the difference is the concept of intention or desired action. A mistake +is an error in the intention, that is, an error that occurs during the planning of an +action. A slip, on the other hand, is an error in carrying out the intention. As an +example, suppose an operator decides to push button A. If the operator instead +pushes button B, then it would be called a slip because the action did not match the +intention. If the operator pushed A (carries out the intention correctly), but it turns +out that the intention was wrong, that is, button A should not have been pushed, +then this is called a mistake. +Designing to prevent slips involves applying different principles than designing +to prevent mistakes. For example, making controls look very different or placing +them far apart from each other may reduce slips, but not mistakes. In general, design- +ing to reduce mistakes is more difficult than reducing slips, which is relatively +straightforward. +One of the difficulties in eliminating planning errors or mistakes is that such +errors are often only visible in hindsight. With the information available at the +time, the decisions may seem reasonable. In addition, planning errors are a neces- +sary side effect of human problem-solving ability. Completely eliminating mistakes +or planning errors (if possible) would also eliminate the need for humans as +controllers. +Planning errors arise from the basic human cognitive ability to solve problems. +Human error in one situation is human ingenuity in another. Human problem +solving rests on several unique human capabilities, one of which is the ability to +create hypotheses and to test them and thus create new solutions to problems not +previously considered. These hypotheses, however, may be wrong. Rasmussen has +suggested that human error is often simply unsuccessful experiments in an unkind +environment, where an unkind environment is defined as one in which it is not pos- +sible for the human to correct the effects of inappropriate variations in performance + + + +before they lead to unacceptable consequences [166]. He concludes that human +performance is a balance between a desire to optimize skills and a willingness to +accept the risk of exploratory acts. +A second basic human approach to problem solving is to try solutions that +worked in other circumstances for similar problems. Once again, this approach is +not always successful but the inapplicability of old solutions or plans (learned pro- +cedures) may not be determinable without the benefit of hindsight. +The ability to use these problem-solving methods provides the advantages of +human controllers over automated controllers, but success is not assured. Designers, +if they understand the limitations of human problem solving, can provide assistance +in the design to avoid common pitfalls and enhance human problem solving. For +example, they may provide ways for operators to obtain extra information or to +test hypotheses safely. At the same time, there are some additional basic human +cognitive characteristics that must be considered. +Hypothesis testing can be described in terms of basic feedback control concepts. +Using the information in the process model, the controller generates a hypothesis +about the controlled process. A test composed of control actions is created to gener- +ate feedback useful in evaluating the hypothesis, which in turn is used to update the +process model and the hypothesis. +When controllers have no accurate diagnosis of a problem, they must make pro- +visional assessments of what is going on based on uncertain, incomplete, and often +contradictory information [50]. That provisional assessment will guide their infor- +mation gathering, but it may also lead to over attention to confirmatory evidence +when processing feedback and updating process models while, at the same time, +discounting information that contradicts their current diagnosis. Psychologists call +this phenomenon cognitive fixation. The alternative is called thematic vagabonding, +where the controller jumps around from explanation to explanation, driven by the +loudest or latest feedback or alarm and never develops a coherent assessment of +what is going on. Only hindsight can determine whether the controller should have +abandoned one explanation for another: Sticking to one assessment can lead to +more progress in many situations than jumping around and not pursuing a consistent +planning process. +Plan continuation is another characteristic of human problem solving related to +cognitive fixation. Commitment to a preliminary diagnosis can lead to sticking with +the original plan even though the situation has changed and calls for a different +plan. Orisanu [149] notes that early cues that suggest an initial plan is correct are +usually very strong and unambiguous, helping to convince people to continue +the plan. Later feedback that suggests the plan should be abandoned is typically +more ambiguous and weaker. Conditions may deteriorate gradually. Even when + + +controllers receive and acknowledge this feedback, the new information may not +change their plan, especially if abandoning the plan is costly in terms of organiza- +tional and economic consequences. In the latter case, it is not surprising that control- +lers will seek and focus on confirmatory evidence and will need a lot of contradictory +evidence to justify changing their plan. +Cognitive fixation and plan continuation are compounded by stress and fatigue. +These two factors make it more difficult for controllers to juggle multiple hypoth- +eses about a problem or to project a situation into the future by mentally simulating +the effects of alternative plans [50]. +Automated tools can be designed to assist the controller in planning and decision +making, but they must embody an understanding of these basic cognitive limitations +and assist human controllers in overcoming them. At the same time, care must be +taken that any simulation or other planning tools to assist human problem solving +do not rest on the same incorrect assumptions about the system that led to the +problems in the first place. +Another useful distinction is between errors of omission and errors of commis- +sion. Sarter and Woods [181] note that in older, less complex aircraft cockpits, most +pilot errors were errors of commission that occurred as a result of a pilot control +action. Because the controller, in this case the pilot, took a direct action, he or she +is likely to check that the intended effect of the action has actually occurred. The +short feedback loops allow the operators to repair most errors before serious +consequences result. This type of error is still the prevalent one for relatively +simple devices. +In contrast, studies of more advanced automation in aircraft find that errors of +omission are the dominant form of error [181]. Here the controller does not imple- +ment a control action that is required. The operator may not notice that the auto- +mation has done something because that automation behavior was not explicitly +invoked by an operator action. Because the behavioral changes are not expected, +the human controller is less likely to pay attention to relevant indications and +feedback, particularly during periods of high workload. +Errors of omission are related to the change of human roles in systems from +direct controllers to monitors, exception handlers, and supervisors of automated +controllers. As their roles change, the cognitive demands may not be reduced but +instead may change in their basic nature. The changes tend to be more prevalent at +high-tempo and high-criticality periods. So while some types of human errors have +declined, new types of errors have been introduced. +The difficulty and perhaps impossibility of eliminating human error does not +mean that greatly improved system design in this respect is not possible. System +design can be used to take advantage of human cognitive capabilities and to mini- +mize the errors that may result from them. The rest of the chapter provides some + + +principles to create designs that better support humans in controlling safety-critical +processes and reduce human errors. +9.4.4 Providing Control Options +If the system design goal is to make humans responsible for safety in control systems, +then they must have adequate flexibility to cope with undesired and unsafe behavior +and not be constrained by inadequate control options. Three general design prin- +ciples apply: design for redundancy, design for incremental control, and design for +error tolerance. +Design for redundant paths: One helpful design feature is to provide multiple +physical devices and logical paths to ensure that a single hardware failure or +software error cannot prevent the operator from taking action to maintain a +safe system state and avoid hazards. There should also be multiple ways to change + + +from an unsafe to a safe state, but only one way to change from an unsafe to a +safe state. +Design for incremental control: Incremental control makes a system easier to +control, both for humans and computers, by performing critical steps incrementally +rather than in one control action. The common use of incremental arm, aim, fire +sequences is an example. The controller should have the ability to observe the +system and get feedback to test the validity of the assumptions and models upon +which the decisions are made. The system design should also provide the controller +with compensating control actions to allow modifying or aborting previous control +actions before significant damage is done. An important consideration in designing +for controllability in general is to lower the time pressures on the controllers, if +possible. +The design of incremental control algorithms can become complex when a human +controller is controlling a computer, which is controlling the actual physical process, +in a stressful and busy environment, such as a military aircraft. If one of the com- +mands in an incremental control sequence cannot be executed within a specified +period of time, the human operator needs to be informed about any delay or post- +ponement or the entire sequence should be canceled and the operator informed. At +the same time, interrupting the pilot with a lot of messages that may not be critical +at a busy time could also be dangerous. Careful analysis is required to determine +when multistep controller inputs can be preempted or interrupted before they are +complete and when feedback should occur that this happened [90]. +Design for error tolerance: Rasmussen notes that people make errors all the time, +but we are able to detect and correct them before adverse consequences occur [165]. +System design can limit people’s ability to detect and recover from their errors. He +defined a system design goal of error tolerant systems. In these systems, errors are +observable (within an appropriate time limit) and they are reversible before unac- +ceptable consequences occur. The same applies to computer errors: they should be +observable and reversible. +The general goal is to allow controllers to monitor their own performance. To +achieve this goal, the system design needs to: +1. Help operators monitor their actions and recover from errors. +2. Provide feedback about actions operators took and their effects, in case the +actions were inadvertent. Common examples are echoing back operator inputs +or requiring confirmation of intent. +3. Allow for recovery from erroneous actions. The system should provide control +options, such as compensating or reversing actions, and enough time for recov- +ery actions to be taken before adverse consequences result. + + +Incremental control, as described earlier, is a type of error-tolerant design +technique. +section 9.4.5. Matching Tasks to Human Characteristics. +In general, the designer should tailor systems to human requirements instead of +the opposite. Engineered systems are easier to change in their behavior than are +humans. +Because humans without direct control tasks will lose vigilance, the design +should combat lack of alertness by designing human tasks to be stimulating and +varied, to provide good feedback, and to require active involvement of the human +controllers in most operations. Maintaining manual involvement is important, not +just for alertness but also in getting the information needed to update process +models. + + +Maintaining active engagement in the tasks means that designers must distin- +guish between providing help to human controllers and taking over. The human +tasks should not be oversimplified and tasks involving passive or repetitive actions +should be minimized. Allowing latitude in how tasks are accomplished will not only +reduce monotony and error proneness, but can introduce flexibility to assist opera- +tors in improvising when a problem cannot be solved by only a limited set of behav- +iors. Many accidents have been avoided when operators jury-rigged devices or +improvised procedures to cope with unexpected events. Physical failures may cause +some paths to become nonfunctional and flexibility in achieving goals can provide +alternatives. +Designs should also be avoided that require or encourage management by excep- +tion, which occurs when controllers wait for alarm signals before taking action. +Management by exception does not allow controllers to prevent disturbances by +looking for early warnings and trends in the process state. For operators to anticipate +undesired events, they need to continuously update their process models. Experi- +ments by Swaanenburg and colleagues found that management by exception is not +the strategy adopted by human controllers as their normal supervisory mode [196]. +Avoiding management by exception requires active involvement in the control task +and adequate feedback to update process models. A display that provides only an +overview and no detailed information about the process state, for example, may not +provide the information necessary for detecting imminent alarm conditions. +Finally, if designers expect operators to react correctly to emergencies, they need +to design to support them in these tasks and to help fight some basic human tenden- +cies described previously such as cognitive fixation and plan continuation. The +system design should support human controllers in decision making and planning +activities during emergencies. + +section 9.4.6. Designing to Reduce Common Human Errors. +Some human errors are so common and unnecessary that there is little excuse for +not designing to prevent them. Care must be taken though that the attempt to +reduce erroneous actions does not prevent the human controller from intervening +in an emergency when the assumptions made during design about what should and +should not be done turn out to be incorrect. +One fundamental design goal is to make safety-enhancing actions easy, natural, +and difficult to omit or do wrong. In general, the design should make it more difficult +for the human controller to operate unsafely than safely. If safety-enhancing actions +are easy, they are less likely to be bypassed intentionally or accidentally. Stopping +an unsafe action or leaving an unsafe state should be possible with a single keystroke +that moves the system into a safe state. The design should make fail-safe actions +easy and natural, and difficult to avoid, omit, or do wrong. + + +In contrast, two or more unique operator actions should be required to start any +potentially hazardous function or sequence of functions. Hazardous actions should +be designed to minimize the potential for inadvertent activation; they should not, +for example, be initiated by pushing a single key or button (see the preceding dis- +cussion of incremental control). +The general design goal should be to enhance the ability of the human controller +to act safely while making it more difficult to behave unsafely. Initiating a potentially +unsafe process change, such as a spacecraft launch, should require multiple key- +strokes or actions while stopping a launch should require only one. +Safety may be enhanced by using procedural safeguards, where the operator is +instructed to take or avoid specific actions, or by designing safeguards into the +system. The latter is much more effective. For example, if the potential error involves +leaving out a critical action, either the operator can be instructed to always take +that action or the action can be made an integral part of the process. A typical error + + +during maintenance is not to return equipment (such as safety interlocks) to the +operational mode. The accident sequence at Three Mile Island was initiated by such +an error. An action that is isolated and has no immediate relation to the “gestalt” +of the repair or testing task is easily forgotten. Instead of stressing the need to be +careful (the usual approach), change the system by integrating the act physically +into the task, make detection a physical consequence of the tool design, or change +operations planning or review. That is, change design or management rather than +trying to change the human [162]. +To enhance decision making, references should be provided for making judg- +ments, such as marking meters with safe and unsafe limits. Because humans often +revert to stereotype and cultural norms, such norms should be followed in design. +Keeping things simple, natural, and similar to what has been done before (not +making gratuitous design changes) is a good way to avoid errors when humans are +working under stress, are distracted, or are performing tasks while thinking about +something else. +To assist in preventing sequencing errors, controls should be placed in the +sequence in which they are to be used. At the same time, similarity, proximity, inter- +ference, or awkward location of critical controls should be avoided. Where operators +have to perform different classes or types of control actions, sequences should be +made as dissimilar as possible. +Finally, one of the most effective design techniques for reducing human error is +to design so that the error is not physically possible or so that errors are obvious. +For example, valves can be designed so they cannot be interchanged by making the +connections different sizes or preventing assembly errors by using asymmetric or +male and female connections. Connection errors can also be made obvious by color +coding. Amazingly, in spite of hundreds of deaths due to misconnected tubes in +hospitals that have occurred over decades, such as a feeding tube inadvertently +connected to a tube that is inserted in a patient’s vein, regulators, hospitals, and +tube manufacturers have taken no action to implement this standard safety design +technique [80]. + +section 9.4.7. Support in Creating and Maintaining Accurate Process Models. +Human controllers who are supervising automation have two process models to +maintain: one for the process being controlled by the automation and one for the +automated controller itself. The design should support human controllers in main- +taining both of these models. An appropriate goal here is to provide humans with +the facilities to experiment and learn about the systems they are controlling, either +directly or indirectly. Operators should also be allowed to maintain manual involve- +ment to update process models, to maintain skills, and to preserve self-confidence. +Simply observing will degrade human supervisory skills and confidence. + + +When human controllers are supervising automated controllers, the automation +has extra design requirements. The control algorithm used by the automation must +be learnable and understandable. Two common design flaws in automated control- +lers are inconsistent behavior by the automation and unintended side effects. +Inconsistent Behavior. +Carroll and Olson define a consistent design as one where a similar task or goal is +associated with similar or identical actions [35]. Consistent behavior on the part of +the automated controller makes it easier for the human providing supervisory +control to learn how the automation works, to build an appropriate process model +for it, and to anticipate its behavior. +An example of inconsistency, detected in an A320 simulator study, involved an +aircraft go-around below 100 feet above ground level. Sarter and Woods found that +pilots failed to anticipate and realize that the autothrust system did not arm when + + +they selected takeoff/go-around (TOGA) power under these conditions because it +did so under all other circumstances where TOGA power is applied [181]. +Another example of inconsistent automation behavior, which was implicated in +an A320 accident, is a protection function that is provided in all automation configu- +rations except the specific mode (in this case altitude acquisition) in which the +autopilot was operating [181]. +Human factors for critical systems have most extensively been studied in aircraft +cockpit design. Studies have found that consistency is most important in high-tempo, +highly dynamic phases of flight where pilots have to rely on their automatic systems +to work as expected without constant monitoring. Even in more low-pressure +situations, consistency (or predictability) is important in light of the evidence from +pilot surveys that their normal monitoring behavior may change on high-tech flight +decks [181]. +Pilots on conventional aircraft use a highly trained instrument-scanning pattern +of recurrently sampling a given set of basic flight parameters. In contrast, some A320 +pilots report that they no longer scan anymore but allocate their attention within +and across cockpit displays on the basis of expected automation states and behav- +iors. Parameters that are not expected to change may be neglected for a long time +[181]. If the automation behavior is not consistent, errors of omission may occur +where the pilot does not intervene when necessary. +In section 9.3.2, determinism was identified as a safety design feature for auto- +mated controllers. Consistency, however, requires more than deterministic behavior. +If the operator provides the same inputs but different outputs (behaviors) result for +some reason other than what the operator has done (or may even know about), +then the behavior is inconsistent from the operator viewpoint even though it is +deterministic. While the designers may have good reasons for including inconsistent +behavior in the automated controller, there should be a careful tradeoff made with +the potential hazards that could result. +Unintended Side Effects. +Incorrect process models can result when an action intended to have one effect has +an additional side effect not easily anticipated by the human controller. An example +occurred in the Sarter and Woods A320 aircraft simulator study cited earlier. Because +the approach to the destination airport is such a busy time for the pilots and the +automation requires so much heads down work, pilots often program the automa- +tion as soon as the air traffic controllers assign them a runway. Sarter and Woods +found that the experienced pilots in their study were not aware that entering a +runway change after entering data for the assigned approach results in the deletion +by the automation of all the previously entered altitude and speed constraints, even +though they may still apply. + + +Once again, there may be good reason for the automation designers to include +such side effects, but they need to consider the potential for human error that +can result. +Mode Confusion. +Modes define mutually exclusive sets of automation behaviors. Modes can be used +to determine how to interpret inputs or to define required controller behavior. Four +general types of modes are common: controller operating modes, supervisory modes, +display modes, and controlled process modes. +Controller operating modes define sets of related behavior in the controller, such +as shutdown, nominal behavior, and fault-handling. +Supervisory modes determine who or what is controlling the component at any +time when multiple supervisors can assume control responsibilities. For example, a +flight guidance system in an aircraft may be issued direct commands by the pilot(s) +or by another computer that is itself being supervised by the pilot(s). The movement +controller in the thermal tile processing system might be designed to be in either +manual supervisory mode (by a human controller) or automated mode (by the +TTPS task controller). Coordination of control actions among multiple supervisors +can be defined in terms of these supervisory modes. Confusion about the current +supervisory mode can lead to hazardous system behavior. +A third type of common mode is a display mode. The display mode will +affect the information provided on the display and how the user interprets that +information. +A final type of mode is the operating mode of the controlled process. For example, +the mobile thermal tile processing robot may be in a moving mode (between work +areas) or in a work mode (in a work area and servicing tiles, during which time it +may be controlled by a different controller). The value of this mode may determine +whether various operations—for example, extending the stabilizer legs or the +manipulator arm—are safe. +Early automated systems had a fairly small number of independent modes. They +provided a passive background on which the operator would act by entering target +data and requesting system operations. They also had only one overall mode setting +for each function performed. Indications of currently active mode and of transitions +between modes could be dedicated to one location on the display. +The consequences of breakdown in mode awareness were fairly small in these +system designs. Operators seemed able to detect and recover from erroneous actions +relatively quickly before serious problems resulted. Sarter and Woods conclude that, +in most cases, mode confusion in these simpler systems are associated with errors +of commission, that is, with errors that require a controller action in order for the +problem to occur [181]. Because the human controller has taken an explicit action, + + +he or she is likely to check that the intended effect of the action has actually +occurred. The short feedback loops allow the controller to repair most errors quickly, +as noted earlier. +The flexibility of advanced automation allows designers to develop more com- +plicated, mode-rich systems. The result is numerous mode indications often spread +over multiple displays, each containing just that portion of mode status data cor- +responding to a particular system or subsystem. The designs also allow for interac- +tions across modes. The increased capabilities of automation can, in addition, lead +to increased delays between user input and feedback about system behavior. +These new mode-rich systems increase the need for and difficulty of maintaining +mode awareness, which can be defined in STAMP terms as keeping the controlled- +system operating mode in the controller’s process model consistent with the actual +controlled system mode. A large number of modes challenges human ability to +maintain awareness of active modes, armed modes, interactions between environ- +mental status and mode behavior, and interactions across modes. It also increases +the difficulty of error or failure detection and recovery. +Calling for systems with fewer or less complex modes is probably unrealistic. +Simplifying modes and automation behavior often requires tradeoffs with precision +or efficiency and with marketing demands from a diverse set of customers [181]. +Systems with accidental (unnecessary) complexity, however, can be redesigned to +reduce the potential for human error without sacrificing system capabilities. Where +tradeoffs with desired goals are required to eliminate potential mode confusion +errors, system and interface design, informed by hazard analysis, can help find solu- +tions that require the fewest tradeoffs. For example, accidents most often occur +during transitions between modes, particularly normal and nonnormal modes, so +they should have more stringent design constraints applied to them. +Understanding more about particular types of mode confusion errors can assist +with design. Two common types leading to problems are interface interpretation +modes and indirect mode changes. +Interface Interpretation Mode Confusion: Interface mode errors are the classic +form of mode confusion error: +1. Input-related errors: The software interprets user-entered values differently +than intended. +2. Output-related errors: The software maps multiple conditions onto the same +output, depending on the active controller mode, and the operator interprets +the interface incorrectly. +A common example of an input interface interpretation error occurs with many +word processors where the user may think they are in insert mode but instead they + + +are in insert and delete mode or in command mode and their input is interpreted +in a different way and results in different behavior than they intended. +A more complex example occurred in what is believed to be a cause of an A320 +aircraft accident. The crew directed the automated system to fly in the track/flight +path angle mode, which is a combined mode related to both lateral (track) and +vertical (flight path angle) navigation: +When they were given radar vectors by the air traffic controller, they may have switched +from the track to the hdg sel mode to be able to enter the heading requested by the +controller. However, pushing the button to change the lateral mode also automatically +changes the vertical mode from flight path angle to vertical speed—the mode switch +button affects both lateral and vertical navigation. When the pilots subsequently entered +“33” to select the desired flight path angle of 3.3 degrees, the automation interpreted their +input as a desired vertical speed of 3300 ft. This was not intended by the pilots who were +not aware of the active “interface mode” and failed to detect the problem. As a conse- +quence of the too steep descent, the airplane crashed into a mountain [181]. +An example of an output interface mode problem was identified by Cook et al. [41] +in a medical operating room device with two operating modes: warmup and normal. +The device starts in warmup mode when turned on and changes from normal mode +to warmup mode whenever either of two particular settings is adjusted by the opera- +tor. The meaning of alarm messages and the effect of controls are different in these +two modes, but neither the current device operating mode nor a change in mode is +indicated to the operator. In addition, four distinct alarm-triggering conditions are +mapped onto two alarm messages so that the same message has different meanings +depending on the operating mode. In order to understand what internal condition +triggered the message, the operator must infer which malfunction is being indicated +by the alarm. +Several design constraints can assist in reducing interface interpretation errors. +At a minimum, any mode used to control interpretation of the supervisory interface +should be annunciated to the supervisor. More generally, the current operating +mode of the automation should be displayed at all times. In addition, any change of +operating mode should trigger a change in the current operating mode reflected in +the interface and thus displayed to the operator, that is, the annunciated mode must +be consistent with the internal mode. +A stronger design choice, but perhaps less desirable for various reasons, might +be not to condition the interpretation of the supervisory interface on modes at all. +Another possibility is to simplify the relationships between modes, for example in +the A320, the lateral and vertical modes might be separated with respect to the +heading select mode. Other alternatives are to make the required inputs different +to lessen confusion (such as 3.3 and 3,300 rather than 33), or the mode indicator +on the control panel could be made clearer as to the current mode. While simply + + +annunciating the mode may be adequate in some cases, annunciations can easily +to missed for a variety of reasons and additional design features should be +considered. +Mode Confusion Arising from Indirect Mode Changes: Indirect mode changes +occur when the automation changes mode without an explicit instruction or direct +command by the operator. Such transitions may be triggered on conditions in the +automation, such as preprogrammed envelope protection. They may also result from +sensor input to the computer about the state of the computer-controlled process, +such as achievement of a preprogrammed target or an armed mode with a prese- +lected mode transition. An example of the latter is a mode in which the autopilot +might command leveling off of the plane once a particular altitude is reached: the +operating mode of the aircraft (leveling off) is changed when the altitude is reached +without a direct command to do so by the pilot. In general, the problem occurs when +activating one mode can result in the activation of different modes depending on +the system status at the time. +There are four ways to trigger a mode change: +1. The automation supervisor explicitly selects a new mode. +2. The automation supervisor enters data (such as a target altitude) or a command +that leads to a mode change: +a. Under all conditions. +b. When the automation is in a particular state +c. When the automation’s controlled system model or environment is in a +particular state. +3. The automation supervisor does not do anything, but the automation logic +changes mode as a result of a change in the system it is controlling. +4. The automation supervisor selects a mode change but the automation does +something else, either because of the state of the automation at the time or +the state of the controlled system. +Again, errors related to mode confusion are related to problems that human super- +visors of automated controllers have in maintaining accurate process models. +Changes in human controller behavior in highly automated systems, such as the +changes in pilot scanning behavior described earlier, are also related to these types +of mode confusion error. +Behavioral expectations about the automated controller behavior are formed +based on the human supervisors’ knowledge of the input to the automation and +on their process models of the automation. Gaps or misconceptions in this model + + +may interfere with predicting and tracking indirect mode transitions or with under- +standing the interactions among modes. +An example of an accident that has been attributed to an indirect mode change +occurred while an A320 was landing in Bangalore, India [182]. The pilot’s selection +of a lower altitude while the automation was in the altitude acquisition mode +resulted in the activation of the open descent mode, where speed is controlled only +by the pitch of the aircraft and the throttles go to idle. In that mode, the automation +ignores any preprogrammed altitude constraints. To maintain pilot-selected speed +without power, the automation had to use an excessive rate of descent, which led +to the aircraft crashing short of the runway. +Understanding how this could happen is instructive in understanding just how +complex mode logic can get. There are three different ways to activate open descent +mode on the A320: +1. Pull the altitude knob after selecting a lower altitude. +2. Pull the speed knob when the aircraft is in expedite mode. +3. Select a lower altitude while in altitude acquisition mode. +It was the third condition that is suspected to have occurred. The pilot must not +have been aware the aircraft was within 200 feet of the previously entered target +altitude, which triggers altitude acquisition mode. He therefore may not have +expected selection of a lower altitude at that time to result in a mode transition and +did not closely monitor his mode annunciations during this high workload time. He +discovered what happened ten seconds before impact, but that was too late to +recover with the engines at idle [182]. +Other factors contributed to his not discovering the problem until too late, one +of which is the problem in maintaining consistent process models when there are +multiple controllers as discussed in the next section. The pilot flying (PF) had dis- +engaged his flight director1 during approach and was assuming the pilot not flying +(PNF) would do the same. The result would have been a mode configuration in +which airspeed is automatically controlled by the autothrottle (the speed mode), +which is the recommended procedure for the approach phase of flight. The PNF +never turned off his flight director, however, and the open descent mode became +active when a lower altitude was selected. This indirect mode change led to the +hazardous state and eventually the accident, as noted earlier. But a complicating +factor was that each pilot only received an indication of the status of his own flight + +director and not all the information necessary to determine whether the desired +mode would be engaged. The lack of feedback and resulting incomplete knowledge +of the aircraft state (incorrect aircraft process model) contributed to the pilots not +detecting the unsafe state in time to correct it. +Indirect mode transitions can be identified in software designs. What to do in +response to identifying them or deciding not to include them in the first place is +more problematic and the tradeoffs and mitigating design features must be consid- +ered for each particular system. The decision is just one of the many involving the +benefits of complexity in system design versus the hazards that can result. + + +footnote. The flight director is automation that gives visual cues to the pilot via an easily interpreted display of +the aircraft’s flight path. The preprogrammed path, automatically computed, furnishes the steering com- +mands necessary to obtain and hold a desired path. + + +Coordination of Multiple Controller Process Models. +When multiple controllers are engaging in coordinated control of a process, incon- +sistency between their process models can lead to hazardous control actions. Careful +design of communication channels and coordinated activity is required. In aircraft, +this coordination, called crew resource management, is accomplished through careful +design of the roles of each controller to enhance communication and to ensure +consistency among their process models. +A special case of this problem occurs when one human controller takes over +for another. The handoff of information about both the state of the controlled +process and any automation being supervised by the human must be carefully +designed. +Thomas describes an incident involving loss of communication for an extended +time between ground air traffic control and an aircraft [199]. In this incident, a +ground controller had taken over after a controller shift change. Aircraft are passed +from one air traffic control sector to another through a carefully designed set of +exchanges, called a handoff, during which the aircraft is told to switch to the radio +frequency for the new sector. When, after a shift change the new controller gave an +instruction to a particular aircraft and received no acknowledgment, the controller +decided to take no further action; she assumed that the lack of acknowledgment +was an indication that the aircraft had already switched to the new sector and was +talking to the next controller. +Process model coordination during shift changes is partially controlled in a +position relief briefing. This briefing normally covers all aircraft that are currently +on the correct radio frequency or have not checked in yet. When the particular flight +in question was not mentioned in the briefing, the new controller interpreted that +as meaning that the aircraft was no longer being controlled by this station. She did +not call the next controller to verify this status because the aircraft had not been +mentioned in the briefing. +The design of the air traffic control system includes redundancy to try to avoid +errors—if the aircraft does not check in with the next controller, then that controller + + +would call her. When she saw the aircraft (on her display) leave her airspace and +no such call was received, she interpreted that as another indication that the aircraft +was indeed talking to the next controller. +A final factor implicated in the loss of communication was that when the new +controller took over, there was little traffic at the aircraft’s altitude and no danger +of collision. Common practice for controllers in this situation is to initiate an early +handoff to the next controller. So although the aircraft was only halfway through +her sector, the new controller assumed an early handoff had occurred. +An additional causal factor in this incident involves the way controllers track +which aircraft have checked in and which have already been handed off to the +next controller. The old system was based on printed flight progress strips and +included a requirement to mark the strip when an aircraft had checked in. The +new system uses electronic flight progress strips to display the same information, +but there is no standard method to indicate the check-in has occurred. Instead, +each individual controller develops his or her own personal method to keep track +of this status. In this particular loss of communication case, the controller involved +would type a symbol in a comment area to mark any aircraft that she had already +handed off to the next sector. The controller that was relieved reported that he +usually relied on his memory or checked a box to indicate which aircraft he was +communicating with. +That a carefully designed and coordinated process such as air traffic control can +suffer such problems with coordinating multiple controller process models (and +procedures) attests to the difficulty of this design problem and the necessity for +careful design and analysis. + +section 9.4.8. Providing Information and Feedback. +Designing feedback in general was covered in section 9.3.2. This section covers +feedback design principles specific to human controllers. Important problems in +designing feedback include what information should be provided, how to make the +feedback process more robust, and how the information should be presented to +human controllers. + +Types of Feedback. +Hazard analysis using STPA will provide information about the types of feedback +needed and when. Some additional guidance can be provided to the designer, once +again, using general safety design principles. +Two basic types of feedback are needed: +1. The state of the controlled process: This information is used to (1) update the +controllers’ process models and (2) to detect faults and failures in the other +parts of the control loop, system, and environment. + + +2. The effect of the controllers’ actions: This feedback is used to detect human +errors. As discussed in the section on design for error tolerance, the key to +making errors observable—and therefore remediable—is to provide feedback +about them. This feedback may be in the form of information about the effects +of controller actions, or it may simply be information about the action itself +on the chance that it was inadvertent. + +Updating Process Models. +Updating process models requires feedback about the current state of the system +and any changes that occur. In a system where rapid response by operators is neces- +sary, timing requirements must be placed on the feedback information that the +controller uses to make decisions. In addition, when task performance requires or +implies need for the controller to assess timeliness of information, the feedback +display should include time and date information associated with data. + + +When a human controller is supervising or monitoring automation, the automa- +tion should provide an indication to the controller and to bystanders that it is func- +tioning. The addition of a light to the power interlock example in chapter 8 is a simple +example of this type of feedback. For robot systems, bystanders should be signaled +when the machine is powered up or warning provided when a hazardous zone is +entered. An assumption should not be made that humans will not have to enter the +robot’s area. In one fully automated plant, an assumption was made that the robots +would be so reliable that the human controllers would not have to enter the plant +often and, therefore, the entire plant could be powered down when entry was +required. The designers did not provide the usual safety features such as elevated +walkways for the humans and alerts, such as aural warnings, when a robot was moving +or about the move. After plant startup, the robots turned out to be so unreliable that +the controllers had to enter the plant and bail them out several times during a shift. +Because powering down the entire plant had such a negative impact on productivity, +the humans got into the habit of entering the automated area of the plant without +powering everything down. The inevitable occurred and someone was killed [72]. +The automation should provide information about its internal state (such as the +state of sensors and actuators), its control actions, its assumptions about the state +of the system, and any anomalies that might have occurred. Processing requiring +several seconds should provide a status indicator so human controllers can distin- +guish automated system processing from failure. In one nuclear power plant, the +analog component that provided alarm annunciation to the operators was replaced +with a digital component performing the same function. An argument was made +that a safety analysis was not required because the replacement was “like for like.” +Nobody considered, however, that while the functional behavior might be the same, +the failure behavior could be different. When the previous analog alarm annunciator +failed, the screens went blank and the failure was immediately obvious to the human +operators. When the new digital system failed, however, the screens froze, which was +not immediately apparent to the operators, delaying critical feedback that the alarm +system was not operating. +While the detection of nonevents is relatively simple for automated controllers— +for instance, watchdog timers can be used—such detection is very difficult for +humans. The absence of a signal, reading, or key piece of information is not usually +immediately obvious to humans and they may not be able to recognize that a missing +signal can indicate a change in the process state. In the Turkish Airlines flight TK +1951 accident at Amsterdam’s Schiphol Airport in 2009, for example, the pilots did +not notice the absence of a critical mode shift [52]. The design must ensure that lack +of important signals will be registered and noticed by humans. +While safety interlocks are being overridden for test or maintenance, their status +should be displayed to the operators and testers. Before allowing resumption of + + +normal operations, the design should require confirmation that the interlocks have +been restored. In one launch control system being designed by NASA, the operator +could turn off alarms temporarily. There was no indication on the display, however, +that the alarms had been disabled. If a shift change occurred and another operator +took over the position, the new operator would have no way of knowing that alarms +were not being annunciated. +If the information an operator needs to efficiently and safety control the process +is not readily available, controllers will use experimentation to test their hypotheses +about the state of the controlled system. If this kind of testing can be hazardous, +then a safe way for operators to test their hypotheses should be provided rather +than simply forbidding it. Such facilities will have additional benefits in handling +emergencies. +The problem of feedback in emergencies is complicated by the fact that distur- +bances may lead to failure of sensors. The information available to the controllers +(or to an automated system) becomes increasingly unreliable as the disturbance +progresses. Alternative means should be provided to check safety-critical informa- +tion as well as ways for human controllers to get additional information the designer +did not foresee would be needed in a particular situation. +Decision aids need to be designed carefully. With the goal of providing assistance +to the human controller, automated systems may provide feedforward (as well as +feedback) information. Predictor displays show the operator one or more future +states of the process parameters, as well as their present state or value, through a +fast-time simulation, a mathematical model, or other analytic method that projects +forward the effects of a particular control action or the progression of a disturbance +if nothing is done about it. +Incorrect feedforward information can lead to process upsets and accidents. +Humans can become dependent on automated assistance and stop checking +whether the advice is reasonable if few errors occur. At the same time, if the +process (control algorithm) truly can be accurately predetermined along with all +future states of the system, then it should be automated. Humans are usually kept +in systems when automation is introduced because they can vary their process +models and control algorithms when conditions change or errors are detected in +the original models and algorithms. Automated assistance such as predictor dis- +plays may lead to overconfidence and complacency and therefore overreliance by +the operator. Humans may stop performing their own mental predictions and +checks if few discrepancies are found over time. The operator then will begin to +rely on the decision aid. +If decision aids are used, they need to be designed to reduce overdependence +and to support operator skills and motivation rather than to take over functions in +the name of support. Decision aids should provide assistance only when requested + + +and their use should not become routine. People need to practice making decisions +if we expect them to do so in emergencies or to detect erroneous decisions by +automation. +Detecting Faults and Failures. +A second use of feedback is to detect faults and failures in the controlled system, +including the physical process and any computer controllers and displays. If +the operator is expected to monitor a computer or automated decision making, +then the computer must make decisions in a manner and at a rate that operators +can follow. Otherwise they will not be able to detect faults and failures reliably +in the system being supervised. In addition, the loss of confidence in the automa- +tion may lead the supervisor to disconnect it, perhaps under conditions where that +could be hazardous, such as during critical points in the automatic landing of an +airplane. When human supervisors can observe on the displays that proper cor- +rections are being made by the automated system, they are less likely to intervene +inappropriately, even in the presence of disturbances that cause large control +actions. +For operators to anticipate or detect hazardous states, they need to be continu- +ously updated about the process state so that the system progress and dynamic state +can be monitored. Because of the poor ability of humans to perform monitoring +over extended periods of time, they will need to be involved in the task in some +way, as discussed earlier. If possible, the system should be designed to fail obviously +or to make graceful degradation obvious to the supervisor. +The status of safety-critical components or state variables should be highlighted +and presented unambiguously and completely to the controller. If an unsafe condi- +tion is detected by an automated system being supervised by a human controller, +then the human controller should be told what anomaly was detected, what action +was taken, and the current system configuration. Overrides of potentially hazardous +failures or any clearing of the status data should not be permitted until all of the +data has been displayed and probably not until the operator has acknowledged +seeing it. A system may have a series of faults that can be overridden safely if they +occur singly, but multiple faults could result in a hazard. In this case, the supervisor +should be made aware of all safety-critical faults prior to issuing an override +command or resetting a status display. +Alarms are used to alert controllers to events or conditions in the process that +they might not otherwise notice. They are particularly important for low-probability +events. The overuse of alarms, however, can lead to management by exception, +overload and the incredulity response. +Designing a system that encourages or forces an operator to adopt a manage- +ment-by-exception strategy, where the operator waits for alarm signals before taking + + +action, can be dangerous. This strategy does not allow operators to prevent distur- +bances by looking for early warning signals and trends in the process state. +The use of computers, which can check a large number of system variables in a +short amount of time, has made it easy to add alarms and to install large numbers +of them. In such plants, it is common for alarms to occur frequently, often five to +seven times an hour [196]. Having to acknowledge a large number of alarms may +leave operators with little time to do anything else, particularly in an emergency +[196]. A shift supervisor at the Three Mile Island (TMI) hearings testified that the +control room never had less than 52 alarms lit [98]. During the TMI incident, more +than a hundred alarm lights were lit on the control board, each signaling a different +malfunction, but providing little information about sequencing or timing. So many +alarms occurred at TMI that the computer printouts were running hours behind the +events and, at one point jammed, losing valuable information. Brooks claims that +operators commonly suppress alarms in order to destroy historical information +when they need real-time alarm information for current decisions [26]. Too many +alarms can cause confusion and a lack of confidence and can elicit exactly the wrong +response, interfering with the operator’s ability to rectify the problems causing +the alarms. +Another phenomenon associated with alarms is the incredulity response, which +leads to not believing and ignoring alarms after many false alarms have occurred. +The problem is that in order to issue alarms early enough to avoid drastic counter- +measures, the alarm limits must be set close to the desired operating point. This goal +is difficult to achieve for some dynamic processes that have fairly wide operating +ranges, leading to the problem of spurious alarms. Statistical and measurement +errors may add to the problem. +A great deal has been written about alarm management, particularly in the +nuclear power arena, and sophisticated disturbance and alarm analysis systems have +been developed. Those designing alarm systems should be familiar with current +knowledge about such systems. The following are just a few simple guidelines: +1.•Keep spurious alarms to a minimum: This guideline will reduce overload and +the incredulity response. +2.•Provide checks to distinguish correct from faulty instruments: When response +time is not critical, most operators will attempt to check the validity of the alarm +[209]. Providing information in a form where this validity check can be made +quickly and accurately, and not become a source of distraction, increases the +probability of the operator acting properly. +3.•Provide checks on alarm system itself: The operator has to know whether the +problem is in the alarm or in the system. Analog devices can have simple checks +such as “press to test” for smoke detectors or buttons to test the bulbs in a + + +lighted gauge. Computer-displayed alarms are more difficult to check; checking +usually requires some additional hardware or redundant information that +does not come through the computer. One complication comes in the form +of alarm analysis systems that check alarms and display a prime cause along +with associated effects. Operators may not be able to perform validity checks +on the complex logic necessarily involved in these systems, leading to overreli- +ance [209]. Weiner and Curry also worry that the priorities might not always +be appropriate in automated alarm analysis and that operators may not recog- +nize this fact. +4.•Distinguish between routine and safety-critical alarms: The form of the alarm, +such as auditory cues or message highlighting, should indicate degree or urgency. +Alarms should be categorized as to which are the highest priority. +5.•Provide temporal information about events and state changes: Proper decision +making often requires knowledge about the timing and sequencing of events. +Because of system complexity and built-in time delays due to sampling inter- +vals, however, information about conditions or events is not always timely or +even presented in the sequence in which the events actually occurred. Complex +systems are often designed to sample monitored variables at different frequen- +cies: some variables may be sampled every few seconds while, for others, the +intervals may be measured in minutes. Changes that are negated within the +sampling period may not be recorded at all. Events may become separated from +their circumstances, both in sequence and time [26]. +6.•Require corrective action when necessary: When faced with a lot of undigested +and sometimes conflicting information, humans will first try to figure out what +is going wrong. They may become so involved in attempts to save the system +that they wait too long to abandon the recovery efforts. Alternatively, they may +ignore alarms they do not understand or they think are not safety critical. The +system design may need to ensure that the operator cannot clear a safety- +critical alert without taking corrective action or without performing subsequent +actions required to complete an interrupted operation. The Therac-25, a linear +accelerator that massively overdosed multiple patients, allowed operators to +proceed with treatment five times after an error message appeared simply by +pressing one key [115]. No distinction was made between errors that could be +safety-critical and those that were not. +7.•Indicate which condition is responsible for the alarm: System designs with +more than one mode or where more than one condition can trigger the +alarm for a mode, must clearly indicate which condition is responsible for +the alarm. In the Therac-25, one message meant that the dosage given was +either too low or too high, without providing information to the operator + + +about which of these errors had occurred. In general, determining the cause of +an alarm may be difficult. In complex, tightly coupled plants, the point where +the alarm is first triggered may be far away from where the fault actually +occurred. +8.• +Minimize the use of alarms when they may lead to management by excep- +tion: After studying thousands of near accidents reported voluntarily by air- +craft crews and ground support personnel, one U.S. government report +recommended that the altitude alert signal (an aural sound) be disabled for all +but a few long-distance flights [141]. Investigators found that this signal had +caused decreased altitude awareness in the flight crew, resulting in more fre- +quent overshoots—instead of leveling off at 10,000 feet, for example, the air- +craft continues to climb or descend until the alarm sounds. A study of such +overshoots noted that they rarely occur in bad weather, when the crew is most +attentive. + +Robustness of the Feedback Process. +Because feedback is so important to safety, robustness must be designed into feed- +back channels. The problem of feedback in emergencies is complicated by the fact +that disturbances may lead to failure of sensors. The information available to the +controllers (or to an automated system) becomes increasingly unreliable as the +disturbance progresses. +One way to prepare for failures is to provide alternative sources of information +and alternative means to check safety-critical information. It is also useful for the +operators to get additional information the designers did not foresee would be +needed in a particular situation. The emergency may have occurred because the +designers made incorrect assumptions about the operation of the controlled +system, the environment in which it would operate, or the information needs of the +controller. +If automated controllers provide the only information about the controlled +system state, the human controller supervising the automation can provide little +oversight. The human supervisor must have access to independent sources of infor- +mation to detect faults and failures, except in the case of a few failure modes such +as total inactivity. Several incidents involving the command and control warning +system at NORAD headquarters in Cheyenne Mountain involved situations where +the computer had bad information and thought the United States was under nuclear +attack. Human supervisors were able to ascertain that the computer was incorrect +through direct contact with the warning sensors (satellites and radars). This direct +contact showed the sensors were operating and had received no evidence of incom- +ing missiles [180]. The error detection would not have been possible if the humans + + +could only get information about the sensors from the computer, which had the +wrong information. Many of these direct sensor inputs are being removed in the +mistaken belief that only computer displays are required. +The main point is that human supervisors of automation cannot monitor its per- +formance if the information used in monitoring is not independent from the thing +being monitored. There needs to be provision made for failure of computer displays +or incorrect process models in the software by providing alternate sources of infor- +mation. Of course, any instrumentation to deal with a malfunction must not be +disabled by the malfunction, that is, common-cause failures must be eliminated or +controlled. As an example of the latter, an engine and pylon came off the wing of +a DC-10, severing the cables that controlled the leading edge flaps and also four +hydraulic lines. These failures disabled several warning signals, including a flap mis- +match signal and a stall warning light [155]. If the crew had known the slats were +retracted and had been warned of a potential stall, they might have been able to +save the plane. + +Displaying Feedback to Human Controllers. +Computer displays are now ubiquitous in providing feedback information to human +controllers, as are complaints about their design. +Many computer displays are criticized for providing too much data (data over- +load) where the human controller has to sort through large amounts of data to find +the pieces needed. Then the information located in different locations may need to +be integrated. Bainbridge suggests that operators should not have to page between +displays to obtain information about abnormal states in the parts of the process +other than the one they are currently thinking about; neither should they have to +page between displays that provide information needed for a single decision +process. +These design problems are difficult to eliminate, but performing a task analysis +coupled with a hazard analysis can assist in better design as will making all the +information needed for a single decision process visible at the same time, placing +frequently used displays centrally, and grouping displays of information using the +information obtained in the task analysis. It may also be helpful to provide alterna- +tive ways to display information or easy ways to request what is needed. +Much has been written about how to design computer displays, although a sur- +prisingly large number of displays still seem to be poorly designed. The difficulty of +such design is increased by the problem that, once again, conflicts can exist. For +example, intuition seems to support providing information to users in a form that +can be quickly and easily interpreted. This assumption is true if rapid reactions are +required. Some psychological research, however, suggests that cognitive processing + + +for meaning leads to better information retention: A display that requires little +thought and work on the part of the operator may not support acquisition of the +knowledge and thinking skills needed in abnormal conditions [168]. +Once again, the designer needs to understand the tasks the user of the display is +performing. To increase safety, the displays should reflect what is known about how +the information is used and what kinds of displays are likely to cause human error. +Even slight changes in the way information is presented can have dramatic effects +on performance. +This rest of this section concentrates only on a few design guidelines that are +especially important for safety. The reader is referred to the standard literature on +display design for more information. +Safety-related information should be distinguished from non-safety-related +information and highlighted. In addition, when safety interlocks are being overrid- +den, their status should be displayed. Similarly, if safety-related alarms are tempo- +rarily inhibited, which may be reasonable to allow so that the operator can deal +with the problem without being continually interrupted by additional alarms, the +inhibit status should be shown on the display. Make warning displays brief and +simple. +A common mistake is to make all the information displays digital simply because +the computer is a digital device. Analog displays have tremendous advantages for +processing by humans. For example, humans are excellent at pattern recognition, +so providing scannable displays that allow operators to process feedback and diag- +nose problems using pattern recognition will enhance human performance. A great +deal of information can be absorbed relatively easily when it is presented in the +form of patterns. +Avoid displaying absolute values unless the human requires the absolute values. +It is hard to notice changes such as events and trends when digital values are going +up and down. A related guideline is to provide references for judgment. Often, for +example, the user of the display does not need the absolute value but only the fact +that it is over or under a limit. Showing the value on an analog dial with references +to show the limits will minimize the required amount of extra and error-prone pro- +cessing by the user. The overall goal is to minimize the need for extra mental pro- +cessing to get the information the users of the display need for decision making or +for updating their process models. +Another typical problem occurs when computer displays must be requested and +accessed sequentially by the user, which makes greater memory demands upon the +operator, negatively affecting difficult decision-making tasks [14]. With conventional +instrumentation, all process information is constantly available to the operator: an +overall view of the process state can be obtained by a glance at the console. Detailed +readings may be needed only if some deviation from normal conditions is detected. + + +The alternative, a process overview display on a computer console, is more time +consuming to process: To obtain additional information about a limited part of the +process, the operator has to select consciously among displays. +In a study of computer displays in the process industry, Swaanenburg and col- +leagues found that most operators considered a computer display more difficult to +work with than conventional parallel interfaces, especially with respect to getting +an overview of the process state. In addition, operators felt the computer overview +displays were of limited use in keeping them updated on task changes; instead, +operators tended to rely to a large extent on group displays for their supervisory +tasks. The researchers conclude that a group display, showing different process vari- +ables in reasonable detail (such as measured value, setpoint, and valve position), +clearly provided the type of data operators preferred. Keeping track of the progress +of a disturbance is very difficult with sequentially presented information [196]. One +general lesson to be learned here is that the operators of the system need to be +involved in display design decisions: The designers should not just do what is easiest +to implement or satisfies their aesthetic senses. +Whenever possible, software designers should try to copy the standard displays +with which operators have become familiar, and which were often developed for +good psychological reasons, instead of trying to be creative or unique. For example, +icons with a standard interpretation should be used. Researchers have found that +icons often pleased system designers but irritated users [92]. Air traffic controllers, +for example, found the arrow icons for directions on a new display useless and +preferred numbers. Once again, including experienced operators in the design +process and understanding why the current analog displays have developed as they +have will help to avoid these basic types of design errors. +An excellent way to enhance human interpretation and processing is to design +the control panel to mimic the physical layout of the plant or system. For example, +graphical displays allow the status of valves to be shown within the context of piping +diagrams and even the flow of materials. Plots of variables can be shown, highlight- +ing important relationships. +The graphical capabilities of computer displays provides exciting potential for +improving on traditional instrumentation, but the designs need to be based on psy- +chological principles and not just on what appeals to the designer, who may never +have operated a complex process. As Lees has suggested, the starting point should +be consideration of the operator’s tasks and problems; the display should evolve as +a solution to these [110]. +Operator inputs to the design process as well as extensive simulation and testing +will assist in designing usable computer displays. Remember that the overall goal is +to reduce the mental workload of the human in updating their process models and +to reduce human error in interpreting feedback. + + +section 9.5. +Summary. +A process for safety-guided design using STPA and some basic principles for safe +design have been described in this chapter. The topic is an important one and more +still needs to be learned, particularly with respect to safe system design for human +controllers. Including skilled and experienced operators in the design process from +the beginning will help as will performing sophisticated human task analyses rather +than relying primarily on operators interacting with computer simulations. +The next chapter describes how to integrate the disparate information and tech- +niques provided so far in part III into a system-engineering process that integrates +safety into the design process from the beginning, as suggested in chapter 6. + + diff --git a/chapter09.txt b/chapter09.txt new file mode 100644 index 0000000..df8293d --- /dev/null +++ b/chapter09.txt @@ -0,0 +1,1703 @@ +chapter 9. +Safety-Guided Design. +In the examples of STPA in the last chapter, the development of the design was +assumed to occur independently. Most of the time, hazard analysis is done after the +major design decisions have been made. But STPA can be used in a proactive way +to help guide the design and system development, rather than as simply a hazard +analysis technique on an existing design. This integrated design and analysis process +is called safety-guided design .(figure 9.1). +As the systems we build and operate increase in size and complexity, the use of +sophisticated system engineering approaches becomes more critical. Important +system-level .(emergent). properties, such as safety, must be built into the design of +these systems; they cannot be effectively added on or simply measured afterward. +Adding barriers or protection devices after the fact is not only enormously more +expensive, it is also much less effective than designing safety in from the beginning +(see Safeware, chapter 16). This chapter describes the process of safety-guided +design, which is enhanced by defining accident prevention as a control problem +rather than a “prevent failures” problem. The next chapter shows how safety engineering and safety-guided design can be integrated into basic system engineering +processes. +section 9.1. +The Safety-Guided Design Process. +One key to having a cost-effective safety effort is to embed it into a system engineering process from the very beginning and to design safety into the system as the +design decisions are made. Once again, the process starts with the fundamental +activities in chapter 7. After the hazards and system-level safety requirements and +constraints have been identified; the design process starts. +1. Try to eliminate the hazards from the conceptual design. +2. If any of the hazards cannot be eliminated, then identify the potential for their +control at the system level. + + +3. Create a system control structure and assign responsibilities for enforcing +safety constraints. Some guidance for this process is provided in the operations +and management chapters. +4. Refine the constraints and design in parallel. +a. Identify potentially hazardous control actions by each of system components that would violate system design constraints using STPA step 1. +Restate the identified hazard control actions as component design +constraints. +b. Using STPA Step 2, determine what factors could lead to a violation of the +safety constraints. +c. Augment the basic design to eliminate or control potentially unsafe control +actions and behaviors. +d. Iterate over the process, that is, perform STPA steps 1 and 2 on the new +augmented design and continue to refine the design until all hazardous +scenarios are eliminated, mitigated, or controlled. +The next section provides an example of the process. The rest of the chapter discusses safe design principles for physical processes, automated controllers, and +human controllers. + +section 9.2. +An Example of Safety-Guided Design for an Industrial Robot. +The process of safety-guided design and the use of STPA to support it is illustrated +here with the design of an experimental Space Shuttle robotic Thermal Tile +Processing System .(TTPS). based on a design created for a research project at +CMU . +The goal of the TTPS system is to inspect and waterproof the thermal protection +tiles on the belly of the Space Shuttle, thus saving humans from a laborious task, +typically lasting three to four months, that begins within minutes after the Shuttle + + +lands and ends just prior to launch. Upon landing at either the Dryden facility in +California or Kennedy Space Center in Florida, the orbiter is brought to either the +Mate-Demate Device .( M D D). or the Orbiter Processing Facility .(OPF). These large +structures provide access to all areas of the orbiters. +The Space Shuttle is covered with several types of heat-resistant tiles that protect +the orbiter’s aluminum skin during the heat of reentry. While the majority of the +upper surfaces are covered with flexible insulation blankets, the lower surfaces are +covered with silica tiles. These tiles have a glazed coating over soft and highly porous +silica fibers. The tiles are 95 percent air by volume, which makes them extremely +light but also makes them capable of absorbing a tremendous amount of water. +Water in the tiles causes a substantial weight problem that can adversely affect +launch and orbit capabilities for the shuttles. Because the orbiters may be exposed +to rain during transport and on the launch pad, the tiles must be waterproofed. This +task is accomplished through the use of a specialized hydrophobic chemical, DMES, +which is injected into each tile. There are approximately 17,000 lower surface tiles +covering an area that is roughly 25m × 40m. +In the standard process, DMES is injected into a small hole in each tile by a +handheld tool that pumps a small quantity of chemical into the nozzle. The nozzle +is held against the tile and the chemical is forced through the tile by a pressurized +nitrogen purge for several seconds. It takes about 240 hours to waterproof the tiles +on an orbiter. Because the chemical is toxic, human workers have to wear heavy +suits and respirators while injecting the chemical and, at the same time, maneuvering +in a crowded work area. One goal for using a robot to perform this task was to +eliminate a very tedious, uncomfortable, and potentially hazardous human activity. +The tiles must also be inspected. A goal for the TTPS was to inspect the tiles +more accurately than the human eye and therefore reduce the need for multiple +inspections. During launch, reentry, and transport, a number of defects can occur on +the tiles in the form of scratches, cracks, gouges, discoloring, and erosion of surfaces. +The examination of the tiles determines if they need to be replaced or repaired. The +typical procedures involve visual inspection of each tile to see if there is any damage +and then assessment and categorization of the defects according to detailed checklists. Later, work orders are issued for repair of individual tiles. +Like any design process, safety-guided design starts with identifying the goals for +the system and the constraints under which the system must operate. The high-level +goals for the TTPS are to. +1. Inspect the thermal tiles for damage caused during launch, reentry, and +transport +2. Apply waterproofing chemicals to the thermal tiles + + +Environmental constraints delimit how these goals can be achieved and identifying +those constraints, particularly the safety constraints, is an early goal in safetyguided design. +The environmental constraints on the system design stem from physical properties of the Orbital Processing Facility .(OPF). at KSC, such as size constraints on the +physical system components and the necessity of any mobile robotic components +to deal with crowded work areas and for humans to be in the area. Example work +area environmental constraints for the TTPS are. +EA1. The work areas of the Orbiter Processing Facility .(OPF). can be very +crowded. The facilities provide access to all areas of the orbiters through the +use of intricate platforms that are laced with plumbing, wiring, corridors, lifting +devices, and so on. After entering the facility, the orbiters are jacked up and +leveled. Substantial structure then swings around and surrounds the orbiter on +all sides and at all levels. With the exception of the jack stands that support +the orbiters, the floor space directly beneath the orbiter is initially clear but +the surrounding structure can be very crowded. +EA2. The mobile robot must enter the facility through personnel access doors 1.1 +meters .(42″). wide. The layout within the OPF allows a length of 2.5 meters +(100″). for the robot. There are some structural beams whose heights are as +low as 1.75 meters .(70″), but once under the orbiter the tile heights range from +about 2.9 meters to 4 meters. The compact roll-in form of the mobile system +must maneuver these spaces and also raise its inspection and injection equipment up to heights of 4 meters to reach individual tiles while still meeting a 1 +millimeter accuracy requirement. +EA3. Additional constraints involve moving around the crowded workspace. The +robot must negotiate jack stands, columns, work stands, cables, and hoses. In +addition, there are hanging cords, clamps, and hoses. Because the robot might +cause damage to the ground obstacles, cable covers will be used for protection +and the robot system must traverse these covers. +Other design constraints on the TTPS include. +1.•Use of the TTPS must not negatively impact the flight schedules of the orbiters +more than that of the manual system being replaced. +2.•Maintenance costs of the TTPS must not exceed x dollars per year. +3.•Use of the TTPS must not cause or contribute to an unacceptable loss .(accident). as defined by Shuttle management. +As with many systems, prioritizing the hazards by severity is enough in this case to +assist the engineers in making decisions during design. Sometimes a preliminary + + +hazard analysis is performed using a risk matrix to determine how much effort will +be put into eliminating or controlling the hazards and in making tradeoffs in design. +Likelihood, at this point, is unknowable but some type of surrogate, like mitigatibility, as demonstrated in section 10.3.4, could be used. In the TTPS example, severity +plus the NASA policy described earlier is adequate. To decide not to consider some +of the hazards at all would be pointless and dangerous at this stage of development +as likelihood is not determinable. As the design proceeds and decisions must be +made, specific additional information may be found to be useful and acquired at +that time. After the system design is completed, if it is determined that some hazards +cannot be adequately handled or the compromises required to handle them are too +great; then the limitations would be documented .(as described in chapter 10). and +decisions would have to be made at that point about the risks of using the system. +At that time, however, the information necessary to make those decisions will more +likely be available than before the development process begins. +After the hazards are identified, system-level safety-related requirements and +design constraints are derived from them. As an example, for hazard H7 .(inadequate +thermal protection), a system-level safety design constraint is that the mobile robot +processing must not result in any tiles being missed in the inspection or waterproofing process. More detailed design constraints will be generated during the safetyguided design process. +To get started, a general system architecture must be selected .(figure 9.2). Let’s +assume that the initial TTPS architecture consists of a mobile base on which tools +will be mounted, including a manipulator arm that performs the processing and +contains the vision and waterproofing tools. This very early decision may be changed +after the safety-guided design process starts, but some very basic initial assumptions +are necessary to get going. As the concept development and detailed design process +proceeds, information generated about hazards and design tradeoffs may lead to +changes in the initial configuration. Alternatively, multiple design configurations +may be considered in parallel. +In the initial candidate architecture .(control structure), a decision is made to +introduce a human operator in order to supervise robot movement as so many of +the hazards are related to movement. At the same time, it may be impractical for +an operator to monitor all the activities so the first version of the system architecture +is to have the TTPS control system in charge of the non-movement activities and +to have both the TTPS and the control room operator share control of movement. +The safety-guided design process, including STPA, will identify the implications of +this decision and will assist in analyzing the allocation of tasks to the various components to determine the safety tradeoffs involved. +In the candidate starting architecture .(control structure), there is an automated +robot work planner to provide the overall processing goals and tasks for the + + +TTPS. A location system is needed to provide information to the movement controller about the current location of the robot. A camera is used to provide information to the human controller, as the control room will be located at a distance +from the orbiter. The role of the other components should be obvious. +The proposed design has two potential movement controllers, so coordination +problems will have to be eliminated. The operator could control all movement, but +that may be considered impractical given the processing requirements. To assist with +this decision process, engineers may create a concept of operations and perform a +human task analysis . +The safety-guided design process, including STPA, will identify the implications +of the basic decisions in the candidate tasks and will assist in analyzing the +allocation of tasks to the various components to determine the safety tradeoffs +involved. +The design process is now ready to start. Using the information already specified, +particularly the general functional responsibilities assigned to each component, + + +designers will identify potentially hazardous control actions by each of the system +components that could violate the safety constraints, determine the causal factors +that could lead to these hazardous control actions, and prevent or control them in +the system design. The process thus involves a top-down identification of scenarios +in which the safety constraints could be violated. The scenarios can then be used to +guide more detailed design decisions. +In general, safety-guided design involves first attempting to eliminate the +hazard from the design and, if that is not possible or requires unacceptable +tradeoffs, reducing the likelihood the hazard will occur, reducing the negative +consequences of the hazard if it does occur, and implementing contingency plans +for limiting damage. More about design procedures is presented in the next +section. +As design decisions are made, an STPA-based hazard analysis is used to +inform these decisions. Early in the system design process, little information is +available, so the hazard analysis will be very general at first and will be refined +and augmented as additional information emerges through the system design +activities. +For the example, let’s focus on the robot instability hazard. The first goal should +be to eliminate the hazard in the system design. One way to eliminate potential +instability is to make the robot base so heavy that it cannot become unstable, no +matter how the manipulator arm is positioned. A heavy base, however, could increase +the damage caused by the base coming into contact with a human or object or make +it difficult for workers to manually move the robot out of the way in an emergency +situation. An alternative solution is to make the base long and wide so the moment +created by the operation of the manipulator arm is compensated by the moments +created by base supports that are far from the robot’s center of mass. A long and +wide base could remove the hazard but may violate the environmental constraints +in the facility layout, such as the need to maneuver through doors and in the +crowded OPF. +The environmental constraint EA2 above implies a maximum length for the +robot of 2.5 meters and a width no larger than 1.1 meter. Given the required +maximum extension length of the manipulator arm and the estimated weight of +the equipment that will need to be carried on the mobile base, a calculation might +show that the length of the robot base is sufficient to prevent any longitudinal +instability, but that the width of the base is not sufficient to prevent lateral +instability. +If eliminating the hazard is determined to be impractical .(as in this case). or not +desirable for some reason, the alternative is to identify ways to control it. The decision to try to control it may turn out not to be practical or later may seem less +satisfactory than increasing the weight .(the solution earlier discarded). All decisions + + +should remain open as more information is obtained about alternatives and backtracking is an option. +At the initial stages in design, we identified only the general hazards.for +example, instability of the robot base and the related system design constraint that +the mobile base must not be capable of falling over under worst-case operational +conditions. As design decisions are proposed and analyzed, they will lead to additional refinements in the hazards and the design constraints. +For example, a potential solution to the stability problem is to use lateral stabilizer legs that are deployed when the manipulator arm is extended but must be +retracted when the robot base moves. Let’s assume that a decision is made to at +least consider this solution. That potential design decision generates a new refined +hazard from the high-level stability hazard .(H2). +H2.1. The manipulator arm is extended while the stabilizer legs are not fully +extended. +Damage to the mobile base or other equipment around the OPF is another potential +hazard introduced by the addition of the legs if the mobile base moves while the +stability legs are extended. Again, engineers would consider whether this hazard +could be eliminated by appropriate design of the stability legs. If it cannot, then that +is a second additional hazard that must be controlled in the design with a corresponding design constraint that the mobile base must not move with the stability +legs extended. +There are now two new refined hazards that must be translated into design +constraints. +1. The manipulator arm must never be extended if the stabilizer legs are not +extended. +2. The mobile base must not move with the stability legs extended. +STPA can be used to further refine these constraints and to evaluate the resulting +designs. In the process, the safety control structure will be refined and perhaps +changed. In this case, a controller must be identified for the stabilizer legs, which +were previously not in the design. Let’s assume that the legs are controlled by the +TTPS movement controller .(figure 9.3). +Using the augmented control structure, the remaining activities in STPA are to +identify potentially hazardous control actions by each of the system components +that could violate the safety constraints, determine the causal factors that could lead +to these hazardous control actions, and prevent or control them in the system design. +The process thus involves a top-down identification of scenarios in which the safety + + +constraints could be violated so that they can be used to guide more detailed design +decisions. +The unsafe control actions associated with the stability hazard are shown in +figure 9.4. Movement and thermal tile processing hazards are also identified in the +table. Combining similar entries for H1 in the table leads to the following unsafe +control actions by the leg controller with respect to the instability hazard. +1. The leg controller does not command a deployment of the stabilizer legs before +the arm is extended. +2. The leg controller commands a retraction of the stabilizer legs before the +manipulator arm is fully stowed. +3. The leg controller commands a retraction of the stabilizer legs after the arm +has been extended or commands a retraction of the stabilizer legs before the +manipulator arm is stowed. + + +4. The leg controller stops extension of the stabilizer legs before they are fully +extended. +and by the arm controller. +1. The arm controller extends the manipulator arm when the stabilizer legs are +not extended or before they are fully extended. +The inadequate control actions can be restated as system safety constraints on the +controller behavior .(whether the controller is automated or human). +1. The leg controller must ensure the stabilizer legs are fully extended before arm +movements are enabled. +2. The leg controller must not command a retraction of the stabilizer legs when +the manipulator arm is not in a fully stowed position. +3. The leg controller must command a deployment of the stabilizer legs before +arm movements are enabled; the leg controller must not command a retraction +of the stabilizer legs before the manipulator arm is stowed. +4. The leg controller must not stop the leg extension until the legs are fully +extended. +Similar constraints will be identified for all hazardous commands. for example, the +arm controller must not extend the manipulator arm before the stabilizer legs are +fully extended. +These system safety constraints might be enforced through physical interlocks, +human procedures, and so on. Performing STPA step 2 will provide information +during detailed design .(1). to evaluate and compare the different design choices, +(2). to design the controllers and design fault tolerance features for the system, and +(3). to guide the test and verification procedures .(or training for humans). As design +decisions and safety constraints are identified, the functional specifications for the +controllers can be created. +To produce detailed scenarios for the violation of safety constraints, the control +structure is augmented with process models. The preliminary design of the process +models comes from the information necessary to ensure the system safety constraints hold. For example, the constraint that the arm controller must not enable +manipulator movement before the stabilizer legs are completely extended implies +there must be some type of feedback to the arm controller to determine when the +leg extension has been completed. +While a preliminary functional decomposition of the system components is +created to start the process, as more information is obtained from the hazard analysis and the system design continues, this decomposition may be altered to optimize +fault tolerance and communication requirements. For example, at this point the need + + +for the process models of the leg and arm controllers to be consistent and the communication required to achieve this goal may lead the designers to decide to combine +the leg and arm controllers .(figure 9.5). +Causal factors for the stability hazard being violated can be determined using +STPA step 2. Feedback about the position of the legs is clearly critical to ensure +that the process model of the state of the stabilizer legs is consistent with the actual +state. The movement and arm controller cannot assume the legs are extended simply +because a command was issued to extend them. The command may not be executed +or may only be executed partly. One possible scenario, for example, involves an +external object preventing the complete extension of the stabilizer legs. In that case, +the robot controller .(either human or automated). may assume the stabilizer legs +are extended because the extension motors have been powered up .(a common type +of design error). Subsequent movement of the manipulator arm would then violate +the identified safety constraints. Just as the analysis assists in refining the component +safety constraints .(functional requirements), the causal analysis can be used to + + +further refine those requirements and to design the control algorithm, the control +loop components, and the feedback necessary to implement them. +Many of the causes of inadequate control actions are so common that they can +be restated as general design principles for safety-critical control loops. The requirement for feedback about whether a command has been executed in the previous +paragraph is one of these. The rest of this chapter presents those general design +principles. + +section 9.3. +Designing for Safety. +Hazard analysis using STPA will identify application-specific safety design constraints that must be enforced by the control algorithm. For the thermal-tile processing robot, a safety constraint identified above is that the manipulator arm must +never be extended if the stabilizer legs are not fully extended. Causal analysis .(step +2 of STPA). can identify specific causes for the constraint to be violated and design +features can be created to eliminate or control them. +More general principles of safe control algorithm functional design can also be +identified by using the general causes of accidents as defined in STAMP .(and used +in STPA step 2), general engineering principles, and common design flaws that have +led to accidents in the past. +Accidents related to software or system logic design often result from incompleteness and unhandled cases in the functional design of the controller. This incompleteness can be considered a requirements or functional design problem. Some +requirements completeness criteria were identified in Safeware and specified using +a state machine model. Here those criteria plus additional design criteria are translated into functional design principles for the components of the control loop. +In STAMP, accidents are caused by inadequate control. The controllers can be +human or physical. This section focuses on design principles for the components of +the control loop that are important whether a human is in the loop or not. Section +9.4 describes extra safety-related design principles that apply for systems that +include human controllers. We cannot “design” human controllers, but we can design +the environment or context in which they operate, and we can design the procedures +they use, the control loops in which they operate, the processes they control, and +the training they receive. + +section 9.3.1. Controlled Process and Physical Component Design. +Protection against component failure accidents is well understood in engineering. +Principles for safe design of common hardware systems .(including sensors and +actuators). with standard safety constraints are often systematized and encoded in +checklists for an industry, such as mechanical design or electrical design. In addition, + + +most engineers have learned about the use of redundancy and overdesign .(safety +margins). to protect against component failures. +These standard design techniques are still relevant today but provide little or no +protection against component interaction accidents. The added complexity of redundant designs may even increase the occurrence of these accidents. Figure 9.6 shows +the design precedence described in Safeware. The highest precedence is to eliminate +the hazard. If the hazard cannot be eliminated, then its likelihood of occurrence +should be reduced, the likelihood of it leading to an accident should be reduced +and, at the lowest precedence, the design should reduce the potential damage +incurred. Clearly, the higher the precedence level, the more effective and less costly +will be the safety design effort. As there is little that is new here that derives from +using the STAMP causality model, the reader is referred to Safeware and standard +engineering references for more information. + + +section 9.3.2. Functional Design of the Control Algorithm. +Design for safety includes more than simply the physical components but also the +control components. We start by considering the design of the control algorithm. +The controller algorithm is responsible for processing inputs and feedback, initializing and updating the process model, and using the process model plus other knowledge and inputs to produce control outputs. Each of these is considered in turn. +Designing and Processing Inputs and Feedback +The basic function of the algorithm is to implement a feedback control loop, as +defined by the controller responsibilities, along with appropriate checks to detect +internal or external failures or errors. +Feedback is critical for safe control. Without feedback, controllers do not know +whether their control actions were received and performed properly or whether + + +The controller must be designed to respond appropriately to the arrival of any +possible .(i.e., detectable by the sensors). input at any time as well as the lack of an +expected input over a given time period. Humans are better .(and more flexible) +than automated controllers at this task. Often automation is not designed to handle +input arriving unexpectedly, for example, a target detection report from a radar that +was previously sent a message to shut down. +All inputs should be checked for out-of-range or unexpected values and a +response designed into the control algorithm. A surprising number of losses still +occur due to software not being programmed to handle unexpected inputs. +In addition, the time bounds .(minimum and maximum). for every input should +be checked and appropriate behavior provided in case the input does not arrive +within these bounds. There should also be a response for the non-arrival of an input +within a given amount of time .(a timeout). for every variable in the process model. +The controller must also be designed to respond to excessive inputs .(overload conditions). in a safe way. +Because sensors and input channels can fail, there should be a minimum-arrivalrate check for each physically distinct communication path, and the controller +should have the ability to query its environment with respect to inactivity over a +given communication path. Traditionally these queries are called sanity or health +checks. Care needs to be taken, however, to ensure that the design of the response +to a health check is distinct from the normal inputs and that potential hardware +failures cannot impact the sanity checks. As an example of the latter, in June 19 80 +warnings were received at the U.S. command and control headquarters that a major +nuclear attack had been launched against the United States . The military +prepared for retaliation, but the officers at command headquarters were able to +ascertain from direct contact with warning sensors that no incoming missile had +been detected and the alert was canceled. Three days later, the same thing happened again. The false alerts were caused by the failure of a computer chip in a +multiplexor system that formats messages sent out continuously to command posts +indicating that communication circuits are operating properly. This health check +message was designed to report that there were 000 ICBMs and 000 SLBMs +detected. Instead, the integrated circuit failure caused some of the zeros to be +replaced with twos. After the problem was diagnosed, the message formats were +changed to report only the status of the communication system and nothing about +detecting ballistic missiles. Most likely, the developers thought it would be easier to +have one common message format but did not consider the impact of erroneous +hardware behavior. +STAMP identifies inconsistency between the process model and the actual +system state as a common cause of accidents. Besides incorrect feedback, as in the +example early warning system, a common way for the process model to become + + + +inconsistent with the state of the actual process is for the controller to assume that +an output command has been executed when it has not. The TTPS controller, for +example, assumes that because it has sent a command to extend the stabilizer legs, +the legs will, after a suitable amount of time, be extended. If commands cannot be +executed for any reason, including time outs, controllers have to know about it. To +detect errors and failures in the actuators or controlled process, there should be an +input .(feedback). that the controller can use to detect the effect of any output on +the process. +This feedback, however, should not simply be an indication that the command +arrived at the controlled process.for example, the command to open a valve was +received by the valve, but that the valve actually opened. An explosion occurred +in a U.S. Air Force system due to overpressurization when a relief valve failed to +open after the operator sent a command to open it. Both the position indicator +light and open indicator light were illuminated on the control board. Believing +the primary valve had opened, the operator did not open the secondary valve, +which was to be used if the primary valve failed. A post-accident examination +discovered that the indicator light circuit was wired to indicate presence of a signal +at the valve, but it did not indicate valve position. The indicator therefore showed +only that the activation button had been pushed, not that the valve had opened. +An extensive quantitative safety analysis of this design had assumed a low probability of simultaneous failure for the two relief valves, but it ignored the possibility +of a design error in the electrical wiring; the probability of the design error was +not quantifiable. Many other accidents have involved a similar design flaw, including Three Mile Island. +When the feedback associated with an output is received, the controller must be +able to handle the normal response as well as deal with feedback that is missing, +too late, too early, or has an unexpected value. +Initializing and Updating the Process Model. +Because the process model is used by the controller to determine what control commands to issue and when, the accuracy of the process model with respect to the +controlled process is critical. As noted earlier, many software-related losses have +resulted from such inconsistencies. STPA will identify which process model variables +are critical to safety; the controller design must ensure that the controller receives +and processes updates for these variables in a timely manner. +Sometimes normal updating of the process model is done correctly by the controller, but problems arise in initialization at startup and after a temporary shutdown. The process model must reflect the actual process state at initial startup and +after a restart. It seems to be common, judging from the number of incidents +and accidents that have resulted, for software designers to forget that the world + + +continues to change even though the software may not be operating. When the +computer controlling a process is temporarily shut down, perhaps for maintenance +or updating of the software, it may restart with the assumption that the controlled +process is still in the state it was when the software was last operating. In addition, +assumptions may be made about when the operation of the controller will be started, +which may be violated. For example, an assumption may be made that a particular +aircraft system will be powered up and initialized before takeoff and appropriate +default values used in the process model for that case. In the event it was not started +at that time or was shut down and then restarted after takeoff, the default startup +values in the process model may not apply and may be hazardous. +Consider the mobile tile-processing robot at the beginning of this chapter. The +mobile base may be designed to allow manually retracting the stabilizer legs if an +emergency occurs while the robot is servicing the tiles and the robot must be physically moved out of the way. When the robot is restarted, the controller may assume + + +that the stabilizer legs are still extended and arm movements may be commanded +that would violate the safety constraints. +The use of an unknown value can assist in protecting against this type of design +flaw. At startup and after temporary shutdown, process variables that reflect the +state of the controlled process should be initialized with the value unknown and +updated when new feedback arrives. This procedure will result in resynchronizing +the process model and the controlled process state. The control algorithm must also +account, of course, for the proper behavior in case it needs to use a process model +variable that has the unknown value. +Just as timeouts must be specified and handled for basic input processing as +described earlier, the maximum time the controller waits until the first input after +startup needs to be determined and what to do if this time limit is violated. Once +again, while human controllers will likely detect such a problem eventually, such as +a failed input channel or one that was not restarted on system startup, computers +will patiently wait forever if they are not given instructions to detect such a timeout +and to respond to it. +In general, the system and control loop should start in a safe state. Interlocks may +need to be initialized or checked to be operational at system startup, including +startup after temporarily overriding the interlocks. +Finally the behavior of the controller with respect to input received before +startup, after shutdown, or while the controller is temporarily disconnected from the +process .(offline). must be considered and it must be determined if this information +can be safely ignored or how it will be stored and later processed if it cannot. One +factor in the loss of an aircraft that took off from the wrong runway at Lexington +Airport, for example, is that information about temporary changes in the airport +taxiways was not reflected in the airport maps provided to the crew. The information +about the changes, which was sent by the National Flight Data Center, was received +by the map-provider computers at a time when they were not online, leading to +airport charts that did not match the actual state of the airport. The document +control system software used by the map provider was designed to only make +reports of information received during business hours Monday through Friday . + +Producing Outputs. +The primary responsibility of the process controller is to produce commands to +fulfill its control responsibilities. Again, the STPA hazard analysis and safety-guided +design process will produce the application-specific behavioral safety requirements +and constraints on controller behavior to ensure safety. But some general guidelines +are also useful. +One general safety constraint is that the behavior of an automated controller +should be deterministic. it should exhibit only one behavior for arrival of any input + + +in a particular state. While it is easy to design software with nondeterministic +behavior and, in some cases, actually has some advantages from a software point of +view, nondeterministic behavior makes testing more difficult and, more important, +much more difficult for humans to learn how an automated system works and +to monitor it. If humans are expected to control or monitor an automated system +or an automated controller, then the behavior of the automation should be +deterministic. +Just as inputs can arrive faster than they can be processed by the controller, the +absorption rate of the actuators and recipients of output from the controller must +be considered. Again, the problem usually arises when a fast output device .(such as +a computer). is providing input to a slower device, such as a human. Contingency +action must be designed when the output absorption rate limit is exceeded. +Three additional general considerations in the safe design of controllers are data +age, latency, and fault handling. + + +Data age. No inputs or output commands are valid forever. The control loop +design must account for inputs that are no longer valid and should not be used by +the controller and for outputs that cannot be executed immediately. All inputs used +in the generation of output commands must be properly limited in the time they +are used and marked as obsolete once that time limit has been exceeded. At the +same time, the design of the control loop must account for outputs that are not +executed within a given amount of time. As an example of what can happen when +data age is not properly handled in the design, an engineer working in the cockpit +of a B-lA aircraft issued a close weapons bay door command during a test. At the +time, a mechanic working on the door had activated a mechanical inhibit on it. The +close door command was not executed, but it remained active. Several hours later, +when the door maintenance was completed, the mechanical inhibit was removed. +The door closed unexpectedly, killing the worker . +Latency. Latency is the time interval during which receipt of new information +cannot change an output even though it arrives prior to the output. While latency +time can be reduced by using various types of design techniques, it cannot be eliminated completely. Controllers need to be informed about the arrival of feedback +affecting previously issued commands and, if possible, provided with the ability to +undo or to mitigate the effects of the now unwanted command. +Fault-handling. Most accidents involve off-nominal processing modes, including +startup and shutdown and fault handling. The design of the control loop should assist +the controller in handling these modes and the designers need to focus particular +attention on them. +The system design may allow for performance degradation and may be designed +to fail into safe states or to allow partial shutdown and restart. Any fail-safe behavior +that occurs in the process should be reported to the controller. In some cases, automated systems have been designed to fail so gracefully that human controllers +are not aware of what is going on until they need to take control and may not be +prepared to do so. Also, hysteresis needs to be provided in the control algorithm +for transitions between off-nominal and nominal processing modes to avoid pingponging when the conditions that caused the controlled process to leave the normal +state still exist or recur. +Hazardous functions have special requirements. Clearly, interlock failures should +result in the halting of the functions they are protecting. In addition, the control +algorithm design may differ after failures are detected, depending on whether the +controller outputs are hazard-reducing or hazard-increasing. A hazard-increasing +output is one that moves the controlled process to a more hazardous state, for +example, arming a weapon. A hazard-reducing output is a command that leads to a + + +reduced risk state, for example, safing a weapon or any other command whose +purpose is to maintain safety. +If a failure in the control loop, such as a sensor or actuator, could inhibit the +production of a hazard-reducing command, there should be multiple ways to trigger +such commands. On the other hand, multiple inputs should be required to trigger +commands that can lead to hazardous states so they are not inadvertently issued. +Any failure should inhibit the production of a hazard-increasing command. As an +example of the latter condition, loss of the ability of the controller to receive input, +such as failure of a sensor, that might inhibit the production of a hazardous output +should prevent such an output from being issued. +section 9.4. +Special Considerations in Designing for Human Controllers. +The design principles in section 9.3 apply when the controller is automated or +human, particularly when designing procedures for human controllers to follow. But +humans do not always follow procedures, nor should they. We use humans to control +systems because of their flexibility and adaptability to changing conditions and to +the incorrect assumptions made by the designers. Human error is an inevitable and +unavoidable consequence. But appropriate design can assist in reducing human +error and increasing safety in human-controlled systems. +Human error is not random. It results from basic human mental abilities and +physical skills combined with the features of the tools being used, the tasks assigned, +and the operating environment. We can use what is known about human mental +abilities and design the other aspects of the system.the tools, the tasks, and the +operating environment.to reduce and control human error to a significant degree. +The previous section described general principles for safe design. This section +focuses on additional design principles that apply when humans control, either +directly or indirectly, safety-critical systems. + +section 9.4.1. Easy but Ineffective Approaches. +One simple solution for engineers is to simply use human factors checklists. While +many such checklists exist, they often do not distinguish among the qualities they +enhance, which may not be related to safety and may even conflict with safety. The +only way such universal guidelines could be useful is if all design qualities were +complementary and achieved in exactly the same way, which is not the case. Qualities are conflicting and require design tradeoffs and decisions about priorities. +Usability and safety, in particular, are often conflicting; an interface that is easy +to use may not necessarily be safe. As an example, a common guideline is to ensure +that a user must enter data only once and that the computer can access that data if + + +needed later for the same task or for different tasks . Duplicate entry, however, +is required for the computer to detect entry errors unless the errors are so extreme +that they violate reasonableness criteria. A small slip usually cannot be detected +and such entry errors have led to many accidents. Multiple entry of critical data can +prevent such losses. +As another example, a design that involves displaying data or instructions on a +screen for an operator to check and verify by pressing the enter button minimizes +the typing an operator must do. Over time, however, and after few errors are +detected, operators will get in the habit of pressing the enter key multiple times in +rapid succession. This design feature has been implicated in many losses. For example, +the Therac-25 was a linear accelerator that overdosed multiple patients during radiation therapy. In the original Therac-25 design, operators were required to enter the +treatment parameters at the treatment site as well as on the computer console. After +the operators complained about the duplication, the parameters entered at the +treatment site were instead displayed on the console and the operator needed only +to press the return key if they were correct. Operators soon became accustomed to +pushing the return key quickly the required number of times without checking the +parameters carefully. +The second easy but not very effective solution is to write procedures for human +operators to follow and then assume the engineering job is done. Enforcing the +following of procedures is unlikely, however, to lead to a high level of safety. +Dekker notes what he called the “Following Procedures Dilemma” . Operators must balance between adapting procedures in the face of unanticipated conditions versus sticking to procedures rigidly when cues suggest they should be +adapted. If human controllers choose the former, that is, they adapt procedures +when it appears the procedures are wrong, a loss may result when the human controller does not have complete knowledge of the circumstances or system state. In +this case, the humans will be blamed for deviations and nonadherence to the procedures. On the other hand, if they stick to procedures .(the control algorithm provided). rigidly when the procedures turn out to be wrong, they will be blamed for +their inflexibility and the application of the rules in the wrong context. Hindsight +bias is often involved in identifying what the operator should have known and done. +Insisting that operators always follow procedures does not guarantee safety +although it does usually guarantee that there is someone to blame.either for following the procedures or for not following them.when things go wrong. Safety +comes from controllers being skillful in judging when and how procedures apply. As +discussed in chapter 12, organizations need to monitor adherence to procedures not +simply to enforce compliance but to understand how and why the gap between +procedures and practice grows and to use that information to redesign both the +system and the procedures . + + +Section 8.5 of chapter 8 describes important differences between human and +automated controllers. One of these differences is that the control algorithm used +by humans is dynamic. This dynamic aspect of human control is why humans are +kept in systems. They provide the flexibility to deviate from procedures when it turns +out the assumptions underlying the engineering design are wrong. But with this +flexibility comes the possibility of unsafe changes in the dynamic control algorithm +and raises new design requirements for engineers and system designers to understand the reason for such unsafe changes and prevent them through appropriate +system design. +Just as engineers have the responsibility to understand the hazards in the physical +systems they are designing and to control and mitigate them, engineers also must +understand how their system designs can lead to human error and how they can +design to reduce errors. +Designing to prevent human error requires some basic understanding about the +role humans play in systems and about human error. + +section 9.4.2. The Role of Humans in Control Systems. +Humans can play a variety of roles in a control system. In the simplest cases, they +create the control commands and apply them directly to the controlled process. For +a variety of reasons, particularly speed and efficiency, the system may be designed +with a computer between the human controller and the system. The computer may +exist only in the feedback loop to process and present data to the human operator. +In other systems, the computer actually issues the control instructions with the +human operator either providing high-level supervision of the computer or simply +monitoring the computer to detect errors or problems. +An unanswered question is what is the best role for humans in safety-critical +process control. There are three choices beyond direct control. the human can +monitor an automated control system, the human can act as a backup to the automation, or the human and automation can both participate in the control through +some type of partnership. These choices are discussed in depth in Safeware and are +only summarized here. +Unfortunately for the first option, humans make very poor monitors. They cannot +sit and watch something without active control duties for any length of time and +maintain vigilance. Tasks that require little active operator behavior may result in +lowered alertness and can lead to complacency and overreliance on the automation. +Complacency and lowered vigilance are exacerbated by the high reliability and low +failure rate of automated systems. +But even if humans could remain vigilant while simply sitting and monitoring a +computer that is performing the control tasks .(and usually doing the right thing), +Bainbridge has noted the irony that automatic control systems are installed because + + +they can do the job better than humans, but then humans are assigned the task of +monitoring the automated system . Two questions arise. +1. The human monitor needs to know what the correct behavior of the controlled +or monitored process should be; however, in complex modes of operation.for +example, where the variables in the process have to follow a particular trajectory over time.evaluating whether the automated control system is performing correctly requires special displays and information that may only be +available from the automated system being monitored. How will human monitors know when the computer is wrong if the only information they have comes +from that computer? In addition, the information provided by an automated +controller is more indirect, which may make it harder for humans to get a clear +picture of the system. Failures may be silent or masked by the automation. +2. If the decisions can be specified fully, then a computer can make them more +quickly and accurately than a human. How can humans monitor such a system? +Whitfield and Ord found that, for example, air traffic controllers’ appreciation +of the traffic situation was reduced at the high traffic levels made feasible by +using computers . In such circumstances, humans must monitor the automated controller at some metalevel, deciding whether the computer’s decisions are acceptable rather than completely correct. In case of a disagreement, +should the human or the computer be the final arbiter? +Employing humans as backups is equally ineffective. Controllers need to have accurate process models to control effectively, but not being in active control leads to a +degradation of their process models. At the time they need to intervene, it may take +a while to “get their bearings”.in other words, to update their process models so +that effective and safe control commands can be given. In addition, controllers need +both manual and cognitive skills, but both of these decline in the absence of practice. +If human backups need to take over control from automated systems, they may be +unable to do so effectively and safely. Computers are often introduced into safetycritical control loops because they increase system reliability, but at the same time, +that high reliability can provide little opportunity for human controllers to practice +and maintain the skills and knowledge required to intervene when problems +do occur. +It appears, at least for now, that humans will have to provide direct control or +will have to share control with automation unless adequate confidence can be established in the automation to justify eliminating monitors completely. Few systems +exist today where such confidence can be achieved when safety is at stake. The +problem then becomes one of finding the correct partnership and allocation of tasks +between humans and computers. Unfortunately, this problem has not been solved, +although some guidelines are presented later. + + +One of the things that make the problem difficult is that it is not just a matter of +splitting responsibilities. Computer control is changing the cognitive demands on +human controllers. Humans are increasingly supervising a computer rather than +directly monitoring the process, leading to more cognitively complex decision +making. Automation logic complexity and the proliferation of control modes are +confusing humans. In addition, whenever there are multiple controllers, the requirements for cooperation and communication are increased, not only between the +human and the computer but also between humans interacting with the same computer, for example, the need for coordination among multiple people making entries +to the computer. The consequences can be increased memory demands, new skill +and knowledge requirements, and new difficulties in the updating of the human’s +process models. +A basic question that must be answered and implemented in the design is who +will have the final authority if the human and computers disagree about the proper +control actions. In the loss of an Airbus 320 while landing at Warsaw in 19 93 , one +of the factors was that the automated system prevented the pilots from activating +the braking system until it was too late to prevent crashing into a bank built at the +end of the runway. This automation feature was a protection device included to +prevent the reverse thrusters accidentally being deployed in flight, a presumed cause +of a previous accident. For a variety of reasons, including water on the runway +causing the aircraft wheels to hydroplane, the criteria used by the software logic to +determine that the aircraft had landed were not satisfied by the feedback received +by the automation . Other incidents have occurred where the pilots have been +confused about who is in control, the pilot or the automation, and found themselves +fighting the automation . +One common design mistake is to set a goal of automating everything and then +leaving some miscellaneous tasks that are difficult to automate for the human controllers to perform. The result is that the operator is left with an arbitrary collection +of tasks for which little thought was given to providing support, particularly support +for maintaining accurate process models. The remaining tasks may, as a consequence, +be significantly more complex and error-prone. New tasks may be added, such as +maintenance and monitoring, that introduce new types of errors. Partial automation, +in fact, may not reduce operator workload but merely change the type of demands +on the operator, leading to potentially increased workload. For example, cockpit +automation may increase the demands on the pilots by creating a lot of data entry +tasks during approach when there is already a lot to do. These automation interaction tasks also create “heads down” work at a time when increased monitoring of +nearby traffic is necessary. +By taking away the easy parts of the operator’s job, automation may make the +more difficult ones even harder . One causal factor here is that taking away or + + +changing some operator tasks may make it difficult or even impossible for the operators to receive the feedback necessary to maintain accurate process models. +When designing the automation, these factors need to be considered. A basic +design principle is that automation should be designed to augment human abilities, +not replace them, that is, to aid the operator, not to take over. +To design safe automated controllers with humans in the loop, designers need +some basic knowledge about human error related to control tasks. In fact, Rasmussen has suggested that the term human error be replaced by considering such events +as human–task mismatches. + +section 9.4.3. Human Error Fundamentals. +Human error can be divided into the general categories of slips and mistakes [143, +144]. Basic to the difference is the concept of intention or desired action. A mistake +is an error in the intention, that is, an error that occurs during the planning of an +action. A slip, on the other hand, is an error in carrying out the intention. As an +example, suppose an operator decides to push button A. If the operator instead +pushes button B, then it would be called a slip because the action did not match the +intention. If the operator pushed A .(carries out the intention correctly), but it turns +out that the intention was wrong, that is, button A should not have been pushed, +then this is called a mistake. +Designing to prevent slips involves applying different principles than designing +to prevent mistakes. For example, making controls look very different or placing +them far apart from each other may reduce slips, but not mistakes. In general, designing to reduce mistakes is more difficult than reducing slips, which is relatively +straightforward. +One of the difficulties in eliminating planning errors or mistakes is that such +errors are often only visible in hindsight. With the information available at the +time, the decisions may seem reasonable. In addition, planning errors are a necessary side effect of human problem-solving ability. Completely eliminating mistakes +or planning errors .(if possible). would also eliminate the need for humans as +controllers. +Planning errors arise from the basic human cognitive ability to solve problems. +Human error in one situation is human ingenuity in another. Human problem +solving rests on several unique human capabilities, one of which is the ability to +create hypotheses and to test them and thus create new solutions to problems not +previously considered. These hypotheses, however, may be wrong. Rasmussen has +suggested that human error is often simply unsuccessful experiments in an unkind +environment, where an unkind environment is defined as one in which it is not possible for the human to correct the effects of inappropriate variations in performance + + + +before they lead to unacceptable consequences . He concludes that human +performance is a balance between a desire to optimize skills and a willingness to +accept the risk of exploratory acts. +A second basic human approach to problem solving is to try solutions that +worked in other circumstances for similar problems. Once again, this approach is +not always successful but the inapplicability of old solutions or plans .(learned procedures). may not be determinable without the benefit of hindsight. +The ability to use these problem-solving methods provides the advantages of +human controllers over automated controllers, but success is not assured. Designers, +if they understand the limitations of human problem solving, can provide assistance +in the design to avoid common pitfalls and enhance human problem solving. For +example, they may provide ways for operators to obtain extra information or to +test hypotheses safely. At the same time, there are some additional basic human +cognitive characteristics that must be considered. +Hypothesis testing can be described in terms of basic feedback control concepts. +Using the information in the process model, the controller generates a hypothesis +about the controlled process. A test composed of control actions is created to generate feedback useful in evaluating the hypothesis, which in turn is used to update the +process model and the hypothesis. +When controllers have no accurate diagnosis of a problem, they must make provisional assessments of what is going on based on uncertain, incomplete, and often +contradictory information . That provisional assessment will guide their information gathering, but it may also lead to over attention to confirmatory evidence +when processing feedback and updating process models while, at the same time, +discounting information that contradicts their current diagnosis. Psychologists call +this phenomenon cognitive fixation. The alternative is called thematic vagabonding, +where the controller jumps around from explanation to explanation, driven by the +loudest or latest feedback or alarm and never develops a coherent assessment of +what is going on. Only hindsight can determine whether the controller should have +abandoned one explanation for another. Sticking to one assessment can lead to +more progress in many situations than jumping around and not pursuing a consistent +planning process. +Plan continuation is another characteristic of human problem solving related to +cognitive fixation. Commitment to a preliminary diagnosis can lead to sticking with +the original plan even though the situation has changed and calls for a different +plan. Orisanu notes that early cues that suggest an initial plan is correct are +usually very strong and unambiguous, helping to convince people to continue +the plan. Later feedback that suggests the plan should be abandoned is typically +more ambiguous and weaker. Conditions may deteriorate gradually. Even when + + +controllers receive and acknowledge this feedback, the new information may not +change their plan, especially if abandoning the plan is costly in terms of organizational and economic consequences. In the latter case, it is not surprising that controllers will seek and focus on confirmatory evidence and will need a lot of contradictory +evidence to justify changing their plan. +Cognitive fixation and plan continuation are compounded by stress and fatigue. +These two factors make it more difficult for controllers to juggle multiple hypotheses about a problem or to project a situation into the future by mentally simulating +the effects of alternative plans . +Automated tools can be designed to assist the controller in planning and decision +making, but they must embody an understanding of these basic cognitive limitations +and assist human controllers in overcoming them. At the same time, care must be +taken that any simulation or other planning tools to assist human problem solving +do not rest on the same incorrect assumptions about the system that led to the +problems in the first place. +Another useful distinction is between errors of omission and errors of commission. Sarter and Woods note that in older, less complex aircraft cockpits, most +pilot errors were errors of commission that occurred as a result of a pilot control +action. Because the controller, in this case the pilot, took a direct action, he or she +is likely to check that the intended effect of the action has actually occurred. The +short feedback loops allow the operators to repair most errors before serious +consequences result. This type of error is still the prevalent one for relatively +simple devices. +In contrast, studies of more advanced automation in aircraft find that errors of +omission are the dominant form of error . Here the controller does not implement a control action that is required. The operator may not notice that the automation has done something because that automation behavior was not explicitly +invoked by an operator action. Because the behavioral changes are not expected, +the human controller is less likely to pay attention to relevant indications and +feedback, particularly during periods of high workload. +Errors of omission are related to the change of human roles in systems from +direct controllers to monitors, exception handlers, and supervisors of automated +controllers. As their roles change, the cognitive demands may not be reduced but +instead may change in their basic nature. The changes tend to be more prevalent at +high-tempo and high-criticality periods. So while some types of human errors have +declined, new types of errors have been introduced. +The difficulty and perhaps impossibility of eliminating human error does not +mean that greatly improved system design in this respect is not possible. System +design can be used to take advantage of human cognitive capabilities and to minimize the errors that may result from them. The rest of the chapter provides some + + +principles to create designs that better support humans in controlling safety-critical +processes and reduce human errors. +9.4.4 Providing Control Options +If the system design goal is to make humans responsible for safety in control systems, +then they must have adequate flexibility to cope with undesired and unsafe behavior +and not be constrained by inadequate control options. Three general design principles apply. design for redundancy, design for incremental control, and design for +error tolerance. +Design for redundant paths. One helpful design feature is to provide multiple +physical devices and logical paths to ensure that a single hardware failure or +software error cannot prevent the operator from taking action to maintain a +safe system state and avoid hazards. There should also be multiple ways to change + + +from an unsafe to a safe state, but only one way to change from an unsafe to a +safe state. +Design for incremental control. Incremental control makes a system easier to +control, both for humans and computers, by performing critical steps incrementally +rather than in one control action. The common use of incremental arm, aim, fire +sequences is an example. The controller should have the ability to observe the +system and get feedback to test the validity of the assumptions and models upon +which the decisions are made. The system design should also provide the controller +with compensating control actions to allow modifying or aborting previous control +actions before significant damage is done. An important consideration in designing +for controllability in general is to lower the time pressures on the controllers, if +possible. +The design of incremental control algorithms can become complex when a human +controller is controlling a computer, which is controlling the actual physical process, +in a stressful and busy environment, such as a military aircraft. If one of the commands in an incremental control sequence cannot be executed within a specified +period of time, the human operator needs to be informed about any delay or postponement or the entire sequence should be canceled and the operator informed. At +the same time, interrupting the pilot with a lot of messages that may not be critical +at a busy time could also be dangerous. Careful analysis is required to determine +when multistep controller inputs can be preempted or interrupted before they are +complete and when feedback should occur that this happened . +Design for error tolerance. Rasmussen notes that people make errors all the time, +but we are able to detect and correct them before adverse consequences occur . +System design can limit people’s ability to detect and recover from their errors. He +defined a system design goal of error tolerant systems. In these systems, errors are +observable .(within an appropriate time limit). and they are reversible before unacceptable consequences occur. The same applies to computer errors. they should be +observable and reversible. +The general goal is to allow controllers to monitor their own performance. To +achieve this goal, the system design needs to. +1. Help operators monitor their actions and recover from errors. +2. Provide feedback about actions operators took and their effects, in case the +actions were inadvertent. Common examples are echoing back operator inputs +or requiring confirmation of intent. +3. Allow for recovery from erroneous actions. The system should provide control +options, such as compensating or reversing actions, and enough time for recovery actions to be taken before adverse consequences result. + + +Incremental control, as described earlier, is a type of error-tolerant design +technique. +section 9.4.5. Matching Tasks to Human Characteristics. +In general, the designer should tailor systems to human requirements instead of +the opposite. Engineered systems are easier to change in their behavior than are +humans. +Because humans without direct control tasks will lose vigilance, the design +should combat lack of alertness by designing human tasks to be stimulating and +varied, to provide good feedback, and to require active involvement of the human +controllers in most operations. Maintaining manual involvement is important, not +just for alertness but also in getting the information needed to update process +models. + + +Maintaining active engagement in the tasks means that designers must distinguish between providing help to human controllers and taking over. The human +tasks should not be oversimplified and tasks involving passive or repetitive actions +should be minimized. Allowing latitude in how tasks are accomplished will not only +reduce monotony and error proneness, but can introduce flexibility to assist operators in improvising when a problem cannot be solved by only a limited set of behaviors. Many accidents have been avoided when operators jury-rigged devices or +improvised procedures to cope with unexpected events. Physical failures may cause +some paths to become nonfunctional and flexibility in achieving goals can provide +alternatives. +Designs should also be avoided that require or encourage management by exception, which occurs when controllers wait for alarm signals before taking action. +Management by exception does not allow controllers to prevent disturbances by +looking for early warnings and trends in the process state. For operators to anticipate +undesired events, they need to continuously update their process models. Experiments by Swaanenburg and colleagues found that management by exception is not +the strategy adopted by human controllers as their normal supervisory mode . +Avoiding management by exception requires active involvement in the control task +and adequate feedback to update process models. A display that provides only an +overview and no detailed information about the process state, for example, may not +provide the information necessary for detecting imminent alarm conditions. +Finally, if designers expect operators to react correctly to emergencies, they need +to design to support them in these tasks and to help fight some basic human tendencies described previously such as cognitive fixation and plan continuation. The +system design should support human controllers in decision making and planning +activities during emergencies. + +section 9.4.6. Designing to Reduce Common Human Errors. +Some human errors are so common and unnecessary that there is little excuse for +not designing to prevent them. Care must be taken though that the attempt to +reduce erroneous actions does not prevent the human controller from intervening +in an emergency when the assumptions made during design about what should and +should not be done turn out to be incorrect. +One fundamental design goal is to make safety-enhancing actions easy, natural, +and difficult to omit or do wrong. In general, the design should make it more difficult +for the human controller to operate unsafely than safely. If safety-enhancing actions +are easy, they are less likely to be bypassed intentionally or accidentally. Stopping +an unsafe action or leaving an unsafe state should be possible with a single keystroke +that moves the system into a safe state. The design should make fail-safe actions +easy and natural, and difficult to avoid, omit, or do wrong. + + +In contrast, two or more unique operator actions should be required to start any +potentially hazardous function or sequence of functions. Hazardous actions should +be designed to minimize the potential for inadvertent activation; they should not, +for example, be initiated by pushing a single key or button .(see the preceding discussion of incremental control). +The general design goal should be to enhance the ability of the human controller +to act safely while making it more difficult to behave unsafely. Initiating a potentially +unsafe process change, such as a spacecraft launch, should require multiple keystrokes or actions while stopping a launch should require only one. +Safety may be enhanced by using procedural safeguards, where the operator is +instructed to take or avoid specific actions, or by designing safeguards into the +system. The latter is much more effective. For example, if the potential error involves +leaving out a critical action, either the operator can be instructed to always take +that action or the action can be made an integral part of the process. A typical error + + +during maintenance is not to return equipment .(such as safety interlocks). to the +operational mode. The accident sequence at Three Mile Island was initiated by such +an error. An action that is isolated and has no immediate relation to the “gestalt” +of the repair or testing task is easily forgotten. Instead of stressing the need to be +careful .(the usual approach), change the system by integrating the act physically +into the task, make detection a physical consequence of the tool design, or change +operations planning or review. That is, change design or management rather than +trying to change the human . +To enhance decision making, references should be provided for making judgments, such as marking meters with safe and unsafe limits. Because humans often +revert to stereotype and cultural norms, such norms should be followed in design. +Keeping things simple, natural, and similar to what has been done before .(not +making gratuitous design changes). is a good way to avoid errors when humans are +working under stress, are distracted, or are performing tasks while thinking about +something else. +To assist in preventing sequencing errors, controls should be placed in the +sequence in which they are to be used. At the same time, similarity, proximity, interference, or awkward location of critical controls should be avoided. Where operators +have to perform different classes or types of control actions, sequences should be +made as dissimilar as possible. +Finally, one of the most effective design techniques for reducing human error is +to design so that the error is not physically possible or so that errors are obvious. +For example, valves can be designed so they cannot be interchanged by making the +connections different sizes or preventing assembly errors by using asymmetric or +male and female connections. Connection errors can also be made obvious by color +coding. Amazingly, in spite of hundreds of deaths due to misconnected tubes in +hospitals that have occurred over decades, such as a feeding tube inadvertently +connected to a tube that is inserted in a patient’s vein, regulators, hospitals, and +tube manufacturers have taken no action to implement this standard safety design +technique . + +section 9.4.7. Support in Creating and Maintaining Accurate Process Models. +Human controllers who are supervising automation have two process models to +maintain. one for the process being controlled by the automation and one for the +automated controller itself. The design should support human controllers in maintaining both of these models. An appropriate goal here is to provide humans with +the facilities to experiment and learn about the systems they are controlling, either +directly or indirectly. Operators should also be allowed to maintain manual involvement to update process models, to maintain skills, and to preserve self-confidence. +Simply observing will degrade human supervisory skills and confidence. + + +When human controllers are supervising automated controllers, the automation +has extra design requirements. The control algorithm used by the automation must +be learnable and understandable. Two common design flaws in automated controllers are inconsistent behavior by the automation and unintended side effects. +Inconsistent Behavior. +Carroll and Olson define a consistent design as one where a similar task or goal is +associated with similar or identical actions . Consistent behavior on the part of +the automated controller makes it easier for the human providing supervisory +control to learn how the automation works, to build an appropriate process model +for it, and to anticipate its behavior. +An example of inconsistency, detected in an A320 simulator study, involved an +aircraft go-around below 100 feet above ground level. Sarter and Woods found that +pilots failed to anticipate and realize that the autothrust system did not arm when + + +they selected takeoff/go-around .(TOGA). power under these conditions because it +did so under all other circumstances where TOGA power is applied . +Another example of inconsistent automation behavior, which was implicated in +an A320 accident, is a protection function that is provided in all automation configurations except the specific mode .(in this case altitude acquisition). in which the +autopilot was operating . +Human factors for critical systems have most extensively been studied in aircraft +cockpit design. Studies have found that consistency is most important in high-tempo, +highly dynamic phases of flight where pilots have to rely on their automatic systems +to work as expected without constant monitoring. Even in more low-pressure +situations, consistency .(or predictability). is important in light of the evidence from +pilot surveys that their normal monitoring behavior may change on high-tech flight +decks . +Pilots on conventional aircraft use a highly trained instrument-scanning pattern +of recurrently sampling a given set of basic flight parameters. In contrast, some A320 +pilots report that they no longer scan anymore but allocate their attention within +and across cockpit displays on the basis of expected automation states and behaviors. Parameters that are not expected to change may be neglected for a long time + . If the automation behavior is not consistent, errors of omission may occur +where the pilot does not intervene when necessary. +In section 9.3.2, determinism was identified as a safety design feature for automated controllers. Consistency, however, requires more than deterministic behavior. +If the operator provides the same inputs but different outputs .(behaviors). result for +some reason other than what the operator has done .(or may even know about), +then the behavior is inconsistent from the operator viewpoint even though it is +deterministic. While the designers may have good reasons for including inconsistent +behavior in the automated controller, there should be a careful tradeoff made with +the potential hazards that could result. +Unintended Side Effects. +Incorrect process models can result when an action intended to have one effect has +an additional side effect not easily anticipated by the human controller. An example +occurred in the Sarter and Woods A320 aircraft simulator study cited earlier. Because +the approach to the destination airport is such a busy time for the pilots and the +automation requires so much heads down work, pilots often program the automation as soon as the air traffic controllers assign them a runway. Sarter and Woods +found that the experienced pilots in their study were not aware that entering a +runway change after entering data for the assigned approach results in the deletion +by the automation of all the previously entered altitude and speed constraints, even +though they may still apply. + + +Once again, there may be good reason for the automation designers to include +such side effects, but they need to consider the potential for human error that +can result. +Mode Confusion. +Modes define mutually exclusive sets of automation behaviors. Modes can be used +to determine how to interpret inputs or to define required controller behavior. Four +general types of modes are common. controller operating modes, supervisory modes, +display modes, and controlled process modes. +Controller operating modes define sets of related behavior in the controller, such +as shutdown, nominal behavior, and fault-handling. +Supervisory modes determine who or what is controlling the component at any +time when multiple supervisors can assume control responsibilities. For example, a +flight guidance system in an aircraft may be issued direct commands by the pilot(s) +or by another computer that is itself being supervised by the pilot(s). The movement +controller in the thermal tile processing system might be designed to be in either +manual supervisory mode .(by a human controller). or automated mode .(by the +TTPS task controller). Coordination of control actions among multiple supervisors +can be defined in terms of these supervisory modes. Confusion about the current +supervisory mode can lead to hazardous system behavior. +A third type of common mode is a display mode. The display mode will +affect the information provided on the display and how the user interprets that +information. +A final type of mode is the operating mode of the controlled process. For example, +the mobile thermal tile processing robot may be in a moving mode .(between work +areas). or in a work mode .(in a work area and servicing tiles, during which time it +may be controlled by a different controller). The value of this mode may determine +whether various operations.for example, extending the stabilizer legs or the +manipulator arm.are safe. +Early automated systems had a fairly small number of independent modes. They +provided a passive background on which the operator would act by entering target +data and requesting system operations. They also had only one overall mode setting +for each function performed. Indications of currently active mode and of transitions +between modes could be dedicated to one location on the display. +The consequences of breakdown in mode awareness were fairly small in these +system designs. Operators seemed able to detect and recover from erroneous actions +relatively quickly before serious problems resulted. Sarter and Woods conclude that, +in most cases, mode confusion in these simpler systems are associated with errors +of commission, that is, with errors that require a controller action in order for the +problem to occur . Because the human controller has taken an explicit action, + + +he or she is likely to check that the intended effect of the action has actually +occurred. The short feedback loops allow the controller to repair most errors quickly, +as noted earlier. +The flexibility of advanced automation allows designers to develop more complicated, mode-rich systems. The result is numerous mode indications often spread +over multiple displays, each containing just that portion of mode status data corresponding to a particular system or subsystem. The designs also allow for interactions across modes. The increased capabilities of automation can, in addition, lead +to increased delays between user input and feedback about system behavior. +These new mode-rich systems increase the need for and difficulty of maintaining +mode awareness, which can be defined in STAMP terms as keeping the controlledsystem operating mode in the controller’s process model consistent with the actual +controlled system mode. A large number of modes challenges human ability to +maintain awareness of active modes, armed modes, interactions between environmental status and mode behavior, and interactions across modes. It also increases +the difficulty of error or failure detection and recovery. +Calling for systems with fewer or less complex modes is probably unrealistic. +Simplifying modes and automation behavior often requires tradeoffs with precision +or efficiency and with marketing demands from a diverse set of customers . +Systems with accidental .(unnecessary). complexity, however, can be redesigned to +reduce the potential for human error without sacrificing system capabilities. Where +tradeoffs with desired goals are required to eliminate potential mode confusion +errors, system and interface design, informed by hazard analysis, can help find solutions that require the fewest tradeoffs. For example, accidents most often occur +during transitions between modes, particularly normal and nonnormal modes, so +they should have more stringent design constraints applied to them. +Understanding more about particular types of mode confusion errors can assist +with design. Two common types leading to problems are interface interpretation +modes and indirect mode changes. +Interface Interpretation Mode Confusion. Interface mode errors are the classic +form of mode confusion error. +1. Input-related errors. The software interprets user-entered values differently +than intended. +2. Output-related errors. The software maps multiple conditions onto the same +output, depending on the active controller mode, and the operator interprets +the interface incorrectly. +A common example of an input interface interpretation error occurs with many +word processors where the user may think they are in insert mode but instead they + + +are in insert and delete mode or in command mode and their input is interpreted +in a different way and results in different behavior than they intended. +A more complex example occurred in what is believed to be a cause of an A320 +aircraft accident. The crew directed the automated system to fly in the track/flight +path angle mode, which is a combined mode related to both lateral .(track). and +vertical .(flight path angle). navigation. +When they were given radar vectors by the air traffic controller, they may have switched +from the track to the hdg sel mode to be able to enter the heading requested by the +controller. However, pushing the button to change the lateral mode also automatically +changes the vertical mode from flight path angle to vertical speed.the mode switch +button affects both lateral and vertical navigation. When the pilots subsequently entered +“33” to select the desired flight path angle of 3.3 degrees, the automation interpreted their +input as a desired vertical speed of 3300 ft. This was not intended by the pilots who were +not aware of the active “interface mode” and failed to detect the problem. As a consequence of the too steep descent, the airplane crashed into a mountain . +An example of an output interface mode problem was identified by Cook et al. +in a medical operating room device with two operating modes. warmup and normal. +The device starts in warmup mode when turned on and changes from normal mode +to warmup mode whenever either of two particular settings is adjusted by the operator. The meaning of alarm messages and the effect of controls are different in these +two modes, but neither the current device operating mode nor a change in mode is +indicated to the operator. In addition, four distinct alarm-triggering conditions are +mapped onto two alarm messages so that the same message has different meanings +depending on the operating mode. In order to understand what internal condition +triggered the message, the operator must infer which malfunction is being indicated +by the alarm. +Several design constraints can assist in reducing interface interpretation errors. +At a minimum, any mode used to control interpretation of the supervisory interface +should be annunciated to the supervisor. More generally, the current operating +mode of the automation should be displayed at all times. In addition, any change of +operating mode should trigger a change in the current operating mode reflected in +the interface and thus displayed to the operator, that is, the annunciated mode must +be consistent with the internal mode. +A stronger design choice, but perhaps less desirable for various reasons, might +be not to condition the interpretation of the supervisory interface on modes at all. +Another possibility is to simplify the relationships between modes, for example in +the A320, the lateral and vertical modes might be separated with respect to the +heading select mode. Other alternatives are to make the required inputs different +to lessen confusion .(such as 3.3 and 3,300 rather than 33), or the mode indicator +on the control panel could be made clearer as to the current mode. While simply + + +annunciating the mode may be adequate in some cases, annunciations can easily +to missed for a variety of reasons and additional design features should be +considered. +Mode Confusion Arising from Indirect Mode Changes. Indirect mode changes +occur when the automation changes mode without an explicit instruction or direct +command by the operator. Such transitions may be triggered on conditions in the +automation, such as preprogrammed envelope protection. They may also result from +sensor input to the computer about the state of the computer-controlled process, +such as achievement of a preprogrammed target or an armed mode with a preselected mode transition. An example of the latter is a mode in which the autopilot +might command leveling off of the plane once a particular altitude is reached. the +operating mode of the aircraft .(leveling off). is changed when the altitude is reached +without a direct command to do so by the pilot. In general, the problem occurs when +activating one mode can result in the activation of different modes depending on +the system status at the time. +There are four ways to trigger a mode change. +1. The automation supervisor explicitly selects a new mode. +2. The automation supervisor enters data .(such as a target altitude). or a command +that leads to a mode change. +a. Under all conditions. +b. When the automation is in a particular state +c. When the automation’s controlled system model or environment is in a +particular state. +3. The automation supervisor does not do anything, but the automation logic +changes mode as a result of a change in the system it is controlling. +4. The automation supervisor selects a mode change but the automation does +something else, either because of the state of the automation at the time or +the state of the controlled system. +Again, errors related to mode confusion are related to problems that human supervisors of automated controllers have in maintaining accurate process models. +Changes in human controller behavior in highly automated systems, such as the +changes in pilot scanning behavior described earlier, are also related to these types +of mode confusion error. +Behavioral expectations about the automated controller behavior are formed +based on the human supervisors’ knowledge of the input to the automation and +on their process models of the automation. Gaps or misconceptions in this model + + +may interfere with predicting and tracking indirect mode transitions or with understanding the interactions among modes. +An example of an accident that has been attributed to an indirect mode change +occurred while an A320 was landing in Bangalore, India . The pilot’s selection +of a lower altitude while the automation was in the altitude acquisition mode +resulted in the activation of the open descent mode, where speed is controlled only +by the pitch of the aircraft and the throttles go to idle. In that mode, the automation +ignores any preprogrammed altitude constraints. To maintain pilot-selected speed +without power, the automation had to use an excessive rate of descent, which led +to the aircraft crashing short of the runway. +Understanding how this could happen is instructive in understanding just how +complex mode logic can get. There are three different ways to activate open descent +mode on the A320. +1. Pull the altitude knob after selecting a lower altitude. +2. Pull the speed knob when the aircraft is in expedite mode. +3. Select a lower altitude while in altitude acquisition mode. +It was the third condition that is suspected to have occurred. The pilot must not +have been aware the aircraft was within 200 feet of the previously entered target +altitude, which triggers altitude acquisition mode. He therefore may not have +expected selection of a lower altitude at that time to result in a mode transition and +did not closely monitor his mode annunciations during this high workload time. He +discovered what happened ten seconds before impact, but that was too late to +recover with the engines at idle . +Other factors contributed to his not discovering the problem until too late, one +of which is the problem in maintaining consistent process models when there are +multiple controllers as discussed in the next section. The pilot flying .(PF). had disengaged his flight director1 during approach and was assuming the pilot not flying +(PNF). would do the same. The result would have been a mode configuration in +which airspeed is automatically controlled by the autothrottle .(the speed mode), +which is the recommended procedure for the approach phase of flight. The PNF +never turned off his flight director, however, and the open descent mode became +active when a lower altitude was selected. This indirect mode change led to the +hazardous state and eventually the accident, as noted earlier. But a complicating +factor was that each pilot only received an indication of the status of his own flight + +director and not all the information necessary to determine whether the desired +mode would be engaged. The lack of feedback and resulting incomplete knowledge +of the aircraft state .(incorrect aircraft process model). contributed to the pilots not +detecting the unsafe state in time to correct it. +Indirect mode transitions can be identified in software designs. What to do in +response to identifying them or deciding not to include them in the first place is +more problematic and the tradeoffs and mitigating design features must be considered for each particular system. The decision is just one of the many involving the +benefits of complexity in system design versus the hazards that can result. + + +footnote. The flight director is automation that gives visual cues to the pilot via an easily interpreted display of +the aircraft’s flight path. The preprogrammed path, automatically computed, furnishes the steering commands necessary to obtain and hold a desired path. + + +Coordination of Multiple Controller Process Models. +When multiple controllers are engaging in coordinated control of a process, inconsistency between their process models can lead to hazardous control actions. Careful +design of communication channels and coordinated activity is required. In aircraft, +this coordination, called crew resource management, is accomplished through careful +design of the roles of each controller to enhance communication and to ensure +consistency among their process models. +A special case of this problem occurs when one human controller takes over +for another. The handoff of information about both the state of the controlled +process and any automation being supervised by the human must be carefully +designed. +Thomas describes an incident involving loss of communication for an extended +time between ground air traffic control and an aircraft . In this incident, a +ground controller had taken over after a controller shift change. Aircraft are passed +from one air traffic control sector to another through a carefully designed set of +exchanges, called a handoff, during which the aircraft is told to switch to the radio +frequency for the new sector. When, after a shift change the new controller gave an +instruction to a particular aircraft and received no acknowledgment, the controller +decided to take no further action; she assumed that the lack of acknowledgment +was an indication that the aircraft had already switched to the new sector and was +talking to the next controller. +Process model coordination during shift changes is partially controlled in a +position relief briefing. This briefing normally covers all aircraft that are currently +on the correct radio frequency or have not checked in yet. When the particular flight +in question was not mentioned in the briefing, the new controller interpreted that +as meaning that the aircraft was no longer being controlled by this station. She did +not call the next controller to verify this status because the aircraft had not been +mentioned in the briefing. +The design of the air traffic control system includes redundancy to try to avoid +errors.if the aircraft does not check in with the next controller, then that controller + + +would call her. When she saw the aircraft .(on her display). leave her airspace and +no such call was received, she interpreted that as another indication that the aircraft +was indeed talking to the next controller. +A final factor implicated in the loss of communication was that when the new +controller took over, there was little traffic at the aircraft’s altitude and no danger +of collision. Common practice for controllers in this situation is to initiate an early +handoff to the next controller. So although the aircraft was only halfway through +her sector, the new controller assumed an early handoff had occurred. +An additional causal factor in this incident involves the way controllers track +which aircraft have checked in and which have already been handed off to the +next controller. The old system was based on printed flight progress strips and +included a requirement to mark the strip when an aircraft had checked in. The +new system uses electronic flight progress strips to display the same information, +but there is no standard method to indicate the check-in has occurred. Instead, +each individual controller develops his or her own personal method to keep track +of this status. In this particular loss of communication case, the controller involved +would type a symbol in a comment area to mark any aircraft that she had already +handed off to the next sector. The controller that was relieved reported that he +usually relied on his memory or checked a box to indicate which aircraft he was +communicating with. +That a carefully designed and coordinated process such as air traffic control can +suffer such problems with coordinating multiple controller process models .(and +procedures). attests to the difficulty of this design problem and the necessity for +careful design and analysis. + +section 9.4.8. Providing Information and Feedback. +Designing feedback in general was covered in section 9.3.2. This section covers +feedback design principles specific to human controllers. Important problems in +designing feedback include what information should be provided, how to make the +feedback process more robust, and how the information should be presented to +human controllers. + +Types of Feedback. +Hazard analysis using STPA will provide information about the types of feedback +needed and when. Some additional guidance can be provided to the designer, once +again, using general safety design principles. +Two basic types of feedback are needed. +1. The state of the controlled process. This information is used to .(1). update the +controllers’ process models and .(2). to detect faults and failures in the other +parts of the control loop, system, and environment. + + +2. The effect of the controllers’ actions. This feedback is used to detect human +errors. As discussed in the section on design for error tolerance, the key to +making errors observable.and therefore remediable.is to provide feedback +about them. This feedback may be in the form of information about the effects +of controller actions, or it may simply be information about the action itself +on the chance that it was inadvertent. + +Updating Process Models. +Updating process models requires feedback about the current state of the system +and any changes that occur. In a system where rapid response by operators is necessary, timing requirements must be placed on the feedback information that the +controller uses to make decisions. In addition, when task performance requires or +implies need for the controller to assess timeliness of information, the feedback +display should include time and date information associated with data. + + +When a human controller is supervising or monitoring automation, the automation should provide an indication to the controller and to bystanders that it is functioning. The addition of a light to the power interlock example in chapter 8 is a simple +example of this type of feedback. For robot systems, bystanders should be signaled +when the machine is powered up or warning provided when a hazardous zone is +entered. An assumption should not be made that humans will not have to enter the +robot’s area. In one fully automated plant, an assumption was made that the robots +would be so reliable that the human controllers would not have to enter the plant +often and, therefore, the entire plant could be powered down when entry was +required. The designers did not provide the usual safety features such as elevated +walkways for the humans and alerts, such as aural warnings, when a robot was moving +or about the move. After plant startup, the robots turned out to be so unreliable that +the controllers had to enter the plant and bail them out several times during a shift. +Because powering down the entire plant had such a negative impact on productivity, +the humans got into the habit of entering the automated area of the plant without +powering everything down. The inevitable occurred and someone was killed . +The automation should provide information about its internal state .(such as the +state of sensors and actuators), its control actions, its assumptions about the state +of the system, and any anomalies that might have occurred. Processing requiring +several seconds should provide a status indicator so human controllers can distinguish automated system processing from failure. In one nuclear power plant, the +analog component that provided alarm annunciation to the operators was replaced +with a digital component performing the same function. An argument was made +that a safety analysis was not required because the replacement was “like for like.” +Nobody considered, however, that while the functional behavior might be the same, +the failure behavior could be different. When the previous analog alarm annunciator +failed, the screens went blank and the failure was immediately obvious to the human +operators. When the new digital system failed, however, the screens froze, which was +not immediately apparent to the operators, delaying critical feedback that the alarm +system was not operating. +While the detection of nonevents is relatively simple for automated controllers. +for instance, watchdog timers can be used.such detection is very difficult for +humans. The absence of a signal, reading, or key piece of information is not usually +immediately obvious to humans and they may not be able to recognize that a missing +signal can indicate a change in the process state. In the Turkish Airlines flight TK +1951 accident at Amsterdam’s Schiphol Airport in 20 09 , for example, the pilots did +not notice the absence of a critical mode shift . The design must ensure that lack +of important signals will be registered and noticed by humans. +While safety interlocks are being overridden for test or maintenance, their status +should be displayed to the operators and testers. Before allowing resumption of + + +normal operations, the design should require confirmation that the interlocks have +been restored. In one launch control system being designed by NASA, the operator +could turn off alarms temporarily. There was no indication on the display, however, +that the alarms had been disabled. If a shift change occurred and another operator +took over the position, the new operator would have no way of knowing that alarms +were not being annunciated. +If the information an operator needs to efficiently and safety control the process +is not readily available, controllers will use experimentation to test their hypotheses +about the state of the controlled system. If this kind of testing can be hazardous, +then a safe way for operators to test their hypotheses should be provided rather +than simply forbidding it. Such facilities will have additional benefits in handling +emergencies. +The problem of feedback in emergencies is complicated by the fact that disturbances may lead to failure of sensors. The information available to the controllers +(or to an automated system). becomes increasingly unreliable as the disturbance +progresses. Alternative means should be provided to check safety-critical information as well as ways for human controllers to get additional information the designer +did not foresee would be needed in a particular situation. +Decision aids need to be designed carefully. With the goal of providing assistance +to the human controller, automated systems may provide feedforward .(as well as +feedback). information. Predictor displays show the operator one or more future +states of the process parameters, as well as their present state or value, through a +fast-time simulation, a mathematical model, or other analytic method that projects +forward the effects of a particular control action or the progression of a disturbance +if nothing is done about it. +Incorrect feedforward information can lead to process upsets and accidents. +Humans can become dependent on automated assistance and stop checking +whether the advice is reasonable if few errors occur. At the same time, if the +process .(control algorithm). truly can be accurately predetermined along with all +future states of the system, then it should be automated. Humans are usually kept +in systems when automation is introduced because they can vary their process +models and control algorithms when conditions change or errors are detected in +the original models and algorithms. Automated assistance such as predictor displays may lead to overconfidence and complacency and therefore overreliance by +the operator. Humans may stop performing their own mental predictions and +checks if few discrepancies are found over time. The operator then will begin to +rely on the decision aid. +If decision aids are used, they need to be designed to reduce overdependence +and to support operator skills and motivation rather than to take over functions in +the name of support. Decision aids should provide assistance only when requested + + +and their use should not become routine. People need to practice making decisions +if we expect them to do so in emergencies or to detect erroneous decisions by +automation. +Detecting Faults and Failures. +A second use of feedback is to detect faults and failures in the controlled system, +including the physical process and any computer controllers and displays. If +the operator is expected to monitor a computer or automated decision making, +then the computer must make decisions in a manner and at a rate that operators +can follow. Otherwise they will not be able to detect faults and failures reliably +in the system being supervised. In addition, the loss of confidence in the automation may lead the supervisor to disconnect it, perhaps under conditions where that +could be hazardous, such as during critical points in the automatic landing of an +airplane. When human supervisors can observe on the displays that proper corrections are being made by the automated system, they are less likely to intervene +inappropriately, even in the presence of disturbances that cause large control +actions. +For operators to anticipate or detect hazardous states, they need to be continuously updated about the process state so that the system progress and dynamic state +can be monitored. Because of the poor ability of humans to perform monitoring +over extended periods of time, they will need to be involved in the task in some +way, as discussed earlier. If possible, the system should be designed to fail obviously +or to make graceful degradation obvious to the supervisor. +The status of safety-critical components or state variables should be highlighted +and presented unambiguously and completely to the controller. If an unsafe condition is detected by an automated system being supervised by a human controller, +then the human controller should be told what anomaly was detected, what action +was taken, and the current system configuration. Overrides of potentially hazardous +failures or any clearing of the status data should not be permitted until all of the +data has been displayed and probably not until the operator has acknowledged +seeing it. A system may have a series of faults that can be overridden safely if they +occur singly, but multiple faults could result in a hazard. In this case, the supervisor +should be made aware of all safety-critical faults prior to issuing an override +command or resetting a status display. +Alarms are used to alert controllers to events or conditions in the process that +they might not otherwise notice. They are particularly important for low-probability +events. The overuse of alarms, however, can lead to management by exception, +overload and the incredulity response. +Designing a system that encourages or forces an operator to adopt a management-by-exception strategy, where the operator waits for alarm signals before taking + + +action, can be dangerous. This strategy does not allow operators to prevent disturbances by looking for early warning signals and trends in the process state. +The use of computers, which can check a large number of system variables in a +short amount of time, has made it easy to add alarms and to install large numbers +of them. In such plants, it is common for alarms to occur frequently, often five to +seven times an hour . Having to acknowledge a large number of alarms may +leave operators with little time to do anything else, particularly in an emergency + . A shift supervisor at the Three Mile Island .(TMI). hearings testified that the +control room never had less than 52 alarms lit . During the TMI incident, more +than a hundred alarm lights were lit on the control board, each signaling a different +malfunction, but providing little information about sequencing or timing. So many +alarms occurred at TMI that the computer printouts were running hours behind the +events and, at one point jammed, losing valuable information. Brooks claims that +operators commonly suppress alarms in order to destroy historical information +when they need real-time alarm information for current decisions . Too many +alarms can cause confusion and a lack of confidence and can elicit exactly the wrong +response, interfering with the operator’s ability to rectify the problems causing +the alarms. +Another phenomenon associated with alarms is the incredulity response, which +leads to not believing and ignoring alarms after many false alarms have occurred. +The problem is that in order to issue alarms early enough to avoid drastic countermeasures, the alarm limits must be set close to the desired operating point. This goal +is difficult to achieve for some dynamic processes that have fairly wide operating +ranges, leading to the problem of spurious alarms. Statistical and measurement +errors may add to the problem. +A great deal has been written about alarm management, particularly in the +nuclear power arena, and sophisticated disturbance and alarm analysis systems have +been developed. Those designing alarm systems should be familiar with current +knowledge about such systems. The following are just a few simple guidelines. +1.•Keep spurious alarms to a minimum. This guideline will reduce overload and +the incredulity response. +2.•Provide checks to distinguish correct from faulty instruments. When response +time is not critical, most operators will attempt to check the validity of the alarm + . Providing information in a form where this validity check can be made +quickly and accurately, and not become a source of distraction, increases the +probability of the operator acting properly. +3.•Provide checks on alarm system itself. The operator has to know whether the +problem is in the alarm or in the system. Analog devices can have simple checks +such as “press to test” for smoke detectors or buttons to test the bulbs in a + + +lighted gauge. Computer-displayed alarms are more difficult to check; checking +usually requires some additional hardware or redundant information that +does not come through the computer. One complication comes in the form +of alarm analysis systems that check alarms and display a prime cause along +with associated effects. Operators may not be able to perform validity checks +on the complex logic necessarily involved in these systems, leading to overreliance . Weiner and Curry also worry that the priorities might not always +be appropriate in automated alarm analysis and that operators may not recognize this fact. +4.•Distinguish between routine and safety-critical alarms. The form of the alarm, +such as auditory cues or message highlighting, should indicate degree or urgency. +Alarms should be categorized as to which are the highest priority. +5.•Provide temporal information about events and state changes. Proper decision +making often requires knowledge about the timing and sequencing of events. +Because of system complexity and built-in time delays due to sampling intervals, however, information about conditions or events is not always timely or +even presented in the sequence in which the events actually occurred. Complex +systems are often designed to sample monitored variables at different frequencies. some variables may be sampled every few seconds while, for others, the +intervals may be measured in minutes. Changes that are negated within the +sampling period may not be recorded at all. Events may become separated from +their circumstances, both in sequence and time . +6.•Require corrective action when necessary. When faced with a lot of undigested +and sometimes conflicting information, humans will first try to figure out what +is going wrong. They may become so involved in attempts to save the system +that they wait too long to abandon the recovery efforts. Alternatively, they may +ignore alarms they do not understand or they think are not safety critical. The +system design may need to ensure that the operator cannot clear a safetycritical alert without taking corrective action or without performing subsequent +actions required to complete an interrupted operation. The Therac-25, a linear +accelerator that massively overdosed multiple patients, allowed operators to +proceed with treatment five times after an error message appeared simply by +pressing one key . No distinction was made between errors that could be +safety-critical and those that were not. +7.•Indicate which condition is responsible for the alarm. System designs with +more than one mode or where more than one condition can trigger the +alarm for a mode, must clearly indicate which condition is responsible for +the alarm. In the Therac-25, one message meant that the dosage given was +either too low or too high, without providing information to the operator + + +about which of these errors had occurred. In general, determining the cause of +an alarm may be difficult. In complex, tightly coupled plants, the point where +the alarm is first triggered may be far away from where the fault actually +occurred. +8.• +Minimize the use of alarms when they may lead to management by exception. After studying thousands of near accidents reported voluntarily by aircraft crews and ground support personnel, one U.S. government report +recommended that the altitude alert signal .(an aural sound). be disabled for all +but a few long-distance flights . Investigators found that this signal had +caused decreased altitude awareness in the flight crew, resulting in more frequent overshoots.instead of leveling off at 10,000 feet, for example, the aircraft continues to climb or descend until the alarm sounds. A study of such +overshoots noted that they rarely occur in bad weather, when the crew is most +attentive. + +Robustness of the Feedback Process. +Because feedback is so important to safety, robustness must be designed into feedback channels. The problem of feedback in emergencies is complicated by the fact +that disturbances may lead to failure of sensors. The information available to the +controllers .(or to an automated system). becomes increasingly unreliable as the +disturbance progresses. +One way to prepare for failures is to provide alternative sources of information +and alternative means to check safety-critical information. It is also useful for the +operators to get additional information the designers did not foresee would be +needed in a particular situation. The emergency may have occurred because the +designers made incorrect assumptions about the operation of the controlled +system, the environment in which it would operate, or the information needs of the +controller. +If automated controllers provide the only information about the controlled +system state, the human controller supervising the automation can provide little +oversight. The human supervisor must have access to independent sources of information to detect faults and failures, except in the case of a few failure modes such +as total inactivity. Several incidents involving the command and control warning +system at NORAD headquarters in Cheyenne Mountain involved situations where +the computer had bad information and thought the United States was under nuclear +attack. Human supervisors were able to ascertain that the computer was incorrect +through direct contact with the warning sensors .(satellites and radars). This direct +contact showed the sensors were operating and had received no evidence of incoming missiles . The error detection would not have been possible if the humans + + +could only get information about the sensors from the computer, which had the +wrong information. Many of these direct sensor inputs are being removed in the +mistaken belief that only computer displays are required. +The main point is that human supervisors of automation cannot monitor its performance if the information used in monitoring is not independent from the thing +being monitored. There needs to be provision made for failure of computer displays +or incorrect process models in the software by providing alternate sources of information. Of course, any instrumentation to deal with a malfunction must not be +disabled by the malfunction, that is, common-cause failures must be eliminated or +controlled. As an example of the latter, an engine and pylon came off the wing of +a D C 10 , severing the cables that controlled the leading edge flaps and also four +hydraulic lines. These failures disabled several warning signals, including a flap mismatch signal and a stall warning light . If the crew had known the slats were +retracted and had been warned of a potential stall, they might have been able to +save the plane. + +Displaying Feedback to Human Controllers. +Computer displays are now ubiquitous in providing feedback information to human +controllers, as are complaints about their design. +Many computer displays are criticized for providing too much data .(data overload). where the human controller has to sort through large amounts of data to find +the pieces needed. Then the information located in different locations may need to +be integrated. Bainbridge suggests that operators should not have to page between +displays to obtain information about abnormal states in the parts of the process +other than the one they are currently thinking about; neither should they have to +page between displays that provide information needed for a single decision +process. +These design problems are difficult to eliminate, but performing a task analysis +coupled with a hazard analysis can assist in better design as will making all the +information needed for a single decision process visible at the same time, placing +frequently used displays centrally, and grouping displays of information using the +information obtained in the task analysis. It may also be helpful to provide alternative ways to display information or easy ways to request what is needed. +Much has been written about how to design computer displays, although a surprisingly large number of displays still seem to be poorly designed. The difficulty of +such design is increased by the problem that, once again, conflicts can exist. For +example, intuition seems to support providing information to users in a form that +can be quickly and easily interpreted. This assumption is true if rapid reactions are +required. Some psychological research, however, suggests that cognitive processing + + +for meaning leads to better information retention. A display that requires little +thought and work on the part of the operator may not support acquisition of the +knowledge and thinking skills needed in abnormal conditions . +Once again, the designer needs to understand the tasks the user of the display is +performing. To increase safety, the displays should reflect what is known about how +the information is used and what kinds of displays are likely to cause human error. +Even slight changes in the way information is presented can have dramatic effects +on performance. +This rest of this section concentrates only on a few design guidelines that are +especially important for safety. The reader is referred to the standard literature on +display design for more information. +Safety-related information should be distinguished from non-safety-related +information and highlighted. In addition, when safety interlocks are being overridden, their status should be displayed. Similarly, if safety-related alarms are temporarily inhibited, which may be reasonable to allow so that the operator can deal +with the problem without being continually interrupted by additional alarms, the +inhibit status should be shown on the display. Make warning displays brief and +simple. +A common mistake is to make all the information displays digital simply because +the computer is a digital device. Analog displays have tremendous advantages for +processing by humans. For example, humans are excellent at pattern recognition, +so providing scannable displays that allow operators to process feedback and diagnose problems using pattern recognition will enhance human performance. A great +deal of information can be absorbed relatively easily when it is presented in the +form of patterns. +Avoid displaying absolute values unless the human requires the absolute values. +It is hard to notice changes such as events and trends when digital values are going +up and down. A related guideline is to provide references for judgment. Often, for +example, the user of the display does not need the absolute value but only the fact +that it is over or under a limit. Showing the value on an analog dial with references +to show the limits will minimize the required amount of extra and error-prone processing by the user. The overall goal is to minimize the need for extra mental processing to get the information the users of the display need for decision making or +for updating their process models. +Another typical problem occurs when computer displays must be requested and +accessed sequentially by the user, which makes greater memory demands upon the +operator, negatively affecting difficult decision-making tasks . With conventional +instrumentation, all process information is constantly available to the operator. an +overall view of the process state can be obtained by a glance at the console. Detailed +readings may be needed only if some deviation from normal conditions is detected. + + +The alternative, a process overview display on a computer console, is more time +consuming to process. To obtain additional information about a limited part of the +process, the operator has to select consciously among displays. +In a study of computer displays in the process industry, Swaanenburg and colleagues found that most operators considered a computer display more difficult to +work with than conventional parallel interfaces, especially with respect to getting +an overview of the process state. In addition, operators felt the computer overview +displays were of limited use in keeping them updated on task changes; instead, +operators tended to rely to a large extent on group displays for their supervisory +tasks. The researchers conclude that a group display, showing different process variables in reasonable detail .(such as measured value, setpoint, and valve position), +clearly provided the type of data operators preferred. Keeping track of the progress +of a disturbance is very difficult with sequentially presented information . One +general lesson to be learned here is that the operators of the system need to be +involved in display design decisions. The designers should not just do what is easiest +to implement or satisfies their aesthetic senses. +Whenever possible, software designers should try to copy the standard displays +with which operators have become familiar, and which were often developed for +good psychological reasons, instead of trying to be creative or unique. For example, +icons with a standard interpretation should be used. Researchers have found that +icons often pleased system designers but irritated users . Air traffic controllers, +for example, found the arrow icons for directions on a new display useless and +preferred numbers. Once again, including experienced operators in the design +process and understanding why the current analog displays have developed as they +have will help to avoid these basic types of design errors. +An excellent way to enhance human interpretation and processing is to design +the control panel to mimic the physical layout of the plant or system. For example, +graphical displays allow the status of valves to be shown within the context of piping +diagrams and even the flow of materials. Plots of variables can be shown, highlighting important relationships. +The graphical capabilities of computer displays provides exciting potential for +improving on traditional instrumentation, but the designs need to be based on psychological principles and not just on what appeals to the designer, who may never +have operated a complex process. As Lees has suggested, the starting point should +be consideration of the operator’s tasks and problems; the display should evolve as +a solution to these . +Operator inputs to the design process as well as extensive simulation and testing +will assist in designing usable computer displays. Remember that the overall goal is +to reduce the mental workload of the human in updating their process models and +to reduce human error in interpreting feedback. + + +section 9.5. +Summary. +A process for safety-guided design using STPA and some basic principles for safe +design have been described in this chapter. The topic is an important one and more +still needs to be learned, particularly with respect to safe system design for human +controllers. Including skilled and experienced operators in the design process from +the beginning will help as will performing sophisticated human task analyses rather +than relying primarily on operators interacting with computer simulations. +The next chapter describes how to integrate the disparate information and techniques provided so far in part 3 into a system-engineering process that integrates +safety into the design process from the beginning, as suggested in chapter 6. + + diff --git a/replacements b/replacements index 7e97d8a..6a22b12 100644 --- a/replacements +++ b/replacements @@ -1,48 +1,70 @@ -: . -— . -\[.\\+\] ( .( + 19\\([[:digit:]][[:digit:]]\\) 19 \\1 + 20\\([[:digit:]][[:digit:]]\\) 20 \1 + 200\\([[:digit:]]\\) 2 thousand \1 +— . +: . ) ). +\[.\\+\] +AAI A A I +ACO A C O +AFB A F B +AI A I +ASO A S O +ATC A T C +ATO A T O +AWACS A Wacks +B757 B 7 57 +BH B H +BMDS B M D S +BSD B S D +CFAC C FACK +CFIT C Fit +CTF C T F +DC-10 D C 10 +DMES D Mez +DO D O +FDAAA F D A A A +FDA F D A +FMEA F M E A +FMIS F Miss +GAO GAOW +HAZOP Haz Op +HMO H M O HQ-II H Q-2 +HTV H T V +IFF I F F III 3 II 2 +IOM I O M +IRB I R B +ITA I T A IV 4 -AWACS A Wacks -ASO A S O -PRA P R A -HMO H M O -MIC M I C -DC-10 D C 10 -OPC O P C -TAOR T A O R -AAI A A I -ACO A C O -AFB A F B -AI A I -ATO A T O -BH B H -BSD B S D -CTF C T F -CFAC C FACK -DO D O -GAO GAOW -IFF I F F -JOIC J O I C -JSOC J SOCK -JTIDS J tides -MCC M C C -MD M D -NCA N C A -NFZ N F Z -OPC O P C -ROE R O E -SD S D +JAXA Jax ah +JOIC J O I C +JSOC J SOCK +JTIDS J tides +MCC M C C +MD M D +MIC M I C +NCA N C A +NFZ N F Z +NMAC N Mack +OND O N D +OPC O P C +OPF O P F +OSMA O S M A +PDUFA P D U F A +PRA P R A +ROE R O E +SD S D SITREP SIT Rep +STPA S T P A TACSAT Tack sat -TAOR T A O R -USCINCEUR U S C in E U R -WD W D - 19\\([[:digit:]][[:digit:]]\\) 19 \\1 - 200\\([[:digit:]]\\) 2 thousand \1 - 20\\([[:digit:]][[:digit:]]\\) 20 \1 -B757 B 7 57 \ No newline at end of file +TAOR T A O R +TAOR T A O R +TCAS T Cass +TMI T M I +TTPS T T P S +USCINCEUR U S C in E U R +WD W D \ No newline at end of file