ASEN 5158 Space Habitat Design

10/7/2008


Chapter 8 Safety of Crewed Spaceflight

 

Learning Objectives

  1. Identify main factors and issues of designing for safety
  2. Describe the variables used for statistical estimation of probability and uncertainty
  3. Explain the method used by NASA to identify hazards and analyze failure modes and effects

 


Some terminology & working definitions…

 

•         Mission Goals  =  political agenda, top level decisions

•         Ground Rules = stated approaches/constraints usually from the program that comply with top level architecture decisions and that generally simplify the range of design choices at the functional and/or technology selection level

•         Assumptions = stated from the engineers to simplify the range of design choices

•         Functional Objectives (FO’s) = derived from top level Mission Goals that also comply with GR&A – what has to be done to enable the mission

•         Functional Requirements = high level FO’s

•         Functional Decomposition = FO’s decomposed down to lowest level functionality

•         Solutions = HW/SW needed to achieve the lowest level functional requirements

•         Safety = design implications relevant to potential crew injury or death

•         Risk = potential of failure to meet mission goals

•         Probability = likelihood of failure (or success)

•         Reliability = likelihood of given device functioning as planned (determined from MTBF), extrapolated up to vehicle

•         Redundancy = similar or dissimilar means of achieving a given functional requirement

•         Factor of Safety = additional performance capacity added beyond calculated baseline

•         Test = additional evaluations intended to improve reliability (also can include actual use)

•         Nominal Ops – normal operations

•         Malfunction Ops – problem encountered, sufficient time to implement corrective action

•         Alternative Ops – pre-planned corrective procedures for probable failure mode

•         Trouble-shooting – fault isolation

•         Contingency = scenario where off-nominal operations are required, usually due to a failure of some type, can include degraded performance capacity, dealt with by operational workaround or redundant systems or (unplanned) in flight repair (or maintenance, IFM)

•         Emergency– imminent loss of mission or vehicle/crew possible

•         Uncertainty – unknown parameter (quantified or not) on the front end of an analysis

•         Error Propagation – end result of uncertainty or inaccurate data on design parameters

•         Sensitivity Analysis – variable isolation process, single delta / multiple(?) outcomes

•         MTBF = Mean Time Between Failure

•         FMEA = Failure Mode Effects Analysis

•         Probability and Risk Analysis:  PRA = f (MTBF + FMEA)

 


Safety Engineering

 

•         Assure that life-critical systems maintain necessary functionality even when parts fail

•         A probabilistically safe system has no single point failures and adequate sensors

•         Most aircraft are certified to ‘less than one life lost’ in 30 years (109 sec) of operation due to mechanical failure


Designing for Safety

 

•         Crewed spacecraft can use the humans onboard to repair failures

–        If the design is flexible enough and adequate time is available, and tools/spares are onboard

•         A hazard is any event that can jeopardize the crew’s safety

–        Hazard identification is an integral part of design and largely depends on experience

–        High energy systems, moving parts, toxic or corrosive materials, flammability, etc.


Fault Tree Analysis

 

•         Hazard analysis is a deductive process

–        Ask a lot of ‘What if…’ questions

–        Start with the parts list

•         What technologies meet the functional requirements?

–        Identify the failure modes

•         How can each component break?

•         Can also be asked at functional level

–        Determine effects of failure

•         What happens if this component breaks?

–        How can I tell if it is broken or about to break?

•         Sensors, other feedback, performance trends

–        What can I do about it?

•         Training and procedures

•         Spare parts and tools


Dealing with Failures

 

•         Redundancy / Factor of Safety

–        Similar vs. Dissimilar means of meeting requirement

–        Weigh redundancy against additional parts / complexity & cost

–        Biological systems are good examples

 

•         Inherent fail-safe (fail-operational) design

–        Overflow drain in the sink

–        Spring-loaded elevator brake system

–        Bimetallic switch for furnace gas cutoff

–        Consider the ‘warning (idiot) lights’ in your car

–        What if the light is burned out?

 

•         Additional Testing


Consider for optional solutions

 

–        Failure modes

–        Environmental stressors

–        Probability of any given failure occurrence (and how you would determine this)

–        Options for dealing with unit failure


Fault Tolerance

 

•         Crewed spacecraft usually designed such that failure of any single part will not result in loss of vehicle/life

•         For non-single fault tolerant parts, factor of safety can be increased to compensate

•         Additional test can also be used to increase reliability

•         Single fault tolerant = redundant

•         Two-fault tolerant = dual redundancy


Safety and Reliability

 

Failure consequence and odds of it happening

 

•         Addressed at the lowest level of hardware to which a failure can be traced – a unit

 


Safety Analysis

 

•         Probabilistic Risk Assessment (PRA)

–        Top-down approach

–        Start with major failure event and trace to unit level failure causes

–        Consider stress-causing event / environment

•         Vibration, acceleration, acoustics, structural or electrical overload, chemical reaction, delta-P, thermal shock, radiation, MMOD, EMF, mechanical shock, temperature gradients, toxic materials

•         Reliability factors

–        How likely is any given failure to occur

•         What are probabilities of crew survival?

•         Focus on most critical causes and failure modes first in preliminary analysis

•         Derive reliability from historical data or analogous units in similarly stressful environments

 

•         FMEA / CIL

–        Failure Mode Effects Analysis / Critical Item List

•         MTBF

–        Mean Time Between Failure

 

•         Crit 1 = loss of crew (emergency)

•         Crit 2 = loss of mission (action)

•         Crit 3 = no impact (monitor)


FMEA Template

 

•         Function – what the component does

•         Failure Mode – how it fails

•         Cause(s) – conditions leading to failure

•         Effect(s) – how failure affects the system

•         Disposition & Rationale – what is done

 

Ref. Table 8-4 Typical Failure Modes and Affected Equipment, Table 8-5 Typical Causes of Failures and Table 8-6 – Circuit Breaker example


Estimating Reliability

 

•         Empirical vs. deterministic

–        Track record or predictive

•         COTS history or ‘never been used before’ one-off design?

–        Random (no wear out) failures

•         Bayesian Estimation

–        Used to determine uncertainty bounds on reliability either with or without failure data on the unit or an analogous unit under similar conditions

–        If failure rate is known, this approach is not needed

–        Used when actual failure data do not exist

•         Failure rate estimate

–        Number of failures per unit time

•         2 shuttles lost in 113 flights (up to Columbia)

•         Reliability estimate

–        Statistical probability of failure

•         Does not equal 1:57

•         Increasing data = decreasing uncertainty


Fault Tree

 

•         Begin breakdown by mission phase

–        Countdown, launch, orbit, reentry

•         Consider functions critical to given phase

•         What hazards pose threat to given function

•         Assign probabilities to affected systems

•         Define ‘credible’ failures


Reliability Constructs

 

•         Series – both A and B have to function

–        Structural components

•         Parallel – either A or B

–        Freon loops

•         Standby – if A fails, B is automatic backup

–        SOP

•         Cross linked – B can take place of D

–        Waste/supply water dump valve config

•         k out of n – any 1 of 3 is sufficient

–        APU’s, Fuel Cells


Action / Recovery

 

•         Can failure be detected?

–        Sensors

•         Is there enough time to react?

–        Action or emergency

•         Can crew (or automated activity) repair failed unit?

–        Training and real-time ops implications


Probability of Survival

 

•         Data on entire system of units that contributes to hazard

–        Use reliability equations to predict probability

–        See gyro example worked in section 8.3.1

 

•         If insufficient data exist

–        use a block diagram to relate unit-failures to hazards

•         Uncertainty bounds propagated from bottom to top

–        Monte Carlo methods

•         Compute failure or survival probabilities

–        Equations 8-1, 2 and 3


Outcomes

 

•         R = reliability

•         F = failure

•         P = probability of crew survival (0.05 - 0.95)

 

How safe is safe enough?

 

Failure is not an option!


 

ASEN 5158 Home Page