10/7/2008
Learning Objectives
Some terminology &
working definitions
Mission Goals =
political agenda, top level decisions
Ground Rules = stated approaches/constraints usually from the program that comply
with top level architecture decisions and that generally simplify the range of
design choices at the functional and/or technology selection level
Assumptions = stated from the engineers to simplify the range of design choices
Functional Objectives (FOs) = derived from top level Mission
Goals that also comply with GR&A what has to be done to enable the
mission
Functional Requirements = high level FOs
Functional Decomposition = FOs decomposed down to lowest level
functionality
Solutions = HW/SW needed to achieve the lowest level functional requirements
Safety = design implications relevant to potential crew injury or death
Risk = potential of failure to meet mission goals
Probability = likelihood of failure (or success)
Reliability = likelihood of given device functioning as planned (determined from
MTBF), extrapolated up to vehicle
Redundancy = similar or dissimilar means of achieving a given functional
requirement
Factor of Safety = additional performance capacity added beyond calculated baseline
Test = additional evaluations intended to improve reliability (also can
include actual use)
Nominal Ops normal operations
Malfunction Ops problem encountered, sufficient time to implement corrective action
Alternative Ops pre-planned corrective procedures for probable failure mode
Trouble-shooting fault isolation
Contingency = scenario where off-nominal operations are required, usually due to a
failure of some type, can include degraded performance capacity, dealt with by
operational workaround or redundant systems or (unplanned) in flight repair (or
maintenance, IFM)
Emergency imminent loss of mission or vehicle/crew possible
Uncertainty unknown parameter (quantified or not) on the front end of an analysis
Error Propagation end result of uncertainty or inaccurate data on design parameters
Sensitivity Analysis variable isolation process, single delta /
multiple(?) outcomes
MTBF = Mean Time Between Failure
FMEA = Failure Mode Effects Analysis
Probability and Risk Analysis: PRA
= f (MTBF + FMEA)
Safety Engineering
Assure that life-critical systems maintain necessary functionality even when parts fail
A probabilistically safe system has no single point failures and adequate sensors
Most aircraft are certified to less than one life lost in 30 years (109 sec) of operation due to mechanical failure
Designing for Safety
Crewed spacecraft can use the humans onboard to repair failures
If the design is flexible enough and adequate time is available, and tools/spares are onboard
A hazard is any event that can jeopardize the crews safety
Hazard identification is an integral part of design and largely depends on experience
High energy systems, moving parts, toxic or corrosive materials, flammability, etc.
Fault Tree Analysis
Hazard analysis is a deductive process
Ask a lot of What if questions
Start with the parts list
What technologies meet the functional requirements?
Identify the failure modes
How can each component break?
Can also be asked at functional level
Determine effects of failure
What happens if this component breaks?
How can I tell if it is broken or about to break?
Sensors, other feedback, performance trends
What can I do about it?
Training and procedures
Spare parts and tools
Dealing with Failures
Redundancy
/ Factor of Safety
Similar vs. Dissimilar means of meeting requirement
Weigh redundancy against additional parts / complexity & cost
Biological systems are good examples
Inherent
fail-safe (fail-operational) design
Overflow drain in the sink
Spring-loaded elevator brake system
Bimetallic switch for furnace gas cutoff
Consider the warning (idiot) lights in your car
What if the light is burned out?
Additional
Testing
Consider for optional
solutions
Failure modes
Environmental stressors
Probability of any given failure occurrence (and how you would determine this)
Options for dealing with unit failure
Fault Tolerance
Crewed spacecraft usually designed such that failure of any single part will not result in loss of vehicle/life
For non-single fault tolerant parts, factor of safety can be increased to compensate
Additional test can also be used to increase reliability
Single fault tolerant = redundant
Two-fault tolerant = dual redundancy
Safety and
Reliability
Failure consequence and odds of it happening
Addressed at the lowest level of hardware to which a failure can be traced a unit
Safety Analysis
Probabilistic
Risk Assessment (PRA)
Top-down approach
Start with major failure event and trace to unit level failure causes
Consider stress-causing event / environment
Vibration, acceleration, acoustics, structural or electrical overload, chemical reaction, delta-P, thermal shock, radiation, MMOD, EMF, mechanical shock, temperature gradients, toxic materials
Reliability
factors
How likely is any given failure to occur
What are
probabilities of crew survival?
Focus on
most critical causes and failure modes first in preliminary analysis
Derive
reliability from historical data or analogous units in similarly stressful
environments
FMEA /
CIL
Failure
Mode Effects Analysis / Critical Item List
MTBF
Mean
Time Between Failure
Crit 1 = loss of
crew (emergency)
Crit 2 = loss of
mission (action)
Crit 3 = no impact (monitor)
FMEA Template
Function what the component does
Failure Mode how it fails
Cause(s) conditions leading to failure
Effect(s) how failure affects the system
Disposition & Rationale what is done
Ref. Table 8-4 Typical Failure Modes and Affected
Equipment, Table 8-5 Typical Causes of Failures and Table 8-6 Circuit Breaker
example
Estimating
Reliability
Empirical vs. deterministic
Track record or predictive
COTS history or never been used before one-off design?
Random (no wear out) failures
Bayesian Estimation
Used to determine uncertainty bounds on reliability either with or without failure data on the unit or an analogous unit under similar conditions
If failure rate is known, this approach is not needed
Used when actual failure data do not exist
Failure rate estimate
Number of failures per unit time
2 shuttles lost in 113 flights (up to
Reliability estimate
Statistical probability of failure
Does not equal 1:57
Increasing data = decreasing uncertainty
Fault Tree
Begin breakdown by mission phase
Countdown, launch, orbit, reentry
Consider functions critical to given phase
What hazards pose threat to given function
Assign probabilities to affected systems
Define credible failures
Reliability
Constructs
Series both A and B have to function
Structural components
Parallel either A or B
Freon loops
Standby if A fails, B is automatic backup
SOP
Cross linked B can take place of D
Waste/supply water dump valve config
k out of n any 1 of 3 is sufficient
APUs, Fuel Cells
Action / Recovery
Can failure be detected?
Sensors
Is there enough time to react?
Action or emergency
Can crew (or automated activity) repair failed unit?
Training and real-time ops implications
Probability of
Survival
Data on entire system of units that contributes to hazard
Use reliability equations to predict probability
See gyro example worked in section 8.3.1
If insufficient data exist
use a block diagram to relate unit-failures to hazards
Uncertainty bounds propagated from bottom to top
Compute failure or survival probabilities
Equations 8-1, 2 and 3
Outcomes
R = reliability
F = failure
P = probability of crew survival (0.05 - 0.95)
How safe is safe
enough?
Failure is not an
option!