Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
At its heart, FMEA asks three deceptively simple questions: “What could go wrong? What would happen if it does? How can we prevent or detect it?” This bottom-up, inductive approach begins at the component or process-step level and propagates upward through system hierarchies, ultimately producing a comprehensive risk landscape that drives engineering decisions.
IEC 60812:2018 defines seven core steps for a properly executed FMEA:
(1) Scope Definition — Define system boundaries, operating assumptions, environmental conditions, and the analysis granularity. This step is critical: poorly defined scope is the leading cause of FMEA sessions that spiral out of control.
(2) Structure Decomposition — Break the system into elements (components, subassemblies, or process steps). Design FMEA uses functional block diagrams and BOMs; Process FMEA uses process flow charts.
(3) Function Description — For each element, articulate its intended function and performance requirements under all operating conditions.
(4) Failure Mode Identification — Enumerate all plausible failure modes for each element, considering the full lifecycle (manufacturing, transport, operation, maintenance, disposal).
(5) Failure Effect Assessment — Evaluate the consequences of each failure mode at the local, subsystem, and end-user levels.
(6) Control Identification — Document existing preventive (design features) and detective (tests, monitors) controls.
(7) Risk Prioritization — Rank failures using RPN or alternative risk matrices to determine which items demand corrective action.
The table below shows a representative FMEA worksheet structure—the core working document every reliability engineer must be fluent in:
| Item / Function | Failure Mode | Failure Effect | S | Failure Cause | O | Current Controls | D | RPN | Recommended Action |
|---|---|---|---|---|---|---|---|---|---|
| Coolant Pump | Bearing seizure | Battery overheat, vehicle derating | 8 | Lubricant degradation / contamination | 3 | Pump speed sensor feedback | 4 | 96 | Dual-pump redundancy; add oil condition monitoring |
| IGBT Power Module | Short-circuit (collector-emitter) | Cooling system failure, vehicle shutdown | 9 | Overvoltage / thermal runaway | 2 | Bus voltage monitor + over-temp protection | 3 | 54 | Add DESAT protection circuit; conformal coating |
| Radiator Core | Progressive clogging | Gradual efficiency loss, eventual overheat | 6 | Sediment / scale buildup on tube walls | 5 | Delta-temperature sensor alarm | 7 | 210 | Add inline filter; scheduled cleaning maintenance |
IEC 60812:2018 distinguishes two primary FMEA categories. Though sharing the same analytical DNA, they differ fundamentally in focus and analytical unit:
Focus: The physical design of the product—components, materials, software architecture, interfaces, and tolerances.
Core question: “In what ways could this design fail to meet its intended function?”
Typical scenario: An EV battery pack team conducts DFMEA on cell-level, module-level, BDU, BMS master, and cooling subassemblies before prototype release. Each failure mode is evaluated for its effect on safety (thermal runaway), performance (range reduction), and regulatory compliance.
Focus: Manufacturing, assembly, testing, and maintenance process steps.
Core question: “In what ways could this process step fail to produce conforming output?”
Typical scenario: An SMT assembly line analyzes the reflow soldering step, considering failure modes such as solder paste misalignment, incorrect thermal profile, PCB warpage, and their effects on solder joint reliability (voids, cold joints, insufficient wetting).
The Risk Priority Number is the most commonly used risk ranking metric in FMEA, calculated as:
RPN = S × O × D
where S = Severity, O = Occurrence, and D = Detection, each typically rated on a 1–10 scale. The resulting product ranges from 1 to 1000, with higher values indicating higher-priority risks.
| Rating | Severity (S) | Occurrence (O) | Detection (D) |
|---|---|---|---|
| 1–2 | Negligible; user will not notice | Extremely remote (<1 ppm) | Almost certain to detect |
| 3–4 | Minor; slight user inconvenience | Very low (10–100 ppm) | High probability of detection |
| 5–6 | Moderate; performance degraded but functional | Moderate (0.1%–1%) | Moderate probability of detection |
| 7–8 | Severe; primary function loss, safety concern | High (1%–5%) | Low probability of detection |
| 9–10 | Catastrophic; personnel safety or regulatory violation | Very high (>5%) | Virtually undetectable |
The 2018 edition of IEC 60812 dedicates an informative annex to RPN limitations, a significant upgrade from the 2006 edition. Engineers must understand these four critical weaknesses:
(1) Product Sensitivity: RPN is the product of three ordinal numbers. The combinations 10×2×5 = 100 and 5×5×4 = 100 produce identical RPN values, yet their engineering meaning is radically different—the former is a catastrophic-but-rare failure, the latter a moderate mid-range issue. Treating them as equivalent risk is mathematically indefensible.
(2) Rating Subjectivity: S/O/D ratings depend heavily on team experience and domain knowledge. Studies have shown that different teams evaluating the same failure mode can produce ratings diverging by 30% or more. Without a well-calibrated rating scale anchored to organizational data, RPN becomes a measure of team confidence rather than actual risk.
(3) Non-Uniform Distribution: Of the theoretical 1–1000 range, only a fraction of values are mathematically achievable with integer ratings. Many RPN values (e.g., 11, 13, 17, 19, 22… and all prime numbers above 10) can never appear, creating artificial “gaps” in the risk scale.
(4) Threshold Trap: Setting a fixed “RPN threshold” (e.g., RPN > 100 requires mandatory action) is dangerous. Teams under schedule pressure may unconsciously suppress ratings to stay below the threshold, defeating the purpose of the analysis. IEC 60812:2018 strongly recommends against rigid RPN thresholds.
This is one of the most commonly confused concept pairs in reliability engineering. The distinction is straightforward but consequential:
FMEA = Failure Mode + Failure Effect analysis
FMECA = FMEA + Criticality Analysis
Criticality analysis introduces two additional dimensions: severity classification (categorical levels such as I–IV per MIL-STD-1629A) and failure probability level, mapped onto a criticality matrix. This approach originated from US military standard MIL-STD-1629A and remains mandatory in aerospace, defense, and nuclear power applications.
| Dimension | FMEA | FMECA |
|---|---|---|
| Analysis Depth | Failure modes + effects + RPN ranking | Full FMEA content + criticality matrix |
| Risk Representation | Primarily quantitative (RPN) | Quantitative risk + qualitative criticality categories |
| Typical Application | Automotive (AIAG-VDA), general industrial | Aerospace, military, nuclear power |
| Core Output | Prioritized improvement action list | Critical items list + risk acceptability decision |
| Governing Standards | IEC 60812, AIAG-VDA FMEA Handbook | MIL-STD-1629A, IEC 60812 (Annex) |
IEC 60812 explicitly mandates a cross-functional FMEA team. At minimum, the team must include: a design engineer (system/component knowledge), a manufacturing engineer (process feasibility), a quality/reliability engineer (methodology and data), and a test engineer (detection capability). Ideally, also include a field service representative (real-world failure data from the installed base) and a supplier representative (component-level failure mode data from the supply chain).
A skilled FMEA facilitator is not merely a meeting scheduler. They must:
One of the most common conversations in FMEA sessions goes: “This failure mode won’t happen because we designed XXX to prevent it.” This “design assumption immunity” is the number one reason FMEAs miss critical failure modes. A skilled facilitator will probe: “Please demonstrate that your protection is independent of this failure mode.” Independence means the protection mechanism does not share a common cause with the failure it is meant to guard against—a concept that directly links FMEA to functional safety (ISO 26262, IEC 61508).