Root Cause Analysis For Reliability: A Case Study
Wikipedia defines Root Cause Analysis (RCA) as “a method of problem-solving used for identifying the root causes of faults or problems.”
Essentially, root cause analysis means to dive deeper into an issue to find what caused a non-conformance. What’s important to understand here is that Root Cause Analysis does not mean just looking at superficial causes of a problem. Rather, it means finding the highest-level cause- the thing that started a chain of cause-effect reactions and ultimately led to the issue at hand.
Appears in lists (1)
More like this (1)
How Facebook keeps its large-scale infrastructure hardware up and running embrace failure. automation. machine learning to...How Facebook keeps its large-scale infrastructure hardware up and running embrace failure. automation. machine learning to automate the automation. build system that can detect and remediate issues.monitor and remediate hardware events without adversely impacting application performance. use prediction methodology for remediations. automate root cause analysis for hardware and system failures at scale to get to the bottom of issues quickly.