The Root Cause of a Failure is Always a Decision


We often get sucked into drawn-out conversations (or heated debates) about the ‘true’ meaning of words. Especially when it comes to sports. Was James Harden (a basketball player) in the ‘act of shooting’ when he was fouled? It matters – because if the answer is ‘yes’ he gets up to three free throws. So what does the ‘act of shooting’ mean and who decides it? There will be endless debate over beers about what this means. Perhaps largely dependent on which team you support.


At the end of the day, it usually doesn’t matter. You can debate it as much as you want, but the referees have already decided what happened on the court. It is done. It is over. You can disagree with them. But nothing changes the score.


Words are important.


There is a difference between ‘taxonomy’ and ‘semantics.’ ‘Taxonomy’ is all about naming things and phenomena to make sure there is a common understanding about what those things are. We engineers do this a lot. Or think we do. ‘Semantics’ is a branch of linguistics that focuses on the meaning of words and phrases. If we have got our ‘taxonomy’ down … there should be no room for ‘semantics.’


In practice, this is not the case. One of my favorite bits of nonsense that routinely appears in the world of reliability engineering is …


… a ‘failure free’ period – which is a period in which the probability of failure is some small value.


I have wasted many hours of my life waiting for people to debate over the meaning of words and terms that should not be debatable. What is a ‘fault?’ What is a ‘failure mode?’ Is this different to ‘functional failure mode?’ … or ‘physical failure mode?’ Too often these debates and arguments aren’t about progressing the conversation or solving a problem. A good number of them degrade into petty competitions of egos about who can be ‘right.’


But perhaps my favorite term that spawns many frivolous arguments is ‘root cause.’ If an aircraft turbine failed due to manufacturing-related ladder cracking that initiated fatigue failure later on … what is the ‘root cause?’ Is it ‘ladder cracking?’ NO. Here is why.


We control BEHAVIORs. Not much more. And what are ‘behaviors?’ Decisions. Everything we do, say, write, acknowledge or otherwise respond to is a decision we make. We can’t change physical phenomena. We can’t ever stop fatigue cracking from being a failure mechanism.


The only thing we can change is the decisions that create the conditions for failure to occur. In the case of our aircraft turbine? The ladder cracking (at the time) was a well-known issue associated with manufacturing these turbines. Coolant is applied to drill bit tips to reduce the risk of them occurring. But the risk is real.


Which is why (in this case) these turbines are supposed to be subjected to routine Non-destructive Inspection (NDI). Now because the maintenance crew in this (real life) scenario did not identify the fatigue crack, the aircraft was allowed to continue flying until this failure occurred.


So again … what is the root cause? Perhaps you might be thinking ‘poorly executed inspection activity’ or something similar. Still not there yet. We need to investigate further. Was the maintenance crew properly trained? If not, the decision to not properly train them is a (potential) root cause. Was the maintenance crew properly supervised? If not, we have another (potential) root cause.


But here is the most common ‘root cause’ I have come across through my reliability engineering career:


management teams DECIDING to avoid accountability.


Most maintenance crews are trained. A lot of them are well supervised. But a significant fraction of them is over-tasked, under-paid, or time-poor. The management team’s solution is to simply delegate an ever-increasing swag of tasks to the worker bees of the organization to the point that it becomes infeasible.


Have you ever been put in a position where your boss gives you (for example) ten tasks to complete in the next three months, and you only have the time or resources to do half of them? When you raise this issue with your boss and essentially try to force them to choose what your prioritized tasks need to be, do they reflect it back to you with ‘motherhood’ statements that absolve them of any responsibility?


I have. I bet you have too. And that is the root cause of many failures that cost you, your team, and broader society very deeply.


So when it comes to ‘root causes’ of failure, don’t stop until you identify an unambiguous decision (or failure to make a decision) by someone with the authority to control the ‘context.’


Depending on where you stand with semantics … this is often ‘culture.’

7 views0 comments

Recent Posts

See All

Reliability engineering has an image problem. It is seen as an imbugerance that destroys budget, schedule and fun. People sometimes think reliability engineering is simply statistics, data analysis an

One of the enduring beauties and mysteries of reliability engineering is that there is no straight forward definition of who a reliability engineer is. Proactive, successful organizations, employ reli