The Five Whys of a Root Cause Analysis

Hari Prasad Nagesha
Hari Prasad Nagesha
6 min read
Posted on November 09, 2022
The Five Whys of a Root Cause Analysis

For the want of a nail the shoe was lost,
For the want of a shoe the horse was lost,
For the want of a horse the rider was lost,
For the want of a rider the troop was lost,
For the want of a troop the kingdom was lost.

This classical proverb For Want of a Nail dates back centuries ago and finds relevance even today, especially while designing a Root Cause Analysis (RCA).

The rhyme captures the true essence of an RCA – The five whys in a blameless way.

Why did we lose the kingdom?

We lost it because we lost the troop.

Okay, but why did we lose the troop?

We lost it because we lost the rider.

Every answer uncovers a new question until we reach the final root cause. Here, the loss of a kingdom finally rests at losing a nail.

This nice rhyme holds a very deep meaning – the linchpin pivotal in case of an issue. For the outset it was a lost kingdom, but the true reason was just a misplaced nail. Observe that the rhyme never mentioned the names of the rider, horse, or kingdom. It defines the problem as what really happened and why, rather than who's to be blamed – much like how an RCA should be.


The Recipe for a Perfect RCA

  1. Being objective and analyzing critically

  2. Finding the five why(s)

  3. Having clear action items

  4. A proper feedback mechanism

  5. Following a blameless approach


Be Objective and Analyze Critically

Everyone is passionate about the code they write, the build they release, and the process they follow. This creates a strong bias that clouds judgement while creating an RCA.

For achieving the desired results:

  • Ideally an RCA should be driven by someone who isn’t deeply involved in the release process or services.

  • Observe the actions of oneself from an outsider’s perspective.

  • Be passionate for change to helps identify the actual reason for the issue, rather than defending something that was done right.

  • Be critical of all decisions taken, like did they influence the outcome, could they be changed, could they be different, and given a similar scenario again, would you have done it differently?


Finding the Five Why(s)

For every item identified as a point of failure, question the individual (or group) on every answer provided. The curiosity and tenacity with which we attack the answers helps improve the entire process and provides a clear outcome. Have a kid like enthusiasm whenever you ask, “what do you do?” for every situation listed.

  • The thought that goes into this is when we have 5 levels of answers for any query, we have reached a deep enough level of enquiry.

  • Documenting the 5 whys helps the reviewer provide their inputs on what could and should be thought differently too.

  • This helps in detecting flaws in the process if a similar item starts showing up across multiple RCAs.

  • Create an RCA as early as possible so that the whys can be captured in the true sense while the issue and incident stays fresh in the memory, rather than having to clamor back to figure out when an issue occurred.


Clear Action Items

An RCA is as effective as the action items it defines. If an RCA concludes with a statement like “yes it happened but we can’t really do anything about it”, then it hardly adds any value other than cautioning the users that something like this could repeat anytime in future.

  • Ideally, all action items should be realistic and well documented. Documenting your thoughts helps identify and address issues better. This should be followed not only on an immediate basis but also at a later point in time that helps in refining those goals.

  • Your action items must nominate an owner who could be notified for fixing flaws and issues that have been identified.

  • The action items should articulate the processes to be followed to fix the identified issues.

  • Segregate your action items into short, medium, and long-term goals to set your priorities right.

  • Identify areas where AI/ML could be leveraged to implement change in processes and procedures. Process fixes like metrics, alerts and monitoring as well as actionable procedures like code fixes and configuration changes should both be tracked separately.


Welcoming Feedbacks

It’s quite ironic that no matter how important an RCA is, everyone hopes that they don't have to write one. However, an RCA helps identify our blind spots and fix the loopholes that caused the issue in the first place.

An RCA needs a lot of feedback from multiple stakeholders, because everyone perceives a problem and its outcome differently. For someone fixing a reporting problem, it could be an Azure issue causing the VMSS to not scale, but for someone from the business side, it's their inability to deliver status to external and internal parties.

  • Feedback helps identify the stakeholders who might have to be informed in case of outages.

  • Every document or an algorithm is written with a train of thought. A different outlook helps provide a view that's entirely different to the documented flow.

  • The whole RCA can be pivoted differently based on feedback.

  • Architects have a horizontal view of systems, and their feedback is invaluable in such cases.

  • While opinions are personal, feedbacks aren’t. Keep your feedback objective so that any questions or comments are addressed in an impersonal manner.


Being Blameless

Being critical and objective truly works only when one is impartial and blameless. A true RCA captures the essence of the problem along with the shortcomings of the processes and services without shifting the blame on anyone. For an RCA who is immaterial, what and why are the most important factors.

  • While being blameless an RCA focuses on the problem rather than the players involved.

  • This garners better feedback and reviews on the document and promotes more discussion.

  • Understand the difference between symptoms and causes. Identify key metrices and alerts that helps narrow down the issue without any bias regarding why it wasn’t done for the current outage.