A Root Cause Analysis Framework for Product Managers

Isolating and analyzing the root cause of problems, at different stages of a product lifecycle, are crucial for a product's long-term success. It is this iterative approach of identifying problems and working towards analyzing and resolving the root cause of these problems, that paves the way for future innovation.

As a former technical consultant, I've often been faced with situations that required isolating issues and providing Root Cause Analysis (RCA), to preemptively avert such situations in the future. This is a skill that has translated well to my current role as a project manager and product owner for a cross-functional product at Cisco. Here's my framework for dealing with RCA problems as a Product Manager, keeping in line with the insights derived as a former consultant.

1. Make sure everyone is on the same page

Before jumping on to solve a problem, make sure that all the key terms used to define the problem, are clearly understood by all stakeholders. It may well be the case that the problem being conveyed by the customer is very different from the problem that you perceived it to be upon first hearing about it. For instance the customer might voice out her concern as being unable to view a particular feature in your product. This open-ended statement could mean a whole lot of things unless qualified with the right constraints. It could mean that feature is not intuitively placed and thus couldn't be located by the user. It could mean the option to use that feature is not present on the interface the user is talking about. As you might have probably guessed, it could likewise mean a whole host of other things, which is why it is very important get the problem properly clarified and define the key terms unequivocally before proceeding with the RCA process.

2. Verify the metrics defined in the problem

When identifying potential root causes for an issue, it is imperative to make well-informed data-driven choices to be able to back up any solution hypothesis you may want to test out, to try to resolve the issue. Thus at the heart of any decision lie the metrics that substantiate the claims in a problem. Thus, it makes logical sense to ensure that the veracity of any metrics presented in a problem are thoroughly vetted before proceeding to try and find the root cause. For instance, lets say that a user analytics report tells you that 35% of the users are seeing a spike in resource consumption, with CPU utilization showing 90% used. How do you interpret the metrics presented in this statement? How did we arrive at theses values? What were the tools used to draw up these analytics reports? What were the criteria/parameters defined that showed that 35% users were impacted?

It is very well possible, that a wrongly defined threshold value could be causing these spiked values to show up on the user analytics report. It is also possible that the analytics being drawn up are polling the wrong set of devices. Thus, doing a quick double-check on the veracity of the metrics presented in the problem statement, would allow you to clearly see if a problem actually exists and to dispel any false-positives right at the start of the RCA process without wasting any further time or resources.

3. List out any recent changes

A good place to start identifying the problem, would be to check when the problem was first spotted. Doing so would allow you to gain insights on the problem's origin. It would also give you a rough timeline to potentially observe patterns that could give a clue on what might have caused the problem. Most importantly, it will allow you to list out all the changes that were made around that time that could have potentially impacted the product feature. Was there a recent update in permissions for a view for a certain group of users? Was there any change made to the feature that was pushed to production recently? Asking these questions would allow you to form a cause and effect relationship between the changes made and the problem being faced.

4. Cover all major endpoints

A common technique when analyzing problems related to a firewall dropping packets, is to do a packet capture. A packet capture is basically checking where the packet is coming from, what are the access control rules that apply to it and where is the packet existing from. The logic being, that a packet entering the Firewall also has to exit the Firewall in some way - either through an exit interface or by getting dropped after hitting an access control statement that blocks it. Thus tracing the flow of the packet would allow us to understand where the problem lies.

This same concept can be extended to solving all product RCA problems as well. Any user interaction with a product can be broken down into the buckets of input, process and output. Analyzing the user-flow through each of these segmented endpoints would allow you to derive insights on the problem. A example can be, a user is complaining that he is unable to see the changes he makes to a certain segment of the web portal. Using the above segmentation process we can divide the flow into 3 buckets. The problem could lie at the place where the information is being entered (input), it could be because of the way that input is getting processed by the back-end (process), or it could be that the stored information is not being displayed properly to the user on the front-end (output). Following this flow will help eliminate problem domains in a step by step manner.

5. Check for the demographic being impacted

When a problem is faced, it is general good practice to understand which group of end users the problem really impacts. Is this a problem faced by all users alike or is this something faced only by a specific subset of people? Is there a common demography (based on age, sex, gender, access or location) of people where this issue is observed in? Are there any other broad customer segmentation that can be done to identify the exact user base being affected?

 6. Check for external factors

Some problems identified by your product team, might not have anything to do with the product itself. These problems could be caused due to changes made by external agents, new campaigns or announcements by competitors or larger macro-economical factors. For instance, if your video streaming product's analytics team identifies that the number of monthly subscriptions to your service has gone down by 30% in the last few weeks, it may well be that the drop had nothing to do with you product itself. It may be that your competition has sliced their subscription fee causing more people to move to their product or the government could have announced a new tax on video streaming services thereby influencing peoples decision on whether or not to subscribe to such a service in the first place. These factors can often be outside your scope of influence as well.

7. Check for internal factors

After going through the above steps, it is now time to revisit each of the factors that you CAN influence as a PM. Check where the issue is occurring exactly. Is it specific to users on a particular device or users running a certain version of the software? If so what are the last few changes done on that particular feature/workflow? Was there any change in user flow that could have led to the problem? What are the possible alternative hypotheses to test out to see if the problem can be resolved? These problems will typically lie within your frame of influence and making the right changes to these internal factors should be effective in resolving the problems.

8. Follow a top down approach

When isolating an issue, explore all potential options and then dive down into the most likely ones after ascertaining each level of information pertaining to the problem. One way of imagining this, is in the form of a tree structure with the problem statement at the root of the tree. Segment the problem into sub-problems and explore each of those sub-problems in further detail. When doing so, determine which of these sub-branches can be ruled out completely, as and how we gather more data insights. Explore down the branches that are not ruled out and iteratively apply the process until the Root Cause is identified. 


Post a Comment