Data Lie Detectors - 7 Ways to Spot and Stop
Data Lie Detectors - 7 Ways to Spot and Stop
Companies collect, store, process, and analyze data to make decisions. But the data is lying, the decision may not be the best, despite great care and effort.
Why would data lie to you?
There are many reasons why the data is lying. That could be the result of one of several mistakes at any stage of the process. That could be conscious or unconscious bias from one of the actors of the processing chain. There could be an agenda behind the data manipulation, or it could be an unfortunate coincidence. As companies are more and more data-driven, there is a tension between automating as much as possible the data management and processing, and the need for a bit of caution. There is a strong incentive to trust the data, but there should be a healthy suspicion about the data. Data must be treated as questionable until proven trustworthy. The issues with implementing this strategy are:
- dealing with third party data, without access to the underlying data sources,
- long processing chains,
- lack of time to review and assess every single piece of information. Here are 7 quick ways to determine data trustworthiness without having to study the entire data processing.
1. Change of Metrics
If you are looking at different metrics from period to period, you won’t develop trust with the data. Changing metrics increases the chances of mistakes while removing the opportunity to implement simple sanity check on variations. Whenever you look at an important performance indicator or business indicators, it is critical to regularly look at the same numbers. If the computation base changes, if you move from average numbers to incremental, to absolute, to a percentage, the data won’t speak to you. As a consumer of the data, you have to demand a consistent measure basis. In the rare cases when there is a valid reason to change a metric, keep the old metric beside the new to:
- get used to the new metric, and how it relates to the old,
- keep an eye on the old metric for comparison,
- remove the old metric only when the new metric speaks to you.
2. Number Don’t Add up
When you are used to your metrics, you know how they relate to each other. Some metrics may work in opposite ways, i.e., one will often increase when the other decreases, and vice versa. You also have complementary metrics, meaning they will, by nature, add up to each other. A rapid test, whenever presented with percentages, is to add them up. More often than not, the total may be different from 100%. There may be a legitimate explanation for that, and you will learn a bit about your data by understanding what it is. A quick check on metrics’ consistency is only possible when looking at your business activities' complete dashboard. It is critical to look at it as a whole, to have a reasonable level of control.
3. Useless Metrics
Presented with too many metrics, looking at various analyses of the same data across multiple dimensions or many graphic representations will get you to miss the obvious. A dashboard, a report, should only contain useful metrics. As the Data consumer, you define what you need. A simple question to ask about each metric, analysis, or graph is: what decision can I make based on this piece of information? If there is no decision to take, that is a clue that it may not be critical. Once you get rid of the clutter, you won’t miss the relevant information.
4. Round Numbers
There are no round numbers in business or natural events. Real numbers tend to look random and don’t end up with a net series of zeroes. A most telling sign is when a number hits the target right in the middle. Business activities may be near, above, or below the target, but never always on target, to the second decimal. If a number was rounded up, to make the presentation cleaner, ask for the actual number. It is effortless to round numbers mentally. As we will see later, this is critical to apply other Data lie detectors.
5. Adjustment, Correction, Outliers
Adjustments and corrections are valid operations to perform on large data sets. But they are also very convenient ways of moving the data in the right direction. It could be a conscious operation, aiming at getting rid of valid data that will good bad on the reporting. It could be an involuntary bias introduce into the process. It may also be an honest mistake. In all cases, any of the above words are red flags that should lead you to dig deeper into the processing used to manipulate the data. Another noteworthy term is outliers. It is a widespread practice in statistics and data science to get rid of outliers. There are methodology and rationale to identify and eliminate them. But without a clear understanding of the phenomenon, removing outliners could very well destroy the clues of a naissant business opportunity or the early manifestations of a creeping catastrophe. Look into them separately with a keen business eye, not the cold analytic glare of a statistician.
6. Non-normal Distribution
A normal distribution, also called Gaussian distribution or Bell curve due to its shape, is characteristic of natural events. Its mathematical definition is the distribution function of independent random variables. It happens when many independent factors contribute to a single outcome. In the business world, that could be a purchase amount or balance of a bank account. It is also the distribution of the height of individuals in a population, their I.Q., their income. Whenever looking at large data sets, some of the variables will follow a normal distribution. Without resorting to mathematical tooling, the easiest way to detect any bias is to ask for a visual representation of the data. It will display a nice regular bell shape curve. If it comes out as two bell curves or only part of a bell curve, we want to investigate further why the data is this way.
7. Breaking the Benford’s Law
Benford’s Law, also called the first digit’s Law, also happens with random data generated from human activities. It is a phenomenological law, as it is not the result of complex mathematical theories, but observing many large data sets and constantly finding the same distributions. Benford’s Law notes that in large listings of data, the numbers starting with 1 are over-represented, compare the numbers beginning with a 2, etc. In these listings, 30% of the numbers start with a 1. We are only considering here the first digit. So 11, 124, 1.579 all start with the 1 digit. If your data does not follow the Benford’s Law it may be lying to you. It is why fraud detection and forensic accounting use this tool.
Don’t Take Data at Face Value!
Data and technology are only as efficient as users allow them to. Keeping an open mind about the data you manipulate, regardless of its origin, is a sure way to avoid critical decision-making mistakes.