There is no Bad Data
Assuming that your operations generate it and you aren’t just making it up, there is just data. If your data isn't suited to a particular purpose, that doesn’t make it "bad." The value of information is intrinsically linked to its intended use.
Data Collection
The act of doing business generates a majority of a business’s data. There is often an inherent tension between the need to complete a transaction and the possibilities of data capture at that transaction point. The data captured might not be optimized for later analytical purposes, especially if the supporting systems were created in the past and are unable to capture new dimensions easily.
Employees are often driven by quotas and immediate results, not to meticulously record all possible dimensions or crucial nuances of the information available during the interaction. If you reward your claims processors on how quickly they can process claims, default fields and open text values will likely generate misleading information. If you pay your sales team to close deals, that is of utmost importance, not necessarily collecting all possible information about the sale.
As a result, we often fall into the trap of assuming data has an inherent value and that the more data we have, the better. Collect it all; we’ll sort it out later. However, value is contextual and purpose-driven. A massive amount of irrelevant data is not only useless but can also be a significant drain on resources, storage, processing power, and the time spent sifting through it.
Data Aggregation
The technical teams responsible for extracting, transforming, and loading data play a significant role in shaping the data available for analysis. They operate based on their understanding of schematic data models and IT domains and are driven by processing and execution requirements that often do not track directly to business value. They focus on the limitations of the systems involved. During these processes, decisions and interpretations are made about data processing, aggregation, summarization, and merging, often from different data sources. While necessary for efficiency and usability, these decisions can also introduce biases or lose granularity in the original data.
Underlying processes, organizational structures, infrastructure, and cost accounting measures also contribute to the friction they create, which impacts the ability to provide tailored and customized data sets.
Data Analysis
A business user, analyst, or scientist often looks for data for a particular purpose. However, the initial request frequently lacks clarity as to the purpose the data is to be used (if the purpose is even all that well known). The mismatch between the underlying operational data structures and the business domains complicates data exploration as well as the silos in which both employees and data exist. This often leads to misinterpretations and the extraction of data that, while technically correct, doesn't honestly answer the intended question. The "opaque task" can also contribute to the problem. If the requester isn't clear about their goals or how they plan to use the data, they might inadvertently pull in irrelevant information or miss crucial details. This lack of an explicit objective acts as a filter, potentially shaping the data in a way that obscures its true potential.
Data Confusion
This disconnect often becomes crucial in a world increasingly driven by data-informed decisions. Frequently, the data presented as available is a copy of a data set built for a previous request or, worse, a view built on top of other views. The cumulative effect of these layers of interpretation and transformation can be significant. The final dataset presented to an analyst might be a highly processed and potentially distorted version of the raw data generated by the initial interaction. This isn't necessarily intentional obfuscation but rather a consequence of time, availability, and funding. It is also a result of different individuals with different priorities and levels of understanding interacting with the data at various stages.
So WHAT!
The journey of data from its generation to its analysis is often complex, involving multiple stakeholders with different priorities and levels of understanding. Recognizing this complexity and focusing on clarity of objectives, data literacy, and effective communication are essential for unlocking the true potential of the data that surrounds us. Instead of labeling data as inherently "good" or "bad," we should focus on understanding its context, limitations, and suitability for the specific questions we are trying to answer. This shift in perspective can lead to more insightful analyses, better decision-making, and a more nuanced understanding of the world around us.
The pressure of the AI/ML revolution is making businesses reexamine existing models. The challenge lies not in eliminating “bad” data but in:
Asking if systems still fit the needs of the business (and if not, is it cost-effective to change them).
Examining processes and ensuring they align with current business objectives
Rethinking the way data is delivered to the business
And learning how to harness the data that exists for meaningful purposes effectively