The Clean Data Trap (Part 3): The Inverse Problem of Traceability
In my last two posts, I argued that data cleaning is Inventory Fraud and that AI models, inference engines that they are, can see through our masks.
But when I talk about "fixing the factory" instead of "cleaning the data," I’m met with a fair question: "Is it even possible to trace a defect back to its source in a system this complex?"
The answer is: only if you change how you architect the system.
The Inverse Problem Trap
As neuroscientist Konrad Kording points out in his work on Forward vs. Inverse Problems, there is a fundamental mathematical asymmetry in how we view systems.
The Forward Problem is easy: A supplier changes a part number (the Cause). Your warehouse data breaks (the Effect). The path is a straight line.
The Inverse Problem is hard: You see a "messy" data point in your dashboard (the Effect). Now, try to work backward to find the specific Cause.
In a high-dimensional data supply chain, the inverse problem is often "ill-posed." A thousand different upstream failures can result in the same "null" value. If you don't have the right architecture, working backward from a defect is like trying to reconstruct a shattered vase from the dust on the floor. It’s not just difficult; it’s mathematically impossible.
Sidebar - the mystery of an opaque sponge.
Let's take a trip down memory lane - the hostels of IITB more than 30 years ago. One of the professors there, Suresh K Bhatia, was working on inverse problems. One of my hostelmates got me a photocopy of his paper. I'll try to paraphrase it here.
Imagine a giant, pitch-black sponge. You can't see its internal structure or the size of its hidden pores. You pour a liter of water onto the top and start a stopwatch to measure exactly how long it takes for the surface to dry.
This is what scientists call an Inverse Problem. You have the final result --the water disappeared -- and you must work backward to map the "how." You want to know if the water raced through large tunnels or squeezed through tiny cracks as the sponge became more saturated.
The data you collect with your stopwatch is rarely perfect. You might blink, or perhaps a few drops splash onto the floor. In the 1980s, the common instinct was to "clean" this data by smoothing out the jagged lines on the graph to make it look professional.
Suresh K. Bhatia (SKB as we called him)’s 1990 paper on stochastic modeling of transport in porous media challenged that approach. He argued that those jagged lines weren't just mistakes; they were the fingerprints of the internal structure. If you "clean" the data to make it look smooth, you aren't just removing noise, you're erasing the evidence of how the sponge actually works. SKB developed a mathematical way to keep the messiness while still finding the truth. He proved that the most accurate view of reality often comes from the rawest information.
The Inverse Problem is a trap because it is one-to-many. A missing price in your dashboard could be a timed-out API, a deleted record in a CRM, or a human error in a warehouse 3,000 miles away. Without metadata, you aren't solving a puzzle; you're just guessing.
From Maps to Radiation Badges
In the deterministic era, we tried to solve the inverse problem with Lineage Maps. We tried to draw a line from every cell back to its source. But in the probabilistic era of AI, those maps break.
Instead of trying to solve the "Inverse Problem" after the fact, we have to build Forward Constraints. We need a Radiation Badge mindset. A radiation badge doesn't "map" the room; it records your exposure in real-time.
Here is how you architect for truth in a world of inverse problems:
1. Shadow Imputation (Preserve the Evidence)
The biggest obstacle to solving an inverse problem is destructive cleaning. When you overwrite a NULL with an average, you delete the only evidence of the failure.
The Fix: Treat raw signals as immutable. If you must impute a value to keep a model running, store it in a Shadow Column.
The Result: You keep the trucks moving, but you preserve the "Effect" so the "Cause" can actually be found.
- Implement a "Defect Budget"
In a complex system, you can't investigate every anomaly. You have to triage by "dose."
The Fix: Every Tier-1 field gets an allowable "messiness" threshold.
The Result: When the "Radiation Badge" on a data feed crosses the budget, it triggers an Operational Incident owned by the source system. You aren't guessing the cause; you are alerting the "supplier" the moment the defect occurs.
- Trace the "Diet," Not the "Mind"
Don't waste time asking a 70-billion parameter model "why" it made a decision. That’s an unsolvable inverse problem. Instead, audit the AI’s diet.
The Fix: Use Row-Level Provenance. Every record should carry a "Birth Certificate" (Source ID, Timestamp, Version).
The Result: When the model drifts, you don't audit the weights; you audit the raw materials. You look for the "Information Stockouts" in the source systems that provided the inputs.
Truth over Tidiness
Tracing back to the root cause will always be an inverse problem. If you spend your time "cleaning" the data, you are making that problem unsolvable. You are effectively erasing the crime scene.
Success in the "Great Reset" belongs to the architects who stop building Refineries and start building Diagnostic Labs. We don't need AI to give us a cleaner version of the truth. We need an architecture that lets us fix the world as it is.
Series Wrap-up:
Part 1: Data cleaning is Inventory Fraud
Part 2: The Inference Trap & The False Green
Part 3: The Inverse Problem of Traceability
Sources & Reading
Industry Context: "Google Cloud exec: Software's 'Great Reset' from predictability to uncertainty" – Fortune (Jan 21, 2026)
The Mathematics of Causality: "Forward vs. Inverse Problems: Why High-Dimensional Systems Are Ill-Posed" – Konrad Kording, Kording Lab (2024)
Foundation Research: "The Inverse Problem of Pore Structure Characterization" – Suresh K. Bhatia, Chemical Engineering Science (1990)
AI Technical Risk: "A False Sense of Privacy: Evaluating Textual Data Sanitization" – Niloofar Mireshghallah, et al. (2025)
Market Outlook: "Gartner: Lack of AI-Ready Data Puts AI Projects at Risk" – Gartner Newsroom (Feb 26, 2025)