The Inference Trap (and why clean data feels so good)

Mihir Wagle 3 min read
inferenceclean data

This article was originally published on LinkedIn on January 31, 2026. Read it there.

In my last post, I compared data cleaning to inventory fraud. It masks operational defects, so the dashboard stays green while the business stays broken.

A very reasonable counterpoint is: "Mihir, we can't stop the trucks just because one shelf is empty. We have to fulfill the order with available inventory (impute) while we investigate the source (trace)."

That's the gold standard in ops. In most AI based insight projects, the "investigation" often never gets funded or staffed. The imputation becomes the default state. Once the dashboard turns green, urgency drops, and no one wants to do a Correction of Errors.

1. The Metric is the Mission

In a warehouse, an empty shelf is a physical reality. In a data pipeline, once you "impute" a value, the dashboard turns green. The "pain" of the failure disappears for leadership. When the metric looks healthy, the incentive to do the hard, expensive work of tracing a defect back to a legacy upstream system evaporates. We trade long-term systemic truth for short-term reporting optics.

2. The Inference Trap

Cleaning doesn't just remove noise. It changes what the model can infer.

Data isn't just sitting in a relational database where redacting a field creates a dead end. Modern models don't work that way. They infer missing pieces from context. Niloofar Mireshghallah's work on data sanitization (A False Sense of Privacy) shows that surface-level removal often doesn't remove the underlying signal.

So if you clamp a margin outlier to "reasonable" values, but leave the surrounding features (e.g., destination and shipment weight), the model learns the pattern anyway. It won't use your average. It will re-encode the defect as a combination of other signals.

And because you've smoothed over the original symptom, you've made the failure harder to diagnose later. The model internalizes a distorted pattern, confidence stays high, and you lose the breadcrumb trail that would have pointed you back to the upstream issue.

3. The "Bounded Defect" Illusion

Often it's reasonable to assume defects are local: one missing field, one contained impact. That holds in a lot of deterministic pipelines.

In probabilistic systems it's closer to a butterfly effect. A defect in a field that seems irrelevant can reshape feature relationships, shift what the model treats as "normal," and cascade into decisions far downstream. Even if the model never uses that field directly.

Fraud example: dispute_status is late and messy, so teams default nulls to no_dispute (or drop the field) to keep training data "complete." That quietly rewrites your label. "Not yet disputed" becomes "clean." A week of dispute-feed lag turns into a fake low-fraud period, and the model learns that those patterns are safe.

Without provenance, you never tie the drift back to "the dispute feed was delayed after release X." You just get a confident model trained on missing bad news.

Truth over Tidiness

AI shouldn't be the thing that makes bad inputs look good. Imputation is fine as a stopgap, but only if it's tracked and paid down.

A defect budget is the difference: when cleaning crosses the line, it triggers an incident, an owner, and a fix upstream. Without that, you'll keep shipping models built on patched-over data.

Part 3: designing for traceability.

← Back to blog

Enjoyed this post? Get new ones in your inbox.