← Home
ArxCafe / ML Engineering / Layer 0 / Part 1 / Chapter 0.2
Book 0 · Chapter 0.2
Book 0 — Chapter 0.2

Data, Reality, and the Distribution Gap

Concept page: models train on data, deploy into reality. This chapter names the mismatch and gives a practical taxonomy for diagnosing shift.
0.2.1 The uncomfortable truth: data is a proxy for reality

Machine learning systems are trained on data, but they are deployed into reality. These two are never identical.

Formally, training assumes examples are drawn from a fixed distribution:

(x, y) ~ D

But in practice:

  • data is collected by imperfect processes
  • labels are delayed, noisy, or biased
  • user behavior changes over time
  • systems interact with the environment they predict
Engineering truth

The central problem of ML engineering is not learning from data — it is learning from a distorted, delayed, and drifting shadow of reality.

0.2.2 The IID assumption (and why it almost always fails)

Most ML theory assumes data is IID:

  • Independent: samples do not influence each other
  • Identically Distributed: all samples come from the same distribution

This assumption is mathematically convenient — and operationally fragile.

Examples of violations:

  • Time series: yesterday affects today
  • User data: users generate multiple correlated samples
  • Recommendation systems: predictions affect future data
  • Logging changes: distribution changes without warning
Engineering takeaway

IID is a modeling assumption, not a fact. Systems must be designed expecting it to fail.

0.2.3 Training distribution ≠ serving distribution

Let:

  • D_train: distribution of training data
  • D_serve: distribution seen in production

Generalization quietly assumes:

D_train ≈ D_serve

In real systems:

D_train ≠ D_serve

This mismatch is called the distribution gap.

Everything from sudden metric drops to silent model decay traces back to this inequality.

0.2.4 Types of distribution shift (critical taxonomy)

Not all shifts are the same. Diagnosing the type matters.

a) Covariate shift

P_train(x) ≠ P_serve(x), while P(y | x) is unchanged.

Examples:

  • new user demographics
  • seasonal effects
  • product UI changes

Often mitigated by:

  • reweighting
  • retraining
  • feature normalization

b) Label shift

P_train(y) ≠ P_serve(y)

Examples:

  • fraud rates increase during holidays
  • rare events become more common

This breaks:

  • calibrated probabilities
  • threshold-based decisions

c) Concept drift

P_train(y | x) ≠ P_serve(y | x)

Examples:

  • user intent changes
  • adversaries adapt
  • market conditions change
Hardest form of drift

No static model can survive concept drift indefinitely.

0.2.5 Why offline evaluation lies

Offline evaluation assumes:

  • test data ≈ future data
  • labels are correct and timely
  • no feedback loops exist

In production:

  • labels arrive late or never
  • predictions influence behavior
  • metrics are delayed proxies

This is why:

  • models pass validation and fail in prod
  • A/B tests disagree with offline metrics
  • “accuracy” improves while business metrics decline

Offline metrics estimate performance under assumptions that reality violates.

0.2.6 Temporal structure: why random splits fail

Random train-test splits implicitly assume:

  • time does not matter
  • future resembles the past

For temporal data:

  • random splits leak future information
  • validation becomes optimistic
  • deployment performance collapses

Correct approach:

  • time-based splits
  • rolling windows
  • backtesting
Exam signal

If time is involved, random splitting is suspicious.

0.2.7 Sampling bias: who gets into the dataset?

Datasets are rarely random samples of reality.

Common sources:

  • logging bias (only some events recorded)
  • survivorship bias (failures disappear)
  • self-selection (only certain users participate)
  • human labeling bias

Once bias enters the dataset, training amplifies it.

Not just ethics

This is not an ethical issue alone — it is a statistical inevitability.

0.2.8 Feedback loops: models change the data they learn from

When a model’s predictions affect the environment:

  • recommendations influence clicks
  • risk scores influence approvals
  • alerts influence investigation rates

The data-generating process becomes:

D_(t+1) = f(D_t, model_t)

This violates stationarity and IID assumptions completely.

Result: models can reinforce their own mistakes.

0.2.9 Dataset shift vs model failure (diagnostic thinking)

When performance drops, ask:

  • Did the data distribution change?
  • Did the feature pipeline change?
  • Did labeling change?
  • Did thresholds change?
  • Did the business objective change?

Do not immediately:

  • blame the algorithm
  • tune hyperparameters
  • add model complexity
Rule of thumb

Most failures are data failures, not model failures.

0.2.10 Why “retraining” is not a silver bullet

Retraining assumes:

  • fresh data reflects reality
  • labels are correct
  • drift is slow

But:

  • retraining on biased data reinforces bias
  • retraining too often increases variance
  • retraining too late misses change
Frame it correctly

Retraining is a control mechanism, not a fix.

0.2.11 Engineering consequences of the distribution gap

This chapter directly motivates:

  • validation strategies
  • monitoring and drift detection
  • shadow deployments
  • canary releases
  • human-in-the-loop systems

You cannot eliminate distribution shift. You can only detect it early and respond safely.

0.2.12 Chapter takeaway

If Chapter 0.1 says:

Chapter 0.1

“We optimize the wrong objective on purpose.”

Then Chapter 0.2 adds:

Chapter 0.2

“And we optimize it on the wrong data, from the wrong time, about the wrong world.”

Machine learning works anyway — not because these problems don’t exist, but because systems are engineered with them in mind.

Readiness Check

You should now be able to:

  • Explain why IID is an assumption, not reality
  • Identify covariate shift vs label shift vs concept drift
  • Explain why offline metrics fail in production
  • Design a correct train/validation split for time-based data
  • Reason about feedback loops in deployed ML systems