Chapter 0.2 — Data, Reality, and the Distribution Gap

0.2.1 The uncomfortable truth: data is a proxy for reality

Machine learning systems are trained on data, but they are deployed into reality. These two are never identical.

Formally, training assumes examples are drawn from a fixed distribution:

(x, y) ~ D

But in practice:

data is collected by imperfect processes
labels are delayed, noisy, or biased
user behavior changes over time
systems interact with the environment they predict

Engineering truth

The central problem of ML engineering is not learning from data — it is learning from a distorted, delayed, and drifting shadow of reality.

0.2.2 The IID assumption (and why it almost always fails)

Most ML theory assumes data is IID:

Independent: samples do not influence each other
Identically Distributed: all samples come from the same distribution

This assumption is mathematically convenient — and operationally fragile.

Examples of violations:

Time series: yesterday affects today
User data: users generate multiple correlated samples
Recommendation systems: predictions affect future data
Logging changes: distribution changes without warning

Engineering takeaway

IID is a modeling assumption, not a fact. Systems must be designed expecting it to fail.

0.2.3 Training distribution ≠ serving distribution

Let:

D_train: distribution of training data
D_serve: distribution seen in production

Generalization quietly assumes:

D_train \approx D_serve

In real systems:

D_train \neq D_serve

This mismatch is called the distribution gap.

Everything from sudden metric drops to silent model decay traces back to this inequality.

0.2.4 Types of distribution shift (critical taxonomy)

Not all shifts are the same. Diagnosing the type matters.

a) Covariate shift

P_train(x) ≠ P_serve(x), while P(y | x) is unchanged.

Examples:

new user demographics
seasonal effects
product UI changes

Often mitigated by:

reweighting
retraining
feature normalization

b) Label shift

P_train(y) ≠ P_serve(y)

Examples:

fraud rates increase during holidays
rare events become more common

This breaks:

calibrated probabilities
threshold-based decisions

c) Concept drift

P_train(y | x) ≠ P_serve(y | x)

Examples:

user intent changes
adversaries adapt
market conditions change

Hardest form of drift

No static model can survive concept drift indefinitely.

0.2.5 Why offline evaluation lies

Offline evaluation assumes:

test data ≈ future data
labels are correct and timely
no feedback loops exist

In production:

labels arrive late or never
predictions influence behavior
metrics are delayed proxies

This is why:

models pass validation and fail in prod
A/B tests disagree with offline metrics
“accuracy” improves while business metrics decline

Offline metrics estimate performance under assumptions that reality violates.

0.2.6 Temporal structure: why random splits fail

Random train-test splits implicitly assume:

time does not matter
future resembles the past

For temporal data:

random splits leak future information
validation becomes optimistic
deployment performance collapses

Correct approach:

time-based splits
rolling windows
backtesting

Exam signal

If time is involved, random splitting is suspicious.

0.2.7 Sampling bias: who gets into the dataset?

Datasets are rarely random samples of reality.

Common sources:

logging bias (only some events recorded)
survivorship bias (failures disappear)
self-selection (only certain users participate)
human labeling bias

Once bias enters the dataset, training amplifies it.

Not just ethics

This is not an ethical issue alone — it is a statistical inevitability.

0.2.8 Feedback loops: models change the data they learn from

When a model’s predictions affect the environment:

recommendations influence clicks
risk scores influence approvals
alerts influence investigation rates

The data-generating process becomes:

D_(t+1) = f(D_t, model_t)

This violates stationarity and IID assumptions completely.

Result: models can reinforce their own mistakes.

0.2.9 Dataset shift vs model failure (diagnostic thinking)

When performance drops, ask:

Did the data distribution change?
Did the feature pipeline change?
Did labeling change?
Did thresholds change?
Did the business objective change?

Do not immediately:

blame the algorithm
tune hyperparameters
add model complexity

Rule of thumb

Most failures are data failures, not model failures.

0.2.10 Why “retraining” is not a silver bullet

Retraining assumes:

fresh data reflects reality
labels are correct
drift is slow

But:

retraining on biased data reinforces bias
retraining too often increases variance
retraining too late misses change

Frame it correctly

Retraining is a control mechanism, not a fix.

0.2.11 Engineering consequences of the distribution gap

This chapter directly motivates:

validation strategies
monitoring and drift detection
shadow deployments
canary releases
human-in-the-loop systems

You cannot eliminate distribution shift. You can only detect it early and respond safely.

0.2.12 Chapter takeaway

If Chapter 0.1 says:

Chapter 0.1

“We optimize the wrong objective on purpose.”

Then Chapter 0.2 adds:

Chapter 0.2

“And we optimize it on the wrong data, from the wrong time, about the wrong world.”

Machine learning works anyway — not because these problems don’t exist, but because systems are engineered with them in mind.

Readiness Check

You should now be able to:

Explain why IID is an assumption, not reality
Identify covariate shift vs label shift vs concept drift
Explain why offline metrics fail in production
Design a correct train/validation split for time-based data
Reason about feedback loops in deployed ML systems