Machine learning systems are trained on data, but they are deployed into reality. These two are never identical.
Formally, training assumes examples are drawn from a fixed distribution:
(x, y) ~ DBut in practice:
- data is collected by imperfect processes
- labels are delayed, noisy, or biased
- user behavior changes over time
- systems interact with the environment they predict
The central problem of ML engineering is not learning from data — it is learning from a distorted, delayed, and drifting shadow of reality.
Most ML theory assumes data is IID:
- Independent: samples do not influence each other
- Identically Distributed: all samples come from the same distribution
This assumption is mathematically convenient — and operationally fragile.
Examples of violations:
- Time series: yesterday affects today
- User data: users generate multiple correlated samples
- Recommendation systems: predictions affect future data
- Logging changes: distribution changes without warning
IID is a modeling assumption, not a fact. Systems must be designed expecting it to fail.
Let:
- D_train: distribution of training data
- D_serve: distribution seen in production
Generalization quietly assumes:
D_train ≈ D_serveIn real systems:
D_train ≠ D_serveThis mismatch is called the distribution gap.
Everything from sudden metric drops to silent model decay traces back to this inequality.
Not all shifts are the same. Diagnosing the type matters.
a) Covariate shift
P_train(x) ≠ P_serve(x), while P(y | x) is unchanged.
Examples:
- new user demographics
- seasonal effects
- product UI changes
Often mitigated by:
- reweighting
- retraining
- feature normalization
b) Label shift
P_train(y) ≠ P_serve(y)
Examples:
- fraud rates increase during holidays
- rare events become more common
This breaks:
- calibrated probabilities
- threshold-based decisions
c) Concept drift
P_train(y | x) ≠ P_serve(y | x)
Examples:
- user intent changes
- adversaries adapt
- market conditions change
No static model can survive concept drift indefinitely.
Offline evaluation assumes:
- test data ≈ future data
- labels are correct and timely
- no feedback loops exist
In production:
- labels arrive late or never
- predictions influence behavior
- metrics are delayed proxies
This is why:
- models pass validation and fail in prod
- A/B tests disagree with offline metrics
- “accuracy” improves while business metrics decline
Offline metrics estimate performance under assumptions that reality violates.
Random train-test splits implicitly assume:
- time does not matter
- future resembles the past
For temporal data:
- random splits leak future information
- validation becomes optimistic
- deployment performance collapses
Correct approach:
- time-based splits
- rolling windows
- backtesting
If time is involved, random splitting is suspicious.
Datasets are rarely random samples of reality.
Common sources:
- logging bias (only some events recorded)
- survivorship bias (failures disappear)
- self-selection (only certain users participate)
- human labeling bias
Once bias enters the dataset, training amplifies it.
This is not an ethical issue alone — it is a statistical inevitability.
When a model’s predictions affect the environment:
- recommendations influence clicks
- risk scores influence approvals
- alerts influence investigation rates
The data-generating process becomes:
D_(t+1) = f(D_t, model_t)This violates stationarity and IID assumptions completely.
Result: models can reinforce their own mistakes.
When performance drops, ask:
- Did the data distribution change?
- Did the feature pipeline change?
- Did labeling change?
- Did thresholds change?
- Did the business objective change?
Do not immediately:
- blame the algorithm
- tune hyperparameters
- add model complexity
Most failures are data failures, not model failures.
Retraining assumes:
- fresh data reflects reality
- labels are correct
- drift is slow
But:
- retraining on biased data reinforces bias
- retraining too often increases variance
- retraining too late misses change
Retraining is a control mechanism, not a fix.
This chapter directly motivates:
- validation strategies
- monitoring and drift detection
- shadow deployments
- canary releases
- human-in-the-loop systems
You cannot eliminate distribution shift. You can only detect it early and respond safely.
If Chapter 0.1 says:
“We optimize the wrong objective on purpose.”
Then Chapter 0.2 adds:
“And we optimize it on the wrong data, from the wrong time, about the wrong world.”
Machine learning works anyway — not because these problems don’t exist, but because systems are engineered with them in mind.
You should now be able to:
- Explain why IID is an assumption, not reality
- Identify covariate shift vs label shift vs concept drift
- Explain why offline metrics fail in production
- Design a correct train/validation split for time-based data
- Reason about feedback loops in deployed ML systems