Chapter 0.1 — Learning as Expected Loss Minimization

At its core, machine learning is the problem of choosing a function that minimizes expected error under uncertainty.

Strip away libraries, cloud services, and model names, and every supervised ML system can be described by four ingredients:

Data-generating process (the real world)
Model family (the functions you are allowed to choose from)
Loss function (what “wrong” means)
Optimization procedure (how you search for a good function)

If you understand how these four interact, you understand why models behave the way they do.

0.1.2 Data comes from an unknown distribution

Assume there exists an unknown joint distribution:

(x, y) ~ D

x: input features (what you observe)
y: target/label (what you want to predict)
D: the real-world process that generates them

You never know D. You only ever see a finite sample drawn from it.

This single fact explains:

generalization error
overfitting
data drift
evaluation failure

ML engineering is largely about managing the consequences of not knowing D.

0.1.3 Models are parameterized functions

You choose a family of functions:

f_θ: X \to Y

Examples:

Linear regression: f_θ(x) = w^T x + b
Logistic regression: f_θ(x) = σ(w^T x + b)
Neural network: a composition of linear maps and nonlinearities

θ represents all trainable parameters.

Important

Training does not invent intelligence — it selects parameters. Your model family defines what patterns are even possible to learn.

0.1.4 Loss functions define “error”

A loss function maps predictions and true labels to a real number:

L(ŷ, y) \in R

Loss functions are not neutral. They encode assumptions.

Mean Squared Error → large errors are very bad
Absolute Error → robust to outliers
Cross-Entropy → confident wrong predictions are catastrophic
Hinge Loss → margin matters more than probability

Key insight

The loss function is your definition of success.

If your loss does not align with the real objective, training will optimize the wrong thing perfectly.

0.1.5 Expected risk (the ideal objective)

If you knew the true distribution D, the ideal learning objective would be:

R(θ) = E_(x,y)~D [ L(f_θ(x), y) ]

This is called expected risk.

It measures average error in the real world
It is what you actually care about
It is uncomputable (because D is unknown)

This gap between what you want and what you can compute is the root of ML engineering.

0.1.6 Empirical risk (what training actually does)

Instead, you observe a dataset:

{ (x_1, y_1), …, (x_n, y_n) }

and minimize:

R^(θ) = (1/n) * Σ_(i=1..n) L( f_θ(x_i), y_i )

This is empirical risk minimization (ERM). Training = choosing parameters that minimize loss on observed samples.

Critical implication

You are optimizing performance on the past, hoping it generalizes to the future. Everything else in ML exists to control the damage caused by this hope.

0.1.7 Generalization: the central problem

The core ML question is not “How low can I make training loss?”

It is: “How close is empirical risk to expected risk?”

The difference between them is generalization error.

Generalization fails when:

the sample is not representative
the model is too flexible
the loss does not reflect reality
the data distribution changes

This is why validation sets, regularization, early stopping, and monitoring exist.

0.1.8 Why overfitting happens (without buzzwords)

Overfitting is not mysterious. It happens when:

your model family is rich enough to memorize noise
empirical risk keeps decreasing
expected risk starts increasing

ERM keeps finding parameters that explain quirks of the sample. Those quirks do not exist in D.

Overfitting is a statistical consequence, not a moral failure.

0.1.9 Why “more data” often beats “better models”

Increasing data size n:

improves estimation of D
reduces variance of empirical risk
narrows the gap between empirical and expected loss

This is why simple models with massive data often outperform complex models with little data, and why production ML emphasizes data pipelines as much as modeling.

0.1.10 The engineering interpretation

Training optimizes a proxy, not reality
Loss functions are design decisions
Metrics are estimates with uncertainty
Models fail when assumptions break
Distribution shift is inevitable
You do not “train once and deploy forever”

You continuously manage approximation error.

0.1.11 Common failure patterns (exam-grade intuition)

Training loss ↓, validation loss ↑ → overfitting
Offline metric ↑, business metric ↓ → loss/metric mismatch
Validation metric stable, production metric ↓ → distribution shift
Model retrain causes instability → variance-dominated regime

All of these are consequences of expected vs empirical risk.

0.1.12 Chapter takeaway

If you remember only one thing

Machine learning is the art of minimizing the wrong objective in a controlled way.

Everything in Layer 0 exists to help you: understand how wrong it is, predict when it will break, and design systems that survive anyway.

You should now be able to:

Explain the difference between expected and empirical risk
Explain why generalization is not guaranteed
Reason about overfitting without vague language
Justify why validation, regularization, and monitoring exist
Explain why “accuracy improving” can still mean “system getting worse”