← Home
ArxCafe / ML Engineering / Layer 0 / Part 1 / Chapter 0.1
Book 0 · Chapter 0.1
Book 0 — Chapter 0.1

Learning as Expected Loss Minimization

Concept page: understand the expected-vs-empirical risk gap and why nearly every ML failure mode is a consequence of it.
Intuition

At its core, machine learning is the problem of choosing a function that minimizes expected error under uncertainty.

Strip away libraries, cloud services, and model names, and every supervised ML system can be described by four ingredients:

  1. Data-generating process (the real world)
  2. Model family (the functions you are allowed to choose from)
  3. Loss function (what “wrong” means)
  4. Optimization procedure (how you search for a good function)

If you understand how these four interact, you understand why models behave the way they do.

Formal Setup

0.1.2 Data comes from an unknown distribution

Assume there exists an unknown joint distribution:

(x, y) ~ D
  • x: input features (what you observe)
  • y: target/label (what you want to predict)
  • D: the real-world process that generates them

You never know D. You only ever see a finite sample drawn from it.

This single fact explains:

  • generalization error
  • overfitting
  • data drift
  • evaluation failure

ML engineering is largely about managing the consequences of not knowing D.

0.1.3 Models are parameterized functions

You choose a family of functions:

f_θ: X → Y

Examples:

  • Linear regression: f_θ(x) = w^T x + b
  • Logistic regression: f_θ(x) = σ(w^T x + b)
  • Neural network: a composition of linear maps and nonlinearities

θ represents all trainable parameters.

Important

Training does not invent intelligence — it selects parameters. Your model family defines what patterns are even possible to learn.

0.1.4 Loss functions define “error”

A loss function maps predictions and true labels to a real number:

L(ŷ, y) ∈ R

Loss functions are not neutral. They encode assumptions.

  • Mean Squared Error → large errors are very bad
  • Absolute Error → robust to outliers
  • Cross-Entropy → confident wrong predictions are catastrophic
  • Hinge Loss → margin matters more than probability
Key insight

The loss function is your definition of success.

If your loss does not align with the real objective, training will optimize the wrong thing perfectly.

0.1.5 Expected risk (the ideal objective)

If you knew the true distribution D, the ideal learning objective would be:

R(θ) = E_(x,y)~D [ L(f_θ(x), y) ]

This is called expected risk.

  • It measures average error in the real world
  • It is what you actually care about
  • It is uncomputable (because D is unknown)

This gap between what you want and what you can compute is the root of ML engineering.

0.1.6 Empirical risk (what training actually does)

Instead, you observe a dataset:

{ (x_1, y_1), …, (x_n, y_n) }

and minimize:

R^(θ) = (1/n) * Σ_(i=1..n) L( f_θ(x_i), y_i )

This is empirical risk minimization (ERM). Training = choosing parameters that minimize loss on observed samples.

Critical implication

You are optimizing performance on the past, hoping it generalizes to the future. Everything else in ML exists to control the damage caused by this hope.

0.1.7 Generalization: the central problem

The core ML question is not “How low can I make training loss?”

It is: “How close is empirical risk to expected risk?”

The difference between them is generalization error.

Generalization fails when:

  • the sample is not representative
  • the model is too flexible
  • the loss does not reflect reality
  • the data distribution changes

This is why validation sets, regularization, early stopping, and monitoring exist.

0.1.8 Why overfitting happens (without buzzwords)

Overfitting is not mysterious. It happens when:

  • your model family is rich enough to memorize noise
  • empirical risk keeps decreasing
  • expected risk starts increasing

ERM keeps finding parameters that explain quirks of the sample. Those quirks do not exist in D.

Overfitting is a statistical consequence, not a moral failure.

0.1.9 Why “more data” often beats “better models”

Increasing data size n:

  • improves estimation of D
  • reduces variance of empirical risk
  • narrows the gap between empirical and expected loss

This is why simple models with massive data often outperform complex models with little data, and why production ML emphasizes data pipelines as much as modeling.

0.1.10 The engineering interpretation

  • Training optimizes a proxy, not reality
  • Loss functions are design decisions
  • Metrics are estimates with uncertainty
  • Models fail when assumptions break
  • Distribution shift is inevitable
  • You do not “train once and deploy forever”

You continuously manage approximation error.

0.1.11 Common failure patterns (exam-grade intuition)

  • Training loss ↓, validation loss ↑ → overfitting
  • Offline metric ↑, business metric ↓ → loss/metric mismatch
  • Validation metric stable, production metric ↓ → distribution shift
  • Model retrain causes instability → variance-dominated regime

All of these are consequences of expected vs empirical risk.

0.1.12 Chapter takeaway

If you remember only one thing

Machine learning is the art of minimizing the wrong objective in a controlled way.

Everything in Layer 0 exists to help you: understand how wrong it is, predict when it will break, and design systems that survive anyway.

Readiness Check

You should now be able to:

  • Explain the difference between expected and empirical risk
  • Explain why generalization is not guaranteed
  • Reason about overfitting without vague language
  • Justify why validation, regularization, and monitoring exist
  • Explain why “accuracy improving” can still mean “system getting worse”