A common but dangerous belief is: “The model understands the data.”
It does not.
A trained model is the result of an optimization process, not a thinking entity. It does not reason, infer intent, or grasp meaning. It adjusts parameters to reduce numerical error on observed examples.
Understanding this distinction is essential for:
- diagnosing failures
- designing safe systems
- resisting anthropomorphic explanations
Training is the process of solving an optimization problem:
θ_(t+1) = θ_t − η ∇_θ L(θ_t)Where:
- θ = model parameters
- L = loss function
- ∇_θ L = gradient (direction of steepest increase)
- η = learning rate
At each step, the model:
- Computes predictions
- Computes loss
- Computes gradients
- Updates parameters slightly
Nothing in this loop involves:
- semantic understanding
- causal reasoning
- goal awareness
Training is numerical hill-descending on a proxy objective.
The optimizer does not know:
- what a “user” is
- what “fraud” means
- what a “cat” looks like
It only knows:
- numbers
- gradients
- how changing parameters affects loss
If two very different parameter settings produce similar loss, the optimizer is indifferent between them.
Multiple internal representations can yield identical performance — interpretability is not guaranteed.
Because training minimizes loss on available data, models exploit any statistical shortcut that reduces loss, even if it is:
- spurious
- non-causal
- unethical
- unstable in production
Examples:
- background pixels instead of objects
- proxy variables for protected attributes
- artifacts introduced by data collection
The optimizer is doing its job correctly. The failure is in problem formulation, not training.
The loss function defines a surface over parameter space:
- each point = one model
- height = loss value
- training follows gradients downhill
Key properties:
- Linear models → convex landscapes (single global minimum)
- Neural networks → non-convex landscapes (many minima, saddle points)
Training instability is expected, not exceptional.
In high-dimensional spaces:
- most “bad” points are saddle points, not local minima
- flat regions slow training
- sharp minima can generalize poorly
Modern training succeeds because:
- stochastic gradients add noise
- large models create many “good enough” minima
- exact optimality is unnecessary
This explains why:
- retraining produces different models
- identical pipelines yield different weights
- ensemble methods improve stability
Hyperparameters do not change what the model can represent — they change how optimization behaves.
Examples:
- learning rate → step size
- batch size → gradient noise
- weight decay → smoothness preference
- initialization → starting point in parameter space
Tuning hyperparameters is shaping the optimization process, not “making the model smarter.”
More compute means:
- more steps
- larger models
- faster convergence
It does not mean:
- better alignment with reality
- better causal reasoning
- immunity to bias
If the loss is misaligned, more compute optimizes the wrong objective faster.
Optimization only cares about training loss. Generalization happens when:
- the data is representative
- the model is appropriately constrained
- the loss encodes useful inductive bias
This is why:
- regularization works
- simpler models sometimes outperform complex ones
- training accuracy is a poor success signal
Generalization is an emergent property, not a goal of training.
When people say: “The model learned feature X”
What they really mean is:
Some internal parameters correlate with X under the training distribution.This distinction matters when:
- distributions shift
- proxies break
- feedback loops emerge
Models do not “know” rules — they encode statistical regularities.
Seen through the optimization lens, common failures become obvious:
- Overfitting → optimizer fits noise because it reduces loss
- Shortcut learning → shortcuts reduce loss faster
- Adversarial examples → optimizer never saw those regions
- Mode collapse → loss tolerates limited diversity
- Training instability → poorly conditioned optimization
These are not bugs — they are consequences.
A mature ML engineer does not ask:
“Why didn’t the model understand?”
They ask:
“What objective did we optimize, on what data, under what constraints?”
This mindset leads to:
- better problem formulation
- safer deployment
- faster debugging
- fewer surprises in production
If Chapter 0.1 taught:
“We minimize expected loss.”
And Chapter 0.2 taught:
“On data that never matches reality.”
Then Chapter 0.3 completes the picture:
“Using blind numerical optimization.”
Machine learning works not because models are intelligent, but because optimization plus data plus constraints can approximate useful functions.
You should now be able to:
- Explain why training is optimization, not reasoning
- Explain why shortcut learning is expected
- Reason about training instability without mysticism
- Explain why hyperparameters affect optimization dynamics
- Resist anthropomorphic explanations of model behavior