In Chapter 0.4, we established that one data point is a vector — a point in feature space. A dataset is simply many such points stacked together.
Formally, given:
nexamplesdfeatures
we represent the dataset as a matrix:
X ∈ R^{n×d}- each row = one example
- each column = one feature
This matrix representation is called the design matrix.
Once you see datasets as matrices, most of machine learning becomes linear algebra plus statistics.
The matrix view is not just notation. It determines:
- how training scales
- how memory is used
- how fast models run
- which algorithms are feasible
Modern ML systems are essentially pipelines for building, transforming, and multiplying very large matrices.
If you misunderstand this layer, scaling ML systems becomes guesswork.
Each row of X is assumed to be:
- a draw from some underlying distribution
- comparable to other rows
- interchangeable (under IID assumptions)
Statistical concepts apply across rows:
- mean feature values
- variance estimates
- sampling error
- train/test splits
When rows are not interchangeable (time series, user sessions), naive matrix assumptions break — leading to evaluation errors.
Each column represents:
- one measurable property
- one axis in feature space
- one degree of freedom for the model
Operations across columns include:
- normalization
- feature selection
- dimensionality reduction
- regularization
Models learn weights per column. Bad columns create bad models.
Targets are also represented algebraically.
For regression:
y ∈ R^nFor classification:
- binary:
y ∈ {0,1}^n - multi-class:
Y ∈ R^{n×k}(one-hot or probabilities)
This alignment matters:
- every row of
Xmust align with exactly one label - misalignment silently corrupts training
Many production bugs are indexing bugs, not ML bugs.
Linear regression can be written compactly as:
ŷ = Xw + bor, with bias absorbed:
ŷ = XθLoss becomes:
L = ||Xθ − y||^2This formulation explains:
- why training is matrix multiplication
- why GPUs help
- why batch size matters
Neural networks generalize this idea: repeated matrix multiplications plus nonlinearities.
Not all datasets are dense.
Examples of sparse data:
- one-hot encoded categorical features
- text bag-of-words
- recommender interaction matrices
Sparse matrices:
- store only non-zero values
- enable massive dimensionality
- change algorithm choice
Some algorithms scale with:
- number of rows × columns (dense)
- number of non-zero entries (sparse)
Confusing these leads to catastrophic performance decisions.
Matrices are stored in memory in specific layouts:
- row-major
- column-major
This affects:
- cache efficiency
- training speed
- data pipeline design
Batching, shuffling, and streaming are matrix operations, not conceptual conveniences.
Operations like:
- standardization
- normalization
- log transforms
are applied column-wise:
X_{:,j}' = f(X_{:,j})This reinforces:
- features are global axes
- transformations must be consistent between training and serving
Mismatch here causes training–serving skew.
Seeing datasets as matrices explains:
- why adding features increases compute
- why wide datasets overfit easily
- why dimensionality reduction helps
- why shuffling rows matters
- why distributed training partitions by rows
Almost every scalability trade-off reduces to matrix size and structure.
The matrix abstraction assumes:
- fixed feature set
- aligned rows
- homogeneous examples
It struggles with:
- variable-length sequences
- graphs
- ragged data
- multimodal inputs
Modern systems handle this by:
- padding
- masking
- multiple matrices
- specialized data structures
Understanding the limits of the abstraction is as important as using it.
A professional ML engineer always asks:
- How many rows?
- How many columns?
- How sparse?
- How fast does it grow?
- How often does it change?
These questions predict:
- cost
- latency
- failure modes
Before choosing a model, understand the matrix.
A dataset is not “a bunch of examples.” It is a structured mathematical object with shape, sparsity, and semantics.
Most ML engineering decisions are consequences of that structure.
You should now be able to:
- represent datasets as matrices formally
- explain rows vs columns statistically and semantically
- reason about dense vs sparse data
- understand why matrix size drives scalability
- diagnose data alignment bugs conceptually