Chapter 0.5 — Datasets as Matrices

0.5.1 From single examples to datasets

In Chapter 0.4, we established that one data point is a vector — a point in feature space. A dataset is simply many such points stacked together.

Formally, given:

n examples
d features

we represent the dataset as a matrix:

X ∈ R^{n×d}

each row = one example
each column = one feature

This matrix representation is called the design matrix.

Once you see datasets as matrices, most of machine learning becomes linear algebra plus statistics.

0.5.2 Why the matrix view matters

The matrix view is not just notation. It determines:

how training scales
how memory is used
how fast models run
which algorithms are feasible

Modern ML systems are essentially pipelines for building, transforming, and multiplying very large matrices.

If you misunderstand this layer, scaling ML systems becomes guesswork.

0.5.3 Rows as samples: statistical meaning

Each row of X is assumed to be:

a draw from some underlying distribution
comparable to other rows
interchangeable (under IID assumptions)

Statistical concepts apply across rows:

mean feature values
variance estimates
sampling error
train/test splits

When rows are not interchangeable (time series, user sessions), naive matrix assumptions break — leading to evaluation errors.

0.5.4 Columns as features: semantic meaning

Each column represents:

one measurable property
one axis in feature space
one degree of freedom for the model

Operations across columns include:

normalization
feature selection
dimensionality reduction
regularization

Critical insight

Models learn weights per column. Bad columns create bad models.

0.5.5 Labels as vectors (or matrices)

Targets are also represented algebraically.

For regression:

y \in R^n

For classification:

binary: y ∈ {0,1}^n
multi-class: Y ∈ R^{n×k} (one-hot or probabilities)

This alignment matters:

every row of X must align with exactly one label
misalignment silently corrupts training

Production reality

Many production bugs are indexing bugs, not ML bugs.

0.5.6 Linear models in matrix form (why this abstraction exists)

Linear regression can be written compactly as:

ŷ = Xw + b

or, with bias absorbed:

ŷ = Xθ

Loss becomes:

L = ||Xθ - y||^2

This formulation explains:

why training is matrix multiplication
why GPUs help
why batch size matters

Neural networks generalize this idea: repeated matrix multiplications plus nonlinearities.

0.5.7 Dense vs sparse matrices

Not all datasets are dense.

Examples of sparse data:

one-hot encoded categorical features
text bag-of-words
recommender interaction matrices

Sparse matrices:

store only non-zero values
enable massive dimensionality
change algorithm choice

Some algorithms scale with:

number of rows × columns (dense)
number of non-zero entries (sparse)

Confusing these leads to catastrophic performance decisions.

0.5.8 Memory layout and performance (engineering reality)

Matrices are stored in memory in specific layouts:

row-major
column-major

This affects:

cache efficiency
training speed
data pipeline design

Batching, shuffling, and streaming are matrix operations, not conceptual conveniences.

0.5.9 Feature scaling as column-wise transformation

Operations like:

standardization
normalization
log transforms

are applied column-wise:

X_{:,j}' = f(X_{:,j})

This reinforces:

features are global axes
transformations must be consistent between training and serving

Mismatch here causes training–serving skew.

0.5.10 Matrix operations explain many ML constraints

Seeing datasets as matrices explains:

why adding features increases compute
why wide datasets overfit easily
why dimensionality reduction helps
why shuffling rows matters
why distributed training partitions by rows

Almost every scalability trade-off reduces to matrix size and structure.

0.5.11 When the matrix abstraction breaks

The matrix abstraction assumes:

fixed feature set
aligned rows
homogeneous examples

It struggles with:

variable-length sequences
graphs
ragged data
multimodal inputs

Modern systems handle this by:

padding
masking
multiple matrices
specialized data structures

Understanding the limits of the abstraction is as important as using it.

0.5.12 Engineering mindset: datasets are objects with shape

A professional ML engineer always asks:

How many rows?
How many columns?
How sparse?
How fast does it grow?
How often does it change?

These questions predict:

cost
latency
failure modes

Before choosing a model, understand the matrix.

0.5.13 Chapter takeaway

A dataset is not “a bunch of examples.” It is a structured mathematical object with shape, sparsity, and semantics.

Most ML engineering decisions are consequences of that structure.

✔ Chapter 0.5 Readiness Check

You should now be able to:

represent datasets as matrices formally
explain rows vs columns statistically and semantically
reason about dense vs sparse data
understand why matrix size drives scalability
diagnose data alignment bugs conceptually