Machine learning models cannot operate on things — only on numbers. Users, images, sentences, transactions, sensors, documents — all must be converted into numerical form before any learning can occur.
This conversion is not incidental. It is the first irreversible design decision in any ML system.
Once reality is mapped into numbers, the model can only reason within that numerical representation.
A scalar is a single real number:
x ∈ RExamples:
age = 42temperature = 18.7account_balance = 1520.35
Scalars are the simplest features and often the most dangerous:
- they imply linear ordering
- they imply magnitude comparisons
- they invite extrapolation
If a scalar’s meaning is non-linear (e.g., risk scores, IDs), treating it as a scalar can mislead the model.
A vector is an ordered collection of scalars:
x = (x_1, x_2, …, x_d) ∈ R^dThis is the canonical representation of a data point in ML.
Examples:
user = [age, country_code, avg_session_time, purchase_count]house = [square_feet, bedrooms, distance_to_city, year_built]
Each component defines one axis in feature space.
A data point is not “an object” — it is a location in space.
The feature space is the set of all possible vectors your model might see.
- 3 numeric features → 3D space
- 100 features → 100D space
- 10,000 features → 10,000D space
Models do not see:
- semantics
- causality
- meaning
They see:
- distances
- angles
- projections
- regions
Once you define the feature space, you define what the model can learn.
Choosing features is choosing a coordinate system for reality.
Two representations of the same object can lead to radically different learning outcomes.
Example:
- raw timestamps vs cyclical encoding (
sin/cosof hour) - zip code as number vs one-hot encoding
- text as word counts vs embeddings
Mathematically, these are different spaces — even if they describe the same thing.
Many real-world attributes are categorical:
- country
- device type
- product ID
They have no natural ordering.
If you encode them as integers:
country = 1, country = 2, country = 3you accidentally introduce:
- distance
- magnitude
- ordering
One-hot encoding fixes this by expanding into a higher-dimensional space:
- each category becomes its own axis
- distance reflects equality, not magnitude
Trade-off:
- correctness vs dimensionality explosion
Binary features (0/1) are everywhere:
- clicked / not clicked
- fraud / not fraud
- active / inactive
They:
- carve space into regions
- create sharp decision boundaries
- interact strongly with linear models
In many systems, most signal comes from binary indicators, not continuous values.
Continuous features introduce geometry:
- distance
- direction
- scale
If one feature ranges from 0–1 and another from 0–1,000,000:
- the larger dominates dot products
- gradients become ill-conditioned
- training becomes unstable
This is why:
- normalization
- standardization
- log transforms
exist. They are geometric corrections, not cosmetic preprocessing.
As dimensionality increases:
- distances concentrate
- nearest neighbors become less meaningful
- volume grows exponentially
This is the curse of dimensionality.
Engineering consequences:
- simple distance-based methods degrade
- feature selection becomes critical
- regularization becomes mandatory
High-dimensional space behaves nothing like 2D or 3D intuition.
Many feature spaces are sparse:
- text (bag-of-words)
- categorical one-hot encodings
- recommender systems
Sparsity is not a bug — it is a structure.
Benefits:
- efficient storage
- fast dot products
- interpretable signals
But it also:
- increases dimensionality
- complicates similarity
- demands specialized algorithms
Embeddings map discrete objects into continuous vector spaces:
object → R^kProperties:
- similar objects are close
- distances become meaningful
- dimensionality is controlled
Embeddings learn the geometry instead of you defining it manually.
This bridges:
- feature engineering
- deep learning
- retrieval systems
In feature space, many models are simply: “Can I draw a surface that separates these points?”
Linear models:
- draw flat surfaces (hyperplanes)
- rely heavily on feature design
If data is not linearly separable in your space:
- no amount of training will fix it
- you must change representation or model family
Interactions (e.g., age × income) correspond to:
- bending space
- introducing new dimensions
- changing separability
Polynomial features and neural networks both:
- enrich feature space
- increase expressiveness
- increase overfitting risk
Representation power always trades off with stability.
A model does not fail because it “missed something.” It fails because that information was never represented — or was represented in a misleading way.
Feature design is ontological:
- it defines what exists in the model’s universe
Machine learning models do not learn about the world. They learn about points in a feature space you designed.
If the geometry is wrong:
- optimization will still succeed
- metrics may look good
- deployment will fail
You should now be able to:
- Explain why all ML inputs are vectors
- Reason about feature spaces geometrically
- Identify fake structure introduced by bad encoding
- Explain why scaling affects training
- Understand why representation often matters more than algorithm