October 26, 20247 min read

Representation is All You Need

If you can represent your problem in a differentiable way, a machine can learn it.

I recently started my masters degree, and as part of it, I work as a Teaching Assistant. Currently, I am working with students in the Neural Networks course. I see the students go through the same types of questions and issues that I went through. There are so many ceremonies involved in training a model. We have the network, layers, backpropagation, activation functions, optimizers, architectures, and more. All of these are foreign concepts to a Computer Science student compared to the programming or computer architecture courses they have previously taken. How can I simplify the essence of neural networks for them? How do we think about it from first principles?

The Idea

I am going to try very hard to do that here. There is one idea underneath all of machine learning. Not matrices. Not activation functions. Not attention heads. Those are tools. The idea is this.

If you can represent your problem in a differentiable way, a machine can learn it*

Everything else is engineering built on top of that. (Also the representation must be optimizable, hence the asterisk.)

The Primitive

Fundamentally, learning means minimizing a cost function JJ with respect to parameters θ\theta. If the mapping f(x;θ)f(x; \theta) is differentiable, we can use the chain rule to find

Jθ=Jy^y^θ\frac{\partial J}{\partial \theta} = \frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial \theta}

As long as that gradient exists and is non zero, learning (stepping in the direction of θJ-\nabla_{\theta} J) can happen. Backpropagation is simply the repeated application of this chain rule. You make a prediction, measure how wrong it is, and compute how much each parameter contributed to that error. Then you nudge each parameter slightly in the direction that reduces the error. Repeat a million times.

The algorithm doesn't know it's looking at a face, a sentence, or a heartbeat. It knows one thing. Here is a number, make it smaller

Which means the real question is never "can a neural network solve this?" The real question is, can you represent your problem so that there is a differentiable path from input to error? If yes, the gradient will find its way. The backprop engine just sits there, waiting for you to give it something to flow through. That means it's learnable. So

  • How do you represent an image?
  • How do you represent a molecule for drug discovery?
  • How do you represent a proof step?
  • How do you represent a graph structure?
  • How do you represent a timeseries?

The history of deep learning is largely a history of people finding clever differentiable representations. CNNs for spatial locality, Transformers for sequence attention, GNNs for graphs.

Three Problems, One Idea

Let's look at three different kinds of problems and how we can solve them.

Images

An image as a pixel grid is just a matrix of numbers, which is already a representation. But a raw pixel matrix doesn't inherently contain semantics. CNNs impose a structured representation. This representation forces the network to learn local patterns first (edges, textures), then compose them into shapes, and then into objects through convolutions. Finally, it can decide if the image is a hotdog or not. A CNN is a carefully designed representation of what "looking at an image" should mean computationally.

Time Series

An ECG waveform is a sequence of floats, a squiggle. A dangerous arrhythmia is not a random squiggle; it's a specific pattern within that sequence. But these raw floats cannot give a network any more meaning than arbitrary numbers. A simple network has no idea that a spike 10 timesteps ago might be related to what's happening right now. RNNs and LSTMs solve this by maintaining a hidden state that creates temporal context. They can encode when something happened, not just what the value was. Once the waveform is represented in a richer, temporal space, "normal rhythm" and "atrial fibrillation" become separated by distance. A loss function can then distinguish between "normal" and "dangerous". The differentiation was always possible, but the representation is what made it meaningful.

Sequence model intuition

RNN Temporal Context

Step through time to see how the hidden state (memory) updates by combining the current input with the past context.

Input Sequence (Time Series)
t=0
Previous State
ht-1
+
Current Input
xt
0.1
New State
ht
Notice how ht is not just based on the current float xt, but influenced by the context ht-1 carried over from previous steps.

Language

Words are discrete. There's no derivative of "king". How do you encode language for a model? If the word "king" can be mapped to a continuous high-dimensional vector space, you can take gradients with respect to its representation. Suddenly, the model can learn to distinguish "king" from "queen". Then, the question "what is king minus man plus woman?" becomes geometry. The result is a representation space where king - man + woman ≈ queen falls out naturally. Not because anyone explicitly told the model about it, but because a good representation makes semantic structure visible.

The Essence

The backprop algorithm is not the invention. Leibniz knew the chain rule in 1675. The invention is always the representation, the decision of how to lay your problem onto a differentiable surface in the first place.

ResNets did not change backprop. They changed the computation graph. Transformers did not change backprop. They changed how tokens attend to each other before the gradient flows. AlphaFold did not change backprop. It found a way to encode protein geometry so that evolutionary fitness became a differentiable target.

Every major breakthrough in deep learning, when you strip it down, is someone finding a better coordinate system for a problem, a differentiable encoding, a better representation.

The Universal Approximation Theorem provides the theoretical backing. Any continuous function, once you have encoded your problem correctly, can be approximated by a neural network of sufficient capacity. The network is not the intelligence. It is the clay. You, the designer, are the one who decides what shape the clay should take.