CNNs don’t “see” images.
They destroy them.
They tear the original image into pieces,
extract numbers from each piece,
and recombine those numbers.
Completely different from human vision.
Yet it works.
More accurately than humans.
Why does destruction become recognition?
Pixels Are Meaningless
Consider a 224 × 224 color image.
That’s 150,528 numbers.
(224 × 224 × 3 channels)
Each of these numbers is meaningless on its own.
The fact that pixel (112, 87) has a red value of 203
has nothing to do with “cat.”
Meaning lies in the relationships between pixels.
How this pixel connects to that pixel.
Where brightness changes abruptly.
What patterns repeat.
Traditional neural networks (MLPs) process each pixel individually.
150,528 input neurons.
Even with just 1,000 neurons in the first hidden layer,
you need 150 million weights.
Untrainable.
Out of memory.
Overfitting inevitable.
CNNs took a different approach.
Don’t look at the whole—look at pieces.
Convolution: Scanning the World Through a Small Window
The core of convolution is the filter (or kernel).
A small 3×3 matrix.
This filter slides across the image,
performing an operation at each position.
The operation is simple.
Multiply the filter’s 9 values by the image’s 9 pixels,
then sum them all.
The result is a single number.
This number represents “how strongly this pattern appears at this location.”
When the filter scans the entire image,
a feature map is created.
A new image, smaller than the original.
A map showing where specific patterns exist.
Here’s the key insight:
The filter’s values are learned.
Humans don’t specify “find vertical edges.”
Through backpropagation,
the network discovers optimal filters on its own.
The First Layer: A World of Edges
Visualize the first layer of a trained CNN,
and you see a surprising pattern.
Most filters are edge detectors.
Filters that find vertical lines.
Filters that find horizontal lines.
Filters that find diagonal lines.
Filters that find brightness changes.
This isn’t coincidence.
The most fundamental information in images is boundaries.
Where objects separate from backgrounds.
Where colors change.
Where shapes begin.
CNNs discover this without being taught.
Training on millions of images,
they conclude that “edges are the most useful low-level features.”
The human visual cortex works similarly.
Hubel and Wiesel’s 1960s research proved this.
Visual neurons in cat brains
respond to lines of specific orientations.
CNNs didn’t imitate biology.
Solving the same problem, they arrived at the same solution.
As Layers Deepen: From Parts to Wholes
The second layer takes the first layer’s output as input.
It learns combinations of edges.
Vertical + horizontal lines = corners.
Combinations of curves = circles.
Patterns of multiple edges = textures.
The third layer recognizes more complex structures.
Partial forms like eyes, noses, mouths.
Object components like wheels, windows.
As we go to the fourth and fifth layers,
abstraction levels rise.
The network begins recognizing conceptual patterns
like “cat face” or “car front.”
This is CNN’s hierarchical feature learning.
| Layer | What It Recognizes | Abstraction Level |
|---|---|---|
| Layer 1 | Edges, color changes | Very low |
| Layer 2 | Corners, simple shapes | Low |
| Layer 3 | Textures, partial structures | Medium |
| Layers 4-5 | Object parts, patterns | High |
| Final layer | Whole objects | Very high |
This is exactly the “automated feature extraction”
discussed in the difference between ML and DL.
Features humans never defined
are hierarchically constructed by CNNs.
Pooling: Compressing Information
Convolution alone has problems.
Feature maps are too large.
Computation explodes.
Pooling solves this.
The most common method is max pooling.
From a 2×2 region, keep only the largest value.
Discard the other three.
Looks like information loss.
It is information loss.
But this loss actually helps.
Whether a cat is on the left or right of an image,
the judgment “cat” should be the same.
Pooling provides this location invariance.
It also builds robustness to small variations.
Slightly rotated images.
Slightly zoomed images.
Slightly shifted images.
Thanks to pooling, they’re recognized as the same features.
Why “Convolution”
Mathematically, convolution is an operation from signal processing.
A way to combine two functions.
One function “slides” over another,
multiplying and summing overlapping parts at each position.
In CNNs, the image is the first function,
the filter is the second.
The key property of this operation is weight sharing.
The same filter applies to every position in the image.
A filter that finds vertical lines in the upper left
finds the same vertical lines in the lower right.
This dramatically reduces parameters.
A single 3×3 filter needs only 9 weights.
Just 9, no matter how large the image.
150 million vs 9.
This difference made CNNs practical.
Why Destruction Becomes Understanding
Back to the original question.
Why does destroying an image lead to understanding?
The answer is abstraction.
The original image is too concrete.
150,528 numbers.
Most are noise.
Meaningful information is a tiny fraction.
CNNs compress information at each layer.
Keep only edges, discard the rest.
Keep only shapes, discard details.
Keep only concepts, discard shapes.
What remains at the end
is a highly compressed judgment:
“What category does this image belong to?”
Destruction isn’t the goal.
It’s a byproduct of compression.
And this compression forms the foundation
for how Physical AI perceives the real world.
The first step in converting what a robot’s camera sees
into actionable information.
Conclusion
CNNs don’t see—they search.
Search for patterns,
search for patterns of patterns,
search for patterns of patterns of patterns.
Each layer uses what the previous layer found
to build something more abstract.
In this process, the original image disappears.
Only meaning remains.
Perhaps human vision works similarly.
We don’t see the world as it is.
We see only the interpretation our brains construct.
CNNs may have revealed
the structure of that interpretation
through mathematics.


Leave a Reply