2012. The ImageNet competition.
CNN demonstrated overwhelming performance in image classification.
After that, AI diverged.
Images became CNN’s domain.
Text became RNN’s domain.
Then Transformer emerged, and language models exploded.
Each field developed independently.
Computer vision, natural language processing, speech recognition.
Different conferences, different datasets, different evaluation metrics.
But humans don’t perceive the world that way.
We see.
We hear at the same time.
We touch, smell, and read.
All these senses integrate into a single understanding.
Why couldn’t AI do the same?
Separate Languages
CNN converts images to numbers.
Filters extract features,
and through multiple layers, they become high-dimensional vectors.
A “cat photo” becomes a 512-dimensional vector.
Transformer converts text to numbers.
Tokens are embedded,
pass through attention, becoming high-dimensional vectors.
The word “cat” also becomes a 512-dimensional vector.
Both are vectors.
The same mathematical object.
But these two vectors live in different spaces.
The 512 dimensions from an image model
and the 512 dimensions from a language model
only share the same dimension count—their meanings are entirely different.
You cannot compare the vector of a cat photo
with the vector of the word “cat.”
Because they speak different languages.
What Is an Embedding?
Let’s return to basics.
Embedding means
representing high-dimensional data in a lower-dimensional space.
More precisely,
it means arranging data in vector space
so that semantically similar things are located close together.
Consider word embeddings.
Since Word2Vec,
“king” and “queen” are close in vector space.
“King” and “apple” are far apart.
Even more remarkable:
“king” – “man” + “woman” ≈ “queen”
Vector operations become semantic operations.
Images work the same way.
Vectors extracted from CNN’s final layer
cluster similar images together.
Cat photos gather in one region,
car photos in another.
This is the power of embedding space.
Distance becomes meaning.
The Idea of Shared Space
Here’s the problem.
Image embedding space and text embedding space
are separated from each other.
Cat photos are close in image space.
“Cat,” “고양이,” “猫” are close in text space.
But a cat photo and the word “cat”?
No way to compare them.
What if we could merge the two spaces?
What if a cat photo and the word “cat”
existed at the same location in the same space?
This is the core idea of shared embedding space.
Representing different modalities—images, text, audio—
in a single vector space.
Things with the same meaning are close, regardless of modality.
Things with different meanings are far apart.
Why We Need a Single Space
Consider what shared space enables.
Search:
Type “a puppy running on the beach,”
convert the text to a vector,
find the images closest to that vector.
Search images with text.
Search text with images.
Search videos with audio.
Generation:
Text vectors guide image generation.
The vector for “a cat flying through space”
steers the diffusion model’s direction.
Understanding:
Look at an image and answer questions about it.
This is possible because Vision-Language Models
process images and text in the same space.
Shared space becomes a translator between modalities.
How to Align
There are separate encoders.
Image encoder, text encoder.
Each converts its input into a vector.
The challenge is aligning these two vector spaces.
The method is simple.
Pairs with the same meaning should be close; different pairs should be far apart.
“A cat photo” and “a photo of a cat” should be close.
“A cat photo” and “a photo of a car” should be far apart.
This is the essence of contrastive learning.
Collect millions of image-text pairs.
Train so that vectors from each pair become close,
while vectors from different pairs become distant.
After training,
outputs from both encoders are aligned in the same space.
In 2021, OpenAI’s CLIP
trained on 400 million image-text pairs using this approach.
The results were revolutionary.
Images as the Anchor
CLIP aligned images and text.
What about audio?
Depth information?
Thermal imagery?
Must we directly align every modality pair?
Text-audio, image-audio, text-depth, image-depth…
The combinations explode.
Meta AI’s ImageBind (2023) took a different approach.
Use images as the anchor.
Align with image-text pairs.
Align with image-audio pairs.
Align with image-depth pairs.
Align with image-thermal pairs.
Align with image-IMU (inertial measurement unit) pairs.
All modalities connect through images.
The remarkable part:
Text and audio never trained as direct pairs,
yet they naturally align through images as an intermediary.
“A dog barking” audio
and “a dog barking” text
end up close together in the same space.
Emergent alignment.
Six modalities unified in one space.
The Limits of Single Modality
Why does this matter?
A single modality cannot express
every concept in the world.
The concept of “a beautiful painting”
is grounded in visual representation.
Difficult to fully convey through text alone.
“The peacefulness of rain sounds”
is grounded in auditory experience.
Difficult to capture through images alone.
Human concepts are inherently multimodal.
For AI to understand the world like humans,
it must integrate multiple modalities.
Shared embedding space is the first step.
Summary
AI developed separately by sense for a long time.
CNN handled images; Transformer handled text.
But each modality lived in different vector spaces.
They couldn’t be compared or translated.
Shared embedding space breaks down these walls.
Different modalities represented in one space.
Same meaning close together, different meaning far apart.
CLIP aligned images and text.
ImageBind unified six modalities into one.
Being in the same space.
It means we can compare.
It means we can translate.
It means we can understand.
How do we translate different worlds into the same language?
The key lies in contrastive learning.


Leave a Reply