Illustration of OpenAI's CLIP model demonstrating contrastive learning, showing the alignment of a cat photo with the text "CAT" to represent visual-semantic embeddings.

Contrast Creates Meaning

Labels aren’t necessary.

ImageNet required 25,000 workers
to label 14 million images.
“Cat,” “car,” “airplane.”

This process was the standard for computer vision.
Humans classify, models learn.

But the internet already has the answers.

Photos have captions beside them.
Products have descriptions beside them.
News articles have images beside them.

People are already describing images.
Every day, hundreds of millions of times.

Could we leverage these natural pairs?

In 2021, OpenAI’s CLIP provided the answer.
400 million image-text pairs.
Natural data collected from the internet.

Learn without labels,
classify things never seen before.


The Principle of Contrastive Learning

The idea is simple.

Same pairs close together, different pairs far apart.

A cat photo and “a photo of a cat” are a matching pair.
Train so their vectors become close.

A cat photo and “a photo of a car” are a mismatched pair.
Train so their vectors become distant.

If a batch contains 32,768 image-text pairs,
each image must distinguish 1 correct text
from 32,767 incorrect texts.

This is contrastive learning.

Pull matching pairs (positive pairs) closer,
push mismatched pairs (negative pairs) farther.

Mathematically, it uses the InfoNCE loss function.

L = -log(exp(sim(I, T+)) / Σ exp(sim(I, Ti)))

The higher the similarity of matching pairs,
the lower the similarity of mismatched pairs,
the lower the loss.


Two Encoders

CLIP’s structure is simple.

Image encoder: Converts images to vectors.
Uses Vision Transformer (ViT) or ResNet.

Text encoder: Converts text to vectors.
A Transformer structure similar to GPT.

Both encoders receive different inputs
but output vectors of the same dimension.
512 or 768 dimensions.

After training,
images and text exist in the same space.

The vector of a cat photo
and the vector of the word “cat”
are placed at nearby positions.


The Meaning of Zero-Shot

CLIP’s true power lies here.

Zero-shot classification.
Classify classes never trained on.

Traditional classifiers work like this:

1. Train on 1,000 “cat” images.
2. Determine if a new image is a “cat.”

CLIP is different:

1. Convert a new image to a vector.
2. Convert “a photo of a cat” to a vector.
3. Convert “a photo of a dog” to a vector.
4. Find the text closest to the image vector.

It never learned the “cat” class.
Just define the class with text.

If any class can be expressed in text,
CLIP can classify it.

“A 1960s-style car”
“A puppy with a sad expression”
“A rainy city at night”

Classify with language, not labels.


The Power of 400 Million Pairs

CLIP was trained on 400 million image-text pairs.

OpenAI called this dataset “WebImageText.”
Images and captions collected from the internet.

The key is diversity.

ImageNet covers 1,000 classes.
Animals, vehicles, furniture, food.

WebImageText covers 500,000 concepts.
Every word appearing 100+ times in English Wikipedia.

This diversity enables generalization.

CLIP wasn’t trained for any specific task.
Yet it showed competitive performance
across 30+ datasets.

On ImageNet, zero-shot performance
matched ResNet-50.
Without labels.


The Doors CLIP Opened

CLIP itself is powerful,
but the real revolution began when CLIP became a component for other models.

DALL-E 2:
Converts text to CLIP embeddings,
generates images conditioned on these embeddings.

Stable Diffusion:
CLIP’s text encoder
guides the diffusion model’s direction.
“A cat flying through space” as text
guides image generation.

SAM (Segment Anything):
Combined with CLIP
to perform text-based image segmentation.

OWL-ViT:
Performs text-based object detection.
Find “red shoes.”

CLIP became the bridge connecting images and text.
New applications blossomed on this bridge.


Limitations

CLIP isn’t perfect.

Concept frequency dependence:
It understands concepts that appear frequently in training data well.
It struggles with concepts that appear rarely.
Performance scales log-linearly with concept frequency.

Difficulty with abstract concepts:
“Above” and “below,” “left” and “right”—
it doesn’t distinguish such spatial relationships well.

Cannot count precisely:
“Three cats” versus “five cats”
are difficult to distinguish accurately.

Bias:
Training data biases are reflected in the model.
Internet data biases transfer directly.

Still, CLIP has become
the foundational technology of multimodal AI.


Summary

Contrastive learning is not classification.
It’s alignment.

Things with the same meaning close together,
things with different meanings far apart.

CLIP aligned images and text in the same space
using 400 million image-text pairs.

The result:
Learn without labels,
classify things never seen before.
Text guides image generation.
Language becomes the key to understanding vision.

Contrast creates meaning.
And this meaning becomes the foundation of Vision-Language Models.


Discover more from Luca — AI, Coffee & Structural Thinking

Subscribe to get the latest posts sent to your email.


Comments

Leave a Reply