Right now, someone is showing GPT-4V a photo.
“What’s the highest value in this chart?”
“Organize the amounts on this receipt.”
“Describe the mood of this painting.”
The model answers.
As if it ‘sees’ the image.
But language models were designed to process text.
They predict words, generate sentences.
Images are pixels.
Text is tokens.
Completely different formats.
How can a language model ‘understand’ images?
The answer lies in architecture.
Three Components
Vision-Language Model (VLM) architecture divides into three parts.
1. Vision Encoder
Converts images to vectors.
Most use Vision Transformer (ViT).
It divides the image into 16×16 pixel patches.
Each patch is treated like a ‘token.’
A 224×224 image becomes 196 patch tokens.
ViT learns relationships between these patches.
Which patch is the cat’s ear,
which patch is the background.
The output is a sequence of vectors representing each patch.
2. Projection Layer
The bridge between vision encoder and language model.
There’s a problem.
The vision encoder’s output dimension and
the language model’s input dimension differ.
ViT might output 1408-dimensional vectors.
The LLM might expect 4096-dimensional inputs.
The projection layer bridges this gap.
Usually a 2-layer MLP (Multi-Layer Perceptron).
It transforms image vectors into a form the language model can understand.
3. Language Model (LLM)
The core engine.
It receives projected image tokens and text tokens together.
196 image tokens followed by “What is the cat doing in this photo?”
From the language model’s perspective, image tokens are
just another type of token.
Processed alongside text tokens.
The output is text.
“The cat is napping in front of the window.”
CLIP Returns
What should we use as the vision encoder?
Many VLMs use CLIP’s vision encoder.
There’s a reason.
CLIP has already aligned images and text.
With 400 million image-text pairs.
CLIP’s vision encoder creates
‘semantically rich’ image representations.
Not just “there’s an edge here”
but closer to “this is a cat.”
Models like LLaVA, MiniGPT-4, and InternVL
use CLIP’s ViT as their vision encoder.
Leveraging pre-aligned representations
is more efficient than training from scratch.
How Training Works
VLM training usually divides into two stages.
Stage 1: Alignment Pre-training
Freeze the vision encoder and LLM.
Train only the projection layer.
Goal:
Map image representations to a space the LLM can understand.
Train on millions of image-caption pairs.
“Describe this image” → Generate caption.
Stage 2: Instruction Fine-tuning
Now train the full model.
(Vision encoder usually stays frozen)
Train on diverse question-answer data.
“What was the 2020 revenue in this chart?” → “15 million dollars.”
“What style is this painting?” → “Impressionism.”
“What’s unusual in this photo?” → “The shadow directions don’t match.”
This stage gives the model ‘instruction-following ability.’
Why It Works
The key insight:
Convert images to tokens, and language models can process them.
Language models are sequence-processing machines.
Whether the input is text tokens or image tokens,
if it’s a sequence, it can be processed.
The vision encoder translates images into ‘a form language models can read.’
The projection layer matches dimensions.
The language model reasons and generates answers.
This is the essence of VLMs.
Converting images into language model input format.
Limitations: The Unseen
VLMs are powerful but not perfect.
Hallucination
They claim things exist that aren’t in the image.
Two cats in the photo,
but it answers “I see three cats.”
The probabilistic nature of language models is one cause.
Following patterns frequent in training data,
they generate content that doesn’t match the actual image.
Difficulty with Spatial Reasoning
“What’s on the left?”
“Is A above or below B?”
They often get these wrong.
CLIP-style vision encoders capture semantic information well
but tend to miss precise spatial relationships.
Fine-grained Information Loss
Images compress into 196 tokens.
High-resolution image details can disappear.
Small text, complex diagrams, subtle differences.
These get lost in compression.
Cannot Count Accurately
“How many people are in this photo?”
VLMs frequently fail at this.
They approximate,
but exact numbers are difficult.
Summary
Vision-Language Models are a combination of three components.
The vision encoder converts images to vector sequences.
The projection layer matches dimensions.
The language model processes these tokens alongside text.
The key is ‘conversion.’
Transform images into tokens the language model can understand,
and language model reasoning can be applied to images.
But information is lost in this conversion process.
Spatial relationships, fine details, precise quantities.
VLM hallucinations often stem from this loss.
There’s still a gap between ‘reading’ and ‘understanding’ images.
The next question reverses direction.
The structure where text guides image generation.
That’s cross-attention.


Leave a Reply