Tag: CLIP

  • Text Guides Image

    Text Guides Image

    Noise has no direction. Without text, it stays noise. “A cat flying through space”—this sentence guides the generation. The image asks: what should I become? Text answers through cross-attention. How Stable Diffusion uses Query, Key, Value to turn prompts into pixels.

  • Contrast Creates Meaning

    Contrast Creates Meaning

    Labels aren’t necessary. ImageNet needed 25,000 workers to label 14 million images. But the internet already has the answers—400 million image-text pairs. CLIP learned without labels and classifies things it’s never seen. How contrastive learning aligned images and text into one space.

  • Into a Shared Space

    Into a Shared Space

    2012. CNN conquered images. Transformer conquered text. But each lived in separate worlds—vectors that couldn’t compare. What if a cat photo and the word “cat” existed at the same location? Shared embedding space makes this possible. How CLIP and ImageBind unified different senses into one language.