Tag: Vision Transformer
-

Contrast Creates Meaning
Labels aren’t necessary. ImageNet needed 25,000 workers to label 14 million images. But the internet already has the answers—400 million image-text pairs. CLIP learned without labels and classifies things it’s never seen. How contrastive learning aligned images and text into one space.
