i just... i thought they would've used something curated.
i just... i thought they would've used something curated.
something structured. like ImageNet. OpenImages.
literally anything with human labels. but no.
they used LAION-5B.
LAION-5B.
do you know what that is?
it’s five billion images.
scraped from the internet.
Pinterest. WordPress. DeviantArt. WeHeartIt.
a blogspot post from 2007 titled “my silly dog :)”
everything.
and then—
this is the part that keeps me up at night—
they didn’t clean it. they didn’t filter it by hand.
they didn’t even look at it.
they just ran CLIP on it.
CLIP.
the model that says how well an image matches a sentence.
they took every image-caption pair and said:
“does this picture of a blurry jello mold look like ‘a medieval tapestry of a lion’?”
and CLIP was like: “yeah, 0.34 cosine similarity.”
and they went:
“perfect. keep it.”
that was the bar.
that was the filter.
0.28 cosine. sometimes 0.3 if they were feeling fancy.
billions of images, captioned by alt-text, file names, and HTML trash,
and they said: “this is our high-quality subset.”
and now?
now every major diffusion model is trained on this.
Stable Diffusion?
Midjourney?
DreamBooth?
probably your weird anime upscaler too?
they all learned to dream from a soup of miscaptioned memes, Pinterest reposts,
and someone’s aunt’s photo album that got indexed by mistake.
you prompt “elon musk riding a horse” and it gives you:
- a centaur in a spacesuit
- 3 fingers
- a jpeg artifact shaped like an NFT
and you wonder why.
it’s because of LAION-5B.
because someone looked at 5 billion unlabeled web images and thought:
“what if we just… ran CLIP on it?”
and everyone else said:
“hell yeah.”