

so you’re saying that of I decoded these intermediate tokens I wouldn’t get coherent sentences, but rather something completely random that is just a covenient representation of the image, or perhaps some words that relate to the image (sth like “woman” “man” “marriage” “blonde” “dress” etc.)?
pro tip: try cooking for 2 days or more at once