• 𝕯𝖎𝖕𝖘𝖍𝖎𝖙@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    10 months ago

    There would still need to be a corpus of text and some supervised training of a model on that text in order to “recognize” with some level of confidence what the text represents, right?

    I understand the image generation works differently, which I sort of gather starts with noise and a random seed and then via learnt networks has pathways a model can take which (“automagic” goes here) it takes from what has been recognized with NLP on the text. something in the end like “elephant (subject) 100% confidence, big room (background) 75% confidence, windows (background) 75% confidence”. I assume then that it “merges” the things which it thinks make up those tokens along with the noise and (more “automagic” goes here) puts them where they need to go.

    • Turun@feddit.de
      link
      fedilink
      arrow-up
      2
      ·
      10 months ago

      There would still need to be a corpus of text and some supervised training of a model on that text in order to “recognize” with some level of confidence what the text represents, right?

      Correct. The clip encoder is trained on images and their corresponding description. Therefore it learns the names for things in images.

      And now it is obvious why this prompt fails: there are no images of empty rooms tagged as “no elephants”. This can be fixed by adding a negative prompt, which subtracts the concept of “elephants” from the image in one of the automagical steps.