• simple@lemmy.mywire.xyz
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    I tried with WizardLM uncensored, but 8K seems to be too much for 4090, it runs out of VRAM and dies.

    I also tried with just 4K, but that also seems to not work.

    When I run it with 2K, it doesn’t crash but the output is garbage.

    • notfromhereOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 year ago

      I hope llama.cpp supports SuperHOT at some point. I never use GPTQ but may need to make an exception to try out the larger context sized. Are you using exllama? Curious why you’re getting garbage output

      • simple@lemmy.mywire.xyz
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Yeah llama.cpp with SuperHOT support would be great, and yeah I’m using exllama with oobabooga UI. I found out why I’m getting garbage output with 2k. It seems like SuperHOT 8K models, when run with 2k context, have a massive increase in perplexity.

        (Higher perplexity, the worse the output quality).

        So I’ll need to figure out if I can get at least 4K running without running out of VRAM.

        Also, there is a new PR for exllama which uses a different method of getting higher context (not SuperHOT) and also has less perplexity loss. So that might be a better alternative potentially.

        • notfromhereOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          I read the guy’s blog post on SuperHOT and it sounded like it didn’t increase perplexity and kept perplexity super low with large contexts. I could have read it wrong but I thought it wasn’t supposed to increase perplexity.

          • simple@lemmy.mywire.xyz
            link
            fedilink
            English
            arrow-up
            2
            ·
            1 year ago

            The increase in perplexity is very small, but there is still some with 8K content. But it seems like with 2K its much larger. I could be misunderstanding something myself. But my little test with 2K context does suggest there’s something going on with 2K contexts on SuperHOT models