With Llama kicking things off, development has been ridiculously fast in the self hosted text model space. The requirements are getting better, but still fairly steep. You can either have painfully slow CPU generation, or if you have 24gb+ of VRAM, you really open up the GPU options.
7b models can sorta run in 12gb, but they’re not great. You really want at least 13b, which needs 24gb VRAM… Or run it on the CPU. Some of them are getting close to ChatGPT quality, definitely not a subset to sleep on, and I feel as though the fediverse would appreciate the idea of self hosting their own chat bots. Some of these models have ridiculous context memory, so they actually remember what you’re talking about ridiculously well.
A good starting point is this rentry: https://rentry.org/local_LLM_guide
I’m admittedly not great with these yet (and my GPU is only 12gb), but I’m fascinated and hope there can be some good discussions around these, as the tech is really fascinating
“B” refer to the size of parameters of the model in the billion but people has started referring them to “bits”.
The bigger the number, the smarter the model will be but the size and RAM, VRAM requirement rises accordingly.
This is not entirely correct. B does stand for “Billion” parameters, but bits are a different thing. You can have, for example, a 4-bit 13B model or an 8-bit 3B model. They don’t correlate at all.