Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · edit-2 2 years ago

Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · edit-2 2 years ago

Absolutely.

Only aphrodite (and other enterprise backends like vllm/sglang) can make use of NVLink, but even exllama or mlc-llm split across GPUs nicely over PCIe, no NVLink needed.

2x 3090s or P40s is indeed a popular config among local runners, and is the perfect size for a 70B model. Some try to squeeze Mistral-Large in, but IMO its too tight a fit.

sntx@lemm.ee · 2 years ago

Is there an inherent benefit for using NVLINK? Should I specifically try out Aprodite over the other recommendations when having 2x 3090 with NVLINK available?

brucethemoose@lemmy.world · 2 years ago

So there are multiple ways to split models across GPUs, (layer splitting, which uses one GPU then another, expert parallelism, which puts different experts on different GPUs), but the way you’re interested in is “tensor parallelism”

This requires a lot of communication between the GPUs, and NVLink speeds that up dramatically.

It comes down to this: If you’re more interested in raw generation speed, especially with parallel calls of smaller models, and/or you don’t care about long context (with 4K being plenty), use Aphrodite. It will ultimately be faster.

But if you simply want to stuff the best/highest quality model you can at VRAM, especially at longer context (>4K), use TabbyAPI. Its tensor parallelism only works over PCIe, so it will be a bit slower, but it will still stream text much faster than you can read. It can simply hold bigger, better models at higher quality in the same 48GB VRAM pool.