I see a lot of talk of Ollama here, which I personally don’t like because:

  • The quantizations they use tend to be suboptimal

  • It abstracts away llama.cpp in a way that, frankly, leaves a lot of performance and quality on the table.

  • It abstracts away things that you should really know for hosting LLMs.

  • I don’t like some things about the devs. I won’t rant, but I especially don’t like the hint they’re cooking up something commercial.

So, here’s a quick guide to get away from Ollama.

  • First step is to pick your OS. Windows is fine, but if setting up something new, linux is best. I favor CachyOS in particular, for its great python performance. If you use Windows, be sure to enable hardware accelerated scheduling and disable shared memory.

  • Ensure the latest version of CUDA (or ROCm, if using AMD) is installed. Linux is great for this, as many distros package them for you.

  • Install Python 3.11.x, 3.12.x, or at least whatever your distro supports, and git. If on linux, also install your distro’s “build tools” package.

Now for actually installing the runtime. There are a great number of inference engines supporting different quantizations, forgive the Reddit link but see: https://old.reddit.com/r/LocalLLaMA/comments/1fg3jgr/a_large_table_of_inference_engines_and_supported/

As far as I am concerned, 3 matter to “home” hosters on consumer GPUs:

  • Exllama (and by extension TabbyAPI), as a very fast, very memory efficient “GPU only” runtime, supports AMD via ROCM and Nvidia via CUDA: https://github.com/theroyallab/tabbyAPI

  • Aphrodite Engine. While not strictly as vram efficient, its much faster with parallel API calls, reasonably efficient at very short context, and supports just about every quantization under the sun and more exotic models than exllama. AMD/Nvidia only: https://github.com/PygmalionAI/Aphrodite-engine

  • This fork of kobold.cpp, which supports more fine grained kv cache quantization (we will get to that). It supports CPU offloading and I think Apple Metal: https://github.com/Nexesenex/croco.cpp

Now, there are also reasons I don’t like llama.cpp, but one of the big ones is that sometimes its model implementations have… quality degrading issues, or odd bugs. Hence I would generally recommend TabbyAPI if you have enough vram to avoid offloading to CPU, and can figure out how to set it up. So:

This can go wrong, if anyone gets stuck I can help with that.

  • Next, figure out how much VRAM you have.

  • Figure out how much “context” you want, aka how much text the llm can ingest. If a models has a context length of, say, “8K” that means it can support 8K tokens as input, or less than 8K words. Not all tokenizers are the same, some like Qwen 2.5’s can fit nearly a word per token, while others are more in the ballpark of half a work per token or less.

  • Keep in mind that the actual context length of many models is an outright lie, see: https://github.com/hsiehjackson/RULER

  • Exllama has a feature called “kv cache quantization” that can dramatically shrink the VRAM the “context” of an LLM takes up. Unlike llama.cpp, it’s Q4 cache is basically lossless, and on a model like Command-R, an 80K+ context can take up less than 4GB! Its essential to enable Q4 or Q6 cache to squeeze in as much LLM as you can into your GPU.

  • With that in mind, you can search huggingface for your desired model. Since we are using tabbyAPI, we want to search for “exl2” quantizations: https://huggingface.co/models?sort=modified&search=exl2

  • There are all sorts of finetunes… and a lot of straight-up garbage. But I will post some general recommendations based on total vram:

  • 4GB: A very small quantization of Qwen 2.5 7B. Or maybe Llama 3B.

  • 6GB: IMO llama 3.1 8B is best here. There are many finetunes of this depending on what you want (horny chat, tool usage, math, whatever). For coding, I would recommend Qwen 7B coder instead: https://huggingface.co/models?sort=trending&search=qwen+7b+exl2

  • 8GB-12GB Qwen 2.5 14B is king! Unlike it’s 7B counterpart, I find the 14B version of the model incredible for its size, and it will squeeze into this vram pool (albeit with very short context/tight quantization for the 8GB cards). I would recommend trying Arcee’s new distillation in particular: https://huggingface.co/bartowski/SuperNova-Medius-exl2

  • 16GB: Mistral 22B, Mistral Coder 22B, and very tight quantizations of Qwen 2.5 34B are possible. Honorable mention goes to InternLM 2.5 20B, which is alright even at 128K context.

  • 20GB-24GB: Command-R 2024 35B is excellent for “in context” work, like asking questions about long documents, continuing long stories, anything involving working “with” the text you feed to an LLM rather than pulling from it’s internal knowledge pool. It’s also quite goot at longer contexts, out to 64K-80K more-or-less, all of which fits in 24GB. Otherwise, stick to Qwen 2.5 34B, which still has a very respectable 32K native context, and a rather mediocre 64K “extended” context via YaRN: https://huggingface.co/DrNicefellow/Qwen2.5-32B-Instruct-4.25bpw-exl2

  • 32GB, same as 24GB, just with a higher bpw quantization. But this is also the threshold were lower bpw quantizations of Qwen 2.5 72B (at short context) start to make sense.

  • 48GB: Llama 3.1 70B (for longer context) or Qwen 2.5 72B (for 32K context or less)

Again, browse huggingface and pick an exl2 quantization that will cleanly fill your vram pool + the amount of context you want to specify in TabbyAPI. Many quantizers such as bartowski will list how much space they take up, but you can also just look at the available filesize.

  • Now… you have to download the model. Bartowski has instructions here, but I prefer to use this nifty standalone tool instead: https://github.com/bodaay/HuggingFaceModelDownloader

  • Put it in your TabbyAPI models folder, and follow the documentation on the wiki.

  • There are a lot of options. Some to keep in mind are chunk_size (higher than 2048 will process long contexts faster but take up lots of vram, less will save a little vram), cache_mode (use Q4 for long context, Q6/Q8 for short context if you have room), max_seq_len (this is your context length), tensor_parallel (for faster inference with 2 identical GPUs), and max_batch_size (parallel processing if you have multiple user hitting the tabbyAPI server, but more vram usage)

  • Now… pick your frontend. The tabbyAPI wiki has a good compliation of community projects, but Open Web UI is very popular right now: https://github.com/open-webui/open-webui I personally use exui: https://github.com/turboderp/exui

  • And be careful with your sampling settings when using LLMs. Different models behave differently, but one of the most common mistakes people make is using “old” sampling parameters for new models. In general, keep temperature very low (<0.1, or even zero) and rep penalty low (1.01?) unless you need long, creative responses. If available in your UI, enable DRY sampling to tamp down repition without “dumbing down” the model with too much temperature or repitition penalty. Always use a MinP of 0.05 or higher and disable other samplers. This is especially important for Chinese models like Qwen, as MinP cuts out “wrong language” answers from the response.

  • Now, once this is all setup and running, I’d recommend throttling your GPU, as it simply doesn’t need its full core speed to maximize its inference speed while generating. For my 3090, I use something like sudo nvidia-smi -pl 290, which throttles it down from 420W to 290W.

Sorry for the wall of text! I can keep going, discussing kobold.cpp/llama.cpp, Aphrodite, exotic quantization and other niches like that if anyone is interested.

  • AliasAKA@lemmy.world
    link
    fedilink
    English
    arrow-up
    14
    ·
    3 months ago

    Bookmarked and will come back to this. One thing that may be if interest to add is for AMD cards with 20gb of ram. I’d suppose that it would be Qwen 2.5 34B with maybe less strict quant or something.

    Also, it may be interesting to look at the AllenAI molmo related models. I’m kind of planning to do this myself but haven’t had time as yet.

    • brucethemoose@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      6
      ·
      edit-2
      3 months ago

      Yep. 20GB is basically 24GB, though its too tight for 70B models.

      One quirk for 7900 owners is that installing flash attention for long context usage can be a pain. Apparently it is doable now, I need to dig up the link, but it might just be easier to use kobold.cpp rocm with its native flash attention.

      As for vision models, that is a whole different can of worms. Exllama does not support this, so you’d need a framework that does.

      If you are looking for niche models, check out MiniG (which is a continued pretrain of the already very excellent GLM4-9B): https://huggingface.co/bartowski/miniG-GGUF

      Llama.cpp support is recent, though I’m not 100% sure its completely fixed. It should work in Aphrodite as well.

  • kitnaht@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    3 months ago

    If your “FIRST STEP” is to choose an OS: Fuck that.

    You should never have to change your OS just to use this crap. It’s all written in Python. It should work on every OS available. Your first step is installing the prerequisites.

    If you’re using something like Continue for local coding tasks, CodeQwen is awesome, and you’ll generally want a context window of 120k or so because for coding, you want all the code context - or else the LLM starts spitting out repetitious stuff, or can’t ingest all of your context so it’ll rewrite stuff that’s already there.

      • brucethemoose@lemmy.worldOP
        link
        fedilink
        English
        arrow-up
        3
        ·
        3 months ago

        I would not recommend that for performance reasons, AFAIK.

        Windows is fine, I should make that more clear.

        • gravitas_deficiency@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          3
          ·
          3 months ago

          Huh, really? Is there that much of a perf hit using passthrough? I’d have assumed that the bottleneck isn’t actually the PCIE, so much as it is the beefiness of the GPU crunching the model.

          • brucethemoose@lemmy.worldOP
            link
            fedilink
            English
            arrow-up
            1
            ·
            edit-2
            3 months ago

            I have not tested WSL or VMs in Windows in awhile, but my impression is that “it depends” and you should use the native windows version unless you are having some major installation issues.

      • kitnaht@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        3 months ago

        Why would you even bother trying to run this all through a VM when you can just run it directly? If you’re to the point of using VMs, you don’t need this tutorial anyways.

        Are you seriously telling me you’re jumping through all the hoops to spin up a VM on Linux, and then doing all the configuration for GPU passthrough, because you can’t just figure out how to run it locally?

        • gravitas_deficiency@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          14
          ·
          3 months ago

          Bro this is a community for sharing knowledge and increasing the technical aptitude of fellow users by doing said sharing. Maybe instead of shitting on a pretty solid digest of the fundamentals of setting up something like this, try adding to the body of knowledge instead.

    • brucethemoose@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      3 months ago

      CodeQwen 1.5 is pretty old at this point, afaik made obsolete by their latest release.

      The Qwen models (at least 2.5) are really only good to like 32K, which is still a ton of context. But I’ve been testing Qwen 32B at 64K -90K and even that larger model is… Not great.

      32K is generally enough to get the jist of whatever you’re trying to fill in.

    • L_Acacia
      link
      fedilink
      English
      arrow-up
      2
      ·
      3 months ago

      llama.cpp works on windows too (or any os for that matter), though linux will vive you better performances

  • Possibly linux@lemmy.zip
    link
    fedilink
    English
    arrow-up
    7
    ·
    3 months ago

    Or we could all just use ollama. It is way simpler and works fine without a GPU even. I don’t really understand the problem with it.

    • brucethemoose@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      3 months ago

      It’s less optimal.

      On a 3090, I simply can’t run Command-R or Qwen 2.5 34B well at 64K-80K context with ollama. Its slow even at lower context, the lack of DRY sampling and some other things majorly hit quality.

      Ollama is meant to be turnkey, and thats fine, but LLMs are extremely resource intense. Sometimes the manual setup/configuration is worth it to squeeze out every ounce of extra performance and quantization quality.

      Even on CPU-only setups, you are missing out on (for instance) the CPU-optimized quantizations llama.cpp offers now, or the more advanced sampling kobold.cpp offers, or more fine grained tuning of flash attention configs, or batched inference, just to start.

      And as I hinted at, I don’t like some other aspects of ollama, like how they “leech” off llama.cpp and kinda hide the association without contributing upstream, some hype and controversies in the past, and hints that they may be cooking up something commercial.

      • Possibly linux@lemmy.zip
        link
        fedilink
        English
        arrow-up
        3
        ·
        3 months ago

        I’m not going to lie I don’t really see evidence supporting you claims. What evidence do you have?

        Ollama is llama.cpp with a web wrapper and some configs to make sure it works.

        • brucethemoose@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          4
          ·
          edit-2
          3 months ago

          To go into more detail:

          • Exllama is faster than llama.cpp with all other things being equal.

          • exllama’s quantized KV cache implementation is also far superior, and nearly lossless at Q4 while llama.cpp is nearly unusable at Q4 (and needs to be turned up to Q5_1/Q4_0 or Q8_0/Q4_1 for good quality)

          • With ollama specifically, you get locked out of a lot of knobs like this enhanced llama.cpp KV cache quantization, more advanced quantization (like iMatrix IQ quantizations or the ARM/AVX optimized Q4_0_4_4/Q4_0_8_8 quantizations), advanced sampling like DRY, batched inference and such.

          It’s not evidence or options… it’s missing features, thats my big issue with ollama. I simply get far worse, and far slower, LLM responses out of ollama than tabbyAPI/EXUI on the same hardware, and there’s no way around it.

          Also, I’ve been frustrated with implementation bugs in llama.cpp specifically, like how llama 3.1 (for instance) was bugged past 8K at launch because it doesn’t properly support its rope scaling. Ollama inherits all these quirks.

          I don’t want to go into the issues I have with the ollama devs behavior though, as that’s way more subjective.

  • Konraddo@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    3 months ago

    I know this is not the theme of this post, but I wonder if there’s an LLM that doesn’t hallucinate when asked to summarize information of a group of documents. I tried Gpt4all for simple queries like finding out which documents mentioned a certain phrase. It often gave me filenames that didn’t actually exist. Hallucinating contents is one thing but making up data source is just horrible.

    • brucethemoose@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      6
      ·
      3 months ago

      That’s absolutely on topic, check out https://huggingface.co/spaces/vectara/Hallucination-evaluation-leaderboard

      Command R is built for this if you have the vram to swing it, otherwise GLM4 (or MiniG as linked below) is great. The later, unfortunately, doesn’t work with TabbyAPI, so you have to use something like Kobold.cpp.

      You also have to use very low (basically zero) temperature and be careful with other sampling settings, and watch your context length.

      There are more sophisticated RAG setups some of these UIs (like open Web UI) integrate, and sometimes you’ll need to host an embeddings model alongside the llm for that to work.

  • vividspecter@lemm.ee
    link
    fedilink
    English
    arrow-up
    4
    ·
    3 months ago

    Do you have any recommendations for a Perplexity.ai type setup? It’s one of the few recent innovations I’ve found useful. I’ve heard of Perplexica and a few others, but not sure what is the best approach.

    • LiveLM@lemmy.zip
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      3 months ago

      What does Perplexity do different than other AI solutions?
      Heard about it but haven’t tried yet

      • Caboose12000@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        3 months ago

        I haven’t heard about it before today but I tried asking it what separates it from other LLMs and apparently the answer is just that it does a google search and shows you the source its summarizing, which if true is not very compelling, and if a hallucination or missing details then its at least not very compelling as a search replacement

  • sntx@lemm.ee
    link
    fedilink
    English
    arrow-up
    3
    ·
    3 months ago

    Thanks for the writeup! So far I’ve been using ollama, but I’m always open for trying out alternatives. To be honest, it seems I was oblivious to the existence of alternatives.

    Your post is suggesting that the same models with the same parameters generate different result when run on different backends?

    I can see how the backend would have an influence hanfling concurrent api calls, ram/vram efficiency, supported hardware/drivers and general speed.

    But going as far as having different context windows and quality degrading issues is news to me.

    • brucethemoose@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      3 months ago

      Your post is suggesting that the same models with the same parameters generate different result when run on different backends

      Yes… sort of. Different backends support different quantization schemes, for both the weights and the KV cache (the context). There are all sorts of tradeoffs.

      There are even more exotic weight quantization schemes (ALQM, VPTQ) that are much more VRAM efficient than llama.cpp or exllama, but I skipped mentioning them (unless somedone asked) because they’re so clunky to setup.

      Different backends also support different samplers. exllama and kobold.cpp tend to be at the cutting edge of this, with things like DRY for better long-form generation or grammar.

  • WolfLink@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    3
    ·
    3 months ago

    Could I run larger LLMs with multiple GPUs? E.g. would 2x3090 be able to run the 48GB models? Would I need NVLink to make it work?

    • brucethemoose@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      6
      ·
      edit-2
      3 months ago

      Absolutely.

      Only aphrodite (and other enterprise backends like vllm/sglang) can make use of NVLink, but even exllama or mlc-llm split across GPUs nicely over PCIe, no NVLink needed.

      2x 3090s or P40s is indeed a popular config among local runners, and is the perfect size for a 70B model. Some try to squeeze Mistral-Large in, but IMO its too tight a fit.

      • sntx@lemm.ee
        link
        fedilink
        English
        arrow-up
        2
        ·
        3 months ago

        Is there an inherent benefit for using NVLINK? Should I specifically try out Aprodite over the other recommendations when having 2x 3090 with NVLINK available?

        • brucethemoose@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          3
          ·
          3 months ago

          So there are multiple ways to split models across GPUs, (layer splitting, which uses one GPU then another, expert parallelism, which puts different experts on different GPUs), but the way you’re interested in is “tensor parallelism”

          This requires a lot of communication between the GPUs, and NVLink speeds that up dramatically.

          It comes down to this: If you’re more interested in raw generation speed, especially with parallel calls of smaller models, and/or you don’t care about long context (with 4K being plenty), use Aphrodite. It will ultimately be faster.

          But if you simply want to stuff the best/highest quality model you can at VRAM, especially at longer context (>4K), use TabbyAPI. Its tensor parallelism only works over PCIe, so it will be a bit slower, but it will still stream text much faster than you can read. It can simply hold bigger, better models at higher quality in the same 48GB VRAM pool.

    • brucethemoose@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      5
      ·
      edit-2
      3 months ago

      Also, AMD is not off the table for multi-gpu. I know some LLM runners are buying used 32GB MI100s.

  • BaroqueInMind
    link
    fedilink
    English
    arrow-up
    3
    ·
    3 months ago

    It abstracts away llama.cpp in a way that, frankly, leaves a lot of performance and quality on the table.

    OP, do you have any telemetry you can show us comparing the performance difference between what you setup on this guide and an Ollama setup? Otherwise, at face value, I’m going to assume this is another thing on the internet i have to assume is uncorroborated bullshit. Apologies for sounding rude.

    I don’t like some things about the devs. I won’t rant, but I especially don’t like the hint they’re cooking up something commercial.

    This concerns me. Please provide links for us to read here about this. I would like any excuse to uninstall Ollama. Thank you!

  • sleep_deprived@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    3 months ago

    I’d be interested in setting up the highest quality models to run locally, and I don’t have the budget for a GPU with anywhere near enough VRAM, but my main server PC has a 7900x and I could afford to upgrade its RAM - is it possible, and if so how difficult, to get this stuff running on CPU? Inference speed isn’t a sticking point as long as it’s not unusably slow, but I do have access to an OpenAI subscription so there just wouldn’t be much point with lower quality models except as a toy.

    • brucethemoose@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      3 months ago

      CPU inference is, unfortunately, slow, even on my 7800X3D.

      The one that might be interesting is deepseek code v2 lite, as its a very fast MoE model. IIRC microsoft also released a Phi MoE thats good for CPU.

      Keep an eye out for upcoming bitnet models.

      Dont bother upgrading RAM though. You will be bandwidth limited anyway, and it doesn’t make a huge difference.

  • morrowind@lemmy.ml
    link
    fedilink
    English
    arrow-up
    2
    ·
    3 months ago

    Honestly, I’m just gonna stick to llamafile. I really don’t want to mess around with python. It also causes way more trouble than I anticipate

    • brucethemoose@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      3 months ago

      Llamafile is fine, but it still leaves a lot of performance on the table.

      You can setup kobold.cpp with Q8 flash attention without ever having to install pytorch, which is the real headache. It does have a little python launch script, but its super minimal.

      You can use the native llama.cpp server for absolutely zero python usage.

  • Eskuero@lemmy.fromshado.ws
    link
    fedilink
    English
    arrow-up
    2
    ·
    3 months ago

    Ollama has had for a while an issue opened abou the vulkan backend but sadly it doesn’t seem to be going anywhere.

  • shaserlark@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    2
    ·
    3 months ago

    I run a Mac Mini as a home server because it’s great for hardware transcoding, I was wondering if I could host an LLM locally. I work with python so that wouldn’t be an issue but I have no idea how to do CUDA or work on low level code. Is there anything I need to consider? Would probably start with a really small model.

      • shaserlark@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        1
        ·
        3 months ago

        Yeah it’s an M1 16GB, sounds awesome I’ll try, thanks a lot for the guide it’s super helpful. I just got the Mac Mini for jellyfin but this is an unexpected use case where the server comes in very handy.

        • brucethemoose@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          2
          ·
          3 months ago

          For that you probably want the llama.cpp server and a Qwen2 14B IQ3 quantization.

          16GB is kinda tight though, especially if you’re running other stuff in the background.

  • Grimy@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    3 months ago

    vLLM can only run on linux but it’s my personal favorite because of the speed gain when doing batch inference.

    • brucethemoose@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      3 months ago

      Aphrodite is a fork of vllm. You should check it out!

      If you are looking for raw batched speed, especially with some redundant context, I would actually recommend sglang instead. Check out its experimental flags too.