I’d like to self host a large language model, LLM.

I don’t mind if I need a GPU and all that, at least it will be running on my own hardware, and probably even cheaper than the $20 everyone is charging per month.

What LLMs are you self hosting? And what are you using to do it?

  • The Hobbyist@lemmy.zip
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    2 months ago

    I run the Mistral-Nemo(12B) and Mistral-Small (22B) on my GPU and they are pretty code. As others have said, the GPU memory is one of the most limiting factors. 8B models are decent, 15-25B models are good and 70B+ models are excellent (solely based on my own experience). Go for q4_K models, as they will run many times faster than higher quantization with little performance degradation. They typically come in S (Small), M (Medium) and (Large) and take the largest which fits in your GPU memory. I’d you go below q4, you may see more severe and noticeable performance degradation.

    If you need to serve only one user at the time, ollama+Webui works great. If you need multiple users at the same time, check out vLLM.

    Edit: I’m simplifying it very much, but hopefully should it is simple and actionable as a starting point. I’ve also seen great stuff from Gemma2-27B

    • Avid Amoeba@lemmy.ca
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 months ago

      If you need to serve only one user at the time, ollama +Webui works great. If you need multiple users at the same time, check out vLLM.

      Why can’t it serve multiple users? Open Web UI seems to support multiple users.

      • The Hobbyist@lemmy.zip
        link
        fedilink
        English
        arrow-up
        0
        ·
        2 months ago

        I didn’t say it can’t. But I’m not sure how well it is optimized for it. From my initial testing it queues queries and submits them one after another to the model, I have not seen it batch compute the queries, but maybe it’s a setup thing on my side. vLLM on the other hand is designed specifically for the multi co current user use case and has multiple optimizations for it.

    • dukatos@lemm.ee
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 months ago

      I run ollama:rocm and deepseek-coder model on Radeon 6700XT. I only had to set the GPU via environment variables because it is not officially supported by ROCm, but it works.