Stack Series · 4 of 9 Practical · May 2026

llama.cpp: higher performance, more control

The engine underneath Ollama, run directly. More flags than you'll want at first; the right home for any model you actually care about tuning. A llama-swap Compose builder, a tour of the YAML that defines your models, the six flags that do most of the work, and a power-user expander for fitting a 35B MoE on 8 GB of VRAM.

Curtis Smith · OptiMoss.ai · part of the Stack Series

§ 0

What this is

llama.cpp is a C/C++ inference engine for transformer models in the GGUF format. It is the engine Ollama wraps, the engine LM Studio wraps, and the engine most of the local-LLM tooling layer ends up calling sooner or later. Running it directly skips a layer of abstraction and gets you to the flags.

The piece you actually run is llama-server — a small HTTP server bundled with llama.cpp that speaks an OpenAI-compatible chat API. One process, one model, one set of flags. To run multiple models without keeping them all in VRAM at once, you put a small proxy in front of llama-server that loads and unloads it on demand. The standard tool for that is llama-swap.

The bargain here: you give up Ollama's library and its sensible defaults. You take on choosing your own model files, writing the flags for each one, and updating the engine yourself. In exchange you get every knob llama.cpp exposes, day-one support for new model architectures, and a setup small enough to read in a single screen of YAML.

§ 1

When this is worth the switch

If Ollama is working for you and nothing about it gets in your way, stop here. The rest of this article is the long way around to a similar chat window. Specific reasons to come over:

A new architecture landed in llama.cpp but not in Ollama yet. Ollama tracks upstream, often within a week or two, but "a week or two" is exactly the window where new releases are most interesting. Going direct removes the lag.
You want to run the same model at different context lengths. A 9B at 8K for snappy chat and the same 9B at 128K for a long document — two entries, one model file, no juggling.
You're trying to fit a model that's just barely too big. KV cache quantization, flash attention, picking a specific quant by hand rather than accepting Ollama's default — these are the levers that turn "doesn't fit" into "fits with room to spare."
You need a vision model that Ollama hasn't packaged yet. Pulling the GGUF weights plus the matching mmproj projector file from Hugging Face and wiring them together yourself is a five-minute job in llama.cpp.
You want a single OpenAI-compatible endpoint that exposes a fleet of models. llama-swap is happy to advertise twenty model entries and load whichever one a request asks for, swapping out the previous one if VRAM is tight.

None of these are blockers for casual use of Ollama. They are reasons I personally crossed over and didn't go back.

§ 2

The shape of the install

Three moving parts.

llama-server — the HTTP server. Built from llama.cpp source, or pulled as a pre-built container image. One process per active model.
llama-swap — a Go proxy that listens on one port, watches incoming requests, and starts or stops llama-server instances behind the scenes based on which model was asked for.
Your GGUF files — sitting in a directory the server can read. Unlike Ollama, there is no registry layer. You download a file; you point a config at it.

The convenient way to run this in Docker is the mostlygeek/llama-swap image with the unified-cuda tag, which ships both llama-swap and a matching llama-server build in one image. One container, one config file, one models directory. That's what I run.

$ cd ~/openwebui
$ mkdir -p llama-models
# drop GGUF files in here as you collect them

§ 3

The Compose service

Append this to the same docker-compose.yml that already has Open WebUI and (optionally) Ollama. llama-swap and Ollama can coexist — they're separate endpoints and Open WebUI can talk to both at once, so you can A/B a model between them while you migrate.

llama-swap · Compose service builder

GPU access

Which devices llama-server can see. A single GPU is the simple case. On a multi-GPU host, pin to one device and leave the others free for ComfyUI or a second engine.

All NVIDIA GPUsRecommended for single-GPU machines. One specific GPU (device 0)For multi-GPU hosts. Edit the device_ids to pin a different card.

Host port

Where the OpenAI-compatible endpoint answers on the host. I use 8007 because 11434 is Ollama's territory and overloading the same port creates needless confusion. Bound to 127.0.0.1 by default; Open WebUI reaches it over the Docker network and doesn't need a public port.

Models directory

Bind-mounted read-only into the container. Bind mount (not a named volume) so you can drop a new GGUF in from the host and llama-swap picks it up on next request.

Append to docker-compose.yml

The config.yaml the service references doesn't exist yet — the next section is what goes in it. Start the service after that file is in place:

$ docker compose up -d llama-swap
$ docker compose logs -f llama-swap   # first request will trigger the first model load

§ 4

Describing your models

llama-swap's config file is a flat list of model entries. Each entry has a name (the string that shows up in Open WebUI's picker), a TTL (how long the process stays loaded after the last request), and a command — the exact invocation of llama-server for that model.

Here is an entry from my actual config, lightly trimmed, running Qwen 3.5 9B with a 32K context window:

# llama-swap-config.yaml
ttl: 300                          # global default: unload after 5 min idle

models:
  "cpp-qwen3.5-9b-32k":
    ttl: 300
    cmd: >
      llama-server --port ${PORT}
      --model /models/Qwen3.5-9B-Q4_K_M.gguf
      --mmproj /models/Qwen3.5-9B-mmproj-F16.gguf
      --ctx-size 32768

That is the whole shape. ${PORT} is filled in by llama-swap at launch — you don't pick it. The cmd: uses YAML's folded scalar (>) so the multi-line value joins back into a single command. The --mmproj line is what makes this a vision-capable model; omit it for text-only models.

The trick that earned its keep on my hardware: the same GGUF can appear under multiple names with different flags. I have cpp-qwen3.5-9b-32k and cpp-qwen3.5-9b-128k — identical weights, different context budgets, picked from Open WebUI's model dropdown depending on what I'm doing. 32K loads faster and leaves room for image inputs; 128K is for the occasional time I'm chewing on a long document.

A fuller config example

Four models, one of them in two flavors, with a vision projector on the 9B:

ttl: 300

models:
  "cpp-qwen3.5-9b-32k":
    ttl: 300
    cmd: >
      llama-server --port ${PORT}
      --model /models/Qwen3.5-9B-Q4_K_M.gguf
      --mmproj /models/Qwen3.5-9B-mmproj-F16.gguf
      --ctx-size 32768

  "cpp-qwen3.5-9b-128k":
    ttl: 300
    cmd: >
      llama-server --port ${PORT}
      --model /models/Qwen3.5-9B-Q4_K_M.gguf
      --mmproj /models/Qwen3.5-9B-mmproj-F16.gguf
      --ctx-size 131072

  "cpp-gemma4-26b-a4b":
    ttl: 300
    cmd: >
      llama-server --port ${PORT}
      --model /models/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf
      --mmproj /models/gemma-4-26B-A4B-it-mmproj-F16.gguf
      --ctx-size 16384

  "cpp-qwen3.6-35b-32k":
    ttl: 300
    cmd: >
      llama-server --port ${PORT}
      --model /models/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf
      --mmproj /models/Qwen3.6-35B-A3B-mmproj-F16.gguf
      --ctx-size 32768

Restart the container after edits — docker compose restart llama-swap — no rebuild needed; the config is bind-mounted.

Where do the GGUF files come from? Hugging Face, mostly. Search for a model you want — Qwen 3.5 9B, Gemma 4 26B — and look for community quants (the unsloth, bartowski, and TheBloke-successor accounts publish good ones). Pick a Q4_K_M file in the size you can afford, download it into llama-models/, add an entry, restart, done.

§ 5

Wiring to Open WebUI

llama-swap speaks the same OpenAI-compatible wire format as OpenRouter, so Open WebUI sees it as another OpenAI connection — no separate plugin, no special config.

In Open WebUI: Admin Panel → Settings → Connections.
Under OpenAI API, click + to add a connection.
URL: http://llama-swap:8080/v1 — that's the in-Docker hostname (the service name), not localhost.
API key: anything non-empty. llama-swap doesn't check it; Open WebUI insists on a value.
Save. The model list will populate with whatever names you put in the config file.

The picker now shows both Ollama models and llama.cpp models side by side. Pick one, send a message — llama-swap notices, starts llama-server in the background, the first response is delayed a few seconds while weights load, and then it's the usual streaming response. Switch to a different model and llama-swap unloads the previous one (subject to the TTL) and starts the next.

If Open WebUI runs outside Docker, or on a different host

Use the host port instead: http://localhost:8007/v1 if Open WebUI is on the same machine but outside Docker, or http://<host-ip>:8007/v1 from another machine on the LAN. In that second case, change the Compose binding from 127.0.0.1:8007 to 0.0.0.0:8007 — and only do that on a network you trust. llama-swap has no authentication of its own; treat the port like a database port.

§ 6

The flags worth knowing

llama-server has on the order of a hundred command-line flags. Six of them do most of the work.

Flag	What it does
--ctx-size	Context window in tokens. Bigger = larger KV cache = more VRAM. Match this to the longest input you actually use, not the theoretical maximum.
--n-gpu-layers	How many transformer layers live on the GPU. `99` means "all of them" (in practice, capped at the model's layer count). Lower numbers offload the remainder to CPU — slower, but lets larger models load on tight VRAM. Default is 0; you almost always want to set this.
--mmproj	Path to the vision projector file for multimodal models. Without it, the same weights run text-only.
-ctk / -ctv	KV cache quantization for keys and values. `q8_0` halves the cache size from FP16 with a small quality cost; `q4_0` halves it again. The fastest way to fit a larger context.
-fa on	Flash attention. Faster and uses less memory on supported architectures. Try it; turn it off if you see anything weird.
--parallel	How many concurrent requests the server will handle in one batch. `1` is the default and the right answer for personal use; raise it for serving multiple users.

The full reference lives in the llama-server README. Plenty of it you'll never need.

Power user: making a 35B model fit on 8 GB of VRAM

An 8 GB Quadro RTX 4000 (suboptimal, yet viable) cannot hold a 35B dense model. It can hold a 35B mixture-of-experts model with a small active parameter count — Qwen 3.6 35B-A3B routes through only ~3B active parameters per token — if you bring the right toolkit. Even with the Q2_K_P quant (14 GB of weights), the whole model still doesn't fit; the trick is partial offload, where the first ~14 of the 64 layers live on the GPU and the rest run on CPU. The KV cache also gets quantized to halve its footprint:

cmd: >
  llama-server --port ${PORT}
  --model /models/Qwen3.6-35B-A3B-Q2_K_P.gguf
  --ctx-size 32768
  --n-gpu-layers 14
  -fa on
  -ctk q8_0 -ctv q8_0

The exact --n-gpu-layers figure is a measurement, not a derivation — see the next section for the math, and the calculator that does it for you. It comes out to ~14 on my card with this quant at 32K context; you adjust until the model loads cleanly without OOM and without leaving meaningful headroom unused. Sub-Q4 quantization loses fidelity, especially for non-English languages, and CPU-resident layers drop tokens-per-second noticeably. It is, though, the kind of thing you can actually try when you control the flags — and on a card that "shouldn't" run a 35B model at all, that's not nothing.

§ 7

Tuning for your VRAM

The flags above are the levers. The math that decides which way to pull them is one inequality, applied per model:

weights × (n_gpu_layers / total_layers)
  + mmproj  (sits on GPU whenever it's specified)
  + ctx × kv_bytes_per_token × (½ with -ctk/-ctv q8_0)
  + ~1.5 GB CUDA context overhead
  + ~0.5 GB vision activation headroom  (only if mmproj is loaded)
≤ available VRAM

Three of the lines are knobs you can turn. Lower --n-gpu-layers and the first term shrinks (and so does inference speed). Shorten --ctx-size and the third term shrinks (and you handle smaller inputs). Toggle -ctk q8_0 -ctv q8_0 and the third term halves (with a small quality cost that's hard to notice in practice). The two overhead lines aren't really negotiable; the vision term shows up because image encoding generates a transient activation tensor and image tokens consume KV slots, so a model that fits text-only at --ctx-size 32768 may still OOM the moment you actually attach an image.

For a custom model, four spec numbers feed the calculator. The weights and mmproj sizes are just file sizes from the GGUF — easy. The other two — total layer count and KV bytes per token — come from the model's HuggingFace page, in config.json under the Files tab:

Total layers = num_hidden_layers directly.
KV bytes per token = 2 × num_hidden_layers × num_key_value_heads × (hidden_size / num_attention_heads) × 2. The first 2 is K plus V; the trailing 2 is bytes per FP16 element. Models with grouped-query attention — most modern ones — have num_key_value_heads well below num_attention_heads, which is why same-parameter-count models can have very different KV footprints.

This produces an architectural upper bound, not what llama.cpp will actually allocate at runtime — the real number is usually lower thanks to KV cache pooling and quantization. That's fine: the calculator will recommend fewer GPU layers than you could strictly get away with, which leaves headroom rather than risking OOM on load. Treat the output as a conservative starting point. Load it, check nvidia-smi, and if you see meaningful free VRAM, bump --n-gpu-layers up a few at a time until you don't.

You don't actually need to do this math by hand. The calculator below takes the model specs, your VRAM budget, and a context-versus-speed preference, and prints a config entry you can paste straight into llama-swap-config.yaml:

llama-swap · config entry generator

Model preset

Pick a model to auto-fill the spec fields, or choose Custom to enter your own.

Weights size (GB)

mmproj size (GB, 0 = none)

Total layers

KV bytes/token (FP16)

VRAM budget (GB)

Target context (tokens)

Use -fa on and -ctk/-ctv q8_0 — halves KV cache, small quality cost

Priority

Left: keep the target context, offload more layers to CPU. Right: keep all layers on GPU, shrink context to fit. Middle: meet in the middle.

more context higher speed

Entry name

Model filename

mmproj filename (blank for none)

n-gpu-layers99

ctx-size32768

VRAM used— GB

llama-swap-config.yaml entry

What this is approximating. The math treats a GGUF file as if its bytes map one-to-one onto GPU memory, which isn't quite true — embedding and output tensors load alongside the first layers and skew the per-layer cost, and quantized weights sometimes use a touch more VRAM than disk. The CUDA-context overhead is empirical (~1.5 GB on my Quadro RTX 4000; it'll vary by driver and card). Treat the calculator's output as a starting point: load the model, watch nvidia-smi, adjust by a few layers if you see headroom or hit OOM. The whole point of writing the entries by hand is that this loop is cheap.

§ 8

Where this lives in my stack

This is where my chat traffic actually goes. Ollama is still installed — useful for testing a model's library default before deciding whether to write a custom entry — but llama-swap is the endpoint Open WebUI hits day to day. The reason is single-purpose: I want to try new models the week they land, and llama.cpp gets architecture support first. The price is that I update the engine myself (a docker compose pull on a schedule I choose) and write the entries by hand. Both jobs take less time than I expected when I committed to the switch.

The other thing that earned its keep is the multi-config trick. Having the same 9B model available at three different context budgets, or three different system prompts baked in via --chat-template, or with and without the vision projector, costs nothing extra at rest — llama-swap only loads one at a time — and means I don't have to remember which Modelfile I wrote three months ago. The config file is the documentation.

For a team or organization

llama-swap inherits llama-server's "no authentication" posture: any client that can reach the port can call any model. The deployment shape that fits an org is the same as Ollama's — keep the port off the public network, put Open WebUI (with its account model and per-user access controls) in front of it, and treat the engine as a private backend.

The honest tradeoff for an organization considering this over Ollama:

You take on engine version management. Pinned image tags, a tested upgrade cadence, and a rollback plan if a new build regresses on a model you depend on. Ollama smooths most of this; here it's yours.
You take on model curation. Which quant from which uploader is "approved." Hugging Face hosts community-uploaded GGUF files; the provenance story is the same one you'd have with any open-weights distribution, and the same review you'd apply to a model directly applies to its quant.
In return, you get reproducible deployments — the config file fully describes the running configuration — and a service surface small enough to audit in an afternoon.

§ 9

Where this fits

With llama.cpp running, the local-inference half of the stack is as deep as it usefully gets. The next layer doesn't change the model — it changes what the model can see. Article 5 is SearXNG: a self-hosted search engine that gives the chat interface a way to look things up on the web without rate limits or query tracking.

1 Why self-host? Read

2 Open WebUI: your AI interface Read

3 Ollama: local models, easy setup Read

4 llama.cpp: higher performance, more control Here

5 SearXNG: web-aware conversations Read

6 Remote access: Cloudflare Tunnel Read

7 Security and privacy Read

8 Image generation Read

9 Open WebUI in practice Read