Ollama: local models, easy setup
A single binary that pulls models, runs them on your hardware, and exposes a stable API. The most forgiving on-ramp to local inference. A Compose service builder with GPU options, a Hugging Face GGUF pulling trick, a VRAM tier table for picking a model that fits your card, and an honest read on where Ollama is the right call versus llama.cpp.
What this is
Ollama is a model server. It downloads model files from a registry the way Docker downloads images, manages GPU memory for you, and exposes an HTTP API that other software — like the Open WebUI from article 2 — can talk to. It also includes a CLI for testing models from a terminal without any other tooling in the loop.
The trade-off Ollama makes is opinionated defaults. It picks quantization for you, sets context length conservatively, and abstracts most of the knobs that llama.cpp (the inference engine it wraps) exposes. For most people, most of the time, that's the right trade. When it stops being the right trade, article 4 walks through the alternative.
This article continues the Compose file from article 2. If you skipped article 2 you can still follow along — Ollama is useful on its own from the command line — but the wiring to a chat interface picks up where article 2 left off.
Before you start
Local inference is the first part of the stack that actually cares about your hardware. A few things to know up front:
- An NVIDIA GPU is the most-supported path on Linux. With current drivers and the NVIDIA container toolkit installed, Ollama in Docker uses the GPU automatically. AMD GPUs work with ROCm but the path is rougher; I haven't run it personally.
- Apple Silicon is the cleanest non-NVIDIA path. Ollama on macOS uses Metal directly via a native install. Don't run it in Docker on a Mac — you lose GPU access. The native installer is what you want.
- CPU-only works. Slowly, but it works. A small model — 3B parameters at Q4 — will give you a few tokens per second on a modern CPU. Useful for trying things; not the configuration you want for daily use.
- VRAM is the binding constraint. The model has to fit in GPU memory along with its KV cache. Section 4 has a rough tier table; the short version is that 8 GB of VRAM gets you small-to-mid models comfortably, 24 GB opens the door to most of what's interesting right now.
Installing
Two routes. Docker is the one that fits this series — same Compose file, same lifecycle as Open WebUI. Native install is what I'd suggest on macOS, and what some Linux readers will prefer for the simpler GPU story.
Which devices Ollama can see. If you don't have a GPU you can still run small models on CPU — just much slower.
Open WebUI talks to Ollama internally over the Docker network — no host port required. Expose the port only if you want to reach the API from outside Docker (other apps on the host, scripts, the CLI from a remote machine).
Where downloaded models live. A named volume is the simpler default; a bind mount is worth it if you already have a model directory you want to share with other tools, or if you keep models on a separate disk.
Drop this into the same docker-compose.yml from article 2 — alongside the open-webui service, sharing the same services: block. If you used a named volume, add ollama: {} under the file's bottom volumes: section (the builder includes it for you when relevant).
$ cd ~/openwebui $ docker compose up -d ollama $ docker compose logs -f ollama # watch it come up
Native install (macOS, or Linux without Docker)
The Ollama installer drops a binary and (on macOS) a menu-bar app. On Linux, a one-line script installs the binary and a systemd service. Either way, the API answers on http://localhost:11434 by default.
If you go this route and still want to use Open WebUI in Docker, point it at the host: set OLLAMA_BASE_URL=http://host.docker.internal:11434 in the Open WebUI service (and add extra_hosts: ["host.docker.internal:host-gateway"] on Linux). The wiring is one extra line; the trade-off is that two install styles are now in play, which I find harder to reason about over time. I'd rather have everything in Compose.
Your first model
Pulling a model is one command. The full library lives at ollama.com/library; start with something small to confirm the install works before downloading 20 GB of weights.
$ docker exec -it ollama ollama pull qwen3:8b pulling manifest... pulling 4f5b1e... 100% ▕████████████████▏ 4.9 GB verifying sha256 digest writing manifest success
A few minutes later — depending on your connection and the model size — it's ready. Test from the same terminal:
$ docker exec -it ollama ollama run qwen3:8b >>> explain quantization in one sentence
Open Open WebUI in a browser and the model is already there — Open WebUI polls Ollama's API on startup and again whenever you refresh, so any model you pull shows up in the picker without further configuration. If you started Ollama after Open WebUI, restart Open WebUI once so it discovers the new connection.
Pulling any GGUF from Hugging Face
The Ollama library has a few thousand models, which is a lot, but Hugging Face has every quant of every model anyone's ever published. The two are connected. On any GGUF model's Hugging Face page, the Use this model dropdown has an Ollama option that gives you a copyable command with a quant selector — something like ollama run hf.co/unsloth/Qwen3.5-9B-GGUF:Q4_K_M.
The hf.co/… form works anywhere an Ollama model name does, including inside Open WebUI. In Admin Panel → Settings → Connections, click the download icon next to your Ollama connection to open the Manage Ollama dialog. Paste hf.co/unsloth/Qwen3.5-9B-GGUF:Q4_K_M (or any HF GGUF path) into the Pull a model field, hit pull, and the model downloads directly through the UI.
I use this from my phone all the time. Open WebUI is reachable from anywhere via Cloudflare Tunnel (article 6); the HF browser is reachable from anywhere by virtue of being the web. Between them, I can add a new model to the server while I'm away from it — admin work over a hotel-room connection that used to require SSH.
What the tag after the colon means
Ollama model names look like qwen3:8b or gemma3:27b-instruct-q4_K_M. The bit after the colon is a tag — usually a combination of parameter count, instruct/base, and quantization. qwen3:8b with no further suffix gets you Ollama's default quantization for that model, which is typically Q4_K_M (a 4-bit quantization that's the size/quality sweet spot for most use).
You can pin more precisely — qwen3:8b-instruct-q5_K_M for a slightly higher-quality 5-bit variant — but the defaults are reasonable and the library page lists what's available for each model. Sub-Q4 quants (Q3, Q2, IQ2_XS) have their place: they hold up better than you'd expect in a model's primary training languages — usually English, often Mandarin — and can be useful for creative writing in those languages, where some looseness helps more than it hurts. They get noticeably less reliable for translation and for work in less-represented languages. On 8 GB of VRAM I sometimes run lower quants than would be optimal — that's the trade I make for being able to load a bigger model at all.
Picking a model for your hardware
The honest answer is that this changes every few months. The shape of the answer doesn't. Match the model size to your VRAM, leave a few gigabytes for the KV cache (context window), and don't try to be a hero with quantization below Q4.
| VRAM | What runs comfortably | Reasonable picks (mid-2026) |
|---|---|---|
| CPU only | Small models, 2–4B, at slow speeds (~3–8 tok/s) | Gemma 4 small variants, Qwen 3.6 4B |
| 8 GB mine | 4–9B dense models at Q4, MoE models with a small active parameter count | Qwen 3.5 9B, smaller Gemma 4 variants |
| 12–16 GB | Up to ~13B dense at Q4, larger MoE models with on-CPU offload | |
| 24 GB | Up to ~30B dense at Q4, or larger MoE comfortably | Gemma 4 26B A4B, Qwen 3.6 35B A3B, Qwen 3.6 27B |
| 48 GB+ | ~70B dense at Q4 with room for context |
I run an 8 GB Quadro RTX 4000 (a decent high-end card ~2020). Qwen 3.5 9B at Q4 is the model I default to for general chat; it handles the same draft-and-summarize workload I'd previously sent to a cloud API well enough that the difference doesn't bother me. For the times I need more capability than that, I use OpenRouter rather than try to squeeze a larger model onto this card. The Qwen 3.6 35B-A3B that fits with aggressive quantization is interesting and capable in English, but I run it through llama.cpp rather than Ollama — for reasons that come up in the next section.
Power user: KV cache, context length, and the second slot of VRAM you forgot about
Model weights are only part of what sits in VRAM. The KV cache — the running state of attention over your context window — also lives there, and it scales with context length and batch size. A 9B model at Q4 takes around 5 GB for the weights; a 32K context window can add another 2–3 GB on top, depending on the model's attention configuration.
Ollama defaults to a 2048-token context window, which is conservative — fine for short chats, painful for anything document-shaped. You can raise it per-request with num_ctx in the API, or set it in a custom Modelfile, but every doubling of context roughly doubles the KV cache. On tight VRAM, a 16K context with a 9B model can be the tipping point that forces partial CPU offload, which slows generation dramatically.
The pragmatic move: pick the model that fits at the context length you actually use, not the largest model that technically loads with a 2K context.
Where Ollama fits in my stack now
I keep Ollama installed and running, but it's not where most of my chat traffic goes. Day-to-day inference on this machine runs through llama.cpp with llama-swap in front of it — a setup with finer control over per-model configuration and on-demand model swapping. That's article 4.
The specific thing that pushed me there: Ollama tends to pick up new model architectures slower than llama.cpp, which is the inference engine Ollama wraps. When a Qwen or Gemma generation lands and I want to try it the same week, llama.cpp usually supports it before Ollama does. Going direct cuts out the lag. For models that have been around a while and aren't going anywhere, Ollama is fine; for the bleeding edge, it isn't the right tool.
That said, the order I'd suggest for most readers is still: start with Ollama. Use it for a few weeks. If you find yourself wanting things it doesn't expose — specific KV cache quantization, weird batch sizes, multiple models loaded simultaneously without restart-and-reload, or a model that hasn't landed in the Ollama library yet — that's the signal to read article 4. Until then, you have what you need.
For a team or organization
Ollama's HTTP API has no authentication. None. The assumption is that you're running it on a trusted network — usually localhost or a private Docker network — and any access control happens upstream, in whatever's calling it.
For an internal multi-user deployment, that means: never expose Ollama's port outside the host or the Docker network. Put Open WebUI (with its account model) in front of it. If you need API access for scripts or notebooks, route those through a reverse proxy that adds an auth header, or via Open WebUI's own API. Treating Ollama as a private backend is the only safe posture.
The flip side is that Ollama is small and easy to audit. It's a single Go binary, the model files are GGUF (an open format), and there's no telemetry that I've found in the default build. For organizations that need to justify what's running, that surface area is a feature.
Where this fits
With Ollama running and a model pulled, the local-only path of the stack is complete. Open WebUI is the interface; Ollama is the engine; the model is yours. Article 4 covers llama.cpp for readers who eventually want more control. Article 5 adds search — the next layer that makes conversations meaningfully smarter without changing the model.