Stack Series · 1 of 9 No commands · May 2026

Why self-host?

You're already using AI tools every day. This is about where that work goes — your prompts, your data, your search history — and why more people are keeping it on hardware they own. Privacy, cost, control, and an honest read on when cloud services are still the right call.

Curtis Smith · OptiMoss.ai · part of the Stack Series

§ 0

Where does it go?

Every time you send a message to a cloud AI service, that message leaves your device. The destination is someone else's server, subject to someone else's data policy — the one that you agreed to when you signed up and probably haven't re-read since.

Cloud AI

You

on your device

prompt →

← reply

prompt ↓

↑ reply

Vendor box

model runs here, logs kept

Logged: your prompt and the reply, retained per current data policy.

Self-hosted

Your box

you and the model on one machine

Logged: nothing by default. Nothing crosses the network.

§ 1

Privacy, for real

Some conversations are awkward to have on record: medical questions, a client strategy you haven't made public, legal research. With a cloud service, those go to someone else's infrastructure under a policy that can change. Many major providers have reasonable policies today. What they'll do with that data in five years is a different question.

With a cloud service, you're trusting their current data policy and every revision that follows. A local model has no policy to revise — the conversation ends when you close the window.

What do cloud providers actually do with your data?

Policies vary by provider and tier. The broad pattern: free tiers often permit data use for model improvement; paid API tiers generally don't train on your inputs but do retain logs for safety review and abuse prevention. Enterprise agreements typically offer stronger protections and shorter retention windows. Default retention periods are often 30–90 days. Human review of flagged content is standard practice across all major providers.

None of it is hidden — it's in the terms — but terms change. What's true today isn't guaranteed to survive an acquisition, a jurisdiction-specific legal order, or a policy update three years from now that you won't notice because you agreed to the original one in 2024.

Framing this for your organization

In a business context, privacy is a compliance question. The relevant artifacts are data processing agreements, jurisdiction documentation, and a clean audit trail.

GDPR treats sending employee or customer data to a US-based AI service as a cross-border transfer that may require specific safeguards.
HIPAA-regulated data cannot go to a general-purpose cloud AI service without a signed BAA. Most don't offer one.
A self-hosted model removes one more data processing relationship that would otherwise need vendor vetting and periodic review.

Self-hosting doesn't eliminate compliance work — access controls, log handling, and incident response are still up to you. It keeps the surface area inside infrastructure you already manage.

§ 2

The cost math

Cloud AI feels cheap, until it doesn't. If you're using a chat interface occasionally, the cost is negligible. Once you start integrating it into workflows — pipelines, automations, anything that runs queries programmatically — the numbers can add up rapidly. Future content will explore the relevant math in more detail.

Power user: where the hardware investment pays off

The upfront hardware argument against self-hosting is real. A used RTX 3090 (24GB VRAM) runs around $400–600 and handles Gemma 4 26B A4B and Qwen 3.6's larger variants comfortably on Q4 quants. At a typical mid-tier cloud-LLM cost of roughly half a cent per query (the Gemini 3.1 Pro / Sonnet 4.6 band, as of May 2026), it takes on the order of 100,000 queries to cover that hardware — years at fifty queries a day, months at five hundred. The crossover is real but slow at human chat volumes.

The math shifts further when you account for automated pipelines. A system that queries an LLM on each incoming event can rack up thousands of API calls per day before anyone notices the bill. Programmatic use is where the crossover actually lives.

If you're running inference on a machine you already own — a workstation with a capable GPU — the crossover is immediate. You're not buying hardware; you're using capacity that's already paid for and sitting idle most of the day.

§ 3

What you actually own

Ownership is the hardest part to quantify. With most software, accepting a dependency on someone else's infrastructure is a reasonable tradeoff where they handle the reliability and you handle the work. AI has a few properties that complicate this more than usual.

Self-hosted

✓Works offline

✓No rate limits

✓No silent model updates

✓No vendor outages

✓Full system prompt control

✓Predictable cost

Cloud service

✗Requires internet

✗Rate-limited by tier

✗Models updated silently

✗Vendor outages happen

–Limited behavior control

✗Pricing subject to change

Each of these has happened. When DeepSeek's R1 matched OpenAI's o1 on benchmarks in January 2025 at a fraction of the price, OpenAI, Anthropic, and Google all cut API pricing in the weeks that followed — useful unless you're mid-quarter on a budget set at the previous rates. OpenAI has retired GPT-3.5, GPT-4, GPT-4 Turbo, and most recently GPT-4o in sequence; notice periods have ranged from several months down to two weeks. Silent in-place updates are standard practice: the model behind a version string in January isn't necessarily the one running in July. If your workflow depends on AI behaving predictably at a known cost, that's the baseline to plan around.

Power user: silent model updates and pipeline reproducibility

One underappreciated issue with cloud APIs: the model behind a version string isn't frozen. Providers update models in place — sometimes improving quality, sometimes introducing regressions in specific areas. If you're building something that depends on consistent output format or behavior, this matters. A prompt that produced structured JSON last month might not today, and there's no diff to look at.

With a local model, you control when you update. You can pin to a specific quantization, test a new version in staging, and roll back if something breaks. For an automated pipeline, that reproducibility matters.

§ 4

Where the cloud still wins

Frontier models don't run locally. What does run — Gemma 4's 26B MoE, Qwen 3.5's 9B — is stronger than anything available a year and a half ago for the day-to-day work most people do: drafting, coding, summarization, Q&A. The cloud still has the edge on multi-step reasoning across long documents and on tasks that need fine distinctions held over a lot of context. Whether you'll feel that gap depends on what you're asking the model to do.

Setup costs time. Docker, GPU drivers, the occasional dependency that breaks on update — although this can all be made simpler to manage with AI assistance. A handful of workloads also don't have great local equivalents yet: certain fine-tuned specialist models, some multimodal pipelines, real-time image generation at scale.

I run a hybrid stack. Ollama + llama.cpp cover daily chat and the automation pipelines on a Dell Precision with an 8GB Quadro RTX 4000 (suboptimal, yet viable). OpenRouter handles the work where I'd notice the quality difference, routed through Zero Data Retention endpoints. The split works because each side does what it's good at.

§ 5

Where do you land?

A few questions to find the setup that fits.

How much does it matter to you where your data goes?

Is that a hard requirement — compliance, contract, or organizational policy?

Do you prefer to own your tools rather than depend on a platform?

Compliance-grade

Own your environment

Build around a local AI interface — software that sits between you and your models, on hardware you control. Run inference locally wherever possible to keep data inside your environment. When a task genuinely needs cloud inference, require an enterprise agreement with a data processing addendum before connecting anything.

The rest of this series walks through building this stack from scratch.

Privacy by choice

Local-first, cloud-fallback

A local AI interface gives you a foundation you own. Run inference locally when your hardware supports it. When it doesn't, OpenRouter's Zero Data Retention endpoints are a meaningful middle ground — cloud convenience without your prompts being stored or used for training.

Local inference doesn't always need a big hardware investment. The series covers reasonable starting points at most budget levels.

Flexibility by design

Own your interface layer

A local AI interface gives you portability and independence from any single provider's product decisions. Pair it with whatever inference backend fits — local hardware, a self-hosted model server, or a cloud aggregator like OpenRouter for multi-model access through one endpoint.

Even without strong privacy requirements, owning your interface layer means your workflow isn't subject to someone else's roadmap.

Managed platform

Convenience, with trade-offs

Proprietary platforms like ChatGPT, Claude.ai, or Gemini are polished and low-friction. The trade-offs: data handling lives in their terms, your workflow lives inside their product, and your access depends on their pricing and availability.

A reasonable place to start. Worth revisiting as your needs grow.

§ 6

What this series builds

The rest of this series walks through setting up the stack I actually run — a local/cloud hybrid that handles chat, search, automation, and remote access. Each article ends with something you can use.

1 Why self-host? Here

2 Open WebUI: your AI interface Read

3 Ollama: local models, easy setup Read

4 llama.cpp: higher performance, more control Read

5 SearXNG: web-aware conversations Read

6 Remote access: Cloudflare Tunnel Read

7 Security and privacy Read

8 Image generation Read

9 Open WebUI in practice Read

Article 2 — Open WebUI is the practical starting point. It gets a working interface running and then forks into three paths: cloud API, local via Ollama, or local via llama.cpp. Pick the one that fits your hardware and go from there.