Running AI on your own infrastructure: on-premise, OSS, GPU and APU

The question regularly comes up in discussions with technical leaders: can you run a high-performance AI model at home, without sending data to OpenAI or Anthropic?

The short answer: yes, and it’s more affordable than you might think. The long answer: it depends on model size, performance requirements and hardware budget.

What an on-premise deployment is

On-premise means that the model runs on your infrastructure, not on third-party servers. Your data never leaves your perimeter. You control versions. You have no marginal cost per request (only the infrastructure cost).

This is not necessarily synonymous with inferior performance. In 2025, the best open-weights (Llama 3 70B, Qwen 2.5 72B, Mistral Large 2) achieve comparable performance to proprietary models on many specialized tasks. The gap remains visible on very general tasks and complex reasoning tasks, but has narrowed considerably.

APU: the accessible surprise

The quiet revolution of the last two years has been in APUs - processors that integrate CPU and GPU on the same die, with high-bandwidth shared memory.

The Apple M4 Max (available late 2024) features 128 GB of unified memory at 500 GB/s bandwidth. A Llama 3 70B model quantised in Q4 (which fits in ~40 GB) runs on this machine at 15-20 tokens per second. This is sufficient for interactive use, well below the speed of a cloud API, but within the limits of a chat interface.

A Mac Studio M4 Ultra with 192 GB costs around 4,000 to 5,000 euros. It can run a 70B model locally, without a separate GPU, with a power consumption of 80-100W. For an SME or consultancy handling sensitive data and making 50-200 requests a day, the economic calculation can hold up against API costs over 2-3 years.

GPUs: the classic volume route

For larger volumes or larger models, the GPU remains the benchmark.

A server with 2 NVIDIA RTX 4090 GPUs (24 GB VRAM each, ~1,500 euros each in 2025) can run a 7-13B model at comfortable speeds. For a 70B model, you need 4 to 8 GPUs or professional GPUs (H100: 30,000+ euros, A100: 15,000+).

The surrounding infrastructure is also a cost: adapted server, cooling, UPS, system management. Expect to pay 30-50% of the GPU cost for the associated infrastructure.

The tools that make it possible

In 2023, running a local LLM required specialized skills. By 2025, tools had radically simplified the experience:

Ollama: one-command local LLM installation (ollama run llama3.2). OpenAI API-compatible interface, so your applications can point to your local instance.

llama.cpp: optimized inference engine for CPU and GPU, the basis for many tools. Supports quantization and runs on Windows, Linux and macOS.

Open WebUI: Ollama-compatible local web interface. Faithful reproduction of the ChatGPT experience, but on your own infrastructure.

LM Studio: desktop tool (Windows/Mac) for downloading and running local models, with graphical interface.

What it doesn’t solve

On-premise deployment solves the problem of data confidentiality. It doesn’t solve the problem of response quality on complex tasks (proprietary frontier models remain superior), nor model updating (you manage the lifecycle), nor infrastructure security (you bear the responsibility for securing your server).

It’s a compromise. Sovereignty has a cost in terms of internal resources. The choice must be a conscious one, not a default one.

What an on-premise deployment is#

APU: the accessible surprise#

GPUs: the classic volume route#

The tools that make it possible#

What it doesn’t solve#

Related

Open source vs proprietary: control, dependency, and the real trade-off

What an on-premise deployment is

APU: the accessible surprise

GPUs: the classic volume route

The tools that make it possible

What it doesn’t solve