Perf Trade-offs: ONNX vs Ollama

Key Points

ONNX Runtime GenAI: native in-process; lower latency; better for shipping in apps.
Ollama: HTTP API; easier setup; better for dev/server.
Quality: comparable for same model + quant.
Hardware: ONNX has more execution providers (DirectML, CoreML, CUDA).
For mobile / edge embedded: ONNX. For server-side / dev: Ollama.

Latency

Scenario	ONNX	Ollama
Cold start	~1-2s (load model)	~3-5s (load + warm-up)
Warm inference	~30-50ms first token	~50-100ms first token
Throughput (tok/s)	Comparable	Comparable

ONNX edges out on cold start because no HTTP / process boundary.

Memory

Same — model weights dominate. Ollama adds small daemon overhead.

Setup complexity

Ollama:    `ollama pull X` → done.
ONNX:      Download ONNX files → reference in code → handle paths.

Ollama wins for prototyping.

Production fit

Scenario	Pick
Mobile app	ONNX (no daemon)
Desktop app	ONNX or Ollama (Ollama spawns server)
Edge IoT	ONNX
Server-side	Ollama (or vLLM for higher perf)
Dev / CI	Ollama
Aspire local dev	Ollama (Aspire integration)

Hardware execution providers

ONNX: - CPU - CUDA (NVIDIA) - DirectML (Windows AMD/Intel/NVIDIA) - CoreML (Apple Silicon) - ROCm (AMD) - TensorRT (NVIDIA optimized)

Ollama: - CPU - Metal (Apple) - CUDA - ROCm

For broad device support: ONNX wins.

Model formats

ONNX uses ONNX format. Microsoft publishes Phi as ONNX. Ollama uses GGUF (llama.cpp). Most models on HuggingFace as GGUF.

For shipping Phi: ONNX. For Llama / Mistral: Ollama easier.

Function calling

Both: limited compared to frontier models. Some quants of Llama 3.x support tool calls; Phi has experimental support.

For agentic workloads with tools: prefer frontier models via API.

Streaming

Both stream tokens. ONNX gives finer control (manual token loop).

Updates

ONNX: re-download model bundle. Ollama: ollama pull X:tag.

Concurrency

ONNX: single-process; concurrent generators (tunable). Memory scales. Ollama: HTTP API; queues requests; one model loaded at a time (configurable).

Cost

Both free. Hardware cost dominates.

Scenario	Cost
Run on existing user device	$0
Run on server	GPU $$/hr
Run on phone	battery / thermal cost

Senior decision matrix

Shipping in customer app (mobile/desktop): ONNX
Server-side production: vLLM > Ollama (vLLM is faster for multi-user)
Dev/staging/CI: Ollama
Prototype: Ollama
Aspire dev orchestration: Ollama
Edge IoT: ONNX

vLLM as production alternative

For server-side multi-tenant local inference: - vLLM (Python): production-grade; PagedAttention; better throughput. - TensorRT-LLM (NVIDIA): fastest on NVIDIA hardware. - Triton Inference Server: NVIDIA's serving layer.

For .NET: call via HTTP (OpenAI-compatible endpoint).

Pitfalls

Quantization quality drift: Q2 / Q3 noticeably worse; stick Q4+ for quality.
Mixing model versions: clients depending on specific behavior break.
Memory pressure on shared hardware: another process steals RAM.
Thermal throttling on mobile: long generations slow down.