Skip to content

Perf Trade-offs: ONNX vs Ollama

Key Points

  • ONNX Runtime GenAI: native in-process; lower latency; better for shipping in apps.
  • Ollama: HTTP API; easier setup; better for dev/server.
  • Quality: comparable for same model + quant.
  • Hardware: ONNX has more execution providers (DirectML, CoreML, CUDA).
  • For mobile / edge embedded: ONNX. For server-side / dev: Ollama.

Latency

Scenario ONNX Ollama
Cold start ~1-2s (load model) ~3-5s (load + warm-up)
Warm inference ~30-50ms first token ~50-100ms first token
Throughput (tok/s) Comparable Comparable

ONNX edges out on cold start because no HTTP / process boundary.

Memory

Same — model weights dominate. Ollama adds small daemon overhead.

Setup complexity

Ollama:    `ollama pull X` → done.
ONNX:      Download ONNX files → reference in code → handle paths.

Ollama wins for prototyping.

Production fit

Scenario Pick
Mobile app ONNX (no daemon)
Desktop app ONNX or Ollama (Ollama spawns server)
Edge IoT ONNX
Server-side Ollama (or vLLM for higher perf)
Dev / CI Ollama
Aspire local dev Ollama (Aspire integration)

Hardware execution providers

ONNX: - CPU - CUDA (NVIDIA) - DirectML (Windows AMD/Intel/NVIDIA) - CoreML (Apple Silicon) - ROCm (AMD) - TensorRT (NVIDIA optimized)

Ollama: - CPU - Metal (Apple) - CUDA - ROCm

For broad device support: ONNX wins.

Model formats

ONNX uses ONNX format. Microsoft publishes Phi as ONNX. Ollama uses GGUF (llama.cpp). Most models on HuggingFace as GGUF.

For shipping Phi: ONNX. For Llama / Mistral: Ollama easier.

Function calling

Both: limited compared to frontier models. Some quants of Llama 3.x support tool calls; Phi has experimental support.

For agentic workloads with tools: prefer frontier models via API.

Streaming

Both stream tokens. ONNX gives finer control (manual token loop).

Updates

ONNX: re-download model bundle. Ollama: ollama pull X:tag.

Concurrency

ONNX: single-process; concurrent generators (tunable). Memory scales. Ollama: HTTP API; queues requests; one model loaded at a time (configurable).

Cost

Both free. Hardware cost dominates.

Scenario Cost
Run on existing user device $0
Run on server GPU $$/hr
Run on phone battery / thermal cost

Senior decision matrix

Shipping in customer app (mobile/desktop): ONNX
Server-side production: vLLM > Ollama (vLLM is faster for multi-user)
Dev/staging/CI: Ollama
Prototype: Ollama
Aspire dev orchestration: Ollama
Edge IoT: ONNX

vLLM as production alternative

For server-side multi-tenant local inference: - vLLM (Python): production-grade; PagedAttention; better throughput. - TensorRT-LLM (NVIDIA): fastest on NVIDIA hardware. - Triton Inference Server: NVIDIA's serving layer.

For .NET: call via HTTP (OpenAI-compatible endpoint).

Pitfalls

  • Quantization quality drift: Q2 / Q3 noticeably worse; stick Q4+ for quality.
  • Mixing model versions: clients depending on specific behavior break.
  • Memory pressure on shared hardware: another process steals RAM.
  • Thermal throttling on mobile: long generations slow down.

Cross-references