Perf Trade-offs: ONNX vs Ollama
Key Points
- ONNX Runtime GenAI: native in-process; lower latency; better for shipping in apps.
- Ollama: HTTP API; easier setup; better for dev/server.
- Quality: comparable for same model + quant.
- Hardware: ONNX has more execution providers (DirectML, CoreML, CUDA).
- For mobile / edge embedded: ONNX. For server-side / dev: Ollama.
Latency
| Scenario | ONNX | Ollama |
|---|---|---|
| Cold start | ~1-2s (load model) | ~3-5s (load + warm-up) |
| Warm inference | ~30-50ms first token | ~50-100ms first token |
| Throughput (tok/s) | Comparable | Comparable |
ONNX edges out on cold start because no HTTP / process boundary.
Memory
Same — model weights dominate. Ollama adds small daemon overhead.
Setup complexity
Ollama wins for prototyping.
Production fit
| Scenario | Pick |
|---|---|
| Mobile app | ONNX (no daemon) |
| Desktop app | ONNX or Ollama (Ollama spawns server) |
| Edge IoT | ONNX |
| Server-side | Ollama (or vLLM for higher perf) |
| Dev / CI | Ollama |
| Aspire local dev | Ollama (Aspire integration) |
Hardware execution providers
ONNX: - CPU - CUDA (NVIDIA) - DirectML (Windows AMD/Intel/NVIDIA) - CoreML (Apple Silicon) - ROCm (AMD) - TensorRT (NVIDIA optimized)
Ollama: - CPU - Metal (Apple) - CUDA - ROCm
For broad device support: ONNX wins.
Model formats
ONNX uses ONNX format. Microsoft publishes Phi as ONNX. Ollama uses GGUF (llama.cpp). Most models on HuggingFace as GGUF.
For shipping Phi: ONNX. For Llama / Mistral: Ollama easier.
Function calling
Both: limited compared to frontier models. Some quants of Llama 3.x support tool calls; Phi has experimental support.
For agentic workloads with tools: prefer frontier models via API.
Streaming
Both stream tokens. ONNX gives finer control (manual token loop).
Updates
ONNX: re-download model bundle. Ollama: ollama pull X:tag.
Concurrency
ONNX: single-process; concurrent generators (tunable). Memory scales. Ollama: HTTP API; queues requests; one model loaded at a time (configurable).
Cost
Both free. Hardware cost dominates.
| Scenario | Cost |
|---|---|
| Run on existing user device | $0 |
| Run on server | GPU $$/hr |
| Run on phone | battery / thermal cost |
Senior decision matrix
Shipping in customer app (mobile/desktop): ONNX
Server-side production: vLLM > Ollama (vLLM is faster for multi-user)
Dev/staging/CI: Ollama
Prototype: Ollama
Aspire dev orchestration: Ollama
Edge IoT: ONNX
vLLM as production alternative
For server-side multi-tenant local inference: - vLLM (Python): production-grade; PagedAttention; better throughput. - TensorRT-LLM (NVIDIA): fastest on NVIDIA hardware. - Triton Inference Server: NVIDIA's serving layer.
For .NET: call via HTTP (OpenAI-compatible endpoint).
Pitfalls
- Quantization quality drift: Q2 / Q3 noticeably worse; stick Q4+ for quality.
- Mixing model versions: clients depending on specific behavior break.
- Memory pressure on shared hardware: another process steals RAM.
- Thermal throttling on mobile: long generations slow down.