Ollama & Aspire
Key Points
- Ollama: easy local LLM runner.
ollama pull llama3and you have a local API. Microsoft.Extensions.AI.Ollama: native .NET integration viaOllamaChatClient.- .NET Aspire has Ollama integration:
builder.AddOllama(...). Local dev experience trivial. - Best for prototyping, dev/test, when you want any model locally without setup pain.
Ollama
# Install
brew install ollama # macOS
# or download from ollama.com
# Pull model
ollama pull llama3.3
ollama pull phi4
ollama pull nomic-embed-text # for embeddings
# Run interactively
ollama run llama3.3
Ollama serves on http://localhost:11434. OpenAI-compatible API.
.NET integration
IChatClient chat = new OllamaChatClient(
new Uri("http://localhost:11434"),
"llama3.3");
var resp = await chat.GetResponseAsync("Hello!");
Embeddings
IEmbeddingGenerator<string, Embedding<float>> embed =
new OllamaEmbeddingGenerator(new Uri("http://localhost:11434"), "nomic-embed-text");
var emb = await embed.GenerateAsync("Hello");
Aspire integration
// AppHost
var ollama = builder.AddOllama("ollama")
.AddModel("chat", "llama3.3:latest")
.AddModel("embed", "nomic-embed-text:latest")
.WithDataVolume();
builder.AddProject<Projects.MyApi>("api")
.WithReference(ollama.GetResource("chat"));
aspire run → Ollama container started; models pulled; .NET API connects.
API project
// Aspire wires this up automatically
builder.AddOllamaApiClient("chat");
// As IChatClient
public class C(IChatClient chat) { /* ... */ }
OpenAI-compatible mode
Ollama exposes OpenAI-compatible endpoint:
new OpenAIClient(
new ApiKeyCredential("ollama"),
new OpenAIClientOptions { Endpoint = new Uri("http://localhost:11434/v1") })
.AsChatClient("llama3.3");
Lets existing OpenAI code work against local model.
Model formats
Ollama uses GGUF (llama.cpp format). Quantizations: - Q4_K_M: balanced quality/size (default). - Q5_K_M: higher quality. - Q8_0: near-fp16.
Custom models
# Modelfile
FROM llama3.3
SYSTEM "You are a helpful assistant for .NET developers."
PARAMETER temperature 0.3
ollama create dev-assistant -f ./Modelfile
ollama run dev-assistant
Use cases
- Local dev: no API key; fast iteration.
- Privacy demos: data never leaves machine.
- Offline development: train without connectivity.
- CI: deterministic; no API rate limits.
- Edge servers: small fleet runs Ollama.
Performance
Ollama uses your CPU + GPU (if Metal / CUDA / ROCm available). Llama 3 8B Q4 on M3 Max: ~50 tok/s.
Compared to ONNX Runtime GenAI
| Aspect | Ollama | ONNX Runtime GenAI |
|---|---|---|
| Setup | ollama pull X | Download ONNX |
| Integration | HTTP | Native .NET |
| Latency | Slightly higher | Lower |
| Models | Llama, Phi, Mistral, ... | Microsoft-supported |
| Mobile | No | Yes (.NET MAUI) |
| Production | Servers | Anywhere |
For dev: Ollama. For shipping in app: ONNX. For server fleet: either.
Aspire dev experience
Local-to-prod consistency.
Senior considerations
- Pin model versions:
llama3.3:latestfloats; pin to specific. - Monitor disk: models are big;
ollama listshows usage. - Gather hardware: GPU acceleration matters.
- For prod: containerize; pre-load models; resource limits.