Ollama & Aspire

Key Points

Ollama: easy local LLM runner. ollama pull llama3 and you have a local API.
Microsoft.Extensions.AI.Ollama: native .NET integration via OllamaChatClient.
.NET Aspire has Ollama integration: builder.AddOllama(...). Local dev experience trivial.
Best for prototyping, dev/test, when you want any model locally without setup pain.

Ollama

# Install
brew install ollama   # macOS
# or download from ollama.com

# Pull model
ollama pull llama3.3
ollama pull phi4
ollama pull nomic-embed-text   # for embeddings

# Run interactively
ollama run llama3.3

Ollama serves on http://localhost:11434. OpenAI-compatible API.

.NET integration

<PackageReference Include="Microsoft.Extensions.AI.Ollama" />

IChatClient chat = new OllamaChatClient(
    new Uri("http://localhost:11434"),
    "llama3.3");

var resp = await chat.GetResponseAsync("Hello!");

Embeddings

IEmbeddingGenerator<string, Embedding<float>> embed =
    new OllamaEmbeddingGenerator(new Uri("http://localhost:11434"), "nomic-embed-text");

var emb = await embed.GenerateAsync("Hello");

Aspire integration

// AppHost
var ollama = builder.AddOllama("ollama")
    .AddModel("chat", "llama3.3:latest")
    .AddModel("embed", "nomic-embed-text:latest")
    .WithDataVolume();

builder.AddProject<Projects.MyApi>("api")
    .WithReference(ollama.GetResource("chat"));

aspire run → Ollama container started; models pulled; .NET API connects.

API project

// Aspire wires this up automatically
builder.AddOllamaApiClient("chat");

// As IChatClient
public class C(IChatClient chat) { /* ... */ }

OpenAI-compatible mode

Ollama exposes OpenAI-compatible endpoint:

new OpenAIClient(
    new ApiKeyCredential("ollama"),
    new OpenAIClientOptions { Endpoint = new Uri("http://localhost:11434/v1") })
    .AsChatClient("llama3.3");

Lets existing OpenAI code work against local model.

Model formats

Ollama uses GGUF (llama.cpp format). Quantizations: - Q4_K_M: balanced quality/size (default). - Q5_K_M: higher quality. - Q8_0: near-fp16.

ollama pull llama3.3:8b-instruct-q4_K_M

Custom models

# Modelfile
FROM llama3.3
SYSTEM "You are a helpful assistant for .NET developers."
PARAMETER temperature 0.3

ollama create dev-assistant -f ./Modelfile
ollama run dev-assistant

Use cases

Local dev: no API key; fast iteration.
Privacy demos: data never leaves machine.
Offline development: train without connectivity.
CI: deterministic; no API rate limits.
Edge servers: small fleet runs Ollama.

Performance

Ollama uses your CPU + GPU (if Metal / CUDA / ROCm available). Llama 3 8B Q4 on M3 Max: ~50 tok/s.

Compared to ONNX Runtime GenAI

Aspect	Ollama	ONNX Runtime GenAI
Setup	`ollama pull X`	Download ONNX
Integration	HTTP	Native .NET
Latency	Slightly higher	Lower
Models	Llama, Phi, Mistral, ...	Microsoft-supported
Mobile	No	Yes (.NET MAUI)
Production	Servers	Anywhere

For dev: Ollama. For shipping in app: ONNX. For server fleet: either.

Aspire dev experience

aspire run
# → ollama starts in container
# → models pulled
# → API connects via service discovery

Local-to-prod consistency.

Senior considerations

Pin model versions: llama3.3:latest floats; pin to specific.
Monitor disk: models are big; ollama list shows usage.
Gather hardware: GPU acceleration matters.
For prod: containerize; pre-load models; resource limits.