Skip to content

Ollama & Aspire

Key Points

  • Ollama: easy local LLM runner. ollama pull llama3 and you have a local API.
  • Microsoft.Extensions.AI.Ollama: native .NET integration via OllamaChatClient.
  • .NET Aspire has Ollama integration: builder.AddOllama(...). Local dev experience trivial.
  • Best for prototyping, dev/test, when you want any model locally without setup pain.

Ollama

# Install
brew install ollama   # macOS
# or download from ollama.com

# Pull model
ollama pull llama3.3
ollama pull phi4
ollama pull nomic-embed-text   # for embeddings

# Run interactively
ollama run llama3.3

Ollama serves on http://localhost:11434. OpenAI-compatible API.

.NET integration

<PackageReference Include="Microsoft.Extensions.AI.Ollama" />
IChatClient chat = new OllamaChatClient(
    new Uri("http://localhost:11434"),
    "llama3.3");

var resp = await chat.GetResponseAsync("Hello!");

Embeddings

IEmbeddingGenerator<string, Embedding<float>> embed =
    new OllamaEmbeddingGenerator(new Uri("http://localhost:11434"), "nomic-embed-text");

var emb = await embed.GenerateAsync("Hello");

Aspire integration

// AppHost
var ollama = builder.AddOllama("ollama")
    .AddModel("chat", "llama3.3:latest")
    .AddModel("embed", "nomic-embed-text:latest")
    .WithDataVolume();

builder.AddProject<Projects.MyApi>("api")
    .WithReference(ollama.GetResource("chat"));

aspire run → Ollama container started; models pulled; .NET API connects.

API project

// Aspire wires this up automatically
builder.AddOllamaApiClient("chat");

// As IChatClient
public class C(IChatClient chat) { /* ... */ }

OpenAI-compatible mode

Ollama exposes OpenAI-compatible endpoint:

new OpenAIClient(
    new ApiKeyCredential("ollama"),
    new OpenAIClientOptions { Endpoint = new Uri("http://localhost:11434/v1") })
    .AsChatClient("llama3.3");

Lets existing OpenAI code work against local model.

Model formats

Ollama uses GGUF (llama.cpp format). Quantizations: - Q4_K_M: balanced quality/size (default). - Q5_K_M: higher quality. - Q8_0: near-fp16.

ollama pull llama3.3:8b-instruct-q4_K_M

Custom models

# Modelfile
FROM llama3.3
SYSTEM "You are a helpful assistant for .NET developers."
PARAMETER temperature 0.3

ollama create dev-assistant -f ./Modelfile
ollama run dev-assistant

Use cases

  • Local dev: no API key; fast iteration.
  • Privacy demos: data never leaves machine.
  • Offline development: train without connectivity.
  • CI: deterministic; no API rate limits.
  • Edge servers: small fleet runs Ollama.

Performance

Ollama uses your CPU + GPU (if Metal / CUDA / ROCm available). Llama 3 8B Q4 on M3 Max: ~50 tok/s.

Compared to ONNX Runtime GenAI

Aspect Ollama ONNX Runtime GenAI
Setup ollama pull X Download ONNX
Integration HTTP Native .NET
Latency Slightly higher Lower
Models Llama, Phi, Mistral, ... Microsoft-supported
Mobile No Yes (.NET MAUI)
Production Servers Anywhere

For dev: Ollama. For shipping in app: ONNX. For server fleet: either.

Aspire dev experience

aspire run
# → ollama starts in container
# → models pulled
# → API connects via service discovery

Local-to-prod consistency.

Senior considerations

  • Pin model versions: llama3.3:latest floats; pin to specific.
  • Monitor disk: models are big; ollama list shows usage.
  • Gather hardware: GPU acceleration matters.
  • For prod: containerize; pre-load models; resource limits.

Cross-references