ONNX Runtime GenAI
Key Points
- ONNX Runtime GenAI runs LLMs locally in your .NET process — no API calls.
- Best for Phi-3 / Phi-4 SLMs and small Llama variants.
- Hardware: CPU works; CUDA/DirectML/CoreML accelerate.
- Use cases: privacy, offline, low-latency, mobile (.NET MAUI).
- Trade-off: model quality < frontier; deployment complexity.
Setup
<PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI" Version="0.5+" />
<!-- For GPU -->
<PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI.Cuda" />
<!-- DirectML -->
<PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI.DirectML" />
Download model
# Phi-4 mini ONNX (Hugging Face)
git clone https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx
Models distributed in ONNX format with config + tokenizer.
Basic generation
using Microsoft.ML.OnnxRuntimeGenAI;
var model = new Model("/path/to/phi-4-onnx");
var tokenizer = new Tokenizer(model);
var prompt = "<|user|>\nHello\n<|assistant|>\n"; // Phi format
var tokens = tokenizer.Encode(prompt);
using var generator = new Generator(model, new GeneratorParams(model)
{
SetSearchOption("max_length", 200),
SetSearchOption("top_p", 0.9f),
SetSearchOption("temperature", 0.7f)
});
generator.AppendTokens(tokens);
while (!generator.IsDone())
{
generator.GenerateNextToken();
var newToken = generator.GetSequence(0)[^1];
Console.Write(tokenizer.Decode(new[] { newToken }));
}
Streaming wrapper
public async IAsyncEnumerable<string> GenerateStreaming(string prompt, [EnumeratorCancellation] CancellationToken ct)
{
using var generator = new Generator(model, /* ... */);
var tokens = tokenizer.Encode(prompt);
generator.AppendTokens(tokens);
while (!generator.IsDone())
{
ct.ThrowIfCancellationRequested();
generator.GenerateNextToken();
var newToken = generator.GetSequence(0)[^1];
yield return tokenizer.Decode(new[] { newToken });
}
}
IChatClient adapter
public class OnnxChatClient(Model model, Tokenizer tokenizer) : IChatClient
{
public async Task<ChatResponse> GetResponseAsync(IEnumerable<ChatMessage> msgs, ChatOptions? opts, CancellationToken ct)
{
var prompt = FormatMessages(msgs); // Phi-formatted
var sb = new StringBuilder();
await foreach (var token in GenerateStreaming(prompt, ct))
sb.Append(token);
return new ChatResponse(new[] { new ChatMessage(ChatRole.Assistant, sb.ToString()) });
}
public IAsyncEnumerable<ChatResponseUpdate> GetStreamingResponseAsync(/* ... */) { /* ... */ }
public ChatClientMetadata Metadata => new("phi-4");
}
(Community packages may exist; check NuGet.)
Hardware acceleration
// CUDA (NVIDIA)
var model = new Model("/path/to/onnx", new Config { ExecutionProvider = "cuda" });
// DirectML (Windows; AMD/Intel/NVIDIA)
var model = new Model("/path/to/onnx", new Config { ExecutionProvider = "dml" });
// CoreML (Mac Apple Silicon)
var model = new Model("/path/to/onnx", new Config { ExecutionProvider = "coreml" });
// CPU fallback
var model = new Model("/path/to/onnx");
Model sizes
INT4 quantized runs on phones / lightweight laptops.
Performance
| Hardware | Phi-3 mini int4 |
|---|---|
| CPU (modern) | ~10 tok/s |
| Apple M3 Max | ~50 tok/s |
| RTX 4090 | ~150+ tok/s |
| Phone (modern) | ~5-10 tok/s |
.NET MAUI mobile app
// In MAUI app
public class LocalAi
{
private Model? _model;
private Tokenizer? _tok;
public async Task LoadAsync(string modelPath)
{
_model = new Model(modelPath);
_tok = new Tokenizer(_model);
}
public async IAsyncEnumerable<string> GenerateAsync(string prompt, [EnumeratorCancellation] CancellationToken ct = default)
{
// Generation loop
}
}
Bundle model with app or download on first run.
Model bundling
For desktop apps:
Adds to publish output.
For mobile: download on first run; cache in app data.
Memory
Phi-4 14B INT4 needs ~7 GB RAM. Phi-3 mini INT4 ~2 GB. Plan accordingly.
Use cases
- Privacy-sensitive (medical, financial): no API call.
- Offline (mobile, edge IoT).
- Real-time (sub-100ms cold start).
- Cost-zero per call (post-bundle).
When NOT
- Need frontier quality.
- Long context.
- Function calling (Phi has limited support; quality varies).
- Large team / many users — managed API simpler.
Senior considerations
- Quantize for size — INT4 usually fine.
- Test on target hardware — perf varies.
- Measure quality on YOUR domain — Phi great for narrow tasks.
- Fallback strategy: if local fails, fallback to API.