ONNX Runtime GenAI

Key Points

ONNX Runtime GenAI runs LLMs locally in your .NET process — no API calls.
Best for Phi-3 / Phi-4 SLMs and small Llama variants.
Hardware: CPU works; CUDA/DirectML/CoreML accelerate.
Use cases: privacy, offline, low-latency, mobile (.NET MAUI).
Trade-off: model quality < frontier; deployment complexity.

Setup

<PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI" Version="0.5+" />
<!-- For GPU -->
<PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI.Cuda" />
<!-- DirectML -->
<PackageReference Include="Microsoft.ML.OnnxRuntimeGenAI.DirectML" />

Download model

# Phi-4 mini ONNX (Hugging Face)
git clone https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx

Models distributed in ONNX format with config + tokenizer.

Basic generation

using Microsoft.ML.OnnxRuntimeGenAI;

var model = new Model("/path/to/phi-4-onnx");
var tokenizer = new Tokenizer(model);

var prompt = "<|user|>\nHello\n<|assistant|>\n";   // Phi format
var tokens = tokenizer.Encode(prompt);

using var generator = new Generator(model, new GeneratorParams(model)
{
    SetSearchOption("max_length", 200),
    SetSearchOption("top_p", 0.9f),
    SetSearchOption("temperature", 0.7f)
});

generator.AppendTokens(tokens);

while (!generator.IsDone())
{
    generator.GenerateNextToken();
    var newToken = generator.GetSequence(0)[^1];
    Console.Write(tokenizer.Decode(new[] { newToken }));
}

Streaming wrapper

public async IAsyncEnumerable<string> GenerateStreaming(string prompt, [EnumeratorCancellation] CancellationToken ct)
{
    using var generator = new Generator(model, /* ... */);
    var tokens = tokenizer.Encode(prompt);
    generator.AppendTokens(tokens);

    while (!generator.IsDone())
    {
        ct.ThrowIfCancellationRequested();
        generator.GenerateNextToken();
        var newToken = generator.GetSequence(0)[^1];
        yield return tokenizer.Decode(new[] { newToken });
    }
}

IChatClient adapter

public class OnnxChatClient(Model model, Tokenizer tokenizer) : IChatClient
{
    public async Task<ChatResponse> GetResponseAsync(IEnumerable<ChatMessage> msgs, ChatOptions? opts, CancellationToken ct)
    {
        var prompt = FormatMessages(msgs);   // Phi-formatted
        var sb = new StringBuilder();
        await foreach (var token in GenerateStreaming(prompt, ct))
            sb.Append(token);
        return new ChatResponse(new[] { new ChatMessage(ChatRole.Assistant, sb.ToString()) });
    }

    public IAsyncEnumerable<ChatResponseUpdate> GetStreamingResponseAsync(/* ... */) { /* ... */ }

    public ChatClientMetadata Metadata => new("phi-4");
}

(Community packages may exist; check NuGet.)

Hardware acceleration

// CUDA (NVIDIA)
var model = new Model("/path/to/onnx", new Config { ExecutionProvider = "cuda" });

// DirectML (Windows; AMD/Intel/NVIDIA)
var model = new Model("/path/to/onnx", new Config { ExecutionProvider = "dml" });

// CoreML (Mac Apple Silicon)
var model = new Model("/path/to/onnx", new Config { ExecutionProvider = "coreml" });

// CPU fallback
var model = new Model("/path/to/onnx");

Model sizes

Phi-3 mini fp16:   ~7 GB
Phi-3 mini int4:   ~2 GB
Phi-4 14B fp16:    ~28 GB
Phi-4 14B int4:    ~7 GB

INT4 quantized runs on phones / lightweight laptops.

Performance

Hardware	Phi-3 mini int4
CPU (modern)	~10 tok/s
Apple M3 Max	~50 tok/s
RTX 4090	~150+ tok/s
Phone (modern)	~5-10 tok/s

.NET MAUI mobile app

// In MAUI app
public class LocalAi
{
    private Model? _model;
    private Tokenizer? _tok;

    public async Task LoadAsync(string modelPath)
    {
        _model = new Model(modelPath);
        _tok = new Tokenizer(_model);
    }

    public async IAsyncEnumerable<string> GenerateAsync(string prompt, [EnumeratorCancellation] CancellationToken ct = default)
    {
        // Generation loop
    }
}

Bundle model with app or download on first run.

Model bundling

For desktop apps:

<ItemGroup>
  <None Update="phi-4-onnx\**" CopyToOutputDirectory="PreserveNewest" />
</ItemGroup>

Adds to publish output.

For mobile: download on first run; cache in app data.

Memory

Phi-4 14B INT4 needs ~7 GB RAM. Phi-3 mini INT4 ~2 GB. Plan accordingly.

Use cases

Privacy-sensitive (medical, financial): no API call.
Offline (mobile, edge IoT).
Real-time (sub-100ms cold start).
Cost-zero per call (post-bundle).

When NOT

Need frontier quality.
Long context.
Function calling (Phi has limited support; quality varies).
Large team / many users — managed API simpler.

Senior considerations

Quantize for size — INT4 usually fine.
Test on target hardware — perf varies.
Measure quality on YOUR domain — Phi great for narrow tasks.
Fallback strategy: if local fails, fallback to API.