Microsoft Phi & Small Models

Key Points

Phi-3 / Phi-4: Microsoft's SLMs (Small Language Models). 3-14B params.
Surprisingly capable for size — Phi-4 14B competitive with much larger models on focused tasks.
Edge / on-device via ONNX Runtime GenAI — Windows, Mac, Linux, mobile.
Use cases: privacy-preserving, offline, real-time on edge, low-latency, cost-zero inference.
Limitations: less capable than frontier; no native multimodal in most variants.

Lineup

Model	Size	Notes
Phi-3 mini	3.8B	Mobile / edge
Phi-3 small	7B	Balanced
Phi-3 medium	14B	Strongest 2024
Phi-3 vision	4.2B	Multimodal
Phi-4	14B	2024-25; competitive
Phi-4 mini	3.8B	2025; smaller

Why small models matter

On-device: phones, laptops; no API call.
Offline / privacy: data stays local.
Cost: zero per-call after model download.
Latency: <100ms cold start typical.
GPU not required (CPU inference works).

ONNX Runtime GenAI

using Microsoft.ML.OnnxRuntimeGenAI;

var model = new Model("/path/to/phi-4-onnx");
var tokenizer = new Tokenizer(model);

var tokens = tokenizer.Encode("Hello, world!");
var generator = new Generator(model, new GeneratorParams(model)
{
    Sequences = tokens,
    MaxLength = 200
});

while (!generator.IsDone())
{
    generator.ComputeLogits();
    generator.GenerateNextToken();
    var next = generator.GetSequence(0)[^1];
    Console.Write(tokenizer.Decode(new[] { next }));
}

Runs in your .NET process. CPU or GPU (CUDA, DirectML, CoreML).

Microsoft.Extensions.AI integration

// OnnxRuntimeGenAI as IChatClient (community / preview)
IChatClient chat = new OnnxGenAIChatClient(modelPath);

// Then standard pipeline
chat = chat.AsBuilder()
    .UseFunctionInvocation()
    .UseLogging(loggerFactory)
    .Build();

Use cases

Mobile (.NET MAUI app with Phi-3 mini).
Desktop AI (note-taking, search, classification offline).
Edge IoT (industrial, on-prem).
Latency-critical (real-time response).
Privacy (medical, financial — data never leaves device).

Phi-4 quality

Comparable to Llama 3 70B on focused benchmarks. NOT as capable as GPT-4 / Claude 4 on open-ended tasks. Excellent for narrow domains.

Where it falls short

Long context: typically 4-16K, not 1M.
Multimodal: Phi-3 vision exists; quality below GPT-4o.
Tool use: less mature.
Edge cases / general world knowledge: smaller training data effect.

Quantization

INT4 / INT8 reduces footprint: - Phi-3 mini fp16: ~7 GB. - Phi-3 mini INT4: ~2 GB. Runs on phone.

ONNX Runtime supports both.

Vs Ollama (Llama small)

Aspect	Phi-4 + ONNX	Llama small + Ollama
Setup	Download ONNX	`ollama pull llama3`
Integration	Native .NET	HTTP API
Latency	Lower	Slightly higher
Quality	Comparable in size	Comparable
Ecosystem	MS-aligned	Wider

Both viable. Ollama easier for prototyping; ONNX better for shipping in .NET app.

Hardware acceleration

DirectML on Windows.
CUDA on NVIDIA.
CoreML on Mac (Apple Silicon).
CPU as fallback.

ONNX Runtime auto-picks.

.NET MAUI integration

Phi-3 mini in a mobile app: download model on first run; all inference local.

Senior considerations

Pick small models for narrow tasks: classification, formatting, simple Q&A. Don't expect frontier-level reasoning.
Measure on YOUR data: benchmarks lie; domain quality varies.
Battery / thermal on mobile: inference is CPU/GPU heavy.
Updates: model bundling vs OTA download. Plan rollback.