Microsoft Phi & Small Models
Key Points
- Phi-3 / Phi-4: Microsoft's SLMs (Small Language Models). 3-14B params.
- Surprisingly capable for size — Phi-4 14B competitive with much larger models on focused tasks.
- Edge / on-device via ONNX Runtime GenAI — Windows, Mac, Linux, mobile.
- Use cases: privacy-preserving, offline, real-time on edge, low-latency, cost-zero inference.
- Limitations: less capable than frontier; no native multimodal in most variants.
Lineup
| Model | Size | Notes |
|---|---|---|
| Phi-3 mini | 3.8B | Mobile / edge |
| Phi-3 small | 7B | Balanced |
| Phi-3 medium | 14B | Strongest 2024 |
| Phi-3 vision | 4.2B | Multimodal |
| Phi-4 | 14B | 2024-25; competitive |
| Phi-4 mini | 3.8B | 2025; smaller |
Why small models matter
- On-device: phones, laptops; no API call.
- Offline / privacy: data stays local.
- Cost: zero per-call after model download.
- Latency: <100ms cold start typical.
- GPU not required (CPU inference works).
ONNX Runtime GenAI
using Microsoft.ML.OnnxRuntimeGenAI;
var model = new Model("/path/to/phi-4-onnx");
var tokenizer = new Tokenizer(model);
var tokens = tokenizer.Encode("Hello, world!");
var generator = new Generator(model, new GeneratorParams(model)
{
Sequences = tokens,
MaxLength = 200
});
while (!generator.IsDone())
{
generator.ComputeLogits();
generator.GenerateNextToken();
var next = generator.GetSequence(0)[^1];
Console.Write(tokenizer.Decode(new[] { next }));
}
Runs in your .NET process. CPU or GPU (CUDA, DirectML, CoreML).
Microsoft.Extensions.AI integration
// OnnxRuntimeGenAI as IChatClient (community / preview)
IChatClient chat = new OnnxGenAIChatClient(modelPath);
// Then standard pipeline
chat = chat.AsBuilder()
.UseFunctionInvocation()
.UseLogging(loggerFactory)
.Build();
Use cases
- Mobile (.NET MAUI app with Phi-3 mini).
- Desktop AI (note-taking, search, classification offline).
- Edge IoT (industrial, on-prem).
- Latency-critical (real-time response).
- Privacy (medical, financial — data never leaves device).
Phi-4 quality
Comparable to Llama 3 70B on focused benchmarks. NOT as capable as GPT-4 / Claude 4 on open-ended tasks. Excellent for narrow domains.
Where it falls short
- Long context: typically 4-16K, not 1M.
- Multimodal: Phi-3 vision exists; quality below GPT-4o.
- Tool use: less mature.
- Edge cases / general world knowledge: smaller training data effect.
Quantization
INT4 / INT8 reduces footprint: - Phi-3 mini fp16: ~7 GB. - Phi-3 mini INT4: ~2 GB. Runs on phone.
ONNX Runtime supports both.
Vs Ollama (Llama small)
| Aspect | Phi-4 + ONNX | Llama small + Ollama |
|---|---|---|
| Setup | Download ONNX | ollama pull llama3 |
| Integration | Native .NET | HTTP API |
| Latency | Lower | Slightly higher |
| Quality | Comparable in size | Comparable |
| Ecosystem | MS-aligned | Wider |
Both viable. Ollama easier for prototyping; ONNX better for shipping in .NET app.
Hardware acceleration
- DirectML on Windows.
- CUDA on NVIDIA.
- CoreML on Mac (Apple Silicon).
- CPU as fallback.
ONNX Runtime auto-picks.
.NET MAUI integration
Phi-3 mini in a mobile app: download model on first run; all inference local.
Senior considerations
- Pick small models for narrow tasks: classification, formatting, simple Q&A. Don't expect frontier-level reasoning.
- Measure on YOUR data: benchmarks lie; domain quality varies.
- Battery / thermal on mobile: inference is CPU/GPU heavy.
- Updates: model bundling vs OTA download. Plan rollback.