Skip to content

Microsoft Phi & Small Models

Key Points

  • Phi-3 / Phi-4: Microsoft's SLMs (Small Language Models). 3-14B params.
  • Surprisingly capable for size — Phi-4 14B competitive with much larger models on focused tasks.
  • Edge / on-device via ONNX Runtime GenAI — Windows, Mac, Linux, mobile.
  • Use cases: privacy-preserving, offline, real-time on edge, low-latency, cost-zero inference.
  • Limitations: less capable than frontier; no native multimodal in most variants.

Lineup

Model Size Notes
Phi-3 mini 3.8B Mobile / edge
Phi-3 small 7B Balanced
Phi-3 medium 14B Strongest 2024
Phi-3 vision 4.2B Multimodal
Phi-4 14B 2024-25; competitive
Phi-4 mini 3.8B 2025; smaller

Why small models matter

  • On-device: phones, laptops; no API call.
  • Offline / privacy: data stays local.
  • Cost: zero per-call after model download.
  • Latency: <100ms cold start typical.
  • GPU not required (CPU inference works).

ONNX Runtime GenAI

using Microsoft.ML.OnnxRuntimeGenAI;

var model = new Model("/path/to/phi-4-onnx");
var tokenizer = new Tokenizer(model);

var tokens = tokenizer.Encode("Hello, world!");
var generator = new Generator(model, new GeneratorParams(model)
{
    Sequences = tokens,
    MaxLength = 200
});

while (!generator.IsDone())
{
    generator.ComputeLogits();
    generator.GenerateNextToken();
    var next = generator.GetSequence(0)[^1];
    Console.Write(tokenizer.Decode(new[] { next }));
}

Runs in your .NET process. CPU or GPU (CUDA, DirectML, CoreML).

Microsoft.Extensions.AI integration

// OnnxRuntimeGenAI as IChatClient (community / preview)
IChatClient chat = new OnnxGenAIChatClient(modelPath);

// Then standard pipeline
chat = chat.AsBuilder()
    .UseFunctionInvocation()
    .UseLogging(loggerFactory)
    .Build();

Use cases

  • Mobile (.NET MAUI app with Phi-3 mini).
  • Desktop AI (note-taking, search, classification offline).
  • Edge IoT (industrial, on-prem).
  • Latency-critical (real-time response).
  • Privacy (medical, financial — data never leaves device).

Phi-4 quality

Comparable to Llama 3 70B on focused benchmarks. NOT as capable as GPT-4 / Claude 4 on open-ended tasks. Excellent for narrow domains.

Where it falls short

  • Long context: typically 4-16K, not 1M.
  • Multimodal: Phi-3 vision exists; quality below GPT-4o.
  • Tool use: less mature.
  • Edge cases / general world knowledge: smaller training data effect.

Quantization

INT4 / INT8 reduces footprint: - Phi-3 mini fp16: ~7 GB. - Phi-3 mini INT4: ~2 GB. Runs on phone.

ONNX Runtime supports both.

Vs Ollama (Llama small)

Aspect Phi-4 + ONNX Llama small + Ollama
Setup Download ONNX ollama pull llama3
Integration Native .NET HTTP API
Latency Lower Slightly higher
Quality Comparable in size Comparable
Ecosystem MS-aligned Wider

Both viable. Ollama easier for prototyping; ONNX better for shipping in .NET app.

Hardware acceleration

  • DirectML on Windows.
  • CUDA on NVIDIA.
  • CoreML on Mac (Apple Silicon).
  • CPU as fallback.

ONNX Runtime auto-picks.

.NET MAUI integration

Phi-3 mini in a mobile app: download model on first run; all inference local.

Senior considerations

  • Pick small models for narrow tasks: classification, formatting, simple Q&A. Don't expect frontier-level reasoning.
  • Measure on YOUR data: benchmarks lie; domain quality varies.
  • Battery / thermal on mobile: inference is CPU/GPU heavy.
  • Updates: model bundling vs OTA download. Plan rollback.

Cross-references