Meta Llama & Open Source

Key Points

Llama 3.x / 4 (Meta): leading open-weights; competitive with frontier closed.
Mistral (Mistral AI): French; Mistral Large; Mixtral MoE; Codestral for code.
DeepSeek R1: Chinese; reasoning-focused; open weights.
Qwen (Alibaba): strong multilingual; Qwen 2.5/3.
Self-hosted via Ollama, vLLM, LM Studio, or hosted via Together, Groq, Replicate, Fireworks.
When: cost at high volume, data sovereignty, fine-tuning, research.

Major open-weights families

Meta Llama

Llama 3.3 70B: state-of-art open; competitive with Claude Sonnet on many tasks.
Llama 3.x 8B / smaller: edge-friendly.
Llama 4 (2026): MoE architecture; competitive with frontier.
License: Llama Community License — broadly permissive but with caveats; check.

Mistral

Mistral Large 2: flagship dense.
Mixtral 8x22B: MoE; high quality at competitive cost.
Codestral: code-specialized.
License: Apache 2.0 (most models).

DeepSeek

DeepSeek R1: reasoning model with explicit thinking. State-of-art for math, code, logic.
DeepSeek Coder: code-focused.
License: MIT-ish; check per-release.

Qwen (Alibaba)

Qwen 2.5 / 3: strong multilingual; competitive English.
Open weights; good fit for Chinese / multilingual workloads.

Smaller players

Phi (Microsoft) — see Phi & Small Models.
Gemma (Google open).
Falcon (TII).
Yi (01.AI).

Hosting options

Path	Cost	Setup
Ollama	Local; free	`ollama pull llama3.3; ollama run llama3.3`
vLLM	Self-host on GPU	Production-grade; fastest
LM Studio	Desktop GUI	Dev-friendly
Together AI	API	$0.20-1/M tokens
Groq	API; ultra-fast	$0.05-2/M; LPU hardware
Replicate	API; flexible	Per-model
Fireworks	API	$0.20-2/M
AWS Bedrock	Llama, Mistral	Managed
Azure AI Foundry	Llama, Mistral, others	Azure-integrated

For production .NET: Together or Groq or Bedrock/Foundry.

Why open-weights

Cost at high volume (self-host).
Data sovereignty: model + data stay on your infra.
Fine-tuning to your domain.
Avoid vendor lock-in.
Research / customization.

Why not

Quality gap with frontier (closing fast).
Ops cost for self-hosting.
No native multimodal in some.
Dependent on community for weights.

.NET integration

Most expose OpenAI-compatible endpoints:

// Together AI
var client = new OpenAIClient(
    new ApiKeyCredential(togetherApiKey),
    new OpenAIClientOptions { Endpoint = new Uri("https://api.together.xyz/v1") });

IChatClient chat = client.AsChatClient("meta-llama/Llama-3.3-70B-Instruct-Turbo");

Same SDK works against any OpenAI-compatible. Massive portability.

Ollama (local)

// Microsoft.Extensions.AI.Ollama
IChatClient chat = new OllamaChatClient(new Uri("http://localhost:11434"), "llama3.3");

For local dev / edge / privacy. See Ollama & Aspire.

Groq

Specialized hardware (LPU) → ultra-fast inference. Llama 3 70B at 500+ tokens/sec.

new OpenAIClient(apiKey, new() { Endpoint = new("https://api.groq.com/openai/v1") })
    .AsChatClient("llama-3.3-70b-versatile");

Fine-tuning

For domain adaptation: - LoRA / QLoRA: parameter-efficient. Train on small data. - Full fine-tune: GPU-heavy. - Tools: HuggingFace TRL, Axolotl, Together fine-tuning API.

For most teams: RAG > fine-tuning. Fine-tune only after hitting RAG ceiling.

Quantization

fp16 (default): 70B Llama → ~140 GB.
int8: ~70 GB; minor quality loss.
int4 (GPTQ, AWQ): ~35 GB; runs on smaller GPUs; some quality loss.

GGUF format (Ollama, llama.cpp) bundles quantizations. Q4_K_M typical for "good enough quality at run-on-laptop".

Running on Azure

Azure AI Foundry hosts Llama, Mistral, others.
Custom container in AKS / VM with vLLM.

Senior considerations

Quality benchmarks change monthly — measure for YOUR workload, not just MMLU.
Latency: Groq leads on speed; Together / Fireworks competitive; AWS Bedrock OK.
Privacy: self-hosted only if data really must stay on-prem; managed providers usually have strong contracts.
Total cost: include GPU + ops; managed often cheaper than self-host until very high volume.