Meta Llama & Open Source
Key Points
- Llama 3.x / 4 (Meta): leading open-weights; competitive with frontier closed.
- Mistral (Mistral AI): French; Mistral Large; Mixtral MoE; Codestral for code.
- DeepSeek R1: Chinese; reasoning-focused; open weights.
- Qwen (Alibaba): strong multilingual; Qwen 2.5/3.
- Self-hosted via Ollama, vLLM, LM Studio, or hosted via Together, Groq, Replicate, Fireworks.
- When: cost at high volume, data sovereignty, fine-tuning, research.
Major open-weights families
Meta Llama
- Llama 3.3 70B: state-of-art open; competitive with Claude Sonnet on many tasks.
- Llama 3.x 8B / smaller: edge-friendly.
- Llama 4 (2026): MoE architecture; competitive with frontier.
- License: Llama Community License — broadly permissive but with caveats; check.
Mistral
- Mistral Large 2: flagship dense.
- Mixtral 8x22B: MoE; high quality at competitive cost.
- Codestral: code-specialized.
- License: Apache 2.0 (most models).
DeepSeek
- DeepSeek R1: reasoning model with explicit thinking. State-of-art for math, code, logic.
- DeepSeek Coder: code-focused.
- License: MIT-ish; check per-release.
Qwen (Alibaba)
- Qwen 2.5 / 3: strong multilingual; competitive English.
- Open weights; good fit for Chinese / multilingual workloads.
Smaller players
- Phi (Microsoft) — see Phi & Small Models.
- Gemma (Google open).
- Falcon (TII).
- Yi (01.AI).
Hosting options
| Path | Cost | Setup |
|---|---|---|
| Ollama | Local; free | ollama pull llama3.3; ollama run llama3.3 |
| vLLM | Self-host on GPU | Production-grade; fastest |
| LM Studio | Desktop GUI | Dev-friendly |
| Together AI | API | $0.20-1/M tokens |
| Groq | API; ultra-fast | $0.05-2/M; LPU hardware |
| Replicate | API; flexible | Per-model |
| Fireworks | API | $0.20-2/M |
| AWS Bedrock | Llama, Mistral | Managed |
| Azure AI Foundry | Llama, Mistral, others | Azure-integrated |
For production .NET: Together or Groq or Bedrock/Foundry.
Why open-weights
- Cost at high volume (self-host).
- Data sovereignty: model + data stay on your infra.
- Fine-tuning to your domain.
- Avoid vendor lock-in.
- Research / customization.
Why not
- Quality gap with frontier (closing fast).
- Ops cost for self-hosting.
- No native multimodal in some.
- Dependent on community for weights.
.NET integration
Most expose OpenAI-compatible endpoints:
// Together AI
var client = new OpenAIClient(
new ApiKeyCredential(togetherApiKey),
new OpenAIClientOptions { Endpoint = new Uri("https://api.together.xyz/v1") });
IChatClient chat = client.AsChatClient("meta-llama/Llama-3.3-70B-Instruct-Turbo");
Same SDK works against any OpenAI-compatible. Massive portability.
Ollama (local)
// Microsoft.Extensions.AI.Ollama
IChatClient chat = new OllamaChatClient(new Uri("http://localhost:11434"), "llama3.3");
For local dev / edge / privacy. See Ollama & Aspire.
Groq
Specialized hardware (LPU) → ultra-fast inference. Llama 3 70B at 500+ tokens/sec.
new OpenAIClient(apiKey, new() { Endpoint = new("https://api.groq.com/openai/v1") })
.AsChatClient("llama-3.3-70b-versatile");
Fine-tuning
For domain adaptation: - LoRA / QLoRA: parameter-efficient. Train on small data. - Full fine-tune: GPU-heavy. - Tools: HuggingFace TRL, Axolotl, Together fine-tuning API.
For most teams: RAG > fine-tuning. Fine-tune only after hitting RAG ceiling.
Quantization
- fp16 (default): 70B Llama → ~140 GB.
- int8: ~70 GB; minor quality loss.
- int4 (GPTQ, AWQ): ~35 GB; runs on smaller GPUs; some quality loss.
GGUF format (Ollama, llama.cpp) bundles quantizations. Q4_K_M typical for "good enough quality at run-on-laptop".
Running on Azure
- Azure AI Foundry hosts Llama, Mistral, others.
- Custom container in AKS / VM with vLLM.
Senior considerations
- Quality benchmarks change monthly — measure for YOUR workload, not just MMLU.
- Latency: Groq leads on speed; Together / Fireworks competitive; AWS Bedrock OK.
- Privacy: self-hosted only if data really must stay on-prem; managed providers usually have strong contracts.
- Total cost: include GPU + ops; managed often cheaper than self-host until very high volume.