Skip to content

Meta Llama & Open Source

Key Points

  • Llama 3.x / 4 (Meta): leading open-weights; competitive with frontier closed.
  • Mistral (Mistral AI): French; Mistral Large; Mixtral MoE; Codestral for code.
  • DeepSeek R1: Chinese; reasoning-focused; open weights.
  • Qwen (Alibaba): strong multilingual; Qwen 2.5/3.
  • Self-hosted via Ollama, vLLM, LM Studio, or hosted via Together, Groq, Replicate, Fireworks.
  • When: cost at high volume, data sovereignty, fine-tuning, research.

Major open-weights families

Meta Llama

  • Llama 3.3 70B: state-of-art open; competitive with Claude Sonnet on many tasks.
  • Llama 3.x 8B / smaller: edge-friendly.
  • Llama 4 (2026): MoE architecture; competitive with frontier.
  • License: Llama Community License — broadly permissive but with caveats; check.

Mistral

  • Mistral Large 2: flagship dense.
  • Mixtral 8x22B: MoE; high quality at competitive cost.
  • Codestral: code-specialized.
  • License: Apache 2.0 (most models).

DeepSeek

  • DeepSeek R1: reasoning model with explicit thinking. State-of-art for math, code, logic.
  • DeepSeek Coder: code-focused.
  • License: MIT-ish; check per-release.

Qwen (Alibaba)

  • Qwen 2.5 / 3: strong multilingual; competitive English.
  • Open weights; good fit for Chinese / multilingual workloads.

Smaller players

  • Phi (Microsoft) — see Phi & Small Models.
  • Gemma (Google open).
  • Falcon (TII).
  • Yi (01.AI).

Hosting options

Path Cost Setup
Ollama Local; free ollama pull llama3.3; ollama run llama3.3
vLLM Self-host on GPU Production-grade; fastest
LM Studio Desktop GUI Dev-friendly
Together AI API $0.20-1/M tokens
Groq API; ultra-fast $0.05-2/M; LPU hardware
Replicate API; flexible Per-model
Fireworks API $0.20-2/M
AWS Bedrock Llama, Mistral Managed
Azure AI Foundry Llama, Mistral, others Azure-integrated

For production .NET: Together or Groq or Bedrock/Foundry.

Why open-weights

  • Cost at high volume (self-host).
  • Data sovereignty: model + data stay on your infra.
  • Fine-tuning to your domain.
  • Avoid vendor lock-in.
  • Research / customization.

Why not

  • Quality gap with frontier (closing fast).
  • Ops cost for self-hosting.
  • No native multimodal in some.
  • Dependent on community for weights.

.NET integration

Most expose OpenAI-compatible endpoints:

// Together AI
var client = new OpenAIClient(
    new ApiKeyCredential(togetherApiKey),
    new OpenAIClientOptions { Endpoint = new Uri("https://api.together.xyz/v1") });

IChatClient chat = client.AsChatClient("meta-llama/Llama-3.3-70B-Instruct-Turbo");

Same SDK works against any OpenAI-compatible. Massive portability.

Ollama (local)

// Microsoft.Extensions.AI.Ollama
IChatClient chat = new OllamaChatClient(new Uri("http://localhost:11434"), "llama3.3");

For local dev / edge / privacy. See Ollama & Aspire.

Groq

Specialized hardware (LPU) → ultra-fast inference. Llama 3 70B at 500+ tokens/sec.

new OpenAIClient(apiKey, new() { Endpoint = new("https://api.groq.com/openai/v1") })
    .AsChatClient("llama-3.3-70b-versatile");

Fine-tuning

For domain adaptation: - LoRA / QLoRA: parameter-efficient. Train on small data. - Full fine-tune: GPU-heavy. - Tools: HuggingFace TRL, Axolotl, Together fine-tuning API.

For most teams: RAG > fine-tuning. Fine-tune only after hitting RAG ceiling.

Quantization

  • fp16 (default): 70B Llama → ~140 GB.
  • int8: ~70 GB; minor quality loss.
  • int4 (GPTQ, AWQ): ~35 GB; runs on smaller GPUs; some quality loss.

GGUF format (Ollama, llama.cpp) bundles quantizations. Q4_K_M typical for "good enough quality at run-on-laptop".

Running on Azure

  • Azure AI Foundry hosts Llama, Mistral, others.
  • Custom container in AKS / VM with vLLM.

Senior considerations

  • Quality benchmarks change monthly — measure for YOUR workload, not just MMLU.
  • Latency: Groq leads on speed; Together / Fireworks competitive; AWS Bedrock OK.
  • Privacy: self-hosted only if data really must stay on-prem; managed providers usually have strong contracts.
  • Total cost: include GPU + ops; managed often cheaper than self-host until very high volume.

Cross-references