Specialty Models
Key Points
Beyond chat: embeddings, rerankers, vision, audio, image generation, video generation. Each has dominant providers and trade-offs.
Embeddings
Convert text → vector. Used for similarity search (RAG).
| Model | Dim | Notes |
|---|---|---|
| OpenAI text-embedding-3-small | 1536 | Cheap; default |
| OpenAI text-embedding-3-large | 3072 | Higher quality |
| Voyage AI voyage-3 | 1024 | Specialty: code, law |
| Cohere embed-v3 | 1024 | Multi-lingual |
| Google text-embedding-004 | 768 | Decent |
| BGE / Nomic / open | varies | Open-weights |
Senior choice: text-embedding-3-small for general, voyage-3-code for code RAG.
IEmbeddingGenerator<string, Embedding<float>> gen =
new OpenAIClient(key).AsEmbeddingGenerator("text-embedding-3-small");
var embeddings = await gen.GenerateAsync(["text 1", "text 2"]);
Dimension reduction
text-embedding-3 supports dimensions parameter — request smaller embeddings (e.g., 256, 512). Trade quality for storage / search speed.
Rerankers
Refine top-K retrieval results before passing to LLM.
| Provider | Notes |
|---|---|
| Cohere Rerank | Best quality |
| Jina Reranker | Strong |
| BGE-Reranker (open) | Self-host |
| Cross-encoder models (sentence-transformers) | Self-host |
// Cohere reranker
var reranked = await cohere.Rerank(new RerankRequest
{
Query = userQuery,
Documents = candidates.Select(c => c.Text).ToArray(),
TopN = 5
});
Reranking improves RAG citation quality significantly.
Vision
| Model | Strength |
|---|---|
| GPT-4o vision | Best general |
| Claude 4 vision | OCR/document |
| Gemini 2.5 Pro | Multimodal native |
| Phi-3 vision | Edge |
For document parsing: Claude. For real-time: GPT-4o-mini or Gemini Flash.
Audio (speech-to-text)
| Model | Notes |
|---|---|
| OpenAI Whisper (whisper-large-v3) | Open weights; great quality |
| OpenAI gpt-4o-audio | Real-time |
| Azure Speech | Enterprise; SDKs |
| AssemblyAI | Specialized; speaker diarization |
| Deepgram | Streaming-first |
Whisper local via Faster-Whisper / Whisper.NET — runs offline.
TTS (text-to-speech)
| Model | Notes |
|---|---|
| OpenAI TTS | Multi-voice; quality |
| ElevenLabs | Best voice cloning |
| Azure Neural TTS | Enterprise |
| Google Wavenet | Multi-lang |
Image generation
| Model | Notes |
|---|---|
| GPT-Image (was DALL-E 3) | High quality; integrated |
| Stable Diffusion (SDXL, SD3) | Open weights; flexible |
| Imagen (Google) | High quality |
| Midjourney | API limited |
| FLUX | Latest open quality |
Self-hosting: Stable Diffusion or FLUX via Diffusers / ComfyUI.
Video generation
- Sora (OpenAI; limited).
- Runway Gen-3 / Veo (Google).
- Open: Hunyuan, Mochi.
Still emerging. Costly. Not production-stable for most apps.
Code-specialized LLMs
- Claude 4 Sonnet/Opus for code (current top).
- GPT-5 / o3-mini for code.
- Codestral (Mistral; code-specialized).
- DeepSeek Coder.
Math / reasoning
- o3 (OpenAI).
- DeepSeek R1 (open).
- Claude with extended thinking.
Domain-specialized
- BioGPT, Med-PaLM (medical).
- BloombergGPT (finance).
- Falcon (Arabic).
Use these only if frontier models underperform for your domain. Often they don't.
.NET integration
Microsoft.Extensions.AI provides: - IEmbeddingGenerator<string, Embedding<float>> - IImageGenerator - (chat clients of course)
For audio: vendor SDKs (Azure Speech, OpenAI audio client).
Senior considerations
- Embedding quality matters: text-embedding-3-small good default; upgrade to large or voyage-3 if RAG quality lacking.
- Rerankers ALMOST always help RAG quality.
- Vision tasks: try Claude for OCR, Gemini for video, GPT-4o for general.
- Audio: streaming providers (Deepgram) for real-time; Whisper for batch.