Chunking & Data Ingestion
Key Points
- Chunking = splitting documents into retrieval units. Quality of RAG depends heavily on chunk strategy.
- Common strategies: fixed-size (token), recursive (by separators), semantic similarity, sentence-window.
- Chunk size: typically 300-700 tokens. Smaller = precise; larger = more context.
- Overlap: 10-20% reduces boundary effects.
- Pipeline: parse → clean → chunk → embed → store metadata + vector.
Why chunking matters
LLM context windows are large but not infinite. RAG retrieval works on chunks. Chunk too big → irrelevant context dilutes; too small → missing context.
Strategies
Fixed-size by tokens
public IEnumerable<string> ChunkByTokens(string text, int maxTokens, int overlap)
{
var tokens = _tokenizer.Encode(text);
for (int i = 0; i < tokens.Count; i += (maxTokens - overlap))
{
var chunk = tokens.Skip(i).Take(maxTokens);
yield return _tokenizer.Decode(chunk);
}
}
Simple. Risk: cuts mid-sentence.
Recursive by separators
Try to split at natural boundaries:
1. Try double newline (paragraphs).
2. If too large: single newline (lines).
3. If too large: sentence (.).
4. If too large: word (space).
5. Fall back: char.
Library: LangChain's RecursiveCharacterTextSplitter is the canonical. .NET equivalents available.
Semantic similarity chunking
Group sentences by embedding similarity. Microsoft.Extensions.AI offers:
var chunker = new SemanticSimilarityChunker(_embed, options: new()
{
BreakpointThreshold = 0.5,
MaxTokensPerChunk = 700
});
var chunks = await chunker.ChunkAsync(document);
Chunks naturally form around topic shifts.
Sentence-window (recursive variant)
Break by sentence; emit window of N sentences. Good for FAQs / Q&A.
Document-aware
For Markdown / HTML / PDF: use the structure.
Chunk by heading; keep heading hierarchy as metadata for context.
For PDFs: layout-aware (use libraries like MarkItDown, Unstructured.io, or tika).
Ingestion pipeline
1. Discover documents (file system, S3, URL).
2. Parse (extract text from PDF/HTML/DOCX).
3. Clean (remove boilerplate, normalize).
4. Chunk.
5. Embed (batched).
6. Upsert with metadata.
7. Track ingestion state.
public async Task IngestAsync(Document doc, CancellationToken ct)
{
var text = await _parser.ExtractAsync(doc, ct);
var chunks = _chunker.Chunk(text);
var embeddings = await _embed.GenerateAsync(chunks.Select(c => c.Text).ToArray(), ct);
var records = chunks.Zip(embeddings, (c, e) => new VectorRecord
{
Id = $"{doc.Id}:{c.Index}",
DocumentId = doc.Id,
ChunkIndex = c.Index,
Text = c.Text,
Embedding = e.Vector,
Source = doc.Source,
Title = doc.Title,
IngestedAt = DateTimeOffset.UtcNow
});
await _vectorStore.UpsertBatchAsync(records, ct);
}
Metadata for filtering
Store with vectors:
tenantId(multi-tenant filter).documentId,title,url,source.chunkIndex,chunkText.category,tags.createdAt,updatedAt.embeddingModel(for re-indexing).
Search filters use these.
Idempotency
Hash of doc content → if unchanged, skip re-embedding.
var hash = Hash(text);
var existing = await _store.GetMetadataAsync(doc.Id);
if (existing?.Hash == hash) return;
Saves embedding cost on no-op updates.
Deletion
When doc deleted/superseded: delete all chunks where documentId == X.
Pre-processing
public string Clean(string text)
{
text = Regex.Replace(text, @"\s+", " "); // normalize whitespace
text = Regex.Replace(text, @"http\S+", ""); // strip URLs
text = Regex.Replace(text, @"[^\w\s.,!?;:'""()-]", ""); // strip non-text
return text.Trim();
}
Aggressive cleaning may hurt quality; tune for your data.
Tokenization
Match tokenizer to embedding model. OpenAI uses cl100k_base for text-embedding-3.
using var tokenizer = TiktokenTokenizer.CreateForModel("text-embedding-3-small");
var count = tokenizer.CountTokens(text);
Senior considerations
- Quality > size: smart chunking > big chunks.
- Test your chunks: pull random; check they make sense standalone.
- Re-embed on model upgrade: incompatible vectors.
- Cost: embedding 1M docs * 1K tokens = ~$20-130 depending on model.
- Throughput: batch + retry; respect rate limits.
Common pitfalls
- ❌ Splitting mid-sentence.
- ❌ No overlap.
- ❌ No metadata; can't filter.
- ❌ No idempotency; re-embed all on every run.
- ❌ Embedding model mismatch with stored vectors.