Chunking & Data Ingestion

Key Points

Chunking = splitting documents into retrieval units. Quality of RAG depends heavily on chunk strategy.
Common strategies: fixed-size (token), recursive (by separators), semantic similarity, sentence-window.
Chunk size: typically 300-700 tokens. Smaller = precise; larger = more context.
Overlap: 10-20% reduces boundary effects.
Pipeline: parse → clean → chunk → embed → store metadata + vector.

Why chunking matters

LLM context windows are large but not infinite. RAG retrieval works on chunks. Chunk too big → irrelevant context dilutes; too small → missing context.

Strategies

Fixed-size by tokens

public IEnumerable<string> ChunkByTokens(string text, int maxTokens, int overlap)
{
    var tokens = _tokenizer.Encode(text);
    for (int i = 0; i < tokens.Count; i += (maxTokens - overlap))
    {
        var chunk = tokens.Skip(i).Take(maxTokens);
        yield return _tokenizer.Decode(chunk);
    }
}

Simple. Risk: cuts mid-sentence.

Recursive by separators

Try to split at natural boundaries:

1. Try double newline (paragraphs).
2. If too large: single newline (lines).
3. If too large: sentence (.).
4. If too large: word (space).
5. Fall back: char.

Library: LangChain's RecursiveCharacterTextSplitter is the canonical. .NET equivalents available.

Semantic similarity chunking

Group sentences by embedding similarity. Microsoft.Extensions.AI offers:

var chunker = new SemanticSimilarityChunker(_embed, options: new()
{
    BreakpointThreshold = 0.5,
    MaxTokensPerChunk = 700
});

var chunks = await chunker.ChunkAsync(document);

Chunks naturally form around topic shifts.

Sentence-window (recursive variant)

Break by sentence; emit window of N sentences. Good for FAQs / Q&A.

Document-aware

For Markdown / HTML / PDF: use the structure.

# Heading 1
## Heading 2
content...

Chunk by heading; keep heading hierarchy as metadata for context.

For PDFs: layout-aware (use libraries like MarkItDown, Unstructured.io, or tika).

Ingestion pipeline

1. Discover documents (file system, S3, URL).
2. Parse (extract text from PDF/HTML/DOCX).
3. Clean (remove boilerplate, normalize).
4. Chunk.
5. Embed (batched).
6. Upsert with metadata.
7. Track ingestion state.

public async Task IngestAsync(Document doc, CancellationToken ct)
{
    var text = await _parser.ExtractAsync(doc, ct);
    var chunks = _chunker.Chunk(text);
    var embeddings = await _embed.GenerateAsync(chunks.Select(c => c.Text).ToArray(), ct);
    var records = chunks.Zip(embeddings, (c, e) => new VectorRecord
    {
        Id = $"{doc.Id}:{c.Index}",
        DocumentId = doc.Id,
        ChunkIndex = c.Index,
        Text = c.Text,
        Embedding = e.Vector,
        Source = doc.Source,
        Title = doc.Title,
        IngestedAt = DateTimeOffset.UtcNow
    });
    await _vectorStore.UpsertBatchAsync(records, ct);
}

Metadata for filtering

Store with vectors:

tenantId (multi-tenant filter).
documentId, title, url, source.
chunkIndex, chunkText.
category, tags.
createdAt, updatedAt.
embeddingModel (for re-indexing).

Search filters use these.

Idempotency

Hash of doc content → if unchanged, skip re-embedding.

var hash = Hash(text);
var existing = await _store.GetMetadataAsync(doc.Id);
if (existing?.Hash == hash) return;

Saves embedding cost on no-op updates.

Deletion

When doc deleted/superseded: delete all chunks where documentId == X.

Pre-processing

public string Clean(string text)
{
    text = Regex.Replace(text, @"\s+", " ");                    // normalize whitespace
    text = Regex.Replace(text, @"http\S+", "");                 // strip URLs
    text = Regex.Replace(text, @"[^\w\s.,!?;:'""()-]", "");     // strip non-text
    return text.Trim();
}

Aggressive cleaning may hurt quality; tune for your data.

Tokenization

Match tokenizer to embedding model. OpenAI uses cl100k_base for text-embedding-3.

using var tokenizer = TiktokenTokenizer.CreateForModel("text-embedding-3-small");
var count = tokenizer.CountTokens(text);

Senior considerations

Quality > size: smart chunking > big chunks.
Test your chunks: pull random; check they make sense standalone.
Re-embed on model upgrade: incompatible vectors.
Cost: embedding 1M docs * 1K tokens = ~$20-130 depending on model.
Throughput: batch + retry; respect rate limits.

Common pitfalls

❌ Splitting mid-sentence.
❌ No overlap.
❌ No metadata; can't filter.
❌ No idempotency; re-embed all on every run.
❌ Embedding model mismatch with stored vectors.