Retrieval-Augmented Generation sounds simple in demos: embed your documents, retrieve relevant chunks, pass them to an LLM, get an answer. In production, it's considerably more nuanced.
Mistake 1: Fixed-Size Chunking
Our first RAG pipeline split documents into fixed 512-token chunks. The results were mediocre because important context was routinely split across chunk boundaries. We switched to semantic chunking — splitting on paragraph boundaries and sentence endings, with overlap between chunks.
Mistake 2: Using the Wrong Embedding Model
We started with text-embedding-ada-002 because it was the obvious default. For our domain, it underperformed. After benchmarking against our specific test set, we switched to a domain-fine-tuned model and saw a 23% improvement in retrieval precision.
Mistake 3: No Reranking
Embedding similarity is a coarse retrieval signal. Adding a cross-encoder reranker as a second retrieval pass significantly improved answer relevance.
Mistake 4: Ignoring Context Window Management
As document volume grew, we hit context window limits. The fix was a budget-aware context assembly function that prioritises the highest-scoring chunks and truncates gracefully when the limit is approached.
Mistake 5: No Evaluation Framework
We shipped without a systematic way to evaluate pipeline quality. We built a small test set of 50 question-answer pairs and ran automated evaluation after every pipeline change.