Back to Blog
RAGLLMVector SearchAIMachine LearningLangChain

RAG Done Right — Retrieval-Augmented Generation Beyond the Basics

A deep-dive into production-grade RAG: chunking strategies, hybrid search, HyDE query transformation, cross-encoder reranking, context assembly, and evaluation with RAGAS. Go beyond naive vector lookup and build retrieval pipelines that actually work.

2026-04-15

Why Naive RAG Fails in Production

The typical RAG proof-of-concept takes an afternoon to build: split documents into fixed-size chunks, embed them with OpenAI, store in a vector database, retrieve top-k at query time, and stuff them into a prompt. The demo looks impressive. Then it hits production, and retrieval quality collapses.

The original RAG paper from Meta AI (2020) established the core idea: augment a language model with a non-parametric retrieval component to ground answers in external knowledge. Three years later, the pattern is everywhere. But the gap between a working RAG prototype and a reliable production system is substantial, and most teams underestimate it.

Retrieval failure is the leading cause of RAG hallucinations. The model does not fabricate answers because it is broken — it fabricates because the retriever handed it irrelevant or incomplete chunks. Fix the retriever, and the generator gets dramatically better without any model fine-tuning. This guide covers the techniques that separate robust RAG systems from weekend demos: chunking strategies, hybrid search, query transformation, cross-encoder reranking, context assembly, and systematic evaluation.

The RAG Stack — What Each Layer Actually Does

A production RAG system has two distinct pipelines: indexing (offline) and retrieval + generation (online). Keeping them conceptually separate prevents the mistake of optimizing the wrong bottleneck.

Indexing Pipeline

Load raw documents → clean and normalize → split into chunks → embed each chunk → store (vector + metadata) in the vector store. This runs once and whenever documents change. Chunk quality is fixed here — there is no online recovery from bad chunking.

Retrieval Pipeline

Receive user query → transform query (optional) → retrieve candidate chunks → rerank candidates → assemble context window → generate answer. Each stage is independently tunable and independently measurable.

Evaluation Loop

Measure retrieval recall, precision, and faithfulness continuously. Without a quantitative feedback loop, you are flying blind — changes may feel better in demos but regress real-world performance.

Chunking: Where Most Teams Get It Wrong

Chunk size is a retrieval-generation trade-off. Small chunks improve retrieval precision (the retrieved passage is more likely to be exactly relevant) but risk losing context that the generator needs. Large chunks preserve context but reduce retrieval precision and burn context-window tokens fast. The right size depends on your document structure — there is no universal answer.

Recursive character splitting is the safest default for unstructured prose. It tries to split at paragraph boundaries first, then sentences, then words, falling back to characters only when necessary. LangChain's RecursiveCharacterTextSplitter implements this well.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # characters, not tokens
    chunk_overlap=64,      # overlap to preserve sentence continuity
    separators=[
        "\n\n",          # paragraph break first
        "\n",             # line break
        ". ",              # sentence boundary
        " ",               # word boundary
        "",                # character fallback
    ],
    length_function=len,
)

chunks = splitter.split_documents(docs)

# Each chunk carries metadata from the source document
# Preserve: source URL, section heading, page number, date
for chunk in chunks:
    chunk.metadata["chunk_id"] = f"{chunk.metadata['source']}:{i}"
    chunk.metadata["heading"] = extract_nearest_heading(chunk)

Note

Chunk overlap (the chunk_overlap parameter) prevents information from being split across chunk boundaries. A 64-character overlap means the first 64 characters of chunk N+1 repeat the last 64 characters of chunk N. Set it to roughly 10-15% of chunk size. Zero overlap causes retrieval gaps at boundaries.

For structured documents(code, tables, Markdown with headers), semantic splitting respects the document's own structure. Split on Markdown headings for wikis and docs. Split on function/class boundaries for code. The structure itself signals where complete thoughts begin and end.

A underused pattern is parent-child chunking: index small child chunks for precise retrieval, but return the larger parent chunk to the generator for context. LlamaIndex's AutoMergingRetriever implements this. Retrieve at 128-character granularity, return the 512-character parent.

Embedding Models That Actually Matter

Not all embedding models are created equal for retrieval. OpenAI's text-embedding-3-large performs well for general English text, but BAAI/bge-large-en-v1.5 and E5-large-v2 are competitive open-source alternatives with much lower latency when self-hosted.

For domain-specific corpora (legal, medical, code), general-purpose embeddings fall short. Fine-tuning an embedding model on your domain using contrastive learning — even with a small labeled dataset of (query, relevant chunk) pairs — produces substantial retrieval improvements. The Sentence Transformers training pipeline makes this tractable.

Note

Asymmetric retrieval models embed queries and documents differently. Models like E5 expect queries prefixed with "query: " and documents prefixed with "passage: ". Skipping these prefixes degrades retrieval quality by 10-20% on MTEB benchmarks. Always check the model card for the correct prompt template.

Hybrid Search: BM25 + Dense Vectors

Pure vector search has a well-known weakness: exact keyword matching. If a user asks about “CVE-2024-38856” or a product model number like “RTX 5090”, dense embeddings often fail because these tokens have no semantic neighbourhood to search. BM25 — the lexical retrieval algorithm underlying Elasticsearch and Solr — handles exact matches perfectly.

Hybrid search combines both: retrieve candidates from BM25 and from dense vectors separately, then merge the ranked lists using Reciprocal Rank Fusion (RRF). This consistently outperforms either method alone on heterogeneous queries. Qdrant, Weaviate, and Elasticsearch all support hybrid search natively.

from qdrant_client import QdrantClient
from qdrant_client.models import (
    SparseVector, SparseVectorParams,
    VectorParams, Distance,
    NamedVector, NamedSparseVector,
)

client = QdrantClient(url="http://localhost:6333")

# Collection with both dense and sparse vectors
client.create_collection(
    collection_name="documents",
    vectors_config={
        "dense": VectorParams(size=1024, distance=Distance.COSINE),
    },
    sparse_vectors_config={
        "sparse": SparseVectorParams(),
    },
)

# Hybrid query: dense + BM25 sparse, fused with RRF
results = client.query_points(
    collection_name="documents",
    prefetch=[
        # Dense vector leg
        {"query": dense_embedding, "using": "dense", "limit": 20},
        # Sparse BM25 leg
        {"query": sparse_vector, "using": "sparse", "limit": 20},
    ],
    query={"fusion": "rrf"},   # Reciprocal Rank Fusion
    limit=10,
    with_payload=True,
)

Query Transformation — HyDE, Step-Back, and Decomposition

The user's raw query is often a poor retrieval signal. Short questions lack the vocabulary of the documents they are meant to retrieve. Complex questions retrieve partial answers that fail to compose into a complete response. Query transformation addresses both problems.

HyDE — Hypothetical Document Embeddings

HyDE (from Gao et al., 2022) generates a hypothetical answer to the query using the LLM, then embeds that hypothetical answer for retrieval. The hypothesis lives in the same embedding space as real documents — it is dense with relevant vocabulary — so similarity search finds better matches than the sparse, ambiguous original question.

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Generate a hypothetical answer
hyde_prompt = ChatPromptTemplate.from_template(
    "Write a short passage that would answer the following question. "
    "Do not say you don't know — write what such a passage might say.\n\n"
    "Question: {question}\n\nPassage:"
)

chain = hyde_prompt | llm

hypothetical_doc = chain.invoke({"question": user_query}).content

# Embed the hypothesis, not the original query
retrieval_embedding = embed_model.embed_query(hypothetical_doc)

# Use retrieval_embedding to search the vector store
candidates = vector_store.similarity_search_by_vector(
    retrieval_embedding, k=10
)

Step-Back Prompting

Specific questions often need broader context to answer well. “Why did the deployment fail on Tuesday?” requires understanding the deployment architecture. Step-back promptinggenerates a more abstract version of the query (“What are common causes of deployment failures in this system?”), retrieves against both the specific and abstract queries, and unions the results.

Multi-Query Decomposition

For complex questions that span multiple topics, generate several sub-queries, retrieve against each independently, and deduplicate the combined candidate set. This prevents single-query retrieval from missing half the relevant documents.

from langchain.retrievers.multi_query import MultiQueryRetriever

# LangChain automatically generates query variants and merges results
retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
    llm=llm,
)

# The retriever internally generates ~3 query variants,
# retrieves for each, deduplicates, and returns the union
docs = retriever.invoke(
    "How does the rate limiter interact with the auth middleware?"
)

Reranking with Cross-Encoders

Bi-encoder retrieval (embedding query and document separately, comparing with cosine similarity) is fast but imprecise. It cannot model fine-grained token-level interactions between query and document. Cross-encoderssolve this: they take the (query, document) pair as joint input, attend across both, and produce a relevance score. They are 100× slower but dramatically more accurate.

The standard pattern is a two-stage pipeline: fast bi-encoder retrieval fetches 50-100 candidates; the cross-encoder reranks and returns the top 5-10. The generator only sees the reranked top-k, which is far more relevant than the raw retrieval output. SBERT cross-encoders and Cohere Rerank are both production-proven options.

from sentence_transformers import CrossEncoder
import numpy as np

# Load a cross-encoder model (runs locally)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Stage 1: Fast bi-encoder retrieval — get 50 candidates
candidates = vector_store.similarity_search(user_query, k=50)

# Stage 2: Cross-encoder reranking
pairs = [(user_query, doc.page_content) for doc in candidates]
scores = reranker.predict(pairs)          # shape: (50,)

# Sort candidates by cross-encoder score
ranked_indices = np.argsort(scores)[::-1]
top_chunks = [candidates[i] for i in ranked_indices[:6]]

# top_chunks is now much more precisely relevant
context = "\n\n---\n\n".join(c.page_content for c in top_chunks)

Note

Reranking adds 200-800ms latency per query depending on model size and candidate count. For latency-sensitive applications, use Cohere Rerank or a distilled cross-encoder like ms-marco-MiniLM-L-6-v2 (6 layers, fast on CPU) rather than a full BERT-large cross-encoder.

Assembling the Context Window

How you present retrieved chunks to the generator matters. Raw concatenation works for simple cases but produces poor results when chunks are contradictory, repetitive, or lack source attribution.

Lost in the middle is an LLM attention bias: models attend well to context at the beginning and end of the window, but struggle with information buried in the middle. Place the most relevant chunk first or last, not in the middle of a long context block. Liu et al. (2023) quantified this effect across multiple models.

Include source metadata in the context block. Telling the model where each chunk came from enables citation in the answer and helps the model weight conflicting chunks by source authority.

def assemble_context(chunks: list, query: str) -> str:
    """
    Assemble retrieved chunks into a structured context block.
    Most relevant chunk goes first (cross-encoder top-1).
    Metadata included for source attribution.
    """
    context_parts = []
    for i, chunk in enumerate(chunks):
        source = chunk.metadata.get("source", "unknown")
        section = chunk.metadata.get("heading", "")
        date = chunk.metadata.get("date", "")

        header = f"[Source {i+1}: {source}"
        if section:
            header += f" / {section}"
        if date:
            header += f" ({date})"
        header += "]"

        context_parts.append(f"{header}\n{chunk.page_content}")

    return "\n\n".join(context_parts)

SYSTEM_PROMPT = """You are a helpful assistant. Answer based only on the
provided context. If the context does not contain enough information to
answer confidently, say so. Cite the source numbers in your answer."""

context = assemble_context(top_chunks, user_query)
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_query}"},
]

Evaluation — Measuring What Actually Matters

Qualitative vibe-checks do not scale. Production RAG systems need automated evaluation pipelines that run on every change. Two frameworks have become standard: RAGAS and ARES. Both evaluate retrieval and generation components independently.

Context Recall

What fraction of the information needed to answer the question is actually present in the retrieved context? Measures retrieval completeness. Low recall = retriever is missing relevant chunks.

Context Precision

What fraction of the retrieved chunks are actually relevant? Low precision = context window is polluted with noise, causing hallucination or distraction.

Answer Faithfulness

Does every claim in the generated answer appear in the retrieved context? The most important metric for hallucination detection. Low faithfulness = model is making things up beyond the provided context.

Answer Relevance

Does the generated answer actually address the user's question? A faithful answer can still be off-topic if the retriever surfaced adjacent but non-responsive documents.

from ragas import evaluate
from ragas.metrics import (
    context_recall,
    context_precision,
    faithfulness,
    answer_relevancy,
)
from datasets import Dataset

# Build an evaluation dataset
# Each row: question, ground_truth_answer, contexts, generated_answer
eval_data = Dataset.from_list([
    {
        "question": "What is the retry policy for the payment service?",
        "ground_truth": "The payment service retries up to 3 times...",
        "contexts": [chunk.page_content for chunk in top_chunks],
        "answer": generated_answer,
    },
    # ... more examples
])

results = evaluate(
    eval_data,
    metrics=[
        context_recall,
        context_precision,
        faithfulness,
        answer_relevancy,
    ],
)

print(results)
# {'context_recall': 0.87, 'context_precision': 0.74,
#  'faithfulness': 0.91, 'answer_relevancy': 0.88}

Note

Build your evaluation dataset from real user queries and their correct answers. Start with 50-100 curated (question, ground-truth) pairs. As you collect production queries, label the ones that went wrong — these become your regression tests. A dataset of 200 labeled examples is enough to detect most regressions reliably.

Production Hardening

Several patterns separate reliable production RAG from a working prototype.

Metadata Filtering

Pure semantic search retrieves based on meaning alone and ignores document boundaries like access control, time ranges, or tenant scope. Add structured metadata filters to every retrieval query: only return chunks from documents the user is authorized to see, within the relevant date range, belonging to the correct tenant. Vector databases support this as pre-filtering (applied before ANN search) or post-filtering.

Incremental Indexing

Documents change. Your index must reflect that. Track a content_hash per document. On each ingestion run, compare hashes and re-embed only changed documents. Delete chunks whose source documents were removed. Full re-indexing on every run does not scale past a few thousand documents.

Fallback and Graceful Degradation

When retrieval returns low-confidence results (all similarity scores below a threshold), it is better to surface a “I couldn't find relevant information” response than to hallucinate from poor context. Set a minimum similarity threshold and treat sub-threshold results as a retrieval miss. Log these misses — they reveal gaps in your document coverage.

Caching

Semantic caching caches LLM responses for queries that are semantically similar to previously answered questions. GPTCacheand Redis with vector search both support this. For a customer support RAG system, 40-60% of queries are paraphrases of previously answered questions — semantic caching eliminates both retrieval and LLM latency for these.

Beyond Naive RAG — Agentic Patterns

The comprehensive RAG survey (Gao et al., 2023) identifies three generations of RAG: naive RAG (what most teams ship), advanced RAG (the techniques covered above), and modular RAG (composable retrieval-augmentation as agent tools).

In the agentic pattern, the LLM decides whether to retrieve, what to retrieve, and how many timesto retrieve in a loop. Self-RAG trains the model to generate reflection tokens that control retrieval. FLARE proactively retrieves when the model's next-token confidence falls below a threshold. These patterns require more infrastructure but handle complex multi-hop reasoning that a single-pass retrieve-then-generate pipeline cannot.

For most teams, the path is: ship advanced RAG first (this guide), measure, understand where it fails, then selectively add agentic patterns for the failure modes that matter most. Agentic complexity has real costs in latency, token spend, and debuggability. Earn the complexity by proving it solves a real measured gap.

Building a RAG pipeline or improving retrieval quality for your AI product?

We help teams design and implement production-grade RAG systems — from chunking strategies and hybrid search to reranking, evaluation, and agentic retrieval patterns. Let’s talk.

Get in Touch

Related Articles