← Home LLMs / Retrieval-Augmented Generation (RAG): Complete Implementation Guide 2025
16 min
LLMs

Retrieval-Augmented Generation (RAG): Complete Implementation Guide 2025

promptyze
Editor · Promptowy
06.12.2025 Date
16 min Reading time
Ilustracja: PROMPTOWY promptowy.com

Master RAG architecture for accurate, grounded AI responses. Step-by-step implementation with vector databases, embedding strategies, and real-world examples.

Introduction: Solving the Knowledge Problem

Large Language Models possess impressive capabilities, but they suffer from a fundamental limitation: their knowledge freezes at training time. Ask GPT-4 about events from last week, your company’s internal documents, or specialized domain knowledge not in its training data, and you’ll get hallucinations, outdated information, or generic responses that miss crucial context.

Retrieval-Augmented Generation (RAG) solves this problem elegantly. Instead of relying solely on the model’s parametric memory, RAG retrieves relevant information from external knowledge bases and provides it as context for generation. This architecture grounds model outputs in actual documents, dramatically reducing hallucinations while enabling AI systems to work with proprietary data, recent information, and specialized knowledge.

RAG has become the de facto standard for production LLM applications requiring factual accuracy. From customer support chatbots accessing internal documentation to research assistants synthesizing across papers to coding assistants referencing company codebases, RAG powers applications where reliability matters more than creativity.

This comprehensive guide walks through RAG implementation from first principles to production deployment. We’ll cover vector databases, embedding strategies, retrieval algorithms, context integration, and optimization techniques that separate proof-of-concept demos from robust production systems.

Understanding RAG Architecture

Before implementation, understanding the architecture clarifies design decisions.

The RAG Pipeline

User Query
    ↓
[1. Query Embedding]
    ↓
[2. Vector Search] ← Vector Database
    ↓
Retrieved Documents
    ↓
[3. Context Assembly]
    ↓
Prompt = System + Context + Query
    ↓
[4. LLM Generation]
    ↓
Response

Each stage critically affects system performance.

Component Breakdown

1. Document Processing Pipeline:

Raw Documents
    ↓
[Chunking] → Split into manageable pieces
    ↓
[Embedding] → Convert to vectors
    ↓
[Storage] → Save in vector database

2. Query Processing Pipeline:

User Query
    ↓
[Query Enhancement] → Expand, clarify, or rewrite
    ↓
[Embedding] → Convert to vector
    ↓
[Retrieval] → Find similar documents
    ↓
[Ranking] → Order by relevance
    ↓
[Context Assembly] → Format for LLM

3. Generation Pipeline:

Retrieved Context + Query
    ↓
[Prompt Construction] → Build comprehensive prompt
    ↓
[LLM Generation] → Generate response
    ↓
[Citation Addition] → Add source references
    ↓
[Validation] → Verify against sources

Why RAG Works

Grounding: Responses grounded in retrieved documents reduce hallucinations from ~15% to ~2-3%.

Freshness: External knowledge base updates immediately without model retraining.

Attribution: Citations enable verification and trust.

Domain Adaptation: Works with specialized knowledge without fine-tuning.

Cost Efficiency: Cheaper than fine-tuning for knowledge updates.

Implementation Step 1: Document Processing

Document processing determines retrieval quality.

Chunking Strategies

Fixed-Size Chunking (Simplest):

def chunk_by_tokens(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    """Split text into fixed-size chunks with overlap."""
    import tiktoken
    
    encoder = tiktoken.get_encoding("cl100k_base")
    tokens = encoder.encode(text)
    chunks = []
    
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[i:i + chunk_size]
        chunk_text = encoder.decode(chunk_tokens)
        chunks.append(chunk_text)
    
    return chunks

# Usage
text = load_document("company_handbook.pdf")
chunks = chunk_by_tokens(text, chunk_size=512, overlap=50)

Semantic Chunking (Better):

def chunk_by_semantics(text: str, max_chunk_size: int = 512) -> list[str]:
    """Split at natural boundaries (paragraphs, sections)."""
    # Split into paragraphs
    paragraphs = text.split('\n\n')
    
    chunks = []
    current_chunk = []
    current_length = 0
    
    for para in paragraphs:
        para_length = len(para.split())
        
        if current_length + para_length > max_chunk_size and current_chunk:
            # Save current chunk
            chunks.append('\n\n'.join(current_chunk))
            current_chunk = []
            current_length = 0
        
        current_chunk.append(para)
        current_length += para_length
    
    # Add remaining
    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))
    
    return chunks

Hierarchical Chunking (Most sophisticated):

class HierarchicalChunker:
    def __init__(self):
        self.chunk_sizes = {
            'section': 2048,
            'subsection': 512,
            'paragraph': 128
        }
    
    def chunk_document(self, document: dict) -> list[dict]:
        """Create hierarchical chunks with metadata."""
        chunks = []
        
        for section in document['sections']:
            # Section-level chunk
            section_chunk = {
                'text': section['content'],
                'metadata': {
                    'type': 'section',
                    'title': section['title'],
                    'document': document['id']
                }
            }
            chunks.append(section_chunk)
            
            # Subsection chunks
            for subsection in section.get('subsections', []):
                subsection_chunk = {
                    'text': subsection['content'],
                    'metadata': {
                        'type': 'subsection',
                        'title': subsection['title'],
                        'parent': section['title'],
                        'document': document['id']
                    }
                }
                chunks.append(subsection_chunk)
        
        return chunks

Chunk Size Considerations

Small Chunks (128-256 tokens):

  • ✅ Precise retrieval
  • ✅ Lower noise in context
  • ❌ May lack sufficient context
  • ❌ More chunks to manage

Medium Chunks (512-1024 tokens):

  • ✅ Balanced precision and context
  • ✅ Good for most use cases
  • ❌ May include some irrelevant content

Large Chunks (2048+ tokens):

  • ✅ Maximum context preservation
  • ✅ Fewer database entries
  • ❌ Lower retrieval precision
  • ❌ More token cost

Recommendation: Start with 512 tokens, adjust based on your domain.

Metadata Enrichment

class MetadataEnricher:
    def __init__(self):
        self.date_parser = DateParser()
        self.entity_extractor = EntityExtractor()
    
    def enrich_chunk(self, chunk: str, source_doc: dict) -> dict:
        """Add valuable metadata to chunks."""
        metadata = {
            # Source information
            'document_id': source_doc['id'],
            'document_title': source_doc['title'],
            'document_type': source_doc['type'],
            'source_url': source_doc.get('url'),
            
            # Temporal information
            'created_at': source_doc['created_at'],
            'updated_at': source_doc.get('updated_at'),
            'published_at': source_doc.get('published_at'),
            
            # Content characteristics
            'chunk_length': len(chunk),
            'language': self.detect_language(chunk),
            
            # Extracted entities
            'entities': self.entity_extractor.extract(chunk),
            'keywords': self.extract_keywords(chunk),
            
            # Hierarchical context
            'section': source_doc.get('section'),
            'subsection': source_doc.get('subsection'),
        }
        
        return {
            'text': chunk,
            'metadata': metadata
        }

Implementation Step 2: Embedding and Vector Storage

Embeddings convert text to numerical vectors that capture semantic meaning.

Embedding Models Comparison

OpenAI text-embedding-3-large:

  • Dimensions: 3072 (configurable down to 256)
  • Cost: $0.13 per 1M tokens
  • Quality: Excellent for general use
  • Best for: Most production applications

OpenAI text-embedding-3-small:

  • Dimensions: 1536
  • Cost: $0.02 per 1M tokens
  • Quality: Good, cost-effective
  • Best for: Budget-conscious applications

Cohere embed-english-v3.0:

  • Dimensions: 1024
  • Cost: $0.10 per 1M tokens
  • Quality: Strong multilingual support
  • Best for: International applications

Open-source alternatives:

  • sentence-transformers (free, self-hosted)
  • Best for: Privacy-sensitive or offline applications

Embedding Implementation

from openai import OpenAI
import numpy as np

class EmbeddingService:
    def __init__(self, model: str = "text-embedding-3-small"):
        self.client = OpenAI()
        self.model = model
        self.dimension = 1536  # for text-embedding-3-small
    
    def embed_text(self, text: str) -> np.ndarray:
        """Generate embedding for text."""
        response = self.client.embeddings.create(
            model=self.model,
            input=text
        )
        return np.array(response.data[0].embedding)
    
    def embed_batch(self, texts: list[str], batch_size: int = 100) -> list[np.ndarray]:
        """Generate embeddings for multiple texts efficiently."""
        embeddings = []
        
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            
            response = self.client.embeddings.create(
                model=self.model,
                input=batch
            )
            
            batch_embeddings = [
                np.array(item.embedding) 
                for item in response.data
            ]
            embeddings.extend(batch_embeddings)
        
        return embeddings

Vector Database Selection

Pinecone (Managed):

import pinecone

# Initialize
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")

# Create index
pinecone.create_index(
    name="knowledge-base",
    dimension=1536,
    metric="cosine"
)

# Get index
index = pinecone.Index("knowledge-base")

# Upsert vectors
index.upsert(vectors=[
    ("id1", embedding1, {"text": "chunk1", "source": "doc1"}),
    ("id2", embedding2, {"text": "chunk2", "source": "doc1"}),
])

# Query
results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)

Weaviate (Open-source, self-hosted):

import weaviate

# Initialize
client = weaviate.Client("http://localhost:8080")

# Create schema
schema = {
    "class": "Document",
    "vectorizer": "none",  # We'll provide vectors
    "properties": [
        {"name": "text", "dataType": ["text"]},
        {"name": "source", "dataType": ["string"]},
        {"name": "created_at", "dataType": ["date"]}
    ]
}
client.schema.create_class(schema)

# Add data
client.data_object.create(
    data_object={
        "text": "chunk text",
        "source": "document.pdf"
    },
    class_name="Document",
    vector=embedding
)

# Query
results = client.query.get("Document", ["text", "source"]) \
    .with_near_vector({"vector": query_embedding}) \
    .with_limit(5) \
    .do()

ChromaDB (Simple, embedded):

import chromadb

# Initialize
client = chromadb.Client()

# Create collection
collection = client.create_collection(name="knowledge_base")

# Add documents
collection.add(
    embeddings=[embedding1, embedding2],
    documents=["chunk1 text", "chunk2 text"],
    metadatas=[{"source": "doc1"}, {"source": "doc1"}],
    ids=["id1", "id2"]
)

# Query
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5
)

Database Comparison

DatabaseHostedScaleFeaturesBest For
PineconeYesMassiveManaged, reliableProduction at scale
WeaviateSelf/CloudLargeGraphQL, hybrid searchFlexibility
ChromaDBNoMediumSimple API, embeddedPrototypes, local dev
QdrantSelf/CloudLargeRust, fastPerformance critical
MilvusSelf/CloudMassiveDistributed, scalableEnterprise scale

Implementation Step 3: Retrieval Strategies

Retrieval quality determines RAG performance.

Basic Vector Search

class BasicRetriever:
    def __init__(self, vector_db, embedding_service):
        self.vector_db = vector_db
        self.embedding_service = embedding_service
    
    def retrieve(self, query: str, top_k: int = 5) -> list[dict]:
        """Retrieve most relevant documents."""
        # Embed query
        query_embedding = self.embedding_service.embed_text(query)
        
        # Search vector database
        results = self.vector_db.query(
            vector=query_embedding,
            top_k=top_k,
            include_metadata=True
        )
        
        return [
            {
                'text': match['metadata']['text'],
                'score': match['score'],
                'metadata': match['metadata']
            }
            for match in results['matches']
        ]

Hybrid Search (Vector + Keyword)

class HybridRetriever:
    def __init__(self, vector_db, elasticsearch_client, embedding_service):
        self.vector_db = vector_db
        self.es = elasticsearch_client
        self.embedding_service = embedding_service
    
    def retrieve(
        self,
        query: str,
        top_k: int = 5,
        vector_weight: float = 0.7
    ) -> list[dict]:
        """Combine vector similarity and keyword matching."""
        # Vector search
        query_embedding = self.embedding_service.embed_text(query)
        vector_results = self.vector_db.query(
            vector=query_embedding,
            top_k=top_k * 2  # Get more candidates
        )
        
        # Keyword search
        keyword_results = self.es.search(
            index="documents",
            body={
                "query": {
                    "multi_match": {
                        "query": query,
                        "fields": ["text", "title"]
                    }
                }
            },
            size=top_k * 2
        )
        
        # Combine and re-rank
        combined = self.merge_results(
            vector_results,
            keyword_results,
            vector_weight
        )
        
        return combined[:top_k]
    
    def merge_results(
        self,
        vector_results,
        keyword_results,
        vector_weight
    ) -> list[dict]:
        """Merge and score results from both methods."""
        scores = {}
        
        # Score vector results
        for result in vector_results:
            doc_id = result['id']
            scores[doc_id] = {
                'vector_score': result['score'],
                'keyword_score': 0,
                'metadata': result['metadata']
            }
        
        # Add keyword scores
        for result in keyword_results:
            doc_id = result['_id']
            if doc_id in scores:
                scores[doc_id]['keyword_score'] = result['_score']
            else:
                scores[doc_id] = {
                    'vector_score': 0,
                    'keyword_score': result['_score'],
                    'metadata': result['_source']
                }
        
        # Calculate combined score
        for doc_id in scores:
            scores[doc_id]['final_score'] = (
                vector_weight * scores[doc_id]['vector_score'] +
                (1 - vector_weight) * scores[doc_id]['keyword_score']
            )
        
        # Sort by final score
        sorted_results = sorted(
            scores.items(),
            key=lambda x: x[1]['final_score'],
            reverse=True
        )
        
        return [
            {
                'id': doc_id,
                'score': result['final_score'],
                'metadata': result['metadata']
            }
            for doc_id, result in sorted_results
        ]

Query Enhancement

class QueryEnhancer:
    def __init__(self, llm_client):
        self.llm = llm_client
    
    def enhance_query(self, query: str) -> list[str]:
        """Generate multiple query variations."""
        prompt = f"""Given this user query: "{query}"

Generate 3 alternative phrasings that might retrieve relevant documents:
1. A more specific version
2. A more general version  
3. A version using different terminology

Format as JSON array of strings."""
        
        response = self.llm.generate(prompt)
        variations = json.loads(response)
        
        return [query] + variations  # Include original
    
    def retrieve_with_enhancement(
        self,
        query: str,
        retriever,
        top_k: int = 5
    ) -> list[dict]:
        """Retrieve using multiple query variations."""
        query_variations = self.enhance_query(query)
        
        all_results = []
        for variation in query_variations:
            results = retriever.retrieve(variation, top_k=top_k)
            all_results.extend(results)
        
        # Deduplicate and re-rank
        return self.deduplicate_and_rank(all_results, top_k)

Metadata Filtering

class FilteredRetriever:
    def __init__(self, vector_db, embedding_service):
        self.vector_db = vector_db
        self.embedding_service = embedding_service
    
    def retrieve_with_filters(
        self,
        query: str,
        filters: dict,
        top_k: int = 5
    ) -> list[dict]:
        """Retrieve with metadata filtering."""
        query_embedding = self.embedding_service.embed_text(query)
        
        # Build filter expression
        filter_expr = self.build_filter_expression(filters)
        
        results = self.vector_db.query(
            vector=query_embedding,
            top_k=top_k,
            filter=filter_expr,
            include_metadata=True
        )
        
        return results
    
    def build_filter_expression(self, filters: dict) -> dict:
        """Build database-specific filter expression."""
        # Example for Pinecone
        expressions = []
        
        if 'document_type' in filters:
            expressions.append({
                "document_type": {"$eq": filters['document_type']}
            })
        
        if 'date_range' in filters:
            expressions.append({
                "created_at": {
                    "$gte": filters['date_range']['start'],
                    "$lte": filters['date_range']['end']
                }
            })
        
        if 'source' in filters:
            expressions.append({
                "source": {"$in": filters['source']}
            })
        
        if len(expressions) == 1:
            return expressions[0]
        else:
            return {"$and": expressions}

# Usage
retriever = FilteredRetriever(vector_db, embedding_service)

results = retriever.retrieve_with_filters(
    query="How do we handle refunds?",
    filters={
        'document_type': 'policy',
        'date_range': {
            'start': '2024-01-01',
            'end': '2024-12-31'
        }
    },
    top_k=5
)

Implementation Step 4: Context Assembly and Prompting

Retrieved documents must be formatted effectively for the LLM.

Context Formatting

class ContextAssembler:
    def __init__(self, max_context_length: int = 4000):
        self.max_context_length = max_context_length
    
    def assemble_context(
        self,
        query: str,
        retrieved_docs: list[dict],
        include_metadata: bool = True
    ) -> str:
        """Assemble retrieved documents into coherent context."""
        context_parts = []
        current_length = 0
        
        for i, doc in enumerate(retrieved_docs, 1):
            # Format document
            doc_text = self.format_document(doc, i, include_metadata)
            doc_length = len(doc_text.split())
            
            # Check if we have room
            if current_length + doc_length > self.max_context_length:
                break
            
            context_parts.append(doc_text)
            current_length += doc_length
        
        return "\n\n---\n\n".join(context_parts)
    
    def format_document(
        self,
        doc: dict,
        index: int,
        include_metadata: bool
    ) -> str:
        """Format single document with metadata."""
        parts = [f"[Document {index}]"]
        
        if include_metadata and 'metadata' in doc:
            metadata = doc['metadata']
            if 'title' in metadata:
                parts.append(f"Title: {metadata['title']}")
            if 'source' in metadata:
                parts.append(f"Source: {metadata['source']}")
            if 'created_at' in metadata:
                parts.append(f"Date: {metadata['created_at']}")
        
        parts.append(f"\n{doc['text']}")
        
        return "\n".join(parts)

RAG Prompt Templates

class RAGPromptBuilder:
    def __init__(self):
        self.system_template = """You are a helpful assistant that answers questions based on provided context.

CRITICAL RULES:
1. Answer ONLY using information from the provided documents
2. If the documents don't contain relevant information, say so clearly
3. Cite document numbers when making claims [Document X]
4. Do not make assumptions or add information not in the documents
5. If documents contradict each other, acknowledge this

Context documents:
{context}"""
        
        self.user_template = """Based on the documents provided, please answer this question:

{query}

Remember to cite which documents support your answer."""
    
    def build_messages(self, query: str, context: str) -> list[dict]:
        """Build messages for chat completion."""
        return [
            {
                "role": "system",
                "content": self.system_template.format(context=context)
            },
            {
                "role": "user",
                "content": self.user_template.format(query=query)
            }
        ]

Complete RAG Pipeline

class RAGSystem:
    def __init__(
        self,
        vector_db,
        embedding_service,
        llm_client,
        retriever_type: str = 'hybrid'
    ):
        self.embedding_service = embedding_service
        self.llm = llm_client
        
        if retriever_type == 'hybrid':
            self.retriever = HybridRetriever(vector_db, es_client, embedding_service)
        else:
            self.retriever = BasicRetriever(vector_db, embedding_service)
        
        self.context_assembler = ContextAssembler()
        self.prompt_builder = RAGPromptBuilder()
    
    def query(
        self,
        question: str,
        top_k: int = 5,
        filters: dict = None
    ) -> dict:
        """Complete RAG query pipeline."""
        # Step 1: Retrieve relevant documents
        retrieved_docs = self.retriever.retrieve(
            query=question,
            top_k=top_k,
            filters=filters
        )
        
        if not retrieved_docs:
            return {
                'answer': "I don't have any relevant information to answer this question.",
                'sources': [],
                'confidence': 0.0
            }
        
        # Step 2: Assemble context
        context = self.context_assembler.assemble_context(
            query=question,
            retrieved_docs=retrieved_docs
        )
        
        # Step 3: Build prompt
        messages = self.prompt_builder.build_messages(question, context)
        
        # Step 4: Generate response
        response = self.llm.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=messages,
            temperature=0.1  # Low temperature for factual accuracy
        )
        
        answer = response.choices[0].message.content
        
        # Step 5: Extract citations and validate
        citations = self.extract_citations(answer)
        validated_answer = self.validate_answer(answer, retrieved_docs)
        
        return {
            'answer': answer,
            'sources': [
                {
                    'text': doc['text'][:200] + '...',
                    'metadata': doc['metadata'],
                    'score': doc['score']
                }
                for doc in retrieved_docs
            ],
            'citations': citations,
            'validated': validated_answer,
            'retrieval_count': len(retrieved_docs)
        }
    
    def extract_citations(self, answer: str) -> list[int]:
        """Extract document citations from answer."""
        import re
        citations = re.findall(r'\[Document (\d+)\]', answer)
        return [int(c) for c in citations]
    
    def validate_answer(
        self,
        answer: str,
        retrieved_docs: list[dict]
    ) -> bool:
        """Check if answer is grounded in retrieved docs."""
        # Simple validation: check if key phrases from answer appear in docs
        answer_phrases = self.extract_key_phrases(answer)
        
        found_phrases = 0
        for phrase in answer_phrases:
            for doc in retrieved_docs:
                if phrase.lower() in doc['text'].lower():
                    found_phrases += 1
                    break
        
        # Consider valid if >50% of key phrases found in docs
        return found_phrases / len(answer_phrases) > 0.5 if answer_phrases else False

# Usage
rag_system = RAGSystem(
    vector_db=pinecone_index,
    embedding_service=embedding_service,
    llm_client=openai_client,
    retriever_type='hybrid'
)

result = rag_system.query(
    question="What is our company's remote work policy?",
    top_k=5,
    filters={'document_type': 'policy'}
)

print(result['answer'])
print(f"\nSources: {len(result['sources'])}")
print(f"Validated: {result['validated']}")

Advanced RAG Techniques

Iterative Retrieval

class IterativeRAG:
    def __init__(self, rag_system, max_iterations: int = 3):
        self.rag = rag_system
        self.max_iterations = max_iterations
    
    def query_iterative(self, question: str) -> dict:
        """Retrieve additional context if initial answer is insufficient."""
        all_sources = []
        
        for iteration in range(self.max_iterations):
            result = self.rag.query(question, top_k=5)
            
            # Check if answer is sufficient
            if self.is_sufficient(result['answer']):
                return result
            
            # Generate follow-up query for additional context
            follow_up = self.generate_followup_query(
                question,
                result['answer'],
                iteration
            )
            
            # Retrieve additional documents
            additional_docs = self.rag.retriever.retrieve(follow_up, top_k=3)
            all_sources.extend(additional_docs)
        
        # Final generation with all accumulated sources
        return self.rag.query(question, top_k=len(all_sources))

Parent Document Retrieval

class ParentDocumentRetriever:
    """Retrieve small chunks but provide larger parent context to LLM."""
    
    def __init__(self, vector_db, document_store):
        self.vector_db = vector_db  # Stores small chunks
        self.document_store = document_store  # Stores full documents
    
    def retrieve(self, query: str, top_k: int = 5) -> list[dict]:
        """Retrieve chunks but return parent documents."""
        # Retrieve small, precise chunks
        chunk_results = self.vector_db.query(query, top_k=top_k)
        
        # Get parent documents
        parent_docs = []
        seen_parents = set()
        
        for chunk in chunk_results:
            parent_id = chunk['metadata']['parent_document_id']
            
            if parent_id not in seen_parents:
                parent_doc = self.document_store.get(parent_id)
                parent_docs.append(parent_doc)
                seen_parents.add(parent_id)
        
        return parent_docs

Re-ranking

from sentence_transformers import CrossEncoder

class ReRanker:
    def __init__(self):
        self.model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    
    def rerank(
        self,
        query: str,
        documents: list[dict],
        top_k: int = 5
    ) -> list[dict]:
        """Re-rank retrieved documents using cross-encoder."""
        # Prepare query-document pairs
        pairs = [(query, doc['text']) for doc in documents]
        
        # Score pairs
        scores = self.model.predict(pairs)
        
        # Attach scores and sort
        for doc, score in zip(documents, scores):
            doc['rerank_score'] = float(score)
        
        reranked = sorted(
            documents,
            key=lambda x: x['rerank_score'],
            reverse=True
        )
        
        return reranked[:top_k]

Evaluation and Optimization

Retrieval Quality Metrics

class RAGEvaluator:
    def __init__(self):
        self.test_queries = load_test_queries()
    
    def evaluate_retrieval(self, retriever) -> dict:
        """Evaluate retrieval quality."""
        metrics = {
            'precision_at_k': [],
            'recall_at_k': [],
            'mrr': []  # Mean Reciprocal Rank
        }
        
        for test_case in self.test_queries:
            query = test_case['query']
            relevant_docs = set(test_case['relevant_doc_ids'])
            
            # Retrieve
            results = retriever.retrieve(query, top_k=10)
            retrieved_ids = [r['id'] for r in results]
            
            # Calculate metrics
            metrics['precision_at_k'].append(
                self.precision_at_k(retrieved_ids, relevant_docs, k=5)
            )
            metrics['recall_at_k'].append(
                self.recall_at_k(retrieved_ids, relevant_docs, k=10)
            )
            metrics['mrr'].append(
                self.mean_reciprocal_rank(retrieved_ids, relevant_docs)
            )
        
        return {
            'precision@5': np.mean(metrics['precision_at_k']),
            'recall@10': np.mean(metrics['recall_at_k']),
            'mrr': np.mean(metrics['mrr'])
        }
    
    def precision_at_k(self, retrieved: list, relevant: set, k: int) -> float:
        """Precision at K."""
        retrieved_k = retrieved[:k]
        relevant_retrieved = sum(1 for doc in retrieved_k if doc in relevant)
        return relevant_retrieved / k
    
    def recall_at_k(self, retrieved: list, relevant: set, k: int) -> float:
        """Recall at K."""
        retrieved_k = set(retrieved[:k])
        relevant_retrieved = len(retrieved_k & relevant)
        return relevant_retrieved / len(relevant) if relevant else 0
    
    def mean_reciprocal_rank(self, retrieved: list, relevant: set) -> float:
        """Mean Reciprocal Rank."""
        for i, doc_id in enumerate(retrieved, 1):
            if doc_id in relevant:
                return 1.0 / i
        return 0.0

End-to-End Quality

class EndToEndEvaluator:
    def __init__(self, rag_system):
        self.rag = rag_system
        self.llm_judge = OpenAI()  # For automated evaluation
    
    def evaluate_answer_quality(
        self,
        question: str,
        generated_answer: str,
        ground_truth: str
    ) -> dict:
        """Evaluate answer quality using LLM-as-judge."""
        judge_prompt = f"""Evaluate this Q&A pair:

Question: {question}

Ground Truth Answer: {ground_truth}

Generated Answer: {generated_answer}

Rate on these dimensions (1-5 scale):
1. Accuracy: How factually correct is the answer?
2. Completeness: Does it fully answer the question?
3. Relevance: Is the answer on-topic?
4. Citation: Are sources properly cited?

Provide scores and brief justification."""
        
        evaluation = self.llm_judge.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": judge_prompt}]
        )
        
        # Parse evaluation
        scores = self.parse_scores(evaluation.choices[0].message.content)
        
        return scores

Production Best Practices

Document Update Strategy

class DocumentUpdater:
    def __init__(self, vector_db, embedding_service):
        self.vector_db = vector_db
        self.embedding_service = embedding_service
    
    def update_document(self, document_id: str, new_content: str):
        """Update document in knowledge base."""
        # Delete old chunks
        self.vector_db.delete(filter={'document_id': document_id})
        
        # Chunk new content
        chunks = chunk_by_semantics(new_content)
        
        # Generate embeddings
        embeddings = self.embedding_service.embed_batch(chunks)
        
        # Insert new chunks
        vectors = [
            (
                f"{document_id}_chunk_{i}",
                embedding,
                {'text': chunk, 'document_id': document_id}
            )
            for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
        ]
        
        self.vector_db.upsert(vectors)
        
        logger.info(f"Updated document {document_id} with {len(chunks)} chunks")

Monitoring

class RAGMonitor:
    def __init__(self):
        self.metrics = {
            'queries_total': Counter(),
            'retrieval_latency': Histogram(),
            'generation_latency': Histogram(),
            'sources_retrieved': Histogram(),
            'user_feedback': Counter()
        }
    
    def track_query(self, query_data: dict):
        """Track query metrics."""
        self.metrics['queries_total'].inc()
        self.metrics['retrieval_latency'].observe(
            query_data['retrieval_time']
        )
        self.metrics['generation_latency'].observe(
            query_data['generation_time']
        )
        self.metrics['sources_retrieved'].observe(
            query_data['sources_count']
        )
    
    def track_feedback(self, helpful: bool):
        """Track user feedback."""
        label = 'helpful' if helpful else 'not_helpful'
        self.metrics['user_feedback'].labels(feedback=label).inc()

Conclusion: RAG as Production Standard

Retrieval-Augmented Generation transforms LLMs from impressive but unreliable systems into trustworthy production tools. By grounding responses in retrieved documents, RAG dramatically reduces hallucinations, enables work with proprietary data, and provides attribution that builds user trust.

Key implementation principles:

  1. Chunk strategically: Balance precision and context
  2. Choose embeddings wisely: Match model to use case and budget
  3. Implement hybrid search: Combine vector and keyword approaches
  4. Format context carefully: Help LLM extract relevant information
  5. Validate outputs: Verify answers against sources
  6. Monitor continuously: Track quality and iterate

RAG is not a silver bullet—it adds complexity, latency, and infrastructure requirements. But for applications requiring factual accuracy and specialized knowledge, it’s become the essential architecture.


Last Updated: December 2024

author avatar
promptyze
promptyze
Founder · Editor · Promptowy

Piszę o AI i automatyzacji od 3 lat. Prowadzę promptowy.com.

More →