Master RAG architecture for accurate, grounded AI responses. Step-by-step implementation with vector databases, embedding strategies, and real-world examples.
Large Language Models possess impressive capabilities, but they suffer from a fundamental limitation: their knowledge freezes at training time. Ask GPT-4 about events from last week, your company’s internal documents, or specialized domain knowledge not in its training data, and you’ll get hallucinations, outdated information, or generic responses that miss crucial context.
Retrieval-Augmented Generation (RAG) solves this problem elegantly. Instead of relying solely on the model’s parametric memory, RAG retrieves relevant information from external knowledge bases and provides it as context for generation. This architecture grounds model outputs in actual documents, dramatically reducing hallucinations while enabling AI systems to work with proprietary data, recent information, and specialized knowledge.
RAG has become the de facto standard for production LLM applications requiring factual accuracy. From customer support chatbots accessing internal documentation to research assistants synthesizing across papers to coding assistants referencing company codebases, RAG powers applications where reliability matters more than creativity.
This comprehensive guide walks through RAG implementation from first principles to production deployment. We’ll cover vector databases, embedding strategies, retrieval algorithms, context integration, and optimization techniques that separate proof-of-concept demos from robust production systems.
Before implementation, understanding the architecture clarifies design decisions.
User Query
↓
[1. Query Embedding]
↓
[2. Vector Search] ← Vector Database
↓
Retrieved Documents
↓
[3. Context Assembly]
↓
Prompt = System + Context + Query
↓
[4. LLM Generation]
↓
Response
Each stage critically affects system performance.
1. Document Processing Pipeline:
Raw Documents
↓
[Chunking] → Split into manageable pieces
↓
[Embedding] → Convert to vectors
↓
[Storage] → Save in vector database
2. Query Processing Pipeline:
User Query
↓
[Query Enhancement] → Expand, clarify, or rewrite
↓
[Embedding] → Convert to vector
↓
[Retrieval] → Find similar documents
↓
[Ranking] → Order by relevance
↓
[Context Assembly] → Format for LLM
3. Generation Pipeline:
Retrieved Context + Query
↓
[Prompt Construction] → Build comprehensive prompt
↓
[LLM Generation] → Generate response
↓
[Citation Addition] → Add source references
↓
[Validation] → Verify against sources
Grounding: Responses grounded in retrieved documents reduce hallucinations from ~15% to ~2-3%.
Freshness: External knowledge base updates immediately without model retraining.
Attribution: Citations enable verification and trust.
Domain Adaptation: Works with specialized knowledge without fine-tuning.
Cost Efficiency: Cheaper than fine-tuning for knowledge updates.
Document processing determines retrieval quality.
Fixed-Size Chunking (Simplest):
def chunk_by_tokens(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
"""Split text into fixed-size chunks with overlap."""
import tiktoken
encoder = tiktoken.get_encoding("cl100k_base")
tokens = encoder.encode(text)
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk_tokens = tokens[i:i + chunk_size]
chunk_text = encoder.decode(chunk_tokens)
chunks.append(chunk_text)
return chunks
# Usage
text = load_document("company_handbook.pdf")
chunks = chunk_by_tokens(text, chunk_size=512, overlap=50)
Semantic Chunking (Better):
def chunk_by_semantics(text: str, max_chunk_size: int = 512) -> list[str]:
"""Split at natural boundaries (paragraphs, sections)."""
# Split into paragraphs
paragraphs = text.split('\n\n')
chunks = []
current_chunk = []
current_length = 0
for para in paragraphs:
para_length = len(para.split())
if current_length + para_length > max_chunk_size and current_chunk:
# Save current chunk
chunks.append('\n\n'.join(current_chunk))
current_chunk = []
current_length = 0
current_chunk.append(para)
current_length += para_length
# Add remaining
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
return chunks
Hierarchical Chunking (Most sophisticated):
class HierarchicalChunker:
def __init__(self):
self.chunk_sizes = {
'section': 2048,
'subsection': 512,
'paragraph': 128
}
def chunk_document(self, document: dict) -> list[dict]:
"""Create hierarchical chunks with metadata."""
chunks = []
for section in document['sections']:
# Section-level chunk
section_chunk = {
'text': section['content'],
'metadata': {
'type': 'section',
'title': section['title'],
'document': document['id']
}
}
chunks.append(section_chunk)
# Subsection chunks
for subsection in section.get('subsections', []):
subsection_chunk = {
'text': subsection['content'],
'metadata': {
'type': 'subsection',
'title': subsection['title'],
'parent': section['title'],
'document': document['id']
}
}
chunks.append(subsection_chunk)
return chunks
Small Chunks (128-256 tokens):
Medium Chunks (512-1024 tokens):
Large Chunks (2048+ tokens):
Recommendation: Start with 512 tokens, adjust based on your domain.
class MetadataEnricher:
def __init__(self):
self.date_parser = DateParser()
self.entity_extractor = EntityExtractor()
def enrich_chunk(self, chunk: str, source_doc: dict) -> dict:
"""Add valuable metadata to chunks."""
metadata = {
# Source information
'document_id': source_doc['id'],
'document_title': source_doc['title'],
'document_type': source_doc['type'],
'source_url': source_doc.get('url'),
# Temporal information
'created_at': source_doc['created_at'],
'updated_at': source_doc.get('updated_at'),
'published_at': source_doc.get('published_at'),
# Content characteristics
'chunk_length': len(chunk),
'language': self.detect_language(chunk),
# Extracted entities
'entities': self.entity_extractor.extract(chunk),
'keywords': self.extract_keywords(chunk),
# Hierarchical context
'section': source_doc.get('section'),
'subsection': source_doc.get('subsection'),
}
return {
'text': chunk,
'metadata': metadata
}
Embeddings convert text to numerical vectors that capture semantic meaning.
OpenAI text-embedding-3-large:
OpenAI text-embedding-3-small:
Cohere embed-english-v3.0:
Open-source alternatives:
from openai import OpenAI
import numpy as np
class EmbeddingService:
def __init__(self, model: str = "text-embedding-3-small"):
self.client = OpenAI()
self.model = model
self.dimension = 1536 # for text-embedding-3-small
def embed_text(self, text: str) -> np.ndarray:
"""Generate embedding for text."""
response = self.client.embeddings.create(
model=self.model,
input=text
)
return np.array(response.data[0].embedding)
def embed_batch(self, texts: list[str], batch_size: int = 100) -> list[np.ndarray]:
"""Generate embeddings for multiple texts efficiently."""
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = self.client.embeddings.create(
model=self.model,
input=batch
)
batch_embeddings = [
np.array(item.embedding)
for item in response.data
]
embeddings.extend(batch_embeddings)
return embeddings
Pinecone (Managed):
import pinecone
# Initialize
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
# Create index
pinecone.create_index(
name="knowledge-base",
dimension=1536,
metric="cosine"
)
# Get index
index = pinecone.Index("knowledge-base")
# Upsert vectors
index.upsert(vectors=[
("id1", embedding1, {"text": "chunk1", "source": "doc1"}),
("id2", embedding2, {"text": "chunk2", "source": "doc1"}),
])
# Query
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True
)
Weaviate (Open-source, self-hosted):
import weaviate
# Initialize
client = weaviate.Client("http://localhost:8080")
# Create schema
schema = {
"class": "Document",
"vectorizer": "none", # We'll provide vectors
"properties": [
{"name": "text", "dataType": ["text"]},
{"name": "source", "dataType": ["string"]},
{"name": "created_at", "dataType": ["date"]}
]
}
client.schema.create_class(schema)
# Add data
client.data_object.create(
data_object={
"text": "chunk text",
"source": "document.pdf"
},
class_name="Document",
vector=embedding
)
# Query
results = client.query.get("Document", ["text", "source"]) \
.with_near_vector({"vector": query_embedding}) \
.with_limit(5) \
.do()
ChromaDB (Simple, embedded):
import chromadb
# Initialize
client = chromadb.Client()
# Create collection
collection = client.create_collection(name="knowledge_base")
# Add documents
collection.add(
embeddings=[embedding1, embedding2],
documents=["chunk1 text", "chunk2 text"],
metadatas=[{"source": "doc1"}, {"source": "doc1"}],
ids=["id1", "id2"]
)
# Query
results = collection.query(
query_embeddings=[query_embedding],
n_results=5
)
| Database | Hosted | Scale | Features | Best For |
|---|---|---|---|---|
| Pinecone | Yes | Massive | Managed, reliable | Production at scale |
| Weaviate | Self/Cloud | Large | GraphQL, hybrid search | Flexibility |
| ChromaDB | No | Medium | Simple API, embedded | Prototypes, local dev |
| Qdrant | Self/Cloud | Large | Rust, fast | Performance critical |
| Milvus | Self/Cloud | Massive | Distributed, scalable | Enterprise scale |
Retrieval quality determines RAG performance.
class BasicRetriever:
def __init__(self, vector_db, embedding_service):
self.vector_db = vector_db
self.embedding_service = embedding_service
def retrieve(self, query: str, top_k: int = 5) -> list[dict]:
"""Retrieve most relevant documents."""
# Embed query
query_embedding = self.embedding_service.embed_text(query)
# Search vector database
results = self.vector_db.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
return [
{
'text': match['metadata']['text'],
'score': match['score'],
'metadata': match['metadata']
}
for match in results['matches']
]
class HybridRetriever:
def __init__(self, vector_db, elasticsearch_client, embedding_service):
self.vector_db = vector_db
self.es = elasticsearch_client
self.embedding_service = embedding_service
def retrieve(
self,
query: str,
top_k: int = 5,
vector_weight: float = 0.7
) -> list[dict]:
"""Combine vector similarity and keyword matching."""
# Vector search
query_embedding = self.embedding_service.embed_text(query)
vector_results = self.vector_db.query(
vector=query_embedding,
top_k=top_k * 2 # Get more candidates
)
# Keyword search
keyword_results = self.es.search(
index="documents",
body={
"query": {
"multi_match": {
"query": query,
"fields": ["text", "title"]
}
}
},
size=top_k * 2
)
# Combine and re-rank
combined = self.merge_results(
vector_results,
keyword_results,
vector_weight
)
return combined[:top_k]
def merge_results(
self,
vector_results,
keyword_results,
vector_weight
) -> list[dict]:
"""Merge and score results from both methods."""
scores = {}
# Score vector results
for result in vector_results:
doc_id = result['id']
scores[doc_id] = {
'vector_score': result['score'],
'keyword_score': 0,
'metadata': result['metadata']
}
# Add keyword scores
for result in keyword_results:
doc_id = result['_id']
if doc_id in scores:
scores[doc_id]['keyword_score'] = result['_score']
else:
scores[doc_id] = {
'vector_score': 0,
'keyword_score': result['_score'],
'metadata': result['_source']
}
# Calculate combined score
for doc_id in scores:
scores[doc_id]['final_score'] = (
vector_weight * scores[doc_id]['vector_score'] +
(1 - vector_weight) * scores[doc_id]['keyword_score']
)
# Sort by final score
sorted_results = sorted(
scores.items(),
key=lambda x: x[1]['final_score'],
reverse=True
)
return [
{
'id': doc_id,
'score': result['final_score'],
'metadata': result['metadata']
}
for doc_id, result in sorted_results
]
class QueryEnhancer:
def __init__(self, llm_client):
self.llm = llm_client
def enhance_query(self, query: str) -> list[str]:
"""Generate multiple query variations."""
prompt = f"""Given this user query: "{query}"
Generate 3 alternative phrasings that might retrieve relevant documents:
1. A more specific version
2. A more general version
3. A version using different terminology
Format as JSON array of strings."""
response = self.llm.generate(prompt)
variations = json.loads(response)
return [query] + variations # Include original
def retrieve_with_enhancement(
self,
query: str,
retriever,
top_k: int = 5
) -> list[dict]:
"""Retrieve using multiple query variations."""
query_variations = self.enhance_query(query)
all_results = []
for variation in query_variations:
results = retriever.retrieve(variation, top_k=top_k)
all_results.extend(results)
# Deduplicate and re-rank
return self.deduplicate_and_rank(all_results, top_k)
class FilteredRetriever:
def __init__(self, vector_db, embedding_service):
self.vector_db = vector_db
self.embedding_service = embedding_service
def retrieve_with_filters(
self,
query: str,
filters: dict,
top_k: int = 5
) -> list[dict]:
"""Retrieve with metadata filtering."""
query_embedding = self.embedding_service.embed_text(query)
# Build filter expression
filter_expr = self.build_filter_expression(filters)
results = self.vector_db.query(
vector=query_embedding,
top_k=top_k,
filter=filter_expr,
include_metadata=True
)
return results
def build_filter_expression(self, filters: dict) -> dict:
"""Build database-specific filter expression."""
# Example for Pinecone
expressions = []
if 'document_type' in filters:
expressions.append({
"document_type": {"$eq": filters['document_type']}
})
if 'date_range' in filters:
expressions.append({
"created_at": {
"$gte": filters['date_range']['start'],
"$lte": filters['date_range']['end']
}
})
if 'source' in filters:
expressions.append({
"source": {"$in": filters['source']}
})
if len(expressions) == 1:
return expressions[0]
else:
return {"$and": expressions}
# Usage
retriever = FilteredRetriever(vector_db, embedding_service)
results = retriever.retrieve_with_filters(
query="How do we handle refunds?",
filters={
'document_type': 'policy',
'date_range': {
'start': '2024-01-01',
'end': '2024-12-31'
}
},
top_k=5
)
Retrieved documents must be formatted effectively for the LLM.
class ContextAssembler:
def __init__(self, max_context_length: int = 4000):
self.max_context_length = max_context_length
def assemble_context(
self,
query: str,
retrieved_docs: list[dict],
include_metadata: bool = True
) -> str:
"""Assemble retrieved documents into coherent context."""
context_parts = []
current_length = 0
for i, doc in enumerate(retrieved_docs, 1):
# Format document
doc_text = self.format_document(doc, i, include_metadata)
doc_length = len(doc_text.split())
# Check if we have room
if current_length + doc_length > self.max_context_length:
break
context_parts.append(doc_text)
current_length += doc_length
return "\n\n---\n\n".join(context_parts)
def format_document(
self,
doc: dict,
index: int,
include_metadata: bool
) -> str:
"""Format single document with metadata."""
parts = [f"[Document {index}]"]
if include_metadata and 'metadata' in doc:
metadata = doc['metadata']
if 'title' in metadata:
parts.append(f"Title: {metadata['title']}")
if 'source' in metadata:
parts.append(f"Source: {metadata['source']}")
if 'created_at' in metadata:
parts.append(f"Date: {metadata['created_at']}")
parts.append(f"\n{doc['text']}")
return "\n".join(parts)
class RAGPromptBuilder:
def __init__(self):
self.system_template = """You are a helpful assistant that answers questions based on provided context.
CRITICAL RULES:
1. Answer ONLY using information from the provided documents
2. If the documents don't contain relevant information, say so clearly
3. Cite document numbers when making claims [Document X]
4. Do not make assumptions or add information not in the documents
5. If documents contradict each other, acknowledge this
Context documents:
{context}"""
self.user_template = """Based on the documents provided, please answer this question:
{query}
Remember to cite which documents support your answer."""
def build_messages(self, query: str, context: str) -> list[dict]:
"""Build messages for chat completion."""
return [
{
"role": "system",
"content": self.system_template.format(context=context)
},
{
"role": "user",
"content": self.user_template.format(query=query)
}
]
class RAGSystem:
def __init__(
self,
vector_db,
embedding_service,
llm_client,
retriever_type: str = 'hybrid'
):
self.embedding_service = embedding_service
self.llm = llm_client
if retriever_type == 'hybrid':
self.retriever = HybridRetriever(vector_db, es_client, embedding_service)
else:
self.retriever = BasicRetriever(vector_db, embedding_service)
self.context_assembler = ContextAssembler()
self.prompt_builder = RAGPromptBuilder()
def query(
self,
question: str,
top_k: int = 5,
filters: dict = None
) -> dict:
"""Complete RAG query pipeline."""
# Step 1: Retrieve relevant documents
retrieved_docs = self.retriever.retrieve(
query=question,
top_k=top_k,
filters=filters
)
if not retrieved_docs:
return {
'answer': "I don't have any relevant information to answer this question.",
'sources': [],
'confidence': 0.0
}
# Step 2: Assemble context
context = self.context_assembler.assemble_context(
query=question,
retrieved_docs=retrieved_docs
)
# Step 3: Build prompt
messages = self.prompt_builder.build_messages(question, context)
# Step 4: Generate response
response = self.llm.chat.completions.create(
model="gpt-4-turbo-preview",
messages=messages,
temperature=0.1 # Low temperature for factual accuracy
)
answer = response.choices[0].message.content
# Step 5: Extract citations and validate
citations = self.extract_citations(answer)
validated_answer = self.validate_answer(answer, retrieved_docs)
return {
'answer': answer,
'sources': [
{
'text': doc['text'][:200] + '...',
'metadata': doc['metadata'],
'score': doc['score']
}
for doc in retrieved_docs
],
'citations': citations,
'validated': validated_answer,
'retrieval_count': len(retrieved_docs)
}
def extract_citations(self, answer: str) -> list[int]:
"""Extract document citations from answer."""
import re
citations = re.findall(r'\[Document (\d+)\]', answer)
return [int(c) for c in citations]
def validate_answer(
self,
answer: str,
retrieved_docs: list[dict]
) -> bool:
"""Check if answer is grounded in retrieved docs."""
# Simple validation: check if key phrases from answer appear in docs
answer_phrases = self.extract_key_phrases(answer)
found_phrases = 0
for phrase in answer_phrases:
for doc in retrieved_docs:
if phrase.lower() in doc['text'].lower():
found_phrases += 1
break
# Consider valid if >50% of key phrases found in docs
return found_phrases / len(answer_phrases) > 0.5 if answer_phrases else False
# Usage
rag_system = RAGSystem(
vector_db=pinecone_index,
embedding_service=embedding_service,
llm_client=openai_client,
retriever_type='hybrid'
)
result = rag_system.query(
question="What is our company's remote work policy?",
top_k=5,
filters={'document_type': 'policy'}
)
print(result['answer'])
print(f"\nSources: {len(result['sources'])}")
print(f"Validated: {result['validated']}")
class IterativeRAG:
def __init__(self, rag_system, max_iterations: int = 3):
self.rag = rag_system
self.max_iterations = max_iterations
def query_iterative(self, question: str) -> dict:
"""Retrieve additional context if initial answer is insufficient."""
all_sources = []
for iteration in range(self.max_iterations):
result = self.rag.query(question, top_k=5)
# Check if answer is sufficient
if self.is_sufficient(result['answer']):
return result
# Generate follow-up query for additional context
follow_up = self.generate_followup_query(
question,
result['answer'],
iteration
)
# Retrieve additional documents
additional_docs = self.rag.retriever.retrieve(follow_up, top_k=3)
all_sources.extend(additional_docs)
# Final generation with all accumulated sources
return self.rag.query(question, top_k=len(all_sources))
class ParentDocumentRetriever:
"""Retrieve small chunks but provide larger parent context to LLM."""
def __init__(self, vector_db, document_store):
self.vector_db = vector_db # Stores small chunks
self.document_store = document_store # Stores full documents
def retrieve(self, query: str, top_k: int = 5) -> list[dict]:
"""Retrieve chunks but return parent documents."""
# Retrieve small, precise chunks
chunk_results = self.vector_db.query(query, top_k=top_k)
# Get parent documents
parent_docs = []
seen_parents = set()
for chunk in chunk_results:
parent_id = chunk['metadata']['parent_document_id']
if parent_id not in seen_parents:
parent_doc = self.document_store.get(parent_id)
parent_docs.append(parent_doc)
seen_parents.add(parent_id)
return parent_docs
from sentence_transformers import CrossEncoder
class ReRanker:
def __init__(self):
self.model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(
self,
query: str,
documents: list[dict],
top_k: int = 5
) -> list[dict]:
"""Re-rank retrieved documents using cross-encoder."""
# Prepare query-document pairs
pairs = [(query, doc['text']) for doc in documents]
# Score pairs
scores = self.model.predict(pairs)
# Attach scores and sort
for doc, score in zip(documents, scores):
doc['rerank_score'] = float(score)
reranked = sorted(
documents,
key=lambda x: x['rerank_score'],
reverse=True
)
return reranked[:top_k]
class RAGEvaluator:
def __init__(self):
self.test_queries = load_test_queries()
def evaluate_retrieval(self, retriever) -> dict:
"""Evaluate retrieval quality."""
metrics = {
'precision_at_k': [],
'recall_at_k': [],
'mrr': [] # Mean Reciprocal Rank
}
for test_case in self.test_queries:
query = test_case['query']
relevant_docs = set(test_case['relevant_doc_ids'])
# Retrieve
results = retriever.retrieve(query, top_k=10)
retrieved_ids = [r['id'] for r in results]
# Calculate metrics
metrics['precision_at_k'].append(
self.precision_at_k(retrieved_ids, relevant_docs, k=5)
)
metrics['recall_at_k'].append(
self.recall_at_k(retrieved_ids, relevant_docs, k=10)
)
metrics['mrr'].append(
self.mean_reciprocal_rank(retrieved_ids, relevant_docs)
)
return {
'precision@5': np.mean(metrics['precision_at_k']),
'recall@10': np.mean(metrics['recall_at_k']),
'mrr': np.mean(metrics['mrr'])
}
def precision_at_k(self, retrieved: list, relevant: set, k: int) -> float:
"""Precision at K."""
retrieved_k = retrieved[:k]
relevant_retrieved = sum(1 for doc in retrieved_k if doc in relevant)
return relevant_retrieved / k
def recall_at_k(self, retrieved: list, relevant: set, k: int) -> float:
"""Recall at K."""
retrieved_k = set(retrieved[:k])
relevant_retrieved = len(retrieved_k & relevant)
return relevant_retrieved / len(relevant) if relevant else 0
def mean_reciprocal_rank(self, retrieved: list, relevant: set) -> float:
"""Mean Reciprocal Rank."""
for i, doc_id in enumerate(retrieved, 1):
if doc_id in relevant:
return 1.0 / i
return 0.0
class EndToEndEvaluator:
def __init__(self, rag_system):
self.rag = rag_system
self.llm_judge = OpenAI() # For automated evaluation
def evaluate_answer_quality(
self,
question: str,
generated_answer: str,
ground_truth: str
) -> dict:
"""Evaluate answer quality using LLM-as-judge."""
judge_prompt = f"""Evaluate this Q&A pair:
Question: {question}
Ground Truth Answer: {ground_truth}
Generated Answer: {generated_answer}
Rate on these dimensions (1-5 scale):
1. Accuracy: How factually correct is the answer?
2. Completeness: Does it fully answer the question?
3. Relevance: Is the answer on-topic?
4. Citation: Are sources properly cited?
Provide scores and brief justification."""
evaluation = self.llm_judge.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": judge_prompt}]
)
# Parse evaluation
scores = self.parse_scores(evaluation.choices[0].message.content)
return scores
class DocumentUpdater:
def __init__(self, vector_db, embedding_service):
self.vector_db = vector_db
self.embedding_service = embedding_service
def update_document(self, document_id: str, new_content: str):
"""Update document in knowledge base."""
# Delete old chunks
self.vector_db.delete(filter={'document_id': document_id})
# Chunk new content
chunks = chunk_by_semantics(new_content)
# Generate embeddings
embeddings = self.embedding_service.embed_batch(chunks)
# Insert new chunks
vectors = [
(
f"{document_id}_chunk_{i}",
embedding,
{'text': chunk, 'document_id': document_id}
)
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
]
self.vector_db.upsert(vectors)
logger.info(f"Updated document {document_id} with {len(chunks)} chunks")
class RAGMonitor:
def __init__(self):
self.metrics = {
'queries_total': Counter(),
'retrieval_latency': Histogram(),
'generation_latency': Histogram(),
'sources_retrieved': Histogram(),
'user_feedback': Counter()
}
def track_query(self, query_data: dict):
"""Track query metrics."""
self.metrics['queries_total'].inc()
self.metrics['retrieval_latency'].observe(
query_data['retrieval_time']
)
self.metrics['generation_latency'].observe(
query_data['generation_time']
)
self.metrics['sources_retrieved'].observe(
query_data['sources_count']
)
def track_feedback(self, helpful: bool):
"""Track user feedback."""
label = 'helpful' if helpful else 'not_helpful'
self.metrics['user_feedback'].labels(feedback=label).inc()
Retrieval-Augmented Generation transforms LLMs from impressive but unreliable systems into trustworthy production tools. By grounding responses in retrieved documents, RAG dramatically reduces hallucinations, enables work with proprietary data, and provides attribution that builds user trust.
Key implementation principles:
RAG is not a silver bullet—it adds complexity, latency, and infrastructure requirements. But for applications requiring factual accuracy and specialized knowledge, it’s become the essential architecture.
Last Updated: December 2024