GenAI in Production: RAG Systems That Work

Why RAG Demos Fail in Production

The demo works perfectly:

User asks: "What's our refund policy?"
RAG retrieves relevant docs
LLM generates accurate answer
Stakeholders are impressed

Then production happens:

Latency: 8 seconds (users expect <2s)
Cost: $0.50 per query (unsustainable at scale)
Accuracy: 70% (not good enough for customer-facing)
Hallucinations: LLM invents policies that don't exist
No observability: Can't debug wrong answers

The gap between demo and production is where most RAG projects die.

Production Requirements for RAG

1. Latency: <2s end-to-end (retrieval + generation)

2. Cost: <$0.05 per query at scale

3. Accuracy: >90% for enterprise use cases

4. Observability: Track retrieval quality, LLM performance, user satisfaction

5. Safety: Prevent hallucinations, data leaks, prompt injection

RAG Architecture for Production

Component 1: Document Processing Pipeline

Chunking Strategy: Break documents into retrievable units

Fixed-size chunks: 512-1024 tokens with 10-20% overlap
Semantic chunks: Split on paragraphs, sections, or topics
Hybrid: Combine both for better retrieval

Metadata Enrichment: Add context to chunks

Document title, author, date
Section headers, page numbers
Access control labels
Document type (policy, FAQ, technical doc)

Embedding Generation: Convert text to vectors

Models: OpenAI text-embedding-3, Cohere embed-v3, or open-source (BGE, E5)
Dimension: 768-1536 (trade-off between quality and cost)
Batch processing: Process 1000s of documents efficiently

Component 2: Vector Database

Database Selection:

Pinecone: Managed, easy to use, expensive at scale
Weaviate: Open-source, hybrid search, self-hosted
Qdrant: Fast, Rust-based, good for high-throughput
pgvector: PostgreSQL extension, good for existing Postgres users

Indexing Strategy:

HNSW (Hierarchical Navigable Small World) for fast approximate search
IVF (Inverted File Index) for large-scale datasets
Trade-off: Speed vs. accuracy

Component 3: Retrieval Strategy

Hybrid Search: Combine vector and keyword search

Vector search: Semantic similarity
Keyword search (BM25): Exact term matching
Combine scores: 0.7 * vector + 0.3 * keyword

Reranking: Improve retrieval quality

Retrieve top 20 candidates
Rerank with cross-encoder model (Cohere rerank, BGE reranker)
Return top 5 for LLM context

Query Expansion: Improve recall

Generate multiple query variations
Use LLM to rephrase user question
Retrieve for all variations, deduplicate results

Component 4: LLM Generation

Model Selection:

GPT-4: Best quality, expensive ($0.03/1K tokens)
GPT-3.5-turbo: Good quality, cheaper ($0.002/1K tokens)
Claude: Strong reasoning, good for complex queries
Open-source: Llama 3, Mixtral (self-hosted, lower cost)

Prompt Engineering:

```

You are a helpful assistant answering questions based on provided context.

Context:

{retrieved_chunks}

User Question: {user_query}

Instructions:

Answer based ONLY on the provided context
If the context doesn't contain the answer, say "I don't have enough information"
Cite sources using [Source: document_name]
Be concise and accurate

Answer:

```

Component 5: Observability

Metrics to Track:

Retrieval Quality: Precision@K, Recall@K, MRR (Mean Reciprocal Rank)
Generation Quality: BLEU, ROUGE, human eval scores
Latency: p50, p95, p99 for retrieval and generation
Cost: Per-query cost (embedding + retrieval + LLM)
User Satisfaction: Thumbs up/down, follow-up questions

Logging:

User query
Retrieved chunks (with scores)
LLM prompt and response
Latency breakdown
User feedback

Optimization Strategies

Latency Optimization:

1. Caching: Cache embeddings, frequent queries, LLM responses

2. Parallel Retrieval: Query multiple indexes simultaneously

3. Streaming: Stream LLM response as it generates

4. Smaller Models: Use GPT-3.5 instead of GPT-4 where acceptable

Cost Optimization:

1. Batch Processing: Process documents in batches

2. Cheaper Embeddings: Use smaller embedding models

3. Prompt Compression: Reduce context size sent to LLM

4. Caching: Avoid redundant LLM calls

Accuracy Optimization:

1. Better Chunking: Experiment with chunk size and overlap

2. Hybrid Search: Combine vector and keyword search

3. Reranking: Use cross-encoder for better retrieval

4. Prompt Tuning: Iterate on prompt engineering

5. Fine-tuning: Fine-tune embedding or LLM on domain data

Safety & Compliance

Prevent Hallucinations:

Instruct LLM to answer only from context
Add confidence scores to responses
Human-in-the-loop for high-stakes decisions

Access Control:

Filter retrieved chunks based on user permissions
Implement row-level security in vector DB
Audit all queries and responses

Prompt Injection Prevention:

Sanitize user input
Separate user query from system instructions
Monitor for suspicious patterns

Production Checklist

GenAI in Production: RAG Systems That Actually Work

The RAG Challenge

Published

Need Help Building Production RAG Systems?

RELATED_INSIGHTS

Decision Intelligence: From Data to Action

From Dashboards to Decisions: Observability That Matters

Enterprise MVP Failure: Operability Is the Missing Feature