Evaluating RAG Pipelines with MLflow: A Practical Framework
By Gennoor Tech·February 24, 2026
A systematic RAG evaluation pipeline using MLflow tracks retrieval quality, answer faithfulness, and relevance across versions — catching quality regressions before users experience them.
Most teams building retrieval-augmented generation (RAG) systems start with vibes-based evaluation: "Does this answer look good?" While intuition helps during initial development, it fails catastrophically at scale. Production RAG systems answer thousands of queries daily across diverse topics—manual review is impossible, and subtle quality degradations go unnoticed until users complain. This guide presents a practical framework for rigorous RAG evaluation using MLflow.
Why Vibes-Based Evaluation Fails at Scale
The vibes approach has fatal flaws:
Cognitive Biases
Humans fall victim to confirmation bias (seeing what they expect), recency bias (over-weighting recent examples), and anchoring (being influenced by initial impressions). These biases make subjective evaluation unreliable, especially when comparing similar system versions.
Limited Coverage
Manual review typically covers 10-50 examples. Production systems handle millions of queries spanning edge cases you never imagined. Your sample is almost certainly not representative.
Inconsistency
Different reviewers apply different standards. The same reviewer applies different standards on different days. Inter-rater reliability studies consistently show poor agreement on subjective quality assessments.
Inability to Detect Regressions
When you modify chunk sizes, change embedding models, or update prompts, vibes evaluation can't reliably detect whether performance improved or degraded. You're flying blind.
No Cost Visibility
Manual review doesn't capture token usage, API costs, or latency. You might achieve slightly better quality at 10x the cost—a terrible trade-off that vibes evaluation won't reveal.
Systematic evaluation with MLflow solves these problems by combining automated metrics, large test sets, and reproducible experiments.
Comprehensive RAG Evaluation Taxonomy
RAG evaluation breaks down into three categories, each measuring different aspects of system quality:
Retrieval Quality
Did the system find the right documents? Metrics include precision (what percentage of retrieved docs are relevant), recall (what percentage of relevant docs were retrieved), and ranking quality (are the best docs at the top).
Generation Quality
Given the retrieved context, did the LLM produce a good answer? Metrics include faithfulness (does the answer follow from the context), relevance (does it address the question), coherence (is it well-written), and safety (is it non-toxic and appropriate).
End-to-End Quality
From the user's perspective, was the overall experience successful? Metrics include answer correctness, user satisfaction, task completion rate, and whether the system knew when to say "I don't know."
Comprehensive evaluation requires metrics from all three categories. Optimizing only retrieval or only generation leads to sub-optimal systems.
Retrieval Metrics Deep Dive
Precision and Recall
These fundamental metrics require ground truth labels indicating which documents are relevant to each query:
- Precision@K: Of the K documents retrieved, what percentage are relevant?
- Recall@K: Of all relevant documents in the corpus, what percentage appear in the top K results?
Example: For query "What is the loan approval process?", if the system retrieves 5 documents and 3 are relevant, and there are 4 relevant documents total in the corpus:
Precision@5 = 3/5 = 0.60
Recall@5 = 3/4 = 0.75
High precision means users aren't overwhelmed with irrelevant results. High recall means the system finds all the information it needs to answer comprehensively.
Mean Reciprocal Rank (MRR)
MRR measures how quickly users find relevant results:
MRR = average(1 / rank_of_first_relevant_document)
If the first relevant document appears in position 1, the reciprocal rank is 1.0. Position 2 yields 0.5, position 3 yields 0.33, and so on. Higher MRR means users find what they need faster.
Normalized Discounted Cumulative Gain (NDCG)
NDCG is sophisticated ranking metric that accounts for graded relevance (some documents are more relevant than others) and position (earlier documents matter more):
DCG@K = sum(relevance_score[i] / log2(i + 1) for i in 1 to K)
NDCG@K = DCG@K / ideal_DCG@K
NDCG ranges from 0 to 1, with 1 indicating perfect ranking. It's particularly useful when you have nuanced relevance labels (e.g., "highly relevant", "somewhat relevant", "not relevant") rather than binary labels.
Hit Rate
Simple but effective: what percentage of queries have at least one relevant document in the top K results?
Hit_Rate@K = (number of queries with ≥1 relevant doc in top K) / (total queries)
Hit rate is easier to calculate than precision/recall because you only need to identify if any relevant documents were retrieved, not label every document.
Implementation in Code
from mlflow.metrics import make_metric
def calculate_retrieval_metrics(predictions, targets, metrics):
"""Calculate precision, recall, and MRR for retrieved documents"""
results = {
"precision_at_5": [],
"recall_at_5": [],
"mrr": []
}
for pred, target in zip(predictions, targets):
retrieved_doc_ids = pred["retrieved_docs"][:5]
relevant_doc_ids = target["relevant_docs"]
# Precision@5
relevant_in_retrieved = sum(1 for doc in retrieved_doc_ids if doc in relevant_doc_ids)
precision = relevant_in_retrieved / len(retrieved_doc_ids) if retrieved_doc_ids else 0
results["precision_at_5"].append(precision)
# Recall@5
recall = relevant_in_retrieved / len(relevant_doc_ids) if relevant_doc_ids else 0
results["recall_at_5"].append(recall)
# MRR
reciprocal_rank = 0
for i, doc in enumerate(retrieved_doc_ids, 1):
if doc in relevant_doc_ids:
reciprocal_rank = 1 / i
break
results["mrr"].append(reciprocal_rank)
return {
"precision_at_5": sum(results["precision_at_5"]) / len(results["precision_at_5"]),
"recall_at_5": sum(results["recall_at_5"]) / len(results["recall_at_5"]),
"mrr": sum(results["mrr"]) / len(results["mrr"])
}
retrieval_metric = make_metric(
eval_fn=calculate_retrieval_metrics,
greater_is_better=True,
name="retrieval_quality"
)
Generation Metrics Deep Dive
Faithfulness
Faithfulness measures whether the generated answer is supported by the retrieved context. This is critical for preventing hallucinations—the LLM should only make claims that can be verified in the source documents.
Implement faithfulness evaluation using an LLM-as-judge approach:
from mlflow.metrics import faithfulness
faithfulness_metric = faithfulness(model="openai:/gpt-4")
The judge model receives the retrieved context, the generated answer, and a prompt asking "Is every claim in the answer supported by the context?" It returns a score from 0 to 1.
Relevance
Relevance measures whether the answer actually addresses the user's question. An answer can be faithful to the context but still irrelevant if the wrong documents were retrieved.
from mlflow.metrics import answer_relevance
relevance_metric = answer_relevance(model="openai:/gpt-4")
Coherence
Coherence assesses whether the answer is well-written, logically structured, and easy to understand. Poor coherence manifests as awkward phrasing, contradictions, or disjointed flow.
Build a custom coherence metric:
def evaluate_coherence(predictions, targets, metrics):
"""Evaluate answer coherence using GPT-4 as judge"""
import openai
scores = []
for pred in predictions:
judge_prompt = f"""Rate the coherence of this answer on a scale of 1-5:
Answer: {pred}
Consider: Is it well-structured? Does it flow logically? Is it easy to understand?
Respond with only a number from 1 to 5."""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": judge_prompt}]
)
score = int(response.choices[0].message.content.strip())
scores.append(score / 5) # Normalize to 0-1
return {"coherence": sum(scores) / len(scores)}
Toxicity
Toxicity detection identifies harmful, offensive, or inappropriate content. Use specialized models like Perspective API or Azure Content Safety:
from mlflow.metrics import toxicity
toxicity_metric = toxicity(model="openai:/gpt-4")
Set strict thresholds—even a small percentage of toxic responses can cause significant reputational damage.
Conciseness
Some domains prioritize brief answers, while others require comprehensive explanations. Measure answer length and penalize verbosity when appropriate:
def evaluate_conciseness(predictions, targets, metrics):
"""Penalize unnecessarily long answers"""
scores = []
target_length = 150 # words
for pred in predictions:
word_count = len(pred.split())
if word_count <= target_length:
score = 1.0
else:
# Penalty for excess words
score = max(0, 1 - (word_count - target_length) / target_length)
scores.append(score)
return {"conciseness": sum(scores) / len(scores)}
Building Golden Test Sets
Automated metrics are only as good as the test data they evaluate against. Golden test sets are curated collections of query-answer pairs with ground truth labels.
Strategies for Test Set Creation
1. Mine Production Logs
Identify frequently asked questions from actual users. This ensures your test set reflects real usage patterns rather than what developers imagine users will ask.
2. Synthetic Data Generation
Use LLMs to generate diverse questions from your document corpus:
for document in corpus:
prompt = f"""Generate 5 questions that could be answered using this document:
{document}
Questions should vary in complexity and specificity."""
questions = llm.generate(prompt)
# Add to test set with document as ground truth context
3. Subject Matter Expert (SME) Curation
Have domain experts write questions they expect users to ask, along with correct answers and relevant documents. This is time-consuming but produces high-quality test data for critical scenarios.
4. Edge Case Collection
Actively seek out challenging scenarios:
- Ambiguous questions that could have multiple interpretations
- Questions requiring information from multiple documents
- Questions with no good answer in the corpus
- Questions with contradictory information across documents
- Domain-specific jargon or technical terminology
Test Set Size Recommendations
- Minimum viable: 50-100 examples for initial development
- Development set: 200-500 examples for hyperparameter tuning and prompt engineering
- Test set: 500-1000 examples for final evaluation and confidence before production
- Continuous evaluation: 50-100 new examples per month from production traffic
Test Set Maintenance
Test sets decay over time as your document corpus evolves, user needs change, and edge cases emerge. Schedule quarterly reviews:
- Remove outdated questions whose answers have changed
- Add new questions covering recent edge cases from production
- Update ground truth labels based on document updates
- Rebalance categories to ensure representative coverage
Store test sets in version control (Git) with clear change logs. Tag test set versions with the document corpus versions they correspond to.
For guidance on building test sets tailored to your domain, our MLOps training programs include hands-on workshops.
MLflow Evaluate Configuration and Code Patterns
Complete Evaluation Pipeline
import mlflow
import pandas as pd
from mlflow.metrics import faithfulness, answer_relevance
# Load test set
test_data = pd.read_csv("test_set.csv")
# Define evaluation data format
eval_data = pd.DataFrame({
"inputs": test_data["question"],
"ground_truth": test_data["expected_answer"],
"context": test_data["relevant_documents"]
})
# Define your RAG model as a function
def rag_model(inputs):
results = []
for question in inputs["inputs"]:
# Retrieve documents
docs = retriever.retrieve(question)
# Generate answer
answer = generator.generate(question, docs)
results.append(answer)
return results
# Run evaluation
with mlflow.start_run():
# Log configuration
mlflow.log_param("chunk_size", 512)
mlflow.log_param("embedding_model", "text-embedding-ada-002")
mlflow.log_param("llm_model", "gpt-4")
mlflow.log_param("temperature", 0.7)
mlflow.log_param("top_k_docs", 5)
# Evaluate
results = mlflow.evaluate(
model=rag_model,
data=eval_data,
model_type="question-answering",
metrics=[
faithfulness(model="openai:/gpt-4"),
answer_relevance(model="openai:/gpt-4"),
retrieval_metric,
toxicity_metric
]
)
# Log aggregate metrics
mlflow.log_metrics(results.metrics)
# Save detailed results
results.tables["eval_results_table"].to_csv("detailed_results.csv")
mlflow.log_artifact("detailed_results.csv")
Comparing Multiple Configurations
configurations = [
{"chunk_size": 256, "top_k": 3},
{"chunk_size": 512, "top_k": 5},
{"chunk_size": 1024, "top_k": 7}
]
for config in configurations:
with mlflow.start_run(run_name=f"chunk_{config['chunk_size']}_k_{config['top_k']}"):
# Configure system
rag_system.set_chunk_size(config["chunk_size"])
rag_system.set_top_k(config["top_k"])
# Log parameters
mlflow.log_params(config)
# Evaluate
results = mlflow.evaluate(...)
mlflow.log_metrics(results.metrics)
Use the MLflow UI to compare runs side-by-side and identify the best configuration.
Automated Evaluation Pipelines with CI/CD
Integrate evaluation into your development workflow to catch regressions before production deployment.
GitHub Actions Example
name: RAG Evaluation
on: [pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run evaluation
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python evaluate_rag.py
- name: Check metrics threshold
run: python check_thresholds.py
Quality Gates
Define minimum acceptable metrics and fail the CI pipeline if they're not met:
import mlflow
client = mlflow.tracking.MlflowClient()
run = client.get_run(run_id)
thresholds = {
"faithfulness": 0.85,
"answer_relevance": 0.80,
"precision_at_5": 0.70,
"toxicity": 0.05 # Lower is better
}
failures = []
for metric, threshold in thresholds.items():
value = run.data.metrics.get(metric, 0)
if metric == "toxicity":
if value > threshold:
failures.append(f"{metric}: {value} > {threshold}")
else:
if value < threshold:
failures.append(f"{metric}: {value} < {threshold}")
if failures:
print("Quality gates failed:")
for failure in failures:
print(f" - {failure}")
exit(1)
A/B Testing RAG Configurations
Once in production, use A/B testing to validate improvements with real users:
- Deploy challenger variant: Deploy new RAG configuration alongside existing "champion" version
- Route traffic: Send 10% of traffic to challenger, 90% to champion
- Collect metrics: Track quality metrics, user feedback, cost, and latency for both variants
- Statistical analysis: Determine if observed differences are statistically significant
- Promote or rollback: If challenger performs better, gradually increase traffic; if worse, remove it
Use feature flags (Azure App Configuration, LaunchDarkly) to control routing without code deployments.
Chunking Strategy Evaluation
Chunk size dramatically impacts RAG performance. Evaluate multiple strategies:
Fixed-Size Chunking
chunk_sizes = [128, 256, 512, 1024]
for size in chunk_sizes:
with mlflow.start_run(run_name=f"fixed_chunk_{size}"):
chunker = FixedSizeChunker(size=size, overlap=50)
# Rebuild index and evaluate
Semantic Chunking
Chunk at natural boundaries (sentences, paragraphs, sections) rather than fixed character counts:
strategies = ["sentence", "paragraph", "section"]
for strategy in strategies:
with mlflow.start_run(run_name=f"semantic_chunk_{strategy}"):
chunker = SemanticChunker(strategy=strategy)
# Rebuild index and evaluate
Hybrid Approaches
Combine semantic boundaries with maximum size constraints to prevent overly large chunks.
Embedding Model Comparison Framework
Different embedding models trade off quality, speed, and cost:
embedding_models = [
{"name": "text-embedding-ada-002", "dim": 1536, "cost": "low"},
{"name": "text-embedding-3-small", "dim": 1536, "cost": "low"},
{"name": "text-embedding-3-large", "dim": 3072, "cost": "medium"},
{"name": "sentence-transformers/all-mpnet-base-v2", "dim": 768, "cost": "zero"}
]
for model in embedding_models:
with mlflow.start_run(run_name=f"embedding_{model['name']}"):
# Rebuild vector store with new embeddings
retriever = rebuild_index(embedding_model=model["name"])
# Evaluate retrieval quality
results = mlflow.evaluate(...)
# Log model characteristics
mlflow.log_param("embedding_dim", model["dim"])
mlflow.log_param("cost_tier", model["cost"])
Often, cheaper embedding models perform nearly as well as expensive ones for domain-specific corpora. Always benchmark before choosing.
Prompt Template Evaluation
Systematically test prompt variations:
prompt_templates = [
"""Answer this question based on the context:
Question: {question}
Context: {context}""",
"""You are a helpful assistant. Use only the provided context to answer the question. If the context doesn't contain enough information, say "I don't have enough information to answer that."
Context: {context}
Question: {question}""",
"""Based on the following context, provide a concise answer to the question. Cite specific facts from the context.
Context: {context}
Question: {question}
Answer:"""
]
for i, template in enumerate(prompt_templates):
with mlflow.start_run(run_name=f"prompt_variant_{i}"):
rag_system.set_prompt_template(template)
results = mlflow.evaluate(...)
mlflow.log_text(template, "prompt_template.txt")
Small prompt changes can yield significant quality differences. Test extensively.
Handling Edge Cases
No-Answer Scenarios
When the corpus doesn't contain relevant information, the system should say "I don't know" rather than hallucinate:
def evaluate_no_answer_handling(predictions, targets, metrics):
"""Check if system appropriately refuses to answer when it shouldn't"""
scores = []
no_answer_phrases = ["I don't have", "I don't know", "insufficient information", "I cannot answer"]
for pred, target in zip(predictions, targets):
if target["should_refuse"]:
# Should say "I don't know"
refused = any(phrase in pred.lower() for phrase in no_answer_phrases)
scores.append(1.0 if refused else 0.0)
else:
# Should provide answer
provided_answer = not any(phrase in pred.lower() for phrase in no_answer_phrases)
scores.append(1.0 if provided_answer else 0.0)
return {"no_answer_accuracy": sum(scores) / len(scores)}
Multi-Document Synthesis
Some questions require synthesizing information from multiple documents. Evaluate whether the system successfully combines evidence:
def evaluate_multi_doc_synthesis(predictions, targets, metrics):
"""Check if answer incorporates information from multiple source documents"""
scores = []
for pred, target in zip(predictions, targets):
if target["requires_multi_doc"]:
# Check if answer mentions facts from multiple ground truth docs
docs_referenced = sum(
1 for doc in target["source_docs"]
if any(fact in pred for fact in doc["key_facts"])
)
score = min(1.0, docs_referenced / len(target["source_docs"]))
scores.append(score)
return {"multi_doc_synthesis": sum(scores) / len(scores) if scores else 1.0}
Contradictory Sources
When documents contradict each other, the system should acknowledge the contradiction rather than picking one arbitrarily:
- Test set includes queries where documents provide conflicting information
- Ground truth answer acknowledges both perspectives
- Evaluate whether generated answer identifies the contradiction
Real-World Case Study: RAG Evaluation at Scale
A financial services company built a RAG system for internal policy Q&A, serving 5,000 employees across legal, compliance, HR, and operations departments.
Initial Challenges
- Manual review of 20 questions couldn't catch regressions
- Employees reported inconsistent answer quality
- No visibility into which departments or query types had poor experiences
- Prompt engineering changes caused unexpected quality degradation
Evaluation Framework Implementation
They built a comprehensive framework using MLflow:
- Golden Test Set: 800 questions curated from 6 months of production logs, labeled by SMEs with ground truth answers and relevant policies
- Automated Pipeline: GitHub Actions running evaluation on every pull request, blocking merges if quality gates fail
- Segmented Metrics: Tracked performance separately for each department, query complexity level, and document type
- Cost Tracking: Logged token usage and costs per evaluation run to prevent ballooning expenses
- Human-in-the-Loop: Weekly review of low-scoring examples to identify systematic issues
Results
- Faithfulness score improved from 0.72 to 0.91 over 3 months
- Caught 12 regressions before production deployment
- Identified that legal queries needed different chunk sizes than HR queries
- Reduced LLM costs by 35% by using cheaper models for simple queries
- Employee satisfaction (measured by thumbs up/down) improved from 68% to 87%
The evaluation framework paid for itself within 6 weeks by preventing production issues and optimizing infrastructure costs. For similar success in your organization, explore our MLOps consulting and training services.
Common Pitfalls and How to Avoid Them
Pitfall 1: Test Set Contamination
Accidentally including test data in training or optimization leads to overfitting. Keep test sets completely separate and only evaluate against them for final validation, not during development.
Pitfall 2: Metric Selection Bias
Optimizing for easy-to-measure metrics (like retrieval precision) while ignoring harder-to-measure ones (like user satisfaction) produces systems that perform well on paper but poorly in practice. Use a balanced scorecard of metrics.
Pitfall 3: Ignoring Latency
Achieving 98% quality with 10-second latency is usually worse than 92% quality with 2-second latency. Always measure and log latency alongside quality metrics.
Pitfall 4: Static Test Sets
As your system improves, existing test sets become too easy and stop providing signal. Continuously add new edge cases from production.
Pitfall 5: Lack of Baseline
Establishing a baseline before optimization attempts is critical. Without it, you can't tell if changes improved performance or just shifted it around.
Pitfall 6: Over-Reliance on LLM Judges
LLM-as-judge metrics are powerful but not perfect. They can miss subtle issues, have biases, and cost money to run. Complement them with heuristic metrics and human review of edge cases.
Conclusion
RAG evaluation separates production-grade systems from prototypes. Vibes-based assessment works for demos but fails at scale—systematic evaluation with MLflow provides the rigor necessary for reliable, high-quality RAG deployments.
Start with basic metrics (precision, faithfulness, relevance), build a modest golden test set, and automate evaluation in CI/CD. As your system matures, expand metrics to cover edge cases, segment performance by user groups, and implement continuous evaluation of production traffic.
RAG systems are complex with many configuration options—chunk size, embedding models, retrieval strategies, prompt templates, and reranking approaches. Only systematic evaluation reveals which combinations work best for your specific use case and data. The investment in evaluation infrastructure pays dividends through fewer production issues, lower costs, and higher user satisfaction.
For teams serious about production RAG systems, evaluation isn't optional—it's the foundation of quality. Build it early, iterate often, and let data guide your optimization decisions. For more insights on MLOps and RAG best practices, visit our blog.
Jalal Ahmed Khan
Microsoft Certified Trainer (MCT) · Founder, Gennoor Tech
14+ years in enterprise AI and cloud technologies. Delivered AI transformation programs for Fortune 500 companies across 6 countries including Boeing, Aramco, HDFC Bank, and Siemens. Holds 16 active Microsoft certifications including Azure AI Engineer and Power BI Analyst.