Skip to main content

Evaluating RAG Pipelines with MLflow: A Practical Framework

All Posts
MLOps12 min read

Evaluating RAG Pipelines with MLflow: A Practical Framework

By Gennoor Tech·February 24, 2026

Join Discussion
Key Takeaway

A systematic RAG evaluation pipeline using MLflow tracks retrieval quality, answer faithfulness, and relevance across versions — catching quality regressions before users experience them.

Most teams building retrieval-augmented generation (RAG) systems start with vibes-based evaluation: "Does this answer look good?" While intuition helps during initial development, it fails catastrophically at scale. Production RAG systems answer thousands of queries daily across diverse topics—manual review is impossible, and subtle quality degradations go unnoticed until users complain. This guide presents a practical framework for rigorous RAG evaluation using MLflow.

Why Vibes-Based Evaluation Fails at Scale

The vibes approach has fatal flaws:

Cognitive Biases

Humans fall victim to confirmation bias (seeing what they expect), recency bias (over-weighting recent examples), and anchoring (being influenced by initial impressions). These biases make subjective evaluation unreliable, especially when comparing similar system versions.

Limited Coverage

Manual review typically covers 10-50 examples. Production systems handle millions of queries spanning edge cases you never imagined. Your sample is almost certainly not representative.

Inconsistency

Different reviewers apply different standards. The same reviewer applies different standards on different days. Inter-rater reliability studies consistently show poor agreement on subjective quality assessments.

Inability to Detect Regressions

When you modify chunk sizes, change embedding models, or update prompts, vibes evaluation can't reliably detect whether performance improved or degraded. You're flying blind.

No Cost Visibility

Manual review doesn't capture token usage, API costs, or latency. You might achieve slightly better quality at 10x the cost—a terrible trade-off that vibes evaluation won't reveal.

Systematic evaluation with MLflow solves these problems by combining automated metrics, large test sets, and reproducible experiments.

Comprehensive RAG Evaluation Taxonomy

RAG evaluation breaks down into three categories, each measuring different aspects of system quality:

Retrieval Quality

Did the system find the right documents? Metrics include precision (what percentage of retrieved docs are relevant), recall (what percentage of relevant docs were retrieved), and ranking quality (are the best docs at the top).

Generation Quality

Given the retrieved context, did the LLM produce a good answer? Metrics include faithfulness (does the answer follow from the context), relevance (does it address the question), coherence (is it well-written), and safety (is it non-toxic and appropriate).

End-to-End Quality

From the user's perspective, was the overall experience successful? Metrics include answer correctness, user satisfaction, task completion rate, and whether the system knew when to say "I don't know."

Comprehensive evaluation requires metrics from all three categories. Optimizing only retrieval or only generation leads to sub-optimal systems.

User Query Retrieval Generation Answer Precision / Recall Faithfulness Correctness MLflow Dashboard
RAG evaluation pipeline: metrics at every stage feed into MLflow for tracking

Retrieval Metrics Deep Dive

Precision and Recall

These fundamental metrics require ground truth labels indicating which documents are relevant to each query:

  • Precision@K: Of the K documents retrieved, what percentage are relevant?
  • Recall@K: Of all relevant documents in the corpus, what percentage appear in the top K results?

Example: For query "What is the loan approval process?", if the system retrieves 5 documents and 3 are relevant, and there are 4 relevant documents total in the corpus:

Precision@5 = 3/5 = 0.60
Recall@5 = 3/4 = 0.75

High precision means users aren't overwhelmed with irrelevant results. High recall means the system finds all the information it needs to answer comprehensively.

Mean Reciprocal Rank (MRR)

MRR measures how quickly users find relevant results:

MRR = average(1 / rank_of_first_relevant_document)

If the first relevant document appears in position 1, the reciprocal rank is 1.0. Position 2 yields 0.5, position 3 yields 0.33, and so on. Higher MRR means users find what they need faster.

Normalized Discounted Cumulative Gain (NDCG)

NDCG is sophisticated ranking metric that accounts for graded relevance (some documents are more relevant than others) and position (earlier documents matter more):

DCG@K = sum(relevance_score[i] / log2(i + 1) for i in 1 to K)
NDCG@K = DCG@K / ideal_DCG@K

NDCG ranges from 0 to 1, with 1 indicating perfect ranking. It's particularly useful when you have nuanced relevance labels (e.g., "highly relevant", "somewhat relevant", "not relevant") rather than binary labels.

Hit Rate

Simple but effective: what percentage of queries have at least one relevant document in the top K results?

Hit_Rate@K = (number of queries with ≥1 relevant doc in top K) / (total queries)

Hit rate is easier to calculate than precision/recall because you only need to identify if any relevant documents were retrieved, not label every document.

Implementation in Code

from mlflow.metrics import make_metric

def calculate_retrieval_metrics(predictions, targets, metrics):
    """Calculate precision, recall, and MRR for retrieved documents"""
    results = {
        "precision_at_5": [],
        "recall_at_5": [],
        "mrr": []
    }
    
    for pred, target in zip(predictions, targets):
        retrieved_doc_ids = pred["retrieved_docs"][:5]
        relevant_doc_ids = target["relevant_docs"]
        
        # Precision@5
        relevant_in_retrieved = sum(1 for doc in retrieved_doc_ids if doc in relevant_doc_ids)
        precision = relevant_in_retrieved / len(retrieved_doc_ids) if retrieved_doc_ids else 0
        results["precision_at_5"].append(precision)
        
        # Recall@5
        recall = relevant_in_retrieved / len(relevant_doc_ids) if relevant_doc_ids else 0
        results["recall_at_5"].append(recall)
        
        # MRR
        reciprocal_rank = 0
        for i, doc in enumerate(retrieved_doc_ids, 1):
            if doc in relevant_doc_ids:
                reciprocal_rank = 1 / i
                break
        results["mrr"].append(reciprocal_rank)
    
    return {
        "precision_at_5": sum(results["precision_at_5"]) / len(results["precision_at_5"]),
        "recall_at_5": sum(results["recall_at_5"]) / len(results["recall_at_5"]),
        "mrr": sum(results["mrr"]) / len(results["mrr"])
    }

retrieval_metric = make_metric(
    eval_fn=calculate_retrieval_metrics,
    greater_is_better=True,
    name="retrieval_quality"
)

Generation Metrics Deep Dive

Faithfulness

Faithfulness measures whether the generated answer is supported by the retrieved context. This is critical for preventing hallucinations—the LLM should only make claims that can be verified in the source documents.

Implement faithfulness evaluation using an LLM-as-judge approach:

from mlflow.metrics import faithfulness

faithfulness_metric = faithfulness(model="openai:/gpt-4")

The judge model receives the retrieved context, the generated answer, and a prompt asking "Is every claim in the answer supported by the context?" It returns a score from 0 to 1.

Relevance

Relevance measures whether the answer actually addresses the user's question. An answer can be faithful to the context but still irrelevant if the wrong documents were retrieved.

from mlflow.metrics import answer_relevance

relevance_metric = answer_relevance(model="openai:/gpt-4")

Coherence

Coherence assesses whether the answer is well-written, logically structured, and easy to understand. Poor coherence manifests as awkward phrasing, contradictions, or disjointed flow.

Build a custom coherence metric:

def evaluate_coherence(predictions, targets, metrics):
    """Evaluate answer coherence using GPT-4 as judge"""
    import openai
    
    scores = []
    for pred in predictions:
        judge_prompt = f"""Rate the coherence of this answer on a scale of 1-5:

Answer: {pred}

Consider: Is it well-structured? Does it flow logically? Is it easy to understand?
Respond with only a number from 1 to 5."""
        
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": judge_prompt}]
        )
        score = int(response.choices[0].message.content.strip())
        scores.append(score / 5) # Normalize to 0-1
    
    return {"coherence": sum(scores) / len(scores)}

Toxicity

Toxicity detection identifies harmful, offensive, or inappropriate content. Use specialized models like Perspective API or Azure Content Safety:

from mlflow.metrics import toxicity

toxicity_metric = toxicity(model="openai:/gpt-4")

Set strict thresholds—even a small percentage of toxic responses can cause significant reputational damage.

Conciseness

Some domains prioritize brief answers, while others require comprehensive explanations. Measure answer length and penalize verbosity when appropriate:

def evaluate_conciseness(predictions, targets, metrics):
    """Penalize unnecessarily long answers"""
    scores = []
    target_length = 150 # words
    
    for pred in predictions:
        word_count = len(pred.split())
        if word_count <= target_length:
            score = 1.0
        else:
            # Penalty for excess words
            score = max(0, 1 - (word_count - target_length) / target_length)
        scores.append(score)
    
    return {"conciseness": sum(scores) / len(scores)}

Building Golden Test Sets

Automated metrics are only as good as the test data they evaluate against. Golden test sets are curated collections of query-answer pairs with ground truth labels.

Strategies for Test Set Creation

1. Mine Production Logs

Identify frequently asked questions from actual users. This ensures your test set reflects real usage patterns rather than what developers imagine users will ask.

2. Synthetic Data Generation

Use LLMs to generate diverse questions from your document corpus:

for document in corpus:
    prompt = f"""Generate 5 questions that could be answered using this document:

{document}

Questions should vary in complexity and specificity."""
    questions = llm.generate(prompt)
    # Add to test set with document as ground truth context

3. Subject Matter Expert (SME) Curation

Have domain experts write questions they expect users to ask, along with correct answers and relevant documents. This is time-consuming but produces high-quality test data for critical scenarios.

4. Edge Case Collection

Actively seek out challenging scenarios:

  • Ambiguous questions that could have multiple interpretations
  • Questions requiring information from multiple documents
  • Questions with no good answer in the corpus
  • Questions with contradictory information across documents
  • Domain-specific jargon or technical terminology

Test Set Size Recommendations

  • Minimum viable: 50-100 examples for initial development
  • Development set: 200-500 examples for hyperparameter tuning and prompt engineering
  • Test set: 500-1000 examples for final evaluation and confidence before production
  • Continuous evaluation: 50-100 new examples per month from production traffic

Test Set Maintenance

Test sets decay over time as your document corpus evolves, user needs change, and edge cases emerge. Schedule quarterly reviews:

  1. Remove outdated questions whose answers have changed
  2. Add new questions covering recent edge cases from production
  3. Update ground truth labels based on document updates
  4. Rebalance categories to ensure representative coverage

Store test sets in version control (Git) with clear change logs. Tag test set versions with the document corpus versions they correspond to.

500+Test Cases for Production
3Evaluation Dimensions
QuarterlyTest Set Refresh Cadence

For guidance on building test sets tailored to your domain, our MLOps training programs include hands-on workshops.

MLflow Evaluate Configuration and Code Patterns

Complete Evaluation Pipeline

import mlflow
import pandas as pd
from mlflow.metrics import faithfulness, answer_relevance

# Load test set
test_data = pd.read_csv("test_set.csv")

# Define evaluation data format
eval_data = pd.DataFrame({
    "inputs": test_data["question"],
    "ground_truth": test_data["expected_answer"],
    "context": test_data["relevant_documents"]
})

# Define your RAG model as a function
def rag_model(inputs):
    results = []
    for question in inputs["inputs"]:
        # Retrieve documents
        docs = retriever.retrieve(question)
        # Generate answer
        answer = generator.generate(question, docs)
        results.append(answer)
    return results

# Run evaluation
with mlflow.start_run():
    # Log configuration
    mlflow.log_param("chunk_size", 512)
    mlflow.log_param("embedding_model", "text-embedding-ada-002")
    mlflow.log_param("llm_model", "gpt-4")
    mlflow.log_param("temperature", 0.7)
    mlflow.log_param("top_k_docs", 5)
    
    # Evaluate
    results = mlflow.evaluate(
        model=rag_model,
        data=eval_data,
        model_type="question-answering",
        metrics=[
            faithfulness(model="openai:/gpt-4"),
            answer_relevance(model="openai:/gpt-4"),
            retrieval_metric,
            toxicity_metric
        ]
    )
    
    # Log aggregate metrics
    mlflow.log_metrics(results.metrics)
    
    # Save detailed results
    results.tables["eval_results_table"].to_csv("detailed_results.csv")
    mlflow.log_artifact("detailed_results.csv")

Comparing Multiple Configurations

configurations = [
    {"chunk_size": 256, "top_k": 3},
    {"chunk_size": 512, "top_k": 5},
    {"chunk_size": 1024, "top_k": 7}
]

for config in configurations:
    with mlflow.start_run(run_name=f"chunk_{config['chunk_size']}_k_{config['top_k']}"):
        # Configure system
        rag_system.set_chunk_size(config["chunk_size"])
        rag_system.set_top_k(config["top_k"])
        
        # Log parameters
        mlflow.log_params(config)
        
        # Evaluate
        results = mlflow.evaluate(...)
        mlflow.log_metrics(results.metrics)

Use the MLflow UI to compare runs side-by-side and identify the best configuration.

Automated Evaluation Pipelines with CI/CD

Integrate evaluation into your development workflow to catch regressions before production deployment.

GitHub Actions Example

name: RAG Evaluation
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run evaluation
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python evaluate_rag.py
      - name: Check metrics threshold
        run: python check_thresholds.py

Quality Gates

Define minimum acceptable metrics and fail the CI pipeline if they're not met:

import mlflow

client = mlflow.tracking.MlflowClient()
run = client.get_run(run_id)

thresholds = {
    "faithfulness": 0.85,
    "answer_relevance": 0.80,
    "precision_at_5": 0.70,
    "toxicity": 0.05 # Lower is better
}

failures = []
for metric, threshold in thresholds.items():
    value = run.data.metrics.get(metric, 0)
    if metric == "toxicity":
        if value > threshold:
            failures.append(f"{metric}: {value} > {threshold}")
    else:
        if value < threshold:
            failures.append(f"{metric}: {value} < {threshold}")

if failures:
    print("Quality gates failed:")
    for failure in failures:
        print(f" - {failure}")
    exit(1)

A/B Testing RAG Configurations

Once in production, use A/B testing to validate improvements with real users:

  1. Deploy challenger variant: Deploy new RAG configuration alongside existing "champion" version
  2. Route traffic: Send 10% of traffic to challenger, 90% to champion
  3. Collect metrics: Track quality metrics, user feedback, cost, and latency for both variants
  4. Statistical analysis: Determine if observed differences are statistically significant
  5. Promote or rollback: If challenger performs better, gradually increase traffic; if worse, remove it

Use feature flags (Azure App Configuration, LaunchDarkly) to control routing without code deployments.

Chunking Strategy Evaluation

Chunk size dramatically impacts RAG performance. Evaluate multiple strategies:

Fixed-Size Chunking

chunk_sizes = [128, 256, 512, 1024]
for size in chunk_sizes:
    with mlflow.start_run(run_name=f"fixed_chunk_{size}"):
        chunker = FixedSizeChunker(size=size, overlap=50)
        # Rebuild index and evaluate

Semantic Chunking

Chunk at natural boundaries (sentences, paragraphs, sections) rather than fixed character counts:

strategies = ["sentence", "paragraph", "section"]
for strategy in strategies:
    with mlflow.start_run(run_name=f"semantic_chunk_{strategy}"):
        chunker = SemanticChunker(strategy=strategy)
        # Rebuild index and evaluate

Hybrid Approaches

Combine semantic boundaries with maximum size constraints to prevent overly large chunks.

Embedding Model Comparison Framework

Different embedding models trade off quality, speed, and cost:

embedding_models = [
    {"name": "text-embedding-ada-002", "dim": 1536, "cost": "low"},
    {"name": "text-embedding-3-small", "dim": 1536, "cost": "low"},
    {"name": "text-embedding-3-large", "dim": 3072, "cost": "medium"},
    {"name": "sentence-transformers/all-mpnet-base-v2", "dim": 768, "cost": "zero"}
]

for model in embedding_models:
    with mlflow.start_run(run_name=f"embedding_{model['name']}"):
        # Rebuild vector store with new embeddings
        retriever = rebuild_index(embedding_model=model["name"])
        
        # Evaluate retrieval quality
        results = mlflow.evaluate(...)
        
        # Log model characteristics
        mlflow.log_param("embedding_dim", model["dim"])
        mlflow.log_param("cost_tier", model["cost"])

Often, cheaper embedding models perform nearly as well as expensive ones for domain-specific corpora. Always benchmark before choosing.

Prompt Template Evaluation

Systematically test prompt variations:

prompt_templates = [
    """Answer this question based on the context:
Question: {question}
Context: {context}""",

    """You are a helpful assistant. Use only the provided context to answer the question. If the context doesn't contain enough information, say "I don't have enough information to answer that."

Context: {context}

Question: {question}""",

    """Based on the following context, provide a concise answer to the question. Cite specific facts from the context.

Context: {context}

Question: {question}

Answer:"""
]

for i, template in enumerate(prompt_templates):
    with mlflow.start_run(run_name=f"prompt_variant_{i}"):
        rag_system.set_prompt_template(template)
        results = mlflow.evaluate(...)
        mlflow.log_text(template, "prompt_template.txt")

Small prompt changes can yield significant quality differences. Test extensively.

Handling Edge Cases

No-Answer Scenarios

When the corpus doesn't contain relevant information, the system should say "I don't know" rather than hallucinate:

def evaluate_no_answer_handling(predictions, targets, metrics):
    """Check if system appropriately refuses to answer when it shouldn't"""
    scores = []
    no_answer_phrases = ["I don't have", "I don't know", "insufficient information", "I cannot answer"]
    
    for pred, target in zip(predictions, targets):
        if target["should_refuse"]:
            # Should say "I don't know"
            refused = any(phrase in pred.lower() for phrase in no_answer_phrases)
            scores.append(1.0 if refused else 0.0)
        else:
            # Should provide answer
            provided_answer = not any(phrase in pred.lower() for phrase in no_answer_phrases)
            scores.append(1.0 if provided_answer else 0.0)
    
    return {"no_answer_accuracy": sum(scores) / len(scores)}

Multi-Document Synthesis

Some questions require synthesizing information from multiple documents. Evaluate whether the system successfully combines evidence:

def evaluate_multi_doc_synthesis(predictions, targets, metrics):
    """Check if answer incorporates information from multiple source documents"""
    scores = []
    
    for pred, target in zip(predictions, targets):
        if target["requires_multi_doc"]:
            # Check if answer mentions facts from multiple ground truth docs
            docs_referenced = sum(
                1 for doc in target["source_docs"]
                if any(fact in pred for fact in doc["key_facts"])
            )
            score = min(1.0, docs_referenced / len(target["source_docs"]))
            scores.append(score)
    
    return {"multi_doc_synthesis": sum(scores) / len(scores) if scores else 1.0}

Contradictory Sources

When documents contradict each other, the system should acknowledge the contradiction rather than picking one arbitrarily:

  • Test set includes queries where documents provide conflicting information
  • Ground truth answer acknowledges both perspectives
  • Evaluate whether generated answer identifies the contradiction

Real-World Case Study: RAG Evaluation at Scale

A financial services company built a RAG system for internal policy Q&A, serving 5,000 employees across legal, compliance, HR, and operations departments.

Initial Challenges

  • Manual review of 20 questions couldn't catch regressions
  • Employees reported inconsistent answer quality
  • No visibility into which departments or query types had poor experiences
  • Prompt engineering changes caused unexpected quality degradation

Evaluation Framework Implementation

They built a comprehensive framework using MLflow:

  1. Golden Test Set: 800 questions curated from 6 months of production logs, labeled by SMEs with ground truth answers and relevant policies
  2. Automated Pipeline: GitHub Actions running evaluation on every pull request, blocking merges if quality gates fail
  3. Segmented Metrics: Tracked performance separately for each department, query complexity level, and document type
  4. Cost Tracking: Logged token usage and costs per evaluation run to prevent ballooning expenses
  5. Human-in-the-Loop: Weekly review of low-scoring examples to identify systematic issues

Results

  • Faithfulness score improved from 0.72 to 0.91 over 3 months
  • Caught 12 regressions before production deployment
  • Identified that legal queries needed different chunk sizes than HR queries
  • Reduced LLM costs by 35% by using cheaper models for simple queries
  • Employee satisfaction (measured by thumbs up/down) improved from 68% to 87%

The evaluation framework paid for itself within 6 weeks by preventing production issues and optimizing infrastructure costs. For similar success in your organization, explore our MLOps consulting and training services.

Common Pitfalls and How to Avoid Them

Pitfall 1: Test Set Contamination

Accidentally including test data in training or optimization leads to overfitting. Keep test sets completely separate and only evaluate against them for final validation, not during development.

Pitfall 2: Metric Selection Bias

Optimizing for easy-to-measure metrics (like retrieval precision) while ignoring harder-to-measure ones (like user satisfaction) produces systems that perform well on paper but poorly in practice. Use a balanced scorecard of metrics.

Pitfall 3: Ignoring Latency

Achieving 98% quality with 10-second latency is usually worse than 92% quality with 2-second latency. Always measure and log latency alongside quality metrics.

Pitfall 4: Static Test Sets

As your system improves, existing test sets become too easy and stop providing signal. Continuously add new edge cases from production.

Pitfall 5: Lack of Baseline

Establishing a baseline before optimization attempts is critical. Without it, you can't tell if changes improved performance or just shifted it around.

Pitfall 6: Over-Reliance on LLM Judges

LLM-as-judge metrics are powerful but not perfect. They can miss subtle issues, have biases, and cost money to run. Complement them with heuristic metrics and human review of edge cases.

Conclusion

RAG evaluation separates production-grade systems from prototypes. Vibes-based assessment works for demos but fails at scale—systematic evaluation with MLflow provides the rigor necessary for reliable, high-quality RAG deployments.

Start with basic metrics (precision, faithfulness, relevance), build a modest golden test set, and automate evaluation in CI/CD. As your system matures, expand metrics to cover edge cases, segment performance by user groups, and implement continuous evaluation of production traffic.

RAG systems are complex with many configuration options—chunk size, embedding models, retrieval strategies, prompt templates, and reranking approaches. Only systematic evaluation reveals which combinations work best for your specific use case and data. The investment in evaluation infrastructure pays dividends through fewer production issues, lower costs, and higher user satisfaction.

For teams serious about production RAG systems, evaluation isn't optional—it's the foundation of quality. Build it early, iterate often, and let data guide your optimization decisions. For more insights on MLOps and RAG best practices, visit our blog.

MLflowRAGEvaluationAI Quality
#MLflow#RAG#AIEvaluation#MLOps#DataQuality
JK

Jalal Ahmed Khan

Microsoft Certified Trainer (MCT) · Founder, Gennoor Tech

14+ years in enterprise AI and cloud technologies. Delivered AI transformation programs for Fortune 500 companies across 6 countries including Boeing, Aramco, HDFC Bank, and Siemens. Holds 16 active Microsoft certifications including Azure AI Engineer and Power BI Analyst.

Found this insightful? Share with your network.

Stay ahead of the curve

Practitioner insights on enterprise AI delivered to your inbox. No spam, just signal.

AI Career Coach