What metrics should I track for RAG evaluation?

At minimum: context precision and recall (retrieval quality), faithfulness or groundedness (does the answer stay within retrieved context?), answer relevance, and end-to-end task success rate. Cost per query and latency complete the picture for production decisions.

How big should a RAG golden dataset be?

Start with 50 question/answer pairs covering your top use cases. Grow to 200–500 as you discover edge cases. Beyond 500 you hit diminishing returns; focus instead on diversity (different question types, document types, user personas).

Can RAG evaluation be automated in CI/CD?

Yes. Run your golden dataset through the RAG pipeline on every prompt or index change, log results to MLflow, and fail the build if metrics regress beyond a threshold. This catches quality issues before they reach users.

Should you use LLM-as-judge for RAG evaluation?

LLM-as-judge works well for faithfulness and relevance scoring at scale. Pair it with human-labeled samples (10–20% of evaluations) to validate that the judge model agrees with human judgment. Pure LLM-as-judge without human spot-checks can drift.

Evaluating RAG Pipelines with MLflow: A Practical Framework

Most teams building retrieval-augmented generation (RAG) systems start with vibes-based evaluation: "Does this answer look good?" While intuition helps during initial development, it fails catastrophically at scale. Production RAG systems answer thousands of queries daily across diverse topics—manual review is impossible, and subtle quality degradations go unnoticed until users complain. This guide presents a practical framework for rigorous RAG evaluation using MLflow's LLM evaluation toolkit.

Why Vibes-Based Evaluation Fails at Scale

The vibes approach has fatal flaws:

Cognitive Biases

Humans fall victim to confirmation bias (seeing what they expect), recency bias (over-weighting recent examples), and anchoring (being influenced by initial impressions). These biases make subjective evaluation unreliable, especially when comparing similar system versions.

Limited Coverage

Manual review typically covers 10-50 examples. Production systems handle millions of queries spanning edge cases you never imagined. Your sample is almost certainly not representative.

Inconsistency

Different reviewers apply different standards. The same reviewer applies different standards on different days. Inter-rater reliability studies consistently show poor agreement on subjective quality assessments.

Inability to Detect Regressions

When you modify chunk sizes, change embedding models, or update prompts, vibes evaluation can't reliably detect whether performance improved or degraded. You're flying blind.

No Cost Visibility

Manual review doesn't capture token usage, API costs, or latency. You might achieve slightly better quality at 10x the cost—a terrible trade-off that vibes evaluation won't reveal.

Systematic evaluation with MLflow solves these problems by combining automated metrics, large test sets, and reproducible experiments.

Comprehensive RAG Evaluation Taxonomy

RAG evaluation breaks down into three categories, each measuring different aspects of system quality:

Retrieval Quality

Did the system find the right documents? Metrics include precision (what percentage of retrieved docs are relevant), recall (what percentage of relevant docs were retrieved), and ranking quality (are the best docs at the top).

Generation Quality

Given the retrieved context, did the LLM produce a good answer? Metrics include faithfulness (does the answer follow from the context), relevance (does it address the question), coherence (is it well-written), and safety (is it non-toxic and appropriate).

End-to-End Quality

From the user's perspective, was the overall experience successful? Metrics include answer correctness, user satisfaction, task completion rate, and whether the system knew when to say "I don't know."

Comprehensive evaluation requires metrics from all three categories. Optimizing only retrieval or only generation leads to sub-optimal systems.

RAG evaluation pipeline: metrics at every stage feed into MLflow for tracking

Retrieval Metrics Deep Dive

Precision and Recall

These fundamental metrics require ground truth labels indicating which documents are relevant to each query:

Precision@K: Of the K documents retrieved, what percentage are relevant?
Recall@K: Of all relevant documents in the corpus, what percentage appear in the top K results?

Example: For query "What is the loan approval process?", if the system retrieves 5 documents and 3 are relevant, and there are 4 relevant documents total in the corpus:


Precision@5 = 3/5 = 0.60

Recall@5 = 3/4 = 0.75

High precision means users aren't overwhelmed with irrelevant results. High recall means the system finds all the information it needs to answer comprehensively.

Mean Reciprocal Rank (MRR)

MRR measures how quickly users find relevant results:


MRR = average(1 / rank_of_first_relevant_document)

If the first relevant document appears in position 1, the reciprocal rank is 1.0. Position 2 yields 0.5, position 3 yields 0.33, and so on. Higher MRR means users find what they need faster.

Normalized Discounted Cumulative Gain (NDCG)

NDCG is sophisticated ranking metric that accounts for graded relevance (some documents are more relevant than others) and position (earlier documents matter more):


DCG@K = sum(relevance_score[i] / log2(i + 1) for i in 1 to K)

NDCG@K = DCG@K / ideal_DCG@K

NDCG ranges from 0 to 1, with 1 indicating perfect ranking. It's particularly useful when you have nuanced relevance labels (e.g., "highly relevant", "somewhat relevant", "not relevant") rather than binary labels.

Hit Rate

Simple but effective: what percentage of queries have at least one relevant document in the top K results?


Hit_Rate@K = (number of queries with ≥1 relevant doc in top K) / (total queries)

Hit rate is easier to calculate than precision/recall because you only need to identify if any relevant documents were retrieved, not label every document.

Implementation in Code


from mlflow.metrics import make_metric


def calculate_retrieval_metrics(predictions, targets, metrics):

    """Calculate precision, recall, and MRR for retrieved documents"""

    results = {

        "precision_at_5": [],

        "recall_at_5": [],

        "mrr": []

    }

    

    for pred, target in zip(predictions, targets):

        retrieved_doc_ids = pred["retrieved_docs"][:5]

        relevant_doc_ids = target["relevant_docs"]

        

        # Precision@5

        relevant_in_retrieved = sum(1 for doc in retrieved_doc_ids if doc in relevant_doc_ids)

        precision = relevant_in_retrieved / len(retrieved_doc_ids) if retrieved_doc_ids else 0

        results["precision_at_5"].append(precision)

        

        # Recall@5

        recall = relevant_in_retrieved / len(relevant_doc_ids) if relevant_doc_ids else 0

        results["recall_at_5"].append(recall)

        

        # MRR

        reciprocal_rank = 0

        for i, doc in enumerate(retrieved_doc_ids, 1):

            if doc in relevant_doc_ids:

                reciprocal_rank = 1 / i

                break

        results["mrr"].append(reciprocal_rank)

    

    return {

        "precision_at_5": sum(results["precision_at_5"]) / len(results["precision_at_5"]),

        "recall_at_5": sum(results["recall_at_5"]) / len(results["recall_at_5"]),

        "mrr": sum(results["mrr"]) / len(results["mrr"])

    }


retrieval_metric = make_metric(

    eval_fn=calculate_retrieval_metrics,

    greater_is_better=True,

    name="retrieval_quality"

)

Generation Metrics Deep Dive

Faithfulness

Faithfulness measures whether the generated answer is supported by the retrieved context. This is critical for preventing hallucinations—the LLM should only make claims that can be verified in the source documents.

Implement faithfulness evaluation using an LLM-as-judge approach:


from mlflow.metrics import faithfulness


faithfulness_metric = faithfulness(model="openai:/gpt-4")

The judge model receives the retrieved context, the generated answer, and a prompt asking "Is every claim in the answer supported by the context?" It returns a score from 0 to 1.

Relevance

Relevance measures whether the answer actually addresses the user's question. An answer can be faithful to the context but still irrelevant if the wrong documents were retrieved.


from mlflow.metrics import answer_relevance


relevance_metric = answer_relevance(model="openai:/gpt-4")

Coherence

Coherence assesses whether the answer is well-written, logically structured, and easy to understand. Poor coherence manifests as awkward phrasing, contradictions, or disjointed flow.

Build a custom coherence metric:


def evaluate_coherence(predictions, targets, metrics):

    """Evaluate answer coherence using GPT-4 as judge"""

    import openai

    

    scores = []

    for pred in predictions:

        judge_prompt = f"""Rate the coherence of this answer on a scale of 1-5:



Answer: {pred}



Consider: Is it well-structured? Does it flow logically? Is it easy to understand?

Respond with only a number from 1 to 5."""

        

        response = openai.ChatCompletion.create(

            model="gpt-4",

            messages=[{"role": "user", "content": judge_prompt}]

        )

        score = int(response.choices[0].message.content.strip())

        scores.append(score / 5)  # Normalize to 0-1

    

    return {"coherence": sum(scores) / len(scores)}

Toxicity

Toxicity detection identifies harmful, offensive, or inappropriate content. Use specialized models like Perspective API or LangSmith evaluators:


from mlflow.metrics import toxicity


toxicity_metric = toxicity(model="openai:/gpt-4")

Set strict thresholds—even a small percentage of toxic responses can cause significant reputational damage.

Conciseness

Some domains prioritize brief answers, while others require comprehensive explanations. Measure answer length and penalize verbosity when appropriate:


def evaluate_conciseness(predictions, targets, metrics):

    """Penalize unnecessarily long answers"""

    scores = []

    target_length = 150  # words

    

    for pred in predictions:

        word_count = len(pred.split())

        if word_count <= target_length:

            score = 1.0

        else:

            # Penalty for excess words

            score = max(0, 1 - (word_count - target_length) / target_length)

        scores.append(score)

    

    return {"conciseness": sum(scores) / len(scores)}

Building Golden Test Sets

Automated metrics are only as good as the test data they evaluate against. Golden test sets are curated collections of query-answer pairs with ground truth labels.

Strategies for Test Set Creation

1. Mine Production Logs

Identify frequently asked questions from actual users. This ensures your test set reflects real usage patterns rather than what developers imagine users will ask.

2. Synthetic Data Generation

Use LLMs to generate diverse questions from your document corpus:


for document in corpus:

    prompt = f"""Generate 5 questions that could be answered using this document:



{document}



Questions should vary in complexity and specificity."""

    questions = llm.generate(prompt)

    # Add to test set with document as ground truth context

3. Subject Matter Expert (SME) Curation

Have domain experts write questions they expect users to ask, along with correct answers and relevant documents. This is time-consuming but produces high-quality test data for critical scenarios.

4. Edge Case Collection

Actively seek out challenging scenarios:

Ambiguous questions that could have multiple interpretations
Questions requiring information from multiple documents
Questions with no good answer in the corpus
Questions with contradictory information across documents
Domain-specific jargon or technical terminology

Test Set Size Recommendations

Minimum viable: 50-100 examples for initial development
Development set: 200-500 examples for hyperparameter tuning and prompt engineering
Test set: 500-1000 examples for final evaluation and confidence before production
Continuous evaluation: 50-100 new examples per month from production traffic

Test Set Maintenance

Test sets decay over time as your document corpus evolves, user needs change, and edge cases emerge. Schedule quarterly reviews:

Remove outdated questions whose answers have changed
Add new questions covering recent edge cases from production
Update ground truth labels based on document updates
Rebalance categories to ensure representative coverage

Store test sets in version control (Git) with clear change logs. Tag test set versions with the document corpus versions they correspond to.

500+Test Cases for Production

3Evaluation Dimensions

QuarterlyTest Set Refresh Cadence

For guidance on building test sets tailored to your domain, our MLOps training programs include hands-on workshops.

MLflow Evaluate Configuration and Code Patterns

Complete Evaluation Pipeline


import mlflow

import pandas as pd

from mlflow.metrics import faithfulness, answer_relevance


# Load test set

test_data = pd.read_csv("test_set.csv")


# Define evaluation data format

eval_data = pd.DataFrame({

    "inputs": test_data["question"],

    "ground_truth": test_data["expected_answer"],

    "context": test_data["relevant_documents"]

})


# Define your RAG model as a function

def rag_model(inputs):

    results = []

    for question in inputs["inputs"]:

        # Retrieve documents

        docs = retriever.retrieve(question)

        # Generate answer

        answer = generator.generate(question, docs)

        results.append(answer)

    return results


# Run evaluation

with mlflow.start_run():

    # Log configuration

    mlflow.log_param("chunk_size", 512)

    mlflow.log_param("embedding_model", "text-embedding-ada-002")

    mlflow.log_param("llm_model", "gpt-4")

    mlflow.log_param("temperature", 0.7)

    mlflow.log_param("top_k_docs", 5)

    

    # Evaluate

    results = mlflow.evaluate(

        model=rag_model,

        data=eval_data,

        model_type="question-answering",

        metrics=[

            faithfulness(model="openai:/gpt-4"),

            answer_relevance(model="openai:/gpt-4"),

            retrieval_metric,

            toxicity_metric

        ]

    )

    

    # Log aggregate metrics

    mlflow.log_metrics(results.metrics)

    

    # Save detailed results

    results.tables["eval_results_table"].to_csv("detailed_results.csv")

    mlflow.log_artifact("detailed_results.csv")

Comparing Multiple Configurations


configurations = [

    {"chunk_size": 256, "top_k": 3},

    {"chunk_size": 512, "top_k": 5},

    {"chunk_size": 1024, "top_k": 7}

]


for config in configurations:

    with mlflow.start_run(run_name=f"chunk_{config['chunk_size']}_k_{config['top_k']}"):

        # Configure system

        rag_system.set_chunk_size(config["chunk_size"])

        rag_system.set_top_k(config["top_k"])

        

        # Log parameters

        mlflow.log_params(config)

        

        # Evaluate

        results = mlflow.evaluate(...)

        mlflow.log_metrics(results.metrics)

Use the MLflow UI to compare runs side-by-side and identify the best configuration.

Automated Evaluation Pipelines with CI/CD

Integrate evaluation into your development workflow to catch regressions before production deployment.

GitHub Actions Example


name: RAG Evaluation

on: [pull_request]


jobs:

  evaluate:

    runs-on: ubuntu-latest

    steps:

      - uses: actions/checkout@v3

      - name: Set up Python

        uses: actions/setup-python@v4

        with:

          python-version: '3.10'

      - name: Install dependencies

        run: pip install -r requirements.txt

      - name: Run evaluation

        env:

          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}

          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

        run: python evaluate_rag.py

      - name: Check metrics threshold

        run: python check_thresholds.py

Quality Gates

Define minimum acceptable metrics and fail the CI pipeline if they're not met:


import mlflow


client = mlflow.tracking.MlflowClient()

run = client.get_run(run_id)


thresholds = {

    "faithfulness": 0.85,

    "answer_relevance": 0.80,

    "precision_at_5": 0.70,

    "toxicity": 0.05  # Lower is better

}


failures = []

for metric, threshold in thresholds.items():

    value = run.data.metrics.get(metric, 0)

    if metric == "toxicity":

        if value > threshold:

            failures.append(f"{metric}: {value} > {threshold}")

    else:

        if value < threshold:

            failures.append(f"{metric}: {value} < {threshold}")


if failures:

    print("Quality gates failed:")

    for failure in failures:

        print(f"  - {failure}")

    exit(1)

A/B Testing RAG Configurations

Once in production, use A/B testing to validate improvements with real users:

Deploy challenger variant: Deploy new RAG configuration alongside existing "champion" version
Route traffic: Send 10% of traffic to challenger, 90% to champion
Collect metrics: Track quality metrics, user feedback, cost, and latency for both variants
Statistical analysis: Determine if observed differences are statistically significant
Promote or rollback: If challenger performs better, gradually increase traffic; if worse, remove it

Use feature flags (Azure App Configuration, LaunchDarkly) to control routing without code deployments.

Chunking Strategy Evaluation

Chunk size dramatically impacts RAG performance. Evaluate multiple strategies:

Fixed-Size Chunking


chunk_sizes = [128, 256, 512, 1024]

for size in chunk_sizes:

    with mlflow.start_run(run_name=f"fixed_chunk_{size}"):

        chunker = FixedSizeChunker(size=size, overlap=50)

        # Rebuild index and evaluate

Semantic Chunking

Chunk at natural boundaries (sentences, paragraphs, sections) rather than fixed character counts:


strategies = ["sentence", "paragraph", "section"]

for strategy in strategies:

    with mlflow.start_run(run_name=f"semantic_chunk_{strategy}"):

        chunker = SemanticChunker(strategy=strategy)

        # Rebuild index and evaluate

Hybrid Approaches

Combine semantic boundaries with maximum size constraints to prevent overly large chunks.

Embedding Model Comparison Framework

Different embedding models trade off quality, speed, and cost:


embedding_models = [

    {"name": "text-embedding-ada-002", "dim": 1536, "cost": "low"},

    {"name": "text-embedding-3-small", "dim": 1536, "cost": "low"},

    {"name": "text-embedding-3-large", "dim": 3072, "cost": "medium"},

    {"name": "sentence-transformers/all-mpnet-base-v2", "dim": 768, "cost": "zero"}

]


for model in embedding_models:

    with mlflow.start_run(run_name=f"embedding_{model['name']}"):

        # Rebuild vector store with new embeddings

        retriever = rebuild_index(embedding_model=model["name"])

        

        # Evaluate retrieval quality

        results = mlflow.evaluate(...)

        

        # Log model characteristics

        mlflow.log_param("embedding_dim", model["dim"])

        mlflow.log_param("cost_tier", model["cost"])

Often, cheaper embedding models perform nearly as well as expensive ones for domain-specific corpora. Always benchmark before choosing.

Prompt Template Evaluation

Systematically test prompt variations:


prompt_templates = [

    """Answer this question based on the context:

Question: {question}

Context: {context}""",



    """You are a helpful assistant. Use only the provided context to answer the question. If the context doesn't contain enough information, say "I don't have enough information to answer that."



Context: {context}



Question: {question}""",



    """Based on the following context, provide a concise answer to the question. Cite specific facts from the context.



Context: {context}



Question: {question}



Answer:"""

]


for i, template in enumerate(prompt_templates):

    with mlflow.start_run(run_name=f"prompt_variant_{i}"):

        rag_system.set_prompt_template(template)

        results = mlflow.evaluate(...)

        mlflow.log_text(template, "prompt_template.txt")

Small prompt changes can yield significant quality differences. Test extensively.

Handling Edge Cases

No-Answer Scenarios

When the corpus doesn't contain relevant information, the system should say "I don't know" rather than hallucinate:


def evaluate_no_answer_handling(predictions, targets, metrics):

    """Check if system appropriately refuses to answer when it shouldn't"""

    scores = []

    no_answer_phrases = ["I don't have", "I don't know", "insufficient information", "I cannot answer"]

    

    for pred, target in zip(predictions, targets):

        if target["should_refuse"]:

            # Should say "I don't know"

            refused = any(phrase in pred.lower() for phrase in no_answer_phrases)

            scores.append(1.0 if refused else 0.0)

        else:

            # Should provide answer

            provided_answer = not any(phrase in pred.lower() for phrase in no_answer_phrases)

            scores.append(1.0 if provided_answer else 0.0)

    

    return {"no_answer_accuracy": sum(scores) / len(scores)}

Multi-Document Synthesis

Some questions require synthesizing information from multiple documents. Evaluate whether the system successfully combines evidence:


def evaluate_multi_doc_synthesis(predictions, targets, metrics):

    """Check if answer incorporates information from multiple source documents"""

    scores = []

    

    for pred, target in zip(predictions, targets):

        if target["requires_multi_doc"]:

            # Check if answer mentions facts from multiple ground truth docs

            docs_referenced = sum(

                1 for doc in target["source_docs"]

                if any(fact in pred for fact in doc["key_facts"])

            )

            score = min(1.0, docs_referenced / len(target["source_docs"]))

            scores.append(score)

    

    return {"multi_doc_synthesis": sum(scores) / len(scores) if scores else 1.0}

Contradictory Sources

When documents contradict each other, the system should acknowledge the contradiction rather than picking one arbitrarily:

Test set includes queries where documents provide conflicting information
Ground truth answer acknowledges both perspectives
Evaluate whether generated answer identifies the contradiction

Real-World Case Study: RAG Evaluation at Scale

A financial services company built a RAG system for internal policy Q&A, serving 5,000 employees across legal, compliance, HR, and operations departments.

Initial Challenges

Manual review of 20 questions couldn't catch regressions
Employees reported inconsistent answer quality
No visibility into which departments or query types had poor experiences
Prompt engineering changes caused unexpected quality degradation

Evaluation Framework Implementation

They built a comprehensive framework using MLflow:

Golden Test Set: 800 questions curated from 6 months of production logs, labeled by SMEs with ground truth answers and relevant policies
Automated Pipeline: GitHub Actions running evaluation on every pull request, blocking merges if quality gates fail
Segmented Metrics: Tracked performance separately for each department, query complexity level, and document type
Cost Tracking: Logged token usage and costs per evaluation run to prevent ballooning expenses
Human-in-the-Loop: Weekly review of low-scoring examples to identify systematic issues

Results

Faithfulness score improved from 0.72 to 0.91 over 3 months
Caught 12 regressions before production deployment
Identified that legal queries needed different chunk sizes than HR queries
Reduced LLM costs by 35% by using cheaper models for simple queries
Employee satisfaction (measured by thumbs up/down) improved from 68% to 87%

The evaluation framework paid for itself within 6 weeks by preventing production issues and optimizing infrastructure costs. For similar success in your organization, explore our MLOps consulting and training services.

Common Pitfalls and How to Avoid Them

Pitfall 1: Test Set Contamination

Accidentally including test data in training or optimization leads to overfitting. Keep test sets completely separate and only evaluate against them for final validation, not during development.

Pitfall 2: Metric Selection Bias

Optimizing for easy-to-measure metrics (like retrieval precision) while ignoring harder-to-measure ones (like user satisfaction) produces systems that perform well on paper but poorly in practice. Use a balanced scorecard of metrics.

Pitfall 3: Ignoring Latency

Achieving 98% quality with 10-second latency is usually worse than 92% quality with 2-second latency. Always measure and log latency alongside quality metrics.

Pitfall 4: Static Test Sets

As your system improves, existing test sets become too easy and stop providing signal. Continuously add new edge cases from production.

Pitfall 5: Lack of Baseline

Establishing a baseline before optimization attempts is critical. Without it, you can't tell if changes improved performance or just shifted it around.

Pitfall 6: Over-Reliance on LLM Judges

LLM-as-judge metrics are powerful but not perfect. They can miss subtle issues, have biases, and cost money to run. Complement them with heuristic metrics and human review of edge cases.

Conclusion

RAG evaluation separates production-grade systems from prototypes. Vibes-based assessment works for demos but fails at scale—systematic evaluation with MLflow provides the rigor necessary for reliable, high-quality RAG deployments.

Start with basic metrics (precision, faithfulness, relevance), build a modest golden test set, and automate evaluation in CI/CD. As your system matures, expand metrics to cover edge cases, segment performance by user groups, and implement continuous evaluation of production traffic.

RAG systems are complex with many configuration options—chunk size, embedding models, retrieval strategies, prompt templates, and reranking approaches. Only systematic evaluation reveals which combinations work best for your specific use case and data. The investment in evaluation infrastructure pays dividends through fewer production issues, lower costs, and higher user satisfaction.

For teams serious about production RAG systems, evaluation isn't optional—it's the foundation of quality. Build it early, iterate often, and let data guide your optimization decisions. For more insights on MLOps and RAG best practices, visit our blog.