Back to Blog
LLMAIEvalsTestingPythonMLOps

LLM Evaluation in Production — Evals Frameworks, Golden Datasets, and Regression Testing

A practical guide to LLM evaluation in production: offline and online eval taxonomy, automated vs human evaluation, golden dataset construction, versioning, and maintenance, DeepEval unit testing with 20+ built-in metrics, RAGAS reference-free RAG evaluation (faithfulness, answer relevancy, context precision), LLM-as-judge with G-Eval and custom rubrics, judge calibration against human annotations with Spearman correlation, CI/CD regression testing on every PR, and online production monitoring with Langfuse traces and automated scoring.

2026-05-16

Why LLM Evaluation Is Fundamentally Different

Traditional ML evaluation is comparatively simple: you have ground truth labels, you run inference on a held-out test set, you compute accuracy, precision, recall, or MSE. The answer is either right or wrong. LLMs break this model entirely. A summarisation model may produce three valid summaries that differ from the reference but are equally correct. A customer support bot may refuse an inappropriate request in ten different ways, all acceptable. A code assistant might solve the problem with a different algorithm than the one in your golden dataset — and produce better code.

This means LLM evaluation requires a multi-layered strategy: offline evals on curated datasets before deployment, regression testing in CI/CD to catch prompt or model changes, and online monitoringin production where real traffic reveals failure modes that no dataset anticipated. The discipline of building and maintaining this infrastructure — often called “evals engineering” — is now a distinct specialty on teams that run LLMs in production.

This article covers the full evaluation stack: frameworks like DeepEval and RAGAS for automated scoring, golden dataset construction and maintenance, LLM-as-judge patterns, CI/CD integration, and production monitoring with Langfuse.

The Eval Taxonomy — What You Are Actually Measuring

Before reaching for a framework, establish what questions your evals need to answer. Conflating different eval types leads to systems that score well on paper but fail silently in production. There are four distinct dimensions:

Offline vs Online

Offline evals run against a fixed, curated dataset — your golden dataset — before deployment. They are fast, deterministic (given a fixed model), and gate releases. Online evals process real production traffic, either sampling and scoring in near-real-time or batching overnight. Both are necessary: offline evals protect against regressions; online evals reveal distribution shift and failure modes that the golden dataset did not anticipate.

Automated vs Human

Automated evals use heuristic metrics (string match, JSON schema validation, regex), ML-based metrics (BERTScore, ROUGE), or LLM-as-judge patterns (GPT-4o or Claude scoring responses on rubrics). Human evals involve annotators rating outputs on quality dimensions. Human evals are the ground truth but expensive and slow — use them to calibrate your automated metrics, not to gate every deployment.

Unit vs Integration

Unit evals test a single prompt/model/chain step in isolation — the summarisation step, the tool-calling step, the classification step. Integration evals test a complete pipeline end-to-end: input enters the orchestration layer, retrieval happens, generation runs, post-processing applies, and the final output is scored. Integration evals catch inter-component failures but are slower and harder to debug.

Functional vs Non-Functional

Functional evals test correctness: does the model answer the question accurately, follow the output format, extract the right entities, generate valid code? Non-functional evals test safety, tone, brand compliance, and latency: does the model refuse jailbreaks, maintain the correct persona, stay within character limits, respond within the SLA? Both categories must be covered in production systems.

Golden Datasets — Construction, Curation, and Maintenance

A golden dataset is the curated collection of inputs, expected outputs, and evaluation criteria that your offline evals run against. It is the single most important investment in your eval infrastructure — a poorly constructed golden dataset gives you false confidence while genuine regressions slip through.

Dataset Construction Strategies

Start with real production examples, not synthetic ones. The first 500 rows of your golden dataset should be sampled from actual user requests, annotated with correct outputs by domain experts. Synthetic generation with LLMs is useful for coverage of rare edge cases, failure modes, and adversarial inputs — but synthetic examples should supplement, not replace, real-world data.

# golden_dataset_builder.py — Curating a golden dataset from production traffic
# Uses Langfuse to sample production traces and export for annotation

import os
from langfuse import Langfuse
from datetime import datetime, timedelta
import json
import random

client = Langfuse(
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
    host=os.environ["LANGFUSE_HOST"],
)


def sample_production_traces(
    days_back: int = 30,
    sample_size: int = 500,
    min_score: float | None = None,
) -> list[dict]:
    """
    Sample production traces from Langfuse for golden dataset curation.
    Stratifies by: (1) high-confidence correct outputs, (2) known failures,
    (3) edge cases flagged by support.
    """
    cutoff = datetime.utcnow() - timedelta(days=days_back)

    traces = client.fetch_traces(
        from_timestamp=cutoff,
        limit=5000,
    ).data

    # Filter to traces with human scores (annotated by support or QA)
    annotated = [
        t for t in traces
        if t.scores and any(s.name == "correctness" for s in t.scores)
    ]

    if min_score is not None:
        annotated = [
            t for t in annotated
            if next(s.value for s in t.scores if s.name == "correctness") >= min_score
        ]

    sampled = random.sample(annotated, min(sample_size, len(annotated)))

    return [
        {
            "id": t.id,
            "input": t.input,
            "expected_output": t.output,
            "metadata": {
                "session_id": t.session_id,
                "tags": t.tags,
                "human_score": next(
                    (s.value for s in t.scores if s.name == "correctness"), None
                ),
            },
        }
        for t in sampled
    ]


def export_for_annotation(traces: list[dict], output_path: str) -> None:
    """Export traces in JSONL format for annotation tooling (Label Studio, Argilla)."""
    with open(output_path, "w") as f:
        for trace in traces:
            f.write(json.dumps(trace) + "\n")
    print(f"Exported {len(traces)} traces to {output_path}")


if __name__ == "__main__":
    traces = sample_production_traces(days_back=30, sample_size=500)
    export_for_annotation(traces, "golden_dataset_candidates.jsonl")

Note

Golden datasets rot. As your application evolves — new features, updated prompts, expanded user personas — the distribution of real traffic drifts away from your fixed dataset. Schedule a quarterly golden dataset refresh: sample 100-200 new production examples, annotate them, merge them into the dataset, and remove examples that no longer represent live traffic patterns.

Versioning Your Golden Dataset

Treat your golden dataset as code. Store it in a Git repository, version it with semantic versioning, and track which version was used for each eval run. This lets you distinguish regressions caused by model changes from regressions caused by dataset changes — a crucial distinction when debugging a failing CI check.

# dataset_schema.py — Pydantic schema for typed golden dataset entries

from pydantic import BaseModel, Field
from typing import Literal
from datetime import datetime


class GoldenEntry(BaseModel):
    id: str
    version: str = Field(description="Dataset version this entry belongs to, e.g. v2.3.0")
    category: Literal[
        "factual_qa",
        "summarisation",
        "code_generation",
        "entity_extraction",
        "classification",
        "adversarial",
        "edge_case",
    ]
    input: str | dict
    expected_output: str | dict | None = None
    evaluation_criteria: list[str] = Field(
        description="Rubric dimensions for LLM-as-judge scoring"
    )
    tags: list[str] = []
    source: Literal["production", "synthetic", "adversarial", "support_ticket"]
    annotated_by: str | None = None
    annotated_at: datetime | None = None
    notes: str | None = None


# Load and validate the dataset at eval time
import json
from pathlib import Path

def load_golden_dataset(version: str = "latest") -> list[GoldenEntry]:
    path = Path(f"datasets/golden/{version}/entries.jsonl")
    entries = []
    with open(path) as f:
        for line in f:
            entries.append(GoldenEntry.model_validate_json(line))
    return entries

DeepEval — Unit Testing for LLMs

DeepEval is an open-source LLM evaluation framework that brings pytest-style unit testing to LLM outputs. It ships with 20+ built-in metrics covering correctness, faithfulness, contextual relevancy, hallucination, toxicity, and more. Each metric uses either a heuristic or an LLM judge internally, and all metrics produce a score between 0 and 1 with a human-readable reason for the score.

# test_llm_evals.py — DeepEval unit tests for a RAG-based Q&A system
# pip install deepeval

import pytest
import deepeval
from deepeval import assert_test
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    HallucinationMetric,
    ToxicityMetric,
)
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Configure the LLM judge (uses GPT-4o by default, or set OPENAI_API_KEY)
# To use a custom judge: deepeval.login_with_confident_ai(api_key="...")


@pytest.mark.parametrize("test_case", [
    LLMTestCase(
        input="What is the capital of France?",
        actual_output="The capital of France is Paris.",
        expected_output="Paris",
        retrieval_context=[
            "France is a country in Western Europe. Its capital city is Paris, "
            "which is also the country's largest city with a population of over 2 million."
        ],
    ),
    LLMTestCase(
        input="How do I reset my password?",
        actual_output=(
            "To reset your password, click 'Forgot Password' on the login page, "
            "enter your email address, and follow the instructions in the reset email."
        ),
        expected_output=(
            "Click 'Forgot Password', enter your email, then follow the emailed link."
        ),
        retrieval_context=[
            "Password reset: Navigate to the login page and click 'Forgot Password'. "
            "Enter the email address associated with your account. You will receive "
            "a password reset link valid for 24 hours."
        ],
    ),
])
def test_rag_answer_quality(test_case: LLMTestCase):
    answer_relevancy = AnswerRelevancyMetric(threshold=0.7)
    faithfulness = FaithfulnessMetric(threshold=0.8)
    hallucination = HallucinationMetric(threshold=0.2)  # lower is better
    toxicity = ToxicityMetric(threshold=0.1)

    assert_test(
        test_case,
        metrics=[answer_relevancy, faithfulness, hallucination, toxicity],
    )


def test_contextual_relevancy():
    """Test that the retriever is fetching relevant context for the query."""
    test_case = LLMTestCase(
        input="What is the refund policy?",
        actual_output=(
            "Our refund policy allows returns within 30 days of purchase for a full refund. "
            "Items must be in original condition."
        ),
        expected_output="30-day full refund policy, items in original condition.",
        retrieval_context=[
            "Refund Policy: Customers may return items within 30 days of the original "
            "purchase date for a full refund. Items must be unused and in their original "
            "packaging. Digital products are non-refundable.",
            "Shipping Policy: Standard shipping takes 5-7 business days.",  # irrelevant chunk
        ],
    )

    precision = ContextualPrecisionMetric(threshold=0.6)
    recall = ContextualRecallMetric(threshold=0.7)

    assert_test(test_case, metrics=[precision, recall])
# run_evals_ci.sh — Running DeepEval in CI with structured output
#!/bin/bash
set -euo pipefail

echo "Running LLM evaluation suite..."

deepeval test run test_llm_evals.py \
  --output-path reports/eval_results.json \
  --confident-api-key "${DEEPEVAL_API_KEY}" \
  --model gpt-4o \
  --verbose

# Parse results and fail if pass rate < 90%
PASS_RATE=$(python3 -c "
import json
with open('reports/eval_results.json') as f:
    data = json.load(f)
total = data['total']
passed = data['passed']
rate = passed / total if total > 0 else 0
print(f'{rate:.3f}')
")

echo "Eval pass rate: ${PASS_RATE}"

python3 -c "
rate = float('${PASS_RATE}')
if rate < 0.90:
    print(f'FAIL: eval pass rate {rate:.1%} is below 90% threshold')
    exit(1)
else:
    print(f'PASS: eval pass rate {rate:.1%}')
"

RAGAS — Evaluating RAG Pipelines Without Ground Truth

RAGAS (Retrieval Augmented Generation Assessment) provides a suite of metrics specifically designed for RAG systems. Its key innovation is that several metrics — faithfulness, answer relevancy, context utilisation — do not require reference answers, making them usable for production traffic where ground truth does not exist. This enables continuous automated scoring of live traffic rather than limiting evaluation to curated datasets.

# ragas_evaluation.py — Evaluating a RAG pipeline with RAGAS
# pip install ragas langchain-openai

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    context_utilization,
    answer_correctness,
)
from ragas.metrics.critique import harmfulness
from datasets import Dataset
import pandas as pd


# The evaluation dataset
# context_recall and answer_correctness require 'ground_truth'
# faithfulness, answer_relevancy, and context_utilization do NOT —
# they can score production traffic without annotations.
eval_data = [
    {
        "question": "What is the maximum file upload size?",
        "answer": "The maximum file upload size is 100MB for standard accounts and 1GB for enterprise accounts.",
        "contexts": [
            "File Upload Limits: Standard accounts support file uploads up to 100MB. "
            "Enterprise accounts have a limit of 1GB per file.",
            "Storage quotas are separate from upload limits and depend on your subscription tier.",
        ],
        "ground_truth": "100MB for standard, 1GB for enterprise accounts.",
    },
    {
        "question": "Does the API support webhooks?",
        "answer": "Yes, the API supports webhooks for real-time event notifications. You can configure webhook endpoints in the dashboard.",
        "contexts": [
            "Webhooks: The platform supports webhook integrations for events such as "
            "payment.completed, user.created, and order.shipped. Configure endpoints via "
            "Settings > Integrations > Webhooks.",
        ],
        "ground_truth": "Yes, webhooks are supported and configured via Settings > Integrations.",
    },
]

dataset = Dataset.from_list(eval_data)

# Run RAGAS evaluation — uses GPT-4o as judge internally
result = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,           # Is the answer grounded in the retrieved context?
        answer_relevancy,       # Is the answer relevant to the question?
        context_precision,      # Are the retrieved chunks actually useful?
        context_recall,         # Did retrieval find all necessary information?
        context_utilization,    # Is the answer using the context efficiently?
        answer_correctness,     # How close is the answer to ground truth? (requires ground_truth)
    ],
)

df: pd.DataFrame = result.to_pandas()
print(df[["question", "faithfulness", "answer_relevancy", "context_precision",
          "context_recall", "answer_correctness"]].to_string())

# Aggregate scores
print("\nMean scores:")
for metric in ["faithfulness", "answer_relevancy", "context_precision", "context_recall"]:
    print(f"  {metric}: {df[metric].mean():.3f}")

Note

RAGAS's reference-free metrics (faithfulness, answer relevancy) make it practical to score a sample of production traffic daily. Run this as a cron job against 1-5% of live requests, push scores to Prometheus or Langfuse, and alert when a 7-day rolling average drops below your thresholds. This is your earliest warning system for prompt regression or model drift.

LLM-as-Judge — G-Eval and Custom Rubrics

LLM-as-judge is the pattern of using a capable LLM (GPT-4o, Claude Sonnet, or Gemini Pro) to score another LLM's outputs against a rubric. It outperforms heuristic metrics like BLEU and ROUGE on open-ended tasks because it can reason about semantic equivalence, tone, and task completion rather than surface-level string similarity. The tradeoff is cost, latency, and the possibility of judge bias.

# llm_judge.py — Custom LLM-as-judge implementation with structured output
# Compatible with any OpenAI-compatible API

import os
from openai import OpenAI
from pydantic import BaseModel
from typing import Literal

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])


class EvalScore(BaseModel):
    score: float  # 0.0 to 1.0
    grade: Literal["excellent", "good", "acceptable", "poor", "fail"]
    reasoning: str
    specific_issues: list[str]


JUDGE_SYSTEM_PROMPT = """You are an expert evaluator assessing the quality of AI assistant responses.
Evaluate the response on the following dimensions:
1. Accuracy — Is the response factually correct given the context?
2. Completeness — Does the response fully address the user's question?
3. Conciseness — Is the response appropriately concise without omitting key details?
4. Tone — Is the response professional, helpful, and appropriate?
5. Format — Is the response well-structured and easy to read?

Provide a holistic score from 0.0 to 1.0 and identify any specific issues.
Score interpretation: 0.9-1.0 excellent, 0.7-0.9 good, 0.5-0.7 acceptable, 0.3-0.5 poor, 0-0.3 fail."""


def evaluate_response(
    question: str,
    response: str,
    context: str | None = None,
    reference: str | None = None,
    model: str = "gpt-4o",
) -> EvalScore:
    """Score an LLM response using an LLM judge with structured output."""

    user_prompt = f"Question: {question}\n\nResponse to evaluate:\n{response}"

    if context:
        user_prompt = f"Context provided to the assistant:\n{context}\n\n{user_prompt}"

    if reference:
        user_prompt += f"\n\nReference answer (for comparison):\n{reference}"

    completion = client.beta.chat.completions.parse(
        model=model,
        messages=[
            {"role": "system", "content": JUDGE_SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt},
        ],
        response_format=EvalScore,
        temperature=0,
    )

    return completion.choices[0].message.parsed


# Batch evaluation against golden dataset
def batch_evaluate(
    entries: list[dict],
    model: str = "gpt-4o-mini",  # cheaper judge for bulk evals
) -> list[dict]:
    results = []
    for entry in entries:
        score = evaluate_response(
            question=entry["input"],
            response=entry["actual_output"],
            context=entry.get("context"),
            reference=entry.get("expected_output"),
            model=model,
        )
        results.append({
            **entry,
            "judge_score": score.score,
            "judge_grade": score.grade,
            "judge_reasoning": score.reasoning,
            "judge_issues": score.specific_issues,
        })
    return results

Calibrating Your Judge

An LLM judge is only as good as its calibration against human annotations. Before trusting automated scores to gate deployments, measure inter-rater agreement between your judge and human annotators on 200-300 examples. Aim for Spearman correlation ≥ 0.75. If agreement is lower, refine the rubric, switch models, or add few-shot examples to the judge prompt. Document calibration results in your eval runbook — they decay over time as models change.

# calibration.py — Measure judge-human agreement with Spearman correlation

from scipy.stats import spearmanr
import numpy as np


def measure_judge_calibration(
    human_scores: list[float],
    judge_scores: list[float],
) -> dict:
    """
    Compute Spearman rank correlation between human and LLM judge scores.
    Returns correlation coefficient and interpretation.
    """
    correlation, p_value = spearmanr(human_scores, judge_scores)

    if correlation >= 0.85:
        interpretation = "Excellent — judge is well-calibrated, safe to gate deployments"
    elif correlation >= 0.75:
        interpretation = "Good — judge is usable, monitor closely for edge cases"
    elif correlation >= 0.60:
        interpretation = "Moderate — refine judge prompt or switch to a stronger model"
    else:
        interpretation = "Poor — do not use this judge for automated gating"

    mae = np.mean(np.abs(np.array(human_scores) - np.array(judge_scores)))

    return {
        "spearman_correlation": round(correlation, 4),
        "p_value": round(p_value, 6),
        "mean_absolute_error": round(mae, 4),
        "sample_size": len(human_scores),
        "interpretation": interpretation,
    }


# Example usage:
# human = [0.9, 0.4, 0.7, 0.8, 0.3, 0.95, 0.6, 0.2, 0.85, 0.5]
# judge = [0.88, 0.42, 0.65, 0.75, 0.35, 0.92, 0.58, 0.25, 0.80, 0.52]
# print(measure_judge_calibration(human, judge))

CI/CD Integration — Regression Testing on Every PR

LLM regressions are insidious: a prompt wording change, a system message update, or a model version bump can silently degrade performance on a subset of use cases while appearing fine on most inputs. Running evals in CI/CD catches these regressions before they reach production.

# .github/workflows/llm-evals.yml — GitHub Actions eval pipeline

name: LLM Evaluation

on:
  pull_request:
    paths:
      - 'src/prompts/**'
      - 'src/chains/**'
      - 'src/retrievers/**'
      - 'datasets/golden/**'
  schedule:
    # Run nightly against latest model version
    - cron: '0 2 * * *'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
          cache: 'pip'

      - name: Install dependencies
        run: pip install deepeval ragas openai langfuse scipy

      - name: Run offline evals — unit tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}
        run: |
          deepeval test run tests/evals/ \
            --output-path reports/eval_results.json \
            --model gpt-4o-mini \
            --confident-api-key "${DEEPEVAL_API_KEY}"

      - name: Run RAGAS scoring on golden dataset
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python scripts/ragas_score_golden.py --output reports/ragas_scores.json

      - name: Compute regression delta vs baseline
        run: |
          python scripts/compare_scores.py \
            --current reports/ragas_scores.json \
            --baseline baselines/main_ragas_scores.json \
            --threshold 0.05

      - name: Upload eval reports
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: eval-reports-${{ github.sha }}
          path: reports/
          retention-days: 90
# compare_scores.py — Regression delta checker
# Fails CI if any metric drops more than the threshold vs baseline

import json
import sys
import argparse


def compare_scores(current_path: str, baseline_path: str, threshold: float) -> bool:
    with open(current_path) as f:
        current = json.load(f)
    with open(baseline_path) as f:
        baseline = json.load(f)

    regressions = []
    improvements = []

    for metric in baseline:
        if metric not in current:
            print(f"WARNING: metric {metric!r} missing from current results")
            continue

        delta = current[metric] - baseline[metric]

        if delta < -threshold:
            regressions.append((metric, baseline[metric], current[metric], delta))
        elif delta > threshold:
            improvements.append((metric, baseline[metric], current[metric], delta))

    if improvements:
        print("\n=== Improvements ===")
        for metric, base, curr, delta in improvements:
            print(f"  {metric}: {base:.3f} → {curr:.3f} (+{delta:.3f})")

    if regressions:
        print("\n=== REGRESSIONS (> threshold) ===")
        for metric, base, curr, delta in regressions:
            print(f"  {metric}: {base:.3f} → {curr:.3f} ({delta:.3f})")
        print(f"\nFAIL: {len(regressions)} metric(s) regressed beyond threshold {threshold}")
        return False

    print(f"\nPASS: No regressions beyond threshold {threshold}")
    return True


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--current", required=True)
    parser.add_argument("--baseline", required=True)
    parser.add_argument("--threshold", type=float, default=0.05)
    args = parser.parse_args()

    passed = compare_scores(args.current, args.baseline, args.threshold)
    sys.exit(0 if passed else 1)

Online Evaluation with Langfuse — Scoring Production Traffic

Langfuse is an open-source LLM observability platform that captures traces from your production system and enables both automated and human scoring of live traffic. It is the operational layer that connects your offline eval infrastructure to your production deployment — bridging the gap between what your evals predict and what users actually experience.

# app/llm_chain.py — Instrumenting an LLM chain with Langfuse tracing

import os
from langfuse import Langfuse
from langfuse.decorators import langfuse_context, observe
from openai import OpenAI

langfuse = Langfuse(
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
)

openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])


@observe(name="rag-qa-chain")
def answer_question(question: str, user_id: str) -> str:
    """Full RAG chain — retrieval + generation — fully traced in Langfuse."""
    langfuse_context.update_current_trace(
        user_id=user_id,
        tags=["production", "rag-qa"],
    )

    # Retrieval step — traced as a span
    with langfuse_context.observe_span(name="vector-retrieval") as span:
        chunks = retrieve_relevant_chunks(question)
        span.update(
            input=question,
            output=chunks,
            metadata={"num_chunks": len(chunks)},
        )

    # Generation step — traced as a generation
    with langfuse_context.observe_generation(
        name="answer-generation",
        model="gpt-4o",
        input={"question": question, "context": chunks},
    ) as generation:
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": "Answer the question based solely on the provided context. "
                               "If the answer is not in the context, say so explicitly.",
                },
                {
                    "role": "user",
                    "content": f"Context:\n{chr(10).join(chunks)}\n\nQuestion: {question}",
                },
            ],
            temperature=0.1,
        )

        answer = response.choices[0].message.content
        generation.update(output=answer, usage=response.usage)

    return answer


def retrieve_relevant_chunks(query: str) -> list[str]:
    # Your vector search implementation here
    return ["Placeholder context chunk 1", "Placeholder context chunk 2"]
# online_scorer.py — Automated scoring of sampled production traces
# Run this as a cron job every hour to score recent production traffic

import os
import random
from datetime import datetime, timedelta
from langfuse import Langfuse
from llm_judge import evaluate_response

langfuse = Langfuse(
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
)

SAMPLE_RATE = 0.05  # Score 5% of production traffic


def score_recent_traces(hours_back: int = 1, sample_rate: float = SAMPLE_RATE) -> dict:
    cutoff = datetime.utcnow() - timedelta(hours=hours_back)

    traces = langfuse.fetch_traces(
        from_timestamp=cutoff,
        tags=["production", "rag-qa"],
        limit=1000,
    ).data

    # Deterministic sampling: score every Nth trace
    sample = [t for i, t in enumerate(traces) if random.random() < sample_rate]

    scores_submitted = 0
    total_score = 0.0

    for trace in sample:
        if not trace.input or not trace.output:
            continue

        # Run automated LLM judge
        score = evaluate_response(
            question=str(trace.input),
            response=str(trace.output),
            model="gpt-4o-mini",
        )

        # Write score back to Langfuse for dashboarding
        langfuse.score(
            trace_id=trace.id,
            name="auto-judge-quality",
            value=score.score,
            comment=f"{score.grade}: {score.reasoning[:200]}",
        )

        total_score += score.score
        scores_submitted += 1

    return {
        "traces_processed": len(traces),
        "traces_scored": scores_submitted,
        "mean_quality_score": total_score / max(scores_submitted, 1),
    }


if __name__ == "__main__":
    result = score_recent_traces(hours_back=1)
    print(f"Scored {result['traces_scored']}/{result['traces_processed']} traces")
    print(f"Mean quality score: {result['mean_quality_score']:.3f}")

Metrics Reference — When to Use What

Choosing the wrong metric gives you a false signal. BLEU was designed for machine translation and is a poor proxy for RAG quality. BERTScore handles paraphrasing better but still misses logical correctness. LLM-as-judge is the most flexible but most expensive. Here is a decision matrix for common tasks:

TaskRecommended MetricsAvoid
RAG Q&AFaithfulness, Answer Relevancy (RAGAS), LLM-judgeBLEU, ROUGE
SummarisationBERTScore, LLM-judge with rubric, ROUGE-LExact match, BLEU
Code GenerationExecution-based (unit tests pass/fail), CodeBLEUROUGE, LLM-judge alone
ClassificationAccuracy, F1, Confusion matrixBERTScore
Entity ExtractionPrecision, Recall, F1 on extracted spansBLEU, LLM-judge alone
Open-ended chatLLM-judge (helpfulness, safety, tone), user thumbs up/downBLEU, BERTScore, exact match
Safety / GuardrailsRegex pattern detection, classifier, refusal rateQuality-focused metrics

Note

For code generation tasks, execution-based evaluation — running the generated code against a test suite and checking pass/fail — is far more reliable than any text similarity metric. If you cannot run generated code, use CodeBLEU which accounts for code-specific structure (AST overlap, data flow) rather than raw token matching. LLM-judge alone is insufficient for code correctness.

Further Reading

Work with us

Building LLM-powered products and struggling to catch regressions before they reach users?

We design and implement end-to-end LLM evaluation infrastructure — from golden dataset curation and DeepEval/RAGAS offline eval suites to LLM-as-judge calibration, CI/CD regression gates, Langfuse production trace scoring, and quality dashboards that give you confidence to ship prompt and model changes safely. Let’s talk.

Get in touch

Related Articles