Part of the CQO Guide to AI Evals Return to the main pillar page: The Chief Quality Officer’s Guide to Testing AI Agents.

LLM-as-a-Judge: How to Automate Your QA Pipeline (Python Tutorial)

Robot holding a scorecard grading another robot

You cannot manually inspect 10,000 interactions a day. If you are building AI Agents in 2026, you are likely facing the "Evaluation Bottleneck." Your developers are pushing code faster than your QA team can verify it.

The solution is LLM-as-a-Judge: a design pattern where a stronger model (like GPT-4o) acts as an impartial grader for your production model. This guide provides the exact Python code to build your own automated grading harness.

Prerequisite: Do you have a Golden Dataset? You need test data before you can build a judge. Read our guide on generating synthetic test cases.

1. The Concept: Agents Grading Agents

The workflow is simple:

  1. You feed a question from your Golden Dataset to your Agent.
  2. The Agent generates a response.
  3. You pass both the Agent's Response and the Ground Truth to the "Judge Model".
  4. The Judge returns a Score (1-5) and a Reasoning string.

2. The Rubric (The Constitution)

An LLM is only as good as its instructions. We don't just ask "Is this good?"; we define specific metrics. Here is a standard Faithfulness Rubric.

# rubric_prompts.py FAITHFULNESS_RUBRIC = """ You are an expert grading an AI assistant. Score the response on a scale of 1 to 5 based on FAITHFULNESS. Criteria: 1 - Hallucination: The response makes claims not supported by the context. 3 - Partial: The response is mostly supported but adds minor unverified details. 5 - Faithful: Every statement in the response is directly supported by the retrieved context. Return your answer in JSON format: { "score": int, "reasoning": "string" } """

3. The Implementation (Python)

We will use the OpenAI SDK with Structured Outputs (Pydantic) to ensure our judge always returns valid JSON. This is critical for CI/CD pipelines.

from pydantic import BaseModel from openai import OpenAI client = OpenAI() # 1. Define the output structure class EvaluationResult(BaseModel): score: int reasoning: str def run_judge(question, context, agent_response): # 2. Construct the prompt prompt = f""" Question: {question} Retrieved Context: {context} Agent Response: {agent_response} --- Evaluate the Faithfulness of the Agent Response based on the Context. """ # 3. Call GPT-4o with structured output completion = client.beta.chat.completions.parse( model="gpt-4o-2024-08-06", messages=[ {"role": "system", "content": FAITHFULNESS_RUBRIC}, {"role": "user", "content": prompt}, ], response_format=EvaluationResult, ) return completion.choices[0].message.parsed # Example Usage result = run_judge( question="What is the refund policy?", context="Refunds are processed within 14 days.", agent_response="You can get a refund in 30 days." ) print(result.score) # Output: 1 print(result.reasoning) # Output: "Agent stated 30 days, but context says 14 days."

4. Scaling to Production

Once you have this function, you simply loop it over your CSV file of test cases. In a typical CI/CD pipeline, you would fail the build if the average_score drops below 4.5/5.

Note on Cost: Running GPT-4o on 500 test cases can cost $10-$20 per run. If this is too high for every commit, check our Comparison of RAGAS vs DeepEval to see tools that optimize these costs using "Cascading Eval" strategies.

Robot holding a scorecard grading another robot

Frequently Asked Questions

Q: Can I use open-source models like Llama 3 as a judge?

A: Yes, but Llama 3 70B is recommended over the 8B version for grading. Smaller models struggle with nuanced reasoning required to distinguish between 'Plausible' and 'Correct' answers.

Q: How do I handle bias in the Judge model?

A: LLMs often favor verbose answers (Length Bias). To counter this, force the judge to output a score first, or use a 'Reference-Based' rubric where the judge compares the output strictly to a Golden Answer.

Q: What is the cost of using GPT-4o as a judge?

A: It costs approximately $0.01 per evaluation run if you use optimized prompts. For large regression suites (1000+ tests), we recommend using GPT-4o-mini for the first pass and GPT-4o only for ambiguous cases.

Gather feedback and optimize your AI workflows with SurveyMonkey. The leader in online surveys and forms. Sign up for free.

SurveyMonkey - Online Surveys and Forms

This link leads to a paid promotion


References & Resources