AI LLM as Judge

AI License Required
Related Documentation
Made by
Kong Inc.
Supported Gateway Topologies
hybrid db-less traditional
Supported Konnect Deployments
hybrid cloud-gateways serverless
Compatible Protocols
grpc grpcs http https
Minimum Version
Kong Gateway - 3.12
Tags
#ai
AI Gateway Enterprise: This plugin is only available as part of our AI Gateway Enterprise offering.

The AI LLM as Judge plugin enables automated evaluation of prompt-response pairs using a dedicated LLM. The plugin assigns a numerical score to LLM responses from 1 to 100, where:

  • 1: Perfect or ideal response
  • 100: Completely incorrect or irrelevant response

This plugin is part of the AI plugin suite, making it easy to integrate LLM-based evaluation workflows into your API pipelines.

Features

The AI LLM as Judge plugin offers several configurable features that control how the LLM evaluates prompts and responses:

Feature

Description

Configurable system prompt Instructs the LLM to act as a strict evaluator.
Numerical scoring Assigns a score from 1–100 to assess response quality.
History depth Includes previous chat messages for context when scoring.
Ignore prompts Options to ignore system, assistant, or tool prompts.
Sampling rate Controls probabilistic request volume for judging.
Native LLM schema Leverages Kong’s LLM schema for seamless integration.

How it works

  1. The plugin sends the user prompt and response to the configured LLM as a judge.
  2. The LLM evaluates the response and returns a numeric score between 1 (ideal) and 100 (wrong or irrelevant).
  3. This score can be used in downstream workflows, such as automated grading, feedback systems, or learning pipelines.

The following sequence diagram illustrates this simplified flow:

 
sequenceDiagram
    actor User as User
    participant Plugin as AI LLM as Judge Plugin
    participant LLM as Configured LLM

    User->>Plugin: Sends prompt and response
    Plugin->>LLM: Forward data for evaluation
    LLM-->>Plugin: Returns numeric score (1 to 100)
    Plugin->>User: Score available for downstream workflows
  

To ensure concise, consistent scoring, configure the LLM that acts as the judge with these values:

Setting

Recommended value

Description

temperature 2 Controls randomness. A lower value leads to a more deterministic output.
max_tokens 5 Maximum tokens for the LLM response.
top_p 1 Nucleus sampling probability; limits token selection.

These settings produce short, precise numeric scores without extra text or verbosity.

Known issues

  • The LLM as judge approach can lead to preference leakage issues when the same family of models is used as both the judge and the source.
  • The score generated by the LLM needs human preference alignment and should not be over-trusted.
Something wrong?

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!