AI LLM as Judge

AI License Required

Overview Examples Configuration reference Changelog

Features

The AI LLM as Judge plugin offers several configurable features that control how the LLM evaluates prompts and responses:

Feature	Description
Configurable system prompt	Instructs the LLM to act as a strict evaluator.
Numerical scoring	Assigns a score from 1–100 to assess response quality.
History depth	Includes previous chat messages for context when scoring.
Ignore prompts	Options to ignore system, assistant, or tool prompts.
Sampling rate	Controls probabilistic request volume for judging.
Native LLM schema	Leverages Kong’s LLM schema for seamless integration.

How it works

The plugin sends the user prompt and response to the configured LLM as a judge.
The LLM evaluates the response and returns a numeric score between 1 (ideal) and 100 (wrong or irrelevant).
This score can be used in downstream workflows, such as automated grading, feedback systems, or learning pipelines.

The following sequence diagram illustrates this simplified flow:

 
sequenceDiagram
    actor User as User
    participant Plugin as AI LLM as Judge Plugin
    participant LLM as Configured LLM

    User->>Plugin: Sends prompt and response
    Plugin->>LLM: Forward data for evaluation
    LLM-->>Plugin: Returns numeric score (1 to 100)
    Plugin->>User: Score available for downstream workflows

Recommended LLM settings

To ensure concise, consistent scoring, configure the LLM that acts as the judge with these values:

Setting	Recommended value	Description
`temperature`	`2`	Controls randomness. A lower value leads to a more deterministic output.
`max_tokens`	`5`	Maximum tokens for the LLM response.
`top_p`	`1`	Nucleus sampling probability; limits token selection.

These settings produce short, precise numeric scores without extra text or verbosity.

Known issues

The LLM as judge approach can lead to preference leakage issues when the same family of models is used as both the judge and the source.
The score generated by the LLM needs human preference alignment and should not be over-trusted.