How to: Control accuracy of LLM models using the AI LLM as judge plugin

Prerequisites

Kong Konnect

This is a Konnect tutorial and requires a Konnect personal access token.

Create a new personal access token by opening the Konnect PAT page and selecting Generate Token.
Export your token to an environment variable:
```
 export KONNECT_TOKEN='YOUR_KONNECT_PAT'
```
Copied!
Run the quickstart script to automatically provision a Control Plane and Data Plane, and configure your environment:
```
 curl -Ls https://get.konghq.com/quickstart | bash -s -- -k $KONNECT_TOKEN --deck-output
```
Copied!
This sets up a Konnect Control Plane named quickstart, provisions a local Data Plane, and prints out the following environment variable exports:
```
 export DECK_KONNECT_TOKEN=$KONNECT_TOKEN
 export DECK_KONNECT_CONTROL_PLANE_NAME=quickstart
 export KONNECT_CONTROL_PLANE_URL=https://us.api.konghq.com
 export KONNECT_PROXY_URL='http://localhost:8000'
```
Copied!
Copy and paste these into your terminal to configure your session.

Kong Gateway running

This tutorial requires Kong Gateway Enterprise. If you don’t have Kong Gateway set up yet, you can use the quickstart script with an enterprise license to get an instance of Kong Gateway running almost instantly.

Export your license to an environment variable:

 export KONG_LICENSE_DATA='LICENSE-CONTENTS-GO-HERE'

Copied!

Run the quickstart script:

curl -Ls https://get.konghq.com/quickstart | bash -s -- -e KONG_LICENSE_DATA

Copied!

Once Kong Gateway is ready, you will see the following message:

 Kong Gateway Ready

decK v1.43+

decK is a CLI tool for managing Kong Gateway declaratively with state files. To complete this tutorial, install decK version 1.43 or later.

This guide uses deck gateway apply, which directly applies entity configuration to your Gateway instance. We recommend upgrading your decK installation to take advantage of this tool.

You can check your current decK version with deck version.

Required entities

For this tutorial, you’ll need Kong Gateway entities, like Gateway Services and Routes, pre-configured. These entities are essential for Kong Gateway to function but installing them isn’t the focus of this guide. Follow these steps to pre-configure them:

Run the following command:

echo '
_format_version: "3.0"
services:
  - name: example-service
    url: http://httpbin.konghq.com/anything
routes:
  - name: example-route
    paths:
    - "/anything"
    service:
      name: example-service
' | deck gateway apply -

Copied!

To learn more about entities, you can read our entities documentation.

OpenAI

This tutorial uses OpenAI:

Create an OpenAI account.
Get an API key.
Create a decK variable with the API key:

```bash export DECK_OPENAI_API_KEY="YOUR OPENAI API KEY" ```

Copied!

export DECK_OPENAI_API_KEY="YOUR OPENAI API KEY"

Copied!

Ollama

To complete this tutorial, make sure you have Ollama installed and running locally.

Visit the Ollama download page and download the installer for your operating system. Follow the installation instructions for your platform.
Start Ollama:
```
ollama start
```
Copied!
After installation, open a new terminal window and run the following command to pull the orca-mini model we will be using in this tutorial:
```
ollama run orca-mini
```
Copied!
To set up the AI Proxy plugin, you’ll need the upstream URL of your local Llama instance.

By default, Ollama runs at localhost:11434. You can verify this by running:
```
lsof -i :11434
```
Copied!
You should see output similar to:
```
COMMAND   PID            USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
ollama   23909  your_user_name   4u  IPv4 0x...            0t0  TCP localhost:11434 (LISTEN)
```
If Ollama is running on a different port, run:
```
lsof -iTCP -sTCP:LISTEN -n -P | grep 'ollama'
```
Copied!
Then look for the ollama process in the output and note the port number it’s listening on.

In this example, we’re running Kong Gateway locally in a Docker container, so the host is host.docker.internal:
```
export DECK_OLLAMA_UPSTREAM_URL='http://host.docker.internal:11434/api/chat'
```
Copied!

Configure the AI Proxy Advanced plugin

The AI Proxy Advanced plugin allows you to route requests to multiple LLM models and define load balancing, retries, timeouts, and token counting strategies. In this tutorial, we configure it to send requests to both OpenAI and Ollama models, using the lowest-usage balancer to direct traffic to the model currently handling the fewest tokens or requests.

For testing purposes only, we include a less reliable Ollama model in the configuration.

This makes it easier to demonstrate the evaluation differences when responses are judged by the LLM as Judge plugin.

echo '
_format_version: "3.0"
plugins:
  - name: ai-proxy-advanced
    config:
      balancer:
        algorithm: lowest-usage
        connect_timeout: 60000
        failover_criteria:
        - error
        - timeout
        hash_on_header: X-Kong-LLM-Request-ID
        latency_strategy: tpot
        read_timeout: 60000
        retries: 5
        slots: 10000
        tokens_count_strategy: llm-accuracy
        write_timeout: 60000
      genai_category: text/generation
      llm_format: openai
      max_request_body_size: 8192
      model_name_header: true
      response_streaming: allow
      targets:
      - model:
          name: gpt-4.1-mini
          provider: openai
          options:
            cohere:
              embedding_input_type: classification
        route_type: llm/v1/chat
        auth:
          allow_override: false
          header_name: Authorization
          header_value: Bearer ${{ env "DECK_OPENAI_API_KEY" }}
        logging:
          log_payloads: true
          log_statistics: true
        weight: 100
      - model:
          name: orca-mini
          options:
            llama2_format: ollama
            upstream_url: "${{ env "DECK_OLLAMA_UPSTREAM_URL" }}"
          provider: llama2
        route_type: llm/v1/chat
        logging:
          log_payloads: true
          log_statistics: true
        weight: 100
' | deck gateway apply -

Copied!

Configure the AI LLM as Judge plugin

The AI LLM as Judge plugin evaluates responses returned by your models and assigns an accuracy score between 1 and 100. These scores can be used for model ranking, learning, or automated evaluation. In this tutorial, we use GPT-4o as the judge model—a higher-capacity model we recommend for this plugin to ensure consistent and reliable scoring.

echo '
_format_version: "3.0"
plugins:
  - name: ai-llm-as-judge
    config:
      prompt: |
        You are a strict evaluator. You will be given a request and a response.
        Your task is to judge whether the response is correct or incorrect. You must
        assign a score between 1 and 100, where: 100 represents a completely correct
        and ideal response, 1 represents a completely incorrect or irrelevant response.
        Your score must be a single number only — no text, labels, or explanations.
        Use the full range of values (e.g., 13, 47, 86), not just round numbers like
        10, 50, or 100. Be accurate and consistent, as this score will be used by another
        model for learning and evaluation.
      http_timeout: 60000
      https_verify: true
      ignore_assistant_prompts: true
      ignore_system_prompts: true
      ignore_tool_prompts: true
      sampling_rate: 1
      llm:
        auth:
          allow_override: false
          header_name: Authorization
          header_value: Bearer ${{ env "DECK_OPENAI_API_KEY" }}
        logging:
          log_payloads: true
          log_statistics: true
        model:
          name: gpt-4o
          provider: openai
          options:
            temperature: 2
            max_tokens: 5
            top_p: 1
        route_type: llm/v1/chat
      message_countback: 3
' | deck gateway apply -

Copied!

Log model accuracy

The HTTP Log plugin allows you to capture plugin events and responses. We’ll use it to collect the LLM accuracy scores produced by AI LLM as Judge.

echo '
_format_version: "3.0"
plugins:
  - name: http-log
    service: example-service
    config:
      http_endpoint: http://host.docker.internal:9999/
      headers:
        Authorization: Bearer some-token
      method: POST
      timeout: 3000
' | deck gateway apply -

Copied!

Let’s run a simple log collector script which collects logs at the 9999 port. Copy and run this snippet in your terminal:

cat <<EOF > log_server.py
from http.server import BaseHTTPRequestHandler, HTTPServer
import datetime

LOG_FILE = "kong_logs.txt"

class LogHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        timestamp = datetime.datetime.now().isoformat()

        content_length = int(self.headers['Content-Length'])
        post_data = self.rfile.read(content_length).decode('utf-8')

        log_entry = f"{timestamp} - {post_data}\n"
        with open(LOG_FILE, "a") as f:
            f.write(log_entry)

        print("="*60)
        print(f"Received POST request at {timestamp}")
        print(f"Path: {self.path}")
        print("Headers:")
        for header, value in self.headers.items():
            print(f"  {header}: {value}")
        print("Body:")
        print(post_data)
        print("="*60)

        # Send OK response
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"OK")

if __name__ == '__main__':
    server_address = ('', 9999)
    httpd = HTTPServer(server_address, LogHandler)
    print("Starting log server on http://0.0.0.0:9999")
    httpd.serve_forever()
EOF

Copied!

Now, run this script with Python:

python3 log_server.py

Copied!

If the script is successful, you’ll receive the following prompt in your terminal:

Starting log server on http://0.0.0.0:9999

Copied!

Validate your configuration

Send test requests to the example-route Route to see model responses scored:

for i in {1..5}; do
  curl -s -X POST "http://localhost:8000/anything" \
    -H "Content-Type: application/json" \
    --json '{
      "messages": [
        {
          "role": "user",
          "content": "Who was Jozef Mackiewicz?"
        }
      ]
    }'

  sleep 3
done

Copied!

You should see JSON logs from your HTTP log plugin endpoint in kong_logs.txt. The llm_accuracy field reflects how well the model’s response aligns with the judge model’s evaluation.

When comparing two models, notice how gpt-4.1-mini produces a much higher llm_accuracy score than orca-mini, showing that the judged responses are significantly more accurate.

{
  "workspace_name": "default",
  "workspace": "3ec2d3e1-92d8-abcd-b3da-2732abcdefgh",
  "ai": {
    "ai-llm-as-judge": {
      "meta": {
        "request_mode": "oneshot",
        "provider_name": "openai",
        "request_model": "orca-mini",
        "response_model": "gpt-4o-2024-08-06",
        "llm_latency": 1491,
        "plugin_id": "8ccfd8b8-f5bc-4af9-8951-123456789abc"
      },
      "payload": {
        "...": "..."
      },
      "tried_targets": [
        {
          "route_type": "llm/v1/chat",
          "upstream_scheme": "http",
          "upstream_uri": "/api/chat",
          "ip": "192.168.00.001",
          "port": 11434,
          "provider": "llama2",
          "host": "host.docker.internal",
          "model": "orca-mini"
        }
      ],
      "usage": {
        "completion_tokens": 114,
        "llm_accuracy": 14,
        "prompt_tokens_details": {
          "cached_tokens": 0,
          "audio_tokens": 0
        },
        "completion_tokens_details": {
          "reasoning_tokens": 0,
          "audio_tokens": 0,
          "accepted_prediction_tokens": 0,
          "rejected_prediction_tokens": 0
        },
        "prompt_tokens": 49,
        "total_tokens": 163,
        "time_to_first_token": 1491,
        "time_per_token": 21.77
      }
    }
  }
}

Notice the jump in llm_accuracy from 14 with orca-mini to 88 with gpt-4.1-mini:

{
  "workspace_name": "default",
  "workspace": "3ec2d3e1-92d8-abcd-b3da-2732abcdefgh",
  "ai": {
    "ai-llm-as-judge": {
      "meta": {
        "request_mode": "oneshot",
        "provider_name": "openai",
        "request_model": "gpt-4.1-mini",
        "response_model": "gpt-4o-2024-08-06",
        "llm_latency": 1525,
        "plugin_id": "8ccfd8b8-f5bc-4af9-8951-123456789abc"
      },
      "payload": {
        "...": "..."
      },
      "tried_targets": [
        {
          "route_type": "llm/v1/chat",
          "upstream_scheme": "https",
          "upstream_uri": "/v1/chat/completions",
          "ip": "172.66.0.243",
          "port": 443,
          "host": "api.openai.com",
          "provider": "openai",
          "model": "gpt-4.1-mini"
        }
      ],
      "usage": {
        "completion_tokens": 266,
        "llm_accuracy": 88,
        "prompt_tokens": 15,
        "total_tokens": 281,
        "time_to_first_token": 1525,
        "time_per_token": 22.38,
        "prompt_tokens_details": {
          "cached_tokens": 0,
          "audio_tokens": 0
        },
        "completion_tokens_details": {
          "reasoning_tokens": 0,
          "audio_tokens": 0,
          "accepted_prediction_tokens": 0,
          "rejected_prediction_tokens": 0
        }
      }
    }
  }
}

Cleanup

Clean up Konnect environment

If you created a new control plane and want to conserve your free trial credits or avoid unnecessary charges, delete the new control plane used in this tutorial.

Destroy the Kong Gateway container

curl -Ls https://get.konghq.com/quickstart | bash -s -- -d

Copied!

Control accuracy of LLM models using the AI LLM as judge plugin

Prerequisites

Kong Konnect

Kong Gateway running

decK v1.43+

Required entities

OpenAI

Ollama

Configure the AI Proxy Advanced plugin

Configure the AI LLM as Judge plugin

Log model accuracy

Validate your configuration

Cleanup

Clean up Konnect environment

Destroy the Kong Gateway container

Help us make these docs great!

Still need help