Control accuracy of LLM models using the AI LLM as judge plugin
Use AI Proxy Advanced to manage multiple LLM models, AI LLM as Judge to score responses, and HTTP Log to monitor LLM accuracy.
Prerequisites
Kong Konnect
This is a Konnect tutorial and requires a Konnect personal access token.
-
Create a new personal access token by opening the Konnect PAT page and selecting Generate Token.
-
Export your token to an environment variable:
export KONNECT_TOKEN='YOUR_KONNECT_PAT'
Copied! -
Run the quickstart script to automatically provision a Control Plane and Data Plane, and configure your environment:
curl -Ls https://get.konghq.com/quickstart | bash -s -- -k $KONNECT_TOKEN --deck-output
Copied!This sets up a Konnect Control Plane named
quickstart
, provisions a local Data Plane, and prints out the following environment variable exports:export DECK_KONNECT_TOKEN=$KONNECT_TOKEN export DECK_KONNECT_CONTROL_PLANE_NAME=quickstart export KONNECT_CONTROL_PLANE_URL=https://us.api.konghq.com export KONNECT_PROXY_URL='http://localhost:8000'
Copied!Copy and paste these into your terminal to configure your session.
Kong Gateway running
This tutorial requires Kong Gateway Enterprise. If you don’t have Kong Gateway set up yet, you can use the quickstart script with an enterprise license to get an instance of Kong Gateway running almost instantly.
-
Export your license to an environment variable:
export KONG_LICENSE_DATA='LICENSE-CONTENTS-GO-HERE'
Copied! -
Run the quickstart script:
curl -Ls https://get.konghq.com/quickstart | bash -s -- -e KONG_LICENSE_DATA
Copied!Once Kong Gateway is ready, you will see the following message:
Kong Gateway Ready
decK v1.43+
decK is a CLI tool for managing Kong Gateway declaratively with state files. To complete this tutorial, install decK version 1.43 or later.
This guide uses deck gateway apply
, which directly applies entity configuration to your Gateway instance.
We recommend upgrading your decK installation to take advantage of this tool.
You can check your current decK version with deck version
.
Required entities
For this tutorial, you’ll need Kong Gateway entities, like Gateway Services and Routes, pre-configured. These entities are essential for Kong Gateway to function but installing them isn’t the focus of this guide. Follow these steps to pre-configure them:
-
Run the following command:
echo ' _format_version: "3.0" services: - name: example-service url: http://httpbin.konghq.com/anything routes: - name: example-route paths: - "/anything" service: name: example-service ' | deck gateway apply -
Copied!
To learn more about entities, you can read our entities documentation.
OpenAI
This tutorial uses OpenAI:
- Create an OpenAI account.
- Get an API key.
- Create a decK variable with the API key:
export DECK_OPENAI_API_KEY="YOUR OPENAI API KEY"
export DECK_OPENAI_API_KEY="YOUR OPENAI API KEY"
Ollama
To complete this tutorial, make sure you have Ollama installed and running locally.
-
Visit the Ollama download page and download the installer for your operating system. Follow the installation instructions for your platform.
- Start Ollama:
ollama start
Copied! -
After installation, open a new terminal window and run the following command to pull the orca-mini model we will be using in this tutorial:
ollama run orca-mini
Copied! -
To set up the AI Proxy plugin, you’ll need the upstream URL of your local Llama instance.
By default, Ollama runs at
localhost:11434
. You can verify this by running:lsof -i :11434
Copied!You should see output similar to:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ollama 23909 your_user_name 4u IPv4 0x... 0t0 TCP localhost:11434 (LISTEN)
If Ollama is running on a different port, run:
lsof -iTCP -sTCP:LISTEN -n -P | grep 'ollama'
Copied!Then look for the
ollama
process in the output and note the port number it’s listening on.In this example, we’re running Kong Gateway locally in a Docker container, so the host is
host.docker.internal
:export DECK_OLLAMA_UPSTREAM_URL='http://host.docker.internal:11434/api/chat'
Copied!
Configure the AI Proxy Advanced plugin
The AI Proxy Advanced plugin allows you to route requests to multiple LLM models and define load balancing, retries, timeouts, and token counting strategies. In this tutorial, we configure it to send requests to both OpenAI and Ollama models, using the lowest-usage balancer to direct traffic to the model currently handling the fewest tokens or requests.
For testing purposes only, we include a less reliable Ollama model in the configuration.
This makes it easier to demonstrate the evaluation differences when responses are judged by the LLM as Judge plugin.
echo '
_format_version: "3.0"
plugins:
- name: ai-proxy-advanced
config:
balancer:
algorithm: lowest-usage
connect_timeout: 60000
failover_criteria:
- error
- timeout
hash_on_header: X-Kong-LLM-Request-ID
latency_strategy: tpot
read_timeout: 60000
retries: 5
slots: 10000
tokens_count_strategy: llm-accuracy
write_timeout: 60000
genai_category: text/generation
llm_format: openai
max_request_body_size: 8192
model_name_header: true
response_streaming: allow
targets:
- model:
name: gpt-4.1-mini
provider: openai
options:
cohere:
embedding_input_type: classification
route_type: llm/v1/chat
auth:
allow_override: false
header_name: Authorization
header_value: Bearer ${{ env "DECK_OPENAI_API_KEY" }}
logging:
log_payloads: true
log_statistics: true
weight: 100
- model:
name: orca-mini
options:
llama2_format: ollama
upstream_url: "${{ env "DECK_OLLAMA_UPSTREAM_URL" }}"
provider: llama2
route_type: llm/v1/chat
logging:
log_payloads: true
log_statistics: true
weight: 100
' | deck gateway apply -
Configure the AI LLM as Judge plugin
The AI LLM as Judge plugin evaluates responses returned by your models and assigns an accuracy score between 1 and 100. These scores can be used for model ranking, learning, or automated evaluation. In this tutorial, we use GPT-4o as the judge model—a higher-capacity model we recommend for this plugin to ensure consistent and reliable scoring.
echo '
_format_version: "3.0"
plugins:
- name: ai-llm-as-judge
config:
prompt: |
You are a strict evaluator. You will be given a request and a response.
Your task is to judge whether the response is correct or incorrect. You must
assign a score between 1 and 100, where: 100 represents a completely correct
and ideal response, 1 represents a completely incorrect or irrelevant response.
Your score must be a single number only — no text, labels, or explanations.
Use the full range of values (e.g., 13, 47, 86), not just round numbers like
10, 50, or 100. Be accurate and consistent, as this score will be used by another
model for learning and evaluation.
http_timeout: 60000
https_verify: true
ignore_assistant_prompts: true
ignore_system_prompts: true
ignore_tool_prompts: true
sampling_rate: 1
llm:
auth:
allow_override: false
header_name: Authorization
header_value: Bearer ${{ env "DECK_OPENAI_API_KEY" }}
logging:
log_payloads: true
log_statistics: true
model:
name: gpt-4o
provider: openai
options:
temperature: 2
max_tokens: 5
top_p: 1
route_type: llm/v1/chat
message_countback: 3
' | deck gateway apply -
Log model accuracy
The HTTP Log plugin allows you to capture plugin events and responses. We’ll use it to collect the LLM accuracy scores produced by AI LLM as Judge.
echo '
_format_version: "3.0"
plugins:
- name: http-log
service: example-service
config:
http_endpoint: http://host.docker.internal:9999/
headers:
Authorization: Bearer some-token
method: POST
timeout: 3000
' | deck gateway apply -
Let’s run a simple log collector script which collects logs at the 9999
port. Copy and run this snippet in your terminal:
cat <<EOF > log_server.py
from http.server import BaseHTTPRequestHandler, HTTPServer
import datetime
LOG_FILE = "kong_logs.txt"
class LogHandler(BaseHTTPRequestHandler):
def do_POST(self):
timestamp = datetime.datetime.now().isoformat()
content_length = int(self.headers['Content-Length'])
post_data = self.rfile.read(content_length).decode('utf-8')
log_entry = f"{timestamp} - {post_data}\n"
with open(LOG_FILE, "a") as f:
f.write(log_entry)
print("="*60)
print(f"Received POST request at {timestamp}")
print(f"Path: {self.path}")
print("Headers:")
for header, value in self.headers.items():
print(f" {header}: {value}")
print("Body:")
print(post_data)
print("="*60)
# Send OK response
self.send_response(200)
self.end_headers()
self.wfile.write(b"OK")
if __name__ == '__main__':
server_address = ('', 9999)
httpd = HTTPServer(server_address, LogHandler)
print("Starting log server on http://0.0.0.0:9999")
httpd.serve_forever()
EOF
Now, run this script with Python:
python3 log_server.py
If the script is successful, you’ll receive the following prompt in your terminal:
Starting log server on http://0.0.0.0:9999
Validate your configuration
Send test requests to the example-route
Route to see model responses scored:
for i in {1..5}; do
curl -s -X POST "http://localhost:8000/anything" \
-H "Content-Type: application/json" \
--json '{
"messages": [
{
"role": "user",
"content": "Who was Jozef Mackiewicz?"
}
]
}'
sleep 3
done
You should see JSON logs from your HTTP log plugin endpoint in kong_logs.txt
. The llm_accuracy
field reflects how well the model’s response aligns with the judge model’s evaluation.
When comparing two models, notice how gpt-4.1-mini
produces a much higher llm_accuracy
score than orca-mini
, showing that the judged responses are significantly more accurate.
Cleanup
Clean up Konnect environment
If you created a new control plane and want to conserve your free trial credits or avoid unnecessary charges, delete the new control plane used in this tutorial.
Destroy the Kong Gateway container
curl -Ls https://get.konghq.com/quickstart | bash -s -- -d