Control prompt size with the AI Compressor plugin
Use the AI RAG Injector in combination with the AI Prompt Compressor and AI Prompt Decorator plugins to retrieve relevant chunks and keep the final prompt within reasonable limits to prevent increased latency, token limit errors and unexpected bills from LLM providers.
Prerequisites
Kong Konnect
This is a Konnect tutorial and requires a Konnect personal access token.
-
Create a new personal access token by opening the Konnect PAT page and selecting Generate Token.
-
Export your token to an environment variable:
export KONNECT_TOKEN='YOUR_KONNECT_PAT'
-
Run the quickstart script to automatically provision a Control Plane and Data Plane, and configure your environment:
curl -Ls https://get.konghq.com/quickstart | bash -s -- -k $KONNECT_TOKEN --deck-output
This sets up a Konnect Control Plane named
quickstart
, provisions a local Data Plane, and prints out the following environment variable exports:export DECK_KONNECT_TOKEN=$KONNECT_TOKEN export DECK_KONNECT_CONTROL_PLANE_NAME=quickstart export KONNECT_CONTROL_PLANE_URL=https://us.api.konghq.com export KONNECT_PROXY_URL='http://localhost:8000'
Copy and paste these into your terminal to configure your session.
Kong Gateway running
This tutorial requires Kong Gateway Enterprise. If you don’t have Kong Gateway set up yet, you can use the quickstart script with an enterprise license to get an instance of Kong Gateway running almost instantly.
-
Export your license to an environment variable:
export KONG_LICENSE_DATA='LICENSE-CONTENTS-GO-HERE'
-
Run the quickstart script:
curl -Ls https://get.konghq.com/quickstart | bash -s -- -e KONG_LICENSE_DATA
Once Kong Gateway is ready, you will see the following message:
Kong Gateway Ready
decK
decK is a CLI tool for managing Kong Gateway declaratively with state files. To complete this tutorial you will first need to install decK.
Required entities
For this tutorial, you’ll need Kong Gateway entities, like Gateway Services and Routes, pre-configured. These entities are essential for Kong Gateway to function but installing them isn’t the focus of this guide. Follow these steps to pre-configure them:
-
Run the following command:
echo ' _format_version: "3.0" services: - name: example-service url: http://httpbin.konghq.com/anything routes: - name: example-route paths: - "/anything" service: name: example-service ' | deck gateway apply -
To learn more about entities, you can read our entities documentation.
OpenAI
This tutorial uses OpenAI:
- Create an OpenAI account.
- Get an API key.
- Create a decK variable with the API key:
export DECK_OPENAI_API_KEY='YOUR OPENAI API KEY'
Redis stack
To complete this tutorial, make sure you have the following:
- A Redis Stack running and accessible from the environment where Kong is deployed.
- Port
6379
, or your custom Redis port is open and reachable from Kong. -
Redis host set as an environment variable so the plugin can connect:
export DECK_REDIS_HOST='YOUR-REDIS-HOST'
If you’re testing locally with Docker, use
host.docker.internal
as the host value.
Kong Prompt Compressor service via Cloudsmith
Kong provides Compressor service as a private Docker image in a Cloudsmith repository. Contact Kong Support to get access to it.
Once you’ve received your Cloudsmith access token, run the following commands in Docker to pull the image:
-
To pull images, you must authenticate first with the token provided by the Support:
docker login docker.cloudsmith.io
-
Docker will then prompt you to enter username and password:
Username: kong/ai-compress Password: <YOUR_TOKEN>
This is a token-based login with read-only access. You can pull images but not push them. Contact support for your token.
-
To pull an image:
Replace
<image-name>
and<tag>
with the appropriate image and version, such as:docker pull docker.cloudsmith.io/kong/ai-compress/service:v0.0.2
-
You can now run the image by pasting the following command in Docker:
docker run --rm -p 8080:8080 docker.cloudsmith.io/kong/ai-compress/service:v0.0.2
Langchain splitters
To complete this tutorial, you’ll need Python (version 3.7 or later) and pip
installed on your machine. You can verify it by running:
python3
python3 -m pip --version
Once that’s set up, install the required packages by running the following command in your terminal:
python3 -m pip install langchain langchain_text_splitters requests
Configure the AI Proxy Advanced plugin
First, you’ll need to configure the AI Proxy Advanced plugin to proxy prompt requests to your model provider, and handle authentication:
echo '
_format_version: "3.0"
plugins:
- name: ai-proxy-advanced
config:
targets:
- route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer ${{ env "DECK_OPENAI_API_KEY" }}
model:
provider: openai
name: gpt-4o
options:
max_tokens: 512
temperature: 1.0
' | deck gateway apply -
Configure the AI RAG Injector plugin
Next, configure the AI RAG Injector plugin to insert the RAG context into the user message only, and wrap it with <LLMLINGUA>
tags so the AI Prompt Compressor plugin can compress it effectively.
echo '
_format_version: "3.0"
plugins:
- name: ai-rag-injector
config:
fetch_chunks_count: 5
inject_as_role: user
inject_template: "<LLMLINGUA><CONTEXT></LLMLINGUA> | <PROMPT>"
embeddings:
auth:
header_name: Authorization
header_value: Bearer ${{ env "DECK_OPENAI_API_KEY" }}
model:
provider: openai
name: text-embedding-3-large
vectordb:
strategy: redis
redis:
host: "${{ env "DECK_REDIS_HOST" }}"
port: 6379
distance_metric: cosine
dimensions: 3072
' | deck gateway apply -
If your Redis instance runs in a separate Docker container from Kong, use
host.docker.internal
forvectordb.redis.host
.If you’re using a model other than
text-embedding-3-large
, be sure to update thevectordb.dimensions
value to match the model’s embedding size.
Once the plugin is created, copy its id
from the Deck response. Then, export it so the ingestion script can reference it later:
export PLUGIN_ID=<YOUR_PLUGIN_ID>
Replace <YOUR_PLUGIN_ID>
with the actual id
returned from the plugin creation API response. You’ll need this environment variable when generating the ingestion script that sends chunked content to the plugin.
Ingest data to Redis
Create an inject_template.py
file by pasting the following into your terminal. This script fetches a Wikipedia article, splits the content into chunks, and sends each chunk to a local RAG ingestion endpoint.
cat <<EOF > inject_template.py
import requests
from langchain_text_splitters import RecursiveCharacterTextSplitter
plugin_id = "${PLUGIN_ID}"
def get_wikipedia_extract(title):
url = "https://en.wikipedia.org/w/api.php"
params = {
"format": "json",
"action": "query",
"prop": "extracts",
"exlimit": "max",
"explaintext": True,
"titles": title,
"redirects": 1
}
response = requests.get(url, params=params)
response.raise_for_status()
data = response.json()
pages = data.get("query", {}).get("pages", {})
for page_id, page in pages.items():
if "extract" in page:
return page["extract"]
return None
title = "Shark"
text = get_wikipedia_extract(title)
if not text:
print(f"Failed to retrieve Wikipedia content for: {title}")
exit()
text = f"# {title}\\n\\n{text}"
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.create_documents([text])
print(f"Injecting {len(docs)} chunks...")
for doc in docs:
response = requests.post(
f"http://localhost:8001/ai-rag-injector/{plugin_id}/ingest_chunk",
data={"content": doc.page_content}
)
print(response.status_code, response.text)
EOF
Now, run this script with Python:
python3 inject_template.py
If successful, your terminal will print the following:
Injecting 91 chunks...
200 {"metadata":{"chunk_id":"c55d8869-6858-496f-83d2-abcdefghij12","ingest_duration":615,"embeddings_tokens_count":2}}
200 {"metadata":{"chunk_id":"fc7d4fd7-21e0-443e-9504-abcdefghij13","ingest_duration":779,"embeddings_tokens_count":231}}
200 {"metadata":{"chunk_id":"8d2aebe1-04e4-40c7-b16f-abcdefghij14","ingest_duration":569,"embeddings_tokens_count":184}}
Wait until all 91 chunks have been injected before moving on to the next step.
Configure the AI Prompt Compressor plugin
Now, you can configure the AI Prompt Compressor plugin to apply compression to the wrapped RAG context using defined token ranges and compression settings.
echo '
_format_version: "3.0"
plugins:
- name: ai-prompt-compressor
config:
compression_ranges:
- max_tokens: 100
min_tokens: 20
value: 0.8
- max_tokens: 1000000
min_tokens: 100
value: 0.3
compressor_type: rate
compressor_url: http://compress-service:8080
keepalive_timeout: 60000
log_text_data: false
stop_on_error: true
timeout: 10000
' | deck gateway apply -
Log prompt compression
Before we send requests to our LLM, we need to set up the HTTP Logs plugin to check how many tokens we’ve managed to save by using our configuration. First, create an HTTP logs plugin:
echo '
_format_version: "3.0"
plugins:
- name: http-log
service: example-route
config:
http_endpoint: http://host.docker.internal:9999/
headers:
Authorization: Bearer some-token
method: POST
timeout: 3000
' | deck gateway apply -
Let’s run a simple log collector script which collect logs at 9999
port. Copy and run this snippet in your terminal:
cat <<EOF > log_server.py
from http.server import BaseHTTPRequestHandler, HTTPServer
import datetime
LOG_FILE = "kong_logs.txt"
class LogHandler(BaseHTTPRequestHandler):
def do_POST(self):
timestamp = datetime.datetime.now().isoformat()
content_length = int(self.headers['Content-Length'])
post_data = self.rfile.read(content_length).decode('utf-8')
log_entry = f"{timestamp} - {post_data}\n"
with open(LOG_FILE, "a") as f:
f.write(log_entry)
print("="*60)
print(f"Received POST request at {timestamp}")
print(f"Path: {self.path}")
print("Headers:")
for header, value in self.headers.items():
print(f" {header}: {value}")
print("Body:")
print(post_data)
print("="*60)
# Send OK response
self.send_response(200)
self.end_headers()
self.wfile.write(b"OK")
if __name__ == '__main__':
server_address = ('', 9999)
httpd = HTTPServer(server_address, LogHandler)
print("Starting log server on http://0.0.0.0:9999")
httpd.serve_forever()
EOF
Now, run this script with Python:
python3 log_server.py
If script is successful, you’ll receive the following prompt in your terminal:
Starting log server on http://0.0.0.0:9999
Validate your configuration
When sending the following request:
curl "$KONNECT_PROXY_URL/anything" \
-H "Content-Type: application/json"\
-H "Authorization: Bearer $DECK_OPENAI_API_KEY" \
--json '{
"messages": [
{
"role": "user",
"content": "How many species of sharks are there in the world?"
}
]
}'
curl "http://localhost:8000/anything" \
-H "Content-Type: application/json"\
-H "Authorization: Bearer $DECK_OPENAI_API_KEY" \
--json '{
"messages": [
{
"role": "user",
"content": "How many species of sharks are there in the world?"
}
]
}'
You should see output like this in your HTTP log plugin endpoint, showing how many tokens were saved through compression:
"compressor": {
"compress_items": [
{
"compress_token_count": 244,
"original_token_count": 700,
"compress_value": 0.3,
"information": "Compression was performed and saved 456 tokens",
"compressor_model": "microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
"msg_id": 1,
"compress_type": "rate",
"save_token_count": 456
}
],
"duration": 1092
}
Govern your LLM pipeline
You can use the AI Prompt Decorator plugin to make sure that the LLM responds only to questions related to the injected RAG context. Let’s apply the following configuration:
echo '
_format_version: "3.0"
plugins:
- name: ai-prompt-decorator
config:
prompts:
append:
- role: system
content: Use only the information passed before the question in the user message.
If no data is provided with the question, respond with ‘no internal data
available'
' | deck gateway apply -
Validate final configuration
Now, on any request not related to the ingested content, for example:
curl "$KONNECT_PROXY_URL/anything" \
-H "Content-Type: application/json"\
-H "Authorization: Bearer $DECK_OPENAI_API_KEY" \
--json '{
"messages": [
{
"role": "user",
"content": "Who founded the city of Ravenna?"
}
]
}'
curl "http://localhost:8000/anything" \
-H "Content-Type: application/json"\
-H "Authorization: Bearer $DECK_OPENAI_API_KEY" \
--json '{
"messages": [
{
"role": "user",
"content": "Who founded the city of Ravenna?"
}
]
}'
You will receive the following response:
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "no internal data available",
...
}
}
]
With the following compression applied:
"compress_items": [
{
"compress_token_count": 301,
"original_token_count": 957,
"compress_value": 0.3,
"information": "Compression was performed and saved 656 tokens",
"compressor_model": "microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
"msg_id": 1,
"compress_type": "rate",
"save_token_count": 656
}
]
Cleanup
Clean up Konnect environment
If you created a new control plane and want to conserve your free trial credits or avoid unnecessary charges, delete the new control plane used in this tutorial.
Destroy the Kong Gateway container
curl -Ls https://get.konghq.com/quickstart | bash -s -- -d