Control prompt size with the AI Compressor plugin

Uses: Kong Gateway AI Gateway decK
Tags
Minimum Version
Kong Gateway - 3.11
TL;DR

Use the AI RAG Injector in combination with the AI Prompt Compressor and AI Prompt Decorator plugins to retrieve relevant chunks and keep the final prompt within reasonable limits to prevent increased latency, token limit errors and unexpected bills from LLM providers.

Prerequisites

This is a Konnect tutorial and requires a Konnect personal access token.

  1. Create a new personal access token by opening the Konnect PAT page and selecting Generate Token.

  2. Export your token to an environment variable:

     export KONNECT_TOKEN='YOUR_KONNECT_PAT'
    
  3. Run the quickstart script to automatically provision a Control Plane and Data Plane, and configure your environment:

     curl -Ls https://get.konghq.com/quickstart | bash -s -- -k $KONNECT_TOKEN --deck-output
    

    This sets up a Konnect Control Plane named quickstart, provisions a local Data Plane, and prints out the following environment variable exports:

     export DECK_KONNECT_TOKEN=$KONNECT_TOKEN
     export DECK_KONNECT_CONTROL_PLANE_NAME=quickstart
     export KONNECT_CONTROL_PLANE_URL=https://us.api.konghq.com
     export KONNECT_PROXY_URL='http://localhost:8000'
    

    Copy and paste these into your terminal to configure your session.

This tutorial requires Kong Gateway Enterprise. If you don’t have Kong Gateway set up yet, you can use the quickstart script with an enterprise license to get an instance of Kong Gateway running almost instantly.

  1. Export your license to an environment variable:

     export KONG_LICENSE_DATA='LICENSE-CONTENTS-GO-HERE'
    
  2. Run the quickstart script:

     curl -Ls https://get.konghq.com/quickstart | bash -s -- -e KONG_LICENSE_DATA 
    

    Once Kong Gateway is ready, you will see the following message:

     Kong Gateway Ready
    

decK is a CLI tool for managing Kong Gateway declaratively with state files. To complete this tutorial you will first need to install decK.

For this tutorial, you’ll need Kong Gateway entities, like Gateway Services and Routes, pre-configured. These entities are essential for Kong Gateway to function but installing them isn’t the focus of this guide. Follow these steps to pre-configure them:

  1. Run the following command:

    echo '
    _format_version: "3.0"
    services:
      - name: example-service
        url: http://httpbin.konghq.com/anything
    routes:
      - name: example-route
        paths:
        - "/anything"
        service:
          name: example-service
    ' | deck gateway apply -
    

To learn more about entities, you can read our entities documentation.

This tutorial uses OpenAI:

  1. Create an OpenAI account.
  2. Get an API key.
  3. Create a decK variable with the API key:
export DECK_OPENAI_API_KEY='YOUR OPENAI API KEY'

To complete this tutorial, make sure you have the following:

  • A Redis Stack running and accessible from the environment where Kong is deployed.
  • Port 6379, or your custom Redis port is open and reachable from Kong.
  • Redis host set as an environment variable so the plugin can connect:

    export DECK_REDIS_HOST='YOUR-REDIS-HOST'
    

If you’re testing locally with Docker, use host.docker.internal as the host value.

Kong provides Compressor service as a private Docker image in a Cloudsmith repository. Contact Kong Support to get access to it.

Once you’ve received your Cloudsmith access token, run the following commands in Docker to pull the image:

  1. To pull images, you must authenticate first with the token provided by the Support:

     docker login docker.cloudsmith.io
    
  2. Docker will then prompt you to enter username and password:

     Username: kong/ai-compress
     Password: <YOUR_TOKEN>
    

    This is a token-based login with read-only access. You can pull images but not push them. Contact support for your token.

  3. To pull an image:

    Replace <image-name> and <tag> with the appropriate image and version, such as:

     docker pull docker.cloudsmith.io/kong/ai-compress/service:v0.0.2
    
  4. You can now run the image by pasting the following command in Docker:

     docker run --rm -p 8080:8080 docker.cloudsmith.io/kong/ai-compress/service:v0.0.2
    

To complete this tutorial, you’ll need Python (version 3.7 or later) and pip installed on your machine. You can verify it by running:

python3
python3 -m pip --version

Once that’s set up, install the required packages by running the following command in your terminal:

python3 -m pip install langchain langchain_text_splitters requests

Configure the AI Proxy Advanced plugin

First, you’ll need to configure the AI Proxy Advanced plugin to proxy prompt requests to your model provider, and handle authentication:

echo '
_format_version: "3.0"
plugins:
  - name: ai-proxy-advanced
    config:
      targets:
      - route_type: llm/v1/chat
        auth:
          header_name: Authorization
          header_value: Bearer ${{ env "DECK_OPENAI_API_KEY" }}
        model:
          provider: openai
          name: gpt-4o
          options:
            max_tokens: 512
            temperature: 1.0
' | deck gateway apply -

Configure the AI RAG Injector plugin

Next, configure the AI RAG Injector plugin to insert the RAG context into the user message only, and wrap it with <LLMLINGUA> tags so the AI Prompt Compressor plugin can compress it effectively.

echo '
_format_version: "3.0"
plugins:
  - name: ai-rag-injector
    config:
      fetch_chunks_count: 5
      inject_as_role: user
      inject_template: "<LLMLINGUA><CONTEXT></LLMLINGUA> | <PROMPT>"
      embeddings:
        auth:
          header_name: Authorization
          header_value: Bearer ${{ env "DECK_OPENAI_API_KEY" }}
        model:
          provider: openai
          name: text-embedding-3-large
      vectordb:
        strategy: redis
        redis:
          host: "${{ env "DECK_REDIS_HOST" }}"
          port: 6379
        distance_metric: cosine
        dimensions: 3072
' | deck gateway apply -

If your Redis instance runs in a separate Docker container from Kong, use host.docker.internal for vectordb.redis.host.

If you’re using a model other than text-embedding-3-large, be sure to update the vectordb.dimensions value to match the model’s embedding size.

Once the plugin is created, copy its id from the Deck response. Then, export it so the ingestion script can reference it later:

export PLUGIN_ID=<YOUR_PLUGIN_ID>

Replace <YOUR_PLUGIN_ID> with the actual id returned from the plugin creation API response. You’ll need this environment variable when generating the ingestion script that sends chunked content to the plugin.

Ingest data to Redis

Create an inject_template.py file by pasting the following into your terminal. This script fetches a Wikipedia article, splits the content into chunks, and sends each chunk to a local RAG ingestion endpoint.

cat <<EOF > inject_template.py
import requests
from langchain_text_splitters import RecursiveCharacterTextSplitter

plugin_id = "${PLUGIN_ID}"

def get_wikipedia_extract(title):
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        "format": "json",
        "action": "query",
        "prop": "extracts",
        "exlimit": "max",
        "explaintext": True,
        "titles": title,
        "redirects": 1
    }

    response = requests.get(url, params=params)
    response.raise_for_status()
    data = response.json()
    pages = data.get("query", {}).get("pages", {})

    for page_id, page in pages.items():
        if "extract" in page:
            return page["extract"]
    return None

title = "Shark"
text = get_wikipedia_extract(title)

if not text:
    print(f"Failed to retrieve Wikipedia content for: {title}")
    exit()

text = f"# {title}\\n\\n{text}"

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.create_documents([text])

print(f"Injecting {len(docs)} chunks...")

for doc in docs:
    response = requests.post(
        f"http://localhost:8001/ai-rag-injector/{plugin_id}/ingest_chunk",
        data={"content": doc.page_content}
    )
    print(response.status_code, response.text)
EOF

Now, run this script with Python:

python3 inject_template.py

If successful, your terminal will print the following:

Injecting 91 chunks...
200 {"metadata":{"chunk_id":"c55d8869-6858-496f-83d2-abcdefghij12","ingest_duration":615,"embeddings_tokens_count":2}}
200 {"metadata":{"chunk_id":"fc7d4fd7-21e0-443e-9504-abcdefghij13","ingest_duration":779,"embeddings_tokens_count":231}}
200 {"metadata":{"chunk_id":"8d2aebe1-04e4-40c7-b16f-abcdefghij14","ingest_duration":569,"embeddings_tokens_count":184}}

Wait until all 91 chunks have been injected before moving on to the next step.

Configure the AI Prompt Compressor plugin

Now, you can configure the AI Prompt Compressor plugin to apply compression to the wrapped RAG context using defined token ranges and compression settings.

echo '
_format_version: "3.0"
plugins:
  - name: ai-prompt-compressor
    config:
      compression_ranges:
      - max_tokens: 100
        min_tokens: 20
        value: 0.8
      - max_tokens: 1000000
        min_tokens: 100
        value: 0.3
      compressor_type: rate
      compressor_url: http://compress-service:8080
      keepalive_timeout: 60000
      log_text_data: false
      stop_on_error: true
      timeout: 10000
' | deck gateway apply -

Log prompt compression

Before we send requests to our LLM, we need to set up the HTTP Logs plugin to check how many tokens we’ve managed to save by using our configuration. First, create an HTTP logs plugin:

echo '
_format_version: "3.0"
plugins:
  - name: http-log
    service: example-route
    config:
      http_endpoint: http://host.docker.internal:9999/
      headers:
        Authorization: Bearer some-token
      method: POST
      timeout: 3000
' | deck gateway apply -

Let’s run a simple log collector script which collect logs at 9999 port. Copy and run this snippet in your terminal:

cat <<EOF > log_server.py
from http.server import BaseHTTPRequestHandler, HTTPServer
import datetime

LOG_FILE = "kong_logs.txt"

class LogHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        timestamp = datetime.datetime.now().isoformat()

        content_length = int(self.headers['Content-Length'])
        post_data = self.rfile.read(content_length).decode('utf-8')

        log_entry = f"{timestamp} - {post_data}\n"
        with open(LOG_FILE, "a") as f:
            f.write(log_entry)

        print("="*60)
        print(f"Received POST request at {timestamp}")
        print(f"Path: {self.path}")
        print("Headers:")
        for header, value in self.headers.items():
            print(f"  {header}: {value}")
        print("Body:")
        print(post_data)
        print("="*60)

        # Send OK response
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"OK")

if __name__ == '__main__':
    server_address = ('', 9999)
    httpd = HTTPServer(server_address, LogHandler)
    print("Starting log server on http://0.0.0.0:9999")
    httpd.serve_forever()
EOF

Now, run this script with Python:

python3 log_server.py

If script is successful, you’ll receive the following prompt in your terminal:

Starting log server on http://0.0.0.0:9999

Validate your configuration

When sending the following request:

curl "$KONNECT_PROXY_URL/anything" \
     -H "Content-Type: application/json"\
     -H "Authorization: Bearer $DECK_OPENAI_API_KEY" \
     --json '{
       "messages": [
         {
           "role": "user",
           "content": "How many species of sharks are there in the world?"
         }
       ]
     }'
curl "http://localhost:8000/anything" \
     -H "Content-Type: application/json"\
     -H "Authorization: Bearer $DECK_OPENAI_API_KEY" \
     --json '{
       "messages": [
         {
           "role": "user",
           "content": "How many species of sharks are there in the world?"
         }
       ]
     }'

You should see output like this in your HTTP log plugin endpoint, showing how many tokens were saved through compression:

"compressor": {
  "compress_items": [
    {
      "compress_token_count": 244,
      "original_token_count": 700,
      "compress_value": 0.3,
      "information": "Compression was performed and saved 456 tokens",
      "compressor_model": "microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
      "msg_id": 1,
      "compress_type": "rate",
      "save_token_count": 456
    }
  ],
  "duration": 1092
}

Govern your LLM pipeline

You can use the AI Prompt Decorator plugin to make sure that the LLM responds only to questions related to the injected RAG context. Let’s apply the following configuration:

echo '
_format_version: "3.0"
plugins:
  - name: ai-prompt-decorator
    config:
      prompts:
        append:
        - role: system
          content: Use only the information passed before the question in the user message.
            If no data is provided with the question, respond with ‘no internal data
            available'
' | deck gateway apply -

Validate final configuration

Now, on any request not related to the ingested content, for example:

curl "$KONNECT_PROXY_URL/anything" \
     -H "Content-Type: application/json"\
     -H "Authorization: Bearer $DECK_OPENAI_API_KEY" \
     --json '{
       "messages": [
         {
           "role": "user",
           "content": "Who founded the city of Ravenna?"
         }
       ]
     }'
curl "http://localhost:8000/anything" \
     -H "Content-Type: application/json"\
     -H "Authorization: Bearer $DECK_OPENAI_API_KEY" \
     --json '{
       "messages": [
         {
           "role": "user",
           "content": "Who founded the city of Ravenna?"
         }
       ]
     }'

You will receive the following response:

"choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "no internal data available",
    ...
      }
    }
]

With the following compression applied:

"compress_items": [
  {
    "compress_token_count": 301,
    "original_token_count": 957,
    "compress_value": 0.3,
    "information": "Compression was performed and saved 656 tokens",
    "compressor_model": "microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    "msg_id": 1,
    "compress_type": "rate",
    "save_token_count": 656
  }
]

Cleanup

If you created a new control plane and want to conserve your free trial credits or avoid unnecessary charges, delete the new control plane used in this tutorial.

curl -Ls https://get.konghq.com/quickstart | bash -s -- -d
Something wrong?

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!