AI Prompt Compressor

AI License Required

Overview Examples Configuration reference Changelog

Why use prompt compression

Efficient prompt compression helps you manage token limits, cut costs, and speed up LLM requests — all while keeping sensitive data safe and your prompts focused.

The table below outlines common use cases for the plugin and the configuration options available to tailor its behavior.

Use case	Description
Token limit management	Compress verbose inputs like chat history or documents to stay within the LLM’s context window. Prevents truncation of important content.
Cost reduction	Reducing token count in prompts decreases API costs when calling large language models, especially for high-volume use cases.
Latency reduction	Smaller prompts result in faster request/response cycles, improving performance for real-time applications like voice assistants.
Data privacy	Compress or abstract sensitive or personally identifiable information to maintain privacy and comply with data protection standards.
Dynamic prompt optimization	Automatically strip verbose or low-value content before sending to the LLM, keeping the focus on what’s most relevant.

Kong provides a Docker image for the AI Prompt Compressor service, which compresses LLM prompts before sending them upstream. It uses LLMLingua 2 to reduce prompt size, which helps you manage token limits and maintain context fidelity. The service supports both HTTP and JSON-RPC APIs and is designed to work with the AI Prompt Compressor plugin in AI Gateway.

Kong provides Compressor service as a private Docker image in a Cloudsmith repository. Contact Kong Support to get access to it.

Once you’ve received your Cloudsmith access token, run the following commands in Docker to pull the image:

To pull images, you must authenticate first with the token provided by the Support:
```
 docker login docker.cloudsmith.io
```
Docker will then prompt you to enter username and password:
```
 Username: kong/ai-compress
 Password: <YOUR_TOKEN>
```
This is a token-based login with read-only access. You can pull images but not push them. Contact support for your token.
To pull an image:

Replace <image-name> and <tag> with the appropriate image and version, such as:
```
 docker pull docker.cloudsmith.io/kong/ai-compress/service:v0.0.3
```

You can now run the image by pasting the following command in Docker:

 docker run --rm -p 8080:8080 docker.cloudsmith.io/kong/ai-compress/service:v0.0.3

Image configuration options

You can configure the Kong Compressor Service using environment variables. These affect model selection, hardware usage, logging, and worker behavior.

Configuration option	Description
LLMLINGUA_MODEL_NAME	Specifies the LLMLingua 2 model to use for compression. Defaults to `microsoft/llmlingua-2-xlm-roberta-large-meetingbank`.
LLMLINGUA_DEVICE_MAP	Device on which to run the model. Supported values include `cpu`, `cuda`, `auto`, or `mps`.
LLMLINGUA_LOG_LEVEL	Log level for the LLMLingua compression logic. Set to `info`, `debug`, or `warning` based on your needs.
GUNICORN_WORKERS	Number of Gunicorn worker processes (for Docker deployments only). Defaults to `2`.
GUNICORN_LOG_LEVEL	Log level for Gunicorn server output (for Docker deployments only). Defaults to `info`.

Compression endpoints

The compressor service exposes both REST and JSON-RPC endpoints. You can use these interfaces to compress prompts, check the current status, or integrate with upstream services and plugins.

POST /llm/v1/compressPrompt: Compresses a prompt using either a compression ratio or a target token count. Supports selective compression via <LLMLINGUA> tags.
GET /llm/v1/status: Returns information about the currently loaded LLMLingua model and device settings (for example, CPU or GPU).
POST /: JSON-RPC endpoint that supports the llm.v1.compressPrompt method. Use this to invoke compression programmatically over JSON-RPC.

Prompt compression options

The AI Prompt Compressor plugin offers flexible compression controls to fit different use cases. You can choose between full-prompt compression, conditional strategies, or selectively compressing only parts of the prompt:

Configuration Option	Description
Compression by ratio	Compress the prompt to a percentage of its original length (for example, reduce to 80%). This allows for consistent shrinkage regardless of the initial size.
Compression by token count	Compress the prompt to a specific token target (for example, 150 tokens). Useful when working close to LLM context window limits.
Conditional rules	Apply different compression strategies based on prompt length. For example, compress prompts under 100 tokens using a 0.8 ratio, and compress longer prompts to a fixed token count.
Selective compression with tags	Wrap sections of the prompt in `<LLMLINGUA>...</LLMLINGUA>` to target only specific parts for compression, preserving untagged content as-is.

How it works

The user sends the final prompt to the AI Prompt Compressor plugin.
The plugin checks the prompt for <LLMLINGUA>…</LLMLINGUA> tags.
- If tags are found, only the tagged sections are sent to LLMLingua 2 for compression.
- If no tags are found, the entire prompt is sent to LLMLingua 2 for compression.
Compression is applied based on configured rules—by ratio, target token count, or conditional length-based rules.
The compressed prompt is returned to the plugin.
The plugin sends the compressed prompt to the Large Language Model (LLM).
The LLM processes the prompt and returns the response to the user.

The diagram below illustrates how the AI Prompt Compressor plugin processes and compresses incoming prompts based on tagging and configured rules.

 
sequenceDiagram
    actor User
    participant KongAICompressor as AI Prompt Compressor Plugin
    participant LLMLingua2 as LLMLingua 2 Compressor
    participant LLM as Large Language Model

    User->>KongAICompressor: Sends final prompt
    activate KongAICompressor
    KongAICompressor->>KongAICompressor: Check for LLMLINGUA tags

    alt Tagged content found
        KongAICompressor->>LLMLingua2: Compress tagged sections
        activate LLMLingua2
        LLMLingua2-->>KongAICompressor: Return compressed sections
        deactivate LLMLingua2
    else No LLMlingua tags
        KongAICompressor->>LLMLingua2: Compress entire prompt
        activate LLMLingua2
        LLMLingua2-->>KongAICompressor: Return compressed prompt
        deactivate LLMLingua2
    end

    KongAICompressor->>LLM: Send compressed prompt
    deactivate KongAICompressor
    activate LLM
    LLM-->>User: Return response
    deactivate LLM

The AI Prompt Compressor plugin applies structured compression to preserve essential context of prompts sent by users, rather than trimming prompts arbitrarily or risking token overflows. This ensures the LLM receives a well-formed, focused prompt keeping token usage under control.

AI Prompt Compressor

Why use prompt compression

AI Prompt Compression Service

Image configuration options

Compression endpoints

Prompt compression options

How it works

Help us make these docs great!

Still need help