Related Documentation
Made by
Kong Inc.
Supported Gateway Topologies
hybrid db-less traditional
Supported Konnect Deployments
hybrid cloud-gateways serverless
Compatible Protocols
grpc grpcs http https
Minimum Version
Kong Gateway - 3.11
Tags
#ai
AI Gateway Enterprise: This plugin is only available as part of our AI Gateway Enterprise offering.

The Kong AI Prompt Compressor plugin compresses retrieved chunks before sending them to a Large Language Model (LLM), reducing text length while preserving meaning. It uses the LLMLingua 2 library for fast, high-quality compression. The plugin supports:

  • Ratio-based or target token compression — for example, reduce a message to 80% of the original length or compress to 150 tokens.
  • Configurable compression ranges — for example, compress prompts under 100 tokens with a 0.8 ratio or compress them to exactly 100 tokens.
  • Selective compression using <LLMLINGUA>...</LLMLINGUA> tags to target specific sections of the prompt. These tags work only in the inject_template field of the AI RAG Injector plugin and must be used in combination with the AI Prompt Compressor.

Why use prompt compression

Efficient prompt compression helps you manage token limits, cut costs, and speed up LLM requests — all while keeping sensitive data safe and your prompts focused.

The table below outlines common use cases for the plugin and the configuration options available to tailor its behavior.

Use case

Description

Token limit management Compress verbose inputs like chat history or documents to stay within the LLM’s context window. Prevents truncation of important content.
Cost reduction Reducing token count in prompts decreases API costs when calling large language models, especially for high-volume use cases.
Latency reduction Smaller prompts result in faster request/response cycles, improving performance for real-time applications like voice assistants.
Data privacy Compress or abstract sensitive or personally identifiable information to maintain privacy and comply with data protection standards.
Dynamic prompt optimization Automatically strip verbose or low-value content before sending to the LLM, keeping the focus on what’s most relevant.

AI Prompt Compression Service

Kong provides a Docker image for the AI Prompt Compressor service, which compresses LLM prompts before sending them upstream. It uses LLMLingua 2 to reduce prompt size, which helps you manage token limits and maintain context fidelity. The service supports both HTTP and JSON-RPC APIs and is designed to work with the AI Prompt Compressor plugin in AI Gateway.

Kong provides Compressor service as a private Docker image in a Cloudsmith repository. Contact Kong Support to get access to it.

Once you’ve received your Cloudsmith access token, run the following commands in Docker to pull the image:

  1. To pull images, you must authenticate first with the token provided by the Support:

     docker login docker.cloudsmith.io
    
  2. Docker will then prompt you to enter username and password:

     Username: kong/ai-compress
     Password: <YOUR_TOKEN>
    

    This is a token-based login with read-only access. You can pull images but not push them. Contact support for your token.

  3. To pull an image:

    Replace <image-name> and <tag> with the appropriate image and version, such as:

     docker pull docker.cloudsmith.io/kong/ai-compress/service:v0.0.2
    
  4. You can now run the image by pasting the following command in Docker:

     docker run --rm -p 8080:8080 docker.cloudsmith.io/kong/ai-compress/service:v0.0.2
    

Image configuration options

You can configure the Kong Compressor Service using environment variables. These affect model selection, hardware usage, logging, and worker behavior.

Configuration option

Description

LLMLINGUA_MODEL_NAME Specifies the LLMLingua 2 model to use for compression. Defaults to microsoft/llmlingua-2-xlm-roberta-large-meetingbank.
LLMLINGUA_DEVICE_MAP Device on which to run the model. Supported values include cpu, cuda, auto, or mps.
LLMLINGUA_LOG_LEVEL Log level for the LLMLingua compression logic. Set to info, debug, or warning based on your needs.
GUNICORN_WORKERS Number of Gunicorn worker processes (for Docker deployments only). Defaults to 2.
GUNICORN_LOG_LEVEL Log level for Gunicorn server output (for Docker deployments only). Defaults to info.

Compression endpoints

The compressor service exposes both REST and JSON-RPC endpoints. You can use these interfaces to compress prompts, check the current status, or integrate with upstream services and plugins.

  • POST /llm/v1/compressPrompt: Compresses a prompt using either a compression ratio or a target token count. Supports selective compression via <LLMLINGUA> tags.

  • GET /llm/v1/status: Returns information about the currently loaded LLMLingua model and device settings (for example, CPU or GPU).

  • POST /: JSON-RPC endpoint that supports the llm.v1.compressPrompt method. Use this to invoke compression programmatically over JSON-RPC.

Prompt compression options

The AI Prompt Compressor plugin offers flexible compression controls to fit different use cases. You can choose between full-prompt compression, conditional strategies, or selectively compressing only parts of the prompt:

Configuration Option

Description

Compression by ratio Compress the prompt to a percentage of its original length (for example, reduce to 80%). This allows for consistent shrinkage regardless of the initial size.
Compression by token count Compress the prompt to a specific token target (for example, 150 tokens). Useful when working close to LLM context window limits.
Conditional rules Apply different compression strategies based on prompt length. For example, compress prompts under 100 tokens using a 0.8 ratio, and compress longer prompts to a fixed token count.
Selective compression with tags Wrap sections of the prompt in <LLMLINGUA>...</LLMLINGUA> to target only specific parts for compression, preserving untagged content as-is.

How it works

  1. The user sends the final prompt to the AI Prompt Compressor plugin.
  2. The plugin checks the prompt for <LLMLINGUA></LLMLINGUA> tags.
    • If tags are found, only the tagged sections are sent to LLMLingua 2 for compression.
    • If no tags are found, the entire prompt is sent to LLMLingua 2 for compression.
  3. Compression is applied based on configured rules—by ratio, target token count, or conditional length-based rules.
  4. The compressed prompt is returned to the plugin.
  5. The plugin sends the compressed prompt to the Large Language Model (LLM).
  6. The LLM processes the prompt and returns the response to the user.

The diagram below illustrates how the AI Prompt Compressor plugin processes and compresses incoming prompts based on tagging and configured rules.

 
sequenceDiagram
    actor User
    participant KongAICompressor as AI Prompt Compressor Plugin
    participant LLMLingua2 as LLMLingua 2 Compressor
    participant LLM as Large Language Model

    User->>KongAICompressor: Sends final prompt
    activate KongAICompressor
    KongAICompressor->>KongAICompressor: Check for LLMLINGUA tags

    alt Tagged content found
        KongAICompressor->>LLMLingua2: Compress tagged sections
        activate LLMLingua2
        LLMLingua2-->>KongAICompressor: Return compressed sections
        deactivate LLMLingua2
    else No LLMlingua tags
        KongAICompressor->>LLMLingua2: Compress entire prompt
        activate LLMLingua2
        LLMLingua2-->>KongAICompressor: Return compressed prompt
        deactivate LLMLingua2
    end

    KongAICompressor->>LLM: Send compressed prompt
    deactivate KongAICompressor
    activate LLM
    LLM-->>User: Return response
    deactivate LLM
  

The AI Prompt Compressor plugin applies structured compression to preserve essential context of prompts sent by users, rather than trimming prompts arbitrarily or risking token overflows. This ensures the LLM receives a well-formed, focused prompt keeping token usage under control.

Something wrong?

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!