Kong provides a Docker image for the AI Prompt Compressor service, which compresses LLM prompts before sending them upstream. It uses LLMLingua 2 to reduce prompt size, which helps you manage token limits and maintain context fidelity. The service supports both HTTP and JSON-RPC APIs and is designed to work with the AI Prompt Compressor plugin in AI Gateway.
Kong provides Compressor service as a private Docker image in a Cloudsmith repository. Contact Kong Support to get access to it.
Once you’ve received your Cloudsmith access token, run the following commands in Docker to pull the image:
-
To pull images, you must authenticate first with the token provided by the Support:
docker login docker.cloudsmith.io
-
Docker will then prompt you to enter username and password:
Username: kong/ai-compress
Password: <YOUR_TOKEN>
This is a token-based login with read-only access. You can pull images but not push them. Contact support for your token.
-
To pull an image:
Replace <image-name>
and <tag>
with the appropriate image and version, such as:
docker pull docker.cloudsmith.io/kong/ai-compress/service:v0.0.2
-
You can now run the image by pasting the following command in Docker:
docker run --rm -p 8080:8080 docker.cloudsmith.io/kong/ai-compress/service:v0.0.2
You can configure the Kong Compressor Service using environment variables. These affect model selection, hardware usage, logging, and worker behavior.
Configuration option
|
Description
|
LLMLINGUA_MODEL_NAME
|
Specifies the LLMLingua 2 model to use for compression. Defaults to microsoft/llmlingua-2-xlm-roberta-large-meetingbank .
|
LLMLINGUA_DEVICE_MAP
|
Device on which to run the model. Supported values include cpu , cuda , auto , or mps .
|
LLMLINGUA_LOG_LEVEL
|
Log level for the LLMLingua compression logic. Set to info , debug , or warning based on your needs.
|
GUNICORN_WORKERS
|
Number of Gunicorn worker processes (for Docker deployments only). Defaults to 2 .
|
GUNICORN_LOG_LEVEL
|
Log level for Gunicorn server output (for Docker deployments only). Defaults to info .
|
The compressor service exposes both REST and JSON-RPC endpoints. You can use these interfaces to compress prompts, check the current status, or integrate with upstream services and plugins.
-
POST /llm/v1/compressPrompt
: Compresses a prompt using either a compression ratio or a target token count. Supports selective compression via <LLMLINGUA>
tags.
-
GET /llm/v1/status
: Returns information about the currently loaded LLMLingua model and device settings (for example, CPU or GPU).
-
POST /
: JSON-RPC endpoint that supports the llm.v1.compressPrompt
method. Use this to invoke compression programmatically over JSON-RPC.