AI Proxy Advanced

AI License Required

Overview Examples Guides Configuration reference Changelog

3.14.0.5

Release date 2026/06/04

Bugfix

Fixed an issue where stream_options was forwarded to Databricks upstreams, causing a 400 error since Databricks does not support this field.
Fixed an issue where full-sync reconfigure reset balancer state even when plugin configs were unchanged.
Fixed an issue where driver transformers returned errors or stream metadata in the wrong return position.
Fixed an issue where OpenAI-compatible Anthropic chat requests only preserved the last system message.
Fixed an issue where configured model name may never be used for Azure provider.
Fixed an issue where we didn’t collect the model name in OpenAI streaming response from Azure.

3.14.0.3

Release date 2026/05/25

Bugfix

Fixed an issue where dashscope replied 500 to inference request.
Fixed an issue where GenAI spans were not finished sometimes.
Fixed an issue where we didn’t support Azure GA version of Realtime API.
Fixed an issue where OpenAI-format chat reasoning requests were not mapped consistently across Anthropic, Gemini, Bedrock and more providers.
Fixed an issue where system prompt of Claude Code was truncated when doing non-passthrough proxy.
Fixed an issue where we didn’t use the huggingface’s token usage if available.

3.14.0.2

Release date 2026/04/28

Bugfix

Fixed an issue where Claude Code (or CLI) statistics were missing when using an Anthropic model via the Bedrock provider.
Fixed an issue where observability errors occurred when gpt-image-1.5 returned image token usage details.
Fixed an issue where Azure OpenAI authentication failed during load balancer failover due to token refresh attempts in balancer_by_lua. Tokens are now pre-fetched in the access phase.
Fixed an issue where we treated plaintext Anthropic response as gzipped.
Fixed an issue where we didn’t correctly ungzip streaming response.

3.14.0.0

Release date 2026/04/07

Feature

Added native passthrough support for Vertex AI multimodal embeddings. Introduced new llm/v1/multimodal-embeddings route type.
Added support for databricks provider.
Added support for DeepSeek provider.
Added the Ollama provider, for native support for all Ollama-hosted models.
Added ACL support that blocks requests matching deny rules, requires allow rules to pass when present, and resolves consumer/model values once per request.
Updated to use /v2/chat when proxying text generation request to cohere provider.
Added the ability to select target models by request models in same AI Proxy Advanced plugin.
Added support for proxying cohere embeddings models on bedrock.
Added support for vLLM provider.

Bugfix

Fixed structured output and tool calls for Ollama (also Llama2 and Mistral) providers.
Updated mappings for Ollama (also Llama2 and Mistral) providers.
Fixed an issue where attributes for GenAI spans were incorrect according to https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans.
Fixed excessive DNS warning logs for template variable endpoints like $(headers.x_gcp_endpoint). These warnings appeared on every configuration change because template variables cannot be resolved until request time.
Fixed usage.cost for OpenAI llm/v1/files/{file_id}/content batch output analytics.
Fixed an issue where metrics were missing and v2 API was unavailable in Cohere native format.
Fixed an issue where id/created fields were missing in multiple drivers.
Fixed an issue where gzip streaming response in OpenAI format was broken.
Fixed an issue where content-type header, from non-200 response of streaming request, was overridden.
Fixed an issue where GCP tokens would not be proactively refreshed before expiration.
Fixed zero usage statistics for OpenAI llm/v1/responses in streaming mode.

Performance

decoded OpenAI response only when necessary.

3.13.0.4

Release date 2026/05/05

Bugfix

Fixed excessive DNS warning logs for template variable endpoints like $(headers.x_gcp_endpoint). These warnings appeared on every configuration change because template variables cannot be resolved until request time.

3.13.0.3

Release date 2026/04/08

Bugfix

Fixed an issue where attributes for GenAI spans were incorrect according to https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans.
Fixed an issue where Azure OpenAI authentication failed during load balancer failover due to token refresh attempts in balancer_by_lua. Tokens are now pre-fetched in the access phase.
Fixed usage.cost for OpenAI llm/v1/files/{file_id}/content batch output analytics.
Fixed an issue where metrics were missing and v2 API was unavailable in Cohere native format.
Fixed an issue where id/created fields were missing in multiple drivers.
Fixed an issue where gzip streaming response in OpenAI format was broken.
Fixed an issue where content-type header, from non-200 response of streaming request, was overridden.

3.13.0.1

Release date 2026/02/01

Bugfix

Fixed an issue where GCP tokens would not be proactively refreshed before expiration.

3.13.0.0

Release date 2025/12/18

Feature

Supported health checks and circuit breaker for the load balancer. Added two new fields max_fails and fail_timeout.
Added missing least-connections load balancing algorithm to the AI Proxy Advanced plugin which was missed in the previous release.
Added support for Gemini live websocket in the ai-proxy-advanced plugin.
Added load balance, failover and circuit breaker feature for semantic routing.

Bugfix

Fixed balancer retry failures caused by expired DNS entries by preloading DNS for targets.
Fixed an issue where the token count for Gemini Vertex embeddings API in native format was incorrect.
Fixed an issue where the token count for Gemini Vertex embeddings API in OpenAI format was incorrect.
Fixed intermittent 500 responses from the AI Proxy Advanced plugin when using Azure OpenAI
Fixed an issue where the native format option did not work correctly for non-openai formats.
Fixed an issue where the semantic load balancing with pgvector namespace was not functioning correctly.
Fixed an issue when using responses API and background mode.
Fixed an issue where Files content analytics extraction is not handled properly for Azure.
Fixed an issue where requests to Anthropic Claude models via Azure Foundry were not being processed correctly.
Fixed an issue where AWS Bedrock invoke command is not properly proxied.
Fixed an issue where the Gemini image generation model responses were not being processed correctly.
Fixed missing id and created fields in certain drivers.
Fixed an issue where the max_completion_tokens parameter was not being set correctly for O1 series models (e.g., o1, o3, o4, gpt-5).

3.12.0.5

Release date 2026/04/15

Bugfix

Fixed an issue where id/created fields were missing in multiple drivers.

3.12.0.3

Release date 2026/02/26

Bugfix

Fixed balancer retry failures caused by expired DNS entries by preloading DNS for targets.
Fixed excessive DNS warning logs for template variable endpoints like $(headers.x_gcp_endpoint). These warnings appeared on every configuration change because template variables cannot be resolved until request time.
Fixed an issue where metrics were missing and v2 API was unavailable in Cohere native format.
Fixed an issue where GCP tokens would not be proactively refreshed before expiration.

3.12.0.2

Release date 2025/12/10

Bugfix

Fixed intermittent 500 responses from the AI Proxy Advanced plugin when using Azure OpenAI
Fixed an issue where Files content analytics extraction is not handled properly for Azure.

3.12.0.1

Release date 2025/11/18

Bugfix

Fixed an issue where the token count for Gemini Vertex embeddings API in native format was incorrect.
Fixed an issue where the token count for Gemini Vertex embeddings API in OpenAI format was incorrect.
Fixed an issue where the native format option did not work correctly for non-openai formats.
Fixed an issue where the semantic load balancing with pgvector namespace was not functioning correctly.
Fixed an issue where AWS Bedrock invoke command is not properly proxied.

3.12.0.0

Release date 2025/10/01

Feature

Added latency and cost observability to realtime API.
Added support for Gemini Rerank in native format.
Exposed input_tokens_details.cached_tokens_details.image_tokens_count in observability metrics.

Bugfix

Fixed an issue where the AI Proxy Advanced plugin’s lowest latency load balancer could fail to select a target when only one target was available, leading to 500 errors.
Fixed an issue where the ai-retry phase was not correctly setting the namespace in the kong.plugin.ctx, causing the ai-proxy-advanced balancer to retry the first target more than once.
Fixed an issue where the Gemini provider didn’t support Model Garden.
Fixed an issue where managed identity could be cached incorrectly.
Fixed an issue where Mistral models would return Unsupported field: seed when using some inference libraries.
Fixed an issue where array input wasn’t being properly validated for certain providers. Note that Bedrock, Gemini Public, and Mistral providers don’t support array input, as before.
Fixed an issue where some Titan embeddings models reported malformed requests.

3.11.0.7

Release date 2026/02/26

Bugfix

Fixed balancer retry failures caused by expired DNS entries by preloading DNS for targets.
Fixed an issue where the token count for Gemini Vertex embeddings API in native format was incorrect.
Fixed an issue where the token count for Gemini Vertex embeddings API in OpenAI format was incorrect.
Fixed excessive DNS warning logs for template variable endpoints like $(headers.x_gcp_endpoint). These warnings appeared on every configuration change because template variables cannot be resolved until request time.
Fixed intermittent 500 responses from the AI Proxy Advanced plugin when using Azure OpenAI
Fixed an issue where the native format option did not work correctly for non-openai formats.
Fixed an issue where Files content analytics extraction is not handled properly for Azure.
Fixed an issue where AWS Bedrock invoke command is not properly proxied.
Fixed missing id and created fields in certain drivers.
Fixed an issue where the max_completion_tokens parameter was not being set correctly for O1 series models (e.g., o1, o3, o4, gpt-5).

3.11.0.3

Release date 2025/09/05

Feature

Support for Gemini Rerank in native format

Bugfix

Fix panic in LLM observability when populating token usage for AI Proxy token usage details
AI Proxy Advanced: Fixed an issue where gemini provider not support model garden
Fixed an issue where Mistral models would return Unsupported field: seed when using some inference libraries.
Skip unknown cached details key from o11y observation.
Fixed an issue where array input is not recognized as valid request for some providers. Bedrock, Gemini Public and Mistral provider don’t accept array input as before.
Fixed an issue where some Titan embeddings model reported malformed request.

3.11.0.2

Release date 2025/07/28

Bugfix

Fixed an issue where ai-retry phase was not correctly setting the namespace in the kong.plugin.ctx, causing ai-proxy-advanced balancer retry first target more than once.
Fixed an issue where AI Proxy and AI Proxy Advanced can’t properly failover to a Bedrock target.
Fixed an issue where AI Proxy and AI Proxy Advanced might produce duplicate content in the response when the SSE event was truncated.
Fixed an issue where AI Proxy and AI Proxy Advanced might drop content in the response when the SSE event was truncated.
Fixed an issue where managed identity may not be cached properly.

3.11.0.1

Release date 2025/07/16

Bugfix

Fixed an issue where the llm license migration could fail if the license counter contained more than one week of data.

3.11.0.0

Release date 2025/07/03

Deprecation

ai-proxy, ai-proxy-advanced: Deprecated the preserve route_type. You are encouraged to use new route_types added in version 3.11.x.x and onwards.

Bugfix

Add tried_targets field in serialized analytics logs for record of all tried ai targets.
Fixed an issue where some of ai metrics was missed in analytics
Fixed an issue where AI Proxy Advanced can’t failover from other provider to Bedrock.
Fix consistent-hashing algorithm not using correct value to hash.
Fixed an issue where the stale semantic key vector may not be refreshed after the plugin config is updated.
Fixed an issue where AI Proxy and AI Proxy Advanced would use corrupted plugin config.

Known Issues

If any AI Gateway plugin has been enabled in a self-managed Kong Gateway deployment for more than a week, upgrades from 3.10 versions to 3.11.0.0 will fail due to a license migration issue. This does not affect Konnect deployments.

A fix will be provided in 3.11.0.1.

See breaking changes in 3.11 for a temporary workaround.

3.10.0.6

Release date 2025/10/10

Bugfix

Fixed an issue where AI Proxy Advanced can’t failover from other provider to Bedrock.
Fixed model name escape or bedrock from request

3.10.0.4

Release date 2025/08/07

Bugfix

Fixed an issue where the Gemini provider could not use Anthropic ‘rawPredict’ endpoint models hosted in Vertex.
Fixed an issue where ai-retry phase was not correctly setting the namespace in the kong.plugin.ctx, causing ai-proxy-advanced balancer retry first target more than once.
Fixed an issue where AI Proxy and AI Proxy Advanced can’t properly failover to a Bedrock target.
Fix consistent-hashing algorithm not using correct value to hash.

3.10.0.3

Release date 2025/07/06

Bugfix

Fixed an issue where the stale semantic key vector may not be refreshed after the plugin config is updated.

3.10.0.1

Release date 2025/04/15

Bugfix

Fixed an issue where AI Proxy and AI Proxy Advanced would use corrupted plugin config.

3.10.0.0

Release date 2025/03/27

Breaking Change

Changed the serialized log key of AI metrics from ai.ai-proxy to ai.proxy to avoid conflicts with metrics generated from plugins other than AI Proxy and AI Proxy Advanced. If you are using logging plugins (for example, File Log, HTTP Log, etc.), you will have to update metrics pipeline configurations to reflect this change.

Deprecation

Deprecated preserve mode in config.route_type. Use config.llm_format instead. The preserve mode setting will be removed in a future release.

Feature

Added support for boto3 SDKs for Bedrock provider, and for Google GenAI SDKs for Gemini provider.
Added new priority balancer algorithm, which allows setting apriority group for each upstream model.
Added the failover_criteria configuration option, which allows retrying requests to the next upstream server in case of failure.
Added cost to tokens_count_strategy when using the lowest-usage load balancing strategy.
Added the huggingface, azure, vertex, and bedrock providers to embeddings. They can be used by the ai-proxy-advanced, ai-semantic-cache, ai-semantic-prompt-guard, and ai-rag-injector plugins.
Allow authentication to Bedrock services with assume roles in AWS.
Added the ability to set a catch-all target in semantic routing.

Bugfix

Fixed an issue where AI upstream URL trailing would be empty.
Fixed an issue where the ai-proxy-advanced plugin failed to failover between providers of different formats.
Fixed an issue where the ai-proxy-advanced plugin identity running failed in retry scenarios.

3.9.1.2

Release date 2025/07/07

Bugfix

Fixed an issue where AI Proxy and AI Proxy Advanced would use corrupted plugin config.

3.9.0.0

Release date 2024/12/12

Feature

Added support for streaming responses to the AI Proxy Advanced plugin.
Made the embeddings.model.name config field a free text entry, enabling use of a self-hosted (or otherwise compatible) model.

Bugfix

Fixed an issue where stale plugin config was not updated in dbless and hybrid mode.
Fixed an issue where lowest-usage and lowest-latency strategy did not update data points correctly.

3.8.1.0

Release date 2024/11/04

Bugfix

Fixed an issue where stale plugin config was not updated in dbless and hybrid mode.
Fixed an issue where lowest-usage and lowest-latency strategy did not update data points correctly.

3.8.0.0

Release date 2024/09/11

Feature

allow AI plugin to read request from buffered file
Added the ai-proxy-advanced plugin that supports advanced load balancing between LLM services.

AI Proxy Advanced

Help us make these docs great!

Still need help?