Route chat requests to any supported LLM provider through Kong AI Gateway, with per-app
Consumer authentication and per-request model selection. By the end of this recipe, you will have
a single /basic-llm-routing endpoint that accepts OpenAI-format requests carrying a per-Consumer
API key, validates the key against a Consumer credential local to Kong, and routes the request to
one of two upstream models based on the model field in the request body.
Basic LLM Routing
Overview
Prerequisites
Kong Konnect
This tutorial uses Kong Konnect. You will provision a recipe-scoped Control Plane and local Data Plane via the quickstart script.
- Create a new personal access token by opening the Konnect PAT page and selecting Generate Token.
-
Export your token. The same token is reused later for kongctl commands:
export KONNECT_TOKEN='YOUR_KONNECT_PAT'Copied! -
Set the recipe-scoped Control Plane name and run the quickstart script:
export KONNECT_CONTROL_PLANE_NAME='basic-llm-routing-recipe' curl -Ls https://get.konghq.com/quickstart | bash -s -- -k $KONNECT_TOKEN --deck-outputCopied!This provisions a Konnect Control Plane named
basic-llm-routing-recipe, a local Data Plane connected to it, and printsexportlines for the rest of the session vars. Paste those into your shell when prompted.
kongctl + decK
This tutorial uses kongctl and decK to manage Kong configuration.
- Install kongctl from developer.konghq.com/kongctl.
- Install decK version 1.43 or later from docs.konghq.com/deck.
-
Verify both are installed:
kongctl version deck versionCopied!
AI Credentials
Pick the provider you want to route to and export its credentials. The same credentials are reused by every apply tab below.
Python 3.11+
The demo script requires Python 3.11 or later. Set up an isolated environment:
python3 -m venv .venv
source .venv/bin/activate
pip install 'openai>=1.0.0'
The problem
Most teams start by integrating LLM providers directly: import the provider’s SDK, embed API keys in environment variables, and call the provider’s API from application code. This works for a single service talking to a single provider, but breaks down as usage grows.
- Provider credential blast radius. Every service that makes LLM calls needs its own copy of the provider’s API key. A leaked key affects every team using the provider, and rotating a key requires coordinated redeploys across every service that holds it. There is no per-app or per-team credential to revoke independently.
- No client identity at the edge. Without an authentication layer between the client and the provider, the gateway cannot attribute usage to a tenant, enforce per-tenant quotas, or revoke access for a single misbehaving app.
-
Provider-specific auth and request formats. Each provider uses a different authentication
mechanism: OpenAI expects
Authorization: Bearer sk-..., Anthropic usesx-api-key: ..., AWS Bedrock requires SigV4 request signing, Azure usesapi-key: ...with instance-specific endpoints. Beyond auth, each provider has its own request and response body shape. Switching providers means rewriting auth and translation logic, not swapping a key. -
Coarse model selection. Most production workloads need to route different requests to different
models, a cheap model for simple completions and a stronger model for hard ones, but provider SDKs
expose this only as a per-request
modelparameter pointing at a provider-specific identifier. Hardcoding model IDs across application code makes it hard to swap models or absorb model version changes without coordinated redeploys.
The root issue is coupling. Application code is bound to provider auth, provider request format, and provider model identifiers, and there is no shared layer where a platform team can enforce identity, quotas, or routing policy.
The solution
Kong AI Gateway inserts a single Service and Route between clients and providers. The Route authenticates each request against a per-app credential, picks the upstream model based on a model field in the request body so clients can choose between tiers without changing endpoints, and injects the provider’s credentials and translates request/response formats so client apps never hold a provider key. The result is one endpoint that gives the platform team a place to enforce identity, routing, and credential policy without coupling client code to any provider.
sequenceDiagram
participant C as Client
participant K as Kong AI Gateway
participant L as LLM Provider
C->>K: POST /basic-llm-routing (apikey, model: fast or smart)
activate K
K->>K: key-auth — validate apikey, attach Consumer
K->>K: ai-proxy-advanced — match model_alias, inject provider auth
K->>L: Forwarded request (translated to provider native format)
activate L
L-->>K: Native response
deactivate L
K-->>C: OpenAI-format response (+ X-Kong-LLM-Model)
deactivate K
|
Component |
Responsibility |
|---|---|
| Client application |
Sends OpenAI-format requests with an apikey header that identifies the Consumer, plus model: fast or model: smart
|
| Key Auth Plugin | Looks up the API key against registered Consumer credentials, attaches the matching Kong Consumer to the request |
| AI Proxy Advanced Plugin |
Matches the request’s model field to a target’s model_alias, injects provider credentials, translates the request body, and routes to the upstream provider
|
| Kong Consumer | Identity attached to authenticated requests, used for rate limiting, ACLs, and analytics attribution |
| LLM provider | Processes the prompt and returns a completion |
How it works
A request flowing through Kong is processed in three stages: authentication, routing, and proxying.
- A client sends a chat completion request (OpenAI format) to
/basic-llm-routingwith anapikeyheader and amodelfield set to eitherfastorsmart. - The Key Auth Plugin reads the
apikeyheader and looks the key up in Kong’s Consumer credential store. If the key is missing or unknown, Kong short-circuits with401before any upstream call. On a match, the Plugin attaches the matching Consumer to the request, which is what downstream Plugins and analytics use to attribute usage. - The AI Proxy Advanced Plugin reads the
modelfield from the request body, finds the target whosemodel_aliasmatches, and selects that target. - The Plugin strips the client’s
apikeyheader, injects the upstream provider’s credentials from its configuration, and (if the upstream uses a different format) translates the OpenAI-format body to the provider’s native format. - Kong forwards the request to the LLM provider’s API endpoint, normalizes the response back to OpenAI format, and returns it to the client with
X-Kong-LLM-Model(the upstream model that served the request) and latency headers attached.
Key Auth: API key authentication and Consumer mapping
The Key Auth Plugin sits in front of the AI Proxy Advanced Plugin and gates every request. Each Consumer is registered with one or more API keys in the keyauth_credentials block. When a request arrives, the Plugin reads the configured header (apikey), looks the key up in Kong’s Consumer credential store, and attaches the matching Consumer identity to the request. That identity is what downstream Plugins like rate limiters and analytics use to attribute usage. The Plugin scales naturally to multi-tenant scenarios. Add a Consumer per app or per team, each with its own key.
Configuration details
- name: key-auth
config:
key_names:
- apikey
hide_credentials: true
key_names: [apikey]. The headers (or query parameters) the Plugin looks in for the API key. The recipe uses apikey because the Key Auth Plugin performs an exact string match on the header value and does not inspect Authorization for Bearer tokens. The OpenAI SDK’s api_key field always serializes as Authorization: Bearer <key>, which Kong would read as the literal string Bearer <key> and fail to match against any stored credential. The “Try it out” section below points at a pre-function pattern that bridges the SDK’s Bearer token to the apikey header server-side; the Authenticate OpenAI SDK clients with Key Auth guide has the full pattern.
hide_credentials: true. Strips the API key from the request before forwarding upstream. The provider never sees the Consumer’s API key. This is a 3.14 default but the recipe sets it explicitly for clarity and to remain portable to older Gateway versions.
Anonymous fallback. Set anonymous: <consumer-id> to let unauthenticated requests fall through to a designated “anonymous” Consumer with their own restricted policies, instead of returning 401. Useful for public/free-tier endpoints. See the key-auth reference for the full set of options.
Scaling to a real IdP. When the platform is ready for end-user identity instead of static API keys, swap key-auth for openid-connect and map JWT claims to Consumers. Application code only changes the auth header it sends; the rest of this recipe (model aliases, ai-proxy-advanced targets, Consumer mappings) stays put. See the claude-code-sso recipe for an end-to-end example with Okta.
AI Proxy Advanced: model alias routing and provider translation
The AI Proxy Advanced Plugin sits behind the Key Auth Plugin and handles everything from the model-selection decision through the upstream call. The recipe configures two targets, each tagged with a model_alias. When a request arrives, the Plugin reads the model field from the request body, finds the target whose alias matches (fast or smart), and uses that target’s model.name and auth configuration. This single Plugin replaces what would otherwise require per-provider SDKs, hand-rolled credential management, and per-model client logic.
Configuration details
- name: ai-proxy-advanced
config:
max_request_body_size: 10485760
response_streaming: allow
targets:
- route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: ${{ env "DECK_OPENAI_TOKEN" }}
logging:
log_statistics: true
log_payloads: true
model:
model_alias: fast
provider: openai
name: ${{ env "DECK_CHAT_MODEL_1" }}
- route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: ${{ env "DECK_OPENAI_TOKEN" }}
logging:
log_statistics: true
log_payloads: true
model:
model_alias: smart
provider: openai
name: ${{ env "DECK_CHAT_MODEL_2" }}
model.model_alias. The client-facing name for the target. When the request body’s model field equals an alias, the Plugin routes to that target. With aliases configured, Kong uses alias matching as the primary routing decision. If no alias matches, the Plugin falls back to the configured load-balancing algorithm.
route_type: llm/v1/chat. Selects the chat-completions translation path. Kong accepts an OpenAI-format chat-completion body and converts it to whatever the upstream provider expects (Anthropic’s messages API, Bedrock’s invoke-model body, etc.). The response is normalised back to OpenAI format.
auth. Kong holds provider credentials in the Plugin config and injects them into every upstream request. Set auth.allow_override: true if you want client-provided credentials to pass through to the provider instead, useful when clients manage their own provider keys and Kong is purely a routing layer.
logging.log_statistics. When enabled, Kong appends token usage data (prompt_tokens, completion_tokens, total_tokens) to any attached logging Plugin’s output. Useful for cost attribution.
logging.log_payloads. When enabled, the full request and response bodies are included in the output of any attached logging Plugin. Whether to enable this depends on your organization’s observability and compliance requirements.
model.name. The upstream model identifier. With aliases in play, this is the actual provider model that serves the request when its alias is selected. Change it and re-apply to swap models without changing client code.
max_request_body_size and response_streaming. The recipe sets a 10 MB request limit (large enough for typical conversation contexts and modest RAG injections) and allows streaming responses (the natural choice for interactive chat). Tighten or relax both based on the workload you expect.
Alternative configurations worth knowing about:
-
llm_format. The recipe uses the default (openai), which accepts OpenAI-format requests and normalizes all provider responses back to OpenAI format. Setllm_formatto a provider’s native format to pass requests through without transformation. Useful when you already have code using a provider’s SDK or need provider-specific features that do not map to the OpenAI format. Native format only supports the matching provider, you cannot route across providers with different native formats on a single Plugin. See the ai-proxy-advanced reference for the supported native formats. -
Routing strategies beyond aliases. This recipe routes by the
modelfield in the body. The same Plugin also supports routing by request header (via Route or Service-level routing in front of the Plugin), by path (separate Routes per model), and by load-balancing algorithm across targets that share an alias. See the ai-proxy-advanced reference for the full set of balancer algorithms and routing strategies. -
Additional route types. A single Plugin instance can have multiple targets for different route types, each with their own model and auth configuration. Beyond
llm/v1/chat, the Plugin supports additional route types for embeddings, completions, responses, realtime, and multimodal traffic. See the ai-proxy-advanced reference for the current list.
Production credentials. This recipe stores the Consumer API key directly in Plugin config and the LLM provider credentials in environment variables for simplicity. In production, use Kong Vaults to reference both from your preferred secret manager (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, Azure Key Vault) instead.
Example response
The same OpenAI-format request goes through Kong. The header that proves alias routing happened is X-Kong-LLM-Model, which echoes the upstream model the request was routed to. Two requests with the same body but different model values land on different upstream models:
Request body (identical for both calls, only the model field changes):
{
"model": "fast",
"messages": [
{ "role": "user", "content": "What is the capital of France?" }
]
}
Response headers from the model: "fast" call:
HTTP/1.1 200 OK
X-Kong-LLM-Model: openai/gpt-4o-mini
X-Kong-Upstream-Latency: 312
X-Kong-Proxy-Latency: 6
Response headers from the model: "smart" call:
HTTP/1.1 200 OK
X-Kong-LLM-Model: openai/gpt-4o
X-Kong-Upstream-Latency: 891
X-Kong-Proxy-Latency: 6
Kong adds the following response headers:
|
Header |
Description |
|---|---|
X-Kong-LLM-Model
|
Upstream model that served the request, prefixed with the provider name and resolved from the matched model_alias
|
X-Kong-Upstream-Latency
|
Time (ms) Kong spent waiting for the provider to respond |
X-Kong-Proxy-Latency
|
Time (ms) Kong spent processing the request (excluding upstream) |
Kong attaches X-Consumer-Username and related headers to the upstream request (so the LLM provider sees who’s calling) but does not echo them back to the downstream client. Per-Consumer attribution shows up in Konnect’s analytics views. See “Explore in Konnect” below.
Apply the Kong configuration
The following configuration creates a Kong Gateway Service and Route at /basic-llm-routing, attaches the key-auth Plugin to identify Consumers via the apikey header, and attaches the ai-proxy-advanced Plugin with two targets to handle alias routing, credential injection, and format translation. All resources are scoped using select_tags and a kongctl namespace so they can be cleanly torn down without affecting other configurations on the same Control Plane. See the kongctl documentation for more on federated configuration management.
First, adopt the quickstart Control Plane into a kongctl namespace so the apply commands below can manage it.
kongctl adopt control-plane "${KONNECT_CONTROL_PLANE_NAME}" \
--namespace "${KONNECT_CONTROL_PLANE_NAME}" \
--pat "${KONNECT_TOKEN}"
Adoption stamps the KONGCTL-namespace label on the Control Plane.
Provider credentials are exported once during Prerequisites. Each tab below only sets the model env vars (which are recipe-specific) and runs the apply.
Try it out
The demo script makes three calls. The first two send the same prompt with different model values (fast then smart) and print the X-Kong-LLM-Model header so you can confirm Kong routed each request to a different upstream model. The third call presents an invalid API key and shows Kong rejecting it with 401 before any upstream call.
The demo passes the API key via
default_headersbecause the OpenAI SDK reservesapi_keyfor theAuthorization: Bearerheader. To let clients pass the key throughapi_keydirectly, attach a pre-function Plugin that copies the Bearer token to theapikeyheader server-side. See Authenticate OpenAI SDK clients with Key Auth for the pattern.
Create the demo script:
cat <<'EOF' > demo.py
"""Basic LLM routing demo. See README for context."""
import os
import sys
import time
from openai import APIStatusError, OpenAI
PROXY_URL = os.getenv("PROXY_URL", "http://localhost:8000")
API_KEY = "demo-api-key"
PROMPT = "What is the capital of France?"
# ANSI color codes. Disabled when stdout isn't a TTY or NO_COLOR is set.
_USE_COLOR = sys.stdout.isatty() and "NO_COLOR" not in os.environ
def _c(code: str, s: str) -> str:
return f"\033[{code}m{s}\033[0m" if _USE_COLOR else s
BOLD = lambda s: _c("1", s)
DIM = lambda s: _c("2", s)
GREEN = lambda s: _c("32", s)
CYAN = lambda s: _c("36", s)
RED = lambda s: _c("31", s)
def make_client(api_key: str) -> OpenAI:
"""Construct an OpenAI client that sends the given API key in the apikey header."""
return OpenAI(
base_url=f"{PROXY_URL}/basic-llm-routing",
api_key="unused", # required by the SDK; Kong reads the apikey header instead
default_headers={"apikey": api_key},
)
def call(client: OpenAI, model_alias: str) -> None:
"""Send one chat request and print the model Kong routed it to."""
print(f"\n{BOLD('[REQUEST]')} model={model_alias!r} prompt={PROMPT!r}")
start_ms = round(time.time() * 1000)
try:
raw = client.chat.completions.with_raw_response.create(
model=model_alias,
messages=[{"role": "user", "content": PROMPT}],
)
except APIStatusError as e:
elapsed_ms = round(time.time() * 1000) - start_ms
print(f"{RED(BOLD('[BLOCKED]'))} {RED(BOLD(str(e.status_code)))} {e.message} ({elapsed_ms}ms)")
return
elapsed_ms = round(time.time() * 1000) - start_ms
completion = raw.parse()
upstream_latency = raw.headers.get("x-kong-upstream-latency", ".")
proxy_latency = raw.headers.get("x-kong-proxy-latency", ".")
upstream_model = raw.headers.get("x-kong-llm-model", ".")
answer = completion.choices[0].message.content
print(f"[RESPONSE] {DIM(answer)}")
# Routed-to model is the headline of this demo — make it pop.
print(f"{GREEN(BOLD('[ROUTED TO]'))} alias={model_alias!r} -> upstream model={CYAN(BOLD(upstream_model))}")
print(f"[LATENCY] {DIM(f'upstream={upstream_latency}ms proxy={proxy_latency}ms total={elapsed_ms}ms')}")
def section(title: str) -> None:
bar = "=" * 70
print(f"\n{bar}\n{BOLD(title)}\n{bar}")
def main() -> None:
section("1. Same client, same prompt, two model aliases")
client = make_client(API_KEY)
call(client, "fast")
call(client, "smart")
section("2. Invalid API key. Kong rejects before reaching the upstream provider")
bad_client = make_client("not-a-real-key")
call(bad_client, "fast")
section("Done.")
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
sys.exit(130)
EOF
Run it:
python demo.py
Example output (using the OpenAI tab’s model env vars):
======================================================================
1. Same client, same prompt, two model aliases
======================================================================
[REQUEST] model='fast' prompt='What is the capital of France?'
[RESPONSE] The capital of France is Paris.
[ROUTED TO] alias='fast' -> upstream model='openai/gpt-4o-mini'
[LATENCY] upstream=1311ms proxy=12ms total=1488ms
[REQUEST] model='smart' prompt='What is the capital of France?'
[RESPONSE] The capital of France is Paris.
[ROUTED TO] alias='smart' -> upstream model='openai/gpt-4o'
[LATENCY] upstream=1071ms proxy=4ms total=1083ms
======================================================================
2. Invalid API key. Kong rejects before reaching the upstream provider
======================================================================
[REQUEST] model='fast' prompt='What is the capital of France?'
[BLOCKED] 401 Error code: 401 - {'message': 'No credentials found for given apikey'} (14ms)
======================================================================
Done.
======================================================================
What happened
-
Both aliases resolved on the same Route. The OpenAI SDK sent two
POST /basic-llm-routingrequests with identical bodies except for themodelfield (fastvssmart). TheX-Kong-LLM-Modelheader on each response shows the actual upstream model Kong routed to:openai/gpt-4o-miniforfastandopenai/gpt-4oforsmart. No client code change, no separate endpoint, just a different value in the request body. -
Consumer identity attached to every successful request. Kong matched the
apikeyheader to thedemo-appConsumer’s credential and made that identity available to downstream Plugins. (X-Consumer-Usernameis added to the upstream request, not the downstream response. See the Konnect analytics views in the next subsection for per-Consumer attribution.) The Consumer is the unit you would attach rate limits, ACLs, and quota policies to. -
Kong enforced auth before any upstream call. The third request used a key that no Consumer holds. Kong’s Key Auth Plugin rejected it in roughly 5 ms, well below normal upstream latency. The provider was never contacted, no provider quota was consumed, and the failure surfaced as a clean
401to the client. -
Provider credentials never left Kong. The OpenAI SDK only ever held the Consumer’s API key. The
Bearer sk-...provider credential lived inDECK_OPENAI_TOKENon the Kong side and was injected into the upstream call by the AI Proxy Advanced Plugin.
Explore in Konnect
Open Konnect and navigate to API Gateway → Gateways → basic-llm-routing-recipe. The recipe created the following resources on this Control Plane:
-
Gateway services → basic-llm-routing: the Service the recipe registered. Its detail page has tabs for Configuration, Routes, Plugins, and Analytics.
-
Routes tab: the
/basic-llm-routingRoute, scoped by thebasic-llm-routing-recipeselect_tagsyou used at apply time. -
Plugins tab: two Plugin instances,
basic-llm-routing-auth(key-auth) andbasic-llm-routing-proxy(ai-proxy-advanced). Open the AI Proxy Advanced Plugin to see the two targets and theirmodel_aliasvalues.
-
Routes tab: the
- Consumers → demo-app: the Consumer the API key maps to.
The Analytics tab on the Gateway service shows analytics tied to this recipe, including request counts, error rates, average latency, and a request-over-time chart. For a deeper dive into these analytics, plus platform-wide analytics across every Control Plane, head to the Observability L1 menu in Konnect.
Variations and next steps
Swap models by changing one env var. Update DECK_CHAT_MODEL_1 or DECK_CHAT_MODEL_2 to a different model on the same provider and re-apply. Client code stays the same: model="fast" and model="smart" keep working, they just resolve to different upstream models. This is the most common production use of model_alias: an ops team can move the claude-sonnet alias from a version-pinned model ID like claude-sonnet-4-5-20250929 to a newer pinned version when it’s vetted, without coordinating with every team that consumes the alias. Stay within the same model class. Swapping fast from a small model to a large one is a behavioural change clients should be in on.
Add per-Consumer rate limits. With key-auth mapping requests to Consumers, attach the ai-rate-limiting-advanced Plugin to apply token quotas per Consumer or per Consumer Group. Each app holds its own API key, so each app gets its own budget. See the llm-cost-optimization recipe for a worked example of cost-based tiered rate limiting.
Switch to OpenID Connect for production identity. Static API keys are simple but they are not user identities. When the platform integrates Okta, Keycloak, Auth0, or another OIDC provider, swap the key-auth Plugin for openid-connect and let JWT claims map users to Consumers automatically based on roles or team membership. The rest of the recipe (model aliases, ai-proxy-advanced targets, Consumer mappings) stays put. See the claude-code-sso recipe for an end-to-end example with Okta.
Switch providers. Select a different provider tab above and re-apply. The client interface, including the apikey header auth and the model: "fast" / model: "smart" aliases, does not change. For setups that route to multiple providers from the same Plugin instance, see the AI Proxy Advanced load balancing documentation.
Cleanup
The recipe scoped all resources with select_tags and a kongctl namespace, so this teardown removes only this recipe’s configuration. Tear down the local Data Plane and delete the Control Plane from Konnect:
export KONNECT_CONTROL_PLANE_NAME='basic-llm-routing-recipe' && curl -Ls https://get.konghq.com/quickstart | bash -s -- -d -k $KONNECT_TOKEN