Kong AI Gateway resource sizing guidelines

Uses: Kong Gateway AI Gateway

Scaling dimensions

AI inference performance depends on both token streaming latency and sustained token throughput. Unlike traditional API traffic, most latency comes from upstream models, so the gateway must be evaluated on its ability to pass through tokens efficiently.

Performance dimension	Measured in	Performance limited by…	Description
Latency	Milliseconds	LLM TTFT and token streaming bound Gateway overhead typically low relative to model time	Time to first token (TTFT) and per-token streaming latency (TPOT) dominate end-to-end latency. Gateway overhead typically adds < 10ms.
Throughput	Input/output tokens per second	CPU-bound Scale workers horizontally for higher sustained token throughput	Maximum sustained input and output tokens per second processed across all requests.

Model streams output tokens in server‑sent events (SSE). Processing streamed output is more expensive per token than input, so capacity planning must treat input and output tokens differently.

Deployment guidance

AI Gateway scales primarily through horizontal worker expansion, not vertical tuning. Treat token throughput as the core capacity metric, and validate performance against real LLM latency profiles. Synthetic or low-latency backends will overstate capacity.

Scale horizontally for token throughput

Kong AI Gateway performance is CPU-bound on token processing. Adding workers increases sustained throughput only when concurrency and streaming behavior scale correctly.

Add workers and nodes to increase throughput
Validate scaling efficiency as concurrency grows
Benchmark against real model latency and token cadence

Allocate CPU and memory for LLM workloads

Compute sizing is dictated by token processing, not request count. Memory supports configuration and streaming buffers. Persistent storage demand is minimal.

CPU determines maximum tokens per second
Memory must support configuration and in-memory stream buffers

Use dedicated compute instance classes

Consistent CPU performance is critical for LLM token streaming. Burstable or credit-based instances can introduce token delay spikes and unstable throughput.

Prefer dedicated compute families (for example, AWS c5, c6g)
Avoid burstable instances (for example, AWS t, GCP e2, Azure B series)

Operational best practices

Effective scaling requires testing with realistic model behavior, applying safety margins, and accommodating upstream model differences.

Benchmark with your model mix and prompt sizes
Size for token/s, not just RPS
Apply redundancy factor 2×–4×
Consider provider differences (OpenAI vs Gemini)
Test multi‑node scaling before production

Baseline benchmark results

These baseline throughput numbers reflect typical single-worker token processing under streaming LLM workloads. Use these numbers as general guidance only. Benchmark performance in your own environment and with your specific model mix.

Benchmark dimension	Result
Output tokens/s	OpenAI path: ~1.05M tokens/s Gemini path: ~0.78M tokens/s
Input tokens/s	~4.4M tokens/s (similar for both OpenAI and Gemini)
Input:output ratio	~4.2:1 – 5.6:1

Throughput depends on the provider, the model, and the size and structure of your prompts and responses. Benchmark with your real workload to measure accurate throughput and avoid relying on synthetic or idealized figures.

Capacity planning formula

equivalent_output_load = I_peak / R + O_peak
required_workers ≈ equivalent_output_load / O_w

Use redundancy factor 2×–4x- to handle burst, tokenization, and provider variability.

Quick estimate rule of thumb

4:1 input:output ratio
~1M output tokens/s per vCPU worker

(80M / 4 + 10M) / 1M = 30 workers
→ 60–120 workers w/ redundancy

Buffer and memory guidance

Inference requests often include large prompts and streamed output. Buffer sizing determines whether payloads are processed in memory or spill to disk, so tune memory settings based on prompt size and workload profile.

Traffic profile	Typical prompt size	max_request_body_size	client_body_buffer_size
Chat apps	< 512 KiB	2–4 MiB	256–512 KiB
RAG w/ embeddings	1–4 MiB	8–16 MiB	1–2 MiB
Batch / large JSON	4–16 MiB	16–64 MiB	2–4 MiB

Instance recommendations

AI Gateway benefits from high clock speed, dedicated CPU, and non-burstable compute classes. Select instance families optimized for consistent CPU throughput and avoid throttled instance types.

Cloud	Architecture	Instance family	Notes
AWS	x86_64	`c5`, `c6i`	Non-burstable compute optimized
AWS	ARM	`c6g`, `c7g`	Graviton cost-efficient scaling
GCP	x86_64	`c2-standard`, `c3-standard`	High clock performance
Azure	x86_64	`Fsv2`, `Dasv5`	CPU-optimized dedicated compute

Deployment sizing tiers

Cluster size depends on configured entities and sustained token throughput. Smaller environments serve team-level workloads; larger footprints handle multi-tenant platforms and enterprise AI adoption at scale.

Size	Number of configured entities	Token throughput guidance (input / output)	Use cases
Small	< 100 services/routes	< 10M input / < 2M output tokens/s	Team workloads, prototypes, low-volume inference
Medium	100–500 services/routes	10M–60M input / 2M–10M output tokens/s	Production traffic for single business unit
Large	500–2,000 services/routes	60M–200M input / 10M–40M output tokens/s	Central platform, multi-team AI adoption
XL	2,000 services/routes	200M input / > 40M output tokens/s	Enterprise AI platform, multi-tenant environments