Kong AI Gateway resource sizing guidelines

Related Documentation
Minimum Version
Kong Gateway - 3.12
Incompatible with
konnect

The Kong AI Gateway is designed to handle high‑volume inference workloads and forward requests to large language model (LLM) providers with predictable latency. This guide explains performance dimensions, capacity planning methodology, and baseline sizing guidance for AI inference traffic.

Scaling dimensions

AI inference performance depends on both token streaming latency and sustained token throughput. Unlike traditional API traffic, most latency comes from upstream models, so the gateway must be evaluated on its ability to pass through tokens efficiently.

Performance dimension

Measured in

Performance limited by…

Description

Latency Milliseconds LLM TTFT and token streaming bound
Gateway overhead typically low relative to model time
Time to first token (TTFT) and per-token streaming latency (TPOT) dominate end-to-end latency. Gateway overhead typically adds < 10ms.
Throughput Input/output tokens per second CPU-bound
Scale workers horizontally for higher sustained token throughput
Maximum sustained input and output tokens per second processed across all requests.

Model streams output tokens in server‑sent events (SSE). Processing streamed output is more expensive per token than input, so capacity planning must treat input and output tokens differently.

Deployment guidance

AI Gateway scales primarily through horizontal worker expansion, not vertical tuning. Treat token throughput as the core capacity metric, and validate performance against real LLM latency profiles. Synthetic or low-latency backends will overstate capacity.

Scale horizontally for token throughput

Kong AI Gateway performance is CPU-bound on token processing. Adding workers increases sustained throughput only when concurrency and streaming behavior scale correctly.

  • Add workers and nodes to increase throughput
  • Validate scaling efficiency as concurrency grows
  • Benchmark against real model latency and token cadence

Allocate CPU and memory for LLM workloads

Compute sizing is dictated by token processing, not request count. Memory supports configuration and streaming buffers. Persistent storage demand is minimal.

  • CPU determines maximum tokens per second
  • Memory must support configuration and in-memory stream buffers

Use dedicated compute instance classes

Consistent CPU performance is critical for LLM token streaming. Burstable or credit-based instances can introduce token delay spikes and unstable throughput.

  • Prefer dedicated compute families (for example, AWS c5, c6g)
  • Avoid burstable instances (for example, AWS t, GCP e2, Azure B series)

Operational best practices

Effective scaling requires testing with realistic model behavior, applying safety margins, and accommodating upstream model differences.

  • Benchmark with your model mix and prompt sizes
  • Size for token/s, not just RPS
  • Apply redundancy factor 2×–4×
  • Consider provider differences (OpenAI vs Gemini)
  • Test multi‑node scaling before production

Baseline benchmark results

These baseline throughput numbers reflect typical single-worker token processing under streaming LLM workloads. Use these numbers as general guidance only. Benchmark performance in your own environment and with your specific model mix.

Benchmark dimension

Result

Output tokens/s OpenAI path: ~1.05M tokens/s Gemini path: ~0.78M tokens/s
Input tokens/s ~4.4M tokens/s (similar for both OpenAI and Gemini)
Input:output ratio ~4.2:1 – 5.6:1

Throughput depends on the provider, the model, and the size and structure of your prompts and responses. Benchmark with your real workload to measure accurate throughput and avoid relying on synthetic or idealized figures.

Capacity planning formula

equivalent_output_load = I_peak / R + O_peak
required_workers ≈ equivalent_output_load / O_w

Use redundancy factor 2×–4x- to handle burst, tokenization, and provider variability.

Quick estimate rule of thumb

  • 4:1 input:output ratio
  • ~1M output tokens/s per vCPU worker
(80M / 4 + 10M) / 1M = 30 workers
→ 60–120 workers w/ redundancy

Buffer and memory guidance

Inference requests often include large prompts and streamed output. Buffer sizing determines whether payloads are processed in memory or spill to disk, so tune memory settings based on prompt size and workload profile.

Traffic profile

Typical prompt size

max_request_body_size

client_body_buffer_size

Chat apps < 512 KiB 2–4 MiB 256–512 KiB
RAG w/ embeddings 1–4 MiB 8–16 MiB 1–2 MiB
Batch / large JSON 4–16 MiB 16–64 MiB 2–4 MiB

Instance recommendations

AI Gateway benefits from high clock speed, dedicated CPU, and non-burstable compute classes. Select instance families optimized for consistent CPU throughput and avoid throttled instance types.

Cloud

Architecture

Instance family

Notes

AWS x86_64 c5, c6i Non-burstable compute optimized
AWS ARM c6g, c7g Graviton cost-efficient scaling
GCP x86_64 c2-standard, c3-standard High clock performance
Azure x86_64 Fsv2, Dasv5 CPU-optimized dedicated compute

Deployment sizing tiers

Cluster size depends on configured entities and sustained token throughput. Smaller environments serve team-level workloads; larger footprints handle multi-tenant platforms and enterprise AI adoption at scale.

Size

Number of configured entities

Token throughput guidance (input / output)

Use cases

Small < 100 services/routes < 10M input / < 2M output tokens/s Team workloads, prototypes, low-volume inference
Medium 100–500 services/routes 10M–60M input / 2M–10M output tokens/s Production traffic for single business unit
Large 500–2,000 services/routes 60M–200M input / 10M–40M output tokens/s Central platform, multi-team AI adoption
XL
2,000 services/routes
200M input / > 40M output tokens/s
Enterprise AI platform, multi-tenant environments
Something wrong?

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!