The Kong AI Gateway is designed to handle high‑volume inference workloads and forward requests to large language model (LLM) providers with predictable latency. This guide explains performance dimensions, capacity planning methodology, and baseline sizing guidance for AI inference traffic.
Kong AI Gateway resource sizing guidelines
Scaling dimensions
AI inference performance depends on both token streaming latency and sustained token throughput. Unlike traditional API traffic, most latency comes from upstream models, so the gateway must be evaluated on its ability to pass through tokens efficiently.
|
Performance dimension |
Measured in |
Performance limited by… |
Description |
|---|---|---|---|
| Latency | Milliseconds |
LLM TTFT and token streaming bound Gateway overhead typically low relative to model time |
Time to first token (TTFT) and per-token streaming latency (TPOT) dominate end-to-end latency. Gateway overhead typically adds < 10ms. |
| Throughput | Input/output tokens per second |
CPU-bound Scale workers horizontally for higher sustained token throughput |
Maximum sustained input and output tokens per second processed across all requests. |
Model streams output tokens in server‑sent events (SSE). Processing streamed output is more expensive per token than input, so capacity planning must treat input and output tokens differently.
Deployment guidance
AI Gateway scales primarily through horizontal worker expansion, not vertical tuning. Treat token throughput as the core capacity metric, and validate performance against real LLM latency profiles. Synthetic or low-latency backends will overstate capacity.
Scale horizontally for token throughput
Kong AI Gateway performance is CPU-bound on token processing. Adding workers increases sustained throughput only when concurrency and streaming behavior scale correctly.
- Add workers and nodes to increase throughput
- Validate scaling efficiency as concurrency grows
- Benchmark against real model latency and token cadence
Allocate CPU and memory for LLM workloads
Compute sizing is dictated by token processing, not request count. Memory supports configuration and streaming buffers. Persistent storage demand is minimal.
- CPU determines maximum tokens per second
- Memory must support configuration and in-memory stream buffers
Use dedicated compute instance classes
Consistent CPU performance is critical for LLM token streaming. Burstable or credit-based instances can introduce token delay spikes and unstable throughput.
- Prefer dedicated compute families (for example, AWS
c5,c6g) - Avoid burstable instances (for example, AWS
t, GCPe2, AzureBseries)
Operational best practices
Effective scaling requires testing with realistic model behavior, applying safety margins, and accommodating upstream model differences.
- Benchmark with your model mix and prompt sizes
- Size for token/s, not just RPS
- Apply redundancy factor 2×–4×
- Consider provider differences (OpenAI vs Gemini)
- Test multi‑node scaling before production
Baseline benchmark results
These baseline throughput numbers reflect typical single-worker token processing under streaming LLM workloads. Use these numbers as general guidance only. Benchmark performance in your own environment and with your specific model mix.
|
Benchmark dimension |
Result |
|---|---|
| Output tokens/s | OpenAI path: ~1.05M tokens/s Gemini path: ~0.78M tokens/s |
| Input tokens/s | ~4.4M tokens/s (similar for both OpenAI and Gemini) |
| Input:output ratio | ~4.2:1 – 5.6:1 |
Throughput depends on the provider, the model, and the size and structure of your prompts and responses. Benchmark with your real workload to measure accurate throughput and avoid relying on synthetic or idealized figures.
Capacity planning formula
equivalent_output_load = I_peak / R + O_peak
required_workers ≈ equivalent_output_load / O_w
Use redundancy factor 2×–4x- to handle burst, tokenization, and provider variability.
Quick estimate rule of thumb
- 4:1 input:output ratio
- ~1M output tokens/s per vCPU worker
(80M / 4 + 10M) / 1M = 30 workers
→ 60–120 workers w/ redundancy
Buffer and memory guidance
Inference requests often include large prompts and streamed output. Buffer sizing determines whether payloads are processed in memory or spill to disk, so tune memory settings based on prompt size and workload profile.
|
Traffic profile |
Typical prompt size |
max_request_body_size |
client_body_buffer_size |
|---|---|---|---|
| Chat apps | < 512 KiB | 2–4 MiB | 256–512 KiB |
| RAG w/ embeddings | 1–4 MiB | 8–16 MiB | 1–2 MiB |
| Batch / large JSON | 4–16 MiB | 16–64 MiB | 2–4 MiB |
Instance recommendations
AI Gateway benefits from high clock speed, dedicated CPU, and non-burstable compute classes. Select instance families optimized for consistent CPU throughput and avoid throttled instance types.
|
Cloud |
Architecture |
Instance family |
Notes |
|---|---|---|---|
| AWS | x86_64 |
c5, c6i
|
Non-burstable compute optimized |
| AWS | ARM |
c6g, c7g
|
Graviton cost-efficient scaling |
| GCP | x86_64 |
c2-standard, c3-standard
|
High clock performance |
| Azure | x86_64 |
Fsv2, Dasv5
|
CPU-optimized dedicated compute |
Deployment sizing tiers
Cluster size depends on configured entities and sustained token throughput. Smaller environments serve team-level workloads; larger footprints handle multi-tenant platforms and enterprise AI adoption at scale.
|
Size |
Number of configured entities |
Token throughput guidance (input / output) |
Use cases |
|---|---|---|---|
| Small | < 100 services/routes | < 10M input / < 2M output tokens/s | Team workloads, prototypes, low-volume inference |
| Medium | 100–500 services/routes | 10M–60M input / 2M–10M output tokens/s | Production traffic for single business unit |
| Large | 500–2,000 services/routes | 60M–200M input / 10M–40M output tokens/s | Central platform, multi-team AI adoption |
| XL |
2,000 services/routes |
200M input / > 40M output tokens/s |
Enterprise AI platform, multi-tenant environments |