Gen AI OpenTelemetry metrics reference

v3.14+ AI Gateway can export OpenTelemetry (OTLP) metrics for generative AI, MCP, and A2A traffic through the OpenTelemetry plugin. These metrics are aggregated time-series data points (counters, histograms) pushed to a configured OTLP metrics endpoint on a regular interval. They are separate from the per-request Gen AI span attributes emitted on traces.

For a step-by-step setup using an OpenTelemetry Collector, see Collect metrics, logs, and traces with the OpenTelemetry plugin. To visualize Gen AI traces in Jaeger, see Set up Jaeger with Gen AI OpenTelemetry.

Use these metrics to:

  • Track LLM request latency and upstream provider processing time
  • Monitor token consumption across providers, models, and consumers
  • Measure time-to-first-token (TTFT) and inter-token latency (TPOT) for streaming responses
  • Calculate AI request costs
  • Observe MCP tool-call latency, error rates, and ACL decisions
  • Monitor A2A agent request volume, duration, and task state transitions

Prerequisites

To collect AI OTel metrics, enable the following settings:

Setting

Plugin

Required for

config.metrics.enable_ai_metrics: true OpenTelemetry All AI metrics
config.metrics.endpoint OpenTelemetry All AI metrics (set to a valid OTLP-compatible metrics endpoint)
config.logging.log_statistics: true AI Proxy or AI Proxy Advanced Gen AI metrics
config.logging.log_statistics: true AI MCP Proxy MCP metrics
config.logging.log_statistics: true AI A2A Proxy A2A metrics

Some metrics have additional requirements:

  • gen_ai.server.request.duration and mcp.client.operation.duration require config.metrics.enable_latency_metrics set to true in the OpenTelemetry plugin.
  • The error.type attribute on duration metrics requires config.metrics.enable_request_metrics set to true in the OpenTelemetry plugin.

Gen AI metrics (OTel semantic conventions)

These metrics follow the OpenTelemetry Gen AI semantic conventions. They capture request duration, upstream latency, token usage, and streaming performance.

gen_ai.client.operation.duration

Total time Kong Gateway spends processing a Gen AI operation, such as an LLM request. Kong Gateway acts as the client calling the Gen AI provider.

  • Type: Histogram
  • Unit: s (seconds)

Attributes:

Attribute

Description

gen_ai.provider.name Name of the Gen AI provider.
gen_ai.request.model Model name targeted by the request.
gen_ai.response.model Model name reported by the provider in the response.
gen_ai.operation.name Operation requested, such as chat or embeddings.
kong.workspace.name Name of the Workspace.
kong.auth.consumer.name Name of the authenticated Consumer.
kong.gen_ai.request.mode Request mode: oneshot, stream, or realtime.
error.type Error type, if the request failed. Requires enable_request_metrics.

gen_ai.server.request.duration

Time the LLM provider spends processing the request (upstream latency). Requires enable_latency_metrics set to true.

  • Type: Histogram
  • Unit: s (seconds)

Attributes: Same as gen_ai.client.operation.duration.

gen_ai.client.token.usage

Number of tokens consumed by the Gen AI operation. Each data point is labeled with a gen_ai.token.type attribute that identifies the token category.

  • Type: Counter
  • Unit: {token}

Attribute

Description

gen_ai.provider.name Name of the Gen AI provider.
gen_ai.request.model Model name targeted by the request.
gen_ai.response.model Model name reported by the provider in the response.
gen_ai.token.type Token category: input, output, or total.
gen_ai.operation.name Operation requested, such as chat or embeddings.
kong.workspace.name Name of the Workspace.
kong.auth.consumer.name Name of the authenticated Consumer.
kong.gen_ai.request.mode Request mode: oneshot, stream, or realtime.

gen_ai.server.time_to_first_token

Time from when the model server receives the request until the first output token is generated. Relevant for streaming responses.

  • Type: Histogram
  • Unit: s (seconds)

Attribute

Description

gen_ai.provider.name Name of the Gen AI provider.
gen_ai.request.model Model name targeted by the request.
gen_ai.response.model Model name reported by the provider in the response.
gen_ai.operation.name Operation requested, such as chat or embeddings.
kong.workspace.name Name of the Workspace.
kong.auth.consumer.name Name of the authenticated Consumer.
kong.gen_ai.request.mode Request mode: oneshot, stream, or realtime.

gen_ai.server.time_per_output_token

Time between successive output tokens generated by the model server after the first token. Measures inter-token latency for streaming responses.

  • Type: Histogram
  • Unit: s (seconds)

Attributes: Same as gen_ai.server.time_to_first_token.

Kong Gen AI metrics

These metrics use the kong.gen_ai.* namespace and capture Kong-specific AI observability data, including cost tracking, cache and RAG latency, and AWS Guardrails processing time.

kong.gen_ai.llm.cost

Cost of AI requests. To populate this metric, define model.options.input_cost and model.options.output_cost in the AI Proxy or AI Proxy Advanced plugin configuration.

  • Type: Counter
  • Unit: {cost}

Attribute

Description

gen_ai.provider.name Name of the Gen AI provider.
gen_ai.request.model Model name targeted by the request.
gen_ai.response.model Model name reported by the provider in the response.
gen_ai.operation.name Operation requested, such as chat or embeddings.
kong.gen_ai.cache.status Cache status: hit or empty if not cached.
kong.gen_ai.vector_db Vector database used for caching, such as redis.
kong.gen_ai.embeddings.provider Embeddings provider used for caching.
kong.gen_ai.embeddings.model Embeddings model used for caching.
kong.workspace.name Name of the Workspace.
kong.auth.consumer.name Name of the authenticated Consumer.
kong.gen_ai.request.mode Request mode: oneshot, stream, or realtime.

kong.gen_ai.cache.fetch.latency

Time to fetch a response from the semantic cache.

  • Type: Histogram
  • Unit: s (seconds)

Attributes: Same as kong.gen_ai.llm.cost.

kong.gen_ai.cache.embeddings.latency

Time to generate embeddings during cache operations.

  • Type: Histogram
  • Unit: s (seconds)

Attributes: Same as kong.gen_ai.llm.cost.

kong.gen_ai.rag.fetch.latency

Time to fetch data from a RAG (Retrieval-Augmented Generation) source.

  • Type: Histogram
  • Unit: s (seconds)

Attributes: Same as kong.gen_ai.llm.cost.

kong.gen_ai.rag.embeddings.latency

Time to generate embeddings for RAG operations.

  • Type: Histogram
  • Unit: s (seconds)

Attributes: Same as kong.gen_ai.llm.cost.

kong.gen_ai.aws.guardrails.latency

Time for AWS Guardrails to process a request.

  • Type: Histogram
  • Unit: s (seconds)

Attribute

Description

kong.gen_ai.aws.guardrails.id ID of the AWS Guardrails configuration.
kong.gen_ai.aws.guardrails.version Version of the AWS Guardrails configuration.
kong.gen_ai.aws.guardrails.mode Mode of the AWS Guardrails evaluation.
kong.gen_ai.aws.guardrails.region AWS region of the Guardrails service.
kong.workspace.name Name of the Workspace.
kong.auth.consumer.name Name of the authenticated Consumer.

MCP metrics

These metrics provide observability into MCP (Model Context Protocol) server interactions, including latency, response sizes, errors, and ACL decisions.

mcp.client.operation.duration

Duration of the MCP request as observed by the sender. Only available when the AI MCP Proxy plugin is in passthrough-listener mode (the upstream is an MCP server). Requires enable_latency_metrics set to true.

  • Type: Histogram
  • Unit: s (seconds)

Attribute

Description

kong.service.name Name of the Gateway Service.
kong.route.name Name of the Route.
kong.workspace.name Name of the Workspace.
mcp.method.name MCP method name, such as tools/call.
gen_ai.tool.name Name of the tool invoked.
error.type JSON-RPC error code, if the request failed.
gen_ai.operation.name Operation name, such as execute_tool for tools/call.

mcp.server.operation.duration

Duration of the MCP request as observed by the receiver.

  • Type: Histogram
  • Unit: s (seconds)

Attributes: Same as mcp.client.operation.duration.

kong.gen_ai.mcp.response.size

Size of the MCP response body.

  • Type: Histogram
  • Unit: By (bytes)

Attribute

Description

kong.service.name Name of the Gateway Service.
kong.route.name Name of the Route.
kong.workspace.name Name of the Workspace.
mcp.method.name MCP method name, such as tools/call.
gen_ai.tool.name Name of the tool invoked.

kong.gen_ai.mcp.request.error.count

Number of MCP request errors.

  • Type: Counter
  • Unit: {error}

Attribute

Description

kong.service.name Name of the Gateway Service.
kong.route.name Name of the Route.
kong.workspace.name Name of the Workspace.
mcp.method.name MCP method name, such as tools/call.
gen_ai.tool.name Name of the tool invoked.
error.type JSON-RPC error code.

kong.gen_ai.mcp.acl.allowed

Number of MCP requests allowed by ACL rules.

  • Type: Counter
  • Unit: {request}

Attribute

Description

kong.service.name Name of the Gateway Service.
kong.route.name Name of the Route.
kong.workspace.name Name of the Workspace.
kong.gen_ai.mcp.primitive MCP primitive type, such as tool.
kong.gen_ai.mcp.primitive_name Name of the MCP primitive.

kong.gen_ai.mcp.acl.denied

Number of MCP requests denied by ACL rules.

  • Type: Counter
  • Unit: {request}

Attributes: Same as kong.gen_ai.mcp.acl.allowed.

A2A metrics

These metrics provide observability into A2A (Agent-to-Agent) traffic, including request volume, latency, response sizes, and task state transitions.

kong.gen_ai.a2a.request.count

Total number of A2A requests.

  • Type: Counter
  • Unit: {request}

Attribute

Description

kong.service.name Name of the Gateway Service.
kong.route.name Name of the Route.
kong.workspace.name Name of the Workspace.
kong.gen_ai.a2a.method A2A method name.
kong.gen_ai.a2a.binding A2A binding type.

kong.gen_ai.a2a.request.duration

Duration of an A2A request.

  • Type: Histogram
  • Unit: s (seconds)

Attributes: Same as kong.gen_ai.a2a.request.count.

kong.gen_ai.a2a.response.size

Size of the A2A response body.

  • Type: Histogram
  • Unit: By (bytes)

Attributes: Same as kong.gen_ai.a2a.request.count.

kong.gen_ai.a2a.ttfb

Time to first byte for A2A streaming responses.

  • Type: Histogram
  • Unit: s (seconds)

Attributes: Same as kong.gen_ai.a2a.request.count.

kong.gen_ai.a2a.request.error.count

Number of A2A request errors.

  • Type: Counter
  • Unit: {error}

Attribute

Description

kong.service.name Name of the Gateway Service.
kong.route.name Name of the Route.
kong.workspace.name Name of the Workspace.
kong.gen_ai.a2a.method A2A method name.
kong.gen_ai.a2a.binding A2A binding type.
kong.gen_ai.a2a.error.type Type of the A2A error.

kong.gen_ai.a2a.task.state.count

Number of A2A task state transitions.

  • Type: Counter
  • Unit: {state}

Attribute

Description

kong.service.name Name of the Gateway Service.
kong.route.name Name of the Route.
kong.workspace.name Name of the Workspace.
kong.gen_ai.a2a.task.state Task state, such as completed, failed, or in_progress.

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!