Stream responses from Vertex AI through Kong AI Gateway using Google Generative AI SDK

Deployment Platform
Minimum Version
Kong Gateway - 3.10
TL;DR

Configure the AI Proxy Advanced plugin with llm_format set to gemini, then send requests to the :streamGenerateContent endpoint. The response returns as a JSON array containing incremental text chunks.

Prerequisites

This is a Konnect tutorial and requires a Konnect personal access token.

  1. Create a new personal access token by opening the Konnect PAT page and selecting Generate Token.

  2. Export your token to an environment variable:

     export KONNECT_TOKEN='YOUR_KONNECT_PAT'
    
  3. Run the quickstart script to automatically provision a Control Plane and Data Plane, and configure your environment:

     curl -Ls https://get.konghq.com/quickstart | bash -s -- -k $KONNECT_TOKEN --deck-output
    

    This sets up a Konnect Control Plane named quickstart, provisions a local Data Plane, and prints out the following environment variable exports:

     export DECK_KONNECT_TOKEN=$KONNECT_TOKEN
     export DECK_KONNECT_CONTROL_PLANE_NAME=quickstart
     export KONNECT_CONTROL_PLANE_URL=https://us.api.konghq.com
     export KONNECT_PROXY_URL='http://localhost:8000'
    

    Copy and paste these into your terminal to configure your session.

This tutorial requires Kong Gateway Enterprise. If you don’t have Kong Gateway set up yet, you can use the quickstart script with an enterprise license to get an instance of Kong Gateway running almost instantly.

  1. Export your license to an environment variable:

     export KONG_LICENSE_DATA='LICENSE-CONTENTS-GO-HERE'
    
  2. Run the quickstart script:

    curl -Ls https://get.konghq.com/quickstart | bash -s -- -e KONG_LICENSE_DATA 
    

    Once Kong Gateway is ready, you will see the following message:

     Kong Gateway Ready
    

decK is a CLI tool for managing Kong Gateway declaratively with state files. To complete this tutorial, install decK version 1.43 or later.

This guide uses deck gateway apply, which directly applies entity configuration to your Gateway instance. We recommend upgrading your decK installation to take advantage of this tool.

You can check your current decK version with deck version.

For this tutorial, you’ll need Kong Gateway entities, like Gateway Services and Routes, pre-configured. These entities are essential for Kong Gateway to function but installing them isn’t the focus of this guide. Follow these steps to pre-configure them:

  1. Run the following command:

    echo '
    _format_version: "3.0"
    services:
      - name: gemini-service
        url: http://httpbin.konghq.com/
    routes:
      - name: gemini-route
        paths:
        - "/gemini"
        service:
          name: gemini-service
    ' | deck gateway apply -
    

To learn more about entities, you can read our entities documentation.

Before you begin, you must get the following credentials from Google Cloud:

  • Service Account Key: A JSON key file for a service account with Vertex AI permissions
  • Project ID: Your Google Cloud project identifier
  • Location ID: Your Google Cloud project location identifier
  • API Endpoint: The global Vertex AI API endpoint https://aiplatform.googleapis.com

After creating the key, convert the contents of modelarmor-admin-key.json into a single-line JSON string. Escape all necessary characters. Quotes (") become \" and newlines become \n. The result must be a valid one-line JSON string.

Then export your credentials as environment variables:

export DECK_GCP_SERVICE_ACCOUNT_JSON="<single-line-escaped-json>"
export DECK_GCP_LOCATION_ID="<your_location_id>"
export DECK_GCP_API_ENDPOINT="<your_gcp_api_endpoint>"
export DECK_GCP_PROJECT_ID="<your-gcp-project-id>"

Set up GCP Application Default Credentials (ADC) with your quota project:

gcloud auth application-default set-quota-project <your_gcp_project_id>

Replace <your_gcp_project_id> with your actual project ID. This configures ADC to use your project for API quota and billing.

To complete this tutorial, you’ll need Python (version 3.7 or later) and pip installed on your machine. You can verify it by running:

python3
python3 -m pip --version
  1. Create a virtual env:

    python3 -m venv myenv
    
  2. Activate it:

    source myenv/bin/activate
    

Install the Google Generative AI SDK:

python3 -m pip install google-genai

Configure the AI Proxy Advanced plugin

First, let’s configure the AI Proxy Advanced plugin to support streaming responses from Vertex AI models. When proxied through this configuration, the Vertex AI model returns response tokens incrementally as the model generates them, reducing perceived latency for longer outputs. The plugin proxies requests to Vertex AI’s :streamGenerateContent endpoint without modifying the response format.

Apply the plugin configuration with your GCP service account credentials:

echo '
_format_version: "3.0"
plugins:
  - name: ai-proxy-advanced
    service: gemini-service
    config:
      llm_format: gemini
      genai_category: text/generation
      targets:
      - route_type: llm/v1/chat
        logging:
          log_payloads: false
          log_statistics: true
        model:
          provider: gemini
          name: gemini-2.0-flash-exp
          options:
            gemini:
              api_endpoint: "${{ env "DECK_GCP_API_ENDPOINT" }}"
              project_id: "${{ env "DECK_GCP_PROJECT_ID" }}"
              location_id: "${{ env "DECK_GCP_LOCATION_ID" }}"
        auth:
          allow_override: false
          gcp_use_service_account: true
          gcp_service_account_json: "${{ env "DECK_GCP_SERVICE_ACCOUNT_JSON" }}"
' | deck gateway apply -

Create Python streaming script

Create a script that sends requests to Vertex AI’s streaming endpoint. The :streamGenerateContent suffix signals that the response should return as incremental chunks rather than a single complete generation.

Vertex AI’s streaming format returns a JSON array where each element contains a chunk of the generated response. The entire array arrives in a single HTTP response body, not as server-sent events or newline-delimited JSON.

The script includes two optional flags for debugging and inspection:

  • --raw displays the complete JSON structure returned by Vertex AI before extracting text
  • --chunks shows metadata for each chunk, including finish reasons and token counts
cat << 'EOF' > vertex_stream.py
#!/usr/bin/env python3
from google import genai
from google.genai.types import HttpOptions
import os
import sys

PROJECT_ID = os.getenv("DECK_GCP_PROJECT_ID")
LOCATION = os.getenv("DECK_GCP_LOCATION_ID")

if not PROJECT_ID:
    print("Error: DECK_GCP_PROJECT_ID environment variable not set")
    sys.exit(1)

def vertex_stream(show_raw=False, show_chunks=False):
    """Stream responses from Vertex AI through Kong Gateway"""

    # Configure client to route through Kong Gateway
    client = genai.Client(
        vertexai=True,
        project=PROJECT_ID,
        location=LOCATION,
        http_options=HttpOptions(
            base_url="http://localhost:8000/gemini",
            api_version="v1"
        )
    )

    try:
        if show_raw:
            print("Streaming with raw output...\n")

        chunk_num = 0
        for chunk in client.models.generate_content_stream(
            model="gemini-2.0-flash-exp",
            contents="Explain quantum entanglement in one paragraph"
        ):
            chunk_num += 1

            if show_chunks:
                print(f"\n--- Chunk {chunk_num} ---")
                if hasattr(chunk, 'candidates') and chunk.candidates:
                    candidate = chunk.candidates[0]
                    if hasattr(candidate, 'finish_reason') and candidate.finish_reason:
                        print(f"Finish reason: {candidate.finish_reason}")
                    if hasattr(chunk, 'usage_metadata') and chunk.usage_metadata:
                        if hasattr(chunk.usage_metadata, 'total_token_count'):
                            print(f"Total tokens: {chunk.usage_metadata.total_token_count}")
                print("Text: ", end="")

            if show_raw:
                print(f"\nChunk {chunk_num}:", chunk)
                print("-" * 80)

            print(chunk.text, end="", flush=True)

            if show_chunks:
                print()

        if not show_chunks:
            print()

    except Exception as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    show_raw = "--raw" in sys.argv
    show_chunks = "--chunks" in sys.argv
    vertex_stream(show_raw, show_chunks)
EOF

The streaming endpoint returns a JSON array. Each element contains a chunk with this structure:

[
  {
    "candidates": [{
      "content": {
        "role": "model",
        "parts": [{"text": "1"}]
      }
    }],
    "usageMetadata": {
      "trafficType": "ON_DEMAND"
    },
    "modelVersion": "gemini-2.0-flash-exp"
  },
  {
    "candidates": [{
      "content": {
        "role": "model",
        "parts": [{"text": ", 2, 3, 4, 5\n"}]
      },
      "finishReason": "STOP"
    }],
    "usageMetadata": {
      "promptTokenCount": 4,
      "candidatesTokenCount": 14,
      "totalTokenCount": 18
    }
  }
]

The script extracts the text field from each parts array and prints it incrementally. The final element includes finishReason and complete token usage statistics.

Validate the configuration

Run the script to verify streaming responses:

python3 vertex_stream.py

Expected output shows text appearing as the model generates it:

Connecting to: http://localhost:8000/gemini/v1/projects/your-project/locations/us-central1/publishers/google/models/gemini-2.0-flash-exp:streamGenerateContent

Quantum entanglement is a bizarre phenomenon where two or more particles become linked together in such a way that they share the same fate, no matter how far apart they are. Measuring the state of one entangled particle instantly influences the state of the other, even across vast distances, seemingly violating the classical concept of locality. This "spooky action at a distance" means knowing the property of one particle immediately reveals the corresponding property of its entangled partner, even before any measurement is made on it directly.

Display chunk metadata

You can use the --chunks flag to inspect individual chunks with their metadata:

python3 vertex_stream.py --chunks

Expected output:

Connecting to: http://localhost:8000/gemini/v1/projects/your-project/locations/us-central1/publishers/google/models/gemini-2.0-flash-exp:streamGenerateContent

--- Chunk 1 ---
Total tokens: None
Text: Quantum

--- Chunk 2 ---
Total tokens: None
Text:  entanglement is a

--- Chunk 3 ---
Total tokens: None
Text:  bizarre phenomenon where two or more particles become linked together in such a way that they

--- Chunk 4 ---
Total tokens: None
Text:  share the same fate, no matter how far apart they are. Measuring the properties

--- Chunk 5 ---
Total tokens: None
Text:  of one entangled particle instantaneously determines the corresponding properties of the other, even if they're separated by vast distances. This correlation isn't due to some pre-existing hidden

--- Chunk 6 ---
Finish reason: STOP
Total tokens: 100
Text:  information but is instead a fundamental connection arising from their shared quantum state, defying classical intuition about locality and causality.

Inspect raw JSON response

You can also use the --raw flag to view the complete JSON structure before parsing:

python3 vertex_stream.py --raw

This displays the full JSON array returned by Vertex AI, then continues with normal text output. Combine flags to see both raw structure and chunk metadata:

python3 vertex_stream.py --raw --chunks

Cleanup

If you created a new control plane and want to conserve your free trial credits or avoid unnecessary charges, delete the new control plane used in this tutorial.

curl -Ls https://get.konghq.com/quickstart | bash -s -- -d
Something wrong?

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!