This configuration uses the AI Proxy Advanced plugin’s semantic load balancing to route requests. Queries are matched against provided model descriptions using vector embeddings to make sure each request goes to the model best suited for its content. Such a distribution helps improve response relevance while optimizing resource use an cost, while also improving response latency.
The plugin also uses “temperature” to determine the level of creativity that the model uses in the response. Higher temperature values (closer to 1) increase randomness and creativity. Lower values (closer to 0) make outputs more focused and predictable.
The table below outlines how different types of queries are semantically routed to specific models in this configuration:
Route
|
Routed to model
|
Description
|
Queries about Python or technical coding
|
gpt-3.5-turbo
|
Requests semantically matched to the “Expert in python programming” category.
Handles complex coding or technical questions with deterministic output (temperature 0).
|
IT support related questions
|
gpt-4o
|
Requests related to IT support topics are routed here.
Uses moderate creativity (temperature 0.3) and a mid-sized token limit.
|
General or catchall queries
|
gpt-4o-mini
|
Catchall for all other queries not strongly matched to other categories.
Prioritizes cost efficiency and creative responses (temperature 1.0).
|
Configure the AI Proxy Advanced plugin to route requests to specific models:
echo '
_format_version: "3.0"
plugins:
- name: ai-proxy-advanced
config:
embeddings:
auth:
header_name: Authorization
header_value: Bearer ${{ env "DECK_OPENAI_API_KEY" }}
model:
provider: openai
name: text-embedding-3-small
vectordb:
dimensions: 1024
distance_metric: cosine
strategy: redis
threshold: 0.75
redis:
host: "${{ env "DECK_REDIS_HOST" }}"
port: 6379
balancer:
algorithm: semantic
targets:
- route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer ${{ env "DECK_OPENAI_API_KEY" }}
model:
provider: openai
name: gpt-3.5-turbo
options:
max_tokens: 826
temperature: 0
description: Expert in Python programming.
- route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer ${{ env "DECK_OPENAI_API_KEY" }}
model:
provider: openai
name: gpt-4o
options:
max_tokens: 512
temperature: 0.3
description: All IT support questions.
- route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: Bearer ${{ env "DECK_OPENAI_API_KEY" }}
model:
provider: openai
name: gpt-4o-mini
options:
max_tokens: 256
temperature: 1.0
description: CATCHALL
' | deck gateway apply -
You can also consider alternative models and temperature settings to better suit your workload needs. For example, specialized code models for coding tasks, full GPT-4 for nuanced IT support, and lighter models with higher temperature for general or creative queries.
-
Technical coding (precision-focused):
code-davinci-002
with temperature: 0. Ensures consistent, deterministic code completions.
-
IT support (balanced creativity):
gpt-4o
with temperature: 0.3 . Allows helpful, slightly creative answers without being too loose.
-
Catchall/general queries (more creative):
gpt-3.5-turbo
or gpt-4o-mini
with temperature: 0.7–1.0 Encourages creative, varied responses for open-ended questions.