Embedding-based similarity matching in Kong AI gateway plugins

Uses: Kong Gateway AI Gateway

Semantic similarity in Kong AI Gateway

In Kong’s AI Gateway, several plugins leverage embedding-based similarity:

Plugin	Description
AI Proxy Advanced	Performs semantic routing by embedding each upstream’s description at config time and storing the results in a selected vector database. At runtime, it embeds the prompt and queries vector database to route requests to the most semantically appropriate upstream.
AI Semantic Cache	Indexes previous prompts and responses as embeddings. On each request, it searches for semantically similar inputs and serves cached responses when possible to reduce redundant LLM calls.
AI RAG Injector	Retrieves semantically relevant chunks from a vector database. It embeds the prompt, performs a similarity search, and injects the results into the prompt to enable retrieval-augmented generation.
AI Semantic Prompt Guard	Compares incoming prompts against allow/deny lists using embedding similarity to detect and block misuse patterns.
AI Semantic Response Guard v3.12+	Filters LLM responses by comparing their semantic content against predefined allow and deny lists. It analyzes the full response body, generates embeddings, and enforces rules to block unsafe or unwanted outputs before returning them to the client.

Vector databases

To compare embeddings efficiently, Kong’s AI Gateway semantic plugins rely on vector databases. These specialized data stores index high-dimensional embeddings and enable fast similarity search based on distance metrics like cosine similarity or Euclidean distance.

When a plugin needs to find semantically similar content—whether it’s a past prompt, an upstream description, or a document chunk—it sends a query to a vector database. The database returns the closest matches, allowing the plugin to make decisions like caching, routing, injecting, or blocking.

Currently, Kong’s AI Gateway supports the following vector backends:

Using redis as the VectorDB strategy:
- Redis with Vector Similarity Search (VSS)
- AWS MemoryDB for Redis v3.12+
Using pgvector as the VectorDB strategy:
- PostgreSQL with pgvector v3.10+

The selected database stores the embeddings generated by the plugin (either at config time or runtime), and determines the accuracy and performance of semantic operations.

What is compared for similarity?

Each plugin applies similarity search slightly differently depending on its goal. These comparisons determine whether the plugin routes, blocks, reuses, or enriches a prompt based on meaning rather than syntax.

The following table describes how each AI plugin compares embeddings:

Plugin	Compared embeddings
AI Proxy Advanced	Prompt vs. `description` field of each upstream target
AI Semantic Prompt Guard	Prompt vs. allowlist and denylist prompts
AI Semantic Cache	Prompt vs. cached prompt keys
AI RAG Injector	Prompt vs. vectorized document chunks

Dimensionality

Embedding models work by converting text into high-dimensional floating-point arrays where mathematical distance reflects semantic relationship. In other words, ingested text data becomes points in a vector space, which enables similarity searches in vector databases, and the dimension of embeddings plays a critical role for this.

Dimensionality determines how many numerical features represent each piece of content—similar to how a detailed profile might have dimensions for age, interests, location, and preferences. Higher dimensions create more detailed “fingerprints” that capture nuanced relationships, with smaller distances between vectors indicating stronger conceptual similarity and larger distances showing weaker associations.

For example, this request to the OpenAI /embeddings API via Kong AI Gateway:

{
    "input": "Tell me, Muse, of the man of many ways, who was driven far journeys, after he had sacked Troy’s sacred citadel.",
    "model": "text-embedding-3-large",
    "dimensions": 20
}

Copied!

Creates the following embedding:

{
	"object": "list",
	"data": [
		{
			"object": "embedding",
			"index": 0,
			"embedding": [
				0.26458353,
				-0.062855035,
				-0.14282244,
				0.18218088,
				-0.41043353,
				0.3704169,
				0.1712553,
				-0.10945333,
				-0.00060006406,
				0.10076551,
				-0.0697658,
				0.1779686,
				-0.3464596,
				0.028745485,
				0.3017042,
				0.2543161,
				-0.20916577,
				-0.06255886,
				-0.21469438,
				0.32934725
			]
		}
	],
	"model": "text-embedding-3-large",
	"usage": {
		"prompt_tokens": 28,
		"total_tokens": 28
	}
}

Copied!

The embedding array contains 20 floating-point numbers—each one representing a dimension in the vector space.

For simplicity, this example uses a reduced dimensionality of 20, though production models typically use 1536 or more.

Accuracy and performance considerations

If you use embedding models that support defining the dimensionality of the embedding output, you should consider how to balance accuracy and performance based on your use case.

However, dimensionality extremes at the far ends of the spectrum present significant drawbacks:

Dimensionality range	Benefits	Drawbacks
Lower dimensionality (2–10 dimensions)	Improves speed and performance Works well for simpler tasks like basic keyword matching or simple images, where hundreds of dimensions may suffice.	Can be too simplistic, like calling a movie simply “good” or “bad” Might miss important nuance and lead to less accurate matches
Higher dimensionality (10,000+ dimensions)	Improves the granularity and nuance of similarity searches Useful for complex tasks like semantic text understanding or detailed images, where thousands of dimensions are often required.	Increases storage and computation costs Can suffer from the “curse of dimensionality”, where differences become less meaningful.

Use moderate dimensionality when possible, and tune it based on both the complexity of your data and the responsiveness required by your application.

Cosine and Euclidean similarity

Kong AI Gateway supports both cosine similarity and Euclidean distance for vector comparisons, allowing you to choose the method best suited for your use case. You can configure the method using config.vectordb.distance_metric setting in the respective plugin.

Use cosine for nuanced semantic similarity (for example, document comparison, text clustering), especially when content length varies or dataset diversity is high.
Use euclidean when magnitude matters (for example, images, sensor data) or you’re working with dense, well-aligned feature sets.

Cosine similarity

Cosine similarity measures the angle between vectors, ignoring their magnitude. It is well-suited for semantic matching, particularly in text-based scenarios. OpenAI recommends cosine similarity for use with the text-embedding-3-large model.

Cosine similarity example

Figure 2: Visualization of cosine similarity as the angle between vector directions.

Cosine tends to perform well across both low and high dimensional space, especially in high-diversity datasets because it captures vector orientation rather than size. This can be useful, for example, when comparing texts about Microsoft, Apple, and Google.

Euclidean distance

Euclidean distance measures the straight-line (L2) distance between vectors and is sensitive to magnitude. It works better when comparing objects across broad thematic categories, such as Technology, Fruit, or Musical Instruments, and in domains where absolute distance is important.

Euclidean similarity example

Figure 3: Visualization of Euclidean distance between vector points.

Differences between `cosine` and `euclidean`

The two graphs below illustrate a key difference between cosine similarity and Euclidean distance: two vectors can have the same angle (and thus the same cosine similarity, represented as γ below) while their Euclidean distances may differ significantly. This happens because cosine similarity measures only the direction of vectors, ignoring their length or magnitude, whereas Euclidean distance reflects the actual straight-line distance between points in space.

Comparing cosine and Euclidean similarity

Figure 4: Two vectors with equal cosine similarity (γ) but different Euclidean distances.

The following table will help you determine which embedding similarity metric you should use based on your use cases:

Similarity metric	Recommended use cases
Cosine similarity	Find semantically similar news articles regardless of length Recommend products to users with similar taste profiles Identify documents with overlapping topics in large corpora Compare diverse text embeddings (for example, Microsoft vs. Apple)
Euclidean distance	Find images with similar color distributions and intensity Detect anomalies in sensor readings where magnitude matters Compare aligned image patches using raw pixel embeddings

Similarity threshold

The vectordb.threshold parameter controls how strictly the vector database evaluates similarity during a query. It is passed directly to the vector engine—such as Redis or PGVector—and defines which results qualify as matches. In Redis, for example, this maps to the distance_threshold query parameter. By default, Redis sets this to 0.2, but you can override it to suit your use case.

The threshold can vary depending on which embedding similarity metric you’re using:

With cosine similarity, the threshold defines the minimum similarity score (between 0 and 1) required for a match. A value of 1 means only exact matches qualify, while lowering the threshold (for example, to 0.6) allows for looser, less similar matches. Higher values mean stricter matching, and lower values mean broader matching. Cosine similarity measures the angle between two embedding vectors—scores near 1 indicate strong alignment (semantic closeness or zero angle), while scores near 0 indicate orthogonality, meaning the vectors are unrelated in direction and therefore semantically dissimilar. Users often configure thresholds above 0.5 for strong matches and 0.8–0.9 for near-exact similarity.
For Euclidean distance, the threshold defines the minimum required similarity as well, normalized to follow the same logic: 1 represents an exact match (zero distance), while 0 allows the broadest match range. Just like with cosine similarity, higher values enforce tighter similarity, while lower values allow looser matches.

The optimal threshold depends on the selected distance metric, the embedding model’s dimensionality, and the variation in your data. Tuning may be required for best results.

In Kong’s AI semantic plugins, this threshold is not post-processed or filtered by the plugin itself. The plugin sends it directly to the vector database, which uses it to determine matching documents based on the configured distance metric.

Threshold sensitivity and cache hit effectiveness

The closer your similarity threshold is to 1, the more likely you are to get cache misses when using plugins like AI Semantic Cache. This is because a higher threshold makes the similarity filter more strict, so only embeddings that are nearly identical to the query will qualify as a match. In practice, this means even small variations in phrasing, structure, or context can cause the system to miss otherwise semantically similar entries and fall back to calling the LLM again.

This happens because vector embeddings are not perfectly robust to minor semantic shifts, especially for short or ambiguous prompts. Raising the threshold narrows the match window, so you’re effectively demanding a near-exact match in a complex vector space, which is rare unless the input is repeated verbatim.

The chart below illustrates this effect: as the similarity threshold increase (for example, becomes more strict), the cache hit rate typically falls. This reflects the broader acceptance of matches in the embedding space, which helps reduce redundant LLM calls at the cost of some semantic looseness.

Similarity threshold and cache rate hits

Figure 5: As the similarity threshold decreases (becomes more permissive), cache hit rate increases—illustrating the trade-off between strict semantic matching and LLM efficiency.

This is generally true but not absolute. If you’re working in a very narrow domain where inputs are highly repetitive or templated (for example, support FAQs), a low threshold might still yield good cache hit rates. Conversely, in open-ended chat or creative domains, a stricter threshold will almost always increase cache misses due to natural language variability.

Limitations

While embedding-based similarity is efficient and effective for many use cases, it has important limitations. Embeddings typically do not capture subtle semantic changes or handle long context as well as LLMs.

For example, the following prompts may be considered semantically equivalent by a vector similarity search, even though the latter asks for additional detail:

Summarize this article.
Summarize this article. Tell me more.

To address these edge cases, you can use a smaller LLM model to compare two texts side-by-side, enabling deeper semantic comparison.