Embedding-based similarity matching in Kong AI gateway plugins

Uses: Kong Gateway AI Gateway

In large language tasks, applications that interact with language models rely on semantic search—not by exact word matches, but by similarity in meaning. This is achieved using vector embeddings, which represent pieces of text as points in a high-dimensional space.

These embeddings enable the concept of semantic similarity, where the “distance” between vectors reflects how closely related two pieces of text are. Similarity can be measured using techniques like cosine similarity or Euclidean distance, forming the quantitative basis for comparing meaning.

Vector embeddings example

Figure 1: A simplified representation of vector text embeddings in a three-dimensional space.

For example, in the image, “king” and “emperor” are semantically more similar than a “king” is to an “otter”.

Vector embeddings power a range of LLM workflows, including semantic search, document clustering, recommendation systems, anomaly detection, content similarity analysis, and classification via auto-labeling.

Semantic similarity in Kong AI Gateway

In Kong’s AI Gateway, several plugins leverage embedding-based similarity:

Plugin

Description

AI Proxy Advanced Performs semantic routing by embedding each upstream’s description at config time and storing the results in a selected vector database. At runtime, it embeds the prompt and queries vector database to route requests to the most semantically appropriate upstream.
AI Semantic Cache Indexes previous prompts and responses as embeddings. On each request, it searches for semantically similar inputs and serves cached responses when possible to reduce redundant LLM calls.
AI RAG Injector Retrieves semantically relevant chunks from a vector database. It embeds the prompt, performs a similarity search, and injects the results into the prompt to enable retrieval-augmented generation.
AI Semantic Prompt Guard Compares incoming prompts against allow/deny lists using embedding similarity to detect and block misuse patterns.

Vector databases

To compare embeddings efficiently, Kong’s AI Gateway semantic plugins rely on vector databases. These specialized data stores index high-dimensional embeddings and enable fast similarity search based on distance metrics like cosine similarity or Euclidean distance.

When a plugin needs to find semantically similar content—whether it’s a past prompt, an upstream description, or a document chunk—it sends a query to a vector database. The database returns the closest matches, allowing the plugin to make decisions like caching, routing, injecting, or blocking.

Currently, Kong’s AI Gateway supports the following vector backends:

The selected database stores the embeddings generated by the plugin (either at config time or runtime), and determines the accuracy and performance of semantic operations.

What is compared for similarity?

Each plugin applies similarity search slightly differently depending on its goal. These comparisons determine whether the plugin routes, blocks, reuses, or enriches a prompt based on meaning rather than syntax.

The following table describes how each AI plugin compares embeddings:

Plugin

Compared embeddings

AI Proxy Advanced Prompt vs. description field of each upstream target
AI Semantic Prompt Guard Prompt vs. allowlist and denylist prompts
AI Semantic Cache Prompt vs. cached prompt keys
AI RAG Injector Prompt vs. vectorized document chunks

Dimensionality

Embedding models work by converting text into high-dimensional floating-point arrays where mathematical distance reflects semantic relationship. In other words, ingested text data becomes points in a vector space, which enables similarity searches in vector databases, and the dimension of embeddings plays a critical role for this.

Dimensionality determines how many numerical features represent each piece of content—similar to how a detailed profile might have dimensions for age, interests, location, and preferences. Higher dimensions create more detailed “fingerprints” that capture nuanced relationships, with smaller distances between vectors indicating stronger conceptual similarity and larger distances showing weaker associations.

For example, this request to the OpenAI /embeddings API via Kong AI Gateway:

{
    "input": "Tell me, Muse, of the man of many ways, who was driven far journeys, after he had sacked Troy’s sacred citadel.",
    "model": "text-embedding-3-large",
    "dimensions": 20
}

Creates the following embedding:

{
	"object": "list",
	"data": [
		{
			"object": "embedding",
			"index": 0,
			"embedding": [
				0.26458353,
				-0.062855035,
				-0.14282244,
				0.18218088,
				-0.41043353,
				0.3704169,
				0.1712553,
				-0.10945333,
				-0.00060006406,
				0.10076551,
				-0.0697658,
				0.1779686,
				-0.3464596,
				0.028745485,
				0.3017042,
				0.2543161,
				-0.20916577,
				-0.06255886,
				-0.21469438,
				0.32934725
			]
		}
	],
	"model": "text-embedding-3-large",
	"usage": {
		"prompt_tokens": 28,
		"total_tokens": 28
	}
}

The embedding array contains 20 floating-point numbers—each one representing a dimension in the vector space.

For simplicity, this example uses a reduced dimensionality of 20, though production models typically use 1536 or more.

Accuracy and performance considerations

If you use embedding models that support defining the dimensionality of the embedding output, you should consider how to balance accuracy and performance based on your use case.

However, dimensionality extremes at the far ends of the spectrum present significant drawbacks:

Dimensionality range

Benefits

Drawbacks

Lower dimensionality (2–10 dimensions)
  • Improves speed and performance
  • Works well for simpler tasks like basic keyword matching or simple images, where hundreds of dimensions may suffice.
  • Can be too simplistic, like calling a movie simply “good” or “bad”
  • Might miss important nuance and lead to less accurate matches
Higher dimensionality (10,000+ dimensions)
  • Improves the granularity and nuance of similarity searches
  • Useful for complex tasks like semantic text understanding or detailed images, where thousands of dimensions are often required.
  • Increases storage and computation costs
  • Can suffer from the “curse of dimensionality”, where differences become less meaningful.

Use moderate dimensionality when possible, and tune it based on both the complexity of your data and the responsiveness required by your application.

Cosine and Euclidean similarity

Kong AI Gateway supports both cosine similarity and Euclidean distance for vector comparisons, allowing you to choose the method best suited for your use case. You can configure the method using config.vectordb.distance_metric setting in the respective plugin.

  • Use cosine for nuanced semantic similarity (for example, document comparison, text clustering), especially when content length varies or dataset diversity is high.
  • Use euclidean when magnitude matters (for example, images, sensor data) or you’re working with dense, well-aligned feature sets.

Cosine similarity

Cosine similarity measures the angle between vectors, ignoring their magnitude. It is well-suited for semantic matching, particularly in text-based scenarios. OpenAI recommends cosine similarity for use with the text-embedding-3-large model.

Cosine similarity example

Figure 2: Visualization of cosine similarity as the angle between vector directions.

Cosine tends to perform well across both low and high dimensional space, especially in high-diversity datasets because it captures vector orientation rather than size. This can be useful, for example, when comparing texts about Microsoft, Apple, and Google.

Euclidean distance

Euclidean distance measures the straight-line (L2) distance between vectors and is sensitive to magnitude. It works better when comparing objects across broad thematic categories, such as Technology, Fruit, or Musical Instruments, and in domains where absolute distance is important.

Euclidean similarity example

Figure 3: Visualization of Euclidean distance between vector points.

Differences between cosine and euclidean

The two graphs below illustrate a key difference between cosine similarity and Euclidean distance: two vectors can have the same angle (and thus the same cosine similarity, represented as γ below) while their Euclidean distances may differ significantly. This happens because cosine similarity measures only the direction of vectors, ignoring their length or magnitude, whereas Euclidean distance reflects the actual straight-line distance between points in space.

Comparing cosine and Euclidean similarity

Figure 4: Two vectors with equal cosine similarity (γ) but different Euclidean distances.

The following table will help you determine which embedding similarity metric you should use based on your use cases:

Similarity metric

Recommended use cases

Cosine similarity
  • Find semantically similar news articles regardless of length
  • Recommend products to users with similar taste profiles
  • Identify documents with overlapping topics in large corpora
  • Compare diverse text embeddings (for example, Microsoft vs. Apple)
Euclidean distance
  • Find images with similar color distributions and intensity
  • Detect anomalies in sensor readings where magnitude matters
  • Compare aligned image patches using raw pixel embeddings

Similarity threshold

The vectordb.threshold parameter controls how strictly the vector database evaluates similarity during a query. It is passed directly to the vector engine—such as Redis or PGVector—and defines which results qualify as matches. In Redis, for example, this maps to the distance_threshold query parameter. By default, Redis sets this to 0.2, but you can override it to suit your use case.

The threshold can vary depending on which embedding similarity metric you’re using:

  • With cosine similarity, the threshold defines the minimum similarity score (between 0 and 1) required for a match. A value of 1 means only exact matches qualify, while lowering the threshold (for example, to 0.6) allows for looser, less similar matches. Higher values mean stricter matching, and lower values mean broader matching. Cosine similarity measures the angle between two embedding vectors—scores near 1 indicate strong alignment (semantic closeness or zero angle), while scores near 0 indicate orthogonality, meaning the vectors are unrelated in direction and therefore semantically dissimilar. Users often configure thresholds above 0.5 for strong matches and 0.8–0.9 for near-exact similarity.

  • For Euclidean distance, the threshold defines the minimum required similarity as well, normalized to follow the same logic: 1 represents an exact match (zero distance), while 0 allows the broadest match range. Just like with cosine similarity, higher values enforce tighter similarity, while lower values allow looser matches.

The optimal threshold depends on the selected distance metric, the embedding model’s dimensionality, and the variation in your data. Tuning may be required for best results.

In Kong’s AI semantic plugins, this threshold is not post-processed or filtered by the plugin itself. The plugin sends it directly to the vector database, which uses it to determine matching documents based on the configured distance metric.

Threshold sensitivity and cache hit effectiveness

The closer your similarity threshold is to 1, the more likely you are to get cache misses when using plugins like AI Semantic Cache. This is because a higher threshold makes the similarity filter more strict, so only embeddings that are nearly identical to the query will qualify as a match. In practice, this means even small variations in phrasing, structure, or context can cause the system to miss otherwise semantically similar entries and fall back to calling the LLM again.

This happens because vector embeddings are not perfectly robust to minor semantic shifts, especially for short or ambiguous prompts. Raising the threshold narrows the match window, so you’re effectively demanding a near-exact match in a complex vector space, which is rare unless the input is repeated verbatim.

The chart below illustrates this effect: as the similarity threshold increase (for example, becomes more strict), the cache hit rate typically falls. This reflects the broader acceptance of matches in the embedding space, which helps reduce redundant LLM calls at the cost of some semantic looseness.

Similarity threshold and cache rate hits

Figure 5: As the similarity threshold decreases (becomes more permissive), cache hit rate increases—illustrating the trade-off between strict semantic matching and LLM efficiency.

This is generally true but not absolute. If you’re working in a very narrow domain where inputs are highly repetitive or templated (for example, support FAQs), a low threshold might still yield good cache hit rates. Conversely, in open-ended chat or creative domains, a stricter threshold will almost always increase cache misses due to natural language variability.

Limitations

While embedding-based similarity is efficient and effective for many use cases, it has important limitations. Embeddings typically do not capture subtle semantic changes or handle long context as well as LLMs.

For example, the following prompts may be considered semantically equivalent by a vector similarity search, even though the latter asks for additional detail:

  • Summarize this article.
  • Summarize this article. Tell me more.

To address these edge cases, you can use a smaller LLM model to compare two texts side-by-side, enabling deeper semantic comparison.

Something wrong?

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!