An Exhaustive List Of Distance Metrics For Vector Similarity Search

Haziqa
DataDrivenInvestor
Published in
10 min readDec 26, 2023

--

A comprehensive compilation of various distance metrics and their contribution to vector similarity search.

Featured image for blog from istock
Credit: Quarta_

Distance metrics are versatile analytical techniques that measure the proximity between data points within a mathematical space. The numerical representation of these data points, called vectors, exhibits intricate patterns in input data. The distance metrics quantify the similarity and dissimilarity between vectors.

Different distance metrics are used depending on the input data type and ML application. This article will highlight different distance metrics’ distinct roles and significance in ML.

Types of Distance Metrics covered in this article:

  • Dot Product
  • Cosine Similarity
  • Euclidean Distance
  • Manhattan Distance
  • Hamming Distance
  • Minkowski Distance

Distance Metrics Explained

Distance metrics are a means to evaluate the similarity of real-world objects represented as vector embeddings. These embeddings capture object features as vectors, often stored in databases (such as Qdrant, Pinecone, Weaviate) for efficient retrieval and analysis.

Employing techniques such as Euclidean or Cosine distance to compute the distances between these vectors yields precise measurements of their similarities. It offers insights into relationships among various real-world entities across diverse domains. These distance metrics are foundational in vector search, revealing detailed similarities among datasets or objects and influencing use cases such as:

  • Image or text similarity search
  • Recommendation systems
  • Information retrieval
  • Clustering
  • Anomaly detection

6 Prominent Distance Metrics Used in Vector Similarity Search

Dot Product

The Dot Product distance metric measures the similarity between two vectors by evaluating the sum of their coordinate products.

Illustration of dot product between two sample vectors A and B.
Dot Product. Image by the author.

Formula: The formula for calculating the Dot Product between vectors A and B in n-dimensional space is given by:

Image with formula for dot product distance
Dot Product Formula. Image by the author.

Explanation: The Dot Product calculates the sum of the products of corresponding elements in both vectors. It signifies the magnitude of one vector projected onto another and is influenced by the cosine of the angle between them. It also evaluates vector alignment and orientation in space, emphasizing that similar meanings are often conveyed through vector direction, not magnitude alone.

A positive dot product result indicates vectors pointing similarly and negatives for opposing directions. Whereas a dot product resulting in zero indicates perpendicular vectors.

Significance in vector similarity search: The Dot Product’s proficiency in measuring alignment and correlation between vectors makes it a critical distance metric in vector similarity search as it can identify related entities or features from large datasets. For instance, the dot product metric can enhance recommendation systems by identifying items with comparable characteristics or user preferences.

Cosine Similarity

Cosine similarity evaluates the similarity between two vectors by assessing the cosine of the angle between them, regardless of their magnitudes.

Illustration of cosine similarity between two sample vectors A and B
Cosine Similarity. Image by the author.

Formula: The formula for Cosine similarity between vectors A and B is:

Image with formula for cosine similarity distance metric
Cosine Similarity Formula. Image by the author.

Explanation: Cosine similarity assesses similarity based on the direction rather than the magnitude of vectors. It quantifies how closely vectors align in multi-dimensional space, generating values between -1 (opposite directions) and 1 (same direction). As the cosine distance between data points grows, the cosine similarity decreases and conversely increases as the distance reduces.

Significance in vector similarity search: As commonly used in vector similarity search, its emphasis on direction, not magnitude, allows for the alignment of vectors irrespective of their scales. As such, Cosine similarity accurately captures similarities based on orientation in high-dimensional spaces. As a prevalent text comparison metric, it is widely employed in tasks like document retrieval, recommendation systems, and clustering across diverse analytical domains.

Euclidean Distance

Euclidean distance measures the straight-line distance separating two points in a n-dimensional space. It represents the length of the line segment connecting the points.

Illustration of Euclidean Distance between two sample vectors A and B
Euclidean Distance. Image by the author.

Formula: The formula for Euclidean distance between vectors x and y in an n-dimensional space is:

Image with formula for euclidean distance
Euclidean Distance Formula. Image by the author.

Explanation: Euclidean distance is computed as the square root of the sum of squares of differences between two corresponding vectors. It measures the overall distance between the vectors in a multi-dimensional space, considering their differences in each dimension. This measurement represents the shortest, straight-line path between two points.

Significance in vector similarity search: Euclidean distance is the most widely used and computationally straightforward distance metric for assessing the dissimilarity among vectors in diverse high-dimensional spaces. It can account for direction as well as magnitude where both factors are important. These capabilities make Euclidean distance vital in clustering, image processing, and anomaly detection tasks.

Manhattan Distance

Manhattan distance, also known as Taxicab or City Block distance, measures the distance between two points in a grid-based system.

Illustration of Manhattan Distance between two sample vectors A and B
Manhattan Distance. Image by the author.

Formula: The formula for Manhattan distance between vectors x and y in an n-dimensional space is:

Image with formula for manhattan distance
Manhattan Distance Formula. Image by the author.

where xi and yi are the corresponding components of vectors x and y in each dimension.

Explanation: Manhattan distance computes the distance as the sum of the absolute differences between the corresponding components of vectors along each dimension. It represents the distance a taxi would travel in a city when the streets’ layout forms a grid.

Significance in vector similarity search: Manhattan distance offers a grid-based distance measurement system for vector similarity search. It is ideal for constrained scenarios where movement follows gridlines. This makes it particularly effective in structured analysis tasks like spatial navigation, route planning, and structured data analysis, particularly in grid-aligned or constrained pathway contexts.

Hamming Distance

Hamming distance measures the dissimilarity between two binary vectors of equal length. It computes the number of positions at which the corresponding bits are different.

Illustration of Hamming Distance between two sample bit strings A and B
Hamming Distance. Image by the author

Formula: The formula for Hamming distance between bit strings x and y of length N is

Image with formula for hamming distance
Hamming Distance Formula. Image by the author.

where xi and yi are the bits at position i in the bitstrings.

Explanation: Hamming distance quantifies the number of differing bits between two binary vectors. It counts the positions where the bits are not the same and computes the ratio of differing bits to the total number of bits. When the vectors are equal, the Hamming distance equates to 0.

Significance in vector similarity search: Hamming distance is crucial in vector similarity search for comparing binary vectors, particularly in scenarios like error detection in digital communication or cryptography. For instance, identifying one-bit or two-bit errors becomes essential to determine the most accurate original vector from corrupted binary vectors. In this case, Hamming distance aids in assessing the similarities or differences, resulting in the precise evaluation and correction of binary vectors.

Minkowski Distance

The Minkowski Distance is a generalized form of Euclidean and Manhattan distance that calculates the distance between two points in an n-dimensional space. Minkowski distance calculations are applicable solely in a normed vector space, denoting a space where distances are represented by vectors possessing a measurable length.

Illustration of Minkowski Distance between two sample vectors A and B.
Minkowski Distance. Image by the author.

Formula: The formula for Minkowski distance between vectors x and y in an n-dimensional space is

Image with formula for minkowski distance
Minkowski Distance Formula. Image by the author.

where xi and yi are the components of vectors x and y in each dimension, and p is a parameter specifying the order of the norm for Minkowski distance.

Explanation: Minkowski distance measures the distance between two vectors by considering the p-th root of the sum of p-th power differences between their components across all dimensions. By adjusting the value of p in the formula, we can compute the distance between two data points in various manners. For example:

  • If p=2, then the distance becomes Euclidean
  • If p=1, then the distance becomes Manhattan

Significance in vector similarity search: Minkowski distance is a versatile metric in vector similarity search that adapts to different scenarios based on specific distance measurement requirements. This adaptability enables reliable implementation across diverse applications, including clustering, classification, and spatial analytical modeling.

Why does vector similarity search require different distance metrics?

Maintaining a delicate balance between speed and accuracy remains crucial across the spectrum of ML models featuring diverse vector sizes and computational requirements. These variations significantly impact how applications interpret data, necessitating precise distance-measuring capabilities. This adaptability extends to diverse data traits, such as sparsity or density, where customized distance measures are pivotal for ensuring precise similarity assessments.

How to scale vector search to billion-scale parameters?

While understanding distance metrics lays the groundwork, scaling vector search to billion-scale parameters necessitates more than metric considerations. Vector search optimization becomes paramount to address the challenges of handling large-scale datasets with high-dimensional vector embeddings.

This is where we can employ quantization to fine-tune vector search according to specific memory, speed, and precision requirements. Quantization is a method to condense vectors into more compact representations, optimizing storage efficiency and search speed.

Here are some of the quantization techniques for scalable vector search:

  • Scalar Quantization: It is the individual conversion of each vector component from floating point vector representations to low-dimension integers. This process significantly reduces memory usage with minimal impact on search speed.
  • Product Quantization: It is a flexible technique for quantizing high-dimensional vectors into smaller or low-dimensional subspaces. This process compresses vectors by segmenting and approximating chunks to optimize memory usage but with potential trade-offs in accuracy and speed.
  • Binary Quantization: It is a simplified single-bit representation of vector components with large vector lengths and a large number of points. It involves partitioning the vector space into binary-coded regions for assigning binary values to vectors falling in those regions. This technique drastically reduces memory footprint and considerably accelerates vector search operations.

Here are some vector databases that employ the quantization technique for faster vector search:

  • Qdrant is an AI-powered vector database that offers a production-ready vector similarity search service. It supports the most popular distance metrics, i.e., cosine similarity, dot product, and euclidean distance. For flexible and scalable vector search, Qdrant employs three quantization methods catering to varying search use cases, including scalar, product, and binary quantization.
  • Weaviate is an open-source vector database for storing, searching, and retrieving vectors as well as objects. It supports a variety of distance metrics, including cosine, dot product, squared-euclidean, hamming, and manhattan distance. It also offers product quantization for memory-efficient vector storage.
  • Pinecone is a fully managed vector database offering storage and querying capabilities for vector embeddings. It stores vectors as indexes and uses either cosine, dot product, or euclidean distance to calculate the similarities between these indexes. To optimize its vector index search, Pinecone employs product quantization.

How to choose the ideal distance metric?

Choosing the ideal distance metric depends on specific problems, model requirements, and computational resources. Consider the problem type — whether clustering, classification, or similarity assessment — and ensure the metric suits your data’s dimensionality, sparsity, and distribution.

Factor in computational efficiency to match available resources for seamless implementation. Remember, no one distance metric fits all criteria; the best metric depends on the context and objectives of your analysis.

Concluding thoughts

Having explored the six most commonly used distance metrics, it’s evident that these metrics are the unsung heroes of similarity searches. From Euclidean distance to Cosine Similarity, is critical in precisely measuring the similarity between vectors, whether for comparing images or crafting personalized recommendations. These metrics quietly work behind the scenes, refining searches and bolstering accuracy.

Vector databases are integral to the similarity search process. They serve as the canvas on which distance metrics work their magic to execute complex algorithms for various similarity search use cases seamlessly.

Looking ahead, as distance metrics continue to advance hand in hand with machine learning, we can expect increasingly refined metrics along with scalable and efficient similarity searches. This innovation will surely pave the way for insightful data discovery and exploration. Here’s to a future brimming with possibilities for uncovering deeper insights and making meaningful discoveries!

As much as I enjoyed writing this article, I hope it was an insightful reading experience for you. Let me know your feedback in the comments, including any mistakes you spotted or suggestions you have to improve this article. Also, feel free to contact me via LinkedIn with any further suggestions or inquiries.

Visit us at DataDrivenInvestor.com

Subscribe to DDIntel here.

Have a unique story to share? Submit to DDIntel here.

Join our creator ecosystem here.

DDIntel captures the more notable pieces from our main site and our popular DDI Medium publication. Check us out for more insightful work from our community.

DDI Official Telegram Channel: https://t.me/+tafUp6ecEys4YjQ1

Follow us on LinkedIn, Twitter, YouTube, and Facebook.

--

--