Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Using HNSW index requires using Euclidian Distance Operator? #152

Closed
john0isaac opened this issue Dec 18, 2024 · 4 comments · Fixed by #155
Closed

[Question] Using HNSW index requires using Euclidian Distance Operator? #152

john0isaac opened this issue Dec 18, 2024 · 4 comments · Fixed by #155
Assignees

Comments

@john0isaac
Copy link
Contributor

Description

You have this comment from the pgvector playgroud repo

# Define HNSW index to support vector similarity search through the vector_l2_ops access method (Euclidean distance). The SQL operator for Euclidean distance is written as <->.

Does it mean that the HNSW index will only work with <-> operator?

Since here in the repo you are using the cosine similarity operator not the <->

SELECT id, RANK () OVER (ORDER BY {self.embedding_column} <=> :embedding) AS rank

@pamelafox
Copy link
Contributor

The pgvector extension will let you use any of the distance operators:
https://github.com/pgvector/pgvector?tab=readme-ov-file#querying

But, for optimal performance, you should create an HNSW index for each operator you expect to be using. This repo uses cosine similarity since it's designed for compatibility with multiple embedding models, and cosine is the most accurate while also being flexible (in that it also works for non-unit vectors, versus innerproduct).

@pamelafox
Copy link
Contributor

Actually I have a bug in this repo in that I defined the indexes using inner product, I'll fix that now!

@pamelafox pamelafox self-assigned this Jan 7, 2025
@john0isaac
Copy link
Contributor Author

Got it, each index has its own operator which you define while creating the index. Thanks!
Yeah that's why i got confused i think you should use vector_cosine_ops if you wanna use cosine similarity.

@pamelafox
Copy link
Contributor

I think the inconsistency happened because I originally used innerproduct, as I was only using OpenAI embedding models and those are normalized, so innerproduct works just as well and is faster than cosine distance. But then I added nomic, which I think may not be normalized?, so I moved to cosine. I've added a comment about all that in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants