Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Items in Shared Dataset MSSPACE1B - Bug or Intended Design? #421

Open
juelinl opened this issue Jan 15, 2025 · 0 comments
Open

Comments

@juelinl
Copy link

juelinl commented Jan 15, 2025

Description: I noticed that the dataset shared with the SPTAG repository contains many duplicate items / vectors with identical column-wise values. This behavior raises questions about whether it is a bug. The presence of duplicates could potentially impact the performance or results when using SPTAG for nearest neighbor search tasks.

One such example is the row 258036 (0-indexed). It is replicated 33 times in the first 1M rows. See attached images for details.

Image

Expected Behavior: Datasets used for nearest neighbor search typically do not include duplicates unless explicitly stated, as duplicates can skew the performance and evaluation metrics. If duplicates are included by design, would you kindly provide documentation or comments clarifying their purpose.

Thank you for taking time considering this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant