Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update benchmarks with PDQ Faiss results #1755

Merged
merged 3 commits into from
Feb 11, 2025

Conversation

b8zhong
Copy link
Contributor

@b8zhong b8zhong commented Feb 6, 2025

Summary

Update the benchmarks, plus benchmark_pdq_faiss_matchers.py didn't have any results in the markdown file, so I added that as well.

Test Plan

Yes

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threshold of 255 is very unrealistic, and is voiding your benchmark! We could be even faster by just return not index.is_empty(), since 255 is the maximum possible distance.

faiss_threads : 1
dataset_size : 10000
num_queries : 1000
thresholds : [255]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very unrealistic threshold! We should probably only be testing at 31 - this is why your multihash results are so poor, since you are forcing it search every subtree instead of the ones nearby.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol yep; totally right there. I thought a high threshold enforces strict matching. Changed to 31

@@ -1,45 +1,81 @@
# pytx-vpdq
Benchmark vPDQ implementation in threatexchange library
# pytx-vPDQ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking q: I am confused by why this is here instead of python-threatexchange/vpdq/README.md

Any ideas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure either... maybe not to clutter the docs? I can move it -- up to you

#1122

Don't think there was a specific reasoning originally here either tho

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it probably makes more sense in vpdq, if you want to move it in a followup will leave for you.

Comment on lines 74 to 75
else:
raise ValueError("Invalid test type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems useful to keep in, since the error will otherwise be on L76 with variable undefined. This is else is also to defend against future developers adding a new test_type and forgetting to update it here.

Copy link
Contributor Author

@b8zhong b8zhong Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kk; I thought argparse would stop before we reached here, so that's why I did it

Reverted it back tho

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would in normal operation - this else is to defend against a future developer adding it to the argparse and then forgetting to add a case here. This error is meant to save them time debugging.

@@ -1,45 +1,81 @@
# pytx-vpdq
Benchmark vPDQ implementation in threatexchange library
# pytx-vPDQ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it probably makes more sense in vpdq, if you want to move it in a followup will leave for you.

Comment on lines +77 to +78
PDQFlatHashIndex - Total Time to search (s): 0.012083053588867188
PDQMultiHashIndex - Total Time to search (s): 0.01529383659362793
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is closer to what I would expect, but this might be worth digging more into later. I think we are choosing the wrong thresholds for these.

@Dcallies Dcallies merged commit 092e081 into facebook:main Feb 11, 2025
6 checks passed
@b8zhong b8zhong deleted the py-tx-benchmarks branch February 11, 2025 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants