Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update benchmarks with PDQ Faiss results #1755

Merged
merged 3 commits into from
Feb 11, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 59 additions & 23 deletions python-threatexchange/benchmarks/README.MD
Original file line number Diff line number Diff line change
@@ -1,45 +1,81 @@
# pytx-vpdq
Benchmark vPDQ implementation in threatexchange library
# pytx-vPDQ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking q: I am confused by why this is here instead of python-threatexchange/vpdq/README.md

Any ideas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure either... maybe not to clutter the docs? I can move it -- up to you

#1122

Don't think there was a specific reasoning originally here either tho

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it probably makes more sense in vpdq, if you want to move it in a followup will leave for you.

Benchmark vPDQ implementation & PDQ Faiss matchers in the threatexchange library.


# Observed Performance
MacBook Pro (16-inch, 2021), Apple M1 Pro, 32 GB Ram
- Model: MacBook Air
- Memory: 16 GB
- Operating System: macOS 15.2
- Chip: Apple M2
- Core Configuration: 8 cores total

Results:

Results (vPDQ):
-------
```
python3 benchmark_vpdq_index.py brute_force -f 500 -v 20 -q 1000
% python3 benchmark_vpdq_index.py brute_force -f 500 -v 20 -q 1000
build: 0.0000s
query: 7.1112s
Per query: 7.1112ms
query: 684.5324s
Per query: 684.5324ms


% python3 benchmark_vpdq_index.py flat -f 500 -v 20 -q 1000
build: 0.0130s
query: 0.0157s
Per query: 0.0157ms
build: 0.0048s
query: 0.0051s
Per query: 0.0051ms


% python3 benchmark_vpdq_index.py signal_type -f 500 -v 2000 -q 10000
Generating data...
Generating data: 8.7703s
build: 16.8385s
query: 5.8271s
Per query: 0.5827ms
Generating data: 1.2398s
build: 3.2970s
query: 2.8439s
Per query: 0.2844ms


% python3 benchmark_vpdq_index.py flat -f 500 -v 2000 -q 10000
Generating data...
Generating data: 9.0299s
build: 1.1329s
query: 5.4447s
Per query: 0.5445ms
Generating data: 1.2237s
build: 0.4786s
query: 2.5248s
Per query: 0.2525ms


% python3 benchmark_vpdq_index.py flat -f 500 -v 2000 -q 100000
Generating data...
Generating data: 8.9390s
build: 1.1269s
Generating data: 1.2195s
build: 0.4800s
Generating queries...
Generating queries: 1.2121s
query: 56.0504s
Per query: 0.5605ms
Generating queries: 0.1017s
query: 26.0294s
Per query: 0.2603ms
```

Results (PDQ Faiss):
-------
```
% python3 benchmarks/benchmark_pdq_faiss_matchers.py --dataset-size 10000 --num-queries 1000 --thresholds 31
Benchmark: PDQ Faiss Matcher Comparison

Options:
faiss_threads : 1
dataset_size : 10000
num_queries : 1000
thresholds : [31]
seed : None

using random seed of 1739236966565067000
use --seed 1739236966565067000 to rerun with same random values

Building Stats:
PDQFlatHashIndex: time to build (s): 0.015399932861328125
PDQFlatHashIndex: approximate size: 390KB
PDQMultiHashIndex: time to build (s): 0.030457258224487305
PDQMultiHashIndex: approximate size: 1,207KB

Benchmarks for threshold: 31
PDQFlatHashIndex - Total Time to search (s): 0.012083053588867188
PDQMultiHashIndex - Total Time to search (s): 0.01529383659362793
Comment on lines +77 to +78
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is closer to what I would expect, but this might be worth digging more into later. I think we are choosing the wrong thresholds for these.

PDQFlatHashIndex - Precent of targets found: 100.0
PDQMultiHashIndex - Precent of targets found: 100.0
```
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

import argparse
import binascii
import os
import time
import pickle

Expand Down