Skip to content

feat: deduplicate shared URL downloads across test suites#338

Open
dabrain34 wants to merge 1 commit into
fluendo:masterfrom
dabrain34:dab_duplication_download
Open

feat: deduplicate shared URL downloads across test suites#338
dabrain34 wants to merge 1 commit into
fluendo:masterfrom
dabrain34:dab_duplication_download

Conversation

@dabrain34

@dabrain34 dabrain34 commented Feb 26, 2026

Copy link
Copy Markdown
Contributor

Introduce a centralized DownloadManager that ensures each URL is downloaded at most once, eliminating duplicate downloads both across test suites and within a single test suite.

  • Add DownloadManager class in utils.py with download-once caching and centralized archive cleanup
  • Refactor TestSuite.download() to use pre-downloaded archives from the manager across all three download paths
  • Use a thread pool to download concurrently and make DownloadManager thread-safe so duplicate URLs are still fetched only once.

This feature allows to fast up considerably the download of AV1-ARGON* which was downloading each time the 6GB archive for every test vector.

Fix #309

@dabrain34

Copy link
Copy Markdown
Contributor Author

@ylatuya ping

@dabrain34

dabrain34 commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

@rsanchez87 can you have a look to this PR as well? The idea would be to fast up the build of docker images containing all the test suites

Comment thread fluster/test_suite.py Outdated
Comment thread fluster/test_suite.py Outdated
Comment thread fluster/test_suite.py Outdated
Comment thread fluster/test_suite.py Outdated
Comment thread fluster/test_suite.py Outdated
return (url, local_path)

max_workers = max(1, min(jobs, len(unique_source_list)))
with ThreadPoolExecutor(max_workers=max_workers) as dl_pool:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use Pool from multiprocessing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to move all the logic for download pool inside the download manager which handled the dedup, the thread lock, the extraction and everything to download through ThreadPoolExecutor.

Do you have a preference for multiprocessing for a reason ?

Comment thread fluster/test_suite.py
@rsanchez87

Copy link
Copy Markdown
Contributor

@rsanchez87 can you have a look to this PR as well? The idea would be to fast up the build of docker images containing all the test suites

@dabrain34, tested with python3 fluster.py download AV1-ARGON-PROFILE0-CORE-ANNEX-B AV1-ARGON-PROFILE1-CORE-ANNEX-B AV1-ARGON-PROFILE2-CORE-ANNEX-B
master: 49m 40s
PR: 16m 2s (~3x faster, ZIP downloaded once instead of 3 times) ✅

Also regression tests ✔️

I’ll test again once the requested changes by @ylatuya are implemented. Thanks!

@dabrain34

Copy link
Copy Markdown
Contributor Author

thanks for the test, indeed this is even better on low speed lines as we dont redownload all the time the AV1 zip file.

I'm currently addressing comments from ylatuya. When this is ready I will come back to you

@dabrain34 dabrain34 force-pushed the dab_duplication_download branch from dc31a68 to a3b3d5f Compare May 22, 2026 13:27
@dabrain34 dabrain34 marked this pull request as draft May 22, 2026 13:46
Introduce a centralized DownloadManager so each URL is downloaded at
most once, both within and across selected suites. Saves re-fetching
multi-GB archives like AV1-ARGON shared by 12 suites.

DownloadManager (fluster/utils.py):
- Thread-safe per-URL caching at resources/.cache/; concurrent get()
  calls on the same URL block on the in-flight download.
- BoundedSemaphore caps HTTP concurrency at 8.
- Per-URL retry budget; ChecksumMismatchError poisons immediately.
- invalidate(url) lets consumers drop a corrupt cached archive.
- Context manager: cleanup() runs via __exit__, honoring keep_file.
- filename_from_url() strips query strings for safe on-disk names.

TestSuite.download() (fluster/test_suite.py):
- Requires a DownloadManager (keyword-only). All three download paths
  consume pre-downloaded archives.
- Multi-TV branch pre-downloads unique URLs in parallel before the
  multiprocessing extraction pool.
- Raw source files are moved out of the cache (no double storage).

CLI (fluster/fluster.py):
- Three-phase: collect URLs across selected suites, parallel
  pre-download, per-suite extraction. Cross-suite parallelism is the
  main user-visible win.
- All callers (CLI + 7 scripts/gen_*.py) use the with-statement form.
@dabrain34 dabrain34 force-pushed the dab_duplication_download branch from a3b3d5f to 2bfdac4 Compare May 25, 2026 11:58
@dabrain34 dabrain34 marked this pull request as ready for review May 27, 2026 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Downloading the AV1 test suites results in downloading multiple times a 6GB archive

3 participants