Improve registry-based archive generation reliability and performance#240
Improve registry-based archive generation reliability and performance#240nutjob4life merged 6 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR enhances the reliability and performance of registry-based deep archive generation by addressing three critical issues: unreliable collection discovery, transient API failures, and excessive API payload sizes. The changes implement a more robust collection discovery mechanism using bundle properties, add automatic retry logic with exponential backoff for transient errors, and optimize API requests to fetch only required fields.
Key changes:
- Replace
/membersendpoint withref_lidvid_collectionproperty parsing for collection discovery - Add retry mechanism with exponential backoff for HTTP 500/502/503/504 errors
- Optimize API payload size by requesting only necessary fields (99% reduction)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| url = f"{server_url}/products/{lidvid}/members" | ||
| params = {"sort": _searchkey, "limit": _apiquerylimit} | ||
| # Request only the fields we need to minimize payload size | ||
| params = {"sort": _searchkey, "limit": _apiquerylimit, "fields": ",".join(_fields)} |
There was a problem hiding this comment.
The fields parameter is added to limit the response size, but there's no validation that the API actually supports this parameter or handles it correctly. If the API ignores unrecognized parameters, the optimization would silently fail. Consider adding validation or documentation noting the API version requirement for the fields parameter.
There was a problem hiding this comment.
@copilot I think hardcoding in API parameters is fine. we own that API
Workaround for #236 where the /products/{bundle-lidvid}/members endpoint has proven brittle. Instead, parse ref_lidvid_collection directly from the bundle properties to discover collections. This approach is equally accurate since it uses the exact collection LIDVIDs specified in the bundle metadata, but avoids the reliability issues with the /members endpoint at the bundle level. The /members endpoint is still used for discovering products within each collection.
Fixes #237 by implementing automatic retry with exponential backoff for all PDS API requests. This addresses transient API/database performance issues that result in 500 errors and JSON decode failures. Changes: - Added _get_session_with_retry() to create sessions with retry logic - Configured to retry on HTTP 500, 502, 503, 504 status codes - Uses exponential backoff (2s, 4s, 8s, 16s, 32s) over 5 attempts - Enhanced error handling to catch JSONDecodeError exceptions - Added detailed logging for debugging API failures The retry mechanism uses urllib3.Retry with requests.HTTPAdapter to automatically handle transient failures without code changes in calling functions.
Fixes #238 by adding fields parameter to /members endpoint requests to minimize payload size and improve performance. Deep archive only needs 5 specific fields from product responses: - ops:Data_File_Info.ops:file_ref (data file URLs) - ops:Data_File_Info.ops:md5_checksum (data file checksums) - ops:Label_File_Info.ops:file_ref (label file URL) - ops:Label_File_Info.ops:md5_checksum (label file checksum) - ops:Harvest_Info.ops:harvest_date_time (for pagination) Changes: - Updated _fields constant to include harvest_date_time for pagination - Modified _getproducts() to pass fields parameter to API requests - Reduced API response payload by filtering unnecessary metadata Benefits: - Smaller JSON payloads reduce network bandwidth - Faster API responses with less data serialization - Better performance for collections with many products - Reduced load on API server and database
c52b51d to
0b025ef
Compare
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Updated retry status codes to use HTTPStatus enum constants for better code readability and maintainability: - 500 → HTTPStatus.INTERNAL_SERVER_ERROR - 502 → HTTPStatus.BAD_GATEWAY - 503 → HTTPStatus.SERVICE_UNAVAILABLE - 504 → HTTPStatus.GATEWAY_TIMEOUT Addresses code review feedback.
Addresses PR feedback with two improvements: 1. Clarified _searchkey field requirement in _fields constant - Added comment explaining it must be included for pagination - Required by line 244 for search-after parameter access 2. Implemented batch collection fetching to reduce API calls - New _getcollections_batch() function fetches collections in batches - Uses query syntax: (lidvid eq "..." or lidvid eq "..." or ...) - Batches collections into groups of 50 (API limit) - Reduces API calls dramatically (e.g., 8 collections: 8→1 requests) - Updated _comprehendregistry() to use batch fetching Benefits: - Fewer API requests reduces server load - Better performance for bundles with many collections - Stays within API query limits - Maintains same functionality with better efficiency
|
@jordanpadams I've opened a new pull request, #242, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@jordanpadams I've opened a new pull request, #243, to work on those changes. Once the pull request is ready, I'll request review from you. |
Updated with Code Review FeedbackAdditional ImprovementsCommit fce720b - Use HTTPStatus constants:
Commit 4b55407 - Batch collection fetching:
Performance ImpactBefore batch fetching:
After batch fetching:
Combined with the field optimization (11MB → 110KB payload), these changes significantly improve both performance and reliability of registry-based archive generation. |
🗒️ Summary
This PR significantly improves the reliability and performance of registry-based deep archive generation through three key enhancements:
ref_lidvid_collectionfrom bundle properties instead of the brittle/membersendpointChanges Made
Fix #236 - Use ref_lidvid_collection for collection discovery:
ref_lidvid_collectiondirectly from bundle properties to discover collections/products/{bundle-lidvid}/membersendpointFix #237 - Add retry logic with exponential backoff:
Fix #238 - Optimize API payload size:
fieldsparameter to API requests to fetch only required fields⚙️ Test Data and/or Report
Automated Tests
✅ All tests pass:
✅ Pre-commit hooks: All passed during push
Manual Testing
Successfully tested with the New Horizons SWAP bundle referenced in #236:
Results:
ref_lidvid_collectionDuring testing, encountered the transient registry errors that led to the retry implementation (#237). The retry logic successfully handled these failures with exponential backoff.
Performance Metrics
/membersendpointref_lidvid_collection♻️ Related Issues
🤓 Reviewer Checklist
Reviewers: Please verify the following before approving this pull request.
Security & Quality
Testing & Validation
Documentation
Maintenance