feat: add fast retry mechanism and parallelize container management#48
Merged
feat: add fast retry mechanism and parallelize container management#48
Conversation
- Added functions to run worker operations in parallel, improving efficiency for starting, stopping, cleaning up, and verifying worker containers. - Updated existing worker management functions to utilize parallel execution, enhancing performance during bulk operations. - Introduced internal wrappers for parallel execution of various worker tasks, including deployment, environment variable updates, and log saving.
- Introduced a new `fast-retry` feature that enables quicker retries without log saving or changes to CHUNK_SIZE, allowing for faster recovery from proving failures. - Added a `proving_timeout_seconds` configuration option to set custom timeout durations for proving operations, defaulting to 30 seconds for fast-retry and 120 seconds otherwise. - Updated the proving client logic to handle the new timeout settings and retry mechanisms, improving overall efficiency and error handling during proving operations. - Enhanced Docker control scripts to support the new fast-retry functionality.
- Improved the `run_parallel_workers` and `run_parallel_all` functions to include better error handling and logging for worker operations. - Added checks for successful creation of temporary directories and files, ensuring robust execution. - Updated output handling to include messages for workers with no output, enhancing clarity in logs. - Exported additional internal worker functions for better modularity and reusability across scripts.
eason1981
approved these changes
Dec 11, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds a faster retry mechanism for running ethproofs and improves Docker container management performance through parallelization.
Changes
Fast retry mechanism: Added
docker-multi-fast-retry.shscript that skips log saving and preservesCHUNK_SIZEfor faster recovery. Integrated into Rust proving client viafast-retryfeature flag with reduced default timeout (30s vs 120s).Concurrent container management: Modified
docker-common.shto execute container operations (stop, start, cleanup, status, logs, env updates) in parallel across multiple machines usingrun_parallel_workers()andrun_parallel_all()helpers.Configurable retry timeout: Added
PROVING_TIMEOUT_SECONDSenvironment variable support (default: 30s with fast-retry, 120s otherwise).