Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: fixed broken swe-bench docker image generation. #1262

Merged
merged 4 commits into from
Feb 11, 2025

Conversation

alt-glitch
Copy link
Collaborator

@alt-glitch alt-glitch commented Feb 7, 2025

  • created new images with updated composio sdk
  • fixes: 1155

Important

Refactor Docker image generation for swe-bench with updated Composio SDK, improved logging, and error handling.

  • Docker Image Generation:
    • Refactor build.py to add logging for successful builds and handle stop signals for graceful termination.
    • Add functions load_successful_builds() and record_success() to track successful builds.
    • Modify _base() and _swes() to skip already built images.
  • Error Handling:
    • Add error handling in get_llm_response() in benchmark.py for Claude model.
    • Improve error logging in create_index.py.
  • Miscellaneous:
    • Rename BedrockChat to ChatBedrock in benchmark.py.
    • Update tox.ini to exclude certain directories from linting and type checking.

This description was created by Ellipsis for c2faf5d. It will automatically update as commits are pushed.

- created new images with updated composio sdk
- fixes: 1155
Copy link

vercel bot commented Feb 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
composio ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 7, 2025 2:31pm

Comment on lines +67 to +68
futures.append(executor.submit(_build, file, tag_part, multi))
[fut.result() for fut in futures]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The futures.append() and fut.result() operations are not protected against stop_requested, allowing in-progress builds to continue after stop signal. Should check stop_requested before fut.result().

📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
futures.append(executor.submit(_build, file, tag_part, multi))
[fut.result() for fut in futures]
futures.append(executor.submit(_build, file, tag_part, multi))
[fut.result() for fut in futures if not stop_requested]

Comment on lines 135 to 140

if process.returncode == 0:
print(f"Finished build for {tag}")
record_success(tag)
else:
print(f"Error building {tag} - {logs / log}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for failed builds - successful builds are tracked but failures are silently ignored. Should maintain error state and handle failed builds appropriately.

📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
if process.returncode == 0:
print(f"Finished build for {tag}")
record_success(tag)
else:
print(f"Error building {tag} - {logs / log}")
if process.returncode == 0:
print(f"Finished build for {tag}")
record_success(tag)
else:
error_msg = f"Error building {tag} - {logs / log}"
print(error_msg)
record_failure(tag, error_msg)
raise BuildError(error_msg)

Comment on lines 122 to 124
if not DEEPLAKE_PATH.exists():
shutil.copytree(
f"{Path.home()}/.composio/tmp/{outname}/deeplake",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition when checking and creating directories. Use mkdir(parents=True, exist_ok=True) instead of separate check and create.

📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
if not DEEPLAKE_PATH.exists():
shutil.copytree(
f"{Path.home()}/.composio/tmp/{outname}/deeplake",
DEEPLAKE_PATH.mkdir(parents=True, exist_ok=True)
shutil.copytree(
f"{Path.home()}/.composio/tmp/{outname}/deeplake",

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to c2faf5d in 2 minutes and 8 seconds

More details
  • Looked at 387 lines of code in 4 files
  • Skipped 0 files when reviewing.
  • Skipped posting 11 drafted comments based on config settings.
1. python/swe/agent/benchmark.py:10
  • Draft comment:
    Updated import to ChatBedrock looks correct; ensure SDK change is fully compatible with client.invoke usage.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    The comment is asking the PR author to ensure compatibility, which violates the rule against asking for confirmation or testing. It doesn't provide a specific suggestion or point out a specific issue.
2. python/swe/dockerfiles/build.py:40
  • Draft comment:
    Graceful shutdown support with signal handlers is added; review that stop_requested flag usage is thread‐safe as needed.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
    In Python, reading/writing a boolean is atomic. The flag is only ever set from False to True once, and only read after that. Even without explicit synchronization, this pattern is actually thread-safe. The worst case is a thread might see the old value (False) for a short time, which just means it would do one more iteration - not a correctness issue.
    I could be wrong about Python's guarantees around boolean operations being atomic. There could be subtle race conditions I'm not considering.
    Even if there are subtle race conditions, they would only result in one extra iteration of work, not corruption or crashes. The graceful shutdown is still graceful enough for this use case.
    The comment raises a valid concern but the current implementation is actually good enough for this use case. The comment should be removed as it would lead to unnecessary work.
3. python/swe/dockerfiles/build.py:59
  • Draft comment:
    Filename splitting may fail on unexpected formats; good that ValueError is caught, but consider logging full error context.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50%
    The code already handles the error case appropriately by catching ValueError and printing the problematic filename. The error message is clear and provides the key information needed to debug the issue. Additional error context would be redundant since the ValueError would only occur if the filename doesn't contain a period, which is already clear from the error message.
    Perhaps there could be value in logging the actual ValueError message in addition to the filename. The current error handling doesn't preserve the original exception details.
    The original exception would just indicate that the string couldn't be split, which adds no meaningful information beyond what we already communicate by showing the invalid filename.
    The comment should be deleted since the existing error handling is already appropriate and provides sufficient context through the filename.
4. python/swe/dockerfiles/build.py:127
  • Draft comment:
    Ensure the _build function correctly chooses between multi and single build args; consider capturing stdout for debugging errors.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
5. python/swe/dockerfiles/create_index.py:105
  • Draft comment:
    Opening fqdn_cache.json assumes the file exists; consider adding error handling or existence check.
  • Reason this comment was not posted:
    Comment was on unchanged code.
6. python/swe/dockerfiles/create_index.py:115
  • Draft comment:
    The 'docker_outdir' directory is used for writing files but not ensured to exist; call mkdir to avoid potential errors.
  • Reason this comment was not posted:
    Comment was on unchanged code.
7. python/tox.ini:10
  • Draft comment:
    Tox configuration updates for isort and black exclude paths look correct; ensure these patterns match the intended directories.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    This comment is asking the PR author to ensure that the exclude patterns match the intended directories, which is against the rules of asking the author to confirm or ensure something. It doesn't provide a specific suggestion or point out a specific issue.
8. python/swe/agent/benchmark.py:10
  • Draft comment:
    Refactored import: Ensure that switching from BedrockChat to ChatBedrock preserves API compatibility.
  • Reason this comment was not posted:
    Marked as duplicate.
9. python/swe/dockerfiles/create_index.py:116
  • Draft comment:
    Ensure the output directory exists before writing files; add mkdir with parents=True for 'docker_outdir'.
  • Reason this comment was not posted:
    Marked as duplicate.
10. python/swe/dockerfiles/create_index.py:53
  • Draft comment:
    Review directory usage: 'docker_outdir' is hard-coded to 'generated', which differs from the provided 'outdir' in IndexGenerator. Consider unifying the output directories.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
    The code appears to intentionally use two directories for different purposes - outdir for git/temporary operations and docker_outdir for final outputs. This separation of concerns seems deliberate. The comment suggests unifying them but doesn't make a strong case for why this would be better. Having separate directories for intermediate vs final outputs is a common and valid pattern.
    I could be wrong about the intentionality - maybe this is actually technical debt that should be cleaned up. The separate directories could cause confusion.
    While separate directories could potentially cause confusion, the code is clear about the purpose of each directory through its usage patterns. The separation provides useful isolation between temporary and final outputs.
    This appears to be an intentional design choice rather than a problem that needs fixing. The comment should be removed as it's suggesting a change without clear benefits.
11. python/tox.ini:10
  • Draft comment:
    Verify that the updated exclude/skip patterns for isort, black, flake8, and mypy correctly match your directory structure.
  • Reason this comment was not posted:
    Marked as duplicate.

Workflow ID: wflow_vuBKP3T6X7N9VMH8


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

print("\n".join(errors))
return
print("==== Successful Builds (after pyenv) ====")
print("\n".join(successful_builds))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider sorting the 'successful_builds' set before printing for reproducible output.



# Load successful builds from logs/success.log into the global set
def load_successful_builds() -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding error handling for file operations in load_successful_builds() to handle potential IO errors gracefully:

def load_successful_builds() -> None:
    try:
        success_file = logs / "success.log"
        if success_file.exists():
            with open(success_file, "r", encoding="utf-8") as f:
                for line in f:
                    tag = line.strip()
                    if tag:
                        successful_builds.add(tag)
    except Exception as e:
        print(f"Error loading successful builds: {e}")


# Load successful builds from logs/success.log into the global set
def load_successful_builds() -> None:
success_file = logs / "success.log"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using constants for frequently used strings and paths:

SUCCESS_LOG_FILE = "success.log"
DOCKER_IMAGE_PREFIX = "composio/swe"

# Then use them like:
success_file = logs / SUCCESS_LOG_FILE
full_tag = f"{DOCKER_IMAGE_PREFIX}:{tag_part}"

This makes the code more maintainable and reduces the risk of typos.

stop_requested = False


def handle_stop(signum, frame):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding proper logging instead of print statements for better observability and debugging:

def handle_stop(signum, frame):
    global stop_requested
    logger.info("Received stop signal. Gracefully stopping new builds...")
    stop_requested = True



# Record a successful build by writing to the log file and updating the global set
def record_success(tag: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding docstrings to the new functions for better code documentation:

def record_success(tag: str) -> None:
    """Record a successful build by writing to the success log file and updating the global set.
    
    Args:
        tag: The docker image tag that was successfully built
        
    Thread-safe: Uses success_lock for synchronization
    """
    with success_lock:
        successful_builds.add(tag)
        with open(logs / "success.log", "a", encoding="utf-8") as f:
            f.write(tag + "\n")


# Record a successful build by writing to the log file and updating the global set
def record_success(tag: str) -> None:
with success_lock:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider validating the docker tag format before recording success to prevent invalid entries:

def is_valid_docker_tag(tag: str) -> bool:
    """Validate docker tag format according to Docker's rules."""
    # Basic validation - can be expanded based on needs
    return bool(re.match(r'^[a-z0-9][a-z0-9._-]*$', tag))

def record_success(tag: str) -> None:
    if not is_valid_docker_tag(tag):
        logger.warning(f"Invalid docker tag format: {tag}")
        return
    with success_lock:
        successful_builds.add(tag)
        with open(logs / "success.log", "a", encoding="utf-8") as f:
            f.write(tag + "\n")

@shreysingla11
Copy link
Collaborator

Code Review Summary

Overall, the changes look good and significantly improve the docker build system's reliability and maintainability. Here's a breakdown:

Strengths:

✅ Added graceful shutdown handling with signal handlers
✅ Implemented build tracking with success logging
✅ Added skip mechanism for already built images
✅ Improved error handling and logging
✅ Better synchronization with threading locks
✅ Updated API imports and model IDs

Suggestions for Improvement:

  1. Add error handling for file operations
  2. Use constants for frequently used strings
  3. Use proper logging instead of print statements
  4. Add docstrings for new functions
  5. Add validation for docker tags

Code Quality: 8/10

The code is well-structured and introduces important improvements to the build system. The suggestions above would make it even better, but they're not critical for functionality.

Testing:

Consider adding:

  • Unit tests for load_successful_builds()
  • Tests for graceful shutdown
  • Tests for concurrent builds
  • Tests for error scenarios

The PR successfully addresses the broken swe-bench docker image generation issue while adding several quality-of-life improvements to the build system.

@angrybayblade angrybayblade merged commit 43bc5b0 into master Feb 11, 2025
23 checks passed
@angrybayblade angrybayblade deleted the fix/swe-benchmark branch February 11, 2025 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants