refactor: fixed broken swe-bench docker image generation. #1262

alt-glitch · 2025-02-07T13:01:51Z

created new images with updated composio sdk
fixes: 1155

Important

Refactor Docker image generation for swe-bench with updated Composio SDK, improved logging, and error handling.

Docker Image Generation:
- Refactor build.py to add logging for successful builds and handle stop signals for graceful termination.
- Add functions load_successful_builds() and record_success() to track successful builds.
- Modify _base() and _swes() to skip already built images.
Error Handling:
- Add error handling in get_llm_response() in benchmark.py for Claude model.
- Improve error logging in create_index.py.
Miscellaneous:
- Rename BedrockChat to ChatBedrock in benchmark.py.
- Update tox.ini to exclude certain directories from linting and type checking.

^{This description was created by}^{for c2faf5d. It will automatically update as commits are pushed.}

- created new images with updated composio sdk - fixes: 1155

vercel · 2025-02-07T13:01:57Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
composio	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Feb 7, 2025 2:31pm

entelligence-ai-pr-reviews · 2025-02-07T13:02:39Z

python/swe/dockerfiles/build.py

+            futures.append(executor.submit(_build, file, tag_part, multi))
+        [fut.result() for fut in futures]


The futures.append() and fut.result() operations are not protected against stop_requested, allowing in-progress builds to continue after stop signal. Should check stop_requested before fut.result().

📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

futures.append(executor.submit(_build, file, tag_part, multi))

[fut.result() for fut in futures]

futures.append(executor.submit(_build, file, tag_part, multi))

[fut.result() for fut in futures if not stop_requested]

entelligence-ai-pr-reviews · 2025-02-07T13:02:40Z

python/swe/dockerfiles/build.py


    if process.returncode == 0:
        print(f"Finished build for {tag}")
+        record_success(tag)
    else:
        print(f"Error building {tag} - {logs / log}")


Missing error handling for failed builds - successful builds are tracked but failures are silently ignored. Should maintain error state and handle failed builds appropriately.

📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

if process.returncode == 0:

print(f"Finished build for {tag}")

record_success(tag)

else:

print(f"Error building {tag} - {logs / log}")

if process.returncode == 0:

print(f"Finished build for {tag}")

record_success(tag)

else:

error_msg = f"Error building {tag} - {logs / log}"

print(error_msg)

record_failure(tag, error_msg)

raise BuildError(error_msg)

entelligence-ai-pr-reviews · 2025-02-07T13:02:41Z

python/swe/dockerfiles/create_index.py

        if not DEEPLAKE_PATH.exists():
            shutil.copytree(
                f"{Path.home()}/.composio/tmp/{outname}/deeplake",


Race condition when checking and creating directories. Use mkdir(parents=True, exist_ok=True) instead of separate check and create.

📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

if not DEEPLAKE_PATH.exists():

shutil.copytree(

f"{Path.home()}/.composio/tmp/{outname}/deeplake",

DEEPLAKE_PATH.mkdir(parents=True, exist_ok=True)

shutil.copytree(

f"{Path.home()}/.composio/tmp/{outname}/deeplake",

ellipsis-dev

❌ Changes requested. Reviewed everything up to c2faf5d in 2 minutes and 8 seconds

More details

Looked at 387 lines of code in 4 files
Skipped 0 files when reviewing.
Skipped posting 11 drafted comments based on config settings.

1. python/swe/agent/benchmark.py:10

Draft comment:
Updated import to ChatBedrock looks correct; ensure SDK change is fully compatible with client.invoke usage.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
The comment is asking the PR author to ensure compatibility, which violates the rule against asking for confirmation or testing. It doesn't provide a specific suggestion or point out a specific issue.

2. python/swe/dockerfiles/build.py:40

Draft comment:
Graceful shutdown support with signal handlers is added; review that stop_requested flag usage is thread‐safe as needed.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
In Python, reading/writing a boolean is atomic. The flag is only ever set from False to True once, and only read after that. Even without explicit synchronization, this pattern is actually thread-safe. The worst case is a thread might see the old value (False) for a short time, which just means it would do one more iteration - not a correctness issue.
I could be wrong about Python's guarantees around boolean operations being atomic. There could be subtle race conditions I'm not considering.
Even if there are subtle race conditions, they would only result in one extra iteration of work, not corruption or crashes. The graceful shutdown is still graceful enough for this use case.
The comment raises a valid concern but the current implementation is actually good enough for this use case. The comment should be removed as it would lead to unnecessary work.

3. python/swe/dockerfiles/build.py:59

Draft comment:
Filename splitting may fail on unexpected formats; good that ValueError is caught, but consider logging full error context.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50%
The code already handles the error case appropriately by catching ValueError and printing the problematic filename. The error message is clear and provides the key information needed to debug the issue. Additional error context would be redundant since the ValueError would only occur if the filename doesn't contain a period, which is already clear from the error message.
Perhaps there could be value in logging the actual ValueError message in addition to the filename. The current error handling doesn't preserve the original exception details.
The original exception would just indicate that the string couldn't be split, which adds no meaningful information beyond what we already communicate by showing the invalid filename.
The comment should be deleted since the existing error handling is already appropriate and provides sufficient context through the filename.

4. python/swe/dockerfiles/build.py:127

Draft comment:
Ensure the _build function correctly chooses between multi and single build args; consider capturing stdout for debugging errors.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

5. python/swe/dockerfiles/create_index.py:105

Draft comment:
Opening fqdn_cache.json assumes the file exists; consider adding error handling or existence check.
Reason this comment was not posted:
Comment was on unchanged code.

6. python/swe/dockerfiles/create_index.py:115

Draft comment:
The 'docker_outdir' directory is used for writing files but not ensured to exist; call mkdir to avoid potential errors.
Reason this comment was not posted:
Comment was on unchanged code.

7. python/tox.ini:10

Draft comment:
Tox configuration updates for isort and black exclude paths look correct; ensure these patterns match the intended directories.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
This comment is asking the PR author to ensure that the exclude patterns match the intended directories, which is against the rules of asking the author to confirm or ensure something. It doesn't provide a specific suggestion or point out a specific issue.

8. python/swe/agent/benchmark.py:10

Draft comment:
Refactored import: Ensure that switching from BedrockChat to ChatBedrock preserves API compatibility.
Reason this comment was not posted:
Marked as duplicate.

9. python/swe/dockerfiles/create_index.py:116

Draft comment:
Ensure the output directory exists before writing files; add mkdir with parents=True for 'docker_outdir'.
Reason this comment was not posted:
Marked as duplicate.

10. python/swe/dockerfiles/create_index.py:53

Draft comment:
Review directory usage: 'docker_outdir' is hard-coded to 'generated', which differs from the provided 'outdir' in IndexGenerator. Consider unifying the output directories.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
The code appears to intentionally use two directories for different purposes - outdir for git/temporary operations and docker_outdir for final outputs. This separation of concerns seems deliberate. The comment suggests unifying them but doesn't make a strong case for why this would be better. Having separate directories for intermediate vs final outputs is a common and valid pattern.
I could be wrong about the intentionality - maybe this is actually technical debt that should be cleaned up. The separate directories could cause confusion.
While separate directories could potentially cause confusion, the code is clear about the purpose of each directory through its usage patterns. The separation provides useful isolation between temporary and final outputs.
This appears to be an intentional design choice rather than a problem that needs fixing. The comment should be removed as it's suggesting a change without clear benefits.

11. python/tox.ini:10

Draft comment:
Verify that the updated exclude/skip patterns for isort, black, flake8, and mypy correctly match your directory structure.
Reason this comment was not posted:
Marked as duplicate.

Workflow ID: wflow_vuBKP3T6X7N9VMH8

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2025-02-07T13:04:11Z

python/swe/dockerfiles/build.py

-        print("\n".join(errors))
-        return
+    print("==== Successful Builds (after pyenv) ====")
+    print("\n".join(successful_builds))


Consider sorting the 'successful_builds' set before printing for reproducible output.

shreysingla11 · 2025-02-07T13:05:12Z

python/swe/dockerfiles/build.py

+
+
+# Load successful builds from logs/success.log into the global set
+def load_successful_builds() -> None:


Consider adding error handling for file operations in load_successful_builds() to handle potential IO errors gracefully:

def load_successful_builds() -> None: try: success_file = logs / "success.log" if success_file.exists(): with open(success_file, "r", encoding="utf-8") as f: for line in f: tag = line.strip() if tag: successful_builds.add(tag) except Exception as e: print(f"Error loading successful builds: {e}")

shreysingla11 · 2025-02-07T13:05:29Z

python/swe/dockerfiles/build.py

+
+# Load successful builds from logs/success.log into the global set
+def load_successful_builds() -> None:
+    success_file = logs / "success.log"


Consider using constants for frequently used strings and paths:

SUCCESS_LOG_FILE = "success.log" DOCKER_IMAGE_PREFIX = "composio/swe" # Then use them like: success_file = logs / SUCCESS_LOG_FILE full_tag = f"{DOCKER_IMAGE_PREFIX}:{tag_part}"

This makes the code more maintainable and reduces the risk of typos.

shreysingla11 · 2025-02-07T13:05:45Z

python/swe/dockerfiles/build.py

+stop_requested = False
+
+
+def handle_stop(signum, frame):


Consider adding proper logging instead of print statements for better observability and debugging:

def handle_stop(signum, frame): global stop_requested logger.info("Received stop signal. Gracefully stopping new builds...") stop_requested = True

shreysingla11 · 2025-02-07T13:06:02Z

python/swe/dockerfiles/build.py

+
+
+# Record a successful build by writing to the log file and updating the global set
+def record_success(tag: str) -> None:


Consider adding docstrings to the new functions for better code documentation:

def record_success(tag: str) -> None: """Record a successful build by writing to the success log file and updating the global set. Args: tag: The docker image tag that was successfully built Thread-safe: Uses success_lock for synchronization """ with success_lock: successful_builds.add(tag) with open(logs / "success.log", "a", encoding="utf-8") as f: f.write(tag + "\n")

shreysingla11 · 2025-02-07T13:06:21Z

python/swe/dockerfiles/build.py

+
+# Record a successful build by writing to the log file and updating the global set
+def record_success(tag: str) -> None:
+    with success_lock:


Consider validating the docker tag format before recording success to prevent invalid entries:

def is_valid_docker_tag(tag: str) -> bool: """Validate docker tag format according to Docker's rules.""" # Basic validation - can be expanded based on needs return bool(re.match(r'^[a-z0-9][a-z0-9._-]*$', tag)) def record_success(tag: str) -> None: if not is_valid_docker_tag(tag): logger.warning(f"Invalid docker tag format: {tag}") return with success_lock: successful_builds.add(tag) with open(logs / "success.log", "a", encoding="utf-8") as f: f.write(tag + "\n")

shreysingla11 · 2025-02-07T13:06:42Z

Code Review Summary

Overall, the changes look good and significantly improve the docker build system's reliability and maintainability. Here's a breakdown:

Strengths:

✅ Added graceful shutdown handling with signal handlers
✅ Implemented build tracking with success logging
✅ Added skip mechanism for already built images
✅ Improved error handling and logging
✅ Better synchronization with threading locks
✅ Updated API imports and model IDs

Suggestions for Improvement:

Add error handling for file operations
Use constants for frequently used strings
Use proper logging instead of print statements
Add docstrings for new functions
Add validation for docker tags

Code Quality: 8/10

The code is well-structured and introduces important improvements to the build system. The suggestions above would make it even better, but they're not critical for functionality.

Testing:

Consider adding:

Unit tests for load_successful_builds()
Tests for graceful shutdown
Tests for concurrent builds
Tests for error scenarios

The PR successfully addresses the broken swe-bench docker image generation issue while adding several quality-of-life improvements to the build system.

…to fix/swe-benchmark

refactor: fixed broken swe-bench docker image generation.

c2faf5d

- created new images with updated composio sdk - fixes: 1155

alt-glitch requested a review from angrybayblade as a code owner February 7, 2025 13:01

Merge branch 'master' into fix/swe-benchmark

26e0075

vercel bot deployed to Preview February 7, 2025 13:02 View deployment

entelligence-ai-pr-reviews bot reviewed Feb 7, 2025

View reviewed changes

ellipsis-dev bot reviewed Feb 7, 2025

View reviewed changes

shreysingla11 reviewed Feb 7, 2025

View reviewed changes

alt-glitch added 2 commits February 7, 2025 20:00

remove unused module

7480c26

Merge branch 'fix/swe-benchmark' of github.com:ComposioHQ/composio in…

edd51bc

…to fix/swe-benchmark

vercel bot deployed to Preview February 7, 2025 14:31 View deployment

angrybayblade approved these changes Feb 11, 2025

View reviewed changes

angrybayblade merged commit 43bc5b0 into master Feb 11, 2025
23 checks passed

angrybayblade deleted the fix/swe-benchmark branch February 11, 2025 06:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: fixed broken swe-bench docker image generation. #1262

refactor: fixed broken swe-bench docker image generation. #1262

alt-glitch commented Feb 7, 2025 •

edited by ellipsis-dev bot

Loading

vercel bot commented Feb 7, 2025 •

edited

Loading

entelligence-ai-pr-reviews bot Feb 7, 2025

entelligence-ai-pr-reviews bot Feb 7, 2025

entelligence-ai-pr-reviews bot Feb 7, 2025

ellipsis-dev bot left a comment

ellipsis-dev bot Feb 7, 2025

shreysingla11 Feb 7, 2025

shreysingla11 Feb 7, 2025

shreysingla11 Feb 7, 2025

shreysingla11 Feb 7, 2025

shreysingla11 Feb 7, 2025

shreysingla11 commented Feb 7, 2025

		futures.append(executor.submit(_build, file, tag_part, multi))
		[fut.result() for fut in futures]



		# Load successful builds from logs/success.log into the global set
		def load_successful_builds() -> None:



		# Record a successful build by writing to the log file and updating the global set
		def record_success(tag: str) -> None:

refactor: fixed broken swe-bench docker image generation. #1262

refactor: fixed broken swe-bench docker image generation. #1262

Conversation

alt-glitch commented Feb 7, 2025 • edited by ellipsis-dev bot Loading

vercel bot commented Feb 7, 2025 • edited Loading

entelligence-ai-pr-reviews bot Feb 7, 2025

Choose a reason for hiding this comment

entelligence-ai-pr-reviews bot Feb 7, 2025

Choose a reason for hiding this comment

entelligence-ai-pr-reviews bot Feb 7, 2025

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot Feb 7, 2025

Choose a reason for hiding this comment

shreysingla11 Feb 7, 2025

Choose a reason for hiding this comment

shreysingla11 Feb 7, 2025

Choose a reason for hiding this comment

shreysingla11 Feb 7, 2025

Choose a reason for hiding this comment

shreysingla11 Feb 7, 2025

Choose a reason for hiding this comment

shreysingla11 Feb 7, 2025

Choose a reason for hiding this comment

shreysingla11 commented Feb 7, 2025

Code Review Summary

Strengths:

Suggestions for Improvement:

Code Quality: 8/10

Testing:

alt-glitch commented Feb 7, 2025 •

edited by ellipsis-dev bot

Loading

vercel bot commented Feb 7, 2025 •

edited

Loading