openai · jmanhype · Mar 11, 2025 · Mar 12, 2025 · Mar 12, 2025 · Mar 12, 2025
diff --git a/README.md b/README.md
@@ -34,6 +34,25 @@ The computer use tool and model are available via the [Responses API](https://pl
 
 You can learn more about this tool in the [Computer use guide](https://platform.openai.com/docs/guides/tools-computer-use).
 
+## Feature Highlights
+
+- **Multiple Computer Environments**: Support for various environments including local browsers, Docker containers, and remote services
+- **Safety Measures**: URL blocklisting and safety check acknowledgments
+- **Function Calling**: Define and use custom functions in your agent
+- **Extensible Design**: Easily add new Computer implementations
+- **Octotools Integration**: Enhanced reasoning and specialized tools through the [Octotools](https://github.com/OctoTools/OctoTools) framework
+
+### Octotools Integration
+
+The CUA Sample App includes integration with the Octotools framework for enhanced reasoning and specialized tool access:
+
+```shell
+# Run with Octotools integration
+python main.py --use-octotools
+```
+
+For more details, see the [Octotools Integration Guide](docs/octotools_integration_guide.md) and [README_OCTOTOOLS.md](README_OCTOTOOLS.md).
+
 ## Abstractions
 
 This repository defines two lightweight abstractions to make interacting with CUA agents more ergonomic. Everything works without them, but they provide a convenient separation of concerns.

diff --git a/README_OCTOTOOLS.md b/README_OCTOTOOLS.md
@@ -0,0 +1,182 @@
+# 🛠️ Octotools Integration for CUA Sample App
+
+This integration enhances the CUA Sample App with [Octotools](https://github.com/OctoTools/OctoTools) capabilities, providing advanced reasoning, problem-solving, and specialized tool access for AI agents.
+
+## 📋 Overview
+
+The Octotools integration enables CUA Sample App to:
+- Perform complex multi-step reasoning
+- Access specialized tools for different tasks
+- Enhance browser automation with content analysis
+- Generate code and analyze data
+- Search for and extract information from the web
+
+## 🧩 Components
+
+The integration consists of the following key components:
+
+1. **OctotoolsWrapper** (`octotools_wrapper.py`) - Core wrapper for Octotools functionality.
+
+2. **OctotoolsAgent** (`octotools_agent.py`) - Enhanced agent extending the base CUA Agent with Octotools capabilities.
+
+3. **SimpleOctotoolsWrapper** (`simple_octotools_wrapper.py`) - Lightweight wrapper using direct API calls for environments without full Octotools.
+
+4. **CompleteOctotoolsWrapper** (`complete_octotools_wrapper.py`) - Full-featured wrapper with all Octotools capabilities.
+
+5. **Integration Scripts** - Various scripts to demonstrate different integration patterns.
+
+## ⚙️ Setup
+
+### Prerequisites
+
+- Python 3.10 or higher
+- CUA Sample App installed and working
+- An OpenAI API key with access to GPT-4o or similar model
+
+### Quick Installation
+
+1. **Clone the repository with submodules**:
+   ```bash
+   git clone https://github.com/jmanhype/openai-cua-sample-app.git
+   cd openai-cua-sample-app
+   ```
+
+2. **Set up environment**:
+   ```bash
+   python setup_octotools.py
+   ```
+
+3. **Install dependencies**:
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+### Manual Setup
+
+If you prefer manual setup:
+
+1. **Create `.env` file**:
+   ```bash
+   echo "OPENAI_API_KEY=your-api-key-here" > .env
+   echo "OCTOTOOLS_MODEL=gpt-4o" >> .env
+   ```
+
+2. **Install dependencies**:
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+## 🚀 Usage
+
+### Basic Integration
+
+Run the CUA Sample App with Octotools enabled:
+
+```bash
+python main.py --use-octotools --debug
+```
+
+### Advanced Usage
+
+Use the dedicated OctotoolsAgent with specific tools:
+
+```bash
+python run_octotools_agent.py --tools "Python_Code_Generator_Tool,Text_Detector_Tool,URL_Text_Extractor_Tool,Nature_News_Fetcher_Tool"
+```
+
+### Available Tools
+
+The integration supports multiple tools:
+
+| Tool | Description | Usage Example |
+|------|-------------|---------------|
+| `Generalist_Solution_Generator_Tool` | General problem-solving | Complex reasoning tasks |
+| `Python_Code_Generator_Tool` | Generates Python code | "Write a script to parse CSV files" |
+| `Text_Detector_Tool` | Analyzes text for key information | Extract entities from documents |
+| `URL_Text_Extractor_Tool` | Extracts text from webpages | "Summarize this webpage" |
+| `Nature_News_Fetcher_Tool` | Fetches news from Nature | "What's new in quantum computing?" |
+
+## 🧪 Testing
+
+Run tests to verify the integration:
+
+```bash
+# Test basic integration
+python test_octotools.py
+
+# Test simple wrapper
+python test_simple_octotools.py
+
+# Test full integration
+python test_full_octotools.py
+```
+
+## 🔍 Architecture
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                  CUA Sample App                         │
+│                                                         │
+│  ┌───────────────┐           ┌───────────────────────┐  │
+│  │ Regular Agent │           │ OctotoolsAgent        │  │
+│  └───────┬───────┘           └───────────┬───────────┘  │
+│          │                               │              │
+│          │                   ┌───────────┴───────────┐  │
+│          │                   │  OctotoolsWrapper     │  │
+│          │                   └───────────────────────┘  │
+│          │                               │              │
+│   ┌──────┴───────────────────────────────┴─────────┐    │
+│   │                  Computer                      │    │
+│   └──────────────────────────────────────────────┐ │    │
+└────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+                     ┌─────────────────┐
+                     │  Octotools      │
+                     │  Framework      │
+                     └─────────────────┘
+```
+
+## 📚 Documentation
+
+For more detailed documentation:
+
+- **Integration Guide**: See `docs/octotools_integration_guide.md` for a comprehensive guide.
+- **API Reference**: Check `octotools_wrapper.py` and `octotools_agent.py` for inline documentation.
+- **Examples**: The `examples/` directory contains example usage patterns.
+
+## ❓ Troubleshooting
+
+### API Key Problems
+
+If you see errors related to the API key:
+- Ensure that the `.env` file contains `OPENAI_API_KEY=your-api-key`
+- Verify your API key has access to the required models
+
+### Import Errors
+
+If you encounter import errors:
+- Ensure all dependencies are properly installed
+- Run from the project root directory
+- Check that the octotools directory is correctly placed
+
+### Performance Issues
+
+If reasoning tasks are slow:
+- Use a more powerful model like GPT-4o
+- Reduce the number of enabled tools
+- Set a lower max_steps value to limit iteration
+
+## 👥 Contributing
+
+Contributions are welcome! To contribute to this integration:
+
+1. Fork the repository
+2. Create a feature branch
+3. Implement your changes
+4. Add tests
+5. Submit a pull request
+
+## 📄 License
+
+This integration is subject to the same license as the CUA Sample App. 
diff --git a/agent/agent.py b/agent/agent.py
@@ -7,12 +7,13 @@
     check_blocklisted_url,
 )
 import json
-from typing import Callable
+from typing import Callable, List, Dict, Any, Optional
 
 
 class Agent:
     """
     A sample agent class that can be used to interact with a computer.
+    Enhanced with Octotools for complex reasoning.
 
     (See simple_cua_loop.py for a simple example without an agent.)
     """
@@ -23,6 +24,9 @@ def __init__(
         computer: Computer = None,
         tools: list[dict] = [],
         acknowledge_safety_check_callback: Callable = lambda: False,
+        use_octotools: bool = False,
+        octotools_engine: str = "gpt-4o",
+        octotools_tools: Optional[List[str]] = None,
     ):
         self.model = model
         self.computer = computer
@@ -41,6 +45,23 @@ def __init__(
                     "environment": computer.environment,
                 },
             ]
+
+        # Octotools integration
+        self.use_octotools = use_octotools
+        if use_octotools:
+            try:
+                from octotools_wrapper import OctotoolsWrapper
+                self.octotools = OctotoolsWrapper(
+                    llm_engine=octotools_engine,
+                    enabled_tools=octotools_tools
+                )
+                print("Octotools initialized successfully!")
+            except ImportError as e:
+                print(f"Warning: Could not initialize Octotools: {str(e)}")
+                self.use_octotools = False
+                self.octotools = None
+        else:
+            self.octotools = None
 
     def debug_print(self, *args):
         if self.debug:
@@ -113,9 +134,16 @@ def handle_item(self, item):
     def run_full_turn(
         self, input_items, print_steps=True, debug=False, show_images=False
     ):
+        """Enhanced run_full_turn with Octotools integration for complex reasoning."""
         self.print_steps = print_steps
         self.debug = debug
         self.show_images = show_images
+
+        # Check if we should use Octotools for complex reasoning
+        if self.use_octotools and self.octotools and self._needs_complex_reasoning(input_items):
+            return self._handle_with_octotools(input_items)
+
+        # Original CUA logic
         new_items = []
 
         # keep looping until we get a final response
@@ -139,3 +167,102 @@ def run_full_turn(
                     new_items += self.handle_item(item)
 
         return new_items
+
+    def _needs_complex_reasoning(self, input_items: List[Dict[str, Any]]) -> bool:
+        """
+        Determine if the query needs complex reasoning that would benefit from Octotools.
+        This is a basic heuristic and can be enhanced based on specific requirements.
+
+        Args:
+            input_items: The list of input items
+
+        Returns:
+            bool: True if complex reasoning is needed, False otherwise
+        """
+        # Extract the latest user message
+        latest_user_message = None
+        for item in reversed(input_items):
+            if item.get("role") == "user":
+                latest_user_message = item.get("content", "")
+                break
+
+        if not latest_user_message:
+            return False
+
+        # Simple heuristic: check for keywords that might suggest complex reasoning
+        complex_keywords = [
+            "analyze", "compare", "calculate", "extract data", "search for", 
+            "find information", "summarize", "visual analysis", 
+            "collect data", "research", "solve"
+        ]
+
+        return any(keyword in latest_user_message.lower() for keyword in complex_keywords)
+
+    def _handle_with_octotools(self, input_items: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+        """
+        Handle a query using Octotools for complex reasoning.
+
+        Args:
+            input_items: The list of input items
+
+        Returns:
+            List[Dict[str, Any]]: The result items
+        """
+        # Extract the latest user message and any screenshots
+        latest_user_message = None
+        latest_screenshot = None
+
+        for item in reversed(input_items):
+            if item.get("role") == "user" and not latest_user_message:
+                latest_user_message = item.get("content", "")
+
+            # Look for the most recent screenshot
+            if not latest_screenshot and item.get("type") == "computer_call_output":
+                output = item.get("output", {})
+                if output.get("type") == "input_image":
+                    image_url = output.get("image_url", "")
+                    if image_url.startswith("data:image/png;base64,"):
+                        latest_screenshot = image_url
+
+        if not latest_user_message:
+            return []
+
+        # Get the current URL for context if in browser environment
+        current_url = None
+        if self.computer and self.computer.environment == "browser":
+            try:
+                current_url = self.computer.get_current_url()
+            except:
+                pass
+
+        # Build context
+        context = f"Current URL: {current_url}" if current_url else ""
+
+        # Solve using Octotools
+        if self.print_steps:
+            print("Using Octotools for complex reasoning...")
+
+        result = self.octotools.solve(
+            query=latest_user_message,
+            image_data=latest_screenshot.split("base64,")[1] if latest_screenshot else None,
+            context=context
+        )
+
+        # Format the result for CUA
+        answer = result.get("answer", "I couldn't find a solution using the available tools.")
+        steps = result.get("steps", [])
+
+        if self.print_steps:
+            print(f"Octotools result: {answer[:100]}...")
+
+        # Build a detailed response that includes steps taken
+        detailed_response = answer + "\n\n"
+        if steps:
+            detailed_response += "I took the following steps to solve this:\n"
+            for i, step in enumerate(steps, 1):
+                tool_used = step.get("tool_used", "Unknown tool")
+                reasoning = step.get("reasoning", "No reasoning provided")
+                detailed_response += f"\n{i}. Used {tool_used}: {reasoning}"
+
+        # Return as a message from the assistant
+        return [{"role": "assistant", "content": detailed_response}]