Skip to content

Add Octotools integration with the CUA Sample App #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,25 @@ The computer use tool and model are available via the [Responses API](https://pl

You can learn more about this tool in the [Computer use guide](https://platform.openai.com/docs/guides/tools-computer-use).

## Feature Highlights

- **Multiple Computer Environments**: Support for various environments including local browsers, Docker containers, and remote services
- **Safety Measures**: URL blocklisting and safety check acknowledgments
- **Function Calling**: Define and use custom functions in your agent
- **Extensible Design**: Easily add new Computer implementations
- **Octotools Integration**: Enhanced reasoning and specialized tools through the [Octotools](https://github.com/OctoTools/OctoTools) framework

### Octotools Integration

The CUA Sample App includes integration with the Octotools framework for enhanced reasoning and specialized tool access:

```shell
# Run with Octotools integration
python main.py --use-octotools
```

For more details, see the [Octotools Integration Guide](docs/octotools_integration_guide.md) and [README_OCTOTOOLS.md](README_OCTOTOOLS.md).

## Abstractions

This repository defines two lightweight abstractions to make interacting with CUA agents more ergonomic. Everything works without them, but they provide a convenient separation of concerns.
Expand Down
182 changes: 182 additions & 0 deletions README_OCTOTOOLS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# πŸ› οΈ Octotools Integration for CUA Sample App

This integration enhances the CUA Sample App with [Octotools](https://github.com/OctoTools/OctoTools) capabilities, providing advanced reasoning, problem-solving, and specialized tool access for AI agents.

## πŸ“‹ Overview

The Octotools integration enables CUA Sample App to:
- Perform complex multi-step reasoning
- Access specialized tools for different tasks
- Enhance browser automation with content analysis
- Generate code and analyze data
- Search for and extract information from the web

## 🧩 Components

The integration consists of the following key components:

1. **OctotoolsWrapper** (`octotools_wrapper.py`) - Core wrapper for Octotools functionality.

2. **OctotoolsAgent** (`octotools_agent.py`) - Enhanced agent extending the base CUA Agent with Octotools capabilities.

3. **SimpleOctotoolsWrapper** (`simple_octotools_wrapper.py`) - Lightweight wrapper using direct API calls for environments without full Octotools.

4. **CompleteOctotoolsWrapper** (`complete_octotools_wrapper.py`) - Full-featured wrapper with all Octotools capabilities.

5. **Integration Scripts** - Various scripts to demonstrate different integration patterns.

## βš™οΈ Setup

### Prerequisites

- Python 3.10 or higher
- CUA Sample App installed and working
- An OpenAI API key with access to GPT-4o or similar model

### Quick Installation

1. **Clone the repository with submodules**:
```bash
git clone https://github.com/jmanhype/openai-cua-sample-app.git
cd openai-cua-sample-app
```

2. **Set up environment**:
```bash
python setup_octotools.py
```

3. **Install dependencies**:
```bash
pip install -r requirements.txt
```

### Manual Setup

If you prefer manual setup:

1. **Create `.env` file**:
```bash
echo "OPENAI_API_KEY=your-api-key-here" > .env
echo "OCTOTOOLS_MODEL=gpt-4o" >> .env
```

2. **Install dependencies**:
```bash
pip install -r requirements.txt
```

## πŸš€ Usage

### Basic Integration

Run the CUA Sample App with Octotools enabled:

```bash
python main.py --use-octotools --debug
```

### Advanced Usage

Use the dedicated OctotoolsAgent with specific tools:

```bash
python run_octotools_agent.py --tools "Python_Code_Generator_Tool,Text_Detector_Tool,URL_Text_Extractor_Tool,Nature_News_Fetcher_Tool"
```

### Available Tools

The integration supports multiple tools:

| Tool | Description | Usage Example |
|------|-------------|---------------|
| `Generalist_Solution_Generator_Tool` | General problem-solving | Complex reasoning tasks |
| `Python_Code_Generator_Tool` | Generates Python code | "Write a script to parse CSV files" |
| `Text_Detector_Tool` | Analyzes text for key information | Extract entities from documents |
| `URL_Text_Extractor_Tool` | Extracts text from webpages | "Summarize this webpage" |
| `Nature_News_Fetcher_Tool` | Fetches news from Nature | "What's new in quantum computing?" |

## πŸ§ͺ Testing

Run tests to verify the integration:

```bash
# Test basic integration
python test_octotools.py

# Test simple wrapper
python test_simple_octotools.py

# Test full integration
python test_full_octotools.py
```

## πŸ” Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CUA Sample App β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Regular Agent β”‚ β”‚ OctotoolsAgent β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β”‚ OctotoolsWrapper β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Computer β”‚ β”‚
β”‚ └──────────────────────────────────────────────┐ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Octotools β”‚
β”‚ Framework β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## πŸ“š Documentation

For more detailed documentation:

- **Integration Guide**: See `docs/octotools_integration_guide.md` for a comprehensive guide.
- **API Reference**: Check `octotools_wrapper.py` and `octotools_agent.py` for inline documentation.
- **Examples**: The `examples/` directory contains example usage patterns.

## ❓ Troubleshooting

### API Key Problems

If you see errors related to the API key:
- Ensure that the `.env` file contains `OPENAI_API_KEY=your-api-key`
- Verify your API key has access to the required models

### Import Errors

If you encounter import errors:
- Ensure all dependencies are properly installed
- Run from the project root directory
- Check that the octotools directory is correctly placed

### Performance Issues

If reasoning tasks are slow:
- Use a more powerful model like GPT-4o
- Reduce the number of enabled tools
- Set a lower max_steps value to limit iteration

## πŸ‘₯ Contributing

Contributions are welcome! To contribute to this integration:

1. Fork the repository
2. Create a feature branch
3. Implement your changes
4. Add tests
5. Submit a pull request

## πŸ“„ License

This integration is subject to the same license as the CUA Sample App.
129 changes: 128 additions & 1 deletion agent/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,13 @@
check_blocklisted_url,
)
import json
from typing import Callable
from typing import Callable, List, Dict, Any, Optional


class Agent:
"""
A sample agent class that can be used to interact with a computer.
Enhanced with Octotools for complex reasoning.

(See simple_cua_loop.py for a simple example without an agent.)
"""
Expand All @@ -23,6 +24,9 @@ def __init__(
computer: Computer = None,
tools: list[dict] = [],
acknowledge_safety_check_callback: Callable = lambda: False,
use_octotools: bool = False,
octotools_engine: str = "gpt-4o",
octotools_tools: Optional[List[str]] = None,
):
self.model = model
self.computer = computer
Expand All @@ -41,6 +45,23 @@ def __init__(
"environment": computer.environment,
},
]

# Octotools integration
self.use_octotools = use_octotools
if use_octotools:
try:
from octotools_wrapper import OctotoolsWrapper
self.octotools = OctotoolsWrapper(
llm_engine=octotools_engine,
enabled_tools=octotools_tools
)
print("Octotools initialized successfully!")
except ImportError as e:
print(f"Warning: Could not initialize Octotools: {str(e)}")
self.use_octotools = False
self.octotools = None
else:
self.octotools = None

def debug_print(self, *args):
if self.debug:
Expand Down Expand Up @@ -113,9 +134,16 @@ def handle_item(self, item):
def run_full_turn(
self, input_items, print_steps=True, debug=False, show_images=False
):
"""Enhanced run_full_turn with Octotools integration for complex reasoning."""
self.print_steps = print_steps
self.debug = debug
self.show_images = show_images

# Check if we should use Octotools for complex reasoning
if self.use_octotools and self.octotools and self._needs_complex_reasoning(input_items):
return self._handle_with_octotools(input_items)

# Original CUA logic
new_items = []

# keep looping until we get a final response
Expand All @@ -139,3 +167,102 @@ def run_full_turn(
new_items += self.handle_item(item)

return new_items

def _needs_complex_reasoning(self, input_items: List[Dict[str, Any]]) -> bool:
"""
Determine if the query needs complex reasoning that would benefit from Octotools.
This is a basic heuristic and can be enhanced based on specific requirements.

Args:
input_items: The list of input items

Returns:
bool: True if complex reasoning is needed, False otherwise
"""
# Extract the latest user message
latest_user_message = None
for item in reversed(input_items):
if item.get("role") == "user":
latest_user_message = item.get("content", "")
break

if not latest_user_message:
return False

# Simple heuristic: check for keywords that might suggest complex reasoning
complex_keywords = [
"analyze", "compare", "calculate", "extract data", "search for",
"find information", "summarize", "visual analysis",
"collect data", "research", "solve"
]

return any(keyword in latest_user_message.lower() for keyword in complex_keywords)

def _handle_with_octotools(self, input_items: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""
Handle a query using Octotools for complex reasoning.

Args:
input_items: The list of input items

Returns:
List[Dict[str, Any]]: The result items
"""
# Extract the latest user message and any screenshots
latest_user_message = None
latest_screenshot = None

for item in reversed(input_items):
if item.get("role") == "user" and not latest_user_message:
latest_user_message = item.get("content", "")

# Look for the most recent screenshot
if not latest_screenshot and item.get("type") == "computer_call_output":
output = item.get("output", {})
if output.get("type") == "input_image":
image_url = output.get("image_url", "")
if image_url.startswith("data:image/png;base64,"):
latest_screenshot = image_url

if not latest_user_message:
return []

# Get the current URL for context if in browser environment
current_url = None
if self.computer and self.computer.environment == "browser":
try:
current_url = self.computer.get_current_url()
except:
pass

# Build context
context = f"Current URL: {current_url}" if current_url else ""

# Solve using Octotools
if self.print_steps:
print("Using Octotools for complex reasoning...")

result = self.octotools.solve(
query=latest_user_message,
image_data=latest_screenshot.split("base64,")[1] if latest_screenshot else None,
context=context
)

# Format the result for CUA
answer = result.get("answer", "I couldn't find a solution using the available tools.")
steps = result.get("steps", [])

if self.print_steps:
print(f"Octotools result: {answer[:100]}...")

# Build a detailed response that includes steps taken
detailed_response = answer + "\n\n"
if steps:
detailed_response += "I took the following steps to solve this:\n"
for i, step in enumerate(steps, 1):
tool_used = step.get("tool_used", "Unknown tool")
reasoning = step.get("reasoning", "No reasoning provided")
detailed_response += f"\n{i}. Used {tool_used}: {reasoning}"

# Return as a message from the assistant
return [{"role": "assistant", "content": detailed_response}]
Loading