OpenAdaptAI · abrichr · Oct 29, 2024 · Oct 29, 2024 · Oct 29, 2024 · Oct 29, 2024
diff --git a/.github/workflows/docker-build-ec2.yml b/.github/workflows/docker-build-ec2.yml
@@ -5,7 +5,7 @@ name: Docker Build on EC2 Instance for OmniParser
 on:
   push:
     branches:
-      - feat/deploy2
+      - feat/deploy-deps
 
 jobs:
   build:
@@ -18,7 +18,7 @@ jobs:
         uses: appleboy/ssh-action@master
         with:
           command_timeout: "60m"
-          host: 44.198.58.162
+          host: 18.209.211.183
           username: ubuntu
 
           key: ${{ secrets.SSH_PRIVATE_KEY }}
@@ -27,15 +27,15 @@ jobs:
             rm -rf OmniParser || true
             git clone https://github.com/OpenAdaptAI/OmniParser
             cd OmniParser
-            git checkout feat/deploy2
+            git checkout feat/deploy-deps
             git pull
 
             # Stop and remove any existing containers
             sudo docker stop omniparser-container || true
             sudo docker rm omniparser-container || true
 
             # Build the Docker image
-            sudo nvidia-docker build -t omniparser .
+            sudo docker build -t omniparser .
 
             # Run the Docker container on the specified port
             sudo docker run -d -p 7861:7861 --gpus all --name omniparser-container omniparser
diff --git a/.gitignore b/.gitignore
@@ -7,4 +7,4 @@ __pycache__
 .env
 .env.*
 venv/
-*.pem
+*.pem
diff --git a/Dockerfile b/Dockerfile
@@ -1,42 +1,32 @@
-# Dockerfile for OmniParser with GPU support and OpenGL libraries
+# Dockerfile for OmniParser with GPU and OpenGL support.
 #
-# This Dockerfile is intended to create an environment with NVIDIA CUDA
-# support and the necessary dependencies to run the OmniParser project.
-# The configuration is designed to support applications that rely on
-# Python 3.12, OpenCV, Hugging Face transformers, and Gradio. Additionally,
-# it includes steps to pull large files from Git LFS and a script to
-# convert model weights from .safetensor to .pt format. The container
-# runs a Gradio server by default, exposed on port 7861.
+# Base: nvidia/cuda:12.3.1-devel-ubuntu22.04
+# Features:
+# - Python 3.12 with Miniconda environment.
+# - Git LFS for large file support.
+# - Required libraries: OpenCV, Hugging Face, Gradio, OpenGL.
+# - Gradio server on port 7861.
 #
-# Base image: nvidia/cuda:12.3.1-devel-ubuntu22.04
+# 1. Build the image with CUDA support.
+# ```
+# sudo nvidia-docker build -t omniparser .
+# ```
 #
-# Key features:
-# - System dependencies for OpenGL to support graphical libraries.
-# - Miniconda for Python 3.12, allowing for environment management.
-# - Git Large File Storage (LFS) setup for handling large model files.
-# - Requirement file installation, including specific versions of
-#   OpenCV and Hugging Face Hub.
-# - Entrypoint script execution with Gradio server configuration for
-#   external access.
+# 2. Run the Docker container with GPU access and port mapping for Gradio.
+# ```bash
+# sudo docker run -d -p 7861:7861 --gpus all --name omniparser-container omniparser
+# ```
+#
+# Author: Richard Abrich ([email protected])
 
 FROM nvidia/cuda:12.3.1-devel-ubuntu22.04
 
 # Install system dependencies with explicit OpenGL libraries
 RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
-    git \
     git-lfs \
     wget \
     libgl1 \
     libglib2.0-0 \
-    libsm6 \
-    libxext6 \
-    libxrender1 \
-    libglu1-mesa \
-    libglib2.0-0 \
-    libsm6 \
-    libxrender1 \
-    libxext6 \
-    python3-opencv \
     && apt-get clean \
     && rm -rf /var/lib/apt/lists/* \
     && git lfs install

diff --git a/README.md b/README.md
@@ -12,9 +12,24 @@
 **OmniParser** is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. 
 
 ## News
+- [2024/10] OmniParser is the #1 trending model on huggingface model hub (starting 10/29/2024). 
+- [2024/10] Feel free to checkout our demo on [huggingface space](https://huggingface.co/spaces/microsoft/OmniParser)! (stay tuned for OmniParser + Claude Computer Use)
 - [2024/10] Both Interactive Region Detection Model and Icon functional description model are released! [Hugginface models](https://huggingface.co/microsoft/OmniParser)
 - [2024/09] OmniParser achieves the best performance on [Windows Agent Arena](https://microsoft.github.io/WindowsAgentArena/)! 
 
+### :rocket: Docker Quick Start
+
+Prerequisites:
+- CUDA-enabled GPU
+- NVIDIA Container Toolkit installed (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
+```
+# Build the image (requires CUDA)
+sudo nvidia-docker build -t omniparser .
+
+# Run the image
+sudo docker run -d -p 7861:7861 --gpus all --name omniparser-container omniparser
+```
+
 ## Install 
 Install environment:
 ```python
@@ -23,8 +38,12 @@ conda activate omni
 pip install -r requirements.txt
 ```
 
-Then download the model ckpts files in: https://huggingface.co/microsoft/OmniParser, and put them under weights/, default folder structure is: weights/icon_detect, weights/icon_caption_florence, weights/icon_caption_blip2. 
+Download and convert the model ckpt files from https://huggingface.co/microsoft/OmniParser:
+```python
+python download.py
+```
 
+Or, download the model ckpts files in: https://huggingface.co/microsoft/OmniParser, and put them under weights/, default folder structure is: weights/icon_detect, weights/icon_caption_florence, weights/icon_caption_blip2.
 Finally, convert the safetensor to .pt file. 
 ```python
 python weights/convert_safetensor_to_pt.py
@@ -39,6 +58,15 @@ To run gradio demo, simply run:
 python gradio_demo.py
 ```
 
+## Deploy to AWS
+
+To deploy OmniParser to EC2 on AWS via Github Actions:
+
+1. Fork this repository and clone your fork to your local machine.
+2. Follow the instructions at the top of [`deploy.py`](https://github.com/microsoft/OmniParser/blob/main/deploy.py).
+
+## Model Weights License
+For the model checkpoints on huggingface model hub, please note that icon_detect model is under AGPL license since it is a license inherited from the original yolo model. And icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model: https://huggingface.co/microsoft/OmniParser.
 
 ## 📚 Citation
 Our technical report can be found [here](https://arxiv.org/abs/2408.00203).

diff --git a/__pycache__/utils.cpython-312.pyc b/__pycache__/utils.cpython-312.pyc
diff --git a/__pycache__/utils.cpython-39.pyc b/__pycache__/utils.cpython-39.pyc
diff --git a/client.py b/client.py
@@ -1,7 +1,7 @@
 """
-This module provides a command-line interface to interact with the OmniParser Gradio server.
+This module provides a command-line interface and programmatic API to interact with the OmniParser Gradio server.
 
-Usage:
+Command-line usage:
     python client.py "http://<server_ip>:7861" "path/to/image.jpg"
 
 View results:
@@ -11,6 +11,10 @@
         Windows: start output_image_<timestamp>.png
         Linux:   xdg-open output_image_<timestamp>.png
 
+Programmatic usage:
+    from omniparse.client import predict
+    result = predict("http://<server_ip>:7861", "path/to/image.jpg")
+
 Result data format:
     {
         "label_coordinates": {
@@ -33,30 +37,31 @@
 import fire
 from gradio_client import Client
 from loguru import logger
-from PIL import Image
 import base64
-from io import BytesIO
 import os
 import shutil
 import json
 from datetime import datetime
 
-def predict(server_url: str, image_path: str, box_threshold: float = 0.05, iou_threshold: float = 0.1):
+# Define constants for default thresholds
+DEFAULT_BOX_THRESHOLD = 0.05
+DEFAULT_IOU_THRESHOLD = 0.1
+
+def predict(server_url: str, image_path: str, box_threshold: float = DEFAULT_BOX_THRESHOLD, iou_threshold: float = DEFAULT_IOU_THRESHOLD):
     """
     Makes a prediction using the OmniParser Gradio client with the provided server URL and image.
-
     Args:
         server_url (str): The URL of the OmniParser Gradio server.
         image_path (str): Path to the image file to be processed.
         box_threshold (float): Box threshold value (default: 0.05).
         iou_threshold (float): IOU threshold value (default: 0.1).
+    Returns:
+        dict: Parsed result data containing label coordinates and parsed content list.
     """
     client = Client(server_url)
-
-    # Generate a timestamp for unique file naming
-    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
 
     # Load and encode the image
+    image_path = os.path.expanduser(image_path)
     with open(image_path, "rb") as image_file:
         encoded_image = base64.b64encode(image_file.read()).decode("utf-8")
 
@@ -72,47 +77,56 @@ def predict(server_url: str, image_path: str, box_threshold: float = 0.05, iou_t
     }
 
     # Make the prediction
-    try:
-        result = client.predict(
-            image_input,    # image input as dictionary
-            box_threshold,  # box_threshold
-            iou_threshold,  # iou_threshold
-            api_name="/process"
-        )
+    result = client.predict(
+        image_input,
+        box_threshold,
+        iou_threshold,
+        api_name="/process"
+    )
 
-        # Process and log the results
-        output_image, result_json = result
-
-        logger.info("Prediction completed successfully")
+    # Process and return the result
+    output_image, result_json = result
+    result_data = json.loads(result_json)
 
-        # Parse the JSON string into a Python object
-        result_data = json.loads(result_json)
+    return {"output_image": output_image, "result_data": result_data}
 
-        # Extract label_coordinates and parsed_content_list
-        label_coordinates = result_data['label_coordinates']
-        parsed_content_list = result_data['parsed_content_list']
 
-        logger.info(f"{label_coordinates=}")
-        logger.info(f"{parsed_content_list=}")
+def predict_and_save(server_url: str, image_path: str, box_threshold: float = DEFAULT_BOX_THRESHOLD, iou_threshold: float = DEFAULT_IOU_THRESHOLD):
+    """
+    Makes a prediction and saves the results to files, including logs and image outputs.
+    Args:
+        server_url (str): The URL of the OmniParser Gradio server.
+        image_path (str): Path to the image file to be processed.
+        box_threshold (float): Box threshold value (default: 0.05).
+        iou_threshold (float): IOU threshold value (default: 0.1).
+    """
+    # Generate a timestamp for unique file naming
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+
+    # Call the predict function to get prediction data
+    try:
+        result = predict(server_url, image_path, box_threshold, iou_threshold)
+        output_image = result["output_image"]
+        result_data = result["result_data"]
 
         # Save result data to JSON file
         result_data_path = f"result_data_{timestamp}.json"
         with open(result_data_path, "w") as json_file:
             json.dump(result_data, json_file, indent=4)
         logger.info(f"Parsed content saved to: {result_data_path}")
-        
+
         # Save the output image
         output_image_path = f"output_image_{timestamp}.png"
         if isinstance(output_image, str) and os.path.exists(output_image):
             shutil.copy(output_image, output_image_path)
             logger.info(f"Output image saved to: {output_image_path}")
         else:
             logger.warning(f"Unexpected output_image format or file not found: {output_image}")
-    
+
     except Exception as e:
         logger.error(f"An error occurred: {str(e)}")
         logger.exception("Traceback:")
 
-if __name__ == "__main__":
-    fire.Fire(predict)
 
+if __name__ == "__main__":
+    fire.Fire(predict_and_save)
diff --git a/demo.ipynb b/demo.ipynb
-Original file line number
+Diff line change
@@ Expand Up / @@ -7,4 +7,4 @@ __pycache__ @@
     .env
     .env.*
     venv/
-    *.pem
+    *.pem