Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
be04aff
Add paddleocr option
aliencaocao Oct 29, 2024
fa9c69f
Default use paddle ocr to false to not break jupyter notebook example
aliencaocao Oct 29, 2024
31c6289
remove autogenerated .github/workflows/docker-build-ec2.yml
abrichr Oct 29, 2024
88f7246
Add functionality to save bounding boxes
abrichr Oct 29, 2024
754d0b7
update README
abrichr Oct 29, 2024
af8c9da
improve README
abrichr Oct 29, 2024
4fdb813
add deploy section to README
abrichr Oct 29, 2024
3520928
improve documentation
abrichr Oct 30, 2024
64bdbaa
add usage to Dockerfile documentation
abrichr Oct 30, 2024
9cce7d7
Improve deploy.py documentation
abrichr Oct 30, 2024
54b8b47
add client.predict and documentation
abrichr Oct 30, 2024
612785d
Merge pull request #53 from aliencaocao/paddle-ocr
yadong-lu Oct 31, 2024
b094079
fixes for paddle ocr
yadong-lu Oct 31, 2024
1db311f
update readme
yadong-lu Oct 31, 2024
d1b39a2
update readme
yadong-lu Oct 31, 2024
169dd20
Merge branch 'master' into feat/deploy
abrichr Nov 1, 2024
b8b952c
undo changes to gradio_demo.py
abrichr Nov 1, 2024
201af0f
Add JSON output formatting to process function; return label_coordinates
abrichr Nov 1, 2024
4ae782f
update readme
yadong-lu Nov 1, 2024
a411848
Update Dockerfile documentation
abrichr Nov 1, 2024
b706744
Merge branch 'master' into feat/deploy
abrichr Nov 1, 2024
9ad451a
remove superfluous print
abrichr Nov 1, 2024
9f2dc91
more terse
abrichr Nov 1, 2024
76d6110
parsed_content
abrichr Nov 1, 2024
aa87102
comment out apt-get install git, opengl, python3, opencv
abrichr Nov 12, 2024
cedbc86
get_latest_ami
abrichr Nov 12, 2024
e2e92da
add workflow file
abrichr Nov 12, 2024
4d8e9c1
add workflow file
abrichr Nov 12, 2024
926a2a5
add workflow file
abrichr Nov 12, 2024
509da0a
set AMI to 06835d15c4de57810
abrichr Nov 12, 2024
31041aa
replace nvidia-docker with docker
abrichr Nov 12, 2024
0a94a1f
add workflow file
abrichr Nov 12, 2024
549b7ff
ssh in start
abrichr Nov 12, 2024
645b0b0
add workflow file
abrichr Nov 12, 2024
1d7c860
add workflow file
abrichr Nov 12, 2024
8a562b4
add workflow file
abrichr Nov 12, 2024
8b1681e
add workflow file
abrichr Nov 12, 2024
a605290
ssh non_interactive
abrichr Nov 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/docker-build-ec2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ name: Docker Build on EC2 Instance for OmniParser
on:
push:
branches:
- feat/deploy2
- feat/deploy-deps

jobs:
build:
Expand All @@ -18,7 +18,7 @@ jobs:
uses: appleboy/ssh-action@master
with:
command_timeout: "60m"
host: 44.198.58.162
host: 18.209.211.183
username: ubuntu

key: ${{ secrets.SSH_PRIVATE_KEY }}
Expand All @@ -27,15 +27,15 @@ jobs:
rm -rf OmniParser || true
git clone https://github.com/OpenAdaptAI/OmniParser
cd OmniParser
git checkout feat/deploy2
git checkout feat/deploy-deps
git pull

# Stop and remove any existing containers
sudo docker stop omniparser-container || true
sudo docker rm omniparser-container || true

# Build the Docker image
sudo nvidia-docker build -t omniparser .
sudo docker build -t omniparser .

# Run the Docker container on the specified port
sudo docker run -d -p 7861:7861 --gpus all --name omniparser-container omniparser
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ __pycache__
.env
.env.*
venv/
*.pem
*.pem
44 changes: 17 additions & 27 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,42 +1,32 @@
# Dockerfile for OmniParser with GPU support and OpenGL libraries
# Dockerfile for OmniParser with GPU and OpenGL support.
#
# This Dockerfile is intended to create an environment with NVIDIA CUDA
# support and the necessary dependencies to run the OmniParser project.
# The configuration is designed to support applications that rely on
# Python 3.12, OpenCV, Hugging Face transformers, and Gradio. Additionally,
# it includes steps to pull large files from Git LFS and a script to
# convert model weights from .safetensor to .pt format. The container
# runs a Gradio server by default, exposed on port 7861.
# Base: nvidia/cuda:12.3.1-devel-ubuntu22.04
# Features:
# - Python 3.12 with Miniconda environment.
# - Git LFS for large file support.
# - Required libraries: OpenCV, Hugging Face, Gradio, OpenGL.
# - Gradio server on port 7861.
#
# Base image: nvidia/cuda:12.3.1-devel-ubuntu22.04
# 1. Build the image with CUDA support.
# ```
# sudo nvidia-docker build -t omniparser .
# ```
#
# Key features:
# - System dependencies for OpenGL to support graphical libraries.
# - Miniconda for Python 3.12, allowing for environment management.
# - Git Large File Storage (LFS) setup for handling large model files.
# - Requirement file installation, including specific versions of
# OpenCV and Hugging Face Hub.
# - Entrypoint script execution with Gradio server configuration for
# external access.
# 2. Run the Docker container with GPU access and port mapping for Gradio.
# ```bash
# sudo docker run -d -p 7861:7861 --gpus all --name omniparser-container omniparser
# ```
#
# Author: Richard Abrich ([email protected])

FROM nvidia/cuda:12.3.1-devel-ubuntu22.04

# Install system dependencies with explicit OpenGL libraries
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
git \
git-lfs \
wget \
libgl1 \
libglib2.0-0 \
libsm6 \
libxext6 \
libxrender1 \
libglu1-mesa \
libglib2.0-0 \
libsm6 \
libxrender1 \
libxext6 \
python3-opencv \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* \
&& git lfs install
Expand Down
30 changes: 29 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,24 @@
**OmniParser** is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.

## News
- [2024/10] OmniParser is the #1 trending model on huggingface model hub (starting 10/29/2024).
- [2024/10] Feel free to checkout our demo on [huggingface space](https://huggingface.co/spaces/microsoft/OmniParser)! (stay tuned for OmniParser + Claude Computer Use)
- [2024/10] Both Interactive Region Detection Model and Icon functional description model are released! [Hugginface models](https://huggingface.co/microsoft/OmniParser)
- [2024/09] OmniParser achieves the best performance on [Windows Agent Arena](https://microsoft.github.io/WindowsAgentArena/)!

### :rocket: Docker Quick Start

Prerequisites:
- CUDA-enabled GPU
- NVIDIA Container Toolkit installed (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
```
# Build the image (requires CUDA)
sudo nvidia-docker build -t omniparser .

# Run the image
sudo docker run -d -p 7861:7861 --gpus all --name omniparser-container omniparser
```

## Install
Install environment:
```python
Expand All @@ -23,8 +38,12 @@ conda activate omni
pip install -r requirements.txt
```

Then download the model ckpts files in: https://huggingface.co/microsoft/OmniParser, and put them under weights/, default folder structure is: weights/icon_detect, weights/icon_caption_florence, weights/icon_caption_blip2.
Download and convert the model ckpt files from https://huggingface.co/microsoft/OmniParser:
```python
python download.py
```

Or, download the model ckpts files in: https://huggingface.co/microsoft/OmniParser, and put them under weights/, default folder structure is: weights/icon_detect, weights/icon_caption_florence, weights/icon_caption_blip2.
Finally, convert the safetensor to .pt file.
```python
python weights/convert_safetensor_to_pt.py
Expand All @@ -39,6 +58,15 @@ To run gradio demo, simply run:
python gradio_demo.py
```

## Deploy to AWS

To deploy OmniParser to EC2 on AWS via Github Actions:

1. Fork this repository and clone your fork to your local machine.
2. Follow the instructions at the top of [`deploy.py`](https://github.com/microsoft/OmniParser/blob/main/deploy.py).

## Model Weights License
For the model checkpoints on huggingface model hub, please note that icon_detect model is under AGPL license since it is a license inherited from the original yolo model. And icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model: https://huggingface.co/microsoft/OmniParser.

## 📚 Citation
Our technical report can be found [here](https://arxiv.org/abs/2408.00203).
Expand Down
Binary file modified __pycache__/utils.cpython-312.pyc
Binary file not shown.
Binary file removed __pycache__/utils.cpython-39.pyc
Binary file not shown.
76 changes: 45 additions & 31 deletions client.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
This module provides a command-line interface to interact with the OmniParser Gradio server.
This module provides a command-line interface and programmatic API to interact with the OmniParser Gradio server.

Usage:
Command-line usage:
python client.py "http://<server_ip>:7861" "path/to/image.jpg"

View results:
Expand All @@ -11,6 +11,10 @@
Windows: start output_image_<timestamp>.png
Linux: xdg-open output_image_<timestamp>.png

Programmatic usage:
from omniparse.client import predict
result = predict("http://<server_ip>:7861", "path/to/image.jpg")

Result data format:
{
"label_coordinates": {
Expand All @@ -33,30 +37,31 @@
import fire
from gradio_client import Client
from loguru import logger
from PIL import Image
import base64
from io import BytesIO
import os
import shutil
import json
from datetime import datetime

def predict(server_url: str, image_path: str, box_threshold: float = 0.05, iou_threshold: float = 0.1):
# Define constants for default thresholds
DEFAULT_BOX_THRESHOLD = 0.05
DEFAULT_IOU_THRESHOLD = 0.1

def predict(server_url: str, image_path: str, box_threshold: float = DEFAULT_BOX_THRESHOLD, iou_threshold: float = DEFAULT_IOU_THRESHOLD):
"""
Makes a prediction using the OmniParser Gradio client with the provided server URL and image.

Args:
server_url (str): The URL of the OmniParser Gradio server.
image_path (str): Path to the image file to be processed.
box_threshold (float): Box threshold value (default: 0.05).
iou_threshold (float): IOU threshold value (default: 0.1).
Returns:
dict: Parsed result data containing label coordinates and parsed content list.
"""
client = Client(server_url)

# Generate a timestamp for unique file naming
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Load and encode the image
image_path = os.path.expanduser(image_path)
with open(image_path, "rb") as image_file:
encoded_image = base64.b64encode(image_file.read()).decode("utf-8")

Expand All @@ -72,47 +77,56 @@ def predict(server_url: str, image_path: str, box_threshold: float = 0.05, iou_t
}

# Make the prediction
try:
result = client.predict(
image_input, # image input as dictionary
box_threshold, # box_threshold
iou_threshold, # iou_threshold
api_name="/process"
)
result = client.predict(
image_input,
box_threshold,
iou_threshold,
api_name="/process"
)

# Process and log the results
output_image, result_json = result

logger.info("Prediction completed successfully")
# Process and return the result
output_image, result_json = result
result_data = json.loads(result_json)

# Parse the JSON string into a Python object
result_data = json.loads(result_json)
return {"output_image": output_image, "result_data": result_data}

# Extract label_coordinates and parsed_content_list
label_coordinates = result_data['label_coordinates']
parsed_content_list = result_data['parsed_content_list']

logger.info(f"{label_coordinates=}")
logger.info(f"{parsed_content_list=}")
def predict_and_save(server_url: str, image_path: str, box_threshold: float = DEFAULT_BOX_THRESHOLD, iou_threshold: float = DEFAULT_IOU_THRESHOLD):
"""
Makes a prediction and saves the results to files, including logs and image outputs.
Args:
server_url (str): The URL of the OmniParser Gradio server.
image_path (str): Path to the image file to be processed.
box_threshold (float): Box threshold value (default: 0.05).
iou_threshold (float): IOU threshold value (default: 0.1).
"""
# Generate a timestamp for unique file naming
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Call the predict function to get prediction data
try:
result = predict(server_url, image_path, box_threshold, iou_threshold)
output_image = result["output_image"]
result_data = result["result_data"]

# Save result data to JSON file
result_data_path = f"result_data_{timestamp}.json"
with open(result_data_path, "w") as json_file:
json.dump(result_data, json_file, indent=4)
logger.info(f"Parsed content saved to: {result_data_path}")

# Save the output image
output_image_path = f"output_image_{timestamp}.png"
if isinstance(output_image, str) and os.path.exists(output_image):
shutil.copy(output_image, output_image_path)
logger.info(f"Output image saved to: {output_image_path}")
else:
logger.warning(f"Unexpected output_image format or file not found: {output_image}")

except Exception as e:
logger.error(f"An error occurred: {str(e)}")
logger.exception("Traceback:")

if __name__ == "__main__":
fire.Fire(predict)

if __name__ == "__main__":
fire.Fire(predict_and_save)
249 changes: 114 additions & 135 deletions demo.ipynb

Large diffs are not rendered by default.

Loading
Loading