Skip to content

Commit

Permalink
[GNN] Adds example building dockerfile for H100s. (#737)
Browse files Browse the repository at this point in the history
* adds updated Dockerfile for building

* renames Dockerfile to Dockerfile.h100, and restore old Dockerfile

* updates README

* adds a small commit to retrigger check
  • Loading branch information
Elnifio authored May 16, 2024
1 parent db0558a commit 87405ce
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 1 deletion.
23 changes: 23 additions & 0 deletions graph_neural_network/Dockerfile.h100
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
FROM nvcr.io/nvidia/pytorch:22.12-py3

WORKDIR /workspace/repository

RUN pip install scikit-learn==0.24.2
RUN pip install torch_geometric==2.4.0
RUN pip install torch_scatter==2.1.1 torch_sparse==0.6.17
RUN pip install graphlearn-torch==0.2.2

RUN apt update
RUN apt install -y git
RUN pip install git+https://github.com/mlcommons/logging.git

# TF32 instead of FP32 for faster compute
ENV NVIDIA_TF32_OVERRIDE=1

COPY . .
WORKDIR /workspace/repository

RUN git clone https://github.com/alibaba/graphlearn-for-pytorch.git
WORKDIR /workspace/repository/graphlearn-for-pytorch
RUN git checkout 910cb55
RUN git submodule update --init
30 changes: 29 additions & 1 deletion graph_neural_network/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,34 @@ cd training/gnn_node_classification/
docker build -f Dockerfile -t training_gnn:latest .
```

##### 2.1 Building on NVIDIA H100

The official Dockerfile supports only NVIDIA A100 GPUs, and `Dockerfile.h100` helps build and run GNN reference on NVIDIA H100 machines. To build the image:

```bash
cd training/graph_neural_network
docker build -f Dockerfile.h100 -t training_gnn:h100 .
```

Once the image is built, we need to run this image **on H100 machines with at least 1 GPU mounted in the container**:

```bash
docker run -it --rm --network=host --ipc=host --gpus all training_gnn:h100
```

Inside the container, we follow the same build process detailed in [GraphLearn-Torch's README](https://github.com/alibaba/graphlearn-for-pytorch):

```bash
# inside the current container image with H100 mounted:
bash install_dependencies.sh

python3 setup.py bdist_wheel
pip install dist/* --force-reinstall
```

The container can now be used on H100 machines once the above installation steps are done. To verify, we can run `import graphlearn_torch as glt` in Python REPL. GLT is successfully installed for H100 if the import statement ends successfully without raising any error, and we can subsequently export the container with `docker commit` to save the container for future uses.

Once this is done, we should `cd /workspace/repository` and follow the same training workflow from there.

### Steps to download and verify data
Download the dataset:
Expand Down Expand Up @@ -167,4 +195,4 @@ This benchmark is a collaborative effort with contributions from Alibaba, Intel,

- Alibaba: Li Su, Baole Ai, Wenting Shen, Shuxian Hu, Wenyuan Yu, Yong Li
- Nvidia: Yunzhou (David) Liu, Kyle Kranen, Shriya Palasamudram
- Intel: Kaixuan Liu, Hesham Mostafa, Sasikanth Avancha, Keith Achorn, Radha Giduthuri, Deepak Canchi
- Intel: Kaixuan Liu, Hesham Mostafa, Sasikanth Avancha, Keith Achorn, Radha Giduthuri, Deepak Canchi

0 comments on commit 87405ce

Please sign in to comment.