Skip to content

Commit

Permalink
Add some doc (armandmcqueen#63)
Browse files Browse the repository at this point in the history
* add some doc

* add some doc

* minor fix

* minor fix

* change yaml_overlay name

* minor change
  • Loading branch information
YangFei1990 authored Jun 12, 2019
1 parent 5fb96b1 commit d226d74
Show file tree
Hide file tree
Showing 7 changed files with 92 additions and 28 deletions.
6 changes: 1 addition & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ RUN pip uninstall -y tensorflow tensorboard tensorflow-estimator keras h5py horo
# Download and install custom Tensorflow binary
RUN wget https://github.com/armandmcqueen/tensorpack-mask-rcnn/releases/download/v0.0.0-WIP/tensorflow-1.13.0-cp36-cp36m-linux_x86_64.whl && \
pip install tensorflow-1.13.0-cp36-cp36m-linux_x86_64.whl && \
pip install tensorflow-estimator==1.13.0 && \
rm tensorflow-1.13.0-cp36-cp36m-linux_x86_64.whl

RUN pip install keras h5py
Expand Down Expand Up @@ -39,8 +40,3 @@ RUN git clone https://github.com/armandmcqueen/tensorpack-mask-rcnn -b $BRANCH_N

RUN chmod -R +w /tensorpack-mask-rcnn
RUN pip install --ignore-installed -e /tensorpack-mask-rcnn/





25 changes: 23 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Mask RCNN

Performance focused implementation of Mask RCNN based on the [Tensorpack implementation](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN).

The original paper: [Mask R-CNN](https://arxiv.org/abs/1703.06870)
### Overview

This implementation of Mask RCNN is focused on increasing training throughput without sacrificing any accuracy. We do this by training with a batch size > 1 per GPU using FP16 and two custom TF ops.
This implementation of Mask RCNN is focused on increasing training throughput without sacrificing any accuracy. We do this by training with a batch size > 1 per GPU using FP16 and two custom TF ops.

### Status

Expand All @@ -19,7 +19,28 @@ A pre-built dockerfile is available in DockerHub under `armandmcqueen/tensorpack
- Running this codebase requires a custom TF binary - available under GitHub releases (custom ops and fix for bug introduced in TF 1.13
- We give some details the codebase and optimizations in `CODEBASE.md`

### To launch training
Container is recommended for training
- To train with docker, refer to [Docker](https://github.com/armandmcqueen/tensorpack-mask-rcnn/tree/master/infra/docker)
- To train with Amazon EKS, refer to [EKS](https://github.com/armandmcqueen/tensorpack-mask-rcnn/tree/master/infra/eks)

### Training results
The result was running on P3dn.24xl instances using EKS.
12 epochs training:

| Num_GPUs x Images_Per_GPU | Training time | Box mAP | Mask mAP |
| ------------- | ------------- | ------------- | ------------- |
| 8x4 | 5.09h | 37.47% | 34.45% |
| 16x4 | 3.11h | 37.41% | 34.47% |
| 32x4 | 1.94h | 37.20% | 34.25% |

24 epochs training:

| Num_GPUs x Images_Per_GPU | Training time | Box mAP | Mask mAP |
| ------------- | ------------- | ------------- | ------------- |
| 8x4 | 9.78h | 38.25% | 35.08% |
| 16x4 | 5.60h | 38.44% | 35.18% |
| 32x4 | 3.33h | 38.33% | 35.12% |

### Tensorpack fork point

Expand Down
48 changes: 48 additions & 0 deletions infra/docker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# To train with docker

## To run on single-node
Refer to [Run with docker](https://github.com/armandmcqueen/tensorpack-mask-rcnn/blob/master/infra/docker/docker.md#using-docker "Run with docker")

## To run on multi-node
Make sure you have your data ready as in [Run with docker](https://github.com/armandmcqueen/tensorpack-mask-rcnn/blob/master/infra/docker/docker.md#using-docker "Run with docker").
### SSH settings
Modify (or create) the file ~/.ssh/config and add below line and change the permission to on 400 all instances.
```
Host *
StrictHostKeyChecking no
```
```
chmod 400 ~/.ssh/config
```
Pick one instance as the primary node and run below command to generate the ssh key pair
```
ssh-keygen -t rsa
```
Copy the content of id_rsa.pub to all other machine's ~/.ssh/authorized_keys including itself. This will enable the [password less ssh connection](http://www.linuxproblem.org/art_9.html) to all other hosts including itself.
Lets setup the ssh keys. This command basically changing the permissions of your key pair to be root:root so that containers can talk to each other. Run on each host:
```
sudo mkdir -p /mnt/share/ssh
sudo cp -r ~/.ssh/* /mnt/share/ssh
```
### Build docker image and run container
For each of the instances
- `cd tensorpack-mask-rcnn`
- build the image by run `infra/docker/build.sh`
- run the container by run `infra/docker/run_multinode.sh`

### Launch training
Inside the container:
- On each host *apart from the primary* run the following in the container you started:
```
/usr/sbin/sshd -p 1234; sleep infinity
```
This will make those containers listen to the ssh connection from port 1234.
- On primary host, `cd tensorpack-mask-rcnn/infra/docker`, create your hosts file, which contains all ips of your nodes (include the primary host). The format should be like:
```
127.0.0.1 slots=8
127.0.0.2 slots=8
127.0.0.3 slots=8
127.0.0.4 slots=8
```
This is 4 nodes, 8 GPUs per node.
Launch training with running `infra/docker/run_multinode.sh 32 4` for 32 GPUs and 4 images per GPU
6 changes: 3 additions & 3 deletions infra/docker/docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,11 @@ cd docker

```
cd tensorpack-mask-rcnn
docker/train.sh 8 250
infra/docker/train.sh 8 1 250
```


This is 8 GPUs, 1 img per GPU, summary writer logs every 250 steps.
This is 8 GPUs, 1 img per GPU, summary writer logs every 250 steps.

Logs will be exposed to the ec2 instance at ~/logs.

Expand All @@ -39,4 +39,4 @@ Logs will be exposed to the ec2 instance at ~/logs.

## Notes

The current Dockerfile uses the wheel built for p3.16xl. The wheel built for p3dn.24xl might have a performance improvement, but it does not run on 16xl due to different available instruction sets.
The current Dockerfile uses the wheel built for p3.16xl. The wheel built for p3dn.24xl might have a performance improvement, but it does not run on 16xl due to different available instruction sets.
11 changes: 5 additions & 6 deletions infra/eks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ Scale the nodegroup to the desired number of nodes. We do not have an autoscalin
- or by creating a new nodegroup based on `eksctl/additional_nodegroup.yaml`
- `eksctl create nodegroup -f eks/eksctl/p3_additional_nodegroup.yaml`

`maskrcnn/values.yaml` holds the default training params for 1 node, 8 GPU training. To launch a training job with a different configuration, we suggest you create a new yaml file with the desired params.
`maskrcnn/values.yaml` holds the default training params for 1 node, 8 GPU training. To launch a training job with a different configuration, we suggest you create a new yaml file with the desired params.

To make that easier, we use a the `overyaml.py` utlity, which takes in a base yaml, applies a list of changes (overlays) to it and prints the new yaml to stdout. See `overyaml.md` for details.

Expand All @@ -105,16 +105,16 @@ If you need to run multiple identical jobs without naming conflict, we have the

```
export OVERLAY_DIR=maskrcnn/overlays
./overyaml.py maskrcnn/values.yaml 32x4 24epoch run1 > maskrcnn/values/determinism-32x4-24epoch-run1.yaml
./overyaml.py maskrcnn/values.yaml 32x4 24epoch run2 > maskrcnn/values/determinism-32x4-24epoch-run2.yaml
./yaml_overlay maskrcnn/values.yaml 32x4 24epoch run1 > maskrcnn/values/determinism-32x4-24epoch-run1.yaml
./yaml_overlay maskrcnn/values.yaml 32x4 24epoch run2 > maskrcnn/values/determinism-32x4-24epoch-run2.yaml
helm install --name maskrcnn-determinism-32x4-24epoch-run1 ./maskrcnn/ -f maskrcnn/values/determinism-32x4-24epoch-run1.yaml
helm install --name maskrcnn-determinism-32x4-24epoch-run2 ./maskrcnn/ -f maskrcnn/values/determinism-32x4-24epoch-run2.yaml
```


### Tensorboard

### Tensorboard

`kubectl apply -f eks/tensorboard/tensorboard.yaml`

Expand All @@ -129,4 +129,3 @@ Shortcut is `./tboard.sh`
`./ssh.sh`

We use `apply-pvc-2` because it uses the tensorboard-mask-rcnn image, which has useful tools like the AWS CLI

22 changes: 11 additions & 11 deletions infra/eks/YAML_OVERLAY.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,32 @@
# Overyaml

Take a base yaml file, apply a series of changes (overlays) and print out new yaml.
Take a base yaml file, apply a series of changes (overlays) and print out new yaml.

e.g. take base maskrcnn params and change to run 5 experiments of 24 epochs, predefined_padding=True, 32x4 GPU configuration without helm naming conflicts. Then run 5 more experiments with 32x2 GPU configuration.

* Be able to make changes to the base yaml and have it impact all other configurations.
* Add a new experiment without having an exploding number of yaml files to maintain and update.
* Be able to make changes to the base yaml and have it impact all other configurations.
* Add a new experiment without having an exploding number of yaml files to maintain and update.

## CLI Syntax

`./overyaml.py $BASE $OVERLAY1 $OVERLAY2 $OVERLAY3 ...`
`./yaml_overlay $BASE $OVERLAY1 $OVERLAY2 $OVERLAY3 ...`

Takes a base yaml and applies overlays sequentially. At the end, prints new yaml out to stdout. Overlay names should be the path to the overlay file minus '.yaml'.

`./overyaml.py maskrcnn/values.yaml maskrcnn/overlays/24epoch maskrcnn/overlays/32x4`
`./yaml_overlay maskrcnn/values.yaml maskrcnn/overlays/24epoch maskrcnn/overlays/32x4`

## Overlay folder

You can keep all your overlays in a single folder and then pass in an `overlay_dir` either through the `--overlay_dir` flag or through the `OVERLAY_DIR` environment variable.

```
export OVERLAY_DIR=maskrcnn/overlays
./overyaml.py maskrcnn/values.yaml 24epoch 32x4
./yaml_overlay maskrcnn/values.yaml 24epoch 32x4
```

## Overlay syntax

An overlay is a yaml file containing two sets of changes - changes where you want to `set` a new value for a field and changes where you want to `append` a postfix to the existing value.
An overlay is a yaml file containing two sets of changes - changes where you want to `set` a new value for a field and changes where you want to `append` a postfix to the existing value.

```
set:
Expand All @@ -39,10 +39,10 @@ append:

Both `set` and `append` are optional.

Changes are represented as a copy of the original object with unchanged fields ommitted and each changed field holding the new value or the postfix as the field's value. See example below.
Changes are represented as a copy of the original object with unchanged fields ommitted and each changed field holding the new value or the postfix as the field's value. See example below.


## Example
## Example

**base.yaml**

Expand All @@ -65,12 +65,12 @@ append:



###`$ ./overyaml.py base.yaml overlay > output.yaml`
###`$ ./yaml_overlay base.yaml overlay > output.yaml`


**output.yaml**
```
someScope:
someField: "new_value"
someOtherField: "my_name_new_postfix"
```
```
2 changes: 1 addition & 1 deletion infra/eks/maskrcnn/overlays/64x4.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ set:
maskrcnn:
gpus: 64
batch_size_per_gpu: 4
gradient_clip: 1.5

append:
global:
name: -64x4

0 comments on commit d226d74

Please sign in to comment.