From d226d74e2a9e38ae5703d9921f6985fcc020f8dd Mon Sep 17 00:00:00 2001 From: Fei <33940270+YangFei1990@users.noreply.github.com> Date: Wed, 12 Jun 2019 13:38:18 -0700 Subject: [PATCH] Add some doc (#63) * add some doc * add some doc * minor fix * minor fix * change yaml_overlay name * minor change --- Dockerfile | 6 +--- README.md | 25 ++++++++++++-- infra/docker/README.md | 48 +++++++++++++++++++++++++++ infra/docker/docker.md | 6 ++-- infra/eks/README.md | 11 +++--- infra/eks/YAML_OVERLAY.md | 22 ++++++------ infra/eks/maskrcnn/overlays/64x4.yaml | 2 +- 7 files changed, 92 insertions(+), 28 deletions(-) create mode 100644 infra/docker/README.md diff --git a/Dockerfile b/Dockerfile index e2b1c299..dde38653 100644 --- a/Dockerfile +++ b/Dockerfile @@ -9,6 +9,7 @@ RUN pip uninstall -y tensorflow tensorboard tensorflow-estimator keras h5py horo # Download and install custom Tensorflow binary RUN wget https://github.com/armandmcqueen/tensorpack-mask-rcnn/releases/download/v0.0.0-WIP/tensorflow-1.13.0-cp36-cp36m-linux_x86_64.whl && \ pip install tensorflow-1.13.0-cp36-cp36m-linux_x86_64.whl && \ + pip install tensorflow-estimator==1.13.0 && \ rm tensorflow-1.13.0-cp36-cp36m-linux_x86_64.whl RUN pip install keras h5py @@ -39,8 +40,3 @@ RUN git clone https://github.com/armandmcqueen/tensorpack-mask-rcnn -b $BRANCH_N RUN chmod -R +w /tensorpack-mask-rcnn RUN pip install --ignore-installed -e /tensorpack-mask-rcnn/ - - - - - diff --git a/README.md b/README.md index 14797e81..fc2d8f46 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,10 @@ # Mask RCNN Performance focused implementation of Mask RCNN based on the [Tensorpack implementation](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN). - +The original paper: [Mask R-CNN](https://arxiv.org/abs/1703.06870) ### Overview -This implementation of Mask RCNN is focused on increasing training throughput without sacrificing any accuracy. We do this by training with a batch size > 1 per GPU using FP16 and two custom TF ops. +This implementation of Mask RCNN is focused on increasing training throughput without sacrificing any accuracy. We do this by training with a batch size > 1 per GPU using FP16 and two custom TF ops. ### Status @@ -19,7 +19,28 @@ A pre-built dockerfile is available in DockerHub under `armandmcqueen/tensorpack - Running this codebase requires a custom TF binary - available under GitHub releases (custom ops and fix for bug introduced in TF 1.13 - We give some details the codebase and optimizations in `CODEBASE.md` +### To launch training +Container is recommended for training +- To train with docker, refer to [Docker](https://github.com/armandmcqueen/tensorpack-mask-rcnn/tree/master/infra/docker) +- To train with Amazon EKS, refer to [EKS](https://github.com/armandmcqueen/tensorpack-mask-rcnn/tree/master/infra/eks) + +### Training results +The result was running on P3dn.24xl instances using EKS. +12 epochs training: + +| Num_GPUs x Images_Per_GPU | Training time | Box mAP | Mask mAP | +| ------------- | ------------- | ------------- | ------------- | +| 8x4 | 5.09h | 37.47% | 34.45% | +| 16x4 | 3.11h | 37.41% | 34.47% | +| 32x4 | 1.94h | 37.20% | 34.25% | + +24 epochs training: +| Num_GPUs x Images_Per_GPU | Training time | Box mAP | Mask mAP | +| ------------- | ------------- | ------------- | ------------- | +| 8x4 | 9.78h | 38.25% | 35.08% | +| 16x4 | 5.60h | 38.44% | 35.18% | +| 32x4 | 3.33h | 38.33% | 35.12% | ### Tensorpack fork point diff --git a/infra/docker/README.md b/infra/docker/README.md new file mode 100644 index 00000000..fcbaf1fb --- /dev/null +++ b/infra/docker/README.md @@ -0,0 +1,48 @@ +# To train with docker + +## To run on single-node +Refer to [Run with docker](https://github.com/armandmcqueen/tensorpack-mask-rcnn/blob/master/infra/docker/docker.md#using-docker "Run with docker") + +## To run on multi-node +Make sure you have your data ready as in [Run with docker](https://github.com/armandmcqueen/tensorpack-mask-rcnn/blob/master/infra/docker/docker.md#using-docker "Run with docker"). +### SSH settings +Modify (or create) the file ~/.ssh/config and add below line and change the permission to on 400 all instances. +``` +Host * + StrictHostKeyChecking no +``` +``` +chmod 400 ~/.ssh/config +``` +Pick one instance as the primary node and run below command to generate the ssh key pair +``` +ssh-keygen -t rsa +``` +Copy the content of id_rsa.pub to all other machine's ~/.ssh/authorized_keys including itself. This will enable the [password less ssh connection](http://www.linuxproblem.org/art_9.html) to all other hosts including itself. +Lets setup the ssh keys. This command basically changing the permissions of your key pair to be root:root so that containers can talk to each other. Run on each host: +``` +sudo mkdir -p /mnt/share/ssh +sudo cp -r ~/.ssh/* /mnt/share/ssh +``` +### Build docker image and run container +For each of the instances +- `cd tensorpack-mask-rcnn` +- build the image by run `infra/docker/build.sh` +- run the container by run `infra/docker/run_multinode.sh` + +### Launch training +Inside the container: +- On each host *apart from the primary* run the following in the container you started: +``` +/usr/sbin/sshd -p 1234; sleep infinity +``` +This will make those containers listen to the ssh connection from port 1234. +- On primary host, `cd tensorpack-mask-rcnn/infra/docker`, create your hosts file, which contains all ips of your nodes (include the primary host). The format should be like: +``` +127.0.0.1 slots=8 +127.0.0.2 slots=8 +127.0.0.3 slots=8 +127.0.0.4 slots=8 +``` +This is 4 nodes, 8 GPUs per node. +Launch training with running `infra/docker/run_multinode.sh 32 4` for 32 GPUs and 4 images per GPU diff --git a/infra/docker/docker.md b/infra/docker/docker.md index d31a125c..5a52a5ef 100644 --- a/infra/docker/docker.md +++ b/infra/docker/docker.md @@ -24,11 +24,11 @@ cd docker ``` cd tensorpack-mask-rcnn -docker/train.sh 8 250 +infra/docker/train.sh 8 1 250 ``` -This is 8 GPUs, 1 img per GPU, summary writer logs every 250 steps. +This is 8 GPUs, 1 img per GPU, summary writer logs every 250 steps. Logs will be exposed to the ec2 instance at ~/logs. @@ -39,4 +39,4 @@ Logs will be exposed to the ec2 instance at ~/logs. ## Notes -The current Dockerfile uses the wheel built for p3.16xl. The wheel built for p3dn.24xl might have a performance improvement, but it does not run on 16xl due to different available instruction sets. \ No newline at end of file +The current Dockerfile uses the wheel built for p3.16xl. The wheel built for p3dn.24xl might have a performance improvement, but it does not run on 16xl due to different available instruction sets. diff --git a/infra/eks/README.md b/infra/eks/README.md index bf7bde04..cb076482 100644 --- a/infra/eks/README.md +++ b/infra/eks/README.md @@ -82,7 +82,7 @@ Scale the nodegroup to the desired number of nodes. We do not have an autoscalin - or by creating a new nodegroup based on `eksctl/additional_nodegroup.yaml` - `eksctl create nodegroup -f eks/eksctl/p3_additional_nodegroup.yaml` -`maskrcnn/values.yaml` holds the default training params for 1 node, 8 GPU training. To launch a training job with a different configuration, we suggest you create a new yaml file with the desired params. +`maskrcnn/values.yaml` holds the default training params for 1 node, 8 GPU training. To launch a training job with a different configuration, we suggest you create a new yaml file with the desired params. To make that easier, we use a the `overyaml.py` utlity, which takes in a base yaml, applies a list of changes (overlays) to it and prints the new yaml to stdout. See `overyaml.md` for details. @@ -105,16 +105,16 @@ If you need to run multiple identical jobs without naming conflict, we have the ``` export OVERLAY_DIR=maskrcnn/overlays -./overyaml.py maskrcnn/values.yaml 32x4 24epoch run1 > maskrcnn/values/determinism-32x4-24epoch-run1.yaml -./overyaml.py maskrcnn/values.yaml 32x4 24epoch run2 > maskrcnn/values/determinism-32x4-24epoch-run2.yaml +./yaml_overlay maskrcnn/values.yaml 32x4 24epoch run1 > maskrcnn/values/determinism-32x4-24epoch-run1.yaml +./yaml_overlay maskrcnn/values.yaml 32x4 24epoch run2 > maskrcnn/values/determinism-32x4-24epoch-run2.yaml helm install --name maskrcnn-determinism-32x4-24epoch-run1 ./maskrcnn/ -f maskrcnn/values/determinism-32x4-24epoch-run1.yaml helm install --name maskrcnn-determinism-32x4-24epoch-run2 ./maskrcnn/ -f maskrcnn/values/determinism-32x4-24epoch-run2.yaml ``` - -### Tensorboard + +### Tensorboard `kubectl apply -f eks/tensorboard/tensorboard.yaml` @@ -129,4 +129,3 @@ Shortcut is `./tboard.sh` `./ssh.sh` We use `apply-pvc-2` because it uses the tensorboard-mask-rcnn image, which has useful tools like the AWS CLI - diff --git a/infra/eks/YAML_OVERLAY.md b/infra/eks/YAML_OVERLAY.md index 584ff652..67dfcfae 100644 --- a/infra/eks/YAML_OVERLAY.md +++ b/infra/eks/YAML_OVERLAY.md @@ -1,19 +1,19 @@ # Overyaml -Take a base yaml file, apply a series of changes (overlays) and print out new yaml. +Take a base yaml file, apply a series of changes (overlays) and print out new yaml. e.g. take base maskrcnn params and change to run 5 experiments of 24 epochs, predefined_padding=True, 32x4 GPU configuration without helm naming conflicts. Then run 5 more experiments with 32x2 GPU configuration. -* Be able to make changes to the base yaml and have it impact all other configurations. -* Add a new experiment without having an exploding number of yaml files to maintain and update. +* Be able to make changes to the base yaml and have it impact all other configurations. +* Add a new experiment without having an exploding number of yaml files to maintain and update. ## CLI Syntax -`./overyaml.py $BASE $OVERLAY1 $OVERLAY2 $OVERLAY3 ...` +`./yaml_overlay $BASE $OVERLAY1 $OVERLAY2 $OVERLAY3 ...` Takes a base yaml and applies overlays sequentially. At the end, prints new yaml out to stdout. Overlay names should be the path to the overlay file minus '.yaml'. -`./overyaml.py maskrcnn/values.yaml maskrcnn/overlays/24epoch maskrcnn/overlays/32x4` +`./yaml_overlay maskrcnn/values.yaml maskrcnn/overlays/24epoch maskrcnn/overlays/32x4` ## Overlay folder @@ -21,12 +21,12 @@ You can keep all your overlays in a single folder and then pass in an `overlay_d ``` export OVERLAY_DIR=maskrcnn/overlays -./overyaml.py maskrcnn/values.yaml 24epoch 32x4 +./yaml_overlay maskrcnn/values.yaml 24epoch 32x4 ``` ## Overlay syntax -An overlay is a yaml file containing two sets of changes - changes where you want to `set` a new value for a field and changes where you want to `append` a postfix to the existing value. +An overlay is a yaml file containing two sets of changes - changes where you want to `set` a new value for a field and changes where you want to `append` a postfix to the existing value. ``` set: @@ -39,10 +39,10 @@ append: Both `set` and `append` are optional. -Changes are represented as a copy of the original object with unchanged fields ommitted and each changed field holding the new value or the postfix as the field's value. See example below. +Changes are represented as a copy of the original object with unchanged fields ommitted and each changed field holding the new value or the postfix as the field's value. See example below. -## Example +## Example **base.yaml** @@ -65,7 +65,7 @@ append: -###`$ ./overyaml.py base.yaml overlay > output.yaml` +###`$ ./yaml_overlay base.yaml overlay > output.yaml` **output.yaml** @@ -73,4 +73,4 @@ append: someScope: someField: "new_value" someOtherField: "my_name_new_postfix" -``` \ No newline at end of file +``` diff --git a/infra/eks/maskrcnn/overlays/64x4.yaml b/infra/eks/maskrcnn/overlays/64x4.yaml index a28d51f4..9854b09b 100644 --- a/infra/eks/maskrcnn/overlays/64x4.yaml +++ b/infra/eks/maskrcnn/overlays/64x4.yaml @@ -2,8 +2,8 @@ set: maskrcnn: gpus: 64 batch_size_per_gpu: 4 + gradient_clip: 1.5 append: global: name: -64x4 -