From d226d74e2a9e38ae5703d9921f6985fcc020f8dd Mon Sep 17 00:00:00 2001
From: Fei <33940270+YangFei1990@users.noreply.github.com>
Date: Wed, 12 Jun 2019 13:38:18 -0700
Subject: [PATCH] Add some doc (#63)

* add some doc

* add some doc

* minor fix

* minor fix

* change yaml_overlay name

* minor change
---
 Dockerfile                            |  6 +---
 README.md                             | 25 ++++++++++++--
 infra/docker/README.md                | 48 +++++++++++++++++++++++++++
 infra/docker/docker.md                |  6 ++--
 infra/eks/README.md                   | 11 +++---
 infra/eks/YAML_OVERLAY.md             | 22 ++++++------
 infra/eks/maskrcnn/overlays/64x4.yaml |  2 +-
 7 files changed, 92 insertions(+), 28 deletions(-)
 create mode 100644 infra/docker/README.md

diff --git a/Dockerfile b/Dockerfile
index e2b1c299..dde38653 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -9,6 +9,7 @@ RUN pip uninstall -y tensorflow tensorboard tensorflow-estimator keras h5py horo
 # Download and install custom Tensorflow binary
 RUN wget https://github.com/armandmcqueen/tensorpack-mask-rcnn/releases/download/v0.0.0-WIP/tensorflow-1.13.0-cp36-cp36m-linux_x86_64.whl && \
     pip install tensorflow-1.13.0-cp36-cp36m-linux_x86_64.whl && \
+    pip install tensorflow-estimator==1.13.0 && \
     rm tensorflow-1.13.0-cp36-cp36m-linux_x86_64.whl
 
 RUN pip install keras h5py
@@ -39,8 +40,3 @@ RUN git clone https://github.com/armandmcqueen/tensorpack-mask-rcnn -b $BRANCH_N
 
 RUN chmod -R +w /tensorpack-mask-rcnn
 RUN pip install --ignore-installed -e /tensorpack-mask-rcnn/
-
-
-
-
-
diff --git a/README.md b/README.md
index 14797e81..fc2d8f46 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,10 @@
 # Mask RCNN
 
 Performance focused implementation of Mask RCNN based on the [Tensorpack implementation](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN).
-
+The original paper: [Mask R-CNN](https://arxiv.org/abs/1703.06870)
 ### Overview
 
-This implementation of Mask RCNN is focused on increasing training throughput without sacrificing any accuracy. We do this by training with a batch size > 1 per GPU using FP16 and two custom TF ops. 
+This implementation of Mask RCNN is focused on increasing training throughput without sacrificing any accuracy. We do this by training with a batch size > 1 per GPU using FP16 and two custom TF ops.
 
 ### Status
 
@@ -19,7 +19,28 @@ A pre-built dockerfile is available in DockerHub under `armandmcqueen/tensorpack
 - Running this codebase requires a custom TF binary - available under GitHub releases (custom ops and fix for bug introduced in TF 1.13
 - We give some details the codebase and optimizations in `CODEBASE.md`
 
+### To launch training
+Container is recommended for training
+- To train with docker, refer to [Docker](https://github.com/armandmcqueen/tensorpack-mask-rcnn/tree/master/infra/docker)
+- To train with Amazon EKS, refer to [EKS](https://github.com/armandmcqueen/tensorpack-mask-rcnn/tree/master/infra/eks)
+
+### Training results
+The result was running on P3dn.24xl instances using EKS.
+12 epochs training:
+
+| Num_GPUs x Images_Per_GPU | Training time | Box mAP | Mask mAP |
+| ------------- | ------------- | ------------- | ------------- |
+| 8x4 | 5.09h | 37.47% | 34.45% |
+| 16x4 | 3.11h | 37.41% | 34.47% |
+| 32x4 | 1.94h | 37.20% | 34.25% |
+
+24 epochs training:
 
+| Num_GPUs x Images_Per_GPU | Training time | Box mAP | Mask mAP |
+| ------------- | ------------- | ------------- | ------------- |
+| 8x4 | 9.78h | 38.25% | 35.08% |
+| 16x4 | 5.60h | 38.44% | 35.18% |
+| 32x4 | 3.33h | 38.33% | 35.12% |
 
 ### Tensorpack fork point
 
diff --git a/infra/docker/README.md b/infra/docker/README.md
new file mode 100644
index 00000000..fcbaf1fb
--- /dev/null
+++ b/infra/docker/README.md
@@ -0,0 +1,48 @@
+# To train with docker
+
+## To run on single-node
+Refer to [Run with docker](https://github.com/armandmcqueen/tensorpack-mask-rcnn/blob/master/infra/docker/docker.md#using-docker "Run with docker")
+
+## To run on multi-node
+Make sure you have your data ready as in [Run with docker](https://github.com/armandmcqueen/tensorpack-mask-rcnn/blob/master/infra/docker/docker.md#using-docker "Run with docker").
+### SSH settings
+Modify (or create) the file ~/.ssh/config and add below line and change the permission to on 400 all instances.
+```
+Host *
+    StrictHostKeyChecking no
+```
+```
+chmod 400 ~/.ssh/config
+```
+Pick one instance as the primary node and run below command to generate the ssh key pair
+```
+ssh-keygen -t rsa
+```
+Copy the content of id_rsa.pub to all other machine's ~/.ssh/authorized_keys including itself. This will enable the [password less ssh connection](http://www.linuxproblem.org/art_9.html) to all other hosts including itself.
+Lets setup the ssh keys. This command basically changing the permissions of your key pair to be root:root so that containers can talk to each other. Run on each host:
+```
+sudo mkdir -p /mnt/share/ssh
+sudo cp -r ~/.ssh/* /mnt/share/ssh
+```
+### Build docker image and run container
+For each of the instances
+- `cd tensorpack-mask-rcnn`
+- build the image by run `infra/docker/build.sh`
+- run the container by run `infra/docker/run_multinode.sh`
+
+### Launch training
+Inside the container:
+- On each host *apart from the primary* run the following in the container you started:
+```
+/usr/sbin/sshd -p 1234; sleep infinity
+```
+This will make those containers listen to the ssh connection from port 1234.
+- On primary host, `cd tensorpack-mask-rcnn/infra/docker`, create your hosts file, which contains all ips of your nodes (include the primary host). The format should be like:
+```
+127.0.0.1 slots=8
+127.0.0.2 slots=8
+127.0.0.3 slots=8
+127.0.0.4 slots=8
+```
+This is 4 nodes, 8 GPUs per node.
+Launch training with running `infra/docker/run_multinode.sh 32 4` for 32 GPUs and 4 images per GPU
diff --git a/infra/docker/docker.md b/infra/docker/docker.md
index d31a125c..5a52a5ef 100644
--- a/infra/docker/docker.md
+++ b/infra/docker/docker.md
@@ -24,11 +24,11 @@ cd docker
 
 ```
 cd tensorpack-mask-rcnn
-docker/train.sh 8 250
+infra/docker/train.sh 8 1 250
 ```
 
 
-This is 8 GPUs, 1 img per GPU, summary writer logs every 250 steps. 
+This is 8 GPUs, 1 img per GPU, summary writer logs every 250 steps.
 
 Logs will be exposed to the ec2 instance at ~/logs.
 
@@ -39,4 +39,4 @@ Logs will be exposed to the ec2 instance at ~/logs.
 
 ## Notes
 
-The current Dockerfile uses the wheel built for p3.16xl. The wheel built for p3dn.24xl might have a performance improvement, but it does not run on 16xl due to different available instruction sets.
\ No newline at end of file
+The current Dockerfile uses the wheel built for p3.16xl. The wheel built for p3dn.24xl might have a performance improvement, but it does not run on 16xl due to different available instruction sets.
diff --git a/infra/eks/README.md b/infra/eks/README.md
index bf7bde04..cb076482 100644
--- a/infra/eks/README.md
+++ b/infra/eks/README.md
@@ -82,7 +82,7 @@ Scale the nodegroup to the desired number of nodes. We do not have an autoscalin
 - or by creating a new nodegroup based on `eksctl/additional_nodegroup.yaml`
     - `eksctl create nodegroup -f eks/eksctl/p3_additional_nodegroup.yaml`
 
-`maskrcnn/values.yaml` holds the default training params for 1 node, 8 GPU training. To launch a training job with a different configuration, we suggest you create a new yaml file with the desired params. 
+`maskrcnn/values.yaml` holds the default training params for 1 node, 8 GPU training. To launch a training job with a different configuration, we suggest you create a new yaml file with the desired params.
 
 To make that easier, we use a the `overyaml.py` utlity, which takes in a base yaml, applies a list of changes (overlays) to it and prints the new yaml to stdout. See `overyaml.md` for details.
 
@@ -105,16 +105,16 @@ If you need to run multiple identical jobs without naming conflict, we have the
 
 ```
 export OVERLAY_DIR=maskrcnn/overlays
-./overyaml.py maskrcnn/values.yaml 32x4 24epoch run1 > maskrcnn/values/determinism-32x4-24epoch-run1.yaml
-./overyaml.py maskrcnn/values.yaml 32x4 24epoch run2 > maskrcnn/values/determinism-32x4-24epoch-run2.yaml
+./yaml_overlay maskrcnn/values.yaml 32x4 24epoch run1 > maskrcnn/values/determinism-32x4-24epoch-run1.yaml
+./yaml_overlay maskrcnn/values.yaml 32x4 24epoch run2 > maskrcnn/values/determinism-32x4-24epoch-run2.yaml
 
 helm install --name maskrcnn-determinism-32x4-24epoch-run1 ./maskrcnn/ -f maskrcnn/values/determinism-32x4-24epoch-run1.yaml
 helm install --name maskrcnn-determinism-32x4-24epoch-run2 ./maskrcnn/ -f maskrcnn/values/determinism-32x4-24epoch-run2.yaml
 ```
-       
 
 
-### Tensorboard 
+
+### Tensorboard
 
 `kubectl apply -f eks/tensorboard/tensorboard.yaml`
 
@@ -129,4 +129,3 @@ Shortcut is `./tboard.sh`
 `./ssh.sh`
 
 We use `apply-pvc-2` because it uses the tensorboard-mask-rcnn image, which has useful tools like the AWS CLI
-
diff --git a/infra/eks/YAML_OVERLAY.md b/infra/eks/YAML_OVERLAY.md
index 584ff652..67dfcfae 100644
--- a/infra/eks/YAML_OVERLAY.md
+++ b/infra/eks/YAML_OVERLAY.md
@@ -1,19 +1,19 @@
 # Overyaml
 
-Take a base yaml file, apply a series of changes (overlays) and print out new yaml. 
+Take a base yaml file, apply a series of changes (overlays) and print out new yaml.
 
 e.g. take base maskrcnn params and change to run 5 experiments of 24 epochs, predefined_padding=True, 32x4 GPU configuration without helm naming conflicts. Then run 5 more experiments with 32x2 GPU configuration.
 
-* Be able to make changes to the base yaml and have it impact all other configurations. 
-* Add a new experiment without having an exploding number of yaml files to maintain and update. 
+* Be able to make changes to the base yaml and have it impact all other configurations.
+* Add a new experiment without having an exploding number of yaml files to maintain and update.
 
 ## CLI Syntax
 
-`./overyaml.py $BASE $OVERLAY1 $OVERLAY2 $OVERLAY3 ...`
+`./yaml_overlay $BASE $OVERLAY1 $OVERLAY2 $OVERLAY3 ...`
 
 Takes a base yaml and applies overlays sequentially. At the end, prints new yaml out to stdout. Overlay names should be the path to the overlay file minus '.yaml'.
 
-`./overyaml.py maskrcnn/values.yaml maskrcnn/overlays/24epoch maskrcnn/overlays/32x4`
+`./yaml_overlay maskrcnn/values.yaml maskrcnn/overlays/24epoch maskrcnn/overlays/32x4`
 
 ## Overlay folder
 
@@ -21,12 +21,12 @@ You can keep all your overlays in a single folder and then pass in an `overlay_d
 
 ```
 export OVERLAY_DIR=maskrcnn/overlays
-./overyaml.py maskrcnn/values.yaml 24epoch 32x4
+./yaml_overlay maskrcnn/values.yaml 24epoch 32x4
 ```
 
 ## Overlay syntax
 
-An overlay is a yaml file containing two sets of changes - changes where you want to `set` a new value for a field and changes where you want to `append` a postfix to the existing value. 
+An overlay is a yaml file containing two sets of changes - changes where you want to `set` a new value for a field and changes where you want to `append` a postfix to the existing value.
 
 ```
 set:
@@ -39,10 +39,10 @@ append:
 
 Both `set` and `append` are optional.
 
-Changes are represented as a copy of the original object with unchanged fields ommitted and each changed field holding the new value or the postfix as the field's value. See example below. 
+Changes are represented as a copy of the original object with unchanged fields ommitted and each changed field holding the new value or the postfix as the field's value. See example below.
 
 
-## Example 
+## Example
 
 **base.yaml**
 
@@ -65,7 +65,7 @@ append:
 
 
 
-###`$ ./overyaml.py base.yaml overlay > output.yaml`
+###`$ ./yaml_overlay base.yaml overlay > output.yaml`
 
 
 **output.yaml**
@@ -73,4 +73,4 @@ append:
 someScope:
     someField: "new_value"
     someOtherField: "my_name_new_postfix"
-```
\ No newline at end of file
+```
diff --git a/infra/eks/maskrcnn/overlays/64x4.yaml b/infra/eks/maskrcnn/overlays/64x4.yaml
index a28d51f4..9854b09b 100644
--- a/infra/eks/maskrcnn/overlays/64x4.yaml
+++ b/infra/eks/maskrcnn/overlays/64x4.yaml
@@ -2,8 +2,8 @@ set:
   maskrcnn:
     gpus: 64
     batch_size_per_gpu: 4
+    gradient_clip: 1.5
 
 append:
   global:
     name: -64x4
-