CIS565-Fall-2016 · immiao · Sep 8, 2016 · Sep 8, 2016 · Sep 8, 2016 · Sep 10, 2016
diff --git a/README.md b/README.md
@@ -1,10 +1,85 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Kaixiang Miao
+* Tested on: Windows 7, i7-3630QM @ 2.40GHz 8GB, GTX 660M 2GB (Lenovo Y580 laptop, personal computer)
 
-### (TODO: Your README)
+## Screenshot
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+___
+
+
+![](./images/CoherentBoidsFlocking.gif)
+
+>*Due to my use of gif tools [LICEcap](http://www.cockos.com/licecap/), my FPS gets much slower.*
+
+The `.gif` above shows my work of **Coherent Boids Flocking**. Parameters are listed as below:
+
+* *N\_FOR_VIS* : `10000`
+* *DT* : `0.2f` 
+* *blockSize* : `128`
+* *rule1Dsitance* : `5.0f`
+* *rule2Distance* : `3.0f`
+* *rule3Distance* : `5.0f`
+* *rule1Scale* : `0.01f`
+* *rule2Scale* : `0.1f`
+* *rule3Scale* : `0.1f`
+* *maxSpeed* : `1.0f`
+* *scene_scale* : `100.0f`
+* *width * height* : `1280 * 720`
+* *pointSize* : `2.0f`
+
+## Performance Analysis
+
+___
+
+### Basic Analysis
+
+The performance of **Naive Boids**, **Uniform Boids** and **Coherent Uniform Boids** is measured by FPS. Different amounts of boids are considered. The results are as below.
+![](./images/table1.jpg)
+
+![](./images/table2.jpg)
+
+![](./images/table3.jpg)
+
+In conclusion, the performance ranking is that **Coherent Uniform Boids > Uniform Boids > Naive Boids**. Besides, as the amount of boids increases, all the performance is weakened.
+
+### Questions
+
+* *For each implementation, how does changing the number of boids affect performance? Why do you think this is?*
+
+As the number of boids increases, the performance in each implementation is slower. One of the insignificant reasons is that, in each implementation, the number of the GPU threads we use scales linearly with the amount of boids. This slightly increases the execution time and weakens our performance. However, it shouldn't be the major cause of the slow performance.
+
+For **Naive Boids**, if not parallelized and assuming that *n* represents for the number of boids, the complexity of our algorithm for updating velocity of boids should be *O(n^2)*, since we have to iterate over each boid [ **step 1**] and for each boid, we have to iterate over other boids to calculate out the distance [ **step 2** ]. Luckily CUDA helps us reduce the complexity of the first part to *O(1)*. Hence the total complexity is *O(n)*.
+
+For **Uniform Grid Boids**, optimization is done in the part of calculating out the distance. Being optimized by uniform grid using some additional indices buffer, the best complexity of this part is *O(1)*, in which each boid do not affect others. However, the worse complexity could be still *O(n)* considering the situation that all the boids are packed in one grid and affecting each other.
+
+For **Coherent Uniform Grid Boids**, since it optimizes for the memory allocation but not the searching algorithm, the situation is the same as **Uniform Grid Boids**. What's more, `thrust::sort_by_key` should be taken into consideration in both of these methods.
+
+* *For each implementation, how does changing the block count and block size affect performance? Why do you think this is?*
+
+The result is quite interesting. I do the comparison as following:
+
+![](./images/table4.jpg)
+
+![](./images/table5.jpg)
+
+![](./images/table6.jpg)
+
+In order to achieve an obvious comparison, I push the amount of boids to the limit and keep FPS below 60. `boids = 10000` for **Naive Boids** and `boids = 100000` for both **Uniform Boids** and **Coherent Uniform Boids** fit very well.
+
+As the charts show, changing the block count and block size does not obviously affect the performance while increasing the block size slowers the efficiency of **Uniform Boids** but improve the performance of **Coherent Uniform Boids**. To explain better, let's remind the steps I mentioned before.
+
+> *... we have to iterate over each boid [step 1] and for each boid, we have to iterate over other boids to calculate out the distance [step 2]*
+
+The unaffected performance of **Naive Boids** has proved that [ **step 1**] won't be the crucial factor of this interesting result. Hence, the only difference between **Uniform Boids** and **Coherent Uniform Boids**, the memory allocation, should be the determinant. The pictures below may explain this.
+
+![](./images/pic1.jpg)
+
+![](./images/pic2.jpg)
+
+If the data in the global memory is scattered, it will be tough for lots of threads in a large block to access them, since threads will be grouped into warps during the execution, as mentioned in [https://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/](https://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/). Rather, it's not efficient for lots of threads in many small blocks to access a large chunk of contiguous global memory.
+
+* *For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?*
+
+Yes. Actually, in the beginning I don't believe it will improve the performance since the rearrangement of the buffer seems more costly than the benefits gaining from the contiguous memory allocation. But later I found that the rearrangement can be totally parallelized and is quite efficient. The improvement of the performance gained by the contiguous memory is also beyond my expectation. The reason why the performance is improved by the more coherent uniform grid is that threads in the same block prefer to reading data which is stored in contiguous memory.
diff --git a/images/CoherentBoidsFlocking.gif b/images/CoherentBoidsFlocking.gif
diff --git a/images/pic1.jpg b/images/pic1.jpg
diff --git a/images/pic2.jpg b/images/pic2.jpg
diff --git a/images/table1.jpg b/images/table1.jpg
diff --git a/images/table2.jpg b/images/table2.jpg
diff --git a/images/table3.jpg b/images/table3.jpg
diff --git a/images/table4.jpg b/images/table4.jpg
diff --git a/images/table5.jpg b/images/table5.jpg
diff --git a/images/table6.jpg b/images/table6.jpg
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_30
     )