Skip to content

Project 1: Kaixiang Miao #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 80 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,85 @@
**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
Project 1 - Flocking**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Kaixiang Miao
* Tested on: Windows 7, i7-3630QM @ 2.40GHz 8GB, GTX 660M 2GB (Lenovo Y580 laptop, personal computer)

### (TODO: Your README)
## Screenshot

Include screenshots, analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
___


![](./images/CoherentBoidsFlocking.gif)

>*Due to my use of gif tools [LICEcap](http://www.cockos.com/licecap/), my FPS gets much slower.*

The `.gif` above shows my work of **Coherent Boids Flocking**. Parameters are listed as below:

* *N\_FOR_VIS* : `10000`
* *DT* : `0.2f`
* *blockSize* : `128`
* *rule1Dsitance* : `5.0f`
* *rule2Distance* : `3.0f`
* *rule3Distance* : `5.0f`
* *rule1Scale* : `0.01f`
* *rule2Scale* : `0.1f`
* *rule3Scale* : `0.1f`
* *maxSpeed* : `1.0f`
* *scene_scale* : `100.0f`
* *width * height* : `1280 * 720`
* *pointSize* : `2.0f`

## Performance Analysis

___

### Basic Analysis

The performance of **Naive Boids**, **Uniform Boids** and **Coherent Uniform Boids** is measured by FPS. Different amounts of boids are considered. The results are as below.
![](./images/table1.jpg)

![](./images/table2.jpg)

![](./images/table3.jpg)

In conclusion, the performance ranking is that **Coherent Uniform Boids > Uniform Boids > Naive Boids**. Besides, as the amount of boids increases, all the performance is weakened.

### Questions

* *For each implementation, how does changing the number of boids affect performance? Why do you think this is?*

As the number of boids increases, the performance in each implementation is slower. One of the insignificant reasons is that, in each implementation, the number of the GPU threads we use scales linearly with the amount of boids. This slightly increases the execution time and weakens our performance. However, it shouldn't be the major cause of the slow performance.

For **Naive Boids**, if not parallelized and assuming that *n* represents for the number of boids, the complexity of our algorithm for updating velocity of boids should be *O(n^2)*, since we have to iterate over each boid [ **step 1**] and for each boid, we have to iterate over other boids to calculate out the distance [ **step 2** ]. Luckily CUDA helps us reduce the complexity of the first part to *O(1)*. Hence the total complexity is *O(n)*.

For **Uniform Grid Boids**, optimization is done in the part of calculating out the distance. Being optimized by uniform grid using some additional indices buffer, the best complexity of this part is *O(1)*, in which each boid do not affect others. However, the worse complexity could be still *O(n)* considering the situation that all the boids are packed in one grid and affecting each other.

For **Coherent Uniform Grid Boids**, since it optimizes for the memory allocation but not the searching algorithm, the situation is the same as **Uniform Grid Boids**. What's more, `thrust::sort_by_key` should be taken into consideration in both of these methods.

* *For each implementation, how does changing the block count and block size affect performance? Why do you think this is?*

The result is quite interesting. I do the comparison as following:

![](./images/table4.jpg)

![](./images/table5.jpg)

![](./images/table6.jpg)

In order to achieve an obvious comparison, I push the amount of boids to the limit and keep FPS below 60. `boids = 10000` for **Naive Boids** and `boids = 100000` for both **Uniform Boids** and **Coherent Uniform Boids** fit very well.

As the charts show, changing the block count and block size does not obviously affect the performance while increasing the block size slowers the efficiency of **Uniform Boids** but improve the performance of **Coherent Uniform Boids**. To explain better, let's remind the steps I mentioned before.

> *... we have to iterate over each boid [step 1] and for each boid, we have to iterate over other boids to calculate out the distance [step 2]*

The unaffected performance of **Naive Boids** has proved that [ **step 1**] won't be the crucial factor of this interesting result. Hence, the only difference between **Uniform Boids** and **Coherent Uniform Boids**, the memory allocation, should be the determinant. The pictures below may explain this.

![](./images/pic1.jpg)

![](./images/pic2.jpg)

If the data in the global memory is scattered, it will be tough for lots of threads in a large block to access them, since threads will be grouped into warps during the execution, as mentioned in [https://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/](https://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/). Rather, it's not efficient for lots of threads in many small blocks to access a large chunk of contiguous global memory.

* *For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?*

Yes. Actually, in the beginning I don't believe it will improve the performance since the rearrangement of the buffer seems more costly than the benefits gaining from the contiguous memory allocation. But later I found that the rearrangement can be totally parallelized and is quite efficient. The improvement of the performance gained by the contiguous memory is also beyond my expectation. The reason why the performance is improved by the more coherent uniform grid is that threads in the same block prefer to reading data which is stored in contiguous memory.
Binary file added images/CoherentBoidsFlocking.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/pic1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/pic2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/table1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/table2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/table3.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/table4.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/table5.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/table6.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ set(SOURCE_FILES

cuda_add_library(src
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_30
)
Loading