Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 67 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,72 @@
**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
Project 1 - Flocking**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Daniel Krupka
* Tested on: Debian testing (stretch), Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz 8GB, GTX 850M

### (TODO: Your README)
# Project 1 - Boids
This project's goal is to implement [Boids](https://en.wikipedia.org/wiki/Boids) using CUDA,
and to explore a few optimizations that can be made to the naive algorithm. To summarize,
the Boids algorithm implements flocking behavior as seen in birds as an emergent behavior
from a few simple rules. The algorithm is embarassingly parallel, as the behavior of each
Boid depends only on the previous state of the system. Details of the assignment can be found
[here](INSTRUCTION.md).

![Boids Demo](images/boid_demo.gif "Boids Demo")

# Optimization

The naive implementation checks every Boid. However, when the search space is large,
most Boids will not be nearby and have no effect. An effective solution to this is
to divide the search space with a grid whose cells are near the size of the Boid range
of influence, and only search neighboring grid cells. On a GPU, where data is most
effectively represented by arrays of values, this is performed by giving each Boid and grid cell an integer index,
and sorting the Boid indices by their corresponding grid cell index. Then, it is simple to find the slice of the Boid index array corresponding
to a given cell.

However, position and velocity are *not* sorted, requiring some index lookups to
find the spatial data for each boid. This is solved by shuffling the position and velocity arrays
to the same order as the Boid index array, making the data coherent and saving lookup time.

# Profiling

Optimizations were tested by running each implementation, referred to as 'brute force',
'uniform', and 'coherent', on Boid counts ranging from 5,000 to 1,000,000. Implementations
were also tested on CUDA block sizes ranging from 128 threads to 1,024 threads. Single-step
execution time was smoothed by an infinite horizon filter with alpha=0.95.

## Peak and Steady-State Rates

Preliminary probe runs showed that, in all implementations, simulation speed peaked early,
then settled into a steady state. This can be explained by noting that, as the simulation
progresses, the Boids 'clump', resulting in more Boids in a given Boids neighborhood than
there are in the dispersed initial state.

![Single Time Step Plot](images/single_iter.png "Single Time Step")

The plot of step time versus Boid count shows a weak influence of Boid count on peak rate, but
a strong influence on the steady-state, supporting the notion that cluster size is responsible
for speed degradation.

## Block Size

As the naive implementation failed at fairly low Boid counts, further testing
was performed only on the Uniform and Coherent methods.

![Block Size Comparison](images/blk_compare.png "Block Size Comparison")

The plot shows single step execution time of each block size, relative to a base
size of 128 threads. In both cases, higher block counts were no better, with substantial
degradation in the Uniform case. This is likely due to scattered memory accesses
rendering the GPUs limited caching ineffective. Additionally, both implementations
were substantially slowed by a 768 thread block, possibly due to the non-power-of-two
number of threads not mapping well to available hardware.

## Coherence

![Coherence Comparison](images/coherent_uniform_compare.png "Coherence Comparison")

Direct comparison between the non-coherent and coherent implementations further
supports the importance of memory coherence, with the coherent version showing a
40% improvement in step time at large Boid counts.

Include screenshots, analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
Binary file added images/blk_compare.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/boid_demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/coherent_uniform_compare.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/single_iter.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 7 additions & 0 deletions prof/profRun.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

echo "$1 = ["
for n in 5000 10000 15000 20000 25000 50000 75000 100000 250000 500000 1000000; do
build/cis565_boids $n 2> /dev/null
done
echo "]"
271 changes: 271 additions & 0 deletions prof/runData.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,271 @@
from matplotlib import pyplot as plt
import matplotlib.lines as mlines

uniform = {}
bruteForce = {}
coherent = {}

uniform[128] = [
[5000,383.656957,624.735748],
[10000,419.100000,903.498103],
[15000,452.350000,1254.570835],
[20000,509.000000,1659.217137],
[25000,584.000000,1991.820693],
[50000,853.000000,4176.786314],
[75000,1164.000000,6912.989198],
[100000,860.000000,9940.314287],
[250000,2847.000000,30302.529010],
[500000,3981.000000,98495.401126],
[1000000,6780.000000,332754.849440],
]
uniform[256] = [
[5000,383.335023,632.444092],
[10000,412.500000,915.473426],
[15000,454.950000,1255.105734],
[20000,518.000000,1679.717623],
[25000,581.000000,2039.227647],
[50000,848.000000,4179.722180],
[75000,1176.000000,6934.372327],
[100000,869.000000,9952.324545],
[250000,2816.000000,30327.787594],
[500000,3965.000000,98674.105457],
[1000000,6795.000000,332020.133592],
]
uniform[512] = [
[5000,385.891676,639.369672],
[10000,417.250000,920.228329],
[15000,456.650000,1266.591706],
[20000,511.000000,1696.459238],
[25000,593.000000,2050.577559],
[50000,843.000000,4250.495830],
[75000,1164.000000,6984.192784],
[100000,865.000000,9969.864876],
[250000,2808.000000,30414.321448],
[500000,3962.000000,99746.753330],
[1000000,6783.000000,331621.236897],
]
uniform[768] = [
[5000,415.881824,833.147383],
[10000,414.250000,1186.176995],
[15000,455.050000,1612.495266],
[20000,506.000000,2198.307857],
[25000,585.000000,2527.516943],
[50000,835.000000,5176.156233],
[75000,1162.000000,8199.426148],
[100000,903.000000,11660.588738],
[250000,2842.000000,32719.868674],
[500000,3957.000000,114974.955343],
[1000000,6794.000000,374045.896418],
]
uniform[1024] = [
[5000,383.236620,636.124498],
[10000,416.300000,898.264101],
[15000,453.450000,1268.548150],
[20000,511.000000,1717.535533],
[25000,582.000000,2082.039087],
[50000,841.000000,4266.817845],
[75000,1164.000000,7033.642933],
[100000,1120.000000,10079.411187],
[250000,2853.000000,30541.178775],
[500000,4011.000000,100375.362867],
[1000000,6792.000000,332025.727319],
]
coherent[128] = [
[5000,386.640039,579.540401],
[10000,432.467500,830.854868],
[15000,457.700000,1096.861468],
[20000,511.650000,1454.444801],
[25000,588.000000,1918.377965],
[50000,848.000000,3487.051416],
[75000,1160.000000,5368.146619],
[100000,912.000000,7227.005568],
[250000,2869.000000,18207.942247],
[500000,3996.000000,57475.470061],
[1000000,6860.000000,212690.564596],
]
coherent[256] = [
[5000,385.174786,581.196897],
[10000,417.550000,831.696793],
[15000,457.850000,1108.747156],
[20000,529.150000,1440.820549],
[25000,588.000000,1908.937970],
[50000,859.000000,3486.965248],
[75000,1160.000000,5394.183790],
[100000,873.000000,7270.980267],
[250000,2857.000000,18222.656499],
[500000,3972.000000,57827.861638],
[1000000,6855.000000,212859.542813],
]
coherent[512] = [
[5000,393.391149,585.138622],
[10000,419.450000,829.729229],
[15000,458.950000,1098.056921],
[20000,515.250000,1473.669385],
[25000,589.000000,1788.153907],
[50000,918.000000,3516.217751],
[75000,1168.000000,5432.469845],
[100000,868.000000,7296.062158],
[250000,3984.000000,18211.412190],
[500000,4439.000000,58252.685316],
[1000000,7245.000000,213454.071366],
]
coherent[768] = [
[5000,412.840833,764.167344],
[10000,440.400000,1009.650819],
[15000,459.850000,1288.873454],
[20000,521.000000,1869.933998],
[25000,590.000000,2161.145613],
[50000,847.000000,4139.408600],
[75000,1191.000000,6331.838109],
[100000,879.000000,8479.993987],
[250000,3222.000000,21434.089841],
[500000,4101.000000,69428.011576],
[1000000,6780.000000,249272.889109],
]
coherent[1024] = [
[5000,387.201838,584.761264],
[10000,422.250000,838.071770],
[15000,454.350000,1114.856206],
[20000,513.000000,1473.060754],
[25000,590.000000,1779.910442],
[50000,851.000000,3549.825770],
[75000,1303.000000,5450.188405],
[100000,869.000000,7247.422401],
[250000,2905.000000,18090.378486],
[500000,3962.000000,58350.514909],
[1000000,6860.000000,213256.116817],
]
bruteForce[128] =[
[5000, 1000000.0/178, 1000000.0/147],
[10000, 1000000.0/151, 1000000.0/36.2],
[15000, 1000000.0/145, 1000000.0/16.2],
[20000, 1000000.0/138, 1000000.0/9.0],
]

bruteForce[256] =[
[5000, 1000000.0/166.8, 1000000.0/149],
[10000, 1000000.0/151.3, 1000000.0/38.2],
[15000, 1000000.0/143.6, 1000000.0/17.0],
[20000, 1000000.0/139.2, 1000000.0/9.6],
]

bruteForce[512] =[
[5000, 1000000.0/168.5, 1000000.0/147.5],
[10000, 1000000.0/149.8, 1000000.0/38.2],
[15000, 1000000.0/145.5, 1000000.0/17.0],
[20000, 1000000.0/140.4, 1000000.0/9.6],
]

bruteForce[768] =[
[5000, 1000000.0/160.8, 1000000.0/95.1],
[10000, 1000000.0/150.8, 1000000.0/31.2],
[15000, 1000000.0/142.8, 1000000.0/15.4],
[20000, 1000000.0/133.2, 1000000.0/7.8],
]

bruteForce[1024] =[
[5000, 1000000.0/168.3, 1000000.0/147.3],
[10000, 1000000.0/151.4, 1000000.0/37.8],
[15000, 1000000.0/144.1, 1000000.0/17.0],
[20000, 1000000.0/139.7, 1000000.0/9.6],
]

dsets = [
(bruteForce, 'bruteForce', 'r'),
(uniform, 'uniform', 'g'),
(coherent, 'coherent', (1,0,1)),
]

### OVERALL COMPARISON
plt.figure()
legend = []
for dset,name,c in dsets:
for sz in dset:
x,yPk,ySt = zip(*dset[sz])
plt.loglog(x,[t/1000 for t in ySt],color=c)
plt.loglog(x,[t/1000 for t in yPk],color=c,ls='dashed')
line = mlines.Line2D([], [], color=c, label=name+' (peak)', ls='dashed')
legend.append(line)
line = mlines.Line2D([], [], color=c, label=name+' (steady)')
legend.append(line)

plt.legend(handles=legend, loc=4)
plt.xlabel('# of boids')
plt.ylabel('t (ms)')
#plt.axvline(x=5000.,color='k',ls='dashed')
plt.title('Single iteration time')
plt.xlim([5000,1000000])

plt.figure()
szList = sorted(uniform.keys())
sz0 = szList[0]
szList = szList[1:]
x0,yPk0,ySt0 = zip(*dset[sz0])

plt.subplot(221)
hPkList = []
for sz in szList:
x,yPk,ySt = zip(*uniform[sz])
rPk = [float(a)/a0 for a,a0 in zip(yPk,yPk0)]
hPk, = plt.semilogx(x,rPk,label='blk=%d'%sz)
hPkList.append(hPk)
plt.legend(handles=hPkList, loc=2)
plt.xlim([5000,1000000])
plt.title("'Uniform' peak performance")
plt.xlabel('# of boids')
plt.ylabel('t(blk)/t(128)')

plt.subplot(222)
hStList = []
for sz in szList:
x,yPk,ySt = zip(*uniform[sz])
rSt = [float(a)/a0 for a,a0 in zip(ySt,ySt0)]
hSt, = plt.semilogx(x,rSt,label='blk=%d'%sz)
hStList.append(hSt)
plt.legend(handles=hStList, loc=2)
plt.xlim([5000,1000000])
plt.title("'Uniform' steady state performance")
plt.xlabel('# of boids')
plt.ylabel('t(blk)/t(128)')

plt.subplot(223)
hPkList = []
for sz in szList:
x,yPk,ySt = zip(*coherent[sz])
rPk = [float(a)/a0 for a,a0 in zip(yPk,yPk0)]
hPk, = plt.semilogx(x,rPk,label='blk=%d'%sz)
hPkList.append(hPk)
plt.legend(handles=hPkList, loc=2)
plt.xlim([5000,1000000])
plt.title("'Coherent' peak performance")
plt.xlabel('# of boids')
plt.ylabel('t(blk)/t(128)')

plt.subplot(224)
hStList = []
for sz in szList:
x,yPk,ySt = zip(*coherent[sz])
rSt = [float(a)/a0 for a,a0 in zip(ySt,ySt0)]
hSt, = plt.semilogx(x,rSt,label='blk=%d'%sz)
hStList.append(hSt)
plt.legend(handles=hStList, loc=1)
plt.xlim([5000,1000000])
plt.title("'Coherent' steady state performance")
plt.xlabel('# of boids')
plt.ylabel('t(blk)/t(128)')

plt.figure()
hList = []
for sz in szList:
x,yPk1,ySt1 = zip(*uniform[sz])
x,yPk2,ySt2 = zip(*coherent[sz])
rPk = [float(a)/a0 for a,a0 in zip(ySt2,ySt1)]
hPk, = plt.semilogx(x,rPk,label='blk=%d'%sz)
hList.append(hPk)
plt.legend(handles=hList, loc=1)
plt.xlim([5000,1000000])
plt.title("'Coherent' vs 'Uniform' steady state performance")
plt.xlabel('# of boids')
plt.ylabel('t(coherent)/t(uniform)')

plt.show()
2 changes: 1 addition & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ set(SOURCE_FILES

cuda_add_library(src
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_50
)
Loading