CIS565-Fall-2016 · krupkad · Sep 6, 2016 · Sep 6, 2016 · Sep 7, 2016 · Sep 7, 2016
diff --git a/README.md b/README.md
@@ -1,10 +1,72 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Daniel Krupka
+* Tested on: Debian testing (stretch), Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz 8GB, GTX 850M
 
-### (TODO: Your README)
+# Project 1 - Boids
+This project's goal is to implement [Boids](https://en.wikipedia.org/wiki/Boids) using CUDA,
+and to explore a few optimizations that can be made to the naive algorithm. To summarize,
+the Boids algorithm implements flocking behavior as seen in birds as an emergent behavior
+from a few simple rules. The algorithm is embarassingly parallel, as the behavior of each
+Boid depends only on the previous state of the system. Details of the assignment can be found
+ [here](INSTRUCTION.md).
+
+![Boids Demo](images/boid_demo.gif "Boids Demo")
+
+# Optimization
+
+The naive implementation checks every Boid. However, when the search space is large,
+most Boids will not be nearby and have no effect. An effective solution to this is
+to divide the search space with a grid whose cells are near the size of the Boid range
+of influence, and only search neighboring grid cells. On a GPU, where data is most
+effectively represented by arrays of values, this is performed by giving each Boid and grid cell an integer index,
+and sorting the Boid indices by their corresponding grid cell index. Then, it is simple to find the slice of the Boid index array corresponding
+to a given cell.
+
+However, position and velocity are *not* sorted, requiring some index lookups to
+find the spatial data for each boid. This is solved by shuffling the position and velocity arrays
+to the same order as the Boid index array, making the data coherent and saving lookup time.
+
+# Profiling
+
+Optimizations were tested by running each implementation, referred to as 'brute force',
+'uniform', and 'coherent', on Boid counts ranging from 5,000 to 1,000,000. Implementations
+were also tested on CUDA block sizes ranging from 128 threads to 1,024 threads. Single-step
+execution time was smoothed by an infinite horizon filter with alpha=0.95.
+
+## Peak and Steady-State Rates
+
+Preliminary probe runs showed that, in all implementations, simulation speed peaked early,
+then settled into a steady state. This can be explained by noting that, as the simulation
+progresses, the Boids 'clump', resulting in more Boids in a given Boids neighborhood than
+there are in the dispersed initial state.
+
+![Single Time Step Plot](images/single_iter.png "Single Time Step")
+
+The plot of step time versus Boid count shows a weak influence of Boid count on peak rate, but
+a strong influence on the steady-state, supporting the notion that cluster size is responsible
+for speed degradation.
+
+## Block Size
+
+As the naive implementation failed at fairly low Boid counts, further testing
+was performed only on the Uniform and Coherent methods.
+
+![Block Size Comparison](images/blk_compare.png "Block Size Comparison")
+
+The plot shows single step execution time of each block size, relative to a base
+size of 128 threads. In both cases, higher block counts were no better, with substantial
+degradation in the Uniform case. This is likely due to scattered memory accesses
+rendering the GPUs limited caching ineffective. Additionally, both implementations
+were substantially slowed by a 768 thread block, possibly due to the non-power-of-two
+number of threads not mapping well to available hardware.
+
+## Coherence
+
+![Coherence Comparison](images/coherent_uniform_compare.png "Coherence Comparison")
+
+Direct comparison between the non-coherent and coherent implementations further
+supports the importance of memory coherence, with the coherent version showing a
+40% improvement in step time at large Boid counts.
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
diff --git a/images/blk_compare.png b/images/blk_compare.png
diff --git a/images/boid_demo.gif b/images/boid_demo.gif
diff --git a/images/coherent_uniform_compare.png b/images/coherent_uniform_compare.png
diff --git a/images/single_iter.png b/images/single_iter.png
diff --git a/prof/profRun.sh b/prof/profRun.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+echo "$1 = ["
+for n in 5000 10000 15000 20000 25000 50000 75000 100000 250000 500000 1000000; do
+  build/cis565_boids $n 2> /dev/null
+done
+echo "]"
diff --git a/prof/runData.py b/prof/runData.py
@@ -0,0 +1,271 @@
+from matplotlib import pyplot as plt
+import matplotlib.lines as mlines
+
+uniform = {}
+bruteForce = {}
+coherent = {}
+
+uniform[128] = [
+[5000,383.656957,624.735748],
+[10000,419.100000,903.498103],
+[15000,452.350000,1254.570835],
+[20000,509.000000,1659.217137],
+[25000,584.000000,1991.820693],
+[50000,853.000000,4176.786314],
+[75000,1164.000000,6912.989198],
+[100000,860.000000,9940.314287],
+[250000,2847.000000,30302.529010],
+[500000,3981.000000,98495.401126],
+[1000000,6780.000000,332754.849440],
+]
+uniform[256] = [
+[5000,383.335023,632.444092],
+[10000,412.500000,915.473426],
+[15000,454.950000,1255.105734],
+[20000,518.000000,1679.717623],
+[25000,581.000000,2039.227647],
+[50000,848.000000,4179.722180],
+[75000,1176.000000,6934.372327],
+[100000,869.000000,9952.324545],
+[250000,2816.000000,30327.787594],
+[500000,3965.000000,98674.105457],
+[1000000,6795.000000,332020.133592],
+]
+uniform[512] = [
+[5000,385.891676,639.369672],
+[10000,417.250000,920.228329],
+[15000,456.650000,1266.591706],
+[20000,511.000000,1696.459238],
+[25000,593.000000,2050.577559],
+[50000,843.000000,4250.495830],
+[75000,1164.000000,6984.192784],
+[100000,865.000000,9969.864876],
+[250000,2808.000000,30414.321448],
+[500000,3962.000000,99746.753330],
+[1000000,6783.000000,331621.236897],
+]
+uniform[768] = [
+[5000,415.881824,833.147383],
+[10000,414.250000,1186.176995],
+[15000,455.050000,1612.495266],
+[20000,506.000000,2198.307857],
+[25000,585.000000,2527.516943],
+[50000,835.000000,5176.156233],
+[75000,1162.000000,8199.426148],
+[100000,903.000000,11660.588738],
+[250000,2842.000000,32719.868674],
+[500000,3957.000000,114974.955343],
+[1000000,6794.000000,374045.896418],
+]
+uniform[1024] = [
+[5000,383.236620,636.124498],
+[10000,416.300000,898.264101],
+[15000,453.450000,1268.548150],
+[20000,511.000000,1717.535533],
+[25000,582.000000,2082.039087],
+[50000,841.000000,4266.817845],
+[75000,1164.000000,7033.642933],
+[100000,1120.000000,10079.411187],
+[250000,2853.000000,30541.178775],
+[500000,4011.000000,100375.362867],
+[1000000,6792.000000,332025.727319],
+]
+coherent[128] = [
+[5000,386.640039,579.540401],
+[10000,432.467500,830.854868],
+[15000,457.700000,1096.861468],
+[20000,511.650000,1454.444801],
+[25000,588.000000,1918.377965],
+[50000,848.000000,3487.051416],
+[75000,1160.000000,5368.146619],
+[100000,912.000000,7227.005568],
+[250000,2869.000000,18207.942247],
+[500000,3996.000000,57475.470061],
+[1000000,6860.000000,212690.564596],
+]
+coherent[256] = [
+[5000,385.174786,581.196897],
+[10000,417.550000,831.696793],
+[15000,457.850000,1108.747156],
+[20000,529.150000,1440.820549],
+[25000,588.000000,1908.937970],
+[50000,859.000000,3486.965248],
+[75000,1160.000000,5394.183790],
+[100000,873.000000,7270.980267],
+[250000,2857.000000,18222.656499],
+[500000,3972.000000,57827.861638],
+[1000000,6855.000000,212859.542813],
+]
+coherent[512] = [
+[5000,393.391149,585.138622],
+[10000,419.450000,829.729229],
+[15000,458.950000,1098.056921],
+[20000,515.250000,1473.669385],
+[25000,589.000000,1788.153907],
+[50000,918.000000,3516.217751],
+[75000,1168.000000,5432.469845],
+[100000,868.000000,7296.062158],
+[250000,3984.000000,18211.412190],
+[500000,4439.000000,58252.685316],
+[1000000,7245.000000,213454.071366],
+]
+coherent[768] = [
+[5000,412.840833,764.167344],
+[10000,440.400000,1009.650819],
+[15000,459.850000,1288.873454],
+[20000,521.000000,1869.933998],
+[25000,590.000000,2161.145613],
+[50000,847.000000,4139.408600],
+[75000,1191.000000,6331.838109],
+[100000,879.000000,8479.993987],
+[250000,3222.000000,21434.089841],
+[500000,4101.000000,69428.011576],
+[1000000,6780.000000,249272.889109],
+]
+coherent[1024] = [
+[5000,387.201838,584.761264],
+[10000,422.250000,838.071770],
+[15000,454.350000,1114.856206],
+[20000,513.000000,1473.060754],
+[25000,590.000000,1779.910442],
+[50000,851.000000,3549.825770],
+[75000,1303.000000,5450.188405],
+[100000,869.000000,7247.422401],
+[250000,2905.000000,18090.378486],
+[500000,3962.000000,58350.514909],
+[1000000,6860.000000,213256.116817],
+]
+bruteForce[128] =[
+  [5000, 1000000.0/178, 1000000.0/147],
+  [10000, 1000000.0/151, 1000000.0/36.2],
+  [15000, 1000000.0/145, 1000000.0/16.2],
+  [20000, 1000000.0/138, 1000000.0/9.0],
+]
+
+bruteForce[256] =[
+  [5000, 1000000.0/166.8, 1000000.0/149],
+  [10000, 1000000.0/151.3, 1000000.0/38.2],
+  [15000, 1000000.0/143.6, 1000000.0/17.0],
+  [20000, 1000000.0/139.2, 1000000.0/9.6],
+]
+
+bruteForce[512] =[
+  [5000, 1000000.0/168.5, 1000000.0/147.5],
+  [10000, 1000000.0/149.8, 1000000.0/38.2],
+  [15000, 1000000.0/145.5, 1000000.0/17.0],
+  [20000, 1000000.0/140.4, 1000000.0/9.6],
+]
+
+bruteForce[768] =[
+  [5000, 1000000.0/160.8, 1000000.0/95.1],
+  [10000, 1000000.0/150.8, 1000000.0/31.2],
+  [15000, 1000000.0/142.8, 1000000.0/15.4],
+  [20000, 1000000.0/133.2, 1000000.0/7.8],
+]
+
+bruteForce[1024] =[
+  [5000, 1000000.0/168.3, 1000000.0/147.3],
+  [10000, 1000000.0/151.4, 1000000.0/37.8],
+  [15000, 1000000.0/144.1, 1000000.0/17.0],
+  [20000, 1000000.0/139.7, 1000000.0/9.6],
+]
+
+dsets = [
+  (bruteForce, 'bruteForce', 'r'),
+  (uniform, 'uniform', 'g'),
+  (coherent, 'coherent', (1,0,1)),
+]
+
+### OVERALL COMPARISON
+plt.figure()
+legend = []
+for dset,name,c in dsets:
+  for sz in dset:
+    x,yPk,ySt = zip(*dset[sz])
+    plt.loglog(x,[t/1000 for t in ySt],color=c)
+    plt.loglog(x,[t/1000 for t in yPk],color=c,ls='dashed')
+  line = mlines.Line2D([], [], color=c, label=name+' (peak)', ls='dashed')
+  legend.append(line)
+  line = mlines.Line2D([], [], color=c, label=name+' (steady)')
+  legend.append(line)
+
+plt.legend(handles=legend, loc=4)
+plt.xlabel('# of boids')
+plt.ylabel('t (ms)')
+#plt.axvline(x=5000.,color='k',ls='dashed')
+plt.title('Single iteration time')
+plt.xlim([5000,1000000])
+
+plt.figure()
+szList = sorted(uniform.keys())
+sz0 = szList[0]
+szList = szList[1:]
+x0,yPk0,ySt0 = zip(*dset[sz0])
+
+plt.subplot(221)
+hPkList = []
+for sz in szList:
+  x,yPk,ySt = zip(*uniform[sz])
+  rPk = [float(a)/a0 for a,a0 in zip(yPk,yPk0)]
+  hPk, = plt.semilogx(x,rPk,label='blk=%d'%sz)
+  hPkList.append(hPk)
+plt.legend(handles=hPkList, loc=2)
+plt.xlim([5000,1000000])
+plt.title("'Uniform' peak performance")
+plt.xlabel('# of boids')
+plt.ylabel('t(blk)/t(128)')
+
+plt.subplot(222)
+hStList = []
+for sz in szList:
+  x,yPk,ySt = zip(*uniform[sz])
+  rSt = [float(a)/a0 for a,a0 in zip(ySt,ySt0)]
+  hSt, = plt.semilogx(x,rSt,label='blk=%d'%sz)
+  hStList.append(hSt)
+plt.legend(handles=hStList, loc=2)
+plt.xlim([5000,1000000])
+plt.title("'Uniform' steady state performance")
+plt.xlabel('# of boids')
+plt.ylabel('t(blk)/t(128)')
+
+plt.subplot(223)
+hPkList = []
+for sz in szList:
+  x,yPk,ySt = zip(*coherent[sz])
+  rPk = [float(a)/a0 for a,a0 in zip(yPk,yPk0)]
+  hPk, = plt.semilogx(x,rPk,label='blk=%d'%sz)
+  hPkList.append(hPk)
+plt.legend(handles=hPkList, loc=2)
+plt.xlim([5000,1000000])
+plt.title("'Coherent' peak performance")
+plt.xlabel('# of boids')
+plt.ylabel('t(blk)/t(128)')
+
+plt.subplot(224)
+hStList = []
+for sz in szList:
+  x,yPk,ySt = zip(*coherent[sz])
+  rSt = [float(a)/a0 for a,a0 in zip(ySt,ySt0)]
+  hSt, = plt.semilogx(x,rSt,label='blk=%d'%sz)
+  hStList.append(hSt)
+plt.legend(handles=hStList, loc=1)
+plt.xlim([5000,1000000])
+plt.title("'Coherent' steady state performance")
+plt.xlabel('# of boids')
+plt.ylabel('t(blk)/t(128)')
+
+plt.figure()
+hList = []
+for sz in szList:
+  x,yPk1,ySt1 = zip(*uniform[sz])
+  x,yPk2,ySt2 = zip(*coherent[sz])
+  rPk = [float(a)/a0 for a,a0 in zip(ySt2,ySt1)]
+  hPk, = plt.semilogx(x,rPk,label='blk=%d'%sz)
+  hList.append(hPk)
+plt.legend(handles=hList, loc=1)
+plt.xlim([5000,1000000])
+plt.title("'Coherent' vs 'Uniform' steady state performance")
+plt.xlabel('# of boids')
+plt.ylabel('t(coherent)/t(uniform)')
+
+plt.show()
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_50
     )