diff --git a/README.md b/README.md
index d63a6a1..d05e697 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,112 @@
+Project 1 Flocking
+====================
+
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Raymond Yang
+	* [LinkedIn](https://www.linkedin.com/in/raymond-yang-b85b19168)
+	* Tested on: 
+		* 09/13/2021
+		* Windows 10
+		* NVIDIA GeForce GTX 1080 Ti. 
+	* Submitted on: 09/13/2021
+
+## Introduction
+The objective of this project was to mimic basic flocking behavior between boids. Boids are arbitrary representations of points in 3D space. Each boid follows three rules: 
+* [Rule 1](https://github.com/CIS565-Fall-2021/Project1-CUDA-Flocking/blob/main/INSTRUCTION.md#rule-1-boids-try-to-fly-towards-the-centre-of-mass-of-neighbouring-boids): Boids try to fly towards the centre of mass of neighbouring boids.
+* [Rule 2](https://github.com/CIS565-Fall-2021/Project1-CUDA-Flocking/blob/main/INSTRUCTION.md#rule-2-boids-try-to-keep-a-small-distance-away-from-other-objects-including-other-boids): Boids try to keep a small distance away from other objects (including other boids).
+* [Rule 3](https://github.com/CIS565-Fall-2021/Project1-CUDA-Flocking/blob/main/INSTRUCTION.md#rule-3-boids-try-to-match-velocity-with-near-boids): Boids try to match velocity with near boids.
+
+This project does not implement any form of obstacles within the 3D space. Rule 2 only accounts for neighboring boids. This project is intended as an introduction to CUDA programming and data accessing and handling within GPU memory space. 
+
+## Demo 
+
+### Naive Demo Gif
+20,000 boids screenshot: 
+![ssa](images/a.PNG)
+
+20,000 boids GIF:
+![ga](images/a.gif)
+
+60,000 boids screenshot: 
+![ssb](images/b.PNG)
+
+60,000 boids GIF:
+![gb](images/b.gif)
+
+120,000 boids screenshot: 
+![ssc](images/c.PNG)
+
+120,000 boids GIF:
+![gc](images/c.gif)
+
+### Scattered Grid Demo Gif
+20,000 boids screenshot: 
+![ssd](images/d.PNG)
+
+20,000 boids GIF:
+![gd](images/d.gif)
+
+60,000 boids screenshot: 
+![sse](images/e.PNG)
+
+60,000 boids GIF:
+![ge](images/e.gif)
+
+120,000 boids screenshot: 
+![ssf](images/f.PNG)
+
+120,000 boids GIF:
+![gf](images/f.gif)
+
+### Coherent Grid Demo Gif
+20,000 boids screenshot: 
+![ssg](images/g.PNG)
+
+20,000 boids GIF:
+![gg](images/g.gif)
+
+60,000 boids screenshot: 
+![ssh](images/h.PNG)
+
+60,000 boids GIF:
+![gh](images/h.gif)
+
+120,000 boids screenshot: 
+![ssi](images/i.PNG)
+
+120,000 boids GIF:
+![gi](images/i.gif)
+
+## Implementation 
+
+
+### Naive Implementation
+Naive Implementation took a simplistic approach to calculating each rule. To compute the rules, each individual boid would observe every other boid in the 3D space and return a comulative velocity. The position and velocity of a boid would be factored into the comulative velocity if it fulfilled the distance conditions: `rule1Distance`, `rule2Distance`, and `rule3Distance`.
+
+### Scattered Grid Implementation
+Scattered Implementation used a grid data structure that represented smaller spaces (cells) within the 3D space. This data structure utilized: `dev_particleArrayIndices`, `dev_particleGridIndices`, `dev_gridCellStartIndices`, and `dev_gridCellEndIndices`. These structures allowed the dereferencing of boid position and velocity vectors based on their 3D space. Each individual boid observes the position and velocity vectors of every other boid within a 2x2 cube of cells (lengths defined by `gridCellWidth`) and, given they satisfy `rule1Distance`, `rule2Distance`, and `rule3Distance`, are incorporated into a cumulative velocity and returned. 
+
+### Coherent Grid Implementation
+Coherent Implementation is similar to Scattered Implementation. However, the data structures are altered such that we parallelize memory dereferences to improve performance. 
+
+## Analysis
+
+### Data Comparison
+| Boids   | Naive FPS | Scattered FPS | Coherent FPS |
+|---------|-----------|---------------|--------------|
+| 20,000  | 25.8      | 674.9         | 743.4        |
+| 60,000  | 3.0       | 276.6         | 399.6        |
+| 120,000 | 0.8       | 124.4         | 198.0        |
+
+![ssaa](images/aa.png)
 
-### (TODO: Your README)
+### Questions
+* Framerate is inversely proportional to number of boids. As the number of boids increase, the framerate decreases. This is due to the increased density of boids. Since cells are fixed size, the average number of boids in each cell increases leading to more serialized calculations. 
+* Decreasing the block count negatively affected performance. As we decrease number of blocks, number of warps increased. Blocks can be parallelized by warps within an SM is serialized. 
+* The coherent grid structure increased performance. The purpose of this step was to parallelize two memory dereferences rather than performing them in serial (like in scattered grid implementation).
+* Checking more cells in my implementation seemed to have negligible effects. While increasing the cells a boid must iterate through, other factors such as syncronization and stalling are also affected. 
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Notes
+* One late day used. 
\ No newline at end of file
diff --git a/images/a.PNG b/images/a.PNG
new file mode 100644
index 0000000..3869470
Binary files /dev/null and b/images/a.PNG differ
diff --git a/images/a.gif b/images/a.gif
new file mode 100644
index 0000000..97516d9
Binary files /dev/null and b/images/a.gif differ
diff --git a/images/aa.png b/images/aa.png
new file mode 100644
index 0000000..559a3d0
Binary files /dev/null and b/images/aa.png differ
diff --git a/images/b.PNG b/images/b.PNG
new file mode 100644
index 0000000..7659e55
Binary files /dev/null and b/images/b.PNG differ
diff --git a/images/b.gif b/images/b.gif
new file mode 100644
index 0000000..448240f
Binary files /dev/null and b/images/b.gif differ
diff --git a/images/c.PNG b/images/c.PNG
new file mode 100644
index 0000000..d78c8b8
Binary files /dev/null and b/images/c.PNG differ
diff --git a/images/c.gif b/images/c.gif
new file mode 100644
index 0000000..39108e4
Binary files /dev/null and b/images/c.gif differ
diff --git a/images/d.PNG b/images/d.PNG
new file mode 100644
index 0000000..9bae296
Binary files /dev/null and b/images/d.PNG differ
diff --git a/images/d.gif b/images/d.gif
new file mode 100644
index 0000000..ec74ed5
Binary files /dev/null and b/images/d.gif differ
diff --git a/images/e.PNG b/images/e.PNG
new file mode 100644
index 0000000..73214a6
Binary files /dev/null and b/images/e.PNG differ
diff --git a/images/e.gif b/images/e.gif
new file mode 100644
index 0000000..46b6784
Binary files /dev/null and b/images/e.gif differ
diff --git a/images/f.PNG b/images/f.PNG
new file mode 100644
index 0000000..863641b
Binary files /dev/null and b/images/f.PNG differ
diff --git a/images/f.gif b/images/f.gif
new file mode 100644
index 0000000..70eb27c
Binary files /dev/null and b/images/f.gif differ
diff --git a/images/g.PNG b/images/g.PNG
new file mode 100644
index 0000000..c0bd876
Binary files /dev/null and b/images/g.PNG differ
diff --git a/images/g.gif b/images/g.gif
new file mode 100644
index 0000000..7441760
Binary files /dev/null and b/images/g.gif differ
diff --git a/images/h.PNG b/images/h.PNG
new file mode 100644
index 0000000..c132747
Binary files /dev/null and b/images/h.PNG differ
diff --git a/images/h.gif b/images/h.gif
new file mode 100644
index 0000000..ae3bfaf
Binary files /dev/null and b/images/h.gif differ
diff --git a/images/i.PNG b/images/i.PNG
new file mode 100644
index 0000000..fa83d12
Binary files /dev/null and b/images/i.PNG differ
diff --git a/images/i.gif b/images/i.gif
new file mode 100644
index 0000000..efeaa73
Binary files /dev/null and b/images/i.gif differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..f4ca498 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -86,6 +86,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
 
+glm::vec3* dev_newPos; 
+
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
 int gridCellCount;
@@ -137,39 +139,61 @@ __global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, flo
 * Initialize memory, update some globals
 */
 void Boids::initSimulation(int N) {
-  numObjects = N;
-  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+    numObjects = N;
+    dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
 
-  // LOOK-1.2 - This is basic CUDA memory management and error checking.
-  // Don't forget to cudaFree in  Boids::endSimulation.
-  cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+    // LOOK-1.2 - This is basic CUDA memory management and error checking.
+    // Don't forget to cudaFree in  Boids::endSimulation.
+    cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
+    checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
 
-  cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
+    cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
+    checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
 
-  cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
+    cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
+    checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
 
-  // LOOK-1.2 - This is a typical CUDA kernel invocation.
-  kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
+    // LOOK-1.2 - This is a typical CUDA kernel invocation.
+    kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
     dev_pos, scene_scale);
-  checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
+    checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
-  // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
-  int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
-  gridSideCount = 2 * halfSideCount;
+    // LOOK-2.1 computing grid params
+    gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+    int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
+    gridSideCount = 2 * halfSideCount;
 
-  gridCellCount = gridSideCount * gridSideCount * gridSideCount;
-  gridInverseCellWidth = 1.0f / gridCellWidth;
-  float halfGridWidth = gridCellWidth * halfSideCount;
-  gridMinimum.x -= halfGridWidth;
-  gridMinimum.y -= halfGridWidth;
-  gridMinimum.z -= halfGridWidth;
+    gridCellCount = gridSideCount * gridSideCount * gridSideCount;
+    gridInverseCellWidth = 1.0f / gridCellWidth;
+    float halfGridWidth = gridCellWidth * halfSideCount;
+    gridMinimum.x -= halfGridWidth;
+    gridMinimum.y -= halfGridWidth;
+    gridMinimum.z -= halfGridWidth;
 
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
-  cudaDeviceSynchronize();
+    // TODO-2.1 - Allocate additional buffers here.
+    cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+    checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+    cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+    checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+    cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+    checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+    cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+    checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+    // TODO-2.3 - Allocate additional buffers here.
+
+    cudaMalloc((void**)&dev_newPos, N * sizeof(glm::vec3));
+    checkCUDAErrorWithLine("cudaMalloc dev_newPos failed!");
+
+    dev_thrust_particleArrayIndices =
+        thrust::device_pointer_cast(dev_particleArrayIndices);
+    dev_thrust_particleGridIndices =
+        thrust::device_pointer_cast(dev_particleGridIndices);
+
+    cudaDeviceSynchronize();
 }
 
 
@@ -230,10 +254,63 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+    // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+    // Rule 2: boids try to stay a distance d away from each other
+    // Rule 3: boids try to match the speed of surrounding boids
+
+    float rule1NumNeighbor = 0.f;
+    float rule3NumNeighbor = 0.f;
+    
+    glm::vec3 rule1Vel(0.f);
+    glm::vec3 rule2Vel(0.f);
+    glm::vec3 rule3Vel(0.f);
+
+    const glm::vec3 myPos(pos[iSelf]);
+    const glm::vec3 myVel(vel[iSelf]);
+
+    for (int i = 0; i < N; i++) {
+
+        if (i != iSelf) {
+
+            const glm::vec3 otherPos(pos[i]);
+
+            const float dis = glm::distance(otherPos, myPos);
+
+            // Compute Rule 1
+            if (dis < rule1Distance) {
+                rule1Vel += otherPos; 
+                rule1NumNeighbor++;
+            }
+
+            // Compute Rule 2
+            if (dis < rule2Distance) {
+                rule2Vel -= (otherPos - myPos); 
+            }
+
+            // Compute Rule 3
+            if (dis < rule3Distance) {
+                rule3Vel += vel[i];
+                rule3NumNeighbor++;
+            }
+        }
+    }
+
+    glm::vec3 newVel = myVel;
+
+    if (rule1NumNeighbor > 0) {
+        newVel += (rule1Vel / rule1NumNeighbor - myPos) * rule1Scale;
+    }
+
+    newVel += rule2Vel * rule2Scale; 
+
+    if (rule3NumNeighbor > 0) {
+        newVel += (rule3Vel / rule3NumNeighbor) * rule3Scale;
+    }
+
+    // Clamp 
+    if (glm::length(newVel) > maxSpeed) { newVel = glm::normalize(newVel) * maxSpeed; }
+
+    return newVel;
 }
 
 /**
@@ -241,10 +318,18 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+    glm::vec3 *vel1, glm::vec3 *vel2) {
+
+    // Compute a new velocity based on pos and vel1
+    // Clamp the speed
+    // Record the new velocity into vel2. Question: why NOT vel1?
+
+    // Compute thread index
+    int iSelf = threadIdx.x + blockIdx.x * blockDim.x; 
+    if (iSelf >= N) { return; } 
+    
+    // Compute new Vel and update Vel2
+    vel2[iSelf] = computeVelocityChange(N, iSelf, pos, vel1);
 }
 
 /**
@@ -283,12 +368,26 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 }
 
 __global__ void kernComputeIndices(int N, int gridResolution,
-  glm::vec3 gridMin, float inverseCellWidth,
-  glm::vec3 *pos, int *indices, int *gridIndices) {
+    glm::vec3 gridMin, float inverseCellWidth,
+    glm::vec3 *pos, int *indices, int *gridIndices) {
     // TODO-2.1
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+    
+    // Find 3D position of grid containing boid 
+    const glm::vec3 boidPos(pos[index]); 
+    const glm::vec3 gridPos(glm::floor((boidPos - gridMin) * inverseCellWidth));
+
+    // Map boid index to grid index
+    gridIndices[index] = gridIndex3Dto1D(gridPos.x, gridPos.y, gridPos.z, gridResolution); 
+    // Parallel array
+    indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -301,95 +400,350 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
 }
 
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
-  int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+    int *gridCellStartIndices, int *gridCellEndIndices) {
+    // TODO-2.1
+    // Identify the start point of each cell in the gridIndices array.
+    // This is basically a parallel unrolling of a loop that goes
+    // "this index doesn't match the one before it, must be a new cell!"
+
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+
+    // Find neighboring grid indices
+    const int gridIdx = particleGridIndices[index]; 
+    int prevGridIdx = (index > 0) ? particleGridIndices[index - 1] : -1;
+    int nextGridIdx = (index < N - 1) ? particleGridIndices[index + 1] : -1;
+
+    // If possible, set start indice
+    if (index == 0 || (prevGridIdx != gridIdx)) {
+        gridCellStartIndices[gridIdx] = index; 
+    }
+
+    // If possible, set end indice
+    if (index == N - 1 || (gridIdx != nextGridIdx)) {
+        gridCellEndIndices[gridIdx] = index; 
+    }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  int *particleArrayIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+    int N, int gridResolution, glm::vec3 gridMin,
+    float inverseCellWidth, float cellWidth,
+    int *gridCellStartIndices, int *gridCellEndIndices,
+    int *particleArrayIndices,
+    glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+    // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
+    // the number of boids that need to be checked.
+    // - Identify the grid cell that this particle is in
+    // - Identify which cells may contain neighbors. This isn't always 8.
+    // - For each cell, read the start/end indices in the boid pointer array.
+    // - Access each boid in the cell and compute velocity change from
+    //   the boids rules, if this boid is within the neighborhood distance.
+    // - Clamp the speed change before putting the new speed in vel2
+
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+
+    // Instantiate Variables
+    glm::vec3 newVel(0.f);
+    glm::vec3 rule1Vel(0.f);
+    glm::vec3 rule2Vel(0.f);
+    glm::vec3 rule3Vel(0.f);
+
+    float rule1NumNeighbors = 0.f;
+    float rule3NumNeighbors = 0.f; 
+
+    // Find grid boid position, grid position, and grid index
+    int boidPtr = particleArrayIndices[index]; 
+    glm::vec3 boidPos = pos[boidPtr];
+    glm::vec3 gridPos = glm::floor((boidPos - gridMin) * inverseCellWidth); 
+    glm::vec3 gridPosRounded = glm::round((boidPos - gridMin) * inverseCellWidth);
+
+
+    // Save a little memory access time
+    const int refX = gridPos.x; const int refY = gridPos.y; const int refZ = gridPos.z;
+
+    // Fine coordinates of nearest 2x2 grids in space
+    int minX = (gridPos.x == gridPosRounded.x) ? imax(0, refX - 1) : refX; 
+    int maxX = (gridPos.x == gridPosRounded.x) ? refX : imin(gridResolution - 1, refX + 1);
+    int minY = (gridPos.y == gridPosRounded.y) ? imax(0, refY - 1) : refY;
+    int maxY = (gridPos.y == gridPosRounded.y) ? refY : imin(gridResolution - 1, refY + 1);
+    int minZ = (gridPos.z == gridPosRounded.z) ? imax(0, refZ - 1) : refZ;
+    int maxZ = (gridPos.z == gridPosRounded.z) ? refZ : imin(gridResolution - 1, refZ + 1);
+
+
+
+    for (int z = minZ; z <= maxZ; z++) {
+        for (int y = minY; y <= maxY; y++) {
+            for (int x = minX; x <= maxX; x++) {
+
+                // Convert grid to 1D index value
+                int gridIdx = gridIndex3Dto1D(x, y, z, gridResolution); 
+
+                // Find cell of boids in grid 
+                int gridCellStartIndices_Idx = gridCellStartIndices[gridIdx];
+                int gridCellEndIndices_Idx = gridCellEndIndices[gridIdx];
+                if (gridCellStartIndices_Idx == -1 || gridCellEndIndices_Idx == -1) {
+                    continue;
+                }
+
+                // Implement rules to mimic flocking
+                for (int i = gridCellStartIndices_Idx; i < gridCellEndIndices_Idx; i++) {
+                    int locBoidPtr = particleArrayIndices[i];
+                    if (locBoidPtr != index) {
+
+                        glm::vec3 otherPos = pos[locBoidPtr];
+                        float dis = glm::distance(otherPos, boidPos);
+
+                        if (dis < rule1Distance) {
+                            rule1Vel += otherPos; 
+                            rule1NumNeighbors++;
+                        }
+
+                        if (dis < rule2Distance) {
+                            rule2Vel -= (otherPos - boidPos); 
+                        }
+
+                        if (dis < rule3Distance) {
+                            rule3Vel += vel1[locBoidPtr];
+                            rule3NumNeighbors++;
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    newVel += vel1[boidPtr];
+
+    if (rule1NumNeighbors > 0) {
+        newVel += (rule1Vel / rule1NumNeighbors - boidPos) * rule1Scale;
+    }
+
+    newVel += rule2Vel * rule2Scale;
+
+    if (rule3NumNeighbors > 0) {
+        newVel += (rule3Vel / rule3NumNeighbors) * rule3Scale;
+    }
+
+    // Clamp new velocity and save it in Vel2
+    if (glm::length(newVel) > maxSpeed) { newVel = glm::normalize(newVel) * maxSpeed; }
+    vel2[boidPtr] = newVel;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+    int N, int gridResolution, glm::vec3 gridMin,
+    float inverseCellWidth, float cellWidth,
+    int *gridCellStartIndices, int *gridCellEndIndices,
+    glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+    // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+    // except with one less level of indirection.
+    // This should expect gridCellStartIndices and gridCellEndIndices to refer
+    // directly to pos and vel1.
+    // - Identify the grid cell that this particle is in
+    // - Identify which cells may contain neighbors. This isn't always 8.
+    // - For each cell, read the start/end indices in the boid pointer array.
+    //   DIFFERENCE: For best results, consider what order the cells should be
+    //   checked in to maximize the memory benefits of reordering the boids data.
+    // - Access each boid in the cell and compute velocity change from
+    //   the boids rules, if this boid is within the neighborhood distance.
+    // - Clamp the speed change before putting the new speed in vel2
+
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+
+    // Instantiate Variables
+    glm::vec3 newVel(0.f);
+    glm::vec3 rule1Vel(0.f);
+    glm::vec3 rule2Vel(0.f);
+    glm::vec3 rule3Vel(0.f);
+
+    float rule1NumNeighbors = 0.f;
+    float rule3NumNeighbors = 0.f; 
+
+    // Find grid boid position, grid position, and grid index
+    glm::vec3 boidPos = pos[index];
+    glm::vec3 gridPos = glm::floor((boidPos - gridMin) * inverseCellWidth); 
+    glm::vec3 gridPosRounded = glm::round((boidPos - gridMin) * inverseCellWidth);
+
+
+    // Save a little memory access time
+    const int refX = gridPos.x; const int refY = gridPos.y; const int refZ = gridPos.z;
+
+    // Fine coordinates of nearest 2x2 grids in space
+    int minX = (gridPos.x == gridPosRounded.x) ? imax(0, refX - 1) : refX; 
+    int maxX = (gridPos.x == gridPosRounded.x) ? refX : imin(gridResolution - 1, refX + 1);
+    int minY = (gridPos.y == gridPosRounded.y) ? imax(0, refY - 1) : refY;
+    int maxY = (gridPos.y == gridPosRounded.y) ? refY : imin(gridResolution - 1, refY + 1);
+    int minZ = (gridPos.z == gridPosRounded.z) ? imax(0, refZ - 1) : refZ;
+    int maxZ = (gridPos.z == gridPosRounded.z) ? refZ : imin(gridResolution - 1, refZ + 1);
+
+
+
+    for (int z = minZ; z <= maxZ; z++) {
+        for (int y = minY; y <= maxY; y++) {
+            for (int x = minX; x <= maxX; x++) {
+
+                // Convert grid to 1D index value
+                int gridIdx = gridIndex3Dto1D(x, y, z, gridResolution); 
+
+                // Find cell of boids in grid 
+                int gridCellStartIndices_Idx = gridCellStartIndices[gridIdx];
+                int gridCellEndIndices_Idx = gridCellEndIndices[gridIdx];
+                if (gridCellStartIndices_Idx == -1 || gridCellEndIndices_Idx == -1) {
+                    continue;
+                }
+
+                // Implement rules to mimic flocking
+                for (int i = gridCellStartIndices_Idx; i < gridCellEndIndices_Idx; i++) {
+                    int locBoidPtr = i;
+                    if (locBoidPtr != index) {
+
+                        glm::vec3 otherPos = pos[locBoidPtr];
+                        float dis = glm::distance(otherPos, boidPos);
+
+                        if (dis < rule1Distance) {
+                            rule1Vel += otherPos; 
+                            rule1NumNeighbors++;
+                        }
+
+                        if (dis < rule2Distance) {
+                            rule2Vel -= (otherPos - boidPos); 
+                        }
+
+                        if (dis < rule3Distance) {
+                            rule3Vel += vel1[locBoidPtr];
+                            rule3NumNeighbors++;
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    newVel += vel1[index];
+
+    if (rule1NumNeighbors > 0) {
+        newVel += (rule1Vel / rule1NumNeighbors - boidPos) * rule1Scale;
+    }
+
+    newVel += rule2Vel * rule2Scale;
+
+    if (rule3NumNeighbors > 0) {
+        newVel += (rule3Vel / rule3NumNeighbors) * rule3Scale;
+    }
+
+    // Clamp new velocity and save it in Vel2
+    if (glm::length(newVel) > maxSpeed) { newVel = glm::normalize(newVel) * maxSpeed; }
+    vel2[index] = newVel;
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+    // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+    // TODO-1.2 ping-pong the velocity buffers
+
+    // Compute Vel, then Sync
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+    cudaDeviceSynchronize();
+
+    // Swap Vel pointers, compute new Pos
+    std::swap(dev_vel1, dev_vel2);
+    kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel1);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+    // TODO-2.1
+    // Uniform Grid Neighbor search using Thrust sort.
+    // In Parallel:
+    // - label each particle with its array index as well as its grid index.
+    //   Use 2x width grids.
+    // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+    //   are welcome to do a performance comparison.
+    // - Naively unroll the loop for finding the start and end indices of each
+    //   cell's data pointers in the array of boid indices
+    // - Perform velocity updates using neighbor search
+    // - Update positions
+    // - Ping-pong buffers as needed
+
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+    kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+    kernIdentifyCellStartEnd<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+    kernUpdateVelNeighborSearchScattered<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+    cudaDeviceSynchronize();
+    kernUpdatePos<<<fullBlocksPerGrid, blockSize >>>(numObjects, dt, dev_pos, dev_vel2);
+
+    std::swap(dev_vel1, dev_vel2);
+}
+
+__global__ void kernRearrange(int N, glm::vec3* currPos, glm::vec3* prevPos, int* particleArrayIndices) {
+
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+
+    currPos[index] = prevPos[particleArrayIndices[index]];
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+    // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+    // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+    // In Parallel:
+    // - Label each particle with its array index as well as its grid index.
+    //   Use 2x width grids
+    // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+    //   are welcome to do a performance comparison.
+    // - Naively unroll the loop for finding the start and end indices of each
+    //   cell's data pointers in the array of boid indices
+    // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
+    //   the particle data in the simulation array.
+    //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+    // - Perform velocity updates using neighbor search
+    // - Update positions
+    // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+    kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+    kernRearrange<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_newPos, dev_pos, dev_particleArrayIndices);
+    kernRearrange<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_vel2, dev_vel1, dev_particleArrayIndices);
+    cudaDeviceSynchronize(); 
+
+    kernIdentifyCellStartEnd<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+    kernUpdateVelNeighborSearchCoherent<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_newPos, dev_vel2, dev_vel1);
+    cudaDeviceSynchronize();
+
+    kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_newPos, dev_vel1);
+    std::swap(dev_newPos, dev_pos);
+
 }
 
 void Boids::endSimulation() {
-  cudaFree(dev_vel1);
-  cudaFree(dev_vel2);
-  cudaFree(dev_pos);
-
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+    cudaFree(dev_vel1);
+    cudaFree(dev_vel2);
+    cudaFree(dev_pos);
+
+    // TODO-2.1 - Free any additional buffers here.
+    cudaFree(dev_particleArrayIndices);
+    cudaFree(dev_particleGridIndices);
+    cudaFree(dev_gridCellStartIndices);
+    cudaFree(dev_gridCellEndIndices);
+
+    // TODO-2.3 - Free any additional buffers here.
+    cudaFree(dev_newPos);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..4acfa11 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,12 +13,12 @@
 // ================
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
-#define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define VISUALIZE 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 120000;
 const float DT = 0.2f;
 
 /**