diff --git a/.gitignore b/.gitignore
index 2e7a3c7..ab7bf3c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -6,7 +6,8 @@ cis565_getting_started_generated_kernel*
 *.vcxproj
 *.xcodeproj
 build
-
+*.exe
+*.py
 # Created by https://www.gitignore.io/api/linux,osx,sublimetext,windows,jetbrains,vim,emacs,cmake,c++,cuda,visualstudio,webstorm,eclipse,xcode
 
 ### Linux ###
diff --git a/README.md b/README.md
index d63a6a1..8a5855c 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,79 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
-Project 1 - Flocking**
+Project 1 -  Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Nuofan Xu
+* Tested on: Windows 10, AMD Ryzen 3800x @ 3.9Hz 2x16GB RAM, RTX 2080 Super 8GB
 
-### (TODO: Your README)
+### Overview
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+[![](/images/boids.gif)](https://vimeo.com/233558094)
+
+###### (Run on RTX 2080 Super with 10000 boids, recorded as GIF at low fps due to file size limitations of github)
+
+This project served as an introduction to CUDA kernels and performance analysis. It is implemented using the Reynolds Boids algorithm in parallel on the GPU.
+
+Boids is a crowd simulation algorithm developed by Craig Reynolds in 1986 which is modeled after the flocking behaviors exibited by birds and fish. The most basic version of the algorithm operates on three rules:
+* Cohesion: Boids try to move towards the perceived center of mass of other boids around them.
+* Alignment: Boids tend to steer in the direction of the perceived average movement of other boids around them.
+* Separation: Boids try to keep some amount of distance between them and other boids.
+
+### Three implementations:
+#### 1. Naive Approach
+
+The naive approach for computing the new positions and velocities for the boids is to check all the boids against the other boids excluding itself and apply the three rules described above. This is extremely slow with a time complexity of O(N^2) as it needs to scan through all the boids to calculate the position and velocity. 
+
+A simple way to implement this in CUDA is to have one thread for each boid. Each thread would loop over the entire position and velocity buffers (skipping itself) and compute a new velocity for that boid. Then, another CUDA kernel would apply the position update using the new velocity data.
+
+#### 2. Uniform Scattered Grid
+
+A spatial data structure can greately improve performance for the algorithm. If we reduce the number of boids each boid needs to check against, we can decrease the amount of work each thread needs to do. We set a certain search radius, and only apply the three rules within it. We choose to impose an uniformly spaced grid on the boids, which allows us to perform a very efficient neighbor search and only check a very limited number of boids. We choose our uniform grid cell width to be twice the maximum search radius. This makes it so that at most, a boid may be influenced by other boids in the 8 cells directly around it.
+
+To create the "grid" we create an additional CUDA kernel which fills a buffer with respective grid cell indices. A parallel sorting algorithm, thrust::sort_by_key, is then used to sort boid indices by their corresponding grid cell indices. Furthermore, we store the "start" and "end" indicies for each grid cell in two new buffers by checking for transitions between cells in the sorted grid cell buffer.
+
+Now, instead of checking against all other boids, each thread in our velocity update can determine the boids grid index based on its position; this then allows us to determine the 8 neighboring cells in 3D space, and only apply rules for those boids that have an index between "start" and "end" of a neighboring cell.
+
+#### 3. Uniform Coherent Grid
+
+Pointers to boid data are contiguous in memory, but the boid data they point to are not. We can further speed up the algorithm by making memory accesses to Boid data more coherent. The uniform scattered grid implementation does unecessary hopping and pointer-chasing. Instead of using an additional sorted buffer to find the index into our particle data, we can shuffle the particle data so that it is also sorted by grid cell index. This allows the GPU to load in and cache our position and velocity data for the boids resulting in fewer global memory calls. This small change ends up making signficant performance improvements.
+
+### Performance Analysis
+
+#### Effect of boid count on framerate
+
+Tests were done using a block size of 128 on all three implementations with visualization window on. Verticle sync feature is turned off to avoid capping FPS at 60.
+
+![](images/vis_on.PNG)
+
+<img src="images/vis_on_plt.PNG" width="60%" height="60%">
+
+Tests were also done using a block size of 128 on all three implementations with visualization window off.
+
+![](images/vis_off.PNG)
+
+<img src="images/vis_off_plt.PNG" width="60%" height="60%">
+
+##### The Naive implementation stops working for >= 480,000 Boids as the CUDA kernels refused to launch with such high boid counts.
+
+As would be expected, as the number of boids increases, all three implementations suffer slowdowns due to the increaing amount of calculations that need to be done, with or without the visulization window. There are some interesting oberservations from those graphs. We can see that using the Uniform Scattered Grid implementation greatly improves performance compared to the naive approach, and then the Uniform Coherent Grid implementation surpasses even that in terms of performance. Intuitively, this makes sense. Uniform Scattered Grid reduces the number of boids that have to be checked against for each boid by setting up a search radius. Uniform Coherent Grid provides coherent memory for boids data, making the memory access faster because of better caching and reduction in the number of calls to global memory. In general, framerate of simulation without visualization is higher, but the trend is the same, and the result converges as the boid count increases (at lower FPS).  
+
+#### Effect of block size on framerate
+
+Tests were done using 5000 boids on all three implementations with visualization window off.
+
+![](images/block_size.PNG)
+
+<img src="images/block_size_plt.PNG" width="60%" height="60%">
+
+From the graph we can see that block size barely affects Uniform Coherent Grid implementation. Generally speaking, the increase of block size decreases the performance of Boid simulation. This might be because smaller blocks mean a larger number of blocks. With increasing number of warps in its own block, we gradually lose the performance benefits of shared memory within a block and instead need to allocate memory for each of our very many blocks.
+
+The blocksize doesn't affect Uniform Scattered and Uniform Coherent Grid implementations as much as it affects Naive implementation. My guess is that there are more memory accesses from different blocks in Naive approach and the performance benefit of accessing shared memory is weighted less. 
+
+#### Effect of changing search radius (26/8 neighboring cells) and grid cell size
+
+![](images/neighboring.PNG)
+
+As can be seen in the table above, an increase in the number of neighboring grid cells we check results in a lower framerate, same as an increase in cell width. The former is easy to understand - the more boids we have to check against the current boid, the more calculation and memory accesses that need to be done. The latter is because that more boids will likely be covered in those bigger cells. By cutting cell width to half and checking 26 neighboring cells, we essentially just increase the overhead for maintaining the grid without a significant reduction in the number of boids we check. It is possible to achieve a better framerate by increasing 'cell width' and decreasing 'the number of neighboring grid cells that have to be checked', although it is achieved mainly by reducing the number of boids to check (less memory accesses and calcualtions).
+
+### Feedback
+
+Any feedback on errors in the above analysis or any other places is appreciated.
\ No newline at end of file
diff --git a/images/block_size.PNG b/images/block_size.PNG
new file mode 100644
index 0000000..8e74cc0
Binary files /dev/null and b/images/block_size.PNG differ
diff --git a/images/block_size_plt.PNG b/images/block_size_plt.PNG
new file mode 100644
index 0000000..2759e83
Binary files /dev/null and b/images/block_size_plt.PNG differ
diff --git a/images/boids.gif b/images/boids.gif
new file mode 100644
index 0000000..dbe07d7
Binary files /dev/null and b/images/boids.gif differ
diff --git a/images/neighboring.PNG b/images/neighboring.PNG
new file mode 100644
index 0000000..de65ea3
Binary files /dev/null and b/images/neighboring.PNG differ
diff --git a/images/vis_off.PNG b/images/vis_off.PNG
new file mode 100644
index 0000000..f6e38f7
Binary files /dev/null and b/images/vis_off.PNG differ
diff --git a/images/vis_off_plt.PNG b/images/vis_off_plt.PNG
new file mode 100644
index 0000000..cba5087
Binary files /dev/null and b/images/vis_off_plt.PNG differ
diff --git a/images/vis_on.PNG b/images/vis_on.PNG
new file mode 100644
index 0000000..ba811dc
Binary files /dev/null and b/images/vis_on.PNG differ
diff --git a/images/vis_on_plt.PNG b/images/vis_on_plt.PNG
new file mode 100644
index 0000000..df50106
Binary files /dev/null and b/images/vis_on_plt.PNG differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..e86dbaf 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -37,7 +37,7 @@ void checkCUDAError(const char *msg, int line = -1) {
 *****************/
 
 /*! Block size used for CUDA kernel launch. */
-#define blockSize 128
+#define blockSize 32
 
 // LOOK-1.2 Parameters for the boids algorithm.
 // These worked well in our reference implementation.
@@ -85,6 +85,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_coherentVel;
+glm::vec3 *dev_coherentPos;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -169,6 +171,23 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+	cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+	cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+	cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+	cudaMalloc((void**)&dev_coherentPos, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_coherentPos failed!");
+
+	cudaMalloc((void**)&dev_coherentVel, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_coherentVel failed!");
   cudaDeviceSynchronize();
 }
 
@@ -233,7 +252,62 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+  glm::vec3 v1 = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 v2 = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 v3 = glm::vec3(0.0f, 0.0f, 0.0f);
+
+	glm::vec3 percieved_center_of_mass = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 perceived_velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 separate_vector = glm::vec3(0.0f, 0.0f, 0.0f);
+
+	int neighborCount1 = 0;
+	int neighborCount3 = 0;
+
+	float distance = 0.0f;
+
+	for (int i = 0; i < N; i++)
+	{
+		if (i != iSelf)
+		{
+			// 3 rules for the basic boids algorithm
+			distance = glm::distance(pos[i], pos[iSelf]);
+			if (distance < rule1Distance)
+			{
+				percieved_center_of_mass += pos[i];
+				neighborCount1++;
+			}
+
+			if (distance < rule2Distance)
+			{
+				separate_vector -= (pos[i] - pos[iSelf]);
+			}
+
+			if (distance < rule3Distance)
+			{
+				perceived_velocity += vel[i];
+				neighborCount3++;
+			}
+		}
+	}
+
+	// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+	if (neighborCount1 != 0)
+	{
+		percieved_center_of_mass /= neighborCount1;
+		v1 = (percieved_center_of_mass - pos[iSelf])*rule1Scale;
+	}
+
+	// Rule 2: boids try to stay a distance d away from each other
+	v2 = separate_vector*rule2Scale;
+
+	// Rule 3: boids try to match the speed of surrounding boids
+	if (neighborCount3 != 0)
+	{
+		perceived_velocity /= neighborCount3;
+		v3 = perceived_velocity*rule3Scale;
+	}
+
+	return v1 + v2 + v3;
 }
 
 /**
@@ -242,9 +316,19 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N)
+	{
+		return;
+	}
   // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+	glm::vec3 newVel = vel1[index] + computeVelocityChange(N, index, pos, vel1);
+	// Clamp the speed
+	newVel = (glm::length(newVel) > maxSpeed) ? glm::normalize(newVel) * maxSpeed : newVel;
+  //  Record the new velocity into vel2.
+  vel2[index] = newVel;
+  //Question: why NOT vel1? vel1 is being accessed in this kernel. Without synchronized R/W across threads, it might result in  
+	// inconsistent data or denied access, etc. This is why we ping-pong the velocity buffers
 }
 
 /**
@@ -289,10 +373,25 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N)
+    {
+      return;
+    }
+
+	// Go through the boids and determine which grid cell to bin them into
+	glm::ivec3 boidPos = (pos[index] - gridMin) * inverseCellWidth;
+	gridIndices[index] = gridIndex3Dto1D(boidPos.x, boidPos.y, boidPos.z, gridResolution);
+	// Set up a parallel array of integer indices as pointers to the actual
+	// boid data in pos and vel1/vel2
+	indices[index] = index;
+
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
 //          does not enclose any boids
+// set the start and end indices to -1 during reset 
 __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if (index < N) {
@@ -306,6 +405,49 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+//go through particleGridIndices identifying when there is a change in there value, 
+	//which signifies a change in the gridcell we are dealing with
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) // N is the numObject
+	{
+		return;
+	}
+
+	if (index == 0) // left most case
+	{
+		gridCellStartIndices[particleGridIndices[index]] = 0;
+	}
+	else if (index == N - 1) //right most case
+	{
+		gridCellEndIndices[particleGridIndices[index]] = N - 1;
+	}
+	else if (particleGridIndices[index] != particleGridIndices[index + 1])
+	{
+		//inbetween grid cells with no boids are set to -1  --> done before when both the arrays were reset to -1
+
+		//change in gridcell
+		gridCellEndIndices[particleGridIndices[index]] = index;
+		gridCellStartIndices[particleGridIndices[index + 1]] = index + 1;
+  }
+
+}
+
+// Store the reshuffled position and velocity buffers that are more memory coherent in new coherentPos and coherentVel buffers 
+__global__ void kernSetCoherentPosVel(int N, int *particleArrayIndices,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	const glm::vec3 *pos, const glm::vec3 *vel,
+	glm::vec3 *coherentPos, glm::vec3 *coherentVel)
+{
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N)
+	{
+		return;
+	}
+ 
+	int coherentindex = particleArrayIndices[index];
+
+	coherentPos[index] = pos[coherentindex];
+	coherentVel[index] = vel[coherentindex];
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -317,11 +459,110 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
   // the number of boids that need to be checked.
   // - Identify the grid cell that this particle is in
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N)
+	{
+		return;
+	}
   // - Identify which cells may contain neighbors. This isn't always 8.
+  glm::ivec3 boidPos = (pos[index] - gridMin) * inverseCellWidth;
+	int x = boidPos.x;
+	int y = boidPos.y;
+	int z = boidPos.z;
+
+  int neighborCount1 = 0;
+  int neighborCount3 = 0;
+  float distance = 0.0f;
+
+  glm::vec3 v1 = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 v2 = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 v3 = glm::vec3(0.0f, 0.0f, 0.0f);
+
+  glm::vec3 percieved_center_of_mass = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 perceived_velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 separate_vector = glm::vec3(0.0f, 0.0f, 0.0f);
   // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+
+  for (int i=-1; i<=1; i++)
+  {
+    for (int j = -1; j <= 1; j++)
+      {
+        for (int k = -1; k <= 1; k++)
+        {
+          int _x = x + i;
+          int _y = y + j;
+          int _z = z + k;
+          // prevent negative 
+          _x = imax(_x, 0);
+          _y = imax(_y, 0);
+          _z = imax(_z, 0);
+          // prevent max out 
+          _x = imin(_x, N - 1);
+          _y = imin(_y, N - 1);
+          _z = imin(_z, N - 1);
+          // find 1D cell index of 26 neighbor cells 
+          int boidGridCellIndex = gridIndex3Dto1D(_x, _y, _z, gridResolution);
+          if (gridCellStartIndices[boidGridCellIndex] != -1)
+          {
+            //cehck if the grid cell is empty (its start or end indices are set to -1)
+
+            // For each cell that contains boids and needs to be checked, 
+            // read the start/end indices in the boid pointer array.
+
+            // Now go through the boids in that grid cell and apply the rules 
+            // to it if it falls within the neighbor hood distance
+            for (int l = gridCellStartIndices[boidGridCellIndex]; l <= gridCellEndIndices[boidGridCellIndex]; l++)
+            {
+              //Access each boid in the cell and compute velocity change from
+              int bindex = particleArrayIndices[l];
+              if (l != index)
+              {
+                //Compute velocity change based on rules
+                distance = glm::distance(pos[bindex], pos[index]);
+                if (distance < rule1Distance)
+                {
+                  percieved_center_of_mass += pos[bindex];
+                  neighborCount1++;
+                }
+
+                if (distance < rule2Distance)
+                {
+                  separate_vector -= (pos[bindex] - pos[index]);
+                }
+
+                if (distance < rule3Distance)
+                {
+                  perceived_velocity += vel1[bindex];
+                  neighborCount3++;
+                }
+						  }
+					  }
+				  } 
+        }
+      }
+  }
+  // - Access each boid in the cell and compute velocity change from the boids rules, if this boid is within the neighborhood distance.
+	if (neighborCount1 != 0)
+	{
+  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+		percieved_center_of_mass /= neighborCount1;
+		v1 = (percieved_center_of_mass - pos[index])*rule1Scale;
+	}
+
+	// Rule 2: boids try to stay a distance d away from each other
+	v2 = separate_vector*rule2Scale;
+
+	// Rule 3: boids try to match the speed of surrounding boids
+	if (neighborCount3 != 0)
+	{
+		perceived_velocity /= neighborCount3;
+		v3 = perceived_velocity*rule3Scale;
+	}
+  
+	glm::vec3 newVel = vel1[index] + v1 + v2 + v3;
+
+	// Clamp the speed change before putting the new speed in vel2
+	vel2[index] = (glm::length(newVel) > maxSpeed) ? glm::normalize(newVel) * maxSpeed : newVel;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,6 +582,115 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N)
+    {
+      return;
+    }
+
+    //find boid position
+    //then use that position to determine the gridcell it belongs to
+    //use that information to find the 8 cells you have to check
+    glm::ivec3 boidPos = (pos[index] - gridMin) * inverseCellWidth;
+    int x = boidPos.x;
+    int y = boidPos.y;
+    int z = boidPos.z;
+
+    glm::vec3 v1 = glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 v2 = glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 v3 = glm::vec3(0.0f, 0.0f, 0.0f);
+
+    glm::vec3 percieved_center_of_mass = glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 perceived_velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 separate_vector = glm::vec3(0.0f, 0.0f, 0.0f);
+
+    int neighborCount1 = 0;
+    int neighborCount3 = 0;
+
+    float distance = 0.0f;
+
+    // DIFFERENCE: For best results, consider what order the cells should be
+    // checked in to maximize the memory benefits of reordering the boids data.
+    for (int k = -1; k <= 1; k++) //z axis
+    {
+      for (int j = -1; j <= 1; j++) //y axis
+      {
+        for (int i = -1; i <= 1; i++) //x axis
+        {
+          int _x = x + i;
+          int _y = y + j;
+          int _z = z + k;
+
+          _x = imax(_x, 0);
+          _y = imax(_y, 0);
+          _z = imax(_z, 0);
+
+          _x = imin(_x, N - 1);
+          _y = imin(_y, N - 1);
+          _z = imin(_z, N - 1);
+
+          int boidGridCellindex = gridIndex3Dto1D(_x, _y, _z, gridResolution);
+
+          // Identify which cells may contain neighbors. This isn't always 8.
+          // For each cell, read the start/end indices in the boid pointer array.
+
+          if (gridCellStartIndices[boidGridCellindex] != -1)
+          {
+            //we know the grid cell is empty if its start or end indices have been set to -1
+
+            //now go through the boids in that grid cell and apply the rules 
+            //to it if it falls within the neighbor hood distance
+            for (int l = gridCellStartIndices[boidGridCellindex]; l <= gridCellEndIndices[boidGridCellindex]; l++)
+            {
+              if (l != index)
+              {
+                // Access each boid in the cell and compute velocity change from
+                // the boids rules, if this boid is within the neighborhood distance.
+                distance = glm::distance(pos[l], pos[index]);
+                if (distance < rule1Distance)
+                {
+                  percieved_center_of_mass += pos[l];
+                  neighborCount1++;
+                }
+
+                if (distance < rule2Distance)
+                {
+                  separate_vector -= (pos[l] - pos[index]);
+                }
+
+                if (distance < rule3Distance)
+                {
+                  perceived_velocity += vel1[l];
+                  neighborCount3++;
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+
+    // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+    if (neighborCount1 != 0)
+    {
+      percieved_center_of_mass /= neighborCount1;
+      v1 = (percieved_center_of_mass - pos[index])*rule1Scale;
+    }
+
+    // Rule 2: boids try to stay a distance d away from each other
+    v2 = separate_vector*rule2Scale;
+
+    // Rule 3: boids try to match the speed of surrounding boids
+    if (neighborCount3 != 0)
+    {
+      perceived_velocity /= neighborCount3;
+      v3 = perceived_velocity*rule3Scale;
+    }
+
+    glm::vec3 newVel = vel1[index] + v1 + v2 + v3;
+
+    // Clamp the speed change before putting the new speed in vel2
+    vel2[index] = (glm::length(newVel) > maxSpeed) ? glm::normalize(newVel) * maxSpeed : newVel;
 }
 
 /**
@@ -349,21 +699,55 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+  dim3 fullBlocksPerGrid((numObjects+blockSize-1)/blockSize);
+  kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+   //Ping-pong the velocity buffers
+	std::swap(dev_vel1, dev_vel2);
+  kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel1);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
   // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+
+	dim3 fullBlocksPerGrid_gridsize((gridCellCount + blockSize - 1) / blockSize);//no dim1,  dim3 automatically makes the ther dimensions 0
+	dim3 fullBlocksPerGrid_boids((numObjects + blockSize - 1) / blockSize);
+
+	// Reset buffers start and end indices buffers
+	kernResetIntBuffer << <fullBlocksPerGrid_gridsize, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid_gridsize, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+
+	// Label each particle with its array index as well as its grid index.
+	// Use 2x width grids.
+	// recompute grid cell indices and particlearray indices every timestep
+	kernComputeIndices << <fullBlocksPerGrid_boids, blockSize >> > (numObjects, gridSideCount,
+		gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	// Now sort the dev_particleGridIndices so that boids belonging to the same grid cell
+	// are next to each other in the gridIndices array --> Use Thrust to sort the array
+
+	// Wrap device vectors in thrust iterators for use with thrust.
+	thrust::device_ptr<int> dev_thrust_keys(dev_particleGridIndices);
+	thrust::device_ptr<int> dev_thrust_values(dev_particleArrayIndices);
+	// Unstable key sort using Thrust. A stable sort isn't necessary
+	thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+	checkCUDAErrorWithLine("thrust sorting failed!");
+
+	// assuming the boidGridIndices are sorted, assign values to the arrays keeping 
+	// track of the data in dev_particleArrayIndices for each cell.
+
+	// unroll the loop for finding the start and end indices of each cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid_boids, blockSize >> > (numObjects, dev_particleGridIndices,
+		dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	// Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid_boids, blockSize >> > (numObjects, gridSideCount, gridMinimum,
+		gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+
+  //Ping-pong the velocity buffers
+	std::swap(dev_vel1, dev_vel2);
+	// Update positions
+	kernUpdatePos <<<fullBlocksPerGrid_boids, blockSize >>> (numObjects, dt, dev_pos, dev_vel1);
+	checkCUDAErrorWithLine("kernUpdatePos failed!");
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,14 +766,59 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+  dim3 fullBlocksPerGrid((gridCellCount + blockSize-1)/blockSize);
+  dim3 fullBlocksPerGrid_boids((numObjects + blockSize - 1) / blockSize);
+  kernResetIntBuffer <<<fullBlocksPerGrid, blockSize>>>(gridCellCount, dev_gridCellStartIndices, -1);
+  kernResetIntBuffer <<<fullBlocksPerGrid, blockSize>>>(gridCellCount, dev_gridCellEndIndices, -1);
+  kernComputeIndices <<<fullBlocksPerGrid_boids, blockSize >> > (numObjects, gridSideCount,
+		gridMinimum, gridInverseCellWidth,
+		dev_pos, dev_particleArrayIndices,
+		dev_particleGridIndices);
+	// Now sort the dev_particleGridIndices so that boids belonging to the same grid cell
+	// are next to each other in the gridIndices array --> Use Thrust to sort the array
+	// Wrap device vectors in thrust iterators for use with thrust.
+	thrust::device_ptr<int> dev_thrust_keys(dev_particleGridIndices);
+	thrust::device_ptr<int> dev_thrust_values(dev_particleArrayIndices);
+	// Unstable key sort using Thrust. A stable sort isn't necessary
+	thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+	checkCUDAErrorWithLine("thrust sorting failed!");
+	// Assuming the boidGridIndices are sorted, assign values to the arrays keeping 
+	// track of the data in dev_particleArrayIndices for each cell.
+	// Unroll the loop for finding the start and end indices of each
+	// cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid_boids, blockSize >> > (numObjects, dev_particleGridIndices,
+		dev_gridCellStartIndices, dev_gridCellEndIndices);
+	// BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all the boid data(position and velocity)
+	// in the simulation array, such that it is memory coherent arranged in order of grid cells
+	kernSetCoherentPosVel << <fullBlocksPerGrid_boids, blockSize >> > (numObjects, dev_particleArrayIndices,
+		dev_gridCellStartIndices, dev_gridCellEndIndices,
+		dev_pos, dev_vel1, dev_coherentPos, dev_coherentVel);
+	// Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid_boids, blockSize >> > (numObjects, gridSideCount, gridMinimum,
+		gridInverseCellWidth, gridCellWidth,
+		dev_gridCellStartIndices, dev_gridCellEndIndices,
+		dev_coherentPos, dev_coherentVel, dev_vel2);
+	
+  // Ping-pong the velocity buffers
+	std::swap(dev_vel1, dev_vel2);
+	// Update positions
+	kernUpdatePos << <fullBlocksPerGrid_boids, blockSize >> >(numObjects, dt, dev_coherentPos, dev_vel1);
+	checkCUDAErrorWithLine("kernUpdatePos failed!");
+	// Ping-pong Coherent and regular pos buffers
+	std::swap(dev_coherentPos, dev_pos);
 }
 
 void Boids::endSimulation() {
+  // Free buffers here.
   cudaFree(dev_vel1);
   cudaFree(dev_vel2);
   cudaFree(dev_pos);
-
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+	cudaFree(dev_particleGridIndices);
+	cudaFree(dev_gridCellStartIndices);
+	cudaFree(dev_gridCellEndIndices);
+	cudaFree(dev_coherentPos);
+	cudaFree(dev_coherentVel);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..015b0ae 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,8 +13,8 @@
 // ================
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
-#define VISUALIZE 1
-#define UNIFORM_GRID 0
+#define VISUALIZE 0
+#define UNIFORM_GRID 1
 #define COHERENT_GRID 0
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
@@ -25,7 +25,13 @@ const float DT = 0.2f;
 * C main function.
 */
 int main(int argc, char* argv[]) {
+  mode = " - NAIVE";
+  if (UNIFORM_GRID == 1) mode = " - UNIFORM_GRID";
+  if (COHERENT_GRID==1) mode = " - COHERENT_GRID";
   projectName = "565 CUDA Intro: Boids";
+  std::cout << mode;
+  // strcpy(projName, projectName);
+  // strcat(projName, mode);
 
   if (init(argc, argv)) {
     mainLoop();
diff --git a/src/main.hpp b/src/main.hpp
index 88e9df7..8fda2d9 100644
--- a/src/main.hpp
+++ b/src/main.hpp
@@ -58,6 +58,8 @@ glm::mat4 projection;
 //====================================
 
 const char *projectName;
+char* mode;
+char* projName;
 
 int main(int argc, char* argv[]);