diff --git a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/README.md b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/README.md
index b253a7e6b3..3ad0e9c953 100755
--- a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/README.md
+++ b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/README.md
@@ -6,27 +6,26 @@
  </tr>
 </table>
 
-
 # Accelerating Video Convolution Filtering Application
 
 ***Version: Vitis 2023.1***
 
-This tutorial introduces you to a compute-intensive application that is accelerated using the Xilinx Alveo Data Center accelerator card. It goes through the design of a specific kernel that runs on the FPGA and briefly discusses optimization of the host-side application for performance. The kernel is designed to maximize throughput, and the host application is optimized to transfer data in an effective manner that moves in-between the host and FPGA card. The host application essentially eliminates the data movement latency by overlapping data transfers for multiple kernel calls. Another essential purpose of this tutorial is to show **_how one can easily estimate the performance of hardware kernels that can be built using Vitis HLS and how accurate and close these estimates are to actual hardware performance_**
+This tutorial introduces you to a compute-intensive application that is accelerated using the AMD Alveo™ Data Center accelerator card. It goes through the design of a specific kernel that runs on the field programmable gate array (FPGA) and briefly discusses optimization of the host-side application for performance. The kernel is designed to maximize throughput, and the host application is optimized to transfer data in an effective manner that moves in-between the host and FPGA card. The host application essentially eliminates the data movement latency by overlapping data transfers for multiple kernel calls. Another essential purpose of this tutorial is to show **how one can easily estimate the performance of hardware kernels that can be built using Vitis HLS and how accurate and close these estimates are to actual hardware performance**.
 
 ## Introduction to Acceleration
 
-The first lab is designed to let you quickly experience the acceleration performance that can be achieved by porting the video filter to Xilinx's Alveo accelerator card. The Alveo series cards are designed for accelerating data center applications. However, this tutorial can be adapted to other accelerator cards with some simple changes.
+The first lab is designed to let you quickly experience the acceleration performance that can be achieved by porting the video filter to AMD's Alveo accelerator card. The Alveo series cards are designed for accelerating data center applications. However, this tutorial can be adapted to other accelerator cards with some simple changes.
 
  The steps to be carried out for this first lab include:
 
 - Setting up the Vitis application acceleration development flow
 - Running the hardware optimized accelerator and comparing its performance with a baseline of the application
 
-This lab demonstrates the significant performance gain that can be achieved as compared to CPU performance. Whereas the next labs in this tutorial will illustrate and guide how such performance can be achieved using different optimizations and design techniques for 2D convolution kernels and the host side application.
+This lab demonstrates the significant performance gain that can be achieved as compared to the processor performance. The next labs in this tutorial will illustrate and guide how such performance can be achieved using different optimizations and design techniques for 2D convolution kernels and the host side application.
 
 ### Cloning the GitHub Repository and Setting Up the Vitis Tool
 
-To run this tutorial you will need to clone a git repo and also download and extract some compressed files, please follow the instruction given below:
+To run this tutorial, you will need to clone a git repo and also download and extract some compressed files. Use the following instructions:
 
 #### Clone Git Repo
 
@@ -38,7 +37,7 @@ git clone https://github.com/Xilinx/Vitis-Tutorials.git
 
 #### Copy and Extract Large Files
 
-Copy and extract large files in convolution tutorial directory as follows:
+Copy and extract large files in the convolution tutorial directory as follows:
 
 ```bash
 cd /VITIS_TUTORIAL_REPO_PATH/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial
@@ -46,11 +45,11 @@ wget https://www.xilinx.com/bin/public/openDownload?filename=conv_tutorial_files
 tar -xvzf  conv_tutorial_files.tar.gz
 ```
 
-**NOTE** : VITIS_TUTORIAL_REPO_PATH is the local directory path where git repo is cloned. 
+>**NOTE:** VITIS_TUTORIAL_REPO_PATH is the local directory path where the git repo is cloned.
 
 #### Setting Up the Vitis Tool
 
-Setup the application build and runtime environment using the following commands as per your local installation:
+Set up the application build and runtime environment using the following commands as per your local installation:
 
 ```bash
 source <XILINX_VITIS_INSTALL_PATH>/settings64.sh
@@ -59,15 +58,15 @@ source <XRT_INSTALL_PATH>/setup.sh
 
 ### Baseline the Application Performance
 
-The software application processes High Definition(HD) video frames/images with 1920x1080 resolution. It performs convolution on a set of images and prints the summary of performance results. It is used for measuring baseline software performance. Please the set the environment variable that points to tutorial direction relative to repo path as follow:
+The software application processes high definition (HD) video frames/images with 1920x1080 resolution. It performs convolution on a set of images and prints the summary of performance results. It is used for measuring baseline software performance. Set the environment variable that points to tutorial direction relative to the repo path as follows:
 
 ```bash
 export CONV_TUTORIAL_DIR=/VITIS_TUTORIAL_REPO_PATH/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial
 ```
 
-where **VITIS_TUTORIAL_REPO_PATH** is the local path where git repo is placed by the user after cloning.
+where **VITIS_TUTORIAL_REPO_PATH** is the local path where the git repo is placed by the user after cloning.
 
-  **NOTE**: Make sure during all of the labs in this tutorial you have set `CONV_TUTORIAL_DIR` variable appropriately 
+  >**NOTE:** Make sure during all of the labs in this tutorial you have set the `CONV_TUTORIAL_DIR` variable appropriately.
 
 Run the application to measure performance as follows:
 
@@ -76,7 +75,7 @@ cd $CONV_TUTORIAL_DIR/sw_run
 ./run.sh 
 ```
 
-Results similar to the ones shown below will be printed. Note down the CPU throughput.
+Results similar to the following will be printed. Note down the CPU throughput.
 
 ```bash
 ----------------------------------------------------------------------------
@@ -97,14 +96,14 @@ CPU  Throughput   :    12.7112 MB/s
 
 ### Launching the Host Application
 
-Now launch the application, which uses FPGA accelerated video convolution filter. The application will be run on an actual FPGA card, also called System Run.
+Now launch the application, which uses a FPGA accelerated video convolution filter. The application will be run on an actual FPGA card, also called System Run.
 
 ```bash
 cd $CONV_TUTORIAL_DIR
 make run
 ```
 
-The result summary will be similar to the one given below:
+The result summary will be similar to the following:
 
 ```bash
 ----------------------------------------------------------------------------
@@ -136,7 +135,7 @@ FPGA Speedup      :    68.1764 x
 
 ### Results
 
- From the host application console output, it is clear that the FPGA accelerated kernel can outperform CPU-only implementation by a factor of 68x. It is a large gain in terms of performance over CPU. The following labs will illustrate how this performance allows processing more than 3 HD video channels with 1080p resolution in parallel. The tutorial describes how to achieve such performance gains by building a kernel and host application written in C++.  The host application uses OpenCL APIs and Xilinx Runtime (XRT) underneath it, demonstrating how to unleash this custom-built hardware kernel's computing power effectively.
+ From the host application console output, it is clear that the FPGA accelerated kernel can outperform CPU-only implementation by a factor of 68x. It is a large gain in terms of performance over CPU. The following labs will illustrate how this performance allows processing more than three HD video channels with 1080p resolution in parallel. The tutorial describes how to achieve such performance gains by building a kernel and host application written in C++.  The host application uses OpenCL™ APIs and Xilinx Runtime (XRT) underneath it, demonstrating how to unleash this custom-built hardware kernel's computing power effectively.
 
 ---------------------------------------
 
@@ -144,7 +143,6 @@ FPGA Speedup      :    68.1764 x
 Next Lab Module: <a href="./lab1_app_introduction_performance_estimation.md">Video Convolution Filter : Introduction and Performance Estimation</a>
  </b></p>
 
-
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
 <p class="sphinxhide" align="center"><sup><a href="https://www.amd.com/en/corporate/copyright">Terms and Conditions</a></sup></p>
diff --git a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab1_app_introduction_performance_estimation.md b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab1_app_introduction_performance_estimation.md
index 15c7fefde4..e47be42de6 100755
--- a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab1_app_introduction_performance_estimation.md
+++ b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab1_app_introduction_performance_estimation.md
@@ -8,29 +8,29 @@
 
 # Video Convolution Filter: Introduction and Performance Estimation
 
-This lab will explore a 2D video convolution filter and measure its performance on the host machine. These measurements will establish a performance baseline. The amount of acceleration that should be provided by hardware implementation is calculated based on the required performance constraints. In the next lab, we will estimate the performance of the FPGA accelerator. In a nutshell, during this lab, you will:
+This lab will explore a 2D video convolution filter, and measure its performance on the host machine. These measurements will establish a performance baseline. The amount of acceleration that should be provided by hardware implementation is calculated based on the required performance constraints. In the next lab, you will estimate the performance of the FPGA accelerator. In a nutshell, during this lab, you will:
 
 - Learn about video convolution filters
-- Measure the performance of software implemented convolution filter
-- Calculate required acceleration vs. software implementation for given performance constraints
-- Estimate the performance of hardware accelerator before implementation
+- Measure the performance of a software implemented convolution filter
+- Calculate the required acceleration versus the software implementation for given performance constraints
+- Estimate the performance of the hardware accelerator before implementation
 
 ## Video Filtering Applications and 2-D Convolution Filters
 
 Video applications use different types of filters extensively for multiple reasons: filter noise, manipulate motion blur, enhance color and contrast, edge detection, creative effects, etc. At its core, a convolution video filter carries out some form of data average around a pixel, which redefines the amount and type of correlation any pixel has to its surrounding area. Such filtering is carried out for all the pixels in a video frame.
 
-A matrix of coefficients defines a convolution filter. The convolution operation is essentially a sum of products performed on a pixel set (a frame/image sub-matrix centered around a given pixel) and a coefficients matrix. The figure below illustrates how convolution is calculated for a pixel; it is highlighted in yellow. Here the filter has a coefficient matrix that is 3x3 in size. The figure also displays how the whole output image is generated during the filtering process. The index of the output pixel being generated is the index of the input pixel highlighted in yellow that is being filtered. In algorithmic terms, the process of filtering consists of:
+A matrix of coefficients defines a convolution filter. The convolution operation is essentially a sum of products performed on a pixel set (a frame/image sub-matrix centered around a given pixel) and a coefficients matrix. The following figure illustrates how convolution is calculated for a pixel; it is highlighted in yellow. Here the filter has a coefficient matrix that is 3x3 in size. The figure also displays how the whole output image is generated during the filtering process. The index of the output pixel being generated is the index of the input pixel highlighted in yellow that is being filtered. In algorithmic terms, the process of filtering consists of:
 
-- Selecting an input pixel as highlighted in yellow in the figure below
+- Selecting an input pixel as highlighted in yellow in the following figure
 - Extracting a sub-matrix whose size is the same as filter coefficients
-- Calculating element-wise sum-of-product of extracted sub-matrix and coefficients matrix
-- Placing the sum-of-product as output pixel in output image/frame on the same index as the input pixel
+- Calculating an element-wise sum-of-product of the extracted sub-matrix and coefficients matrix
+- Placing the sum-of-product as an output pixel in an output image/frame on the same index as the input pixel
 
 ![Convolution Filter](./images/convolution.jpg)
 
 ### Performance Requirements for 1080p HD Video
 
-You can easily calculate the application performance requirements given the standard performance specs for 1080p High Definition (HD) video. Then these top-level requirements can be translated into constraints for hardware implementation or software throughput requirements. For 1080p HD video at 60 frames per seconds(FPS) the specs are listed below as well as required throughput in terms of pixels per second is calculated:
+You can easily calculate the application performance requirements given the standard performance specs for 1080p HD video. Then these top-level requirements can be translated into constraints for hardware implementation or software throughput requirements. For 1080p HD video at 60 frames per seconds(FPS), the following specs are listed as well as required throughput in terms of pixels per second is calculated:
 
 ```bash
 Video Resolution        = 1920 x 1080
@@ -42,17 +42,17 @@ Color Channels(YUV)     = 3
 Throughput(Pixel/s)   = Frame Width * Frame Height * Channels * FPS
 Throughput(Pixel/s)   = 1920*1080*3*60
 Throughput (MB/s)     = 373 MB/s
-``` 
+```
 
-The required throughput to meet 60 FPS performance turns out to be 373 MB/s ( since each pixel is 8-bits).
+The required throughput to meet 60 FPS performance turns out to be 373 MB/s (since each pixel is 8-bits).
 
 ## Software Implementation
 
 This section will discuss the baseline software implementation and performance measurements, which will be used to gauge the acceleration requirements given the performance constraints.
 
-The convolution filter is implemented in software using a typical multi-level nested loop structure. Outer two loops define the pixel to be processed(iterating over each pixel). The inner two loops perform the sum-of-product (SOP) operation, actual convolution filtering between the coefficient matrix and the selected sub-matrix from the image centered around the processed pixel.
+The convolution filter is implemented in the software using a typical multi-level nested loop structure. The outer two loops define the pixel to be processed (iterating over each pixel). The inner two loops perform the sum-of-product (SOP) operation, actual convolution filtering between the coefficient matrix and the selected sub-matrix from the image centered around the processed pixel.
 
-**TIP:** Boundary conditions where it is not possible to center sub-matrix around a given pixel require special processing.  This algorithm assumes all pixels beyond the boundary of the image have zero values.
+>**TIP:** Boundary conditions where it is not possible to center sub-matrix around a given pixel require special processing. This algorithm assumes all pixels beyond the boundary of the image have zero values.
 
 ```cpp
 void Filter2D(
@@ -98,7 +98,7 @@ void Filter2D(
 }
 ```
 
-The following snapshot shows how the top-level function calls the convolution filter function for an image with three components or channels. Here OpenMP pragma is used to parallelize software execution using multiple threads. You can open **_src/host_randomized.cpp_** and **_src/filter2d_sw.cpp_** from tutorial directory to examine all implementation details.
+The following snapshot shows how the top-level function calls the convolution filter function for an image with three components or channels. Here the OpenMP pragma is used to parallelize software execution using multiple threads. You can open `src/host_randomized.cpp` and `src/filter2d_sw.cpp` from the tutorial directory to examine all the implementation details.
 
 ```cpp
    #pragma omp parallel for num_threads(3)
@@ -113,14 +113,14 @@ The following snapshot shows how the top-level function calls the convolution fi
 
 ### Running the Software Application
 
-To run the software application, go to the directory called "sw_run" and launch the application as follows:
+To run the software application, go to the directory called `sw_run`, and launch the application as follows:
 
 ```bash
 cd $CONV_TUTORIAL_DIR/sw_run
 ./run.sh
 ```
 
-Once the application is launched, it should produce an output similar to the one shown below. The software application will process a randomly generated set of images and report performance. Here you have used randomly generated images to avoid any extra library dependencies such as OpenCV. But in the next labs, while working with the hardware implementation, the option to use the OpenCV library for loading images or use randomly generated images will be provided given the user has OpenCV 2.4 installed on the machine. If another version of OpenCV is needed, the user can modify the host application to use different APIs to load and store images from the disk.
+When the application is launched, it should produce an output similar to the following. The software application will process a randomly generated set of images and report performance. Here you have used randomly generated images to avoid any extra library dependencies, such as OpenCV. But in the next labs, while working with the hardware implementation, the option to use the OpenCV library for loading images or use randomly generated images will be provided given you have OpenCV 2.4 installed on the machine. If another version of OpenCV is needed, you can modify the host application to use different APIs to load and store images from the disk.
 
 ```bash
 ----------------------------------------------------------------------------
@@ -137,7 +137,7 @@ CPU  Throughput   :    14.5617 MB/s
 ----------------------------------------------------------------------------
 ```
 
-The application run measures performance using high precision timers and reports it as throughput. The machine used for experiments produced a throughput of 14.51 MB/s. The  machine details are listed below:
+The application run measures performance using high precision timers and reports it as throughput. The machine used for experiments produced a throughput of 14.51 MB/s. The following machine details are listed:
 
 ```bash
     CPU Model : Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
@@ -157,14 +157,14 @@ So to meet the required performance of processing 60 frames per second, the soft
 
 To understand what kind of hardware implementation is needed given the performance constraints, you can examine the convolution kernel in some detail:
 
-- The core compute is done in a 4-level nested loop, but you can break it to the compute per output pixel produced.
+- The core compute is done in a four-level nested loop, but you can break it to the compute per output pixel produced.
 - In terms of the output-pixels produced, it is clear from the filter source code that a single output pixel is produced when the inner two loops finish execution once.
 - These two loops are essentially doing the sum-of-product on a coefficient matrix and image sub-matrix. The matrix sizes are defined by the coefficient matrix, which is 15x15.
 - The inner two loops are performing a dot product of size 225(15x15). In other words, the two inner loops perform 225 multiply-accumulate (MAC) operations for every output pixel produced.
 
 ### Baseline Hardware Implementation Performance
 
-The simplest and most straightforward hardware implementation can be achieved by passing this current kernel source code through the Vitis HLS tool. It will pipeline the innermost loop with II=1, performing only one multiply-accumulate(MAC) per cycle. The performance can be estimated based on the MACs as follows:
+The simplest and most straightforward hardware implementation can be achieved by passing this current kernel source code through the Vitis HLS tool. It will pipeline the innermost loop with II=1, performing only one MAC per cycle. The performance can be estimated based on the MACs as follows:
 
 ```bash
  MACs per Cycle = 1
@@ -172,15 +172,14 @@ The simplest and most straightforward hardware implementation can be achieved by
  Throughput  = 300/225 = 1.33 (MPixels/s) =  1.33 MB/s
 ```
 
-Here the hardware clock frequency is assumed to be 300MHz because, in general, for the U200 Xilinx Alveo Data Center card, this is the maximum supported clock frequency when using Vitis HLS based design flow. The performance turns out to be 1.33 MB/s with baseline hardware implementation. From the convolution filter source code, it can also be estimated how much memory bandwidth is needed at the input and output for achieved throughput. From the convolution filter source code also shown above, it is clear that the inner two loops, while calculating a single output pixel, performs 225(15*15) reads at the input so:
+Here the hardware clock frequency is assumed to be 300 MHz because, in general, for the U200 AMD Alveo™ Data Center card, this is the maximum supported clock frequency when using the Vitis HLS-based design flow. The performance turns out to be 1.33 MB/s with baseline hardware implementation. From the convolution filter source code, it can also be estimated how much memory bandwidth is needed at the input and output for achieved throughput. From the convolution filter source code also shown above, it is clear that the inner two loops, while calculating a single output pixel, performs 225(15*15) reads at the input so:
 
 ```bash
 Output Memory Bandwidth = Throughput = 1.33 MB/s
 Input Memory Bandwidth  = Throughput * 225 = 300 MB/s
-```   
+```
 
-For the baseline implementation, the memory bandwidth requirements are very trivial, assuming that PCIe and device DDR memory bandwidths on Xilinx Acceleration Cards/Boards are of the order of 10s of GB/s.
-As you have seen in previous sections, the throughput required for 60FPS 1080p HD video is 373 MB/s. So it clear that to meet the performance requirement:
+For the baseline implementation, the memory bandwidth requirements are very trivial, assuming that PCIe and device DDR memory bandwidths on AMD Acceleration Cards/Boards are of the order of 10s of GB/s. As you have seen in previous sections, the throughput required for 60 FPS 1080p HD video is 373 MB/s. So it is clear that to meet the performance requirement:
 
 ```bash
 Acceleration Factor to Meet 60FPS Performance = 373/1.33 = 280x
@@ -189,7 +188,7 @@ Acceleration Factor to Meet SW Performance    = 14.5/1.33 = 10.9x
 
 ### Performance Estimation for Optimized Hardware Implementations
 
-From the above calculations, it is clear that you need to improve the performance of a baseline hardware implementation by 280x to process 60 FPS. One of the paths you can take is to start unrolling the inner loops and pipeline. For example, by unrolling the innermost loop, which iterates 15 times, you can improve the performance by 15x. With that one change, the hardware performance will already be better than software-only implementation, but not yet good enough to meet the required video performance. Another approach you can follow is to unroll the inner two loops and gain in performance by 15*15=225, which means a throughput of 1-output pixel per cycle. The performance and memory bandwidth requirements will be as follows:
+From the above calculations, it is clear that you need to improve the performance of a baseline hardware implementation by 280x to process 60 FPS. One of the paths you can take is to start unrolling the inner loops and pipeline. For example, by unrolling the innermost loop, which iterates 15 times, you can improve the performance by 15x. With that one change, the hardware performance will already be better than software-only implementation but not yet good enough to meet the required video performance. Another approach you can follow is to unroll the inner two loops and gain in performance by 15*15=225, which means a throughput of one-output pixel per cycle. The performance and memory bandwidth requirements will be as follows:
 
 ```bash
 Throughput  = Fmax * Pixels produced per cycle = 300 * 1 = 300 MB/s
@@ -197,9 +196,9 @@ Output Memory Bandwidth = Fmax * Pixels produced per cycle =  300 MB/s
 Input Memory Bandwidth = Fmax * Input pixels read per output pixel = 300 * 225 = 67.5 GB/s
 ```
 
-The required output memory bandwidth scales linearly with throughput, but input memory bandwidth has gone up enormously and might not be sustainable. A closer look at the convolution filter will reveal that it is not required to read all 225(15x15) pixels from the input memory for processing. An innovative caching scheme can be built to avoid such extensive use of input memory bandwidth. 
+The required output memory bandwidth scales linearly with throughput, but input memory bandwidth has gone up enormously and might not be sustainable. A closer look at the convolution filter will reveal that it is not required to read all 225(15x15) pixels from the input memory for processing. An innovative caching scheme can be built to avoid such extensive use of input memory bandwidth.
 
-The convolution filter belongs to a class of kernels known as stencil kernels, which can be optimized to increase input data reuse extensively. Which can result in substantially reduced memory bandwidth requirements. With a caching scheme, you can bring the input bandwidth required to be the same as output, which is around 300 MB/s. With the optimized data reuse scheme, when both inner loops are unrolled, it will require that only 1-input pixel is read for producing one output pixel on average and hence input memory bandwidth of 300 MB/s.
+The convolution filter belongs to a class of kernels known as stencil kernels, which can be optimized to increase input data reuse extensively, which can result in substantially reduced memory bandwidth requirements. With a caching scheme, you can bring the input bandwidth required to be the same as the output, which is around 300 MB/s. With the optimized data reuse scheme, when both inner loops are unrolled, it will require that only one-input pixel is read for producing one output pixel on average and hence input memory bandwidth of 300 MB/s.
 
 Although you can reduce the input bandwidth, the achieved performance will still only be 300 MB/s, which is less than the required 373 MB/s. To deal with this, you can look for other ways to increase the throughput of hardware. One approach is to duplicate kernel instances, also called compute units. In terms of heterogeneous computing, you can increase the number of compute units so that you can process data in parallel. In the convolution filter case, you can process all color channels (YUV) on separate compute units. When using three compute units, one for each color channel, the expected performance summary will be as follows:
 
@@ -225,8 +224,6 @@ Next Lab Module: <a href="./lab2_conv_filter_kernel_design.md">Design and Analys
 
 </b></p>
 
-
-
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
 <p class="sphinxhide" align="center"><sup><a href="https://www.amd.com/en/corporate/copyright">Terms and Conditions</a></sup></p>
diff --git a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab2_conv_filter_kernel_design.md b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab2_conv_filter_kernel_design.md
index 6e86564537..17195c4759 100755
--- a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab2_conv_filter_kernel_design.md
+++ b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab2_conv_filter_kernel_design.md
@@ -8,7 +8,7 @@
 
 # Design and Analysis of Hardware Kernel Module for 2-D Video Convolution Filter
 
-This lab is designed to demonstrate the design of a convolution filter module, do performance analysis, and analyze hardware resource utilization. A bottom-up approach is followed by first developing the hardware kernel and analyzing its performance before integrating it with the host application. You will use Vitis HLS to build and estimate the performance of the kernel.
+This lab is designed to demonstrate the design of a convolution filter module, do performance analysis, and analyze the hardware resource utilization. A bottom-up approach is followed by first developing the hardware kernel and analyzing its performance before integrating it with the host application. You will use Vitis HLS to build and estimate the performance of the kernel.
 
 ## 2-D Convolution Filter Implementation
 
@@ -16,7 +16,7 @@ This section discusses the design of a convolution filter in detail. It goes thr
 
 ### Top Level Structure of Kernel
 
-The top-level of the convolution filter is modeled using a dataflow process. The dataflow consists of four different functions as given below. For full implementation details please refer to source file  `src/filter2d_hw.cpp` in convolutional tutorial directory.
+The top-level of the convolution filter is modeled using a dataflow process. The dataflow consists of four different functions as follows. For full implementation details, refer to the source file  `src/filter2d_hw.cpp` in the convolutional tutorial directory.
 
 ```cpp
 void Filter2DKernel(
@@ -53,20 +53,20 @@ void Filter2DKernel(
 
 The dataflow chain consists of four different functions as follows:
 
-- **ReadFromMem**: reads pixel data or video input from main memory
-- **Window2D**:  local cache with wide(15x15 pixels) access on the output side
-- **Filter2D**:  core kennel filtering algorithm
-- **WriteToMem**:  writes output data to main memory
+- **ReadFromMem**: Reads pixel data or video input from main memory
+- **Window2D**: Local cache with wide (15x15 pixels) access on the output side
+- **Filter2D**: Core kennel filtering algorithm
+- **WriteToMem**: Writes output data to the main memory
 
-Two functions at the input and output read and write data from the device's global memory. The `ReadFromMem` function reads data and streams it for filtering. The `WriteToMem` function at the end of the chain writes processed pixel data to the device memory. The input data(pixels) read from the main memory is passed to the `Window2D` function, which creates a local cache and, on every cycle, provides a 15x15 pixel sample to the filter function/block. The `Filter2D` function can consume the 15x15 pixel sample in a single cycle to perform 225(15x15) MACs per cycle. 
+Two functions at the input and output read and write data from the device's global memory. The `ReadFromMem` function reads data and streams it for filtering. The `WriteToMem` function at the end of the chain writes processed pixel data to the device memory. The input data (pixels) read from the main memory is passed to the `Window2D` function, which creates a local cache and, on every cycle, provides a 15x15 pixel sample to the filter function/block. The `Filter2D` function can consume the 15x15 pixel sample in a single cycle to perform 225(15x15) MACs per cycle.
 
-Please open the **"src/filter2d_hw.cpp"** source file  from convolutioanl tutorial directory and examine the implementation details of these individual functions. In the next section, you will elaborate on the implementation details of Window2D and Filter2D functions. The following figure shows how data flows between different functions (dataflow modules).
+Open the `src/filter2d_hw.cpp` source file from the convolutioanl tutorial directory, and examine the implementation details of these individual functions. In the next section, you will elaborate on the implementation details of Window2D and Filter2D functions. The following figure shows how data flows between different functions (dataflow modules).
 
    ![Datflow](images/filterBlkDia.jpg)
 
 ### Data Mover
 
-One of the key advantages of custom design hardware accelerators, for which FPGAs are well suited, is the choice and architecture of custom data movers. These customized data movers facilitate efficient access to global device memory and optimize bandwidth utilization by reusing data.  Specialized data movers at the interface with main memory can be built at the input and output of the data processing engine or processing elements. The convolution filter is an excellent example of this. Looking from a pure software implementation point of view, it seems that to produce a single sample at the output side requires 450 memory accesses at the input side and 1 write access to the output.
+One of the key advantages of custom design hardware accelerators, for which FPGAs are well suited, is the choice and architecture of custom data movers. These customized data movers facilitate efficient access to global device memory and optimize bandwidth utilization by reusing data. Specialized data movers at the interface with the main memory can be built at the input and output of the data processing engine or processing elements. The convolution filter is an excellent example of this. Looking from a pure software implementation point of view, it seems that to produce a single sample at the output side requires 450 memory accesses at the input side and one write access to the output.
 
 ```bash
 Memory Accesses to Read filter Co-efficients = 15x15 = 225
@@ -79,39 +79,39 @@ For a pure software implementation, even though many of these accesses can becom
 
 #### Window2D: Line and Window Buffers
 
-The `Window2D` block is essentially built from two basic blocks: the first is called a "line buffer", and the second is called a "Window". 
+The `Window2D` block is essentially built from two basic blocks: the first is called a "line buffer", and the second is called a "Window".
 
-- The line buffer is used to buffer multiple lines of a full image, and specifically, here it is designed to buffer **_FILTER_V_SIZE - 1_** image lines. Where FILTER_V_SIZE is the height of the convolution filter. The total number of pixels held by the line buffer is **(FILTER_V_SIZE-1) *  MAX_IMAGE_WIDTH**.  
-- The "Window" block holds **FILTER_V_SIZE * FILTER_H_SIZE** pixels. The 2-D convolution filtering operation consists of centering the filtering mask (filter coefficients) on the index of output pixel and calculating the sum-of-product(SOP) as described in the previous lab. The following figure shows how these centering and SOP operations are carried. 
+- The line buffer is used to buffer multiple lines of a full image, and specifically, here it is designed to buffer **_FILTER_V_SIZE - 1_** image lines, where FILTER_V_SIZE is the height of the convolution filter. The total number of pixels held by the line buffer is **(FILTER_V_SIZE-1) *  MAX_IMAGE_WIDTH**.  
+- The "Window" block holds **FILTER_V_SIZE * FILTER_H_SIZE** pixels. The 2-D convolution filtering operation consists of centering the filtering mask (filter coefficients) on the index of output pixel and calculating the SOP as described in the previous lab. The following figure shows how these centering and SOP operations are carried.
 
 ![Convolution Filter](images/convolution.jpg)
 
-The figure above shows SOP carried out for a full image being processed. If you look carefully when output pixels are produced line by line, it is not required to have all the image pixels in memory. Only the lines where the filtering mask overlaps are required which is essentially FILTER_V_SIZE lines, which can even be reduced to FILTER_V_SIZE-1. Essentially that is the amount of data that needs to be on-chip or housed by a data mover at any given time.
+The figure above shows SOP carried out for a full image being processed. If you look carefully when output pixels are produced line by line, it is not required to have all the image pixels in memory. Only the lines where the filtering mask overlaps are required which is essentially FILTER_V_SIZE lines, which can even be reduced to FILTER_V_SIZE-1. Essentially, that is the amount of data that needs to be on-chip or housed by a data mover at any given time.
 
 ![Matrix Movement](images/Window2D.jpg)
 
-The figure above illustrates the operation and requirements for a line and Window buffer. The image size is assumed 8x8, and the filter size is 3x3. For this example, you are generating the filtered output of pixel number 10. In this case, you need a 3x3 block of input pixels centered around pixel 10, as shown by step `A`.
+The figure above illustrates the operation and requirements for a line and Window buffer. The image size is assumed 8x8, and the filter size is 3x3. For this example, you are generating the filtered output of pixel number 10. In this case, you need a 3x3 block of input pixels centered around pixel 10, as shown in step `A`.
 
-Step `B` in the figure highlights what is required for producing pixel number 11. Another 3x3 block, but it has a significant overlap with the previous input block. Essentially a column moves out from the left, and a column moves in from the right. One important thing to notice in steps A, B, and C, is that from the input side, it only needs one new pixel to produce one output pixel(ignoring the initial latency of filling the line buffer with multiple pixels, which is one time only).
+Step `B` in the figure highlights what is required for producing pixel number 11. Another 3x3 block, but it has a significant overlap with the previous input block. Essentially a column moves out from the left, and a column moves in from the right. One important thing to notice in steps A, B, and C, is that from the input side; it only needs one new pixel to produce one output pixel (ignoring the initial latency of filling the line buffer with multiple pixels, which is one-time only).
 
 The line buffer holds FILTER_V_SIZE-1 lines. In general, it requires FILTER_V_SIZE lines, but a line is reduced by using the line buffer in a circular fashion and by exploiting the fact that pixels at the start of the first line buffer can be used to write new incoming pixels since they are no longer needed. The window buffer is implemented as FILTER_V_SIZE * FILTER_H_SIZE storage fully partitioned, giving parallel access to all elements inside the window. The data moves as a column vector of size FILTER_V_SIZE from line buffer to window buffer, and then this whole widow is passed through a stream to the `Filter2D` function for processing.
 
-The overall scheme (data mover) is built to maximize the data reuse providing maximum parallel data to the processing element. For a deeper understanding of the modeling style and minute details of the data mover examine the `Window2D` function details in the source code. The function can be found in the **"src/filter2d_hw.cpp"** source file in convolutioanl tutorial directory .
+The overall scheme (data mover) is built to maximize the data reuse providing maximum parallel data to the processing element. For a deeper understanding of the modeling style and minute details of the data mover, examine the `Window2D` function details in the source code. The function can be found in the `src/filter2d_hw.cpp` source file in the convolutioanl tutorial directory.
 
 ## Building and Simulating the Kernel using Vitis HLS
 
-In this section, you will build and simulate the 2D convolution filter using Vitis HLS. You will also look at the performance estimates and measured results after co-simulation for comparison with target performance settings.
+In this section, you will build and simulate the 2D convolution filter using Vitis HLS. You will also look at the performance estimates and measured results after co-simulation for comparison with the target performance settings.
 
 ### Building the Kernel Module
 
-Now you will build the kernel module as a standalone kernel with AXI interfaces to memory, which are also used for simulation. To do this, please follow the steps listed below:
+Now you will build the kernel module as a standalone kernel with AXI interfaces to memory, which are also used for simulation. To do this, use the following steps:
 
 ```bash
   cd  $CONV_TUTORIAL_DIR/hls_build
   vitis_hls -f build.tcl 
 ```
 
-**TIP:** This step can take some time to complete. 
+>**TIP:** This step can take some time to complete.
 
 An output similar to the following will be printed:
 
@@ -132,17 +132,17 @@ An output similar to the following will be printed:
 
 This shows that an image with Width=1000 and height=30 is simulated. There are default parameters for image dimensions, and these are kept as small values to make co-simulation run more quickly. The synthesis is done for a maximum image size of 1920x1080.
 
-Once the build and simulation are finished, launch Vitis HLS GUI to analyze the performance estimate reports and implementation QoR as follows:
+When the build and simulation are finished, launch the Vitis HLS GUI to analyze the performance estimate reports and implementation QoR as follows:
 
 ```bash
 vitis_hls -p conv_filter_prj
 ```
 
-After the GUI opens, the first thing to notice is the Synthesis Summary report, which shows the Performance and Resource Estimates as shown below:
+After the GUI opens, the first thing to notice is the Synthesis Summary report, which shows the Performance and Resource Estimates as the following:
 
 ![Resource Report](images/vitisHlsResourceReport2.jpg)
 
-It shows the use of 139 DSP essentially for the SOP operations by the top-level module, and the use of 14 BRAMs by the `Window2D` data mover block.
+It shows the use of 139 DSP essentially for the SOP operations by the top-level module, and the use of 14 block RAMs by the `Window2D` data mover block.
 
 One important thing to notice is the static performance estimate for the kernel, 7.3 ms, which is very close to the estimated target latency of 6.9 ms for the kernel as calculated in the previous lab.
 
@@ -152,11 +152,11 @@ You can also get an accurate measurement of latency for kernel from the Co-simul
 
 Since you are simulating a 1000x30 image, the expected latency should be 30,000 + fixed latency of some blocks (assuming one clock cycle per output pixel). The number shown in the report is 38,520. Here 8,520 is the fixed latency, and when the actual image size is 1920x1080, the fixed latency will get amortized across more image lines. The fact that a large image will amortize the latency can be verified by simulating with a larger image.
 
-Another thing that verifies that the kernel can achieve one output sample per cycle throughput is the loop initiation intervals (II). The synthesis report expanded view shows that all loops have II=1, as shown below:
+Another thing that verifies that the kernel can achieve one output sample per cycle throughput is the loop initiation intervals (II). The synthesis report expanded view shows that all loops have II=1, as follows:
 
 ![II Report](images/vitisHLSIIReport.jpg)
 
-Once you have verified that the throughput requirements are met, and the resource consumption is acceptable, you can move forward and start integrating the full application. Which consists of creating and compiling the host application to drive the kernel, and building the kernel using one of the Xilinx platforms for Alveo Data Center accelerator cards. For this lab, the Alveo U200 card is used.
+When you have verified that the throughput requirements are met, and the resource consumption is acceptable, you can move forward and start integrating the full application, which consists of creating and compiling the host application to drive the kernel, and building the kernel using one of the AMD platforms for Alveo Data Center accelerator cards. For this lab, the Alveo U200 card is used.
 
 In this lab, you learned about:
 
diff --git a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab3_build_app_kernel.md b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab3_build_app_kernel.md
index 5d84028f50..660f31dc8f 100755
--- a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab3_build_app_kernel.md
+++ b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab3_build_app_kernel.md
@@ -8,28 +8,28 @@
 
 # Building the 2-D Convolution Kernel and Host Application
 
-This lab will focus on building a hardware kernel using the Vitis application acceleration development flow, targeting the Xilinx Alveo U200 accelerator card. A host-side application will be implemented to coordinate all the data movements and execution triggers for invoking the kernel. During this lab, real performance measurements will be taken and compared to estimated performance and the CPU-only performance.
+This lab will focus on building a hardware kernel using the Vitis application acceleration development flow, targeting the AMD Alveo™ U200 accelerator card. A host-side application will be implemented to coordinate all the data movements and execution triggers for invoking the kernel. During this lab, real performance measurements will be taken and compared to estimated performance and the CPU-only performance.
 
 ## Host Application
 
-This section briefly discusses how the host application is written to orchestrate the execution of a convolution kernel. It was estimated in the previous lab that multiple compute units will be needed to meet the 60 FPS performance target for processing 1080p HD Video. The host application is designed to be agnostic to the number of compute units. More specifically, if the compute units are symmetric ( instance of the same kernel and memory connectivity to device DDR banks is identical), the host application can deal with any number of compute units. 
+This section briefly discusses how the host application is written to orchestrate the execution of a convolution kernel. It was estimated in the previous lab that multiple compute units will be needed to meet the 60 FPS performance target for processing 1080p HD Video. The host application is designed to be agnostic to the number of compute units. More specifically, if the compute units are symmetric (the instance of the same kernel and memory connectivity to the device DDR banks is identical), the host application can deal with any number of compute units.
 
-**TIP:** Additional [tutorials](https://github.com/Xilinx/Vitis-Tutorials) and [examples](https://github.com/Xilinx/Vitis_Accel_Examples) about host programming and Vitis tools are available.
+>**TIP:** Additional [tutorials](https://github.com/Xilinx/Vitis-Tutorials) and [examples](https://github.com/Xilinx/Vitis_Accel_Examples) about host programming and Vitis tools are available.
 
 ### Host Application Variants
 
-Please go to the top-level folder for the convolution tutorial and change the directory to `src`, and list the files:
+Go to the top-level folder for the convolution tutorial, change the directory to `src`, and list the files:
 
 ```bash
 cd  $CONV_TUTORIAL_DIR/src
 ls
 ```
 
-There are two files namely "**host.cpp**" and "**host_randomized.cpp**". They can be used to build two different versions of the host application. The way they interact with the kernel compute unit is exactly the same except that one uses the **pgm** image file as input. This file is repeated multiple times to emulate an image sequence(video). The randomized host uses a randomly generated image sequence. The host with random input image generation has no dependencies. In contrast, the host code in "**host.cpp**" uses OpenCV libraries, specifically using **OpenCV 2.4** libraries to load, unload and convert between raw image formats.
+There are two files namely `host.cpp` and `host_randomized.cpp`. They can be used to build two different versions of the host application. The way they interact with the kernel compute unit is exactly the same except that one uses the **pgm** image file as the input. This file is repeated multiple times to emulate an image sequence (video). The randomized host uses a randomly generated image sequence. The host with the random input image generation has no dependencies. In contrast, the host code in `host.cpp` uses OpenCV libraries, specifically using **OpenCV 2.4** libraries to load, unload, and convert between raw image formats.
 
 ### Host Application Details
 
-The host application starts by parsing command line arguments. Following are the command-line options provided by the host that takes the input image and uses OpenCV (in source file `src/host.cpp`):
+The host application starts by parsing command line arguments. The following are the command line options provided by the host that takes the input image and uses OpenCV (in the source file `src/host.cpp`):
 
 ```cpp
   CmdLineParser parser;
@@ -48,106 +48,106 @@ The `host_randomized.cpp` file has all those options, and adds `width` and `heig
   parser.addSwitch("--height",  "-h", "Image height", "1080");
 ```
 
-Different options can be used to launch the application and for performance measurements. In this lab, you will set most of these command-line inputs to the application using a makefile, `make_options.mk` in the top-level directory. This file lets you set most of these options. 
+Different options can be used to launch the application and for performance measurements. In this lab, you will set most of these command line inputs to the application using a makefile, `make_options.mk` in the top-level directory. This file lets you set most of these options.
 
-After parsing the command-line options, the host application creates an OpenCL context, reads and loads the `.xclbin`, and creates a command queue with out-of-order execution and profiling enabled. After that, memory allocation is done, and the input image is read (or randomly generated). 
+After parsing the command line options, the host application creates an OpenCL™ context, reads and loads the `.xclbin`, and creates a command queue with out-of-order execution and profiling enabled. After that, memory allocation is done, and the input image is read (or randomly generated).
 
 After the setup is complete, the application creates a `Filter2DDispatcher` object and uses it to dispatch filtering requests on several images. Timers are used to capture execution time measurements for both software and hardware execution. Finally, the host application prints the summary of performance results. Most of the heavy lifting is done by `Filter2DDispatcher` and `Filter2DRequest`. These classes manage and coordinate the execution of filtering operations on multiple compute units. Both versions of the host application are based on these classes.
 
 ### 2D Filtering Requests
 
-The `Filter2DRequest` class is used by the filtering request dispatcher class. An object of this class encapsulates a single request to process a single color channel (YUV) for a given image. It essentially allocates and holds handles to OpenCL resources needed for enqueueing 2-D convolution filtering requests. These resources include OpenCL buffers, event lists, and handles to kernel and command queue. The application creates a single command queue that is passed down to enqueue every kernel enqueue command. 
+The `Filter2DRequest` class is used by the filtering request dispatcher class. An object of this class encapsulates a single request to process a single color channel (YUV) for a given image. It essentially allocates and holds handles to OpenCL resources needed for enqueueing 2-D convolution filtering requests. These resources include OpenCL buffers, event lists, and handles to the kernel and command queue. The application creates a single command queue that is passed down to enqueue every kernel enqueue command.
 
 After an object of the `Filter2DRequest` class is created, it can be used to make a call to the `Filter2D` method. This call will enqueue all the operations, moving input data or filter coefficients, kernel calls, and reading of output data back to the host. The same API call will create a list of dependencies between these transfers and also creates an output event that signals the completion of output data transfer to the host.
 
 ### 2D Filter Dispatcher
 
-The `Filter2DDispatcher` class is the top-level class that provides an end-user API to schedule kernel calls. Every call schedules a kernel enqueue and related data transfers using the `Filter2DRequest` object, as explained previously.  The `Filter2DDispatcher` is a container class that essentially holds a vector of request objects. The number of `Filter2DRequest` objects that are instantiated is defined as the **max** parameter for the dispatcher class at construction time. This parameter's minimum value can be as small as the number of compute units to allow at least one kernel enqueue call per compute unit to happen in parallel. But a larger value is desired since it will allow overlap between input and output data transfers happening between host and device.
+The `Filter2DDispatcher` class is the top-level class that provides an end-user API to schedule kernel calls. Every call schedules a kernel enqueue and related data transfers using the `Filter2DRequest` object, as explained previously. The `Filter2DDispatcher` is a container class that essentially holds a vector of request objects. The number of `Filter2DRequest` objects that are instantiated is defined as the **max** parameter for the dispatcher class at construction time. This parameter's minimum value can be as small as the number of compute units to allow at least one kernel enqueue call per compute unit to happen in parallel. But a larger value is desired, since it will allow overlap between input and output data transfers happening between the host and device.
 
 ## Building the Application
 
-The host application can be built using the `Makefile` that is provided with the tutorial. As mentioned earlier, the host application has two versions: the first version takes input images to process, the second can generate random data that will be processed as images. 
+The host application can be built using the `Makefile` that is provided with the tutorial. As mentioned earlier, the host application has two versions: the first version takes input images to process, and the second can generate random data that will be processed as images.
 
 The top-level `Makefile` includes a file called `make_options.mk`. This file provides most of the options that can be used to generate different host builds and kernel versions for emulation modes. It also provides a way to launch emulation with a specific number of test images. The details of the options provided by this file are as follows:
 
 ### Kernel Build Options
 
-- TARGET: selects build target; the choices are `hw`, `sw_emu`, `hw_emu`.
-- PLATFORM: target Xilinx platform used for the build  
-- ENABLE_STALL_TRACE : enables the kernel to generate stall data. Choices are: `yes`, `no`.
-- TRACE_DDR: select the memory bank to store trace data. Choices are DDR[0]-DDR[3] for u200 card.
-- KERNEL_CONFIG_FILE: kernel configuration file
-- VPP_TEMP_DIRS: temporary log directory for the Vitis kernel compiler (`v++`)
-- VPP_LOG_DIRS: log directory for `v++`.
-- USE_PRE_BUILT_XCLBIN: enables the use of pre-built FPGA binary file to speed the use of this tutorial
+- TARGET: Selects build target; the choices are `hw`, `sw_emu`, and `hw_emu`.
+- PLATFORM: Target AMD platform used for the build  .
+- ENABLE_STALL_TRACE: Enables the kernel to generate stall data. Choices are: `yes` or `no`.
+- TRACE_DDR: Select the memory bank to store trace data. Choices are DDR[0]-DDR[3] for the U200 card.
+- KERNEL_CONFIG_FILE: Kernel configuration file.
+- VPP_TEMP_DIRS: Temporary log directory for the Vitis kernel compiler (`v++`).
+- VPP_LOG_DIRS: Log directory for `v++`.
+- USE_PRE_BUILT_XCLBIN: Enables the use of pre-built FPGA binary file to speed the use of this tutorial.
 
 ### Host Build Options
 
-- ENABLE_PROF: Enables OpenCL profiling for the host application 
-- OPENCV_INCLUDE: OpenCV include directory path
-- OPENCV_LIB: OpenCV lib directory path
+- ENABLE_PROF: Enables OpenCL profiling for the host application. 
+- OPENCV_INCLUDE: OpenCV include directory path.
+- OPENCV_LIB: OpenCV lib directory path.
 
 ### Application Runtime Options
 
-- FILTER_TYPE: selects between 6 different filter types: choices are 0-6(Identity, Blur, Motion Blur, Edges, Sharpen, Gaussian, Emboss)
-- PARALLEL_ENQ_REQS: application command-line argument for parallel enqueued requests
-- NUM_IMAGES: number of images to process
-- IMAGE_WIDTH: image width to use
-- IMAGE_HEIGHT: image height to use 
-- INPUT_TYPE: selects between host versions
-- INPUT_IMAGE: path and name of image file
-- PROFILE_ALL_IMAGES: while comparing CPU vs. FPGA, use all images or not
-- NUM_IMAGES_SW_EMU: sets no. of images to use for sw_emu
-- NUM_IMAGES_HW_EMU: sets no. of images to use for hw_emu
+- FILTER_TYPE: Selects between six different filter types: choices are 0-6 (Identity, Blur, Motion Blur, Edges, Sharpen, Gaussian, and Emboss).
+- PARALLEL_ENQ_REQS: Application command line argument for parallel enqueued requests.
+- NUM_IMAGES: Number of images to process.
+- IMAGE_WIDTH: Image width to use.
+- IMAGE_HEIGHT: Image height to use.
+- INPUT_TYPE: Selects between host versions.
+- INPUT_IMAGE: Path and name of image file.
+- PROFILE_ALL_IMAGES: While comparing the CPU versus FPGA, use all images or not.
+- NUM_IMAGES_SW_EMU: Sets the number of images to use for sw_emu.
+- NUM_IMAGES_HW_EMU: Sets the number of images to use for hw_emu.
 
-To build the host application with randomized data please follow these steps:
+To build the host application with randomized data, follow these steps:
 
 1. Edit the Makefile options:
- 
-  ```bash
-  cd  $CONV_TUTORIAL_DIR/
-  vim make_options.mk
-  ```
 
-2. Make sure **INPUT_TYPE** option is set to `random`. This will build the `host_randomized.cpp` application:
+    ```bash
+    cd  $CONV_TUTORIAL_DIR/
+    vim make_options.mk
+    ```
 
-  ```makefile
-  ############## Host Application Options
-  INPUT_TYPE :=random
-  ```
+2. Make sure the **INPUT_TYPE** option is set to `random`. This will build the `host_randomized.cpp` application:
 
-  **TIP:** To build the `host.cpp` you must include the following two variables that point to **OpenCV 2.4** install path in `make_options.mk` file:
+    ```makefile
+    ############## Host Application Options
+    INPUT_TYPE :=random
+    ```
 
-  ```makefile
-  ############## OpenCV Installation Paths
-  OPENCV_INCLUDE :=/**OpenCV2.4 User Install Path**/include
-  OPENCV_LIB :=/**OpenCV2.4 User Install Path**/lib
-  ```
+    >**TIP:** To build the `host.cpp` you must include the following two variables that point to **OpenCV 2.4** install path in `make_options.mk` file:
 
-3. Source the install specific scripts for setting up the Vitis application acceleration development flow and the Xilinx RunTime Library:
+    ```makefile
+    ############## OpenCV Installation Paths
+    OPENCV_INCLUDE :=/**OpenCV2.4 User Install Path**/include
+    OPENCV_LIB :=/**OpenCV2.4 User Install Path**/lib
+    ```
 
-  ```bash
-  source /**User XRT Install Path**/setup.sh
-  source /**User Vitis Install Path**/settings64.sh
-  ```
+3. Source the install specific scripts for setting up the Vitis application acceleration development flow and the XRT:
+
+    ```bash
+    source /**User XRT Install Path**/setup.sh
+    source /**User Vitis Install Path**/settings64.sh
+    ```
 
 4. After setting the appropriate paths, build the host application using the makefile command as follows:
 
-  ```bash
-  make host
-  ```
+    ```bash
+    make host
+    ```
 
-It will build `host.exe` inside a `build` folder. By building the host application separate from the kernel code, you can make sure the host code compiles correctly and all library paths have been set.
+  It will build `host.exe` inside a `build` folder. By building the host application separate from the kernel code, you can make sure the host code compiles correctly and all library paths have been set.
 
 ### Running Software Emulation
-To build and run the kernel in software emulation mode, please run the following bash command:
 
+To build and run the kernel in software emulation mode, run the following bash command:
 
   ```bash
   make run TARGET=sw_emu
   ```
 
-It will build an xclbin file to be used in software emulation mode only and launch an emulation run. Once the emulation finishes, you should get a console output similar to the one below. The output given below is for random input image case:
+It will build an `xclbin` file to be used in software emulation mode only, and launch an emulation run. When the emulation finishes, you should get a console output similar to the following. The following output is for a random input image case:
 
 ```bash
 ----------------------------------------------------------------------------
@@ -175,7 +175,7 @@ Test PASSED: Output matches reference
 ----------------------------------------------------------------------------
 ```
 
-If the input image is to be used for processing, please set OpenCV paths and INPUT_TYPE as an empty string in `make_options.mk` file and run it again. Following is the expected console output:
+If the input image is to be used for processing, set the OpenCV paths and INPUT_TYPE as an empty string in `make_options.mk` file, and run it again. The following is the expected console output:
 
 ```bash
 ----------------------------------------------------------------------------
@@ -201,85 +201,85 @@ Test PASSED: Output matches reference
 ----------------------------------------------------------------------------
 ```
 
-The input and the output images are shown below for filter type selection set to 3 which performs edge detection:
+The input and the output images are shown in the following images for filter type selection set to 3 which performs edge detection:
 
 ![missing image](./images/inputImage50.jpg)
 ![missing image](./images/outputImage.jpg)
 
 ### Running Hardware Emulation
 
-The application can be run in hardware emulation mode in a similar way as software emulation. The only change needed is TARGET, which should be set to  `hw_emu`.
+The application can be run in hardware emulation mode in a similar way as software emulation. The only change needed is TARGET, which should be set to `hw_emu`.
 
-  **NOTE**: Hardware Emulation may take a long time. The `Makefile` default setting will make sure it simulates only a single image, but it is recommended in case of the random input image that image size be set smaller by keeping image height in the range of 30-100 pixels. The height and width of the image can be specified in the `make_options.mk` file.  
+  >**NOTE:** Hardware emulation can take a long time. The `Makefile` default setting will make sure it simulates only a single image, but it is recommended in case of the random input image, that image size be set smaller by keeping image height in the range of 30-100 pixels. The height and width of the image can be specified in the `make_options.mk` file.  
 
 1. Launch hardware emulation using the following command:
 
-  ```bash
-  make run TARGET=hw_emu
-  ```
+    ```bash
+    make run TARGET=hw_emu
+    ```
 
-It will build the hardware kernel in emulation mode and then launch the host application. The output printed in the console window will be similar to the `sw_emu` case. But after hardware emulation, you can analyze different synthesis reports and view different waveforms using Vitis Analyzer.
+    It will build the hardware kernel in emulation mode, and then launch the host application. The output printed in the console window will be similar to the `sw_emu` case. But after hardware emulation, you can analyze different synthesis reports and view different waveforms using the Vitis Analyzer.
 
 ## System Run
 
-In this section, you will run the host application using FPGA hardware and analyze the overall system's performance using Vitis Analyzer and host application console output.
+In this section, you will run the host application using FPGA hardware and analyze the overall system's performance using the Vitis Analyzer and host application console output.
 
 ### Building the Hardware xclbin
 
-Once the kernel functionality is verified, and its resource usage is satisfactory, the hardware kernel build process can be started. The kernel build process will create an xclbin file targetting the actual accelerator card. It is an FPGA executable file that can be read and loaded by the host onto the FPGA card. Building xclbin takes a few hours, and is built as shown below:
+When the kernel functionality is verified, and its resource usage is satisfactory, the hardware kernel build process can be started. The kernel build process will create an `xclbin` file targetting the actual accelerator card. It is an FPGA executable file that can be read and loaded by the host onto the FPGA card. Building `xclbin` takes a few hours, and is built as follows:
 
-1. You can enable the performance profiling by setting "ENABLE_PROF?=yes" in `make_options.mk` file as shown below:
+1. You can enable the performance profiling by setting "ENABLE_PROF?=yes" in `make_options.mk` file as follows:
 
-  ```bash
-   ENABLE_PROF?=yes
- ```
+    ```bash
+    ENABLE_PROF?=yes
+    ```
 
-2. Launch the hardware run using the following comand: 
+2. Launch the hardware run using the following comand:
 
    ```bash
-     make build TARGET=hw
+    make build TARGET=hw
    ```
   
-  >**TIP**: You can use a prebuilt xclbin file if one is available by setting **USE_PRE_BUILT_XCLBIN := 1**  in the  `make_options.mk` file.
+    >**TIP**: You can use a prebuilt xclbin file if one is available by setting **USE_PRE_BUILT_XCLBIN := 1**  in the  `make_options.mk` file.
 
-### Application Run Using FPGA Kernel
+### Application Run Using the FPGA Kernel
 
-1. To run the application, please proceed as follows:
+1. To run the application, proceed as follows:
 
-  ```bash
-  make run TARGET=hw
-  ```
+    ```bash
+    make run TARGET=hw
+    ```
 
-It should produce a console log similar to the one shown below:
+    It should produce a console log similar to the following :
 
-```bash
-----------------------------------------------------------------------------
+    ```bash
+    ----------------------------------------------------------------------------
 
-Xilinx 2D Filter Example Application (Randomized Input Version)
+    Xilinx 2D Filter Example Application (Randomized Input Version)
 
-FPGA binary       : ../xclbin/fpgabinary.hw.xclbin
-Number of runs    : 60
-Image width       : 1920
-Image height      : 1080
-Filter type       : 3
-Max requests      : 12
-Compare perf.     : 1
+    FPGA binary       : ../xclbin/fpgabinary.hw.xclbin
+    Number of runs    : 60
+    Image width       : 1920
+    Image height      : 1080
+    Filter type       : 3
+    Max requests      : 12
+    Compare perf.     : 1
 
-Programming FPGA device
-Generating a random 1920x1080 input image
-Running FPGA accelerator on 60 images
-Running Software version
-Comparing results
+    Programming FPGA device
+    Generating a random 1920x1080 input image
+    Running FPGA accelerator on 60 images
+    Running Software version
+    Comparing results
 
-Test PASSED: Output matches reference
+    Test PASSED: Output matches reference
 
-FPGA Time         :     0.4240 s
-FPGA Throughput   :   839.4765 MB/s
-CPU  Time         :    28.9083 s
-CPU  Throughput   :    12.3133 MB/s
-FPGA Speedup      :    68.1764 x
-----------------------------------------------------------------------------
-```
+    FPGA Time         :     0.4240 s
+    FPGA Throughput   :   839.4765 MB/s
+    CPU  Time         :    28.9083 s
+    CPU  Throughput   :    12.3133 MB/s
+    FPGA Speedup      :    68.1764 x
+    ----------------------------------------------------------------------------
+    ```
 
 From the console output, it is clear that acceleration achieved when compared to CPU is 68x. The achieved throughput is 839 MB/s, which is close to the estimated throughput of 900 MB/s; it only differs by 6.66 percent.
 
@@ -293,39 +293,39 @@ The trace information generated during the application run can be controlled by
 
 1. After the design has been run; you can open the run time profile summary report using the following steps:
 
-  ```bash
-  vitis_analyzer ./build/fpgabinary.xclbin.run_summary
-  ```
+    ```bash
+    vitis_analyzer ./build/fpgabinary.xclbin.run_summary
+    ```
+
+    >**NOTE:** In the 2023.1 release, this command opens the Analysis view of the new Vitis Unified IDE and loads the run summary as described in [Working with the Analysis View](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Working-with-the-Analysis-View). You can navigate to the various reports using the left pane of the Analysis view or by clicking on the links provided in the summary report.
+
+2. After the Vitis Analyzer tool opens, select **Profile Summary** from the left-side menu, and then select **Compute Unit Utilization** from the window displayed on the right-hand side.
 
->**NOTE:** In the 2023.1 release this command opens the Analysis view of the new Vitis Unified IDE and loads the run summary as described in [Working with the Analysis View](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Working-with-the-Analysis-View). You can navigate to the various reports using the left pane of the Analysis view or by clicking on the links provided in the summary report.
->
-2. After the Vitis Analyzer tool opens, select **Profile Summary** from the left-side menu, and then select **Compute Unit Utilization** from the window displayed on the right-hand side. 
-   
     The report will display stats about the measured performance of the compute units. You have built the `.xclbin` with three compute units, so the display will appear as shown below:
 
-    ![Compute Unit Utilization](images/cuPerf.jpg) 
+    ![Compute Unit Utilization](images/cuPerf.jpg)
 
-    From this table, it can be seen that the kernel compute time as displayed in the **Avg Time** column is about 7 ms, almost equal to the estimated kernel latency in the previous lab. 
+    From this table, it can be seen that the kernel compute time as displayed in the **Avg Time** column is about 7 ms, almost equal to the estimated kernel latency in the previous lab.
 
-    Another important measurement is the **CU Utilization** column, which is very close to 100 percent. This means the host was able to feed data to compute units through PCIe continuously. In other words, the host PICe bandwidth was sufficient, and compute units never saturated it. This can also be observed by examining the host bandwidth utilization. To see this, select **Host Data Transfers** in the report, and a table similar to the figure below will be displayed. From this table, it is clear that the host bandwidth is not fully utilized.
+    Another important measurement is the **CU Utilization** column, which is very close to 100 percent. This means the host was able to feed data to compute units through PCIe® continuously. In other words, the host PICe bandwidth was sufficient, and compute units never saturated it. This can also be observed by examining the host bandwidth utilization. To see this, select **Host Data Transfers** in the report, and a table similar to the following figure will be displayed. From this table, it is clear that the host bandwidth is not fully utilized.
 
     ![missing image](images/bwUtil.jpg)
 
-    Similarly, by selecting **Kernel Data Transfers** in the report, you can see how much bandwidth is utilized between the kernel and the device DDR memory. You have used a single memory bank (DDR[1]) for all the compute units, as shown below.
+    Similarly, by selecting **Kernel Data Transfers** in the report, you can see how much bandwidth is utilized between the kernel and the device double-data rate (DDR) memory. You have used a single memory bank (DDR[1]) for all the compute units, as shown below.
 
-![missing image](images/bwKernel.jpg)
+    ![missing image](images/bwKernel.jpg)
 
 ### Application Timeline
 
-The Application Timeline can also be used to examine performance parameters like CU latency per invocation and bandwidth utilization. 
+The Application Timeline can also be used to examine performance parameters like compute unit latency per invocation and bandwidth utilization.
 
-1. Select **Application Timeline** from the left-side menu. 
+1. Select **Application Timeline** from the left-side menu.
 
-   This will display the Application Timeline in the right side window, as shown below. 
+   This will display the Application Timeline in the right side window, as shown below.
 
     ![missing image](images/cuTime.jpg)
 
-    Zoom appropriately and go to device-side trace. For any CU, hover your mouse on any transaction in "Row 0" and a tooltip will show compute start and end times and also the latency. This should be similar to what you saw in the last section.
+    Zoom appropriately, and go to device-side trace. For any compute unit, hover your mouse on any transaction in "Row 0", and a tooltip will show the compute start and end times and also the latency. This should be similar to what you saw in the last section.
 
 Another important thing to observe is the host data transfer trace as shown below. From this report, it can be seen that the host read and write bandwidth is not fully utilized as there are gaps, showing times when there are no read/write transactions occurring. You can see that these gaps are significant, highlighting the fact that only a fraction of host PCIe bandwidth is utilized.
 
@@ -335,13 +335,13 @@ From this discussion, and knowing that the Alveo U200 accelerator card has multi
 
 In this lab, you have learned:
 
-- How to build, run and analyze the performance of a video filter
-- How to write an optimized host-side application for multi CU designs 
+- How to build, run, and analyze the performance of a video filter
+- How to write an optimized host-side application for multi compute unit designs
 - How to estimate kernel performance and compare it with measured performance
 
 ## Conclusion
-Congratulations! You have successfully completed the tutorial. In this tutorial, you have learned how to estimate the performance and acceleration requirements for the hardware implementation of a Vitis Kernel. You have analyzed kernel performance using different tools and compared measured and estimated performance, to see how close both performance number are. You have also seen that you can create an optimized memory cache or hierarchy for FPGA based implementation easily that significantly boost application performance. Finally you have learnt how to write optimized host code to get best performance out of multiple CUs for a given kernel.  
 
+Congratulations! You have successfully completed the tutorial. In this tutorial, you have learned how to estimate the performance and acceleration requirements for the hardware implementation of a Vitis kernel. You have analyzed kernel performance using different tools and compared measured and estimated performance, to see how close both performance number are. You have also seen that you can create an optimized memory cache or hierarchy for FPGA-based implementation easily that significantly boost application performance. Finally, you have learned how to write optimized host code to get best performance out of multiple compute units for a given kernel.  
 
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/1_overview.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/1_overview.md
index 7005f7b24a..ee9dd10c5b 100644
--- a/Hardware_Acceleration/Design_Tutorials/02-bloom/1_overview.md
+++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/1_overview.md
@@ -29,7 +29,7 @@ The following figure shows a Bloom filter example representing the set `{x, y, z
 In this tutorial, each document consists of an array of words where: each word is a 32-bit unsigned integer comprised of a 24-bit word ID and an 8-bit integer representing the frequency. The search array consists of words of interest to the user, and represents a smaller set of 24-bit word IDs, where each word ID has a weight associated with it, determining the importance of the word.
 
 1. Navigate to `Hardware_Acceleration/Design_Tutorials/02-bloom` directory.
-2. Go to the `cpu_src` directory, open the `main.cpp` file, and look at line 63. 
+2. Go to the `cpu_src` directory, open the `main.cpp` file, and look at line 63.
   
     The Bloom filter application is 64 KB, which is implemented as `1L<<bloom_size` where `bloom_size` is defined as 14 in the header file `sizes.h` (calculated as `(2^14)*4B = 64 KB`).
 
@@ -39,7 +39,6 @@ In this tutorial, each document consists of an array of words where: each word i
 
 In the next step, you will [experience the acceleration potential](./2_experience-acceleration.md) before creating and optimizing a kernel.
 
-
 <p align="center" class="sphinxhide"><b><a href="./README.md">Return to Start of Tutorial</a></b></p>
 
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/2_experience-acceleration.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/2_experience-acceleration.md
index 6fa213a399..a97c0bb61b 100644
--- a/Hardware_Acceleration/Design_Tutorials/02-bloom/2_experience-acceleration.md
+++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/2_experience-acceleration.md
@@ -41,14 +41,13 @@ In this lab, you will experience the acceleration potential by running the appli
       Execution COMPLETE
       ```
   
-3. Run the application on the FPGA.
-     For the purposes of this lab, the FPGA accelerator is implemented with an 8x parallelization factor.
+3. Run the application on the FPGA. For the purposes of this lab, the FPGA accelerator is implemented with an 8x parallelization factor.
 
-   *  Eight input words are processed in parallel, producing eight output flags in parallel during each clock cycle. 
+   * Eight input words are processed in parallel, producing eight output flags in parallel during each clock cycle.
 
       To run the optimized application on the FPGA, run the following `make` command.
 
-         ``` bash
+      ``` bash
          make run_fpga SOLUTION=1
          ```
 
@@ -68,7 +67,7 @@ In this lab, you will experience the acceleration potential by running the appli
 
          Throughput = Total data/Total time = 1.39 GB/427.1341ms = 3.25 GB/s
 
-         By efficiently leveraging FPGA acceleration, the throughput of the application increases by a factor of 7. 
+         By efficiently leveraging FPGA acceleration, the throughput of the application increases by a factor of 7.
 
 ## Next Steps
 
@@ -77,7 +76,6 @@ In this step, you observed the acceleration that can be achieved using an FPGA.
 <hr/>
 <p align="center" class="sphinxhide"><b><a href="docs/README.md">Return to Start of Tutorial</a></b></p>
 
-
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
 <p class="sphinxhide" align="center"><sup><a href="https://www.amd.com/en/corporate/copyright">Terms and Conditions</a></sup></p>
diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/3_architect-the-application.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/3_architect-the-application.md
index 57c2e91f99..8ecdadbbd6 100644
--- a/Hardware_Acceleration/Design_Tutorials/02-bloom/3_architect-the-application.md
+++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/3_architect-the-application.md
@@ -67,7 +67,7 @@ You will evaluate which of these sections are a good fit for the FPGA.
 
       * On the FPGA, you can create custom architectures, and therefore, create an accelerator that will shift the data by an arbitrary number of bits in a single clock cycle.
 
-      * The FPGA also has dedicated DSP units that perform multiplications faster than the CPU. Even though the CPU runs at a frequency eight times higher than the FPGA, the arithmetic shift and multiplication operations can perform faster on the FPGA because of the customizable hardware architecture.
+      * The FPGA also has dedicated digital signal processing (DSP) units that perform multiplications faster than the CPU. Even though the CPU runs at a frequency eight times higher than the FPGA, the arithmetic shift and multiplication operations can perform faster on the FPGA because of the customizable hardware architecture.
 
         Therefore, this function is a good candidate for FPGA acceleration.
 
@@ -108,13 +108,13 @@ You will evaluate which of these sections are a good fit for the FPGA.
 
     From this code, you can see:
 
-    *  You are computing two hash outputs for each word in all the documents and creating output flags accordingly.
+    * You are computing two hash outputs for each word in all the documents and creating output flags accordingly.
 
-    *  You already determined that the hash function(`MurmurHash2()`) is a good candidate for acceleration on the FPGA.
+    * You already determined that the hash function(`MurmurHash2()`) is a good candidate for acceleration on the FPGA.
 
     * The hash (`MurmurHash2()`) function with one word is independent of other words and can be done in parallel which improves the execution time.
 
-    * The algorithm sequentially accesses to the `input_doc_words` array. This is an important property because when implemented in the FPGA, it allows for very efficient accesses to the DDR. 
+    * The algorithm sequentially accesses to the `input_doc_words` array. This is an important property because when implemented in the FPGA, it allows for very efficient accesses to the DDR.
 
     This code section is a good candidate for FPGA acceleration because the hash function can run faster on the FPGA, and you can compute hashes for multiple words in parallel by reading multiple words from the DDR in burst mode.
 
@@ -148,7 +148,7 @@ You will evaluate which of these sections are a good fit for the FPGA.
     * The compute score requires one memory access to `profile_weights`, one accumulation, and one multiplication operation.
     * The memory accesses are random because they depend on the `word_id` and therefore, the content of each document.
 
-    * The size of the `profile_weights` array is 128 MB and must be stored in the DDR memory connected to the FPGA. Non-sequential accesses to DDR are big performance bottlenecks. Because accesses to the `profile_weights` array are random, implementing this function on the FPGA would not provide much performance benefit, and because the function takes only about 11% of the total running time, you can keep this function on the host CPU.
+    * The size of the `profile_weights` array is 128 MB and must be stored in the DDR memory connected to the FPGA. Non-sequential accesses to DDR are big performance bottlenecks.Because accesses to the `profile_weights` array are random, implementing this function on the FPGA would not provide much performance benefit, and because the function takes only about 11% of the total running time, you can keep this function on the host CPU.
 
       Based on this analysis, it is only beneficial to accelerate the Compute Output Flags from the Hash section on the FPGA. The execution of the Compute Document Score section can be kept on the host CPU.
 
@@ -156,7 +156,7 @@ You will evaluate which of these sections are a good fit for the FPGA.
 
 ## Establish the Realistic Goal for the Overall Application
 
-The Compute Score function is calculated on the CPU. Based on your calculations in the previous lab, it takes about 380 ms. You cannot accelerate the function further for a given CPU. Even if the FPGA can compute the hash function in zero time, the application will still take at the minimum of 380 ms, but running the FPGA in no time is also not realistic. You also need to account for sending the data from the CPU to the FPGA and retrieving it back to the CPU from the FPGA which will also add a delay. 
+The Compute Score function is calculated on the CPU. Based on your calculations in the previous lab, it takes about 380 ms. You cannot accelerate the function further for a given CPU. Even if the FPGA can compute the hash function in zero time, the application will still take at the minimum of 380 ms, but running the FPGA in no time is also not realistic. You also need to account for sending the data from the CPU to the FPGA and retrieving it back to the CPU from the FPGA which will also add a delay.
 
 1. Set the goal for the application such that the Compute Hash function (in Hardware) should run as fast as the Compute Score on the CPU so that the hash function does not become the bottleneck.
 
@@ -164,11 +164,11 @@ The Compute Score function is calculated on the CPU. Based on your calculations
 
     1. Keep only the compute hash function in the FPGA. In software, this function takes about 2569 ms.
 
-    The goal of application has been established to compute the hashes and overall score of the 10,000 documents in about 380 ms. 
+    The goal of application has been established to compute the hashes and overall score of the 10,000 documents in about 380 ms.
 
 ### Determine the Maximum Achievable Throughput
 
-In most FPGA-accelerated systems, the maximum achievable throughput is limited by the PCIe® bus. The PCIe bus performance is influenced by many different aspects, such as the motherboard, drivers, targeted shell, and transfer sizes. The Vitis core development kit provides a utility, `xbutil`.
+In most FPGA-accelerated systems, the maximum achievable throughput is limited by the PCIe® bus. The PCIe® bus performance is influenced by many different aspects, such as the motherboard, drivers, targeted shell, and transfer sizes. The Vitis core development kit provides a utility, `xbutil`.
 
  Run the `xbutil validate` command to measure the maximum PCIe bandwidth that can be achieved. The throughput on your design target cannot exceed this upper limit.
 
@@ -185,15 +185,15 @@ The `xbutil validate` command produces the following output.
   Host <- PCIe <- FPGA read bandwidth = 12154.5 MB/s
 ```
 
-The PCIe FPGA write bandwidth is about 9 GB/sec and the FPGA read bandwidth is about 12 GB/sec. The PCIe bandwidth is 3.1 GB/sec above your established goal.
+The PCIe FPGA write bandwidth is about 9 GB/s and the FPGA read bandwidth is about 12 GB/s. The PCIe bandwidth is 3.1 GB/s above your established goal.
 
 ### Identifying Parallelization for an FPGA Application
 
 In Software, the flow will look similar to the following figure.
-![missing image](./images/Architect1.PNG) 
+![missing image](./images/Architect1.PNG)
 
-- `Murmurhash2` functions are calculated for all the words up front, and output flags are set in the local memory. Each of the loops in the hash compute functions are run sequentially.
-- After all hashes have computed, only then can another loop be called for all the documents to calculate the compute score.
+* `Murmurhash2` functions are calculated for all the words up front, and output flags are set in the local memory. Each of the loops in the hash compute functions are run sequentially.
+* After all hashes have computed, only then can another loop be called for all the documents to calculate the compute score.
 
 When you run the application on the FPGA, an additional delay is added when the data is transferred from the host to device memory and read back. You can split the entire application time and create a budgeted-based run based on following requirements:
 
@@ -207,14 +207,14 @@ For achieving application target of 380 ms, adding 1470 ms + 30 ms + 380 ms is c
 If steps 1 through 4 are carried out sequentially like the CPU, you cannot achieve your performance goal. You will need to take advantage of concurrent processing and overlapping of the FPGA.
 
 1. Parallelism between the host to device data transfer and Compute on FPGA. Split the 100,000 documents into multiple buffers, and send the buffers to the device so the kernel does not need to wait until the whole buffer is transferred.
-2. Parallelism between Compute on FPGA and Compute Profile score on CPU
-3. Increase the number of words to be processed in parallel to increase the concurrent hash compute processing. The hash compute in the CPU is performed on a 32-bit word. Because the hash compute can be completed independently on several words, you can explore computing 4, 8, or even 16 words in parallel.
+2. Parallelism between Compute on FPGA and Compute Profile score on CPU.
+3. Increase the number of words to be processed in parallel to increase the concurrent hash compute processing. The hash compute in the CPU is performed on a 32-bit word. Because the hash compute can be completed independently on several words, you can explore computing four, eight, or even 16 words in parallel.
 
 Based on your earlier analysis, the `Murmurhash2` function is a good candidate for computation on the FPGA, and the Compute Profile Score calculation can be carried out on the host. For a hardware function on the FPGA, based on the established goal in previous lab, you should process hashes as fast as possible by processing multiples of these words every clock cycle.
 
 The application conceptually can similar to the following figure.
 
-![missing image](./images/Architect2.PNG) 
+![missing image](./images/Architect2.PNG)
 
 Transferring words from the host to CPU, Compute on FPGA, transferring words from the FPGA to CPU all can be executed in parallel. The CPU starts to calculate the profile score as soon as the flags are received;essentially, the profile score on the CPU can also start calculations in parallel. With the pipelining as shown above, the latency of the Compute on FPGA will become invisible.
 
@@ -222,7 +222,7 @@ Transferring words from the host to CPU, Compute on FPGA, transferring words fro
 
 In this lab, you profiled an application and determined which parts were best-suited for FPGA acceleration. You also created the setup to create an optimized kernel to achieve your acceleration goal. You will explore these steps in the following lab sections:
 
-* [Implementing the Kernel](./4_implement-kernel.md): Create an optimized kernel with 4, 8, and 16 words to be processed in parallel and all 100,000 documents worth 1.4 GB are sent from host to kernel in single batch.  
+* [Implementing the Kernel](./4_implement-kernel.md): Create an optimized kernel with four, eight, and 16 words to be processed in parallel and all 100,000 documents worth 1.4 GB are sent from host to kernel in single batch.  
 
 * [Analyze Data Movement Between Host and Kernel](./5_data-movement.md): Explore sending 100,000 documents in multiple batches, so that the kernel compute can be overlapped with the host data transfer for optimized application performance. You will analyze the results by keeping the `Murmurhash2` functions on the FPGA and Compute Score functions on the CPU to process in sequential mode. Next, flags created by the accelerator will also be sent over to the host and overlapped with the host to data transfer and FPGA compute.
 
diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/4_implement-kernel.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/4_implement-kernel.md
index 39e095379e..410c9e555c 100644
--- a/Hardware_Acceleration/Design_Tutorials/02-bloom/4_implement-kernel.md
+++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/4_implement-kernel.md
@@ -8,18 +8,18 @@
 
 # Implementing the Kernel
 
-In this lab, you will create an optimized kernel with 4, 8, and 16 words to be processed in parallel. All 100,000 documents worth 1.4 GB are sent from the host CPU to the kernel using a single buffer from the host CPU. 
+In this lab, you will create an optimized kernel with four, 8eight, and 16 words to be processed in parallel. All 100,000 documents worth 1.4 GB are sent from the host CPU to the kernel using a single buffer from the host CPU.
 
-## Bloom4x: Kernel Implementation Using 4 Words in Parallel
+## Bloom4x: Kernel Implementation Using Four Words in Parallel
 
-Processing 4 words in parallel will require 32-bits*4 = 128-bits in parallel, but you should access the DDR with 512-bits because the data is contiguous. This will require smaller number of memory accesses.
+Processing four words in parallel will require 32-bits*4 = 128-bits in parallel, but you should access the DDR with 512-bits because the data is contiguous. This will require smaller number of memory accesses.
 
 Use the following interface requirements to create kernel:
 
-- Read multiple words stored in the DDR as a 512-bit DDR access, equivalent of reading 16 words per DDR access. 
+- Read multiple words stored in the DDR as a 512-bit DDR access, equivalent of reading 16 words per DDR access.
 - Write multiple flags to the DDR as a 512-bit DDR access, equivalent of writing 32 flags per DDR access.
-- Compute 4 words to be computed in parallel with each word requiring two `MurmurHash2` functions 
-- Compute the hash (two `MurmurHash2` functions) functions for 4 words every cycle. 
+- Compute four words to be computed in parallel with each word requiring two `MurmurHash2` functions.
+- Compute the hash (two `MurmurHash2` functions) functions for four words every cycle.
 
 Refer to [Methodology for Accelerating Applications with the Vitis Software](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Methodology-for-Accelerating-Data-Center-Applications-with-the-Vitis-Software-Platform) in the in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416).
 
@@ -37,7 +37,7 @@ The algorithm has been updated to receive 512-bits of words from the DDR with th
   - `load_filter`: Enable or disable of loading coefficients. This only needs to be loaded one time.
 
 1. The first step of the kernel development methodology requires structuring the kernel code into the Load-Compute-Store pattern. This means creating a top-level function, `runOnfpga` with:
-    - Added sub-functions in the `compute_hash_flags_dataflow` for Load, Compute and Store.
+    - Added sub-functions in the `compute_hash_flags_dataflow` for Load, Compute, and Store.
     - Local arrays or `hls::stream` variables to pass data between these functions.
 2. The source code has the following INTERFACE pragmas for `input_words`, `output_flags` and `bloom_filter`.
 
@@ -51,23 +51,23 @@ The algorithm has been updated to receive 512-bits of words from the DDR with th
       - `m_axi`: Interface pragmas are used to characterize the AXI Master ports.
       - `port`: Specifies the name of the argument to be mapped to the AXI4 interface.
       - `offset=slave`: Indicates that the base address of the pointer is made available through the AXI4-Lite slave interface of the kernel.
-      - `bundle`: Specifies the name of the `m_axi` interface. In this example, the `input_words` and `output_flags` are mapped to a `maxiport0` and `bloom_filter` argument is mapped to `maxiport1`.
+      - `bundle`: Specifies the name of the `m_axi` interface. In this example, the `input_words` and `output_flags` are mapped to a `maxiport0`, and the `bloom_filter` argument is mapped to `maxiport1`.
   
-    The function `runOnfpga` loads the Bloom filter coefficients and calls the `compute_hash_flags_dataflow` function which has the main functionality of the Load, Compute and Store functions. 
+    The function, `runOnfpga`, loads the Bloom filter coefficients and calls the `compute_hash_flags_dataflow` function which has the main functionality of the Load, Compute, and Store functions.
 
-    Refer to the function `compute_hash_flags_dataflow` in the `02-bloom/cpu_src/compute_score_fpga_kernel.cpp` file. The following block diagram shows how the compute kernel connects to the device DDR memories and how it feeds the compute hash block processing unit. 
+    Refer to the function `compute_hash_flags_dataflow` in the `02-bloom/cpu_src/compute_score_fpga_kernel.cpp` file. The following block diagram shows how the compute kernel connects to the device DDR memories and how it feeds the compute hash block processing unit.
 
     ![missing image](./images/Kernel_block_diagram.PNG)
 
     The kernel interface to the DDR memories is an AXI interface that is kept at its maximum width of 512 at the input and output. The `compute_hash_flags` function input can have a width different than 512, managed through “PARALLELIZATION”. To deal with these variations on the processing element boundaries, "Resize" blocks are inserted that adapt between the memory interface width and the processing unit interface width. Essentially, blocks named "Buffer" are memory adapters that convert between streams, and the AXI and “Resize” blocks adapt to interface widths as it depends on PARALLELIZATION factor chosen for the given configuration.
 
-3. The input of the `compute_hash_flags_dataflow` function, `input_words` are read as 512-bit burst reads from the global memory over an AXI interface and `data_from_gmem`, the stream of 512-bit values are created. 
+3. The input of the `compute_hash_flags_dataflow` function, `input_words` are read as 512-bit burst reads from the global memory over an AXI interface and `data_from_gmem`, the stream of 512-bit values are created.
 
     ```
     hls_stream::buffer(data_from_gmem, input_words, total_size/(512/32));
      ```
 
-4. The stream of parallel words, `word_stream` (equals PARALLELIZATION words) are created from `data_from_gmem` as `compute_hash_flags` requires 128-bit for 4 words to process in parallel.
+4. The stream of parallel words, `word_stream` (equals PARALLELIZATION words) are created from `data_from_gmem` as `compute_hash_flags` requires 128-bit for four words to process in parallel.
 
     ```
     hls_stream::resize(word_stream, data_from_gmem, total_size/(512/32));
@@ -75,7 +75,7 @@ The algorithm has been updated to receive 512-bits of words from the DDR with th
 
 5. The function `compute_hash_flags_dataflow` calls the `compute_hash_flags` function for computing hash of parallel words.  
 
-6. With `PARALLELIZATION=4`, the output of the `compute_hash_flags`, `flag_stream` is 4*8-bit = 32-bit parallel words, which will be used to create the 512-bit values of stream as `data_to_mem`. 
+6. With `PARALLELIZATION=4`, the output of the `compute_hash_flags`, `flag_stream` is 4*8-bit = 32-bit parallel words, which will be used to create the 512-bit values of stream as `data_to_mem`.
 
     ```
     hls_stream::resize(data_to_gmem, flag_stream, total_size/(512/8));
@@ -126,7 +126,7 @@ Now that you have the top-level function, `runOnfpga` updated with the proper da
     - `#pragma HLS PIPELINE II=1` is added to initiate the burst DDR accesses and read the Bloom filter coefficients every cycle.
     - The expected latency is about 16,000 cycles because the `bloom_filter_size` is fixed to 16,000. You should confirm this after you run HLS Synthesis.
 
-2. Within the `compute_hash_flags` function, the `for` loop is rearchitected as nested for the loop to compute 4 words in parallel. 
+2. Within the `compute_hash_flags` function, the `for` loop is rearchitected as nested for the loop to compute four words in parallel.
 
     ```cpp
     void compute_hash_flags (
@@ -165,7 +165,7 @@ Now that you have the top-level function, `runOnfpga` updated with the proper da
 
     - Added `#pragma HLS UNROLL`
         - Unrolls internal loop to make four copies of the Hash functionality.
-    - Vitis HLS will try to pipeline the outer loop with `II=1`. With the inside loop unrolled, you can initiate the outer loop every clock cycle, and compute 4 words in parallel.
+    - Vitis HLS will try to pipeline the outer loop with `II=1`. With the inside loop unrolled, you can initiate the outer loop every clock cycle, and compute four words in parallel.
     - Added `#pragma HLS LOOP_TRIPCOUNT` min=1 max=3500000`
         - Reports the latency of the function after HLS Synthesis.
 
@@ -181,7 +181,7 @@ Now, build the kernel using the Vitis compiler. The Vitis compiler will call the
 
     This command will call the `v++` compiler which then calls the Vitis HLS tool to translate the C++ code into RTL code that can be used to run Hardware Emulation.
 
-   >**NOTE**: For purposes of this tutorial, the number of input words used is only 100 because it will take a longer time to run the Hardware Emulation. 
+   >**NOTE**: For purposes of this tutorial, the number of input words used is only 100 because it will take a longer time to run the Hardware Emulation.
 
 2. Then, use the following commands to visualize the HLS Synthesis Report in the Vitis analyzer.
 
@@ -189,13 +189,13 @@ Now, build the kernel using the Vitis compiler. The Vitis compiler will call the
     vitis_analyzer ../build/single_buffer/kernel_4/hw_emu/runOnfpga_hw_emu.xclbin.link_summary
     ```
 
- >**NOTE:** In the 2023.1 release this command opens the Analysis view of the new Vitis Unified IDE and loads the link summary as described in [Working with the Analysis View](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Working-with-the-Analysis-View). You can navigate to the various reports using the left pane of the Analysis view or by clicking on the links provided in the summary report.
+    >**NOTE:** In the 2023.1 release, this command opens the Analysis view of the new Vitis Unified IDE and loads the link summary as described in [Working with the Analysis View](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Working-with-the-Analysis-View). You can navigate to the various reports using the left pane of the Analysis view or by clicking the links provided in the summary report.
 
-Select the System Estimate report to open it. 
+    Select the System Estimate report to open it.
 
-  ![missing image](./images/4_Kernel_4_link.PNG)
+    ![missing image](./images/4_Kernel_4_link.PNG)
 
-    - The `compute_hash_flags` latency reported is 875,011 cycles. This is based on total of 35,000,000 words, computed with 4 words in parallel. This loop has 875,000 iterations and including the `MurmurHash2` latency, the total latency of 875,011 cycles is optimal.
+    - The `compute_hash_flags` latency reported is 875,011 cycles. This is based on total of 35,000,000 words, computed with four words in parallel. This loop has 875,000 iterations and including the `MurmurHash2` latency, the total latency of 875,011 cycles is optimal.
     - The `compute_hash_flags_dataflow` function has `dataflow` enabled in the Pipeline column. This function is important to review and indicates that the task-level parallelism is enabled and expected to have overlap across the sub-functions in the `compute_hash_flags_dataflow` function.
     - The latency reported for `read_bloom_filter` function is 16,385 for reading the Bloom filter coefficients from the DDR using the `bloom_filter maxi` port. This loop is iterated over 16,000 cycles reading 32-bits data of from the Bloom filter coefficients.
 
@@ -205,7 +205,7 @@ The reports confirm that the latency of the function meets your target. You stil
 
 The initial version of the accelerated application code structure follows the structure of the original software version. The entire input buffer is transferred from the host to the FPGA in a single transaction. Then, the FPGA accelerator performs the computation. Finally, the results are read back from the FPGA to the host before being post-processed.
 
-The following figure shows the sequential process of the host writing data on the device, compute by the accelerator on the FPGA, and read flags back to host, implemented in this first step. The Profile score is calculated sequentially on CPU after all the flags are received by the host. 
+The following figure shows the sequential process of the host writing data on the device, compute by the accelerator on the FPGA, and read flags back to host, implemented in this first step. The Profile score is calculated sequentially on CPU after all the flags are received by the host.
 
   ![missing image](./images/overlap_single_buffer.PNG)
 
@@ -213,18 +213,18 @@ The FPGA accelerator computes the hash values and flags for the provided input w
 
 The functionality of the different inputs passed to the accelerator kernel is as follows:
 
-* `input_doc_words`: Input array that contains the 32-bit words for all the documents.
-* `bloom_filter`: Bloom filter array that contains the inserted search array hash values.
-* `total_size`: Unsigned `int` that represents the total size processed by the FPGA when called.
-* `load_weights`: Boolean that allows the `bloom_filter` array to load only once to the FPGA in the case of multiple kernel invocations.
+- `input_doc_words`: Input array that contains the 32-bit words for all the documents.
+- `bloom_filter`: Bloom filter array that contains the inserted search array hash values.
+- `total_size`: Unsigned `int` that represents the total size processed by the FPGA when called.
+- `load_weights`: Boolean that allows the `bloom_filter` array to load only once to the FPGA in the case of multiple kernel invocations.
 
 The output of the accelerator is as follows:
 
-* `output_inh_flags`: Output array of 8-bit outputs where each bit in the 8-bit output indicates whether a word is present in the Bloom filter, that is then used for computing score in the CPU.
+- `output_inh_flags`: Output array of 8-bit outputs where each bit in the 8-bit output indicates whether a word is present in the Bloom filter, that is then used for computing score in the CPU.
 
-### Run Software Emulation, Hardware Emulation and Hw
+### Run Software Emulation, Hardware Emulation and Hardware
 
-1. To ensure the application passes Software Emulation with your changes, run the following command.
+1. To ensure the application passes software emulation with your changes, run the following command.
 
    ``` 
    cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer TARGET=sw_emu 
@@ -232,7 +232,7 @@ The output of the accelerator is as follows:
 
     Make sure that the Software Emulation is passing.
 
-2. Next, to verify the functionality is intact, use the following command to run Hardware Emulation.
+2. Next, to verify the functionality is intact, use the following command to run hardware emulation.
 
     ```
     cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer TARGET=hw_emu 
@@ -240,10 +240,9 @@ The output of the accelerator is as follows:
 
     - The command should conclude with 'Verification: Pass'. This ensures that the generated hardware is functionally correct. However, you have not yet run the hardware on the FPGA.
 
-  
-   > **NOTE**: This tutorial is provided with `xclbin` files in the `$LAB_WORK_DIR/xclbin_save` directory. The `SOLUTION=1` option can be added to the make target for using these `xclbin` files for `hw` runs. These `xclbin` files were generated for Alveo U200 cards only. You must generate new `xclbin` files for every platform used in this tutorial.
+   > **NOTE**: This tutorial is provided with `xclbin` files in the `$LAB_WORK_DIR/xclbin_save` directory. The `SOLUTION=1` option can be added to the make target for using these `xclbin` files for `hw` runs. These `xclbin` files were generated for AMD Alveo™ U200 cards only. You must generate new `xclbin` files for every platform used in this tutorial.
 
-4. Run the following steps to execute the application on hardware. 
+3. Run the following steps to execute the application on hardware.
 
     You are using 100,000 documents compute on the hardware.
 
@@ -251,14 +250,14 @@ The output of the accelerator is as follows:
     cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer ITER=1 PF=4 TARGET=hw
     ```
 
-    * If you are using an `xclbin` provided as part of solution in this tutorial, then use the following command.
+    - If you are using an `xclbin` provided as part of solution in this tutorial, then use the following command.
 
       ```
       cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer ITER=1 PF=4 TARGET=hw SOLUTION=1
       ```
 
-    * To use four words in parallel, `PF=4` will set the PARALLELIZATION macro to 4 in `$LAB_WORK_DIR/reference_files/compute_score_fpga_kernel.cpp`. 
-    * `ITER=1` indicates buffer sent using single iteration (using a single buffer).
+    - To use four words in parallel, `PF=4` will set the PARALLELIZATION macro to 4 in `$LAB_WORK_DIR/reference_files/compute_score_fpga_kernel.cpp`.
+    - `ITER=1` indicates buffer sent using single iteration (using a single buffer).
 
      The following output displays.
 
@@ -273,14 +272,14 @@ The output of the accelerator is as follows:
     Verification: PASS
       ```
 
-    - Total FPGA time is 447 ms. This includes the host to DDR transfer, Total Compute on FPGA and DDR to host transfer. 
-    - Total time of computing 100,000 documents is about 838 ms.
+    - The total FPGA time is 447 ms. This includes the host to DDR transfer, Total Compute on FPGA, and DDR to host transfer.
+    - The total time of computing 100,000 documents is about 838 ms.
 
 At this point, review the Profile reports and Timeline Trace to extract information, such as how much time it takes to transfer the data between host and kernel and how much time it takes to compute on the FPGA.
 
 ### Visualize the Resources Utilized
 
-Use the Vitis analyzer to visualize the HLS Synthesis Report. You will need to build the kernel without `SOLUTION=1` to `generate link_summary` as this is not provided as part of tutorial. You can skip this step as well. 
+Use the Vitis analyzer to visualize the HLS Synthesis Report. You will need to build the kernel without `SOLUTION=1` to `generate link_summary` as this is not provided as part of tutorial. You can skip this step as well.
 
 ``` 
 vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_4/hw/runOnfpga_hw.xclbin.link_summary
@@ -288,7 +287,7 @@ vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_4/hw/runOnfpga_hw.xclbin
 
 The HLS Synthesis Report shows the number of LUTs, REG, and BRAM utilized for the Bloom4x kernel implementation.
 
-![missing image](./images/kernel_4_util.PNG) 
+![missing image](./images/kernel_4_util.PNG)
 
 ### Review Profile Reports and Timeline Trace
 
@@ -302,32 +301,32 @@ The Profile Summary and Timeline Trace reports are useful tools to analyze the p
 
 ### Review Profile Summary Report
 
-* *Kernels & Compute Unit: Kernel Execution* indicates that the Total Time by kernel enqueue is about 292 ms.
+- *Kernels & Compute Unit: Kernel Execution* indicates that the total time by kernel enqueue is about 292 ms.
 
-    ![missing image](./images/kernel_4_profile_1.PNG) 
+    ![missing image](./images/kernel_4_profile_1.PNG)
 
-    - 4 words in parallel are computed. The accelerator is architected at 300 MHz. In total, you are computing 350,000,000 words (3,500 words/document * 100,000 documents).
-    - Number of words/(Clock Freq * Parallelization factor in kernel) = 350M/(300M*4) = 291.6 ms. The actual FPGA compute time is almost same as your theoretical calculations.
+  - Four words in parallel are computed. The accelerator is architected at 300 MHz. In total, you are computing 350,000,000 words (3,500 words/document * 100,000 documents).
+  - Number of words/(Clock Freq *Parallelization factor in kernel) = 350M/(300M*4) = 291.6 ms. The actual FPGA compute time is almost same as your theoretical calculations.
 
-*  *Host Data Transfer: Host Transfer* shows that the Host Write Transfer to DDR is 145 ms and the Host Read Transfer to DDR is 36 ms.
-  ![missing image](./images/kernel_4_profile_2.PNG) 
+- *Host Data Transfer: Host Transfer* shows that the Host Write Transfer to DDR is 145 ms, and the Host Read Transfer to DDR is 36 ms.
+  ![missing image](./images/kernel_4_profile_2.PNG)
 
-    - Host Write transfer using a theoretical PCIe bandwidth of 9GB should be 1399 MB/9GBps = 154 ms 
-    - Host Read transfer using a theoretical PCIe bandwidth of 12GB should be 350 MB/12 GBps = 30 ms 
-    - Reported number indicates that the PCIe transfers are occurring at the maximum bandwidth 
+  - Host Write transfer using a theoretical PCIe bandwidth of 9 GB should be 1399 MB/9 GBps = 154 ms.
+  - Host Read transfer using a theoretical PCIe bandwidth of 12 GB should be 350 MB/12 GBps = 30 ms.
+  - Reported number indicates that the PCIe transfers are occurring at the maximum bandwidth.
 
-* *Kernels & Compute Unit: Compute Unit Stalls* confirms that there are almost no "External Memory Stalls" 
-  ![missing image](./images/Kernel_4_Guidance_Stall.PNG) 
+- *Kernels & Compute Unit: Compute Unit Stalls* confirms that there are almost no "External Memory Stalls".
+  ![missing image](./images/Kernel_4_Guidance_Stall.PNG)
 
 ### Review the Timeline Trace
 
   The Timeline Trace shows the data transfer from the host to the FPGA and back to the host as they appear. The Timeline Trace can be visualized so that the transfer from the host to the FPGA and the FPGA compute and transfer from the FPGA to host occur sequentially.
 
-  ![missing image](./images/Kernel_4_time_1.PNG) 
+  ![missing image](./images/Kernel_4_time_1.PNG)
 
-  - There is a sequential execution of operations starting from the data transferred from the host to the FPGA, followed by compute in the FPGA and transferring back the results from the FPGA to the host. 
-  - At any given time, either the host or FPGA has access to the DDR. In other words, there is no memory contention between the host and kernel accessing the same DDR. 
-  - Using a single buffer will create a kernel itself with lowest latency and most optimized performance.
+- There is a sequential execution of operations starting from the data transferred from the host to the FPGA, followed by compute in the FPGA and transferring back the results from the FPGA to the host.
+- At any given time, either the host or FPGA has access to the DDR. In other words, there is no memory contention between the host and kernel accessing the same DDR.
+- Using a single buffer will create a kernel itself with lowest latency and most optimized performance.
 
 ### Throughput Achieved
 
@@ -337,18 +336,18 @@ Based on the results, the throughput of the application is 1399 MB/838 ms = appr
 
 Because there is no external memory access contention for accessing memory, this is the best possible performance for the Kernel computing four words in parallel.
 
-* The kernel is reading 512-bits of data, but only 128-bits are used to compute 4 words in parallel. The kernel has the potential to compute more words because you have resources available on the FPGA. You can increase the number of words to be processed in parallel and can experiment with 8 words and 16 words in parallel. 
-* Also note, even though the kernel is operating at its best performance, the kernel has to wait until the complete transfer is done by the host. Usually, it is recommended that you send a larger buffer from the host to DDR, but a very large buffer will add a delay before the kernel can start, impacting overall performance.
+- The kernel is reading 512-bits of data, but only 128-bits are used to compute four words in parallel. The kernel has the potential to compute more words because you have resources available on the FPGA. You can increase the number of words to be processed in parallel and can experiment with eight words and 16 words in parallel.
+- Also note, even though the kernel is operating at its best performance, the kernel has to wait until the complete transfer is done by the host. Usually, it is recommended that you send a larger buffer from the host to DDR, but a very large buffer will add a delay before the kernel can start, impacting overall performance.
 
 You will continue this lab and create a kernel with 8 words to be computed in parallel. In the next lab, [Analyze Data Movement Between Host and Kernel](./5_data-movement.md), you will experiment with splitting the buffer into multiple chunks, and observe how this affects the performance later in this tutorial.
 
-## Bloom8x: Kernel Implementation Using 8 Words in Parallel
+## Bloom8x: Kernel Implementation Using Eight Words in Parallel
 
-For config Bloom4x, you read 512-bit input values from the DDR and computed 4 words in parallel which uses only 128-bit input values. This steps enables you to run 8 words in parallel.
+For config Bloom4x, you read 512-bit input values from the DDR and computed four words in parallel which uses only 128-bit input values. This steps enables you to run eight words in parallel.
 
-You can achieve this by using `PF=8` on the command line. Use the following steps to compute 8 words in parallel.
+You can achieve this by using `PF=8` on the command line. Use the following steps to compute eight words in parallel.
 
-### Run Hardware on the FPGA 
+### Run Hardware on the FPGA
 
 Use the following command to run the hardware on the FPGA.
 
@@ -367,22 +366,22 @@ Single_Buffer: Running with a single buffer of 1398.903 MBytes for FPGA processi
  Verification: PASS
 ```
 
-- The Total FPGA time is 315 ms. This includes the Host to DDR transfer, Total Compute on FPGA and DDR to Host transfer. 
+- The Total FPGA time is 315 ms. This includes the Host to DDR transfer, Total Compute on FPGA and DDR to Host transfer.
 - As expected, when computing 8 words in parallel, the total FPGA Time that includes the data transfer between the host and device, has reduced from 447 ms to 315 ms.
 
 ### Visualize the Resources Utilized
 
-Use the Vitis analyzer to visualize the HLS Synthesis Report. You will need to build the kernel without `SOLUTION=1` to generate link_summary; this is not provided as part of tutorial. You can skip this step. 
+Use the Vitis analyzer to visualize the HLS Synthesis Report. You will need to build the kernel without `SOLUTION=1` to generate link_summary; this is not provided as part of tutorial. You can skip this step.
 
 ``` 
 vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_8/hw/runOnfpga_hw.xclbin.link_summary
 ```
 
-From the HLS Synthesis Report, you can see that the number of resources increased for LUTs, REG, and BRAM compared to the Bloom4x kernel implementation.
+From the HLS Synthesis Report, you can see that the number of resources increased for LUTs, REG, and block RAM compared to the Bloom4x kernel implementation.
 
-![missing image](./images/kernel_8_util.PNG) 
+![missing image](./images/kernel_8_util.PNG)
 
-### Review Profile Summary Report and Timeline Trace
+### Review the Profile Summary Report and Timeline Trace
 
 Use the Vitis analyzer to visualize the run_summary report.
 
@@ -390,13 +389,13 @@ Use the Vitis analyzer to visualize the run_summary report.
 vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_8/hw/runOnfpga_hw.xclbin.run_summary
 ```
 
-Review the profile reports and compare the metrics with configuration Bloom4x
+Review the profile reports, and compare the metrics with configuration Bloom4x.
 
-1. *Kernels & Compute Unit:Kernel Execution* reports 146 ms compared to the 292 ms. This is exactly half the time as now 8 words are computed in parallel instead of 4 words. 
+1. *Kernels & Compute Unit:Kernel Execution* reports 146 ms compared to the 292 ms. This is exactly half the time as now eight words are computed in parallel instead of four words.
 2. *Host Data Transfer: Host Transfer* section reports the same delays.
-3. The overall gain in the application overall gain is only because that kernel now is processing 8 words in parallel compared to 4 words in parallel.
+3. The overall gain in the application overall gain is only because that kernel now is processing eight words in parallel compared to four words in parallel.
 
-   ![missing image](./images/Kernel_8_time_1.PNG) 
+   ![missing image](./images/Kernel_8_time_1.PNG)
 
 ### Throughput Achieved
 
@@ -404,7 +403,7 @@ Review the profile reports and compare the metrics with configuration Bloom4x
 
 ## Bloom16x : Kernel Implementation Using 16 Words in Parallel
 
-In the previous step, you read 512-bit input values from the DDR and computed 8 words in parallel. You can compute 16 words in parallel by setting `PF=16` on the command line. Use the following steps to compute 16 words in parallel.
+In the previous step, you read 512-bit input values from the DDR and computed eight words in parallel. You can compute 16 words in parallel by setting `PF=16` on the command line. Use the following steps to compute 16 words in parallel.
 
 ### Run Hardware on the FPGA 
 
@@ -424,11 +423,11 @@ Processing 1398.903 MBytes of data
  Verification: PASS
  ```
 
-As previous steps, you can also review profile and timeline trace reports. This is left out as an homework exercise. 
+As previous steps, you can also review profile and timeline trace reports. This is left out as an homework exercise.
 
 ### Opportunities for Performance Improvements
 
-In this lab, you focused on building on an optimal kernel with 4, 8 and 16 words in parallel using a single buffer and reviewed the reports. As you observed from the Timeline Report that the kernel cannot start until the whole buffer is transferred to the DDR, which is about 145 ms when the single buffer is used. This is a substantial time delay considering the `Kernel execution time` is even smaller than the total host transfer time.
+In this lab, you focused on building on an optimal kernel with four, eight, and 16 words in parallel using a single buffer and reviewed the reports. As you observed from the Timeline Report that the kernel cannot start until the whole buffer is transferred to the DDR, which is about 145 ms when the single buffer is used. This is a substantial time delay considering the `Kernel execution time` is even smaller than the total host transfer time.
 
 ## Next Steps
 
@@ -437,8 +436,6 @@ In the next lab, you [explore sending documents in multiple buffers](./5_data-mo
 <hr/>
 <p align="center" class="sphinxhide"><b><a href="docs/bloom/README.md">Return to Start of Tutorial</a></b></p>
 
-
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
 <p class="sphinxhide" align="center"><sup><a href="https://www.amd.com/en/corporate/copyright">Terms and Conditions</a></sup></p>
-
diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/5_data-movement.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/5_data-movement.md
index d7a8717564..094426ee43 100644
--- a/Hardware_Acceleration/Design_Tutorials/02-bloom/5_data-movement.md
+++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/5_data-movement.md
@@ -6,7 +6,6 @@
  </tr>
 </table>
 
-
 # Data Movement Between the Host and Kernel
 
 In the previous step, you implemented a sequential execution of the written words from the host, computing hash functions on the FPGA, and reading flags by the host.
@@ -16,12 +15,12 @@ The compute does not start until the entire input is read into the FPGA, and sim
 In this lab, you will work with an:
 
 - Overlap of host data transfer and compute on the FPGA with split buffers (two buffers)
-  - Split the documents and send them to the FPGA in two iterations. 
+  - Split the documents and send them to the FPGA in two iterations.
   - The kernel can start the compute as soon as the data for the corresponding iteration is transferred to the FPGA.
 - Overlap of host data transfer and compute with multiple buffers
   - Explore how the application performance is affected based on splitting the documents and into 2, 4, 8, 16, 32, 64, and 128 chunks.
-- Overlap data transfer from host, compute on FPGA and profile score on the CPU
-  - Enables the host to start profile scores as soon as the flags are received. 
+- Overlap data transfer from host, compute on FPGA and profile score on the CPU.
+  - Enables the host to start profile scores as soon as the flags are received.
 
 ## Overlap of Host Data Transfer and Compute with Split Buffers
 
@@ -117,9 +116,9 @@ Navigate to `$LAB_WORK_DIR/reference_files`, and with a file editor, open `run_s
         krnlWait.push_back(krnlDone);
         q.enqueueMigrateMemObjects({subbuf_inh_flags[0]}, CL_MIGRATE_MEM_OBJECT_HOST, &krnlWait, &flagDone);
         flagWait.push_back(flagDone);
-      ```    
+      ```
 
-5. During the second iteration, the kernel arguments are set, the commands to write the input buffer with second set of words to the FPGA, execute the kernel, and read the results back to the host, are enqueued. 
+5. During the second iteration, the kernel arguments are set, the commands to write the input buffer with second set of words to the FPGA, execute the kernel, and read the results back to the host, are enqueued.
 
     ```cpp
       //  Set Kernel Arguments, Read, Enqueue Kernel and Write for second iteration
@@ -167,7 +166,7 @@ Verification: PASS
 
 ## Review Profile Report and Timeline Trace for the Bloom8x Kernel
   
-1. Run the following commands to view the Timeline Trace report with Bloom8x kernel.
+1. Run the following commands to view the Timeline Trace report with the Bloom8x kernel.
 
     ``` 
     vitis_analyzer $LAB_WORK_DIR/build/split_buffer/kernel_8/hw/runOnfpga_hw.xclbin.run_summary 
@@ -178,8 +177,8 @@ Verification: PASS
    ![missing image](./images/double_buffer_timeline_trace.PNG)
 
     - The Timeline Trace confirms that you achieved the execution schedule you expected.
-      * There is an overlap of the read and compute with write operations between the first and second iterations. 
-      * The execution time of the first kernel run and the first data read are effectively "hidden" behind the write data transfer from host. This results in a faster overall run.
+      - There is an overlap of the read and compute with write operations between the first and second iterations.
+      - The execution time of the first kernel run and the first data read are effectively "hidden" behind the write data transfer from host. This results in a faster overall run.
 
 3. From the Profile Report, *Host Data Transfer: Host Transfer* shows that the "data transfer" from the host CPU consumes more than the "Kernel Compute Time".
       - The Host to Global Memory WRITE Transfer takes about 178 ms, which is higher compared to using a single buffer.
@@ -209,7 +208,7 @@ Executed Software-Only version     |   3133.5186 ms
 Verification: PASS
 ```
 
-You can see that if the documents are split into two buffers, the overall application execution time using the Bloom8x kernel and Bloom16x kernel are very close. As expected, using the Bloom16x kernel rather than the Bloom8x kernel has no benefit.
+You can see that if the documents are split into two buffers, and the overall application execution time using the Bloom8x kernel and Bloom16x kernel are very close. As expected, using the Bloom16x kernel rather than the Bloom8x kernel has no benefit.
 
 While developing your own application, these attributes can be explored to make trade-offs and pick the optimal kernel implementation optimized for resources/performance.
 
@@ -276,7 +275,7 @@ Open `run_generic_buffer.cpp` in `$LAB_WORK_DIR/reference_files` with a file edi
       ```
 
 1. The kernel arguments are set, and the kernel is enqueued to load the Bloom filter coefficients.
- 
+
     ```cpp
       // Set Kernel arguments and load the Bloom filter coefficients in the kernel
       cl::Event buffDone, krnlDone;
@@ -291,7 +290,7 @@ Open `run_generic_buffer.cpp` in `$LAB_WORK_DIR/reference_files` with a file edi
     ```
 
 1. For each iteration, kernel arguments are set, and the commands to write the input buffer to the FPGA, execute the kernel, and read the results back to the host, are enqueued.
-      
+
     ```cpp
       // Set Kernel arguments. Read, Enqueue Kernel and Write for each iteration
       for (int i=0; i<num_iter; i++)
@@ -377,25 +376,25 @@ In this step, you will explore the Bloom8x kernel with 100,000 documents split i
       vitis_analyzer $LAB_WORK_DIR/build/generic_buffer/kernel_8/hw/runOnfpga_hw.xclbin.run_summary
       ```
 
-1. Zoom in to display the Timeline Trace report.
+ 1. Zoom in to display the Timeline Trace report.
 
     ![missing image](./images/generic_buffer_timeline_trace.PNG)
 
-- As you can see from the report, the input buffer is split into eight sub buffers, and there are overlaps between read, compute, and write for all iterations. The total computation is divided in eight iterations, but seven of them are occur simultaneously with data transfers; therefore, only the last compute counts towards the total FPGA execution time. This is indicated by the two arrows on timeline trace
+    - As you can see from the report, the input buffer is split into eight sub buffers, and there are overlaps between read, compute, and write for all iterations. The total computation is divided in eight iterations, but seven of them are occur simultaneously with data transfers; therefore, only the last compute counts towards the total FPGA execution time. This is indicated by the two arrows on timeline trace.
 
-- You can also see that after splitting the input data into multiple buffers, the total execution time on the FPGA improved from the previous steps, allowing additional overlap between the data transfer and compute.
+    - You can also see that after splitting the input data into multiple buffers, the total execution time on the FPGA improved from the previous steps, allowing additional overlap between the data transfer and compute.
 
-For your application, the most optimized configuration is:
+    For your application, the most optimized configuration is:
 
-- Bloom8x kernel with the words split in 8 sub-buffers (`ITER=8`).
+    - Bloom8x kernel with the words split in 8 sub-buffers (`ITER=8`).
 
 ## Overlap Between the Host CPU and FPGA
 
 In the previous steps, you looked at optimizing the execution time of the FPGA by overlapping the data transfer from the host to FPGA and compute on the FPGA. After the FPGA compute is complete, the CPU computes the document scores based on the output from the FPGA. Until now, the FPGA processing and CPU post-processing executed sequentially.
 
-If you look at the previous Timeline Trace reports, you can see red segments on the very first row that shows the OpenCL API Calls made by the host application. This indicates that the host is waiting, staying idle while the FPGA computes the hash and flags. In this step, you will overlap the FPGA processing with the CPU post-processing.
+If you look at the previous Timeline Trace reports, you can see red segments on the very first row that shows the OpenCL™ API Calls made by the host application. This indicates that the host is waiting, staying idle while the FPGA computes the hash and flags. In this step, you will overlap the FPGA processing with the CPU post-processing.
 
-Because the total compute is split into multiple iterations, you can start post-processing on the host CPU after the corresponding iteration is complete, allowing the overlap between the CPU and FPGA processing. The performance increases because the CPU is also processing in parallel with the FPGA, which reduces the execution time. 
+Because the total compute is split into multiple iterations, you can start post-processing on the host CPU after the corresponding iteration is complete, allowing the overlap between the CPU and FPGA processing. The performance increases because the CPU is also processing in parallel with the FPGA, which reduces the execution time.
 
 ### Host Code Modifications
 
@@ -414,7 +413,7 @@ Because the total compute is split into multiple iterations, you can start post-
       unsigned int  iter = 0;
     ```
 
-4. Block the host only if the hash function of the words are still not computed by FPGA, thereby allowing overlap between the CPU and FPGA processing.
+2. Block the host only if the hash function of the words are still not computed by FPGA, thereby allowing overlap between the CPU and FPGA processing.
 
     ```cpp
       for(unsigned int doc=0, n=0; doc<total_num_docs;doc++)
@@ -484,7 +483,7 @@ The following output displays.
    ![missing image](./images/sw_overlap_timeline_trace.PNG)
 
     - As shown in *OpenCL API Calls* of the *Host* section, the red segments are shorter (indicated by red squares) in width which indicates that the processing time of the host CPU is now overlapping with the FPGA processing, which improved the overall application execution time. In the previous steps, the host remained completely idle until the FPGA finished all its processing.
-    - *Data Transfer -> Write* of the Host section seems to have no gap. Kernel compute time of each invocation is smaller than the Host transfer. 
+    - *Data Transfer -> Write* of the Host section seems to have no gap. Kernel compute time of each invocation is smaller than the Host transfer.
     - Each Kernel compute and writing flags to DDR are overlapped with the next Host->
     Device transfer.
 
@@ -495,11 +494,11 @@ The following output displays.
 
     ![missing image](./images/sw_overlap_stalls.PNG)
 
-3. *Host Data Transfer: Host Transfer* Host to Global Memory WRITE Transfer takes about 207.5 ms and Host to Global Memory READ Transfer takes about 36.4 ms
+3. *Host Data Transfer: Host Transfer* Host to Global Memory WRITE Transfer takes about 207.5 ms and Host to Global Memory READ Transfer takes about 36.4 ms.
 
    ![missing image](./images/sw_overlap_profile_host.PNG)
 
- * *Kernels & Compute Unit: Compute Unit Utilization* section shows that CU Utilization is about 71%. This is an important measure representing how much time CU was active over the Device execution time. 
+- *Kernels & Compute Unit: Compute Unit Utilization* section shows that CU Utilization is about 71%. This is an important measure representing how much time CU was active over the device execution time.
 
     ![missing image](./images/sw_overlap_profile_CU_util.PNG)
 
@@ -517,8 +516,6 @@ The host and kernel are trying to access the same DDR bank at the same time whic
 
 <p align="center" class="sphinxhide"><b><a href="./README.md">Return to Start of Tutorial</a></b></p>
 
-
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
 <p class="sphinxhide" align="center"><sup><a href="https://www.amd.com/en/corporate/copyright">Terms and Conditions</a></sup></p>
-
diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/6_using-multiple-ddr.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/6_using-multiple-ddr.md
index a058a0aebd..005ec302ac 100644
--- a/Hardware_Acceleration/Design_Tutorials/02-bloom/6_using-multiple-ddr.md
+++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/6_using-multiple-ddr.md
@@ -6,34 +6,32 @@
  </tr>
 </table>
 
-
 # Using Multiple DDR Banks
 
-In the previous step, you noticed the overlap of the host data transfer documents sent to FPGA were also split into multiple buffers. Flags from the FPGA were also sent to the host immediately; this overlaps the compute profile score on the CPU with the "Compute FPGA" which further improves the application execution time.
+In the previous step, you noticed the overlap of the host data transfer documents sent to the FPGA were also split into multiple buffers. Flags from the FPGA were also sent to the host immediately; this overlaps the compute profile score on the CPU with the "Compute FPGA" which further improves the application execution time.
 
-You also observed memory contention because the host and kernel both accessed the same bank at the same time.
-In this section, you configure multiple DDR banks to improve the kernel performance.
+You also observed memory contention because the host and kernel both accessed the same bank at the same time. In this section, you configure multiple DDR banks to improve the kernel performance.
 
-Alveo cards have multiple DDR banks, and you can use multiple banks in ping-pong fashion to minimize the contention.
+AMD Alveo™ cards have multiple DDR banks, and you can use multiple banks in ping-pong fashion to minimize the contention.
 
-* The host is writing words to DDR bank 1 and DDR bank 2 alternatively. 
-* When the host is writing words to DDR bank1, the kernel is reading flags from DDR bank2. 
-* When host is writing documents to DDR bank2, the kernel is reading flags from DDR bank1. 
+* The host is writing words to DDR bank 1 and DDR bank 2 alternatively.
+* When the host is writing words to DDR bank 1, the kernel is reading flags from DDR bank 2.
+* When host is writing documents to DDR bank 2, the kernel is reading flags from DDR bank 1.
 
-The kernel will read from DDR bank1 and bank2 alternatively and its `maxi` port is connected to both DDR banks. You must establish the connectivity of kernel arguments to DDR banks in the `v++ --link` command as described in [Mapping Kernel Ports to Memory](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Mapping-Kernel-Ports-to-Memory). In this case the `$LAB_WORK_DIR/makefile/connectivity.cfg` configuration file specifies the connectivity. 
+The kernel will read from DDR bank 1 and bank 2 alternatively and its `maxi` port is connected to both DDR banks. You must establish the connectivity of kernel arguments to DDR banks in the `v++ --link` command as described in [Mapping Kernel Ports to Memory](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Mapping-Kernel-Ports-to-Memory). In this case, the `$LAB_WORK_DIR/makefile/connectivity.cfg` configuration file specifies the connectivity.
 
     ```
     [connectivity]
     sp=runOnfpga_1.input_words:DDR[1:2] 
     ```
-   - The `-sp` option instructs the `v++` linker that `input_words` is connected to both DDR banks 1 and 2. You will need to rebuild the kernel because connectivity is now changed.
 
+* The `-sp` option instructs the `v++` linker that `input_words` is connected to both DDR banks 1 and 2. You will need to rebuild the kernel because connectivity is now changed.
 
 ## Code Modifications
 
 1. Navigate to `$LAB_WORK_DIR/reference_files`, and with a file editor, open `run_sw_overlap_multiDDR.cpp`.
 
-3. From the host code, you will need to send the words to both DDR banks alternatively. The DDR bank assignment in the host code is supported by a Xilinx vendor extension to the OpenCL API. Two Xilinx extension pointer objects (`cl_mem_ext_ptr_t`) are created, `buffer_words_ext[0]` and `buffer_words_ext[1]`. The`flags` will determine which DDR bank the buffer will be send to, so that kernel can access it.
+2. From the host code, you will need to send the words to both DDR banks alternatively. The DDR bank assignment in the host code is supported by a AMD vendor extension to the OpenCL™ API. Two AMD extension pointer objects (`cl_mem_ext_ptr_t`) are created, `buffer_words_ext[0]` and `buffer_words_ext[1]`. The`flags` will determine which DDR bank the buffer will be send to, so that kernel can access it.
 
    ```cpp
     cl_mem_ext_ptr_t buffer_words_ext[2];
@@ -46,9 +44,9 @@ The kernel will read from DDR bank1 and bank2 alternatively and its `maxi` port
     buffer_words_ext[1].obj   = input_doc_words;
    ```  
 
-4. Next two buffers, `buffer_doc_words[0]` and `buffer_doc_words[1]` are created in `DDR[1]` and `DDR[2]` as follows.
+3. The next two buffers, `buffer_doc_words[0]` and `buffer_doc_words[1]`, are created in `DDR[1]` and `DDR[2]` as follows.
 
-    ```cpp 
+    ```cpp
     buffer_doc_words[0] = cl::Buffer(context, CL_MEM_EXT_PTR_XILINX | CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, total_size*sizeof(uint), &buffer_words_ext[0]);
     buffer_doc_words[1] = cl::Buffer(context, CL_MEM_EXT_PTR_XILINX | CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, total_size*sizeof(uint), &buffer_words_ext[1]);
     buffer_inh_flags    = cl::Buffer(context, CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, total_size*sizeof(char),output_inh_flags);
@@ -80,11 +78,9 @@ The kernel will read from DDR bank1 and bank2 alternatively and its `maxi` port
     }
     ```
 
-5. The kernel argument, input word is set to array of sub-buffers created from `buffer_doc_words[0]` and `buffer_doc_words[1]` alternatively; hence, data is sent to DDR bank 1 and 2 alternatively in each kernel execution. 
-
+4. The kernel argument, input word is set to the array of sub-buffers created from `buffer_doc_words[0]` and `buffer_doc_words[1]` alternatively; hence, data is sent to DDR bank 1 and 2 alternatively in each kernel execution.
 
-
-    ```cpp 
+    ```cpp
     for (int i=0; i<num_iter; i++)
     {
       cl::Event buffDone, krnlDone, flagDone;
@@ -137,21 +133,21 @@ The kernel will read from DDR bank1 and bank2 alternatively and its `maxi` port
 
     ![missing image](./images/multiDDR_timeline_trace_1.PNG)
 
-    - The Timeline Trace confirms that the host is writing to the DDR in a ping-pong fashion. You can hover your mouse over Data Transfer-> Write transactions and observe that the host is writing to bank1, bank2, bank1, bank2, alternatively.
+    * The Timeline Trace confirms that the host is writing to the DDR in a ping-pong fashion. You can hover your mouse over Data Transfer-> Write transactions and observe that the host is writing to bank1, bank2, bank1, bank2, alternatively.
     The kernel is always writing to same DDR bank1 as flags size is relatively small.
-    - In the previous lab, without usage of multiple banks the kernel cannot read the next set of words from the DDR until the host has read flags written by the kernel in the previous enqueue. In this lab, you can observe that both of these accesses can be carried out in parallel because these accesses are for different DDR banks. 
+    * In the previous lab, without usage of multiple banks, the kernel cannot read the next set of words from the DDR until the host has read flags written by the kernel in the previous enqueue. In this lab, you can observe that both of these accesses can be carried out in parallel because these accesses are for different DDR banks.
 
-    This results in an improved FPGA compute that includes the transfer from the host, device compute and sending flag data back to the host.
+    This results in an improved FPGA compute that includes the transfer from the host, device compute, and sending flag data back to the host.
 
 3. Review the Profile report and note the following observations:
 
-    *  *Data Transfer: Host to Global Memory* section indicates:
-        - Host to Global Memory WRITE Transfer takes about 145.7 ms which is less than 207 ms.
-        - Host to Global Memory READ Transfer takes about 37.9 ms.
+    * *Data Transfer: Host to Global Memory* section indicates:
+        * Host to Global Memory WRITE Transfer takes about 145.7 ms which is less than 207 ms.
+        * Host to Global Memory READ Transfer takes about 37.9 ms.
 
           ![missing image](./images/multiDDR_profile_host.PNG)
 
-   * *Kernels & Compute Unit: Compute Unit Utilization* section shows that the CU Utilization has also increased to 89.5% from 71% in previous lab.
+   * The *Kernels & Compute Unit: Compute Unit Utilization* section shows that the CU Utilization has also increased to 89.5% from 71% in previous lab.
 
       ![missing image](./images/multiDDR_profile_CU_util.PNG)
 
@@ -159,7 +155,7 @@ The kernel will read from DDR bank1 and bank2 alternatively and its `maxi` port
 
       ![missing image](./images/multiDDR_stalls.PNG)
 
-Compared to the previous step using only one DDR, there is no overall application gain. The FPGA compute performance improves, but the bottleneck is processing the "Compute Score", which is limited by CPU Performance. If the CPU can process faster, you can get the better performance.  
+Compared to the previous step using only one DDR, there is no overall application gain. The FPGA compute performance improves, but the bottleneck is processing the "Compute Score", which is limited by the CPU Performance. If the CPU can process faster, you can get the better performance.  
 
 Based on the results, the throughput of the application is 1399MB/426 ms = approximately 3.27 GBs. You now have approximately 7.2x (=3058 ms/426 ms) the performance results compared to the software-only version.
 
@@ -169,13 +165,9 @@ Congratulations! You have successfully completed the tutorial.
 
 In this tutorial, you learned that optimizing how the host application interacts with the accelerator makes a significant difference. A native initial implementation delivered a 4x performance improvement over the reference software implementation. By leveraging data-parallelism, the overlapping data transfers, compute, and the overlapping CPU processing with FPGA processing, using multiple DDR banks, the application performance was increased by another 1.8x, achieving a total of 7.2x acceleration.
 
-
 ---------------------------------------
 <p align="center" class="sphinxhide"><b><a href="./README.md">Return to Start of Tutorial</a></b></p>
 
-
-
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
 <p class="sphinxhide" align="center"><sup><a href="https://www.amd.com/en/corporate/copyright">Terms and Conditions</a></sup></p>
-
diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/README.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/README.md
index d6baf1275c..4fde4fbcd5 100644
--- a/Hardware_Acceleration/Design_Tutorials/02-bloom/README.md
+++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/README.md
@@ -6,18 +6,17 @@
  </tr>
 </table>
 
-
 # Optimizing Accelerated FPGA Applications: Bloom Filter Example
 
 ***Version: Vitis 2023.1***
 
 ## Introduction
 
-The methodology for accelerating applications on an FPGA is comprised of multiple phases:
+The methodology for accelerating applications on an field programmable array (FPGA) is comprised of multiple phases:
 
-   - **Architecting the application**: Make key decisions about the architecture of the application and decide some important factors, such as what software functions should be mapped to FPGA kernels, how much parallelism is needed, and how it should be delivered.
-   - **Developing the accelerator to meet your desired performance goals**: Implement the kernel by modifying the source code and applying pragmas to create a kernel architecture that can achieve the desired performance goals.
-   - **Optimize the host code**: Review the application's access patterns, data movements, CPU and FPGA idle time, and update the host code to meet your performance goals.
+- **Architecting the Application**: Make key decisions about the architecture of the application and decide some important factors, such as what software functions should be mapped to the FPGA kernels, how much parallelism is needed, and how it should be delivered.
+- **Developing the Accelerator to Meet Your Desired Performance Goals**: Implement the kernel by modifying the source code and applying pragmas to create a kernel architecture that can achieve the desired performance goals.
+- **Optimize the Host Code**: Review the application's access patterns, data movements, CPU and FPGA idle time, and update the host code to meet your performance goals.
 
 You begin this tutorial with a baseline application, and you profile the application to examine the potential for hardware acceleration. The tutorial application involves searching through an incoming stream of documents to find the documents that closely match a user’s interest based on a search profile.
 
@@ -29,24 +28,24 @@ In general, a Bloom filter application has use cases in data analytics, such as
 
 The labs in this tutorial use:
 
-* BASH Linux shell commands.
-* 2023.1 Vitis core development kit release and the *xilinx_u200_gen3x16_xdma_2_202110_1* platform. If necessary, it can be easily ported to other versions and platforms.
+- BASH Linux shell commands.
+- 2023.1 Vitis core development kit release and the *xilinx_u200_gen3x16_xdma_2_202110_1* platform. If necessary, it can be easily ported to other versions and platforms.
 
-This tutorial guides you to run the designed accelerator on the FPGA; therefore, the expectation is that you have an Xilinx® Alveo™ U200 Data Center accelerator card set up to run this tutorial. Because it can take several (six or seven) hours to generate the multiple `xclbin` files needed to run the accelerator, pregenerated `xclbin` files are provided for the U200 card. To use these pregenerated files, when building the hardware kernel or running the accelerator on hardware, you need to add the `SOLUTION=1` argument. 
+This tutorial guides you to run the designed accelerator on the FPGA; therefore, the expectation is that you have an AMD Alveo™ U200 Data Center accelerator card set up to run this tutorial. Because it can take several (six or seven) hours to generate the multiple `xclbin` files needed to run the accelerator, pregenerated `xclbin` files are provided for the U200 card. To use these pregenerated files, when building the hardware kernel or running the accelerator on hardware, you need to add the `SOLUTION=1` argument.
 
 >**IMPORTANT:**  
 >
-> * Before running any of the examples, make sure you have installed the Vitis core development kit as described in [Installation](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Installation-Requirements) in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416).
->* If you run applications on Alveo cards, ensure the card and software drivers have been correctly installed by following the instructions on the [Alveo Portfolio page](https://www.xilinx.com/products/boards-and-kits/alveo.html).
+> - Before running any of the examples, make sure you have installed the Vitis core development kit as described in [Installation](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Installation-Requirements) in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416).
+>- If you run applications on Alveo cards, ensure the card and software drivers have been correctly installed by following the instructions on the [Alveo Portfolio page](https://www.xilinx.com/products/boards-and-kits/alveo.html).
 
 ### Accessing the Tutorial Reference Files
 
 1. To access the tutorial content, enter the following in a terminal: `git clone http://github.com/Xilinx/Vitis-Tutorials`.
 2. Navigate to the `Hardware_Acceleration/Design_Tutorials/02-bloom` directory.
-    * `cpu_src` contains all the original source code before modification.
-    * `images` contains the figures in this tutorial. 
-    * `Makefile` in the `makefile` directory explains the commands used in this lab. Use the `PLATFORM` variable if targeting different platforms.
-    * `reference_file` contains the modified kernel and host-related files for achieving higher performance.
+    - `cpu_src` contains all the original source code before modification.
+    - `images` contains the figures in this tutorial.
+    - `Makefile` in the `makefile` directory explains the commands used in this lab. Use the `PLATFORM` variable if targeting different platforms.
+    - `reference_file` contains the modified kernel and host-related files for achieving higher performance.
 3. Copy and extract large files in as follows:
 
    ```
@@ -54,16 +53,16 @@ This tutorial guides you to run the designed accelerator on the FPGA; therefore,
    tar -xvzf  xclbin_save.tar.gz
    ```
 
-   **TIP:** The `xclbin_save` contains the saved `xclbin` files that can be used directly for running on hardware by setting `SOLUTION=1` for the `make run` commands.
-   
+   >**TIP:** The `xclbin_save` contains the saved `xclbin` files that can be used directly for running on hardware by setting `SOLUTION=1` for the `make run` commands.
+
 ### Tutorial Outline
 
-* [Overview of the Original Application](1_overview.md)
-* [Experience Acceleration Performance](2_experience-acceleration.md)
-* [Architecting the Application](3_architect-the-application.md)
-* [Implementing the Kernel](4_implement-kernel.md)
-* [Analyze Data Movement Between Host and Kernel](5_data-movement.md)
-* [Using Multiple DDR Banks](6_using-multiple-ddr)
+- [Overview of the Original Application](1_overview.md)
+- [Experience Acceleration Performance](2_experience-acceleration.md)
+- [Architecting the Application](3_architect-the-application.md)
+- [Implementing the Kernel](4_implement-kernel.md)
+- [Analyze Data Movement Between Host and Kernel](5_data-movement.md)
+- [Using Multiple DDR Banks](6_using-multiple-ddr)
 
 <!--
 1. [Overview of the Original Application](1_overview.md): Provides a brief overview of the Bloom filter application with some examples of how this application is used in real-world scenarios.
@@ -76,7 +75,6 @@ This tutorial guides you to run the designed accelerator on the FPGA; therefore,
 
 <hr/>
 
-
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
 <p class="sphinxhide" align="center"><sup><a href="https://www.amd.com/en/corporate/copyright">Terms and Conditions</a></sup></p>
diff --git a/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/README.md b/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/README.md
index 0c42124cd3..82cffd2085 100755
--- a/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/README.md
+++ b/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/README.md
@@ -13,23 +13,16 @@
 
 ***Version: Vitis 2023.1***
 
-This tutorial demonstrate the design flow for an example mixed kernels hardware design, which includes both RTL kernel and HLS C kernel, as well as Vitis Vision Library. The design generates a real-time clock image, resizes it, then alpha-mix it with an input image in global memory, finally output the result image to global memory. AXI stream interface is used for the kernel-to-kernel connection.
+This tutorial demonstrate the design flow for an example mixed kernels hardware design, which includes both register transfer level (RTL) kernel and HLS C kernels, as well as the Vitis Vision Library. The design generates a real-time clock image, resizes it, then alpha-mixes it with an input image in global memory, finally output the result image to global memory. The AXI4-Stream interface is used for the kernel-to-kernel connection.
 
-<div align="center">
-<img src="./doc/images/alpha_mixing.jpg" alt="Alpha Mixing" >
-</div>
+![Alpha Mixing](./doc/images/alpha_mixing.jpg)
 
-<br/><br/>
-
-The hardware design includes three kernels: *rtc_gen*, *alpha_mix*, and *strm_dump*. These kernels are directly connected together using AXI stream link. The topology of the design is shown in the figure below. 
-
-<div align="center">
-<img src="./doc/images/topo.png" alt="Topology" >
-</div>
+The hardware design includes three kernels: *rtc_gen*, *alpha_mix*, and *strm_dump*. These kernels are directly connected together using AXI stream link. The topology of the design is shown in the following figure.
 
+![Topology](./doc/images/topo.png)
 
 **Additional Requirements for RedHat/CentOS 7**
-The host program is using XRT Native API, which need higher version of GCC. If you are using RedHat/CentOS 7, the default installed GCC version is 4.x.x. You must use the  following command to install and switch to GCC 7 before compiling the host program.
+The host program is using the Xilinx Runtime (XRT) Native API, which need higher version of the GNU compiler collection (GCC). If you are using RedHat/CentOS 7, the default installed GCC version is 4.x.x. You must use the following command to install and switch to GCC 7 before compiling the host program.
 
 ```shell
 sudo yum install centos-release-scl
@@ -37,12 +30,12 @@ sudo yum install devtoolset-7-gcc-c++
 scl enable devtoolset-7 bash
 ```
 
-The directory struction and brief explainations are as below.
+The directory struction and brief explainations are as follows.
 ~~~
 ├── doc/                            # documents
 ├── hw/                             # Hardware build working directory
 │   ├── alpha_mix.cpp               # HLS C source code for alpha_mix kernel
-│   ├── build_rtc_gen_xo.sh         # shell script to call Vivado to package rtc_gen kernel IP to xo file
+│   ├── build_rtc_gen_xo.sh         # shell script to call AMD Vivado™ to package rtc_gen kernel IP to xo file
 │   ├── config_gen.mk               # Makefile sub-module to generate Vitis linking configuration files
 │   ├── include/                    # Vision Vision library include file for HLS C
 │   ├── Makefile                    # Makefile for hardware building
@@ -68,16 +61,14 @@ The directory struction and brief explainations are as below.
 
 ## RTL Kernel: rtc_gen (XO)
 
-*rtc_gen* is the real-time clock digit image generation kernel written in Verilog HDL. *rtc_gen* has an internal always-run real-time-clock driven by AXI bus clock with a clock divider. The time value can be set by host via kernel arguments. The kernel will firstly load the font image library for digits 0-9 from global memory to on-chip buffer, then output the real-time-clock digit image through AXI stream port. The user can also read out the time value from the internal always-run time counter.
+*rtc_gen* is the real-time clock digit image generation kernel written in Verilog HDL. *rtc_gen* has an internal always-run real-time-clock driven by an AXI bus clock with a clock divider. The time value can be set by host via kernel arguments. The kernel will firstly load the font image library for digits 0-9 from the global memory to the on-chip buffer, then output the real-time-clock digit image through the AXI4-Stream port. You can also read out the time value from the internal always-run time counter.
+
+The character size in the font library is 240 (height) by 160 (width) pixels, and the font library includes 11 characters, i.e., digits 0-9 and colon. Refer to the following image for the font library contents.  
 
-The character size in the font library is 240 (height) by 160 (width) pixels, and the font library includes 11 characters, i.e. digits 0-9 and colon. Refer to the image below for the font library contents.  
+![Font Library](./doc/images/font_lib.png)
 
-<div align="center">
-<img src="./doc/images/font_lib.png" alt="Font Library" >
-</div>
-<br/><br/>
+Each pixel in the font library is represented with 4-bit, which is actually the opacity value for each pixel. When output through the AXI4-Stream port, the 4-bit opacity value will be expanded to 8-bit by left shifting 4-bit then add 15 (i.e., expand `0xB` to `0xBF`). The opacity value will be used by the downstream alpha-mixing kernel to generate time digit with color setting. The font image data size for single chracter is:
 
-Each pixel in the font library is represented with 4-bit, which is actually the opacity value for each pixel. When output through AXI stream port, the 4-bit opacity value will be expanded to 8-bit by left shifting 4-bit then add 15 (i.e., expand 0xB to 0xBF). The opacity value will be used by the downstream alpha-mixing kernel to generate time digit with color setting. The font image data size for single chracter is:
 ~~~
   240 x 160 x 4 = 153,600 bits = 19,200 bytes
 ~~~
@@ -88,61 +79,55 @@ The total font image library size is:
   19,200 x 11 = 211,200 bytes
 ~~~
 
-*rtc_gen* support two time format: one is with centi-second, namely HOUR:MIN:SEC:CENTISEC, including 11 characters; the other is without centi-second, namely HOUR:MIN:SEC, including 8 chracters. The time format is set in kernel arguments. Refer to the figures below for examples of the two output time format.
+*rtc_gen* support two time formats: one is with centi-second, namely HOUR:MIN:SEC:CENTISEC, including 11 characters; the other is without centi-second, namely HOUR:MIN:SEC, including eight chracters. The time format is set in kernel arguments. Refer to the following figures for examples of the two output time format.
 
-<div align="center">
-<img src="./doc/images/time_format.png" alt="Time Format" >
-</div>
+![Time Format](./doc/images/time_format.png)
 
 The *rtc_gen* kernel has three bus interfaces:
-+ AXI-Lite slave interface for kernel argument and control
-+ AXI-Lite master interface for font library data loading
-+ AXI stream master interface for clock digit image output
 
-The kernel is composed of three blocks: *rtc_gen_axi_read_master* for AXI master based font library reading, *rtc_gen_control_s_axi* for AXI slave based kernel arguments and control, and *rtc_gen_core* for core kernel function and AXI stream output. *rtc_gen_axi_read_master* is a standard block generated by Vitis/Vivado RTL Kernel Wizard. *rtc_gen_control_s_axi* is also a generated block, but we need to make some modifications to it to add time value read-out function.
++ AXI4-Lite slave interface for kernel argument and control
++ AXI4-Lite master interface for font library data loading
++ AXI4-Stream master interface for clock digit image output
 
-<div align="center">
-<img src="./doc/images/rtc_gen_block.png" alt="rtc_gen Block" >
-</div>
+The kernel is composed of three blocks: *rtc_gen_axi_read_master* for AXI master-based font library reading, *rtc_gen_control_s_axi* for AXI slave-based kernel arguments and control, and *rtc_gen_core* for core kernel function and AXI4-Stream output. *rtc_gen_axi_read_master* is a standard block generated by the Vitis/Vivado RTL Kernel Wizard. *rtc_gen_control_s_axi* is also a generated block, but you need to make some modifications to it to add time value read-out function.
 
-When triggered by the host, the kernel will read time value from internel real-time-clock, and output a frame of time image corresponding to the time value.
+![rtc_gen Block](./doc/images/rtc_gen_block.png)
 
-Following table summarizes the arguments used by *rtc_gen* kernel.
+When triggered by the host, the kernel will read the time value from the internel real-time-clock, and output a frame of the time image corresponding to the time value.
+
+The following table summarizes the arguments used by the *rtc_gen* kernel.
 
 |No.   | Arguments   | Width  | Description |
 | ---- | ----        | ----   | ---- |
-|0     | work_mode   | 1      |[0]: determine the kernel working mode <br> 0 - load font from global memory to on-chip SRAM via AXI read master <br> 1 - output RTC digit figure via AXI steam master |
-|1     | cs_count    | 32     |[21:0]: Centi-second counter. For example, if the system clock is 200MHz, cs_count should be set to 2,000,000 |
+|0     | work_mode   | 1      |[0]: determine the kernel working mode <br> 0 - load font from global memory to on-chip static random access memory (RAM) (SRAM) via the AXI read master <br> 1 - output real-time clock (RTC) digit figure via the AXI stream master |
+|1     | cs_count    | 32     |[21:0]: Centi-second counter. For example, if the system clock is 200 MHz, the cs_count should be set to 2,000,000 |
 |2     | time_format | 1      |[0]: determine whether centisecond is included in the output digit images <br> 0 - disable centiseconds output <br> 1 - enable centiseconds output |
 |3     | time_set_val| 32     |Set time value for internal free-running clock: <br> [31:24] - hours <br> [23:16] - minutes <br> [15:8] - seconds <br> [7:0] - centi-seconds |
 |4     | time_set_en | 1      |[0]: write 1 to this bit will load the time_set_value to internal free-running clock. |
 |5     | time_val    | 32     |Read-only regsiter for internal real-time-clock time value: <br> [31:24] - hours <br> [23:16] - minutes <br> [15:8] - seconds <br> [7:0] - centi-seconds |
-|6     | read_addr   | 64     |AXI master pointer, this is the FPGA device buffer address for font library |
-
-<br/><br/>
+|6     | read_addr   | 64     |AXI master pointer, this is the FPGA device buffer address for the font library |
 
-Please read [RTC_GEN RTL Kernel Creation](./doc/rtc_gen_tutorial.md) for more details of the RTL kernel *rtc_gen* and the step-by-step guideline to create this RTL kernel.
+Read [RTC_GEN RTL Kernel Creation](./doc/rtc_gen_tutorial.md) for more details about the RTL kernel *rtc_gen* and the step-by-step guideline to create this RTL kernel.
 
 ## HLS C Kernel: alpha_mix (XO)
 
-The kernel *alpha_mix* finishes follow tasks in order:
-* Receive the clock digit image from *rtc_gen* kernel via AXI stream port
-* Resize the clock digit image with Vitis Vison Library resize function
-* Load the background image from global memory, then execute alpha mixing with the clock digit image
-* Send out the mixed image via AXI stream port
+The kernel *alpha_mix* finishes the follow tasks in order:
 
-<div align="center">
-<img src="./doc/images/alpha_flow.png" alt="alpha_mix flow" >
-</div>
++ Receive the clock digit image from *rtc_gen* kernel via the AXI4-Stream port
++ Resize the clock digit image with Vitis Vison Library resize function
++ Load the background image from global memory, then execute alpha mixing with the clock digit image
++ Send out the mixed image via the AXI4-Stream port
+
+![alpha_mix flow](./doc/images/alpha_flow.png)
 
 The *alpha_mix* kernel has four bus interfaces:
 
-* AXI-Lite slave interface for control
-* AXI-Lite master interface for background image loading
-* AXI stream slave interface for clock digit image receiving
-* AXI stream master interface for mixed image output
++ AXI4-Lite slave interface for control
++ AXI4-Lite master interface for background image loading
++ AXI4-Stream slave interface for clock digit image receiving
++ AXI4-Stream master interface for mixed image output
 
-Following table summarized the arguments used by *alpha_mix* kernel. Please note the kernel use *XF_NPPC8* mode, namely eight pixels will be processed at each clock cycle, so please ensure the *background image width* and *resized time image width* are integer multiples of 8, otherwise the kernel might hang.
+The following table summarized the arguments used by *alpha_mix* kernel. Note the kernel use *XF_NPPC8* mode, namely eight pixels will be processed at each clock cycle, so ensure the *background image width* and *resized time image width* are integer multiples of eight, otherwise, the kernel might hang.
 
 |No.   | Arguments          | Width | Description |
 | ---- | ----               | ----  | ----  |
@@ -163,17 +148,15 @@ Following table summarized the arguments used by *alpha_mix* kernel. Please note
 
 You could refer to below figure for the meaning of some kernel arguments.
 
-<div align="center">
-<img src="./doc/images/alpha_arg.png" alt="alpha_mix kernel arguments" >
-</div>
+![alpha_mix kernel arguments](./doc/images/alpha_arg.png)
 
-Please read [ALPHA_MIX HLS C Kernel Creation](./doc/alpha_mix_tutorial.md) for more details of the HLS C kernel *alpha_mix*.
+Read [ALPHA_MIX HLS C Kernel Creation](./doc/alpha_mix_tutorial.md) for more details about the HLS C kernel *alpha_mix*.
 
 ## HLS C Kernel: strm_dump (XO)
 
-*strm_dump* is a simple HLS kernel to dump the input AXI stream to global memory via AXI Lite master.
+*strm_dump* is a simple HLS kernel to dump the input AXI4-Stream to global memory via AXI4-Lite master.
 
-Following table summarizes the arguments used by *strm_dump* kernel.
+The following table summarizes the arguments used by the *strm_dump* kernel.
 
 |No.   | Arguments   | Width  | Description |
 | ---- | ----        | ----   | ---- |
@@ -185,50 +168,45 @@ Following table summarizes the arguments used by *strm_dump* kernel.
 
 ### rtc_gen_test_hw.xclbin / rtc_gen_test_hw_emu.xclbin
 
-This is a simple test system for *rtc_gen* kernel, which integrates two kernels: *rtc_gen* and *strm_dump*, which are connected together using AXI stream bus. Refer to the following connection diagram on U50 platform. According to the different building target (hw or hw_emu), two XCLBIN files will be generated.
+This is a simple test system for the *rtc_gen* kernel, which integrates two kernels: *rtc_gen* and *strm_dump*, which are connected together using an AXI4-Stream bus. Refer to the following connection diagram on the U50 platform. According to the different building target (hw or hw_emu), two XCLBIN files will be generated.
 
-<div align="center">
-<img src="./doc/images/rtc_gen_test_diagram.png" alt="rtc_gen_test Diagram" >
-</div>
+![rtc_gen_test Diagram](./doc/images/rtc_gen_test_diagram.png)
 
 ### rtc_alpha_hw.xclbin / rtc_alpha_hw_emu.xclbin
 
-This is the fully implemented system, which integrated all the three kernels: *rtc_gen*, *alpha_mix* and *strm_dump*, which are connected together via AXI stream bus. Please note the function of the kernel *strm_dump* is very easy to be merged into *alpha_mix* kernel. We separated the *strm_dump* kernel here just to demonstrate the kernel-to-kernel AXI stream connection functionality. Refer to the following connection diagram on U50 platform. According to the different building target (hw or hw_emu), two XCLBIN files will be generated.
+This is the fully implemented system, which integrated all the three kernels: *rtc_gen*, *alpha_mix*, and *strm_dump*, which are connected together via an AXI4-Stream bus. Note the function of the kernel *strm_dump* is very easy to be merged into the *alpha_mix* kernel. Here  is separated just to demonstrate the kernel-to-kernel AXI4-Stream connection functionality. Refer to the following connection diagram on the U50 platform. According to the different building target (hw or hw_emu), two XCLBIN files will be generated.
 
-<div align="center">
-<img src="./doc/images/rtc_alpha_diagram.png" alt="rtc_alpha Diagram" >
-</div>
+![rtc_alpha Diagram](./doc/images/rtc_alpha_diagram.png)
 
 ## Test Program
 
 ### rtc_gen_test.cpp
 
-This program first judges the running mode according to the environment variable *XCL_EMULATION_MODE*, then chooses to  use binary file *rtc_gen_test_hw.xclbin* or *rtc_gen_test_hw_emu.xclbin* to finish the testing of RTL kernel *rtc_gen*. It will test both the 8-digit and 11-digit clock format, and the generated clock image will be displayed directly. The program also uses *kernel::read_register* API to read and print out the value of register *time_val* of *rtc_gen* kernel, namely the value of the internal hardware time counter. The value of *time_val* is also used to control the image display refresh. To ensure the correct operation of *xclRegRead* function, please ensure to create or modify *xrt.ini* file in the execution directory to add following lines:
+This program first judges the running mode according to the environment variable *XCL_EMULATION_MODE*, then chooses to use the binary file `rtc_gen_test_hw.xclbin` or `rtc_gen_test_hw_emu.xclbin` to finish the testing of the RTL kernel *rtc_gen*. It will test both the 8-digit and 11-digit clock format, and the generated clock image will be displayed directly. The program also uses the *kernel::read_register* API to read and print out the value of the register *time_val* of the *rtc_gen* kernel, namely the value of the internal hardware time counter. The value of *time_val* is also used to control the image display refresh. To ensure the correct operation of *xclRegRead* function, ensure to create or modify the `xrt.ini` file in the execution directory to add the following lines:
 
 ~~~
 [Runtime]
 rw_shared=true
 ~~~
-### rtc_alpha_tb.cpp
 
-This program first judges the running mode according to the environment variable *XCL_EMULATION_MODE*, then chooses to use binary file *rtc_alpha_hw.xclbin* or *rtc_alpha_hw_emu.xclbin* to mix the generated real time clock images to a background image. The user can select background image, set time format, and set clock time by command parameters. The user can also change the color, size, and position of the clock image by modifying the program source code. This test program also uses *kernel::read_register* API to read the value of register *time_val* of *rtc_gen* kernel and use that value to control image display refresh.
+### rtc_alpha_tb.cpp
 
-<br/>
+This program first judges the running mode according to the environment variable *XCL_EMULATION_MODE*, then chooses to use binary file `rtc_alpha_hw.xclbin` or `rtc_alpha_hw_emu.xclbin` to mix the generated real time clock images to a background image. You can select the background image, set the time format, and set the clock time by command parameters. You can also change the color, size, and position of the clock image by modifying the program source code. This test program also uses *kernel::read_register* API to read the value of register *time_val* of *rtc_gen* kernel and use that value to control image display refresh.
 
 ## How to Use This Repository
 
-Before going through the following steps, don't forget to source XRT and Vitis setup files, for example:
+Before going through the following steps, do not forget to source the XRT and Vitis setup files. For example:
 
 ~~~
 source /opt/xilinx/xrt/setup.sh
 source /opt/xilinx/Vitis/2023.1/settings64.sh
 ~~~
 
-The two test programs need to display images. So if you are using remote server, please use VNC desktop, or ssh connection with X11 forwarding along with local X11 server.
+The two test programs need to display images. If you are using remote server, use VNC desktop, or ssh connection with X11 forwarding along with local X11 server.
 
 ### Build the hardware
 
-Change to *./hw* directory, then use **make** command to finish the building of three XO files and to XCLBIN files. All available make command option includes:
+Change to the `./hw` directory, then use the `make` command to finish the building of the three XO files and two XCLBIN files. All the available `make` command option include:
 
 ~~~
 make
@@ -246,58 +224,58 @@ make clean
       Command to remove all the generated files.
 ~~~
 
-In the make command options, the TARGET can be *hw* or *hw_emu*. Because the *rtc_gen* kernel doesn't provide software emulation model, *sw_emu* mode cannot be used. When the TARGET is *hw*, the XCLBIN and XO files will be with *_hw* postfix; when the TARGET is *hw_emu*, the XCLBIN and XO files will be with *_hw_emu* postfix. Please note the RTL kernel *rtc_gen* will not be affected by the *hw* or *hw_emu* option, and there will only be a XO file *rtc_gen.xo*.
+In the `make` command options, the TARGET can be *hw* or *hw_emu*. Because the *rtc_gen* kernel does not provide software emulation model, *sw_emu* mode cannot be used. When the TARGET is *hw*, the XCLBIN and XO files will be with *_hw* postfix; when the TARGET is *hw_emu*, the XCLBIN and XO files will be with *_hw_emu* postfix. Note the RTL kernel *rtc_gen* will not be affected by the *hw* or *hw_emu* option, and there will only be a XO file `rtc_gen.xo`.
 
-The PLATFORM could be one of the six choices: xilinx_u200_gen3x16_xdma_2_202110_1, xilinx_u250_gen3x16_xdma_3_1_202020_1, xilinx_u250_xdma_201830_2, xilinx_u50_gen3x16_xdma_201920_3 and xilinx_u280_xdma_201920_3. No matter whether you have these Alveo cards installed, you can use the platform as the build PLATFORM if you have installed the development platform package (deb or rpm packages) on your system. You can look into */opt/xilinx/platform* directory or use command *platforminfo -l* to check which platforms have been installed. The finally generated xclbin and xo files will be in *./hw* directory after the successful execution of the make command. 
+The PLATFORM could be one of the six choices: xilinx_u200_gen3x16_xdma_2_202110_1, xilinx_u250_gen3x16_xdma_3_1_202020_1, xilinx_u250_xdma_201830_2, xilinx_u50_gen3x16_xdma_201920_3, and xilinx_u280_xdma_201920_3. No matter whether you have these AMD Alveo™ cards installed, you can use the platform as the build PLATFORM if you have installed the development platform package (deb or rpm packages) on your system. You can look into the `/opt/xilinx/platform` directory, or use the command `platforminfo -l` to check which platforms have been installed. The finally generated xclbin and xo files will be in the `./hw` directory after the successful execution of the `make` command.
 
-For example, if you would like to build all XO and XCLBIN files in hardware emulation mode with U50 card, just input:
+For example, if you would like to build all the XO and XCLBIN files in hardware emulation mode with a U50 card, just input:
 
 ~~~
 make all TARGET=hw_emu PLATFORM=xilinx_u50_gen3x16_xdma_201920_3
 ~~~
 
-Because the XCLBIN file building for hardware target needs a long time, to save your time, we also provide the pre-built XCLBIN files (*rtc_gen_test_hw.xclbin* and *rtc_alpha_hw.xclbin*) for each kind of supported Alveo platforms. Please note they are built with *TARGET=hw* option and cannot be used in *hw_emu* mode. For *hw_emu* target XCLBIN files, it's much faster to build and system dependent, so please build them by yourself. 
+Because the XCLBIN file building for hardware target needs a long time, to save your time,the pre-built XCLBIN files (`rtc_gen_test_hw.xclbin` and `rtc_alpha_hw.xclbin`) are provided for each kind of supported Alveo platforms. Note they are built with the *TARGET=hw* option and cannot be used in *hw_emu* mode. For *hw_emu* target XCLBIN files, it is much faster to build and system dependent, so build them by yourself. 
 
-You can download the pre-built XCLBIN files via the link: <https://www.xilinx.com/bin/public/openDownload?filename=rtl_stream_kernel_xclbin_2020.2.tgz>
+You can download the pre-built XCLBIN files via the link: <https://www.xilinx.com/bin/public/openDownload?filename=rtl_stream_kernel_xclbin_2020.2.tgz>.
 
-**To use the pre-built xclbin files, please copy the two xclbin files corresponding to your target platform into *./hw* directory, which will be used directly in downstream steps.**
+**To use the pre-built xclbin files, copy the two xclbin files corresponding to your target platform into the `./hw` directory, which will be used directly in downstream steps.**
 
-### Build and run the software
+### Build and Run the Software
 
-* Step 1: generate Makefile
+#### Step 1: Generate the Makefile
 
-Change to *./sw/build* directory, then enter **cmake ..** or **cmake3 ..** command. This will generate the *Makefile* for software builds, as well as link the two XCLBIN files in *./hw* directory to *./sw/build* directory.
+Change to the `./sw/build` directory, then enter the `cmake ..` or `cmake3 ..` command. This will generate the *Makefile* for the software builds, as well as link the two XCLBIN files in the `./hw` directory to the `./sw/build` directory.
 
 ~~~
 cd ./sw/build
 cmake ..
 ~~~
 
-* Step 2: compile the programs
+#### Step 2: Compile the Programs
 
-Enter **make** command, then the two C++ program will be compiled. This will generate two executables: *rtc_gen_test* and *rtc_alpha_tb*.
+Enter the `make` command, then the two C++ program will be compiled. This will generate two executables: *rtc_gen_test* and *rtc_alpha_tb*.
 
 ~~~
 make
 ~~~
 
-Please note because XRT low level API *xclRegRead* is used in the test program, so there are different link library sets for the hardware mode and hardware emulation mode. Altogether four executables will be generated after the successful compilation: *rtc_alpha_tb*, *rtc_alpha_tb_emu*, *rtc_gen_test*, *rtc_gen_test_emu*. Please use the correct executables for hardware or hardware emulation modes.
+ Because the XRT low-level API *xclRegRead* is used in the test program, there are different link library sets for the hardware mode and hardware emulation mode. Altogether four executables will be generated after the successful compilation: *rtc_alpha_tb*, *rtc_alpha_tb_emu*, *rtc_gen_test*, and *rtc_gen_test_emu*. Use the correct executables for hardware or hardware emulation modes.
 
-* Step 3: configure running mode (hardware or hardware emulation)
+#### Step 3: Configure the Running Mode (Hardware or Hardware Emulation)
 
-Script *setup_emu.sh* is provided to set the running mode.
+The script, *setup_emu.sh*, is provided to set the running mode.
 
-**Run in hardware mode**
+##### Run in Hardware Mode
 
-If you didn't enter emulation mode before, just run the executables *rtc_gen_test* and *rtc_alpha_ab* to run in hardware mode. If you have entered hardware emulation mode and want to exit to real hardware mode, just use following command before running the exeutables:
+If you did not enter emulation mode before, just run the executables *rtc_gen_test* and *rtc_alpha_ab* to run in hardware mode. If you have entered hardware emulation mode and want to exit to real hardware mode, just use following command before running the executables:
 
 ~~~
 source setup_emu.sh -s off
 ~~~
 
-**Run in hardware emulation mode**
+##### Run in Hardware Emulation Mode
 
-To try the test programs in hardware emulation mode, you should use the executables *rtc_gen_test_emu* and *rtc_alpha_tb_emu*. Before running them, please run following command firstly：
+To try the test programs in hardware emulation mode, you should use the executables *rtc_gen_test_emu* and *rtc_alpha_tb_emu*. Before running them, first run the following command:
 
 ~~~
 source setup_emu.sh -s on -p PLATFORM_NAME
@@ -309,21 +287,21 @@ The *PLATFORM_NAME* is one of the six supported platform, you could run followin
 source setup_emu.sh
 ~~~
 
-For example, if you want to run the executable in hardware emulation mode with U50 platform, just input:
+For example, if you want to run the executable in hardware emulation mode with a U50 platform, just input:
 
 ~~~
 source setup_emu.sh -s on -p xilinx_u50_gen3x16_xdma_201920_3
 ~~~
 
-*setup_emu.sh* will generate necessary configuration file and setting up the environment. 
+*setup_emu.sh* will generate the necessary configuration file and set up the environment. 
 
-**Note:** The *PLATFORM_NAME* you input here should be consistent with the XCLBIN files in *./sw/build* directory.
+>**NOTE:** The *PLATFORM_NAME* you input here should be consistent with the XCLBIN files in the `./sw/build` directory.
 
-For more detailes on the hardware emulation for this example design, please read [Emulation Tutorial](./doc/hw_emu_tutorial.md)
+For more detailes on the hardware emulation for this example design, read the [Emulation Tutorial](./doc/hw_emu_tutorial.md)
 
-* Step 4: run executables **rtc_gen_test** or **rtc_gen_test_emu**
+#### Step 4: Run the Executables **rtc_gen_test** or **rtc_gen_test_emu**
 
-Run the executable *rtc_gen_test* or *rtc_gen_test_emu* to finish the program running in hardware or hardware emulation mode. Firstly an eight-digit clock will be displayed, keep the image window front and press *ESC* key, a second eleven-digit clock will be displayed. Keep the image window front then Press *ESC* key again to exit the program. It will also read and print out the value of register *time_val* of the kernel. **Don't forget to set running mode to hardware emulation before running *rtc_gen_test_emu*.**
+Run the executable *rtc_gen_test* or *rtc_gen_test_emu* to finish the program running in hardware or hardware emulation mode. First, an eight-digit clock will be displayed, keep the image window front, and press the *ESC* key; a second eleven-digit clock will be displayed. Keep the image window in front, then press the *ESC* key again to exit the program. It will also read and print out the value of register *time_val* of the kernel. **Do not forget to set the running mode to hardware emulation before running *rtc_gen_test_emu*.**
 
 ~~~
 ./rtc_gen_test
@@ -333,27 +311,26 @@ or
 ./rtc_gen_test_emu
 ~~~
 
-The program will firstly judge the running mode (hw or hw_emu), then look for *./sw/build/rtc_gen_test_hw.xclbin* or *./sw/build/rtc_gen_test_hw_emu.xclbin* file and analyze it to get the platform it is using, then compare it with the card you have installed. If mismatching is detected, error information will be reported and the program will exit.
+The program will first judge the running mode (hw or hw_emu), then look for `./sw/build/rtc_gen_test_hw.xclbin` or `./sw/build/rtc_gen_test_hw_emu.xclbin` file and analyze it to get the platform it is using, then compare it with the card you have installed. If mismatching is detected, error information will be reported, and the program will exit.
 
-**Note**: running under hardware emulation mode may take a long time since it is actually running RTL simulation.
+>**NOTE:** Running under hardware emulation mode can take a long time because it is actually running the RTL simulation.
 
-* Step 5: run executables **rtc_alpha_tb** or **rtc_alpha_tb_emu**
+#### Step 5: Run the executables **rtc_alpha_tb** or **rtc_alpha_tb_emu**
 
-Run the executable *rtc_alpha_tb* or *rtc_alpha_tb_emu* to finish the program running in hardware or hardware emulation mode. There are a few command parameters for the executable, the usage is as below:
+Run the executable *rtc_alpha_tb* or *rtc_alpha_tb_emu* to finish the program running in hardware or hardware emulation mode. There are a few command parameters for the executable, the usage is as follows:
 
 ~~~
 rtc_alpha_tb [-i BACK_IMAGE] [-f] [-s] [-h]
 
-  -i BACK_IMAGE: set path to the background image, default is ../media/alveo.jpg
-  -f : set to use eleven-digit clock, default is eight-digit
-  -s : use system time to set the clock, default don't set the clock
-  -h : print help information
-
+  -i BACK_IMAGE: Set the path to the background image, default is ../media/alveo.jpg
+  -f : Set to use eleven-digit clock, default is eight-digit
+  -s : Use system time to set the clock, default don't set the clock
+  -h : Print help information
 ~~~
 
-There are three images provided in *./sw/media* directory: alveo.jpg, vitis.jpg and victor.jpg, and you could also use other images. Please note the images should be in three-channel format (RGB without transparency). Also please use images big enough, otherwise please modify the program source code to adjust the clock image size or position. 
+There are three images provided in the `./sw/media` directory: `alveo.jpg`, `vitis.jpg`, and `victor.jpg`, and you could also use other images. Note the images should be in three-channel format (RGB without transparency). Also use images large enough; otherwise, modify the program source code to adjust the clock image size or position.
 
-Following is some execution command line examples:
+The following is some execution command line examples:
 
 ~~~
 rtc_alpha_tb
@@ -365,11 +342,11 @@ rtc_alpha_tb -i ../media/vitis.jpg -f -s
       Linux system clock, and use 11-digit format.
 ~~~
 
-To exit the program, just keep the image window front, then press **ESC** key.
+To exit the program, just keep the image window in front, then press the **ESC** key.
 
-The program will firstly judge the running mode (hw or hw_emu), then look for *./sw/build/rtc_alpha_hw.xclbin* or *./sw/build/rtc_alpha_hw_emu.xclbin* file and analyze it to get the platform it is using, then compare it with the card you have installed. If mismatching is detected, error information will be reported and the program will exit.
+The program will first judge the running mode (hw or hw_emu), then look for the `./sw/build/rtc_alpha_hw.xclbin` or .`/sw/build/rtc_alpha_hw_emu.xclbin` file and analyze it to get the platform it is using, and  then compare it with the card you have installed. If mismatching is detected, the error information will be reported, and the program will exit.
 
-You could make modification to following *#define* section at the beginning of *./sw/src/rtc_alpha_tb.cpp* file to adjust the color, size, position and opacity of the clock image, then repeat **step 2** to re-compile the program and run to see the result. Don't forget to ensure that the width of background image and resized clock image are integer multiples of 8.
+You could make modifications to following *#define* section at the beginning of the `./sw/src/rtc_alpha_tb.cpp` file to adjust the color, size, position, and opacity of the clock image, then repeat **step 2** to re-compile the program, and run to see the result. Do not forget to ensure that the width of the background image and resized clock image are integer multiples of eight.
 
 ~~~c++
 // position of clock image, top-left corner is (0,0)
@@ -400,12 +377,11 @@ You could make modification to following *#define* section at the beginning of *
 #define BGR_OPA 100
 ~~~
 
-**Note**: running under hardware emulation mode may take a long time since it is actually running RTL simulation. You could use some smaller background image to reduce the run time, and don't forget to modify those size and position parameters described above accordingly in this case.
-
-* Step 6: try Vitis profiling function with **rtc_gen_test** and **rtc_alpha_tb** program.
+>**NOTE:** Running under hardware emulation mode can take a long time because it is actually running the RTL simulation. You could use some smaller background images to reduce the run time, and do not forget to modify those size and position parameters described above accordingly in this case.
 
-Vitis provides powerful profiling features which enable you to get a deeper view into the performance, bandwidth usage, design bottleneck, etc. Please read [Profiling the Application](./doc/profile_tutorial.md) for more details. 
+#### Step 6: Try the Vitis Profiling Function with **rtc_gen_test** and **rtc_alpha_tb** Program
 
+Vitis provides powerful profiling features which enable you to get a deeper view into the performance, bandwidth usage, design bottleneck, etc. Read [Profiling the Application](./doc/profile_tutorial.md) for more details.
 
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
diff --git a/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/alpha_mix_tutorial.md b/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/alpha_mix_tutorial.md
index be9b873964..3cb8b83bf9 100644
--- a/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/alpha_mix_tutorial.md
+++ b/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/alpha_mix_tutorial.md
@@ -1,15 +1,26 @@
+<table class="sphinxhide" width="100%">
+ <tr>
+   <td align="center"><img src="https://raw.githubusercontent.com/Xilinx/Image-Collateral/main/xilinx-logo.png" width="30%"/><h1>Vitis™ Application Acceleration Development Flow Tutorials</h1>
+   </td>
+ </tr>
+ <tr>
+ <td>
+ </td>
+ </tr>
+</table>
+
 # ALPHA_MIX HLS C Kernel Creation
 
-According to the design specification, the *alpha_mix* kernel will include following functions:
+According to the design specification, the *alpha_mix* kernel will include the following functions:
 
-* Receive monochrome (one channel 8-bit depth) digital clock images via AXI Stream slave interface
-* Read the background image from global memory via AXI master interface
-* Resize the digital clock image to designated scale
+* Receive monochrome (one channel 8-bit depth) digital clock images via the AXI4-Stream slave interface
+* Read the background image from the global memory via the AXI master interface
+* Resize the digital clock image to the designated scale
 * Color the clock digit characters as well as the background layer for the clock images
-* Mix the clock digit, clock background layer and the background image with opacity setting
-* Output the mixed image via AXI Stream master interface
+* Mix the clock digit, clock background layer, and the background image with opacity setting
+* Output the mixed image via the AXI4-Stream master interface
 
-Following code block is the top level function declaration for *alpha_mix* kernel. You can see that: the hardware AXI stream port (master or slave) is represented by *hls::stream* data type; the hardware AXI master port (read or write) is represented by *ap_uint* (actually array pointer) data type; those kernel arguments are represented by normal *int* data type, which will be grouped up to map to an AXI slave interface.
+The following code block is the top-level function declaration for *alpha_mix* kernel. You can see that: the hardware AXI stream port (master or slave) is represented by the *hls::stream* data type; the hardware AXI master port (read or write) is represented by the *ap_uint* (actually array pointer) data type; those kernel arguments are represented by the normal *int* data type, which will be grouped up to map to an AXI slave interface.
 
 ~~~c++
 void alpha_mix(hls::stream<ap_axiu<64, 0, 0, 0>> &time_img_input,   // time image input
@@ -29,16 +40,13 @@ void alpha_mix(hls::stream<ap_axiu<64, 0, 0, 0>> &time_img_input,   // time imag
 )
 ~~~
 
+The following is the sub-function and data flow diagram for the *alpha_mix* kernel.
 
-Following is the sub-function and data flow diagram for *alpha_mix* kernel.
-
-<div align="center">
-<img src="./images/alpha_mix_flow.png" alt="alpha_mix flow" >
-</div>
+![alpha_mix flow](alpha_mix flow)
 
-In the diagram, the sub-functions filled with red are from Vitis Vision Library, and those with blue are hand-written. *xf::cv::Mat* is the counterpart for *cv::Mat* in OpenCV software library, it is very useful for handling image data. In the hardware implementation, if we want to handle the image with in-order pixel level (no need to randomly access the pixel data), we can use *#pragma HLS stream* to indicate the compiler to map the *xf::cv::Mat* object to array. This is the case for our *alpha_mix* kernel. Many functions in Vitis Vision Library support *xf::cv::Mat* as the input and output data. In the HLS C code, we can use for-loop to process data in *xf::cv::mat* stream easily, you can refer to the source code of sub-function *mixing* for alpha-mixing operation applied to the input and output *xf::cv::Mat* objects.
+In the diagram, the sub-functions filled with red are from the Vitis Vision Library, and those with blue are handwritten. *xf::cv::Mat* is the counterpart for *cv::Mat* in the OpenCV software library; it is very useful for handling image data. In the hardware implementation, if you want to handle the image with in-order pixel level (no need to randomly access the pixel data), you can use the *#pragma HLS stream* to indicate the compiler to map the *xf::cv::Mat* object to the array. This is the case for your *alpha_mix* kernel. Many functions in Vitis Vision Library support *xf::cv::Mat* as the input and output data. In the HLS C code, you can use the for-loop to process data in the *xf::cv::mat* stream easily; you can refer to the source code of sub-function *mixing* for the alpha-mixing operation applied to the input and output *xf::cv::Mat* objects.
 
-We use Vitis *v++* command to compile the HLS C source code to kernel file (xo). For example, use following command to compile *alpha_mix.c* to hardware kernel with *xilinx_u50_gen3x16_xdma_201920_3* platform.
+Use Vitis *v++* command to compile the HLS C source code to the kernel file (xo). For example, use the following command to compile *alpha_mix.c* to the hardware kernel with a *xilinx_u50_gen3x16_xdma_201920_3* platform.
 
 ~~~
 v++ --platform xilinx_u50_gen3x16_xdma_201920_3                                                     \
@@ -51,8 +59,6 @@ v++ --platform xilinx_u50_gen3x16_xdma_201920_3
     alpha_mix.c
 ~~~
 
-
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
 <p class="sphinxhide" align="center"><sup><a href="https://www.amd.com/en/corporate/copyright">Terms and Conditions</a></sup></p>
-
diff --git a/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/hw_emu_tutorial.md b/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/hw_emu_tutorial.md
index e5054f4dd5..b8da341086 100644
--- a/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/hw_emu_tutorial.md
+++ b/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/hw_emu_tutorial.md
@@ -1,47 +1,54 @@
+<table class="sphinxhide" width="100%">
+ <tr>
+   <td align="center"><img src="https://raw.githubusercontent.com/Xilinx/Image-Collateral/main/xilinx-logo.png" width="30%"/><h1>Vitis™ Application Acceleration Development Flow Tutorials</h1>
+   </td>
+ </tr>
+ <tr>
+ <td>
+ </td>
+ </tr>
+</table>
+
 # Hardware Emulation
 
-There are 3 types of builds and runs in Vitis, Software Emulation, Hardware Emulation and System Hardware. Since RTL kernels naturally don't support software emulation, this tutorial will only demonstrate hardware emulation flow. 
+There are three types of builds and runs in Vitis: Software Emulation, Hardware Emulation, and System Hardware. Because RTL kernels naturally do not support software emulation, this tutorial will only demonstrate the hardware emulation flow. 
 
-In addition to functional correctness, hardware emulation provides a cycle-accurate performance and detailed resource estimates , but the compile and execution times are longer than for software emulation, it is recommended to use small data sets for validation during hardware emulation to keep run times manageable. 
+In addition to functional correctness, hardware emulation provides a cycle-accurate performance and detailed resource estimates, but the compile and execution times are longer than for software emulation. It is recommended to use small data sets for validation during hardware emulation to keep run times manageable.
 
-In mix-kernel design, hardware emulation has often been used to test system level integration and view the interaction between multiple kernels. 
+In the mix-kernel design, hardware emulation has often been used to test the system-level integration, and view the interaction between multiple kernels.
 
 ## Waveform Report
 
-Vitis can generate a waveform view when running hardware emulation, it displays in-depth details include data transfers between the kernel and global memory and data flow through inter-kernel pipes. 
+Vitis can generate a waveform view when running hardware emulation; it displays in-depth details include data transfers between the kernel and global memory and data flow through the inter-kernel pipes.
 
-To enable waveform data collection, make sure `-g` option was used during compilation, and associated switch is turned on at the *xrt.ini* file, which should be placed at the same directory as the host executable file. We already delivered a working *xrt.ini* with this tutorial, you can check it under *./sw/build/* directory. For more details, refer to [xrt.ini File ](https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/obl1532064985142.html) from Vitis online documentation.
+To enable waveform data collection, make sure the `-g` option was used during compilation, and the associated switch is turned on at the `xrt.ini` file, which should be placed at the same directory as the host executable file. A working `xrt.ini` is delivered with this tutorial; you can check it under *./sw/build/* directory. For more details, refer to [xrt.ini File](https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/obl1532064985142.html) from the Vitis documentation.
 
 ```
 [Emulation]
 debug_mode=batch
 ```
 
-Let's take the Hardware Emulation result of *rtc_gen_test* on U200 card as an example, a run summary file will be automatically generated after emulation has been successfully executed, and it can be opened directly with *Vitis Analyzer*. Waveform is included at run summary report. 
+Take the Hardware Emulation result of the *rtc_gen_test* on the U200 card as an example; a run summary file will be automatically generated after emulation has been successfully executed, and it can be opened directly with *Vitis Analyzer*. The waveform is included in the run summary report.
 
 ```
 vitis_analyzer rtc_gen_test_hw_emu.xclbin.run_summary
 ```
+![Hardware Emulation Waveform](./images/hw_emu_waveform.PNG)
 
-<div align="center">
-<img src="./images/hw_emu_waveform.PNG" alt="Hardware Emulation Waveform" >
-</div>
-
-You can also open the waveform database with the Vivado logic simulator `xsim`
+You can also open the waveform database with the AMD Vivado™ logic simulator `xsim`.
 
 ```
 xsim -gui xilinx_u200_xdma_201830_2-0-rtc_gen_test_hw_emu.wdb 
 ```
 
-If you wish to have the simulation waveform opened during the hardware emulation run, change debug_mode to gui at *xrt.ini*
+If you wish to have the simulation waveform opened during the hardware emulation run, change the debug_mode to GUI in the `xrt.ini` file.
 
 ```
 [Emulation]
 debug_mode=gui
 ```
 
-Besides waveform, other information delivered with run summary is also valuable for you to profile, optimize and debug your application, please check on [Profiling the Application](./profile_tutorial.md) of this tutorial for more details. If you want to understand more detailed information about hardware emulation flow, please refer to [Hardware Emulation](https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/buildtargets1.html#ldh1504034328524) from Vitis online documentation.
-
+Besides the waveform, other information delivered with the run summary is also valuable for you to profile, optimize, and debug your application. Refer to [Profiling the Application](./profile_tutorial.md) of this tutorial for more details. If you want to understand more detailed information about the hardware emulation flow, refer to [Hardware Emulation](https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/buildtargets1.html#ldh1504034328524) from the Vitis documentation.
 
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
diff --git a/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/profile_tutorial.md b/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/profile_tutorial.md
index 85ba5e9f8a..94131193a9 100644
--- a/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/profile_tutorial.md
+++ b/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/profile_tutorial.md
@@ -1,8 +1,19 @@
+<table class="sphinxhide" width="100%">
+ <tr>
+   <td align="center"><img src="https://raw.githubusercontent.com/Xilinx/Image-Collateral/main/xilinx-logo.png" width="30%"/><h1>Vitis™ Application Acceleration Development Flow Tutorials</h1>
+   </td>
+ </tr>
+ <tr>
+ <td>
+ </td>
+ </tr>
+</table>
+
 # Profiling the Application
 
-Vitis generates various system and kernel resource performance reports during compilation, XRT also collects profiling data during application execution in both software and hardware emulation and system run mode. A run summary report linked lots of reports from compilation and execution is automatically generated after running the active build, and can be viewed in *Vitis Analyzer*. 
+Vitis generates various system and kernel resource performance reports during compilation. XRT also collects profiling data during the application execution in both software and hardware emulation and system run mode. A run summary report linked to lots of reports from compilation and execution is automatically generated after running the active build and can be viewed in *Vitis Analyzer*.
 
-Profiling reports and data can be used to isolate performance bottlenecks in the application, identify problems in the system, and optimize the design to improve performance. To obtain profiling reports, make sure associated switches are turned on at the *xrt.ini* file, which should be placed at the same directory as the host executable file. We already delivered a working *xrt.ini* with this tutorial, you can check it under *./sw/build/* directory. For more details, refer to [xrt.ini File](https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/obl1532064985142.html) from Vitis online documentation.
+Profiling reports and data can be used to isolate performance bottlenecks in the application, identify problems in the system, and optimize the design to improve performance. To obtain the profiling reports, make sure the associated switches are turned on in the `xrt.ini` file, which should be placed at the same directory as the host executable file. A working *xrt.ini* is delivered with this tutorial. You can check it under the `./sw/build/` directory. For more details, refer to the [xrt.ini File](https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/obl1532064985142.html) from the Vitis documentation.
 
 ```
 [Debug]
@@ -11,7 +22,7 @@ timeline_trace=true
 data_transfer_trace=fine
 ```
 
-Let's take the Hardware Run result of *rtc_alpha_tb* on U200 card as an example. There are reports of 6 categories listed with run summary, waveform will not be available because this is a run summary from system run. 
+Take the Hardware Run result of *rtc_alpha_tb* on a U200 card as an example. There are reports of six categories listed with the run summary. The waveform will not be available because this is a run summary from the system run.
 
 ```
 vitis_analyzer rtc_alpha_hw.xclbin.run_summary
@@ -19,17 +30,15 @@ vitis_analyzer rtc_alpha_hw.xclbin.run_summary
 
 ## Run Guidance
 
-Guidance includes message for reported violations, a brief suggested resolution, and a detailed resolution provided as a web link. The rules are generic rules based on an extensive set of reference designs, thus might not be applicable for every design. You need to understand the specific guidance rules and take appropriate action based on your specific algorithm and requirements.
+Guidance includes a message for reported violations, a brief suggested resolution, and a detailed resolution provided as a web link. The rules are generic rules based on an extensive set of reference designs, thus might not be applicable for every design. You need to understand the specific guidance rules and take appropriate action based on your specific algorithm and requirements.
 
-<div align="center">
-<img src="./images/hw_guidance.PNG" alt="Hardware Run Guidance" >
-</div>
+![Hardware Run Guidance](./images/hw_guidance.PNG)
 
 ## Profile Summary
 
-Enabling profile data capturing for traffic between the kernels and host consumes additional resources and may impact performance, so we cleared those elements out of the sources as we delivered the pre-built XCLBIN files with this tutorial. 
+Enabling profile data capturing for traffic between the kernels and host consumes additional resources and can impact performance, so those elements were cleared out of the sources with the pre-built XCLBIN files with this tutorial.
 
-You could add `--profile_kernel` option during the Vitis compiler linking process to enable data profiling for kernels by adding Acceleration Monitors and AXI Performance Monitors to kernels. See below modification on *./hw/Makefile* as an example, detailed description can be found in the [Vitis Compiler Command](https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/vitiscommandcompiler.html#wrj1504034328013) from Vitis online documentation. 
+You could add the `--profile_kernel` option during the Vitis compiler linking process to enable data profiling for kernels by adding Acceleration Monitors and AXI Performance Monitors to the kernels. See the following modification on `./hw/Makefile` as an example. A detailed description can be found in the [Vitis Compiler Command](https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/vitiscommandcompiler.html#wrj1504034328013).
 
 ```
 rtc_alpha_$(TARGET).xclbin: $(XOS_RTC_ALPHA)
@@ -39,28 +48,20 @@ rtc_gen_test_$(TARGET).xclbin: $(XOS_RTC_GEN_TEST)
 	v++ -l $(XOCCFLAGS) $(XOCCLFLAGS) --config xclbin_rtc_gen_test.ini --profile_kernel data:all:all:all  -o $@ $(XOS_RTC_GEN_TEST)
 ```
 
-<div align="center">
-<img src="./images/hw_profile_summary.PNG" alt="Profile Summary" >
-</div>
-
-Even without `--profile_kernel` option added while generating XCLBIN files, most of the profiling reports are still available, only few sections like the Kernel Data Transfers will show no data.
+![Profile Summary](./images/hw_profile_summary.PNG)
 
-<div align="center">
-<img src="./images/no_kernel_data.PNG" alt="No Kernel Data" >
-</div>
+Even without the `--profile_kernel` option added while generating XCLBIN files, most of the profiling reports are still available, only few sections like the Kernel Data Transfers will show no data.
 
-##  Application Timeline
+![No Kernel Data](./images/no_kernel_data.PNG)
 
-Application Timeline collects and displays host and kernel events on a common timeline, it helps to understand and visualize the overall health and performance of the systems. The graphical display makes it easy to discover issues regarding kernel synchronization and efficient concurrent execution. 
+## Application Timeline
 
-<div align="center">
-<img src="./images/hw_timeline.PNG" alt="Application Timeline" >
-</div>
+The Application Timeline collects and displays the host and kernel events on a common timeline;it helps to understand and visualize the overall health and performance of the systems. The graphical display makes it easy to discover issues regarding the kernel synchronization and efficient concurrent execution.
 
-If you are looking for more details of Profiling and how to use *Vitis Analyzer*, check on [Profiling the Application](https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/profilingapplication.html) from Vitis online documentation. 
+![Application Timeline](./images/hw_timeline.PNG)
 
+If you are looking for more details about profiling and how to use the *Vitis Analyzer*, refer to [Profiling the Application](https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/profilingapplication.html). 
 
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
 <p class="sphinxhide" align="center"><sup><a href="https://www.amd.com/en/corporate/copyright">Terms and Conditions</a></sup></p>
-
diff --git a/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/rtc_gen_tutorial.md b/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/rtc_gen_tutorial.md
index cc613c129d..cbf360c8f2 100644
--- a/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/rtc_gen_tutorial.md
+++ b/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/doc/rtc_gen_tutorial.md
@@ -1,20 +1,31 @@
+<table class="sphinxhide" width="100%">
+ <tr>
+   <td align="center"><img src="https://raw.githubusercontent.com/Xilinx/Image-Collateral/main/xilinx-logo.png" width="30%"/><h1>Vitis™ Application Acceleration Development Flow Tutorials</h1>
+   </td>
+ </tr>
+ <tr>
+ <td>
+ </td>
+ </tr>
+</table>
+
 # RTC_GEN RTL Kernel Creation
 
-It is very easy for the user to use Vitis/Vivado RTL Kernel Wizard to create and develop RTL based kernel which could be seamlessly integrated with other HLS C or Vitis Vision library based kernel with Vitis. This tutorial will show the detailed steps to create *rtc_gen* kernel. We use Alveo U200 as the target platform for the tutorial.
+It is very easy for you to use the Vitis/Vivado RTL Kernel Wizard to create and develop a RTL-based kernel which could be seamlessly integrated with other HLS C or Vitis Vision library based kernels with Vitis. This tutorial will show the detailed steps to create the *rtc_gen* kernel. An AMD Alveo™ U200 is used as the target platform for this tutorial.
 
 ## Determine Top Level Design Specification
 
-The *rtc_gen* kernel is required to read font library from global memory into on-chip SRAM, then output the clock image via AXI stream port. Considerting the font library reading is a one-time job, and no global memory bandwidth requirement exists during normal work, so we use 32-bit (data bus) AXI4 master interface for the font reading port to save resources. In the downstream processing pipeline, we will use *XF_NPP8* format (processing 8 pixels in each clock cycle), so we select the AXI stream width as 64-bit, which could transfer eight pixels in each AXI stream transaction. For control register, we will use an AXI slave interface compatible with XRT, so we could use the standard OpenCL API to program and control the kernel. 
+The *rtc_gen* kernel is required to read the font library from the global memory into the on-chip SRAM, then output the clock image via the AXI stream port. Considering the font library reading is a one-time job, and no global memory bandwidth requirement exists during normal work use a 32-bit (data bus) AXI4 master interface for the font reading port to save resources. In the downstream processing pipeline, you will use the *XF_NPP8* format (processing eight pixels in each clock cycle), so you select the AXI4-Stream width as 64-bit, which could transfer eight pixels in each AXI4-Stream transaction. For the control register, you will use an AXI slave interface compatible with XRT, so you could use the standard OpenCL API to program and control the kernel.
 
-Thus we generalize the top level design specification for *rtc_gen* kernel as below:
+The top-level design specification for *rtc_gen* kernel is generalized as folows:
 
-**Bus interfaces**
+### Bus interfaces
 
-+ AXI-Lite slave interface for control
-+ AXI-Lite master interface for font data loading, data width 32-bit, address width 64-bit
-+ AXI stream master interface for clock digit image output, data width 64-bit
++ AXI4-Lite slave interface for control
++ AXI4-Lite master interface for font data loading, data width 32-bit, address width 64-bit
++ AXI4-Stream master interface for clock digit image output, data width 64-bit
 
-**Control Register**
+### Control RegisterD
 
 |No.   | Arguments   | Width  | Description |
 | ---- | ----        | ----   | ---- |
@@ -26,18 +37,13 @@ Thus we generalize the top level design specification for *rtc_gen* kernel as be
 |5     | time_val    | 32     |Read-only regsiter for internal real-time-clock time value: <br> [31:24] - hours <br> [23:16] - minutes <br> [15:8] - seconds <br> [7:0] - centi-seconds |
 |6     | read_addr   | 64     |AXI master pointer, this is the FPGA device buffer address for font library |
 
+![RTL Kernel](./images/rtc_gen_block.png)
 
-<div align="center">
-<img src="./images/rtc_gen_block.png" alt="RTL Kernel" >
-</div>
-
-<br/>
-
-## Use RTL Kernel Wizard to Create Kernel Frame
+## Use the RTL Kernel Wizard to Create the Kernel Frame
 
-Now we use RTL Kernel Wizard to create the frame for *rtc_gen* kernel. We will launch the Wizard from Vivado. Please refer to [Vitis Application Acceleration Development Flow Documentation](https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/devrtlkernel.html#qnk1504034323350) for detailed user guide of RTL kernel and RTL Kernel Wizard.
+Now you use the RTL Kernel Wizard to create the frame for the *rtc_gen* kernel. You will launch the Wizard from the AMD Vivado™ Design Suite. Refer to [Vitis Application Acceleration Development Flow Documentation](https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/devrtlkernel.html#qnk1504034323350) for a detailed user guide of the RTL kernel and RTL Kernel Wizard.
 
-Please change to *./rtc_gen* directory of the git repo, create a directory named *vivado_project*, then enter this directory.
+Change to the `./rtc_gen` directory of the git repo, create a directory named *vivado_project*, then enter this directory.
 
 ~~~
 cd ./rtc_gen
@@ -46,71 +52,39 @@ cd ./vivado_project
 vivado &
 ~~~
 
-We create a new RTL project named *rtc_gen_kernel* in the just created *vivado_project* directory. During the part selection page, select **Alveo U200 Data Center Accelerator Card*.
-
-<div align="center">
-<img src="./images/rtl_kernel_wiz_1.png" alt="RTL Kernel" >
-</div>
-
-<br/>
-
-When the project is created, in the Flow Navigator, click the IP catalog command, type RTL Kernel in the IP catalog search box, then double-click RTL Kernel Wizard to launch the wizard.
-
-In the *General Settings* tab of the RTL Kernel Wizard, set the kernel name to **rtc_gen**, set the kernel vendor to **xilinx.com**, change the *has reset* option to value 1, refer to below snapshot.
-
-<div align="center">
-<img src="./images/rtl_kernel_wiz_2.png" alt="RTL Kernel" >
-</div>
-
-<br/>
+Create a new RTL project named *rtc_gen_kernel* in the just created *vivado_project* directory. During the part selection page, select **Alveo U200 Data Center Accelerator Card*.
 
-In the *Scalars* tab, configure the kernel arguments as our design specification. Refer to the **Control Register** table and below snapshot. Please note **read_addr** register is not considered scalar argument as AXI master pointer, so we don't need to configure it in this tab. We are using **uint** as argument type here, though all these bit might not be used.
+![RTL Kernel](./images/rtl_kernel_wiz_1.png)
 
-<div align="center">
-<img src="./images/rtl_kernel_wiz_3.png" alt="RTL Kernel" >
-</div>
+When the project is created, in the Flow Navigator, click the IP catalog command, type RTL Kernel in the IP catalog search box, then double-click the RTL Kernel Wizard to launch the wizard.
 
-<br/>
+In the *General Settings* tab of the RTL Kernel Wizard, set the kernel name to **rtc_gen**, set the kernel vendor to **xilinx.com**,and  change the *has reset* option to value 1.
 
-In the *Global Memory* tab, configure the AXI master interfaces parameters according to our design specification. Name the AXI master interface to **fontread_axi_m**, change the width to 4 bytes (32-bit), and set the relating argument name to **read_addr**. Refer to below snapshot.
+![RTL Kernel](./images/rtl_kernel_wiz_2.png)
 
-<div align="center">
-<img src="./images/rtl_kernel_wiz_4.png" alt="RTL Kernel" >
-</div>
+In the *Scalars* tab, configure the kernel arguments per your design specification. Refer to the **Control Register** table and the following snapshot. Note the **read_addr** register is not considered a scalar argument as a AXI master pointer, so you do not need to configure it in this tab. You are using **uint** as the argument type here, though all these bits might not be used.
 
-<br/>
+![RTL Kernel](./images/rtl_kernel_wiz_3.png)
 
-In the *Streaming interfaces* tab, set the number of AXI4-Sttream interfaces to 1, name it to **dataout_axis_m**, set the mode the **Master**, and set the width to 8 bytes (64-bit). Refer to below snapshot.
+In the *Global Memory* tab, configure the AXI master interfaces parameters according to your design specification. Name the AXI master interface as **fontread_axi_m**, change the width to 4 bytes (32-bit), and set the related argument name to **read_addr**.
 
-<div align="center">
-<img src="./images/rtl_kernel_wiz_5.png" alt="RTL Kernel" >
-</div>
+![RTL Kernel](./images/rtl_kernel_wiz_4.png)
 
-<br/>
+In the *Streaming interfaces* tab, set the number of AXI4-Sttream interfaces to 1, name it **dataout_axis_m**, set the mode the **Master**, and set the width to 8 bytes (64-bit).
 
-Finally review the summary page of the wizard, and click *OK* button to generate the RTL kernel top level framework. 
+![RTL Kernel](/images/rtl_kernel_wiz_5.png)
 
-<div align="center">
-<img src="./images/rtl_kernel_wiz_6.png" alt="RTL Kernel" >
-</div>
+Finally, review the summary page of the wizard, and click **OK** to generate the RTL kernel top-level framework.
 
-<br/>
+![RTL Kernel](./images/rtl_kernel_wiz_6.png)
 
-For the next "Generate Output Products" pop-up window, just click *Skip* button close the window. Now you can see the *rtc_gen.xci* file in the *Design Sources* group of *Sources* view. Right click the *rtc_gen.xci* file, select *Open IP Example Design*.
+For the next "Generate Output Products" pop-up window, just click **Skip** to close the window. Now you can see the `rtc_gen.xci` file in the *Design Sources* group of the *Sources* view. Right-click the `rtc_gen.xci` file, and select *Open IP Example Design*.
 
-<div align="center">
-<img src="./images/rtl_kernel_wiz_7.png" alt="RTL Kernel" >
-</div>
+![RTL Kernel](./images/rtl_kernel_wiz_7.png)
 
-<br/>
+In the *Open IP Example Design* pop-up window, directly click **OK**, then another project named *rtc_gen_ex* will be created in the `./vivado_project` directory, and open automatically in another Vivado session. YOu will use the project, *rtc_gen_ex*, as the major working project to finish the *rtc_gen* kernel development.
 
-In the *Open IP Example Design* pop-up window directly click *OK* button, then another project named *rtc_gen_ex* will be created in *./vivado_project* directory and open automatically in another Vivado session. We will use project *rtc_gen_ex* as the major working project to finish the *rtc_gen* kernel development.
-
-<div align="center">
-<img src="./images/rtl_kernel_wiz_8.png" alt="RTL Kernel" >
-</div>
-
-<br/>
+![RTL Kernel](./images/rtl_kernel_wiz_8.png)
 
 In the *rtc_gen_ex* project, you can see a few automatically generated Verilog/SystemVerilog source code files:
 
@@ -128,50 +102,44 @@ rtc_gen_tb.sv
 rtc_gen.v
 ~~~
 
-In the *Sources* view, *Hierarchy* tab, we can see the HDL file hierarchy. Now we have finished the kernel framework creation.
-
-<div align="center">
-<img src="./images/rtl_kernel_wiz_9.png" alt="RTL Kernel" >
-</div>
+In the *Sources* view, *Hierarchy* tab, you can see the HDL file hierarchy. Now we have finished the kernel framework creation.
 
-<br/>
+![RTL Kernel](./images/rtl_kernel_wiz_9.png)
 
 ## RTC_GEN Kernel Development
 
-We will use following generated RTL files in our *rtc_gen* kernel:
+You will use the following generated RTL files in your *rtc_gen* kernel:
+
 + **rtc_gen_control_s_axi.v**
 
-  It is the AXI Lite slave interface to upper level system and XRT. It includes all the kernel argument as well as the kernel control signals (ap_start, ap_done, ap_idle, ap_ready). We will modify this module a little to realize the read-only register *time_val*.
+  This is the AXI4-Lite slave interface to the upper-level system and XRT. It includes all the kernel arguments as well as the kernel control signals (`ap_start`, `ap_done`, `ap_idle`, and `ap_ready`). You will modify this module a little to realize the read-only register *time_val*.
 
 + **rtc_gen_example_axi_read_master.sv**
   
-  It is a simple AXI Lite read master which could be called by our kernel directly. It uses four control signals to trigger the master to finish data reading tasks: 
-  + *ctrl_start*: single cycle pulse to start the master state machine
-  + *ctrl_done*: single cycle pulse indicating the finish the AXI reading task
-  + *ctrl_addr_offset*: base address for AXI reading operation. Kernel argument *read_addr* can be used to drive this signal
-  + *ctrl_xfer_size_in_bytes*: number of byes to be read from AXI bus.
+  It is a simple AXI4-Lite read master which could be called by your kernel directly. It uses four control signals to trigger the master to finish data reading tasks: 
+
+  + *ctrl_start*: Single-cycle pulse to start the master state machine.
+  + *ctrl_done*: Single-cycle pulse indicating the finish the AXI reading task.
+  + *ctrl_addr_offset*: Base address for the AXI reading operation. The kernel argument *read_addr* can be used to drive this signal.
+  + *ctrl_xfer_size_in_bytes*: Number of bytes to be read from the AXI bus.
   
-  The example master also use an AXI stream port for reading data transfer, which could be connected to FIFO like data pipeline easily.
+  The example master also uses an AXI4-Stream port for reading the data transfer, which could be connected to a first in first out (FIFO)-like data pipeline easily.
 
 + **rtc_gen_example_counter.sv**
   
   The sub-module of the example AXI read master.
 
 + **rtc_gen.v**
-  It is the example top level kernel wrapper instantiating all the submodule. We will modify this module to construct our *rtc_gen* kernel.
+  It is the example top-level kernel wrapper instantiating all the submodules. You will modify this module to construct your *rtc_gen* kernel.
 
 + **rt_gen_tb.sv**
-  It is an example testbench using Xilinx AXI Verification IP. We can modify this testbench to test our *rtc_gen* kernel.
+  It is an example testbench using the AMD AXI Verification IP. You can modify this testbench to test your *rtc_gen* kernel.
 
-In addition to these five files, we could also refer to *rtc_gen_example_vadd.sv* for the connection of AXI read master. For AXI stream port, it is simple and we don't need the example for reference. For *rtc_gen* kernel, a Verilog file *rtc_gen_core.v* is created to finish the core function of the kernel. The function diagram of *rtc_gen_core* is shown in following diagram.
+In addition to these five files, you could also refer to the `rtc_gen_example_vadd.sv` for the connection of the AXI read master. For the AXI4-Stream port, it is simple, and you do not need the example for reference. For the *rtc_gen* kernel, a Verilog file, `rtc_gen_core.v`, is created to finish the core function of the kernel. The function diagram of *rtc_gen_core* is shown in the following diagram.
 
-<div align="center">
-<img src="./images/rtl_kernel_wiz_10.png" alt="RTL Kernel" >
-</div>
+![RTL Kernel](./images/rtl_kernel_wiz_10.png)
 
-<br/>
-
-To make the core source code directory clean, we put all the necessary generated, newly composed or modified RTL files in *~/rtc_gen/src* directory. Please review the directory for these source code files. Please note *SPSR.v* is a parameterized SRAM template which could be synthesized to FPGA BlockRAM.
+To make the core source code directory, clean all the necessary generated, newly composed, or modified RTL files placed in the `~/rtc_gen/src` directory. Review the directory for these source code files. Note `SPSR.v` is a parameterized SRAM template which could be synthesized to the FPGA block RAM.
 
 ~~~
 ./rtc_gen/src/rtc_gen_control_s_axi.v
@@ -183,27 +151,19 @@ To make the core source code directory clean, we put all the necessary generated
 ./rtc_gen/src/SPSR.v
 ~~~
 
-Now we remove all the existing Verilog/SystemVerilog source codes (except for thos in *IP* group) from the *rtc_gen_ex* project in Vivado, then add the files in *./rtc_gen/src* to the project (*rtc_gen_tb.sv* for *Simulation-Only Sources*, other files for *Design Sources*). Thus you can see the design hierarchy as below snapshot, and we finish the coding for kernel *rtc_gen*.
-
-<div align="center">
-<img src="./images/rtl_kernel_wiz_11.png" alt="RTL Kernel" >
-</div>
+Now remove all the existing Verilog/SystemVerilog source codes (except for thos in the *IP* group) from the *rtc_gen_ex* project in Vivado, then add the files in `./rtc_gen/src` to the project (`rtc_gen_tb.sv` for *Simulation-Only Sources*, other files for *Design Sources*). Thus you can see the design hierarchy in the following snapshot, and you finish the coding for kernel *rtc_gen*.
 
-<br/>
+![RTL Kernel](./images/rtl_kernel_wiz_11.png)
 
-Now you can simulation the design and go through normal RTL design flow with standard Vivado design methodology. A text format font library file *font_sim_data.txt* in *./rtc_gen* directory is provided, which could be read in by testbench *rtc_gen_tb.sv* for simulation. Please copy it to *./rtc_gen/vivado_project/rtc_gen_ex/rtc_gen_ex.sim/sim_1/behav/xsim/* for simulation runs.
+Now you can simulate the design and go through the normal RTL design flow with the standard Vivado design methodology. A text format font library file `font_sim_data.txt` in the `./rtc_gen` directory is provided, which could be read in by testbench `rtc_gen_tb.sv` for simulation. Copy it to `./rtc_gen/vivado_project/rtc_gen_ex/rtc_gen_ex.sim/sim_1/behav/xsim/` for simulation runs.
 
 ## Package the RTL Kernel
 
-After you confirm the design is OK, select *Generate RTL Kernel* from *Flow* Menu, then select *Sources-only kernel* in the pop-up window, click *OK* button to finish the *rtc_gen* RTL kernel creation.
+After you confirm the design is OK, select **Generate RTL Kernel** from the *Flow* menu, then select **Sources-only kernel** in the pop-up window, and click **OK** to finish the *rtc_gen* RTL kernel creation.
 
-<div align="center">
-<img src="./images/rtl_kernel_wiz_12.png" alt="RTL Kernel" >
-</div>
+![RTL Kernel](./images/rtl_kernel_wiz_12.png)
 
-<br/>
-
-The generated kernel file is *./rtc_gen/vivado_project/rtc_gen_ex/rtc_gen.xo*, which can be used in downstream Vitis intergartion flow. From the Vivado *Tcl Console* view, you can see the Vivado actually use following command line to finish the kernel packaging:
+The generated kernel file is `./rtc_gen/vivado_project/rtc_gen_ex/rtc_gen.xo`, which can be used in the downstream Vitis integration flow. From the Vivado *Tcl Console* view, you can see the Vivado actually uses the following command line to finish the kernel packaging:
 
 ~~~
 package_xo  -xo_path ./rtc_gen/vivado_project/rtc_gen_ex/exports/rtc_gen.xo \
@@ -212,8 +172,7 @@ package_xo  -xo_path ./rtc_gen/vivado_project/rtc_gen_ex/exports/rtc_gen.xo \
             -kernel_xml ./rtc_gen/vivado_project/rtc_gen_ex/imports/kernel.xml
 ~~~
 
-*./rtc_gen/vivado_project/rtc_gen_ex/rtc_gen* is the folder for *rtc_gen* IP, and *./rtc_gen/vivado_project/rtc_gen_ex/imports/kernel.xml* is the kernel description file. So we could copy out these items to standlone directory and just use the command line above to package the kernel. This is also where the *./hw/rtc_gen_ip* directory and *./hw/rtc_gen_kernel.xml* file come from.
-
+`./rtc_gen/vivado_project/rtc_gen_ex/rtc_gen` is the folder for the *rtc_gen* IP, and `./rtc_gen/vivado_project/rtc_gen_ex/imports/kernel.xml` is the kernel description file. You could copy out these items to a standalone directory and just use the command line above to package the kernel. This is also where the `./hw/rtc_gen_ip` directory and `./hw/rtc_gen_kernel.xml` file comes from.
 
 <p class="sphinxhide" align="center"><sub>Copyright © 2020–2023 Advanced Micro Devices, Inc</sub></p>
 
diff --git a/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/README.md b/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/README.md
index 579249ba99..ca187446f3 100644
--- a/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/README.md
+++ b/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/README.md
@@ -13,7 +13,7 @@
 
 ***Version: Vitis 2023.1***
 
-Register transfer level (RTL) design is a traditional and important hardware accelerator development methodology for FPGA. RTL modules provide excellent flexibility and efficiency, while the design process is a time consuming and error-prone process. The Vitis&trade; unified software platform provides a mature and proven RTL kernel design methodology. With Vitis and the included the AMD Vivado&reg; Design Suite, you can focus on your core accelerating module, instead of spending a lot of time on integration, host-field programmable gate array (FPGA) communication, direct memory access (DMA), and other supporting tasks.
+Register transfer level (RTL) design is a traditional and important hardware accelerator development methodology for FPGA. RTL modules provide excellent flexibility and efficiency, while the design process is a time consuming and error-prone process. The Vitis unified software platform provides a mature and proven RTL kernel design methodology. With Vitis and the included the AMD Vivado&trade; Design Suite, you can focus on your core accelerating module, instead of spending a lot of time on integration, host-field programmable gate array (FPGA) communication, direct memory access (DMA), and other supporting tasks.
 
 ## About This Tutorial
 
@@ -27,7 +27,7 @@ The AMD Alveo&trade; Data Center accelerator cards are the target platform for t
 
 * Packing RTL design to Vitis-compliant RTL kernel with the command line interface
 * Using mixed-mode clock manager (MMCM)/ phase-locked loop (PLL) in the RTL kernel
-* Using the AMD Vivado&trade; XSIM to simulate the design
+* Using the Vivado XSIM to simulate the design
 * Using AMD AXI VIP to verify the RTL design with AXI interface
 * Using three RTL kernel control modes with XRT: ```ap_ctrl_none```, ```ap_ctrl_hs```, and ```ap_ctrl_chain```.
 * Host programming for the RTL kernel with the XRT Native API
@@ -55,7 +55,7 @@ The designs have been verified with the following software/hardware environment
 
 ### Additional Requirements for RedHat/CentOS
 
-If you are using RedHat/CentOS 7, the default installed GNU compiler collection(GCC) version is 4.x.x. You must use the following command to install and switch to GCC 7 before compiling the host program.
+If you are using RedHat/CentOS 7, the default installed GNU compiler collection (GCC) version is 4.x.x. You must use the following command to install and switch to GCC 7 before compiling the host program.
 
 ```shell
 sudo yum install centos-release-scl
diff --git a/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/doc/aes.md b/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/doc/aes.md
index eb79bc073f..9ce0a07b86 100644
--- a/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/doc/aes.md
+++ b/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/doc/aes.md
@@ -1,3 +1,14 @@
+<table class="sphinxhide" width="100%">
+ <tr>
+   <td align="center"><img src="https://raw.githubusercontent.com/Xilinx/Image-Collateral/main/xilinx-logo.png" width="30%"/><h1>Vitis™ Application Acceleration Development Flow Tutorials</h1>
+   </td>
+ </tr>
+ <tr>
+ <td>
+ </td>
+ </tr>
+</table>
+
 # RTL Module: Aes
 
 ## About the AES Encryption Algorithm
@@ -6,7 +17,7 @@ The AES algorithm is a commonly used symmetric-key encryption algorithm, meaning
 
 ![AES-ECB decryption](./images/aes_core.svg)
 
-In applications, AES has five widely used usage models:
+In applications, AES has five widely-used usage models:
 
 * Electronic codebook (ECB)
 * Cipher block chaining (CBC)
@@ -40,20 +51,20 @@ The RTL module ```Aes``` is the core module of AES encryption/decryption module.
 |Name       |Width (bits)   |Direction  |Description|
 |----       |----           |----       |----       |
 |CLK        |1              |in         |Clock input|
-|RESETn     |1              |in         |Active-low reset input|
+|RESETn     |1              |in         |Active-Low reset input|
 |DATA_INPUT |128            |in         |Input data block for encryption or decryption|
 |KEY        |256            |in         |AES key for encryption or decryption, when using 128/192 bit key, aligned to MSB (complement 0s to least significant bit (LSB)|
 |DATA_OUTPUT|128            |out        |Output data block for encryption or decryption|
-|NK         |2              |in         |AES Key length: 2'b00=128bit, 2'b01=192bit, 2'b10=256bit|
-|NR         |4              |in         |Reserved, tie to 4'b0000 for common AES operation|
-|START_CIPHER|1             |in         |When given one cycle active high trigger, cipher operation is executed (encryption or decryption).|
-|START_KEYEXP|1             |in         |When given one cycle active high trigger, AES key expansion operation is executed.|
+|NK         |2              |in         |AES Key length: `2'b00`=128bit, `2'b01`=192bit, `2'b10`=256bit|
+|NR         |4              |in         |Reserved, tie to `4'b0000` for common AES operation|
+|START_CIPHER|1             |in         |When given one cycle active-High trigger, cipher operation is executed (encryption or decryption).|
+|START_KEYEXP|1             |in         |When given one cycle active-High trigger, AES key expansion operation is executed.|
 |OP_FINISH  |1              |out        |When the encryption or decryption job is in progress, OP_FINISH will keep low. OP_FINISH will keep high in idle state.|
 |EXP_FINISH |1              |output     |When the key expansion job is in progress, EXP_FINISH will keep low. EXP_FINISH will keep high in idle state.|
 
 ## Testbench
 
-A simple native Verilog testbench is provided in the `~/aes/tbench` directory. The testbench uses a pre-generated random dataset by a Perl script as the stimulus, and the output result is compared with the reference dataset generated by the OpenSSL utility. The shell script `~/aes/runsim_aes_xsim.sh` is used to generate the input stimulus and output reference and to run the simulation with Vivado XSIM.
+A simple native Verilog testbench is provided in the `~/aes/tbench` directory. The testbench uses a pre-generated random dataset by a Perl script as the stimulus, and the output result is compared with the reference dataset generated by the OpenSSL utility. The shell script `~/aes/runsim_aes_xsim.sh` is used to generate the input stimulus and output reference and to run the simulation with AMD Vivado&trade; XSIM.
 
 ## Usage
 
diff --git a/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/doc/krnl_aes.md b/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/doc/krnl_aes.md
index f7a7b55b13..5a64523c87 100644
--- a/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/doc/krnl_aes.md
+++ b/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/doc/krnl_aes.md
@@ -1,32 +1,43 @@
+<table class="sphinxhide" width="100%">
+ <tr>
+   <td align="center"><img src="https://raw.githubusercontent.com/Xilinx/Image-Collateral/main/xilinx-logo.png" width="30%"/><h1>Vitis™ Application Acceleration Development Flow Tutorials</h1>
+   </td>
+ </tr>
+ <tr>
+ <td>
+ </td>
+ </tr>
+</table>
+
 # RTL Kernel: krnl_aes
 
 ## Introduction
 
-In this part of the tutorial, you will start with the Aes module you created in [RTL Module: Aes](./aes.md). You will add RTL logics, then pack them into Vivado IP and Vitis kernel (XO) file. This XO file can be used directly in the Vitis v++ flow to build the XCLBIN overlay files, and you can use the XRT library to realize the kernel control and data transfer between the host and the kernel.
+In this part of the tutorial, you will start with the Aes module you created in [RTL Module: Aes](./aes.md). You will add RTL logics, then pack them into the AMD Vivado&trade; IP and Vitis kernel (XO) file. This XO file can be used directly in the Vitis v++ flow to build the XCLBIN overlay files, and you can use the XRT library to realize the kernel control and data transfer between the host and the kernel.
 
 ## Kernel Features
 
-The `krnl_aes` kernel includes four Aes modules, each of which are connected outside with AXI-Stream slave and master ports. There is also an AXI slave port for kernel control. To simplify the design, all Aes modules use the same settings (encrypt/decrypt mode, key length, and key value).
+The `krnl_aes` kernel includes four Aes modules, each of which are connected outside with AXI4-Stream slave and master ports. There is also an AXI slave port for kernel control. To simplify the design, all Aes modules use the same settings (encrypt/decrypt mode, key length, and key value).
 
 The `krnl_aes` actually uses two clocks inside: one for external AXI ports and the other for internal AXI ports and AES cores. The AES cores run at a higher frequency than the platform AXI interconnections.
 
-There are two methods to get the higher clock for the kernel. One method is to use the secondary platform clock provided, namely the `ap_clk_2` ports. In the Alveo Data Center accelerator card, the target platform has provided multiple clocks for the user kernel. For example, for the Alveo U200 `xilinx_u200_xdma_201830_2` platform, two system clocks are provided, `ap_clk` and `ap_clk_2`, whose default frequency is 300 MHz and 500 Mhz, respectively. The frequency can be configured in the Vitis v++ link process. `ap_clk_2` is generated by a standalone MMCM in the static region of the Alveo platform. The second method to get the additional clock is to manually instantiate an MMCM inside the RTL kernel. This might provide additional flexibility for some specific requirements.
+There are two methods to get the higher clock for the kernel. One method is to use the secondary platform clock provided, namely the `ap_clk_2` ports. In the AMD Alveo&trade; Data Center accelerator card, the target platform has provided multiple clocks for the user kernel. For example, for the Alveo U200 `xilinx_u200_xdma_201830_2` platform, two system clocks are provided, `ap_clk` and `ap_clk_2`, whose default frequency is 300 MHz and 500 Mhz, respectively. The frequency can be configured in the Vitis v++ link process. `ap_clk_2` is generated by a standalone MMCM in the static region of the Alveo platform. The second method to get the additional clock is to manually instantiate an MMCM inside the RTL kernel. This might provide additional flexibility for some specific requirements.
 
 The example design provided here uses the second method to generate the required clock. The `krnl_aes` kernel includes a customized MMCM module to generate a 400 MHz clock from the standard 300 MHz input clock provided by the platform.
 
-The AMD UltraScale+&trade; Alveo target platform is divided into a static region and a dynamic region. The customer-instantiated MMCMs in the dynamic region are probably driven by an MMCM in the static region, which might be used to drive the platform bus clock. This usually causes big clock skew and makes it difficult for the synchronous design to meet timing. To ease the timing closure, those modules driven by the customer MMCM should operate in asynchronous mode to the platform bus clock. In this example design, the AXI slave control module and the four AES engines all run in the 400 MHz clock domain, while the kernel will be connected to 300 MHz standard platform clock domain. So, altogether nine AXI/AXIS clock converter IPs are used in the top-level of the kernel: one AXI clock converter for AXI control slave, four AXIS clock converters for AXIS slave ports, and four AXIS clock converters for AXIS master ports.
+The AMD UltraScale+&trade; Alveo target platform is divided into a static region and a dynamic region. The customer-instantiated MMCMs in the dynamic region are probably driven by an MMCM in the static region, which might be used to drive the platform bus clock. This usually causes big clock skew and makes it difficult for the synchronous design to meet timing. To ease the timing closure, those modules driven by the customer MMCM should operate in asynchronous mode to the platform bus clock. In this example design, the AXI slave control module and the four AES engines all run in the 400 MHz clock domain, while the kernel will be connected to 300 MHz standard platform clock domain. So, altogether nine AXI/AXIS clock converter IPs are used in the top-level of the kernel: one AXI clock converter for AXI control slave, four AXI4-Stream clock converters for AXI4-Stream slave ports, and four AXI4-Stream clock converters for AXI4-Stream master ports.
 
 The following is the block diagram of *krnl_aes* kernel.
 
 ![krnl_aes Diagram](./images/krnl_aes.svg)
 
-There are three kernel execution models for Vitis acceleration application supported by XRT: *ap_ctrl_none*, *ap_ctrl_hs* and *ap_ctrl_chain*. You can refer to [ug1416](https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/managing_interface_synthesis.html?hl=ap_ctrl_chain#qls1539734256651__ae476299), [Supported Kernel Execution Models](https://xilinx.github.io/XRT/master/html/xrt_kernel_executions.html) for more details. This RTL kernel *krnl_aes* is the mixing of *ap_ctrl_none* and *ap_ctrl_hs* modes: *ap_ctrl_hs* is used for AES key expansion operation, namely the host will start and wait for the finish of AES key expansion operation; *ap_ctrl_none* is used for general AES encryption/decryption operation, namely as soon as the kernel receives input data from AXI-Stream slave port, it will start and finish the encryption/decryption operation automatically and send the output data to AXI stream master port.
+There are three kernel execution models for Vitis acceleration application supported by XRT: *ap_ctrl_none*, *ap_ctrl_hs*, and *ap_ctrl_chain*. You can refer to [ug1416](https://www.xilinx.com/html_docs/xilinx2023_1/vitis_doc/managing_interface_synthesis.html?hl=ap_ctrl_chain#qls1539734256651__ae476299), [Supported Kernel Execution Models](https://xilinx.github.io/XRT/master/html/xrt_kernel_executions.html) for more details. This RTL kernel *krnl_aes* is the mixing of the *ap_ctrl_none* and *ap_ctrl_hs* modes: *ap_ctrl_hs* is used for AES key expansion operation, namely the host will start and wait for the finish of AES key expansion operation; *ap_ctrl_none* is used for general AES encryption/decryption operation, namely as soon as the kernel receives input data from AXI4-Stream slave port, it will start and finish the encryption/decryption operation automatically and send the output data to AXI4-Stream master port.
 
-In *krnl_aes*, following control signal waveform is implemented for *ap_ctrl_hs* mode. XRT will assert *ap_start* signal to start the kernel execution, then *ap_start* will keep high and de-asserted by *ap_ready* signal. *ap_ready* is actually a copy of *ap_done* signal. Finally *ap_done* is cleared by read control register operation on AXI slave port. Please note that the XRT scheduler actually decides when to assert *ap_start* depending on the status of *ap_start*, that is, when XRT detects the *ap_start* is de-assert, it considers the kernel ready to receive new *ap_start* request.
+In *krnl_aes*, the following control signal waveform is implemented for *ap_ctrl_hs* mode. XRT will assert the *ap_start* signal to start the kernel execution, then *ap_start* will keep high and de-asserted by the *ap_ready* signal. *ap_ready* is actually a copy of the *ap_done* signal. Finally, *ap_done* is cleared by the read control register operation on the AXI slave port. Note that the XRT scheduler actually decides when to assert *ap_start*, depending on the status of *ap_start*, that is, when XRT detects the *ap_start* signal is de-assert, it considers the kernel ready to receive new *ap_start* request.
 
 ![ap_ctrl_hs mode](./images/ap_ctrl_hs.svg)
 
-The following table lists all the control register and kernel arguments included in AXI slave port, no interrupt support in this kernel.
+The following table lists all the control register and kernel arguments included in AXI slave port; no interrupt support in this kernel.
 
 |Name       |Addr Offset    |Width (bits)   |Description|
 |----       |----           |----           |----       |
@@ -58,13 +69,13 @@ These IPs are all generated by a Tcl script, `~/krnl_aes/gen_ip.tcl`. The script
 
 ## Pack the Design into the Vivado IP and Vitis Kernel
 
-One key step for the RTL kernel design for Vitis is to package the RTL design into Vitis kernel file (XO file). You can utilize the GUI version *RTL Kernel Wizard* tool to help to create the Vitis kernel. You can also use GUI version *IP Packager* in Vivado to package the design into Vivado IP then generate the XO file. Vivado also provides a command line flow for Vitis kernel generation which provides the same jobs as the GUI version. In this tutorial, you will use the Vivado Tcl command to finish the `krnl_aes` IP packaging and XO file generation in batch mode. The complete kernel generation script for this design is in `~/krnl_aes/pack_kernel.tcl`, the main steps are explained below, and see the details in the script. Actually, each step in the script has its counterpart in GUI tool, refer to [RTL Kernels](https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/devrtlkernel.html) for GUI version IP packaging tool usage.
+One key step for the RTL kernel design for Vitis is to package the RTL design into a Vitis kernel file (XO file). You can utilize the GUI version *RTL Kernel Wizard* tool to help to create the Vitis kernel. You can also use GUI version *IP Packager* in Vivado to package the design into the Vivado IP then generate the XO file. Vivado also provides a command line flow for Vitis kernel generation which provides the same jobs as the GUI version. In this tutorial, you will use the Vivado Tcl command to finish the `krnl_aes` IP packaging and XO file generation in batch mode. The complete kernel generation script for this design is in `~/krnl_aes/pack_kernel.tcl`, the main steps are explained below, and see the details in the script. Actually, each step in the script has its counterpart in GUI tool, refer to [RTL Kernels](https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/devrtlkernel.html) for GUI version IP packaging tool usage.
 
-### Step 1: Create the Vivado Project and Add Design Sources
+### Step 1: Create the Vivado Project and Add the Design Sources
 
 First, you need to create a Vivado project containing the source files. The script use the Tcl command `create_project`, `add_files`, and `update_compiler_order` to finish this step. All RTL source codes, generated IP file (xci file) and XDC file should be added into the newly created project.
 
-Finally the *ipx::package_project* Tcl command is used to initialize the IP packaging process.
+Finally, the `ipx::package_project` Tcl command is used to initialize the IP packaging process.
 
 ~~~shell
 create_project krnl_aes ./krnl_aes
@@ -81,7 +92,7 @@ ipx::package_project -root_dir ./krnl_aes_ip -vendor xilinx.com -library user -t
 
 In this step, you might see a warning information like: *WARNING: [IP_Flow 19-5101] Packaging a component with a SystemVerilog top file is not fully supported. Please refer to UG1118 'Creating and Packaging Custom IP'*. Currently, the Vitis tool chain requires the port definition of the top module be traditional Verilog style. Though the top level module `krnl_aes.sv` uses some SystemVerilog syntax (SystemVerilog *interface* objects), but it uses the traditional Verilog style port definition, so you could omit this warning.
 
-#### Step 2: Infer Clock, Reset, AXI Interfaces and associate Them with a Clock
+#### Step 2: Infer Clock, Reset, AXI Interfaces and Associate Them with a Clock
 
 Here you first use the `ipx::infer_bus_interface` command to infer `ap_clk` and `ap_rst_n` as AXI bus signals. Generally, if `ap_clk` is the only clock you used in the RTL kernel, this command can be omitted. If you use more clocks (`ap_clk_2`, `ap_clk_3`, ...) in the design, you must use the `ipx::infer_bus_interface` command to infer the ports explicitly.
 
@@ -105,7 +116,7 @@ ipx::associate_bus_interfaces -clock ap_clk -reset ap_rst_n [ipx::current_core]
 
 #### Step 3: Set the Definition of AXI Control Slave Registers, Including CTRL and User Kernel Arguments
 
-Here you use the `ipx::add_register` command to add the registers to inferred `s_axi_control` interface and use *set_property* command to set the property of the registers, for example the kernel argument *KEY_LEN*:
+Here you use the `ipx::add_register` command to add the registers to inferred `s_axi_control` interface and use `set_property` command to set the property of the registers; for example, the kernel argument *KEY_LEN*:
 
 ~~~
 ipx::add_register KEY_LEN   [ipx::get_address_blocks reg0 -of_objects [ipx::get_memory_maps s_axi_control -of_objects [ipx::current_core]]]
@@ -118,7 +129,7 @@ Here *KEY_LEN* is the kernel argument name; "AES key length" is the register des
 
 You can see in the provided Tcl script that all the registers defined in the previous table are added and defined accordingly.
 
-Since the *krnl_aes* kernel does not have AXI master ports, so we do not need to configure them. The description of setting AXI master can be found in *krnl_cbc* kernel part next.
+Because the *krnl_aes* kernel does not have AXI master ports, so you do not need to configure them. The description of setting AXI master can be found in *krnl_cbc* kernel part next.
 
 #### Step 4: Package Vivado IP and generate Vitis kernel file
 
@@ -142,15 +153,15 @@ In this way, the kernel execution model is specified in the XML file with *hwCon
 
 ## Testbench
 
-Aa simple SystemVerilog testbench is provided for the *krnl_aes* module with AMD AXI VIPs. The testbench sources are in the `~/krnl_aes/tbench` directory. Since *krnl_aes* has four completely identical AES engines, so you only test two of the four engines. Two AXI-Stream master VIPs are used to send input data to the kernel, two AXI stream slave VIPs are used to receive output data from the kernel, and an AXI master are used to configure the kernel arguments and control the kernel execution in *ap_ctrl_hs* fashion for AES key expansion operation. You could read the `tb_krnl_aes.sv` file for more details on the usages of AXI VIP to interaction with the kernel.
+Aa simple SystemVerilog testbench is provided for the *krnl_aes* module with AMD AXI VIPs. The testbench sources are in the `~/krnl_aes/tbench` directory. Because *krnl_aes* has four completely identical AES engines, so you only test two of the four engines. Two AXI4-Stream master VIPs are used to send input data to the kernel, two AXI stream slave VIPs are used to receive output data from the kernel, and an AXI master are used to configure the kernel arguments and control the kernel execution in *ap_ctrl_hs* fashion for AES key expansion operation. You could read the `tb_krnl_aes.sv` file for more details on the usages of AXI VIP to interaction with the kernel.
 
 The input random data to the testbench is generated by a perl script `~/common/plain_gen.pl`, and the reference data for output check is generated by OpenSSL tools. The shell script `~/krnl_aes/runsim_krnl_aes_xsim.sh` is used to generate the input stimulus, output reference and run the simulation with Vivado XSIM.
 
 ## Kernel Test System and Overlay (XCLBIN) Generation
 
-To build a test system overlay for `krnl_aes`, you need the AXI-Stream master and AXI stream slave to provide and receive data from `krnl_aes`. To do this, write a simple memory-mapped AXI-to-AXIS kernel and a AXIS-to-AXI memory-mapped kernel with the HLS C language. These are located in the `~/krnl_aes/hls` directory, and they are called `strm_issue.cpp` and `strm_dump.cpp`, respectively.
+To build a test system overlay for `krnl_aes`, you need the AXI4-Stream master and AXI4-Stream slave to provide and receive data from `krnl_aes`. To do this, write a simple memory-mapped AXI-to-AXIS kernel and a AXIS-to-AXI memory-mapped kernel with the HLS C language. These are located in the `~/krnl_aes/hls` directory, and they are called `strm_issue.cpp` and `strm_dump.cpp`, respectively.
 
-The HLS C implementation of these two kernels is as below. Note that the for-loop with variable `j` is for endian conversion: the input/output data is transferred as little-endian via 128-bit data width AXI bus, and the 128-bit data is considered a integral word in big-endian.
+The HLS C implementation of these two kernels is as follows. Note that the for-loop with variable `j` is for endian conversion: the input/output data is transferred as little-endian via 128-bit data width AXI bus, and the 128-bit data is considered a integral word in big-endian.
 
 ```C
 #define PTR_WIDTH 128
@@ -197,7 +208,7 @@ In total, four `strm_issue` and four `strm_dump` kernels are integrated with `kr
 
 ## Host Programming
 
-For host programming, you can use XRT Native C++ APIs to control the kernel execution in the FPGA. XRT Native APIs are very straightforward and intuitive. They provide higher efficiency compared to XRT OpenCL, especially in those cases needing very frequent host-kernel interactions. For more details on XRT Native APIs, refer to [XRT Native APIs](https://xilinx.github.io/XRT/master/html/xrt_native_apis.html).
+For host programming, you can use XRT Native C++ APIs to control the kernel execution in the FPGA. XRT Native APIs are very straightforward and intuitive. They provide higher efficiency compared to XRT OpenCL&trade;, especially in those cases needing very frequent host-kernel interactions. For more details on XRT Native APIs, refer to [XRT Native APIs](https://xilinx.github.io/XRT/master/html/xrt_native_apis.html).
 
 The host program generates the random data as plain input, then uses the OpenSSL AES API to generate the reference cipher data. It supports the hardware emulation (`hw_emu`) flow as well, and will select the correct XCLBIN files for `hw` or `hw_emu` mode.
 
@@ -262,7 +273,7 @@ source /tools/Xilinx/Vitis/2020.2/settings64.sh
 
 ### Tutorial Steps
 
-#### 1. Generate the IPs
+#### 1: Generate the IPs
 
 ```
 make gen_ip
@@ -270,7 +281,7 @@ make gen_ip
 
 This starts Vivado in batch mode and calls `~/krnl_aes/gen_ip.tcl` to generate all needed design and verification IPs.
 
-#### 2. Run the Standalone Simulation
+#### 2: Run the Standalone Simulation
 
 ~~~
 make runsim
@@ -278,7 +289,7 @@ make runsim
 
 This calls `~/krnl_aes/runsim_krnl_aes_xsim.sh` to run the simulation with Vivado XSIM.
 
-#### 3. Package Vivado IP and Generate the Vitis Kernel File
+#### 3: Package Vivado IP and Generate the Vitis Kernel File
 
 ```
 make pack_kernel
@@ -286,7 +297,7 @@ make pack_kernel
 
 This starts Vivado in batch mode and calls `~/krnl_aes/pack_kernel.tcl` to package the RTL sources, generated IP XCI files, and XDC files into Vivado IP. It then generates the Vitis kernel file `~/krnl_aes/krnl_aes.xo`.
 
-#### 4. Build the Kernel Testing System Overlay Files
+#### 4: Build the Kernel Testing System Overlay Files
 
 ##### For a Hardware Target
 
@@ -308,7 +319,7 @@ make build_hw TARGET=hw_emu
 
 This first compiles the two HLS kernels into XO files, then builds the total system overlay files `~/krnl_aes/krnl_aes_test_hw_emu.xclbin`.
 
-#### 5. Compile the Host Program
+#### 5: Compile the Host Program
 
 ```
 make build_sw
@@ -336,7 +347,7 @@ In this example, if your target card is U50, you can find the device ID is 2. Yo
  30 #define DEVICE_ID   2
 ```
 
-#### 6. Run Hardware Emulation (Optional)
+#### 6: Run Hardware Emulation (Optional)
 
 When the XCLBIN file for hardware emulation (`~/krnl_aes/krnl_aes_test_hw_emu.xclbin`) is generated, you can run hardware emulation to verify the kernel in the platform environment for debug or details profiling purposes. Refer to the following commands.
 
@@ -375,7 +386,7 @@ The next figure shows the control signals behavior in AXI control slave for `ap_
 
 ![ap_ctrl_hs waveform](./images/ap_ctrl_hs_waveform.png)
 
-#### 7. Run Host Program in Hardware Mode
+#### 7: Run the Host Program in Hardware Mode
 
 If you have tried hardware emulation in the previous step, you must run the following command to disable the `hw_emu` mode:
 
@@ -383,7 +394,7 @@ If you have tried hardware emulation in the previous step, you must run the foll
 source setup_emu.sh -s off
 ```
 
-Now, you can run the compiled `host_krnl_aes_test` file to test the system in hardware mode. The default words number to process is 1M 128-bit words, which is 16 MBytes. Because of the data transfer efficiency between host and FPGA via PCIe, you can peak processing throughput with big-enough input data.
+Now, you can run the compiled `host_krnl_aes_test` file to test the system in hardware mode. The default words number to process is 1M 128-bit words, which is 16 MBytes. Because of the data transfer efficiency between host and FPGA via PCIe®, you can peak processing throughput with big-enough input data.
 
 ```
 ./host_krnl_aes_test
diff --git a/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/doc/krnl_cbc.md b/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/doc/krnl_cbc.md
index 0f5c255985..7a8addea6c 100644
--- a/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/doc/krnl_cbc.md
+++ b/Hardware_Acceleration/Design_Tutorials/05-bottom_up_rtl_kernel/doc/krnl_cbc.md
@@ -1,16 +1,27 @@
+<table class="sphinxhide" width="100%">
+ <tr>
+   <td align="center"><img src="https://raw.githubusercontent.com/Xilinx/Image-Collateral/main/xilinx-logo.png" width="30%"/><h1>Vitis™ Application Acceleration Development Flow Tutorials</h1>
+   </td>
+ </tr>
+ <tr>
+ <td>
+ </td>
+ </tr>
+</table>
+
 # RTL Kernel: krnl_cbc
 
 ## Introduction
 
 This part of the tutorial introduces another RTL kernel, `krnl_cbc`.
 
-This kernel has AXI master interfaces to access input/output data in on-board global memory, and to transmit/receive the data via AXI-Stream master/slave ports. This kernel is connected with the `krnl_aes` kernel via AXI-Stream ports in the Vitis v++ linking stage to implement the complete AES processing function. AES-ECB and AES-CBC modes are supported by `krnl_cbc`.
+This kernel has AXI master interfaces to access input/output data in on-board global memory, and to transmit/receive the data via AXI4-Stream master/slave ports. This kernel is connected with the `krnl_aes` kernel via AXI4-Stream ports in the Vitis v++ linking stage to implement the complete AES processing function. AES-ECB and AES-CBC modes are supported by `krnl_cbc`.
 
-Again, you will use command line Tcl scripts to finish all the steps without GUI support, except for waveform viewing. The `krnl_cbc` kernel has four internal processing pipes, matching the four AES engines in `krnl_aes`, which are transparent to the user. The `ap_ctrl_chain` execution model is supported by `krnl_cbc`, and the user can fully utilize the hardware parallel acceleration capability without insight knowledge about the number of the internal engines. Note that it is actually not so efficient to realize the connection between the AES core engines and CBC control units with external AXI-Stream link. They are implemented in this way to show the Vitis capability and design flow.
+Again, you will use command line Tcl scripts to finish all the steps without GUI support, except for waveform viewing. The `krnl_cbc` kernel has four internal processing pipes, matching the four AES engines in `krnl_aes`, which are transparent to you. The `ap_ctrl_chain` execution model is supported by `krnl_cbc`, and you can fully utilize the hardware parallel acceleration capability without insight knowledge about the number of the internal engines. Note that it is actually not so efficient to realize the connection between the AES core engines and CBC control units with external AXI-Stream link. They are implemented in this way to show the Vitis capability and design flow.
 
 ## Kernel Feature
 
-Refer to the following block diagram of the `krnl_cbc` kernel. It has four identical CBC engines, which receive input data from AXI read master via engine control unit. They then send the data to and receive output data from the `krnl_aes` kernel via the AXI-Stream port, and send the result to AXI write master via the `engine control` unit.
+Refer to the following block diagram of the `krnl_cbc` kernel. It has four identical CBC engines, which receive input data from AXI read master via engine control unit. They then send the data to and receive output data from the `krnl_aes` kernel via the AXI4-Stream port, and send the result to AXI write master via the `engine control` unit.
 
 An AXI control slave module is used to set the necessary kernel arguments. The `krnl_cbc` kernel finishes the task with input/output grouped words stored in global memory. Each internal engine will handle one words group at one time. Consecutive input groups are assigned to different internal CBC engines in round-robin fashion by `engine control` module. The `krnl_cbc` kernel uses a single kernel clock for all internal modules.
 
@@ -26,7 +37,7 @@ For `input sync`, at clock edge **a** and **b**, `ap_start` is validated and dea
 
 For `output sync`, at clock edge **c** and **d**, `ap_done` is confirmed and de-asserted by the  `ap_continue` signal, meaning the completion of one kernel job. When the XRT scheduler detects the `ap_done` signal has been asserted, XRT asserts `ap_continue`. Generally, this should be implemented as a self-clear signal, so that it only keeps one cycle.
 
-From the waveform, you can see that before the `ap_done` signal was asserted, the kernel uses the  `ap_ready` signal to tell the XRT that it can accept new input data. This scheme acts as back-pressure on the `input sync` stage to enable the task pipeline to fully utilize the hardware capability. In the above example waveform, XRT writes `ap_start` bit and `ap_continue` bit twice each in the AXI control slave register.
+From the waveform, you can see that before the `ap_done` signal was asserted, the kernel uses the  `ap_ready` signal to tell XRT that it can accept new input data. This scheme acts as back-pressure on the `input sync` stage to enable the task pipeline to fully utilize the hardware capability. In the above example waveform, XRT writes `ap_start` bit and `ap_continue` bit twice each in the AXI control slave register.
 
 The following table lists all the control register and kernel arguments included in AXI slave port. There is no interrupt support in this kernel.
 
@@ -56,13 +67,13 @@ These IPs are generated by a Tcl script called `~/krnl_cbc/gen_ip.tcl`.
 
 ## Packing the Design into Vivado IP and Vitis Kernel
 
-One key step for the RTL kernel design for Vitis is to package the RTL design into a Vitis kernel file (XO file). You can utilize the RTL Kernel Wizard in the GUI to help to create the Vitis kernel. You can also use the IP Packager in Vivado to package the design into Vivado IP, and then generate the XO file. Vivado also provides a command line flow for Vitis kernel generation, which finishes the same jobs as the GUI version.
+One key step for the RTL kernel design for Vitis is to package the RTL design into a Vitis kernel file (XO file). You can utilize the RTL Kernel Wizard in the GUI to help to create the Vitis kernel. You can also use the IP Packager in the AMD Vivado™ Design Suite to package the design into Vivado IP, and then generate the XO file. Vivado also provides a command line flow for Vitis kernel generation, which finishes the same jobs as the GUI version.
 
 In this tutorial, like in the `krnl_aes` kernel case, you will use the Vivado Tcl command to finish the `krnl_cbc` IP packaging and XO file generation in batch mode. The complete kernel generation script for this design is in `~/krnl_cbc/pack_kernel.tcl`. The main steps are summarized below; refer to the details in the script.
 
 >**NOTE:** Each step in the script has a counterpart tool in the GUI. Refer to [RTL Kernels](https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/devrtlkernel.html) for GUI version IP packaging tool usage.
 
-### 1: Create the Vivado project and add design sources
+### 1: Create the Vivado Project and Add Design Sources
 
 First, you must create a Vivado project containing the source files. The script use the Tcl commands `create_project`, `add_files` and `update_compiler_order` to finish this step. For `krnl_cbc`, only RTL source code files are required to be added to the newly created project.
 
@@ -121,9 +132,9 @@ set_property size           {32}                [ipx::get_registers CBC_MODE  -o
 
 The following are included in the above example case:
 
-* `CBC_MODE` is the kernel argument name
-* "cbc mode" is the register description
-* "0x050" is the address offset the the register
+* `CBC_MODE` is the kernel argument name.
+* "cbc mode" is the register description.
+* "0x050" is the address offset the the register.
 * "32" is the data width of the register (all scalar kernel arguments should be 32-bit width).
 
 You can see in the provided Tcl script that all the registers defined in the previous table are added and defined accordingly. Two special kernel arguments here are `SRC_ADDR` and `DEST_ADDR`; these are for AXI master address pointer and are all 64-bit width. You will associate them with the AXI master ports in the next step.
@@ -164,7 +175,7 @@ package_xo -force -xo_path ../krnl_cbc.xo -kernel_name krnl_cbc -ctrl_protocol a
 
 Note that in the above `package_xo` command usage, you let the tool to generate the kernel description XML file automatically, and therefore you do not need to manually create it.
 
-##### Manually Creating the Kernel XML File
+#### Manually Creating the Kernel XML File
 
 If you have an existing Vitis-compatible Vivado IP and need to generate the XO file from it, you could also manually create the kernel XML file, and designate it in the command as follows:
 
@@ -176,7 +187,7 @@ In this case, the kernel execution model is specified in the XML file with `hwCo
 
 ## Testbench
 
-AMD provides a simple SystemVerilog testbench for the `krnl_cbc` module with AMD AXI VIPs. The testbench sources are in the `~/krnl_cbc/tbench` directory. The `krnl_aes` module is instantiated in this testbench to connect with `krnl_cbc` via AXI-Stream link. Two AXI slave VIPs are used in memory mode, and two AXI master VIPs are used to configure the arguments and control the kernel execution.
+AMD provides a simple SystemVerilog testbench for the `krnl_cbc` module with AMD AXI VIPs. The testbench sources are in the `~/krnl_cbc/tbench` directory. The `krnl_aes` module is instantiated in this testbench to connect with `krnl_cbc` via AXI4-Stream link. Two AXI slave VIPs are used in memory mode, and two AXI master VIPs are used to configure the arguments and control the kernel execution.
 
 For `krnl_aes`, the AXI master VIP emulates the `ap_ctrl_hs` protocol for AES key expansion operation. For `krnl_cbc`, the AXI master VIP emulates the `ap_ctrl_chain` protocol for consecutive task pushing. In the testbench, the input and output data are divided into groups including a number of words. Both `input sync` and `output sync` are emulated in the testbench. For more details, refer to the ``tb_krnl_cbc.sv`` file.
 
@@ -202,7 +213,7 @@ For `ap_ctrl_chain` execution model, the host program uses multi-threading techn
 
 This tutorial uses files in the `~/krnl_cbc` directory.
 
-All steps except for host program execution in this tutorial are finished by the GNU Make. This example design supports four Alveo cards (U200, U250, U50, U280), and you must make the necessary adjustments to the `~/krnl_cbc/Makefile` for each card by uncommenting the line matching your Alveo card.
+All steps except for host program execution in this tutorial are finished by the GNU Make. This example design supports four AMD Alveo™ cards (U200, U250, U50, U280), and you must make the necessary adjustments to the `~/krnl_cbc/Makefile` for each card by uncommenting the line matching your Alveo card.
 
 ```makefile
  41 # PART setting: uncomment the line matching your Alveo card
@@ -241,7 +252,7 @@ make gen_ip
 
 This starts Vivado in batch mode and calls ``~/krnl_cbc/gen_ip.tcl`` to generate all needed design and verification IPs.
 
-#### 2. Run the Standalone Simulation
+#### 2: Run the Standalone Simulation
 
 ~~~
 make runsim
@@ -253,7 +264,7 @@ The following figure shows the control signal waveform of `krnl_cbc`. You can se
 
 ![krnl_cbc waveform](./images/krnl_cbc_sim.png)
 
-#### 3. Package the Vivado IP and Generate the Vitis Kernel File
+#### 3: Package the Vivado IP and Generate the Vitis Kernel File
 
 ```
 make pack_kernel
@@ -261,7 +272,7 @@ make pack_kernel
 
 This starts Vivado in batch mode and calls ``~/krnl_cbc/pack_kernel.tcl`` to package the RTL sources into Vivado IP. It then generates the Vitis kernel file, ``~/krnl_cbc/krnl_cbc.xo``.
 
-#### 4. Build the Kernel Testing System Overlay Files
+#### 4: Build the Kernel Testing System Overlay Files
 
 ##### For a Hardware Target
 
@@ -283,7 +294,7 @@ make build_hw TARGET=hw_emu
 
 This builds the total system overlay files, ``~/krnl_cbc/krnl_cbc_test_hw_emu.xclbin``.
 
-#### 5. Compile Host Program
+#### 5: Compile Host Program
 
 ```
 make build_sw
@@ -311,7 +322,7 @@ In this example, if your target card is U50, you can find the device ID is 2. Yo
  32 #define DEVICE_ID   2
 ```
 
-#### 6. Run Hardware Emulation
+#### 6: Run Hardware Emulation
 
 When the XCLBIN file for hardware emulation ``~/krnl_cbc/krnl_cbc_test_hw_emu.xclbin`` is generated, you can run hardware emulation to verify the kernel in the platform environment for debug or details profiling purpose. You also use different option to compare the different behaviors between `ap_ctrl_hs` and `ap_ctrl_chain` modes.
 
@@ -327,7 +338,7 @@ Then, use the following command to run the program with words-per-groups as 64 a
 ./host_krnl_cbc_test -w 64 -g 4
 ```
 
-In the generated `wdb` waveform database, you can select the AXI-Stream slave ports of `krnl_cbc` to reflect the work status of the kernel. You can also add `emu_wrapper.emu_i.krnl_aes_1.inst.krnl_aes_axi_ctrl_slave_inst.status[3:0]` signals to the waveform window to get the status of the AES engines in `krnl_aes`.
+In the generated `wdb` waveform database, you can select the AXI4-Stream slave ports of `krnl_cbc` to reflect the work status of the kernel. You can also add `emu_wrapper.emu_i.krnl_aes_1.inst.krnl_aes_axi_ctrl_slave_inst.status[3:0]` signals to the waveform window to get the status of the AES engines in `krnl_aes`.
 
 The waveform snapshot is as follows. You can see that the four AES engines are working in parallel to process the four consecutive input data groups.
 
@@ -361,7 +372,7 @@ The `~/krnl_cbc/xrt.ini` file is used to control the XRT emulation options, as s
   7 device_trace=coarse
 ~~~
 
-#### 7. Run the Host Program in Hardware Mode
+#### 7: Run the Host Program in Hardware Mode
 
 If you have tried hardware emulation in the previous step, you must first run the following command to disable the `hw_emu` mode: