diff --git a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/README.md b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/README.md index b253a7e6b3..3ad0e9c953 100755 --- a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/README.md +++ b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/README.md @@ -6,27 +6,26 @@ - # Accelerating Video Convolution Filtering Application ***Version: Vitis 2023.1*** -This tutorial introduces you to a compute-intensive application that is accelerated using the Xilinx Alveo Data Center accelerator card. It goes through the design of a specific kernel that runs on the FPGA and briefly discusses optimization of the host-side application for performance. The kernel is designed to maximize throughput, and the host application is optimized to transfer data in an effective manner that moves in-between the host and FPGA card. The host application essentially eliminates the data movement latency by overlapping data transfers for multiple kernel calls. Another essential purpose of this tutorial is to show **_how one can easily estimate the performance of hardware kernels that can be built using Vitis HLS and how accurate and close these estimates are to actual hardware performance_** +This tutorial introduces you to a compute-intensive application that is accelerated using the AMD Alveo™ Data Center accelerator card. It goes through the design of a specific kernel that runs on the field programmable gate array (FPGA) and briefly discusses optimization of the host-side application for performance. The kernel is designed to maximize throughput, and the host application is optimized to transfer data in an effective manner that moves in-between the host and FPGA card. The host application essentially eliminates the data movement latency by overlapping data transfers for multiple kernel calls. Another essential purpose of this tutorial is to show **how one can easily estimate the performance of hardware kernels that can be built using Vitis HLS and how accurate and close these estimates are to actual hardware performance**. ## Introduction to Acceleration -The first lab is designed to let you quickly experience the acceleration performance that can be achieved by porting the video filter to Xilinx's Alveo accelerator card. The Alveo series cards are designed for accelerating data center applications. However, this tutorial can be adapted to other accelerator cards with some simple changes. +The first lab is designed to let you quickly experience the acceleration performance that can be achieved by porting the video filter to AMD's Alveo accelerator card. The Alveo series cards are designed for accelerating data center applications. However, this tutorial can be adapted to other accelerator cards with some simple changes. The steps to be carried out for this first lab include: - Setting up the Vitis application acceleration development flow - Running the hardware optimized accelerator and comparing its performance with a baseline of the application -This lab demonstrates the significant performance gain that can be achieved as compared to CPU performance. Whereas the next labs in this tutorial will illustrate and guide how such performance can be achieved using different optimizations and design techniques for 2D convolution kernels and the host side application. +This lab demonstrates the significant performance gain that can be achieved as compared to the processor performance. The next labs in this tutorial will illustrate and guide how such performance can be achieved using different optimizations and design techniques for 2D convolution kernels and the host side application. ### Cloning the GitHub Repository and Setting Up the Vitis Tool -To run this tutorial you will need to clone a git repo and also download and extract some compressed files, please follow the instruction given below: +To run this tutorial, you will need to clone a git repo and also download and extract some compressed files. Use the following instructions: #### Clone Git Repo @@ -38,7 +37,7 @@ git clone https://github.com/Xilinx/Vitis-Tutorials.git #### Copy and Extract Large Files -Copy and extract large files in convolution tutorial directory as follows: +Copy and extract large files in the convolution tutorial directory as follows: ```bash cd /VITIS_TUTORIAL_REPO_PATH/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial @@ -46,11 +45,11 @@ wget https://www.xilinx.com/bin/public/openDownload?filename=conv_tutorial_files tar -xvzf conv_tutorial_files.tar.gz ``` -**NOTE** : VITIS_TUTORIAL_REPO_PATH is the local directory path where git repo is cloned. +>**NOTE:** VITIS_TUTORIAL_REPO_PATH is the local directory path where the git repo is cloned. #### Setting Up the Vitis Tool -Setup the application build and runtime environment using the following commands as per your local installation: +Set up the application build and runtime environment using the following commands as per your local installation: ```bash source /settings64.sh @@ -59,15 +58,15 @@ source /setup.sh ### Baseline the Application Performance -The software application processes High Definition(HD) video frames/images with 1920x1080 resolution. It performs convolution on a set of images and prints the summary of performance results. It is used for measuring baseline software performance. Please the set the environment variable that points to tutorial direction relative to repo path as follow: +The software application processes high definition (HD) video frames/images with 1920x1080 resolution. It performs convolution on a set of images and prints the summary of performance results. It is used for measuring baseline software performance. Set the environment variable that points to tutorial direction relative to the repo path as follows: ```bash export CONV_TUTORIAL_DIR=/VITIS_TUTORIAL_REPO_PATH/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial ``` -where **VITIS_TUTORIAL_REPO_PATH** is the local path where git repo is placed by the user after cloning. +where **VITIS_TUTORIAL_REPO_PATH** is the local path where the git repo is placed by the user after cloning. - **NOTE**: Make sure during all of the labs in this tutorial you have set `CONV_TUTORIAL_DIR` variable appropriately + >**NOTE:** Make sure during all of the labs in this tutorial you have set the `CONV_TUTORIAL_DIR` variable appropriately. Run the application to measure performance as follows: @@ -76,7 +75,7 @@ cd $CONV_TUTORIAL_DIR/sw_run ./run.sh ``` -Results similar to the ones shown below will be printed. Note down the CPU throughput. +Results similar to the following will be printed. Note down the CPU throughput. ```bash ---------------------------------------------------------------------------- @@ -97,14 +96,14 @@ CPU Throughput : 12.7112 MB/s ### Launching the Host Application -Now launch the application, which uses FPGA accelerated video convolution filter. The application will be run on an actual FPGA card, also called System Run. +Now launch the application, which uses a FPGA accelerated video convolution filter. The application will be run on an actual FPGA card, also called System Run. ```bash cd $CONV_TUTORIAL_DIR make run ``` -The result summary will be similar to the one given below: +The result summary will be similar to the following: ```bash ---------------------------------------------------------------------------- @@ -136,7 +135,7 @@ FPGA Speedup : 68.1764 x ### Results - From the host application console output, it is clear that the FPGA accelerated kernel can outperform CPU-only implementation by a factor of 68x. It is a large gain in terms of performance over CPU. The following labs will illustrate how this performance allows processing more than 3 HD video channels with 1080p resolution in parallel. The tutorial describes how to achieve such performance gains by building a kernel and host application written in C++. The host application uses OpenCL APIs and Xilinx Runtime (XRT) underneath it, demonstrating how to unleash this custom-built hardware kernel's computing power effectively. + From the host application console output, it is clear that the FPGA accelerated kernel can outperform CPU-only implementation by a factor of 68x. It is a large gain in terms of performance over CPU. The following labs will illustrate how this performance allows processing more than three HD video channels with 1080p resolution in parallel. The tutorial describes how to achieve such performance gains by building a kernel and host application written in C++. The host application uses OpenCL™ APIs and Xilinx Runtime (XRT) underneath it, demonstrating how to unleash this custom-built hardware kernel's computing power effectively. --------------------------------------- @@ -144,7 +143,6 @@ FPGA Speedup : 68.1764 x Next Lab Module: Video Convolution Filter : Introduction and Performance Estimation

-

Copyright © 2020–2023 Advanced Micro Devices, Inc

Terms and Conditions

diff --git a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab1_app_introduction_performance_estimation.md b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab1_app_introduction_performance_estimation.md index 15c7fefde4..e47be42de6 100755 --- a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab1_app_introduction_performance_estimation.md +++ b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab1_app_introduction_performance_estimation.md @@ -8,29 +8,29 @@ # Video Convolution Filter: Introduction and Performance Estimation -This lab will explore a 2D video convolution filter and measure its performance on the host machine. These measurements will establish a performance baseline. The amount of acceleration that should be provided by hardware implementation is calculated based on the required performance constraints. In the next lab, we will estimate the performance of the FPGA accelerator. In a nutshell, during this lab, you will: +This lab will explore a 2D video convolution filter, and measure its performance on the host machine. These measurements will establish a performance baseline. The amount of acceleration that should be provided by hardware implementation is calculated based on the required performance constraints. In the next lab, you will estimate the performance of the FPGA accelerator. In a nutshell, during this lab, you will: - Learn about video convolution filters -- Measure the performance of software implemented convolution filter -- Calculate required acceleration vs. software implementation for given performance constraints -- Estimate the performance of hardware accelerator before implementation +- Measure the performance of a software implemented convolution filter +- Calculate the required acceleration versus the software implementation for given performance constraints +- Estimate the performance of the hardware accelerator before implementation ## Video Filtering Applications and 2-D Convolution Filters Video applications use different types of filters extensively for multiple reasons: filter noise, manipulate motion blur, enhance color and contrast, edge detection, creative effects, etc. At its core, a convolution video filter carries out some form of data average around a pixel, which redefines the amount and type of correlation any pixel has to its surrounding area. Such filtering is carried out for all the pixels in a video frame. -A matrix of coefficients defines a convolution filter. The convolution operation is essentially a sum of products performed on a pixel set (a frame/image sub-matrix centered around a given pixel) and a coefficients matrix. The figure below illustrates how convolution is calculated for a pixel; it is highlighted in yellow. Here the filter has a coefficient matrix that is 3x3 in size. The figure also displays how the whole output image is generated during the filtering process. The index of the output pixel being generated is the index of the input pixel highlighted in yellow that is being filtered. In algorithmic terms, the process of filtering consists of: +A matrix of coefficients defines a convolution filter. The convolution operation is essentially a sum of products performed on a pixel set (a frame/image sub-matrix centered around a given pixel) and a coefficients matrix. The following figure illustrates how convolution is calculated for a pixel; it is highlighted in yellow. Here the filter has a coefficient matrix that is 3x3 in size. The figure also displays how the whole output image is generated during the filtering process. The index of the output pixel being generated is the index of the input pixel highlighted in yellow that is being filtered. In algorithmic terms, the process of filtering consists of: -- Selecting an input pixel as highlighted in yellow in the figure below +- Selecting an input pixel as highlighted in yellow in the following figure - Extracting a sub-matrix whose size is the same as filter coefficients -- Calculating element-wise sum-of-product of extracted sub-matrix and coefficients matrix -- Placing the sum-of-product as output pixel in output image/frame on the same index as the input pixel +- Calculating an element-wise sum-of-product of the extracted sub-matrix and coefficients matrix +- Placing the sum-of-product as an output pixel in an output image/frame on the same index as the input pixel ![Convolution Filter](./images/convolution.jpg) ### Performance Requirements for 1080p HD Video -You can easily calculate the application performance requirements given the standard performance specs for 1080p High Definition (HD) video. Then these top-level requirements can be translated into constraints for hardware implementation or software throughput requirements. For 1080p HD video at 60 frames per seconds(FPS) the specs are listed below as well as required throughput in terms of pixels per second is calculated: +You can easily calculate the application performance requirements given the standard performance specs for 1080p HD video. Then these top-level requirements can be translated into constraints for hardware implementation or software throughput requirements. For 1080p HD video at 60 frames per seconds(FPS), the following specs are listed as well as required throughput in terms of pixels per second is calculated: ```bash Video Resolution = 1920 x 1080 @@ -42,17 +42,17 @@ Color Channels(YUV) = 3 Throughput(Pixel/s) = Frame Width * Frame Height * Channels * FPS Throughput(Pixel/s) = 1920*1080*3*60 Throughput (MB/s) = 373 MB/s -``` +``` -The required throughput to meet 60 FPS performance turns out to be 373 MB/s ( since each pixel is 8-bits). +The required throughput to meet 60 FPS performance turns out to be 373 MB/s (since each pixel is 8-bits). ## Software Implementation This section will discuss the baseline software implementation and performance measurements, which will be used to gauge the acceleration requirements given the performance constraints. -The convolution filter is implemented in software using a typical multi-level nested loop structure. Outer two loops define the pixel to be processed(iterating over each pixel). The inner two loops perform the sum-of-product (SOP) operation, actual convolution filtering between the coefficient matrix and the selected sub-matrix from the image centered around the processed pixel. +The convolution filter is implemented in the software using a typical multi-level nested loop structure. The outer two loops define the pixel to be processed (iterating over each pixel). The inner two loops perform the sum-of-product (SOP) operation, actual convolution filtering between the coefficient matrix and the selected sub-matrix from the image centered around the processed pixel. -**TIP:** Boundary conditions where it is not possible to center sub-matrix around a given pixel require special processing. This algorithm assumes all pixels beyond the boundary of the image have zero values. +>**TIP:** Boundary conditions where it is not possible to center sub-matrix around a given pixel require special processing. This algorithm assumes all pixels beyond the boundary of the image have zero values. ```cpp void Filter2D( @@ -98,7 +98,7 @@ void Filter2D( } ``` -The following snapshot shows how the top-level function calls the convolution filter function for an image with three components or channels. Here OpenMP pragma is used to parallelize software execution using multiple threads. You can open **_src/host_randomized.cpp_** and **_src/filter2d_sw.cpp_** from tutorial directory to examine all implementation details. +The following snapshot shows how the top-level function calls the convolution filter function for an image with three components or channels. Here the OpenMP pragma is used to parallelize software execution using multiple threads. You can open `src/host_randomized.cpp` and `src/filter2d_sw.cpp` from the tutorial directory to examine all the implementation details. ```cpp #pragma omp parallel for num_threads(3) @@ -113,14 +113,14 @@ The following snapshot shows how the top-level function calls the convolution fi ### Running the Software Application -To run the software application, go to the directory called "sw_run" and launch the application as follows: +To run the software application, go to the directory called `sw_run`, and launch the application as follows: ```bash cd $CONV_TUTORIAL_DIR/sw_run ./run.sh ``` -Once the application is launched, it should produce an output similar to the one shown below. The software application will process a randomly generated set of images and report performance. Here you have used randomly generated images to avoid any extra library dependencies such as OpenCV. But in the next labs, while working with the hardware implementation, the option to use the OpenCV library for loading images or use randomly generated images will be provided given the user has OpenCV 2.4 installed on the machine. If another version of OpenCV is needed, the user can modify the host application to use different APIs to load and store images from the disk. +When the application is launched, it should produce an output similar to the following. The software application will process a randomly generated set of images and report performance. Here you have used randomly generated images to avoid any extra library dependencies, such as OpenCV. But in the next labs, while working with the hardware implementation, the option to use the OpenCV library for loading images or use randomly generated images will be provided given you have OpenCV 2.4 installed on the machine. If another version of OpenCV is needed, you can modify the host application to use different APIs to load and store images from the disk. ```bash ---------------------------------------------------------------------------- @@ -137,7 +137,7 @@ CPU Throughput : 14.5617 MB/s ---------------------------------------------------------------------------- ``` -The application run measures performance using high precision timers and reports it as throughput. The machine used for experiments produced a throughput of 14.51 MB/s. The machine details are listed below: +The application run measures performance using high precision timers and reports it as throughput. The machine used for experiments produced a throughput of 14.51 MB/s. The following machine details are listed: ```bash CPU Model : Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz @@ -157,14 +157,14 @@ So to meet the required performance of processing 60 frames per second, the soft To understand what kind of hardware implementation is needed given the performance constraints, you can examine the convolution kernel in some detail: -- The core compute is done in a 4-level nested loop, but you can break it to the compute per output pixel produced. +- The core compute is done in a four-level nested loop, but you can break it to the compute per output pixel produced. - In terms of the output-pixels produced, it is clear from the filter source code that a single output pixel is produced when the inner two loops finish execution once. - These two loops are essentially doing the sum-of-product on a coefficient matrix and image sub-matrix. The matrix sizes are defined by the coefficient matrix, which is 15x15. - The inner two loops are performing a dot product of size 225(15x15). In other words, the two inner loops perform 225 multiply-accumulate (MAC) operations for every output pixel produced. ### Baseline Hardware Implementation Performance -The simplest and most straightforward hardware implementation can be achieved by passing this current kernel source code through the Vitis HLS tool. It will pipeline the innermost loop with II=1, performing only one multiply-accumulate(MAC) per cycle. The performance can be estimated based on the MACs as follows: +The simplest and most straightforward hardware implementation can be achieved by passing this current kernel source code through the Vitis HLS tool. It will pipeline the innermost loop with II=1, performing only one MAC per cycle. The performance can be estimated based on the MACs as follows: ```bash MACs per Cycle = 1 @@ -172,15 +172,14 @@ The simplest and most straightforward hardware implementation can be achieved by Throughput = 300/225 = 1.33 (MPixels/s) = 1.33 MB/s ``` -Here the hardware clock frequency is assumed to be 300MHz because, in general, for the U200 Xilinx Alveo Data Center card, this is the maximum supported clock frequency when using Vitis HLS based design flow. The performance turns out to be 1.33 MB/s with baseline hardware implementation. From the convolution filter source code, it can also be estimated how much memory bandwidth is needed at the input and output for achieved throughput. From the convolution filter source code also shown above, it is clear that the inner two loops, while calculating a single output pixel, performs 225(15*15) reads at the input so: +Here the hardware clock frequency is assumed to be 300 MHz because, in general, for the U200 AMD Alveo™ Data Center card, this is the maximum supported clock frequency when using the Vitis HLS-based design flow. The performance turns out to be 1.33 MB/s with baseline hardware implementation. From the convolution filter source code, it can also be estimated how much memory bandwidth is needed at the input and output for achieved throughput. From the convolution filter source code also shown above, it is clear that the inner two loops, while calculating a single output pixel, performs 225(15*15) reads at the input so: ```bash Output Memory Bandwidth = Throughput = 1.33 MB/s Input Memory Bandwidth = Throughput * 225 = 300 MB/s -``` +``` -For the baseline implementation, the memory bandwidth requirements are very trivial, assuming that PCIe and device DDR memory bandwidths on Xilinx Acceleration Cards/Boards are of the order of 10s of GB/s. -As you have seen in previous sections, the throughput required for 60FPS 1080p HD video is 373 MB/s. So it clear that to meet the performance requirement: +For the baseline implementation, the memory bandwidth requirements are very trivial, assuming that PCIe and device DDR memory bandwidths on AMD Acceleration Cards/Boards are of the order of 10s of GB/s. As you have seen in previous sections, the throughput required for 60 FPS 1080p HD video is 373 MB/s. So it is clear that to meet the performance requirement: ```bash Acceleration Factor to Meet 60FPS Performance = 373/1.33 = 280x @@ -189,7 +188,7 @@ Acceleration Factor to Meet SW Performance = 14.5/1.33 = 10.9x ### Performance Estimation for Optimized Hardware Implementations -From the above calculations, it is clear that you need to improve the performance of a baseline hardware implementation by 280x to process 60 FPS. One of the paths you can take is to start unrolling the inner loops and pipeline. For example, by unrolling the innermost loop, which iterates 15 times, you can improve the performance by 15x. With that one change, the hardware performance will already be better than software-only implementation, but not yet good enough to meet the required video performance. Another approach you can follow is to unroll the inner two loops and gain in performance by 15*15=225, which means a throughput of 1-output pixel per cycle. The performance and memory bandwidth requirements will be as follows: +From the above calculations, it is clear that you need to improve the performance of a baseline hardware implementation by 280x to process 60 FPS. One of the paths you can take is to start unrolling the inner loops and pipeline. For example, by unrolling the innermost loop, which iterates 15 times, you can improve the performance by 15x. With that one change, the hardware performance will already be better than software-only implementation but not yet good enough to meet the required video performance. Another approach you can follow is to unroll the inner two loops and gain in performance by 15*15=225, which means a throughput of one-output pixel per cycle. The performance and memory bandwidth requirements will be as follows: ```bash Throughput = Fmax * Pixels produced per cycle = 300 * 1 = 300 MB/s @@ -197,9 +196,9 @@ Output Memory Bandwidth = Fmax * Pixels produced per cycle = 300 MB/s Input Memory Bandwidth = Fmax * Input pixels read per output pixel = 300 * 225 = 67.5 GB/s ``` -The required output memory bandwidth scales linearly with throughput, but input memory bandwidth has gone up enormously and might not be sustainable. A closer look at the convolution filter will reveal that it is not required to read all 225(15x15) pixels from the input memory for processing. An innovative caching scheme can be built to avoid such extensive use of input memory bandwidth. +The required output memory bandwidth scales linearly with throughput, but input memory bandwidth has gone up enormously and might not be sustainable. A closer look at the convolution filter will reveal that it is not required to read all 225(15x15) pixels from the input memory for processing. An innovative caching scheme can be built to avoid such extensive use of input memory bandwidth. -The convolution filter belongs to a class of kernels known as stencil kernels, which can be optimized to increase input data reuse extensively. Which can result in substantially reduced memory bandwidth requirements. With a caching scheme, you can bring the input bandwidth required to be the same as output, which is around 300 MB/s. With the optimized data reuse scheme, when both inner loops are unrolled, it will require that only 1-input pixel is read for producing one output pixel on average and hence input memory bandwidth of 300 MB/s. +The convolution filter belongs to a class of kernels known as stencil kernels, which can be optimized to increase input data reuse extensively, which can result in substantially reduced memory bandwidth requirements. With a caching scheme, you can bring the input bandwidth required to be the same as the output, which is around 300 MB/s. With the optimized data reuse scheme, when both inner loops are unrolled, it will require that only one-input pixel is read for producing one output pixel on average and hence input memory bandwidth of 300 MB/s. Although you can reduce the input bandwidth, the achieved performance will still only be 300 MB/s, which is less than the required 373 MB/s. To deal with this, you can look for other ways to increase the throughput of hardware. One approach is to duplicate kernel instances, also called compute units. In terms of heterogeneous computing, you can increase the number of compute units so that you can process data in parallel. In the convolution filter case, you can process all color channels (YUV) on separate compute units. When using three compute units, one for each color channel, the expected performance summary will be as follows: @@ -225,8 +224,6 @@ Next Lab Module: Design and Analys

- -

Copyright © 2020–2023 Advanced Micro Devices, Inc

Terms and Conditions

diff --git a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab2_conv_filter_kernel_design.md b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab2_conv_filter_kernel_design.md index 6e86564537..17195c4759 100755 --- a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab2_conv_filter_kernel_design.md +++ b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab2_conv_filter_kernel_design.md @@ -8,7 +8,7 @@ # Design and Analysis of Hardware Kernel Module for 2-D Video Convolution Filter -This lab is designed to demonstrate the design of a convolution filter module, do performance analysis, and analyze hardware resource utilization. A bottom-up approach is followed by first developing the hardware kernel and analyzing its performance before integrating it with the host application. You will use Vitis HLS to build and estimate the performance of the kernel. +This lab is designed to demonstrate the design of a convolution filter module, do performance analysis, and analyze the hardware resource utilization. A bottom-up approach is followed by first developing the hardware kernel and analyzing its performance before integrating it with the host application. You will use Vitis HLS to build and estimate the performance of the kernel. ## 2-D Convolution Filter Implementation @@ -16,7 +16,7 @@ This section discusses the design of a convolution filter in detail. It goes thr ### Top Level Structure of Kernel -The top-level of the convolution filter is modeled using a dataflow process. The dataflow consists of four different functions as given below. For full implementation details please refer to source file `src/filter2d_hw.cpp` in convolutional tutorial directory. +The top-level of the convolution filter is modeled using a dataflow process. The dataflow consists of four different functions as follows. For full implementation details, refer to the source file `src/filter2d_hw.cpp` in the convolutional tutorial directory. ```cpp void Filter2DKernel( @@ -53,20 +53,20 @@ void Filter2DKernel( The dataflow chain consists of four different functions as follows: -- **ReadFromMem**: reads pixel data or video input from main memory -- **Window2D**: local cache with wide(15x15 pixels) access on the output side -- **Filter2D**: core kennel filtering algorithm -- **WriteToMem**: writes output data to main memory +- **ReadFromMem**: Reads pixel data or video input from main memory +- **Window2D**: Local cache with wide (15x15 pixels) access on the output side +- **Filter2D**: Core kennel filtering algorithm +- **WriteToMem**: Writes output data to the main memory -Two functions at the input and output read and write data from the device's global memory. The `ReadFromMem` function reads data and streams it for filtering. The `WriteToMem` function at the end of the chain writes processed pixel data to the device memory. The input data(pixels) read from the main memory is passed to the `Window2D` function, which creates a local cache and, on every cycle, provides a 15x15 pixel sample to the filter function/block. The `Filter2D` function can consume the 15x15 pixel sample in a single cycle to perform 225(15x15) MACs per cycle. +Two functions at the input and output read and write data from the device's global memory. The `ReadFromMem` function reads data and streams it for filtering. The `WriteToMem` function at the end of the chain writes processed pixel data to the device memory. The input data (pixels) read from the main memory is passed to the `Window2D` function, which creates a local cache and, on every cycle, provides a 15x15 pixel sample to the filter function/block. The `Filter2D` function can consume the 15x15 pixel sample in a single cycle to perform 225(15x15) MACs per cycle. -Please open the **"src/filter2d_hw.cpp"** source file from convolutioanl tutorial directory and examine the implementation details of these individual functions. In the next section, you will elaborate on the implementation details of Window2D and Filter2D functions. The following figure shows how data flows between different functions (dataflow modules). +Open the `src/filter2d_hw.cpp` source file from the convolutioanl tutorial directory, and examine the implementation details of these individual functions. In the next section, you will elaborate on the implementation details of Window2D and Filter2D functions. The following figure shows how data flows between different functions (dataflow modules). ![Datflow](images/filterBlkDia.jpg) ### Data Mover -One of the key advantages of custom design hardware accelerators, for which FPGAs are well suited, is the choice and architecture of custom data movers. These customized data movers facilitate efficient access to global device memory and optimize bandwidth utilization by reusing data. Specialized data movers at the interface with main memory can be built at the input and output of the data processing engine or processing elements. The convolution filter is an excellent example of this. Looking from a pure software implementation point of view, it seems that to produce a single sample at the output side requires 450 memory accesses at the input side and 1 write access to the output. +One of the key advantages of custom design hardware accelerators, for which FPGAs are well suited, is the choice and architecture of custom data movers. These customized data movers facilitate efficient access to global device memory and optimize bandwidth utilization by reusing data. Specialized data movers at the interface with the main memory can be built at the input and output of the data processing engine or processing elements. The convolution filter is an excellent example of this. Looking from a pure software implementation point of view, it seems that to produce a single sample at the output side requires 450 memory accesses at the input side and one write access to the output. ```bash Memory Accesses to Read filter Co-efficients = 15x15 = 225 @@ -79,39 +79,39 @@ For a pure software implementation, even though many of these accesses can becom #### Window2D: Line and Window Buffers -The `Window2D` block is essentially built from two basic blocks: the first is called a "line buffer", and the second is called a "Window". +The `Window2D` block is essentially built from two basic blocks: the first is called a "line buffer", and the second is called a "Window". -- The line buffer is used to buffer multiple lines of a full image, and specifically, here it is designed to buffer **_FILTER_V_SIZE - 1_** image lines. Where FILTER_V_SIZE is the height of the convolution filter. The total number of pixels held by the line buffer is **(FILTER_V_SIZE-1) * MAX_IMAGE_WIDTH**. -- The "Window" block holds **FILTER_V_SIZE * FILTER_H_SIZE** pixels. The 2-D convolution filtering operation consists of centering the filtering mask (filter coefficients) on the index of output pixel and calculating the sum-of-product(SOP) as described in the previous lab. The following figure shows how these centering and SOP operations are carried. +- The line buffer is used to buffer multiple lines of a full image, and specifically, here it is designed to buffer **_FILTER_V_SIZE - 1_** image lines, where FILTER_V_SIZE is the height of the convolution filter. The total number of pixels held by the line buffer is **(FILTER_V_SIZE-1) * MAX_IMAGE_WIDTH**. +- The "Window" block holds **FILTER_V_SIZE * FILTER_H_SIZE** pixels. The 2-D convolution filtering operation consists of centering the filtering mask (filter coefficients) on the index of output pixel and calculating the SOP as described in the previous lab. The following figure shows how these centering and SOP operations are carried. ![Convolution Filter](images/convolution.jpg) -The figure above shows SOP carried out for a full image being processed. If you look carefully when output pixels are produced line by line, it is not required to have all the image pixels in memory. Only the lines where the filtering mask overlaps are required which is essentially FILTER_V_SIZE lines, which can even be reduced to FILTER_V_SIZE-1. Essentially that is the amount of data that needs to be on-chip or housed by a data mover at any given time. +The figure above shows SOP carried out for a full image being processed. If you look carefully when output pixels are produced line by line, it is not required to have all the image pixels in memory. Only the lines where the filtering mask overlaps are required which is essentially FILTER_V_SIZE lines, which can even be reduced to FILTER_V_SIZE-1. Essentially, that is the amount of data that needs to be on-chip or housed by a data mover at any given time. ![Matrix Movement](images/Window2D.jpg) -The figure above illustrates the operation and requirements for a line and Window buffer. The image size is assumed 8x8, and the filter size is 3x3. For this example, you are generating the filtered output of pixel number 10. In this case, you need a 3x3 block of input pixels centered around pixel 10, as shown by step `A`. +The figure above illustrates the operation and requirements for a line and Window buffer. The image size is assumed 8x8, and the filter size is 3x3. For this example, you are generating the filtered output of pixel number 10. In this case, you need a 3x3 block of input pixels centered around pixel 10, as shown in step `A`. -Step `B` in the figure highlights what is required for producing pixel number 11. Another 3x3 block, but it has a significant overlap with the previous input block. Essentially a column moves out from the left, and a column moves in from the right. One important thing to notice in steps A, B, and C, is that from the input side, it only needs one new pixel to produce one output pixel(ignoring the initial latency of filling the line buffer with multiple pixels, which is one time only). +Step `B` in the figure highlights what is required for producing pixel number 11. Another 3x3 block, but it has a significant overlap with the previous input block. Essentially a column moves out from the left, and a column moves in from the right. One important thing to notice in steps A, B, and C, is that from the input side; it only needs one new pixel to produce one output pixel (ignoring the initial latency of filling the line buffer with multiple pixels, which is one-time only). The line buffer holds FILTER_V_SIZE-1 lines. In general, it requires FILTER_V_SIZE lines, but a line is reduced by using the line buffer in a circular fashion and by exploiting the fact that pixels at the start of the first line buffer can be used to write new incoming pixels since they are no longer needed. The window buffer is implemented as FILTER_V_SIZE * FILTER_H_SIZE storage fully partitioned, giving parallel access to all elements inside the window. The data moves as a column vector of size FILTER_V_SIZE from line buffer to window buffer, and then this whole widow is passed through a stream to the `Filter2D` function for processing. -The overall scheme (data mover) is built to maximize the data reuse providing maximum parallel data to the processing element. For a deeper understanding of the modeling style and minute details of the data mover examine the `Window2D` function details in the source code. The function can be found in the **"src/filter2d_hw.cpp"** source file in convolutioanl tutorial directory . +The overall scheme (data mover) is built to maximize the data reuse providing maximum parallel data to the processing element. For a deeper understanding of the modeling style and minute details of the data mover, examine the `Window2D` function details in the source code. The function can be found in the `src/filter2d_hw.cpp` source file in the convolutioanl tutorial directory. ## Building and Simulating the Kernel using Vitis HLS -In this section, you will build and simulate the 2D convolution filter using Vitis HLS. You will also look at the performance estimates and measured results after co-simulation for comparison with target performance settings. +In this section, you will build and simulate the 2D convolution filter using Vitis HLS. You will also look at the performance estimates and measured results after co-simulation for comparison with the target performance settings. ### Building the Kernel Module -Now you will build the kernel module as a standalone kernel with AXI interfaces to memory, which are also used for simulation. To do this, please follow the steps listed below: +Now you will build the kernel module as a standalone kernel with AXI interfaces to memory, which are also used for simulation. To do this, use the following steps: ```bash cd $CONV_TUTORIAL_DIR/hls_build vitis_hls -f build.tcl ``` -**TIP:** This step can take some time to complete. +>**TIP:** This step can take some time to complete. An output similar to the following will be printed: @@ -132,17 +132,17 @@ An output similar to the following will be printed: This shows that an image with Width=1000 and height=30 is simulated. There are default parameters for image dimensions, and these are kept as small values to make co-simulation run more quickly. The synthesis is done for a maximum image size of 1920x1080. -Once the build and simulation are finished, launch Vitis HLS GUI to analyze the performance estimate reports and implementation QoR as follows: +When the build and simulation are finished, launch the Vitis HLS GUI to analyze the performance estimate reports and implementation QoR as follows: ```bash vitis_hls -p conv_filter_prj ``` -After the GUI opens, the first thing to notice is the Synthesis Summary report, which shows the Performance and Resource Estimates as shown below: +After the GUI opens, the first thing to notice is the Synthesis Summary report, which shows the Performance and Resource Estimates as the following: ![Resource Report](images/vitisHlsResourceReport2.jpg) -It shows the use of 139 DSP essentially for the SOP operations by the top-level module, and the use of 14 BRAMs by the `Window2D` data mover block. +It shows the use of 139 DSP essentially for the SOP operations by the top-level module, and the use of 14 block RAMs by the `Window2D` data mover block. One important thing to notice is the static performance estimate for the kernel, 7.3 ms, which is very close to the estimated target latency of 6.9 ms for the kernel as calculated in the previous lab. @@ -152,11 +152,11 @@ You can also get an accurate measurement of latency for kernel from the Co-simul Since you are simulating a 1000x30 image, the expected latency should be 30,000 + fixed latency of some blocks (assuming one clock cycle per output pixel). The number shown in the report is 38,520. Here 8,520 is the fixed latency, and when the actual image size is 1920x1080, the fixed latency will get amortized across more image lines. The fact that a large image will amortize the latency can be verified by simulating with a larger image. -Another thing that verifies that the kernel can achieve one output sample per cycle throughput is the loop initiation intervals (II). The synthesis report expanded view shows that all loops have II=1, as shown below: +Another thing that verifies that the kernel can achieve one output sample per cycle throughput is the loop initiation intervals (II). The synthesis report expanded view shows that all loops have II=1, as follows: ![II Report](images/vitisHLSIIReport.jpg) -Once you have verified that the throughput requirements are met, and the resource consumption is acceptable, you can move forward and start integrating the full application. Which consists of creating and compiling the host application to drive the kernel, and building the kernel using one of the Xilinx platforms for Alveo Data Center accelerator cards. For this lab, the Alveo U200 card is used. +When you have verified that the throughput requirements are met, and the resource consumption is acceptable, you can move forward and start integrating the full application, which consists of creating and compiling the host application to drive the kernel, and building the kernel using one of the AMD platforms for Alveo Data Center accelerator cards. For this lab, the Alveo U200 card is used. In this lab, you learned about: diff --git a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab3_build_app_kernel.md b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab3_build_app_kernel.md index 5d84028f50..660f31dc8f 100755 --- a/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab3_build_app_kernel.md +++ b/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/lab3_build_app_kernel.md @@ -8,28 +8,28 @@ # Building the 2-D Convolution Kernel and Host Application -This lab will focus on building a hardware kernel using the Vitis application acceleration development flow, targeting the Xilinx Alveo U200 accelerator card. A host-side application will be implemented to coordinate all the data movements and execution triggers for invoking the kernel. During this lab, real performance measurements will be taken and compared to estimated performance and the CPU-only performance. +This lab will focus on building a hardware kernel using the Vitis application acceleration development flow, targeting the AMD Alveo™ U200 accelerator card. A host-side application will be implemented to coordinate all the data movements and execution triggers for invoking the kernel. During this lab, real performance measurements will be taken and compared to estimated performance and the CPU-only performance. ## Host Application -This section briefly discusses how the host application is written to orchestrate the execution of a convolution kernel. It was estimated in the previous lab that multiple compute units will be needed to meet the 60 FPS performance target for processing 1080p HD Video. The host application is designed to be agnostic to the number of compute units. More specifically, if the compute units are symmetric ( instance of the same kernel and memory connectivity to device DDR banks is identical), the host application can deal with any number of compute units. +This section briefly discusses how the host application is written to orchestrate the execution of a convolution kernel. It was estimated in the previous lab that multiple compute units will be needed to meet the 60 FPS performance target for processing 1080p HD Video. The host application is designed to be agnostic to the number of compute units. More specifically, if the compute units are symmetric (the instance of the same kernel and memory connectivity to the device DDR banks is identical), the host application can deal with any number of compute units. -**TIP:** Additional [tutorials](https://github.com/Xilinx/Vitis-Tutorials) and [examples](https://github.com/Xilinx/Vitis_Accel_Examples) about host programming and Vitis tools are available. +>**TIP:** Additional [tutorials](https://github.com/Xilinx/Vitis-Tutorials) and [examples](https://github.com/Xilinx/Vitis_Accel_Examples) about host programming and Vitis tools are available. ### Host Application Variants -Please go to the top-level folder for the convolution tutorial and change the directory to `src`, and list the files: +Go to the top-level folder for the convolution tutorial, change the directory to `src`, and list the files: ```bash cd $CONV_TUTORIAL_DIR/src ls ``` -There are two files namely "**host.cpp**" and "**host_randomized.cpp**". They can be used to build two different versions of the host application. The way they interact with the kernel compute unit is exactly the same except that one uses the **pgm** image file as input. This file is repeated multiple times to emulate an image sequence(video). The randomized host uses a randomly generated image sequence. The host with random input image generation has no dependencies. In contrast, the host code in "**host.cpp**" uses OpenCV libraries, specifically using **OpenCV 2.4** libraries to load, unload and convert between raw image formats. +There are two files namely `host.cpp` and `host_randomized.cpp`. They can be used to build two different versions of the host application. The way they interact with the kernel compute unit is exactly the same except that one uses the **pgm** image file as the input. This file is repeated multiple times to emulate an image sequence (video). The randomized host uses a randomly generated image sequence. The host with the random input image generation has no dependencies. In contrast, the host code in `host.cpp` uses OpenCV libraries, specifically using **OpenCV 2.4** libraries to load, unload, and convert between raw image formats. ### Host Application Details -The host application starts by parsing command line arguments. Following are the command-line options provided by the host that takes the input image and uses OpenCV (in source file `src/host.cpp`): +The host application starts by parsing command line arguments. The following are the command line options provided by the host that takes the input image and uses OpenCV (in the source file `src/host.cpp`): ```cpp CmdLineParser parser; @@ -48,106 +48,106 @@ The `host_randomized.cpp` file has all those options, and adds `width` and `heig parser.addSwitch("--height", "-h", "Image height", "1080"); ``` -Different options can be used to launch the application and for performance measurements. In this lab, you will set most of these command-line inputs to the application using a makefile, `make_options.mk` in the top-level directory. This file lets you set most of these options. +Different options can be used to launch the application and for performance measurements. In this lab, you will set most of these command line inputs to the application using a makefile, `make_options.mk` in the top-level directory. This file lets you set most of these options. -After parsing the command-line options, the host application creates an OpenCL context, reads and loads the `.xclbin`, and creates a command queue with out-of-order execution and profiling enabled. After that, memory allocation is done, and the input image is read (or randomly generated). +After parsing the command line options, the host application creates an OpenCL™ context, reads and loads the `.xclbin`, and creates a command queue with out-of-order execution and profiling enabled. After that, memory allocation is done, and the input image is read (or randomly generated). After the setup is complete, the application creates a `Filter2DDispatcher` object and uses it to dispatch filtering requests on several images. Timers are used to capture execution time measurements for both software and hardware execution. Finally, the host application prints the summary of performance results. Most of the heavy lifting is done by `Filter2DDispatcher` and `Filter2DRequest`. These classes manage and coordinate the execution of filtering operations on multiple compute units. Both versions of the host application are based on these classes. ### 2D Filtering Requests -The `Filter2DRequest` class is used by the filtering request dispatcher class. An object of this class encapsulates a single request to process a single color channel (YUV) for a given image. It essentially allocates and holds handles to OpenCL resources needed for enqueueing 2-D convolution filtering requests. These resources include OpenCL buffers, event lists, and handles to kernel and command queue. The application creates a single command queue that is passed down to enqueue every kernel enqueue command. +The `Filter2DRequest` class is used by the filtering request dispatcher class. An object of this class encapsulates a single request to process a single color channel (YUV) for a given image. It essentially allocates and holds handles to OpenCL resources needed for enqueueing 2-D convolution filtering requests. These resources include OpenCL buffers, event lists, and handles to the kernel and command queue. The application creates a single command queue that is passed down to enqueue every kernel enqueue command. After an object of the `Filter2DRequest` class is created, it can be used to make a call to the `Filter2D` method. This call will enqueue all the operations, moving input data or filter coefficients, kernel calls, and reading of output data back to the host. The same API call will create a list of dependencies between these transfers and also creates an output event that signals the completion of output data transfer to the host. ### 2D Filter Dispatcher -The `Filter2DDispatcher` class is the top-level class that provides an end-user API to schedule kernel calls. Every call schedules a kernel enqueue and related data transfers using the `Filter2DRequest` object, as explained previously. The `Filter2DDispatcher` is a container class that essentially holds a vector of request objects. The number of `Filter2DRequest` objects that are instantiated is defined as the **max** parameter for the dispatcher class at construction time. This parameter's minimum value can be as small as the number of compute units to allow at least one kernel enqueue call per compute unit to happen in parallel. But a larger value is desired since it will allow overlap between input and output data transfers happening between host and device. +The `Filter2DDispatcher` class is the top-level class that provides an end-user API to schedule kernel calls. Every call schedules a kernel enqueue and related data transfers using the `Filter2DRequest` object, as explained previously. The `Filter2DDispatcher` is a container class that essentially holds a vector of request objects. The number of `Filter2DRequest` objects that are instantiated is defined as the **max** parameter for the dispatcher class at construction time. This parameter's minimum value can be as small as the number of compute units to allow at least one kernel enqueue call per compute unit to happen in parallel. But a larger value is desired, since it will allow overlap between input and output data transfers happening between the host and device. ## Building the Application -The host application can be built using the `Makefile` that is provided with the tutorial. As mentioned earlier, the host application has two versions: the first version takes input images to process, the second can generate random data that will be processed as images. +The host application can be built using the `Makefile` that is provided with the tutorial. As mentioned earlier, the host application has two versions: the first version takes input images to process, and the second can generate random data that will be processed as images. The top-level `Makefile` includes a file called `make_options.mk`. This file provides most of the options that can be used to generate different host builds and kernel versions for emulation modes. It also provides a way to launch emulation with a specific number of test images. The details of the options provided by this file are as follows: ### Kernel Build Options -- TARGET: selects build target; the choices are `hw`, `sw_emu`, `hw_emu`. -- PLATFORM: target Xilinx platform used for the build -- ENABLE_STALL_TRACE : enables the kernel to generate stall data. Choices are: `yes`, `no`. -- TRACE_DDR: select the memory bank to store trace data. Choices are DDR[0]-DDR[3] for u200 card. -- KERNEL_CONFIG_FILE: kernel configuration file -- VPP_TEMP_DIRS: temporary log directory for the Vitis kernel compiler (`v++`) -- VPP_LOG_DIRS: log directory for `v++`. -- USE_PRE_BUILT_XCLBIN: enables the use of pre-built FPGA binary file to speed the use of this tutorial +- TARGET: Selects build target; the choices are `hw`, `sw_emu`, and `hw_emu`. +- PLATFORM: Target AMD platform used for the build . +- ENABLE_STALL_TRACE: Enables the kernel to generate stall data. Choices are: `yes` or `no`. +- TRACE_DDR: Select the memory bank to store trace data. Choices are DDR[0]-DDR[3] for the U200 card. +- KERNEL_CONFIG_FILE: Kernel configuration file. +- VPP_TEMP_DIRS: Temporary log directory for the Vitis kernel compiler (`v++`). +- VPP_LOG_DIRS: Log directory for `v++`. +- USE_PRE_BUILT_XCLBIN: Enables the use of pre-built FPGA binary file to speed the use of this tutorial. ### Host Build Options -- ENABLE_PROF: Enables OpenCL profiling for the host application -- OPENCV_INCLUDE: OpenCV include directory path -- OPENCV_LIB: OpenCV lib directory path +- ENABLE_PROF: Enables OpenCL profiling for the host application. +- OPENCV_INCLUDE: OpenCV include directory path. +- OPENCV_LIB: OpenCV lib directory path. ### Application Runtime Options -- FILTER_TYPE: selects between 6 different filter types: choices are 0-6(Identity, Blur, Motion Blur, Edges, Sharpen, Gaussian, Emboss) -- PARALLEL_ENQ_REQS: application command-line argument for parallel enqueued requests -- NUM_IMAGES: number of images to process -- IMAGE_WIDTH: image width to use -- IMAGE_HEIGHT: image height to use -- INPUT_TYPE: selects between host versions -- INPUT_IMAGE: path and name of image file -- PROFILE_ALL_IMAGES: while comparing CPU vs. FPGA, use all images or not -- NUM_IMAGES_SW_EMU: sets no. of images to use for sw_emu -- NUM_IMAGES_HW_EMU: sets no. of images to use for hw_emu +- FILTER_TYPE: Selects between six different filter types: choices are 0-6 (Identity, Blur, Motion Blur, Edges, Sharpen, Gaussian, and Emboss). +- PARALLEL_ENQ_REQS: Application command line argument for parallel enqueued requests. +- NUM_IMAGES: Number of images to process. +- IMAGE_WIDTH: Image width to use. +- IMAGE_HEIGHT: Image height to use. +- INPUT_TYPE: Selects between host versions. +- INPUT_IMAGE: Path and name of image file. +- PROFILE_ALL_IMAGES: While comparing the CPU versus FPGA, use all images or not. +- NUM_IMAGES_SW_EMU: Sets the number of images to use for sw_emu. +- NUM_IMAGES_HW_EMU: Sets the number of images to use for hw_emu. -To build the host application with randomized data please follow these steps: +To build the host application with randomized data, follow these steps: 1. Edit the Makefile options: - - ```bash - cd $CONV_TUTORIAL_DIR/ - vim make_options.mk - ``` -2. Make sure **INPUT_TYPE** option is set to `random`. This will build the `host_randomized.cpp` application: + ```bash + cd $CONV_TUTORIAL_DIR/ + vim make_options.mk + ``` - ```makefile - ############## Host Application Options - INPUT_TYPE :=random - ``` +2. Make sure the **INPUT_TYPE** option is set to `random`. This will build the `host_randomized.cpp` application: - **TIP:** To build the `host.cpp` you must include the following two variables that point to **OpenCV 2.4** install path in `make_options.mk` file: + ```makefile + ############## Host Application Options + INPUT_TYPE :=random + ``` - ```makefile - ############## OpenCV Installation Paths - OPENCV_INCLUDE :=/**OpenCV2.4 User Install Path**/include - OPENCV_LIB :=/**OpenCV2.4 User Install Path**/lib - ``` + >**TIP:** To build the `host.cpp` you must include the following two variables that point to **OpenCV 2.4** install path in `make_options.mk` file: -3. Source the install specific scripts for setting up the Vitis application acceleration development flow and the Xilinx RunTime Library: + ```makefile + ############## OpenCV Installation Paths + OPENCV_INCLUDE :=/**OpenCV2.4 User Install Path**/include + OPENCV_LIB :=/**OpenCV2.4 User Install Path**/lib + ``` - ```bash - source /**User XRT Install Path**/setup.sh - source /**User Vitis Install Path**/settings64.sh - ``` +3. Source the install specific scripts for setting up the Vitis application acceleration development flow and the XRT: + + ```bash + source /**User XRT Install Path**/setup.sh + source /**User Vitis Install Path**/settings64.sh + ``` 4. After setting the appropriate paths, build the host application using the makefile command as follows: - ```bash - make host - ``` + ```bash + make host + ``` -It will build `host.exe` inside a `build` folder. By building the host application separate from the kernel code, you can make sure the host code compiles correctly and all library paths have been set. + It will build `host.exe` inside a `build` folder. By building the host application separate from the kernel code, you can make sure the host code compiles correctly and all library paths have been set. ### Running Software Emulation -To build and run the kernel in software emulation mode, please run the following bash command: +To build and run the kernel in software emulation mode, run the following bash command: ```bash make run TARGET=sw_emu ``` -It will build an xclbin file to be used in software emulation mode only and launch an emulation run. Once the emulation finishes, you should get a console output similar to the one below. The output given below is for random input image case: +It will build an `xclbin` file to be used in software emulation mode only, and launch an emulation run. When the emulation finishes, you should get a console output similar to the following. The following output is for a random input image case: ```bash ---------------------------------------------------------------------------- @@ -175,7 +175,7 @@ Test PASSED: Output matches reference ---------------------------------------------------------------------------- ``` -If the input image is to be used for processing, please set OpenCV paths and INPUT_TYPE as an empty string in `make_options.mk` file and run it again. Following is the expected console output: +If the input image is to be used for processing, set the OpenCV paths and INPUT_TYPE as an empty string in `make_options.mk` file, and run it again. The following is the expected console output: ```bash ---------------------------------------------------------------------------- @@ -201,85 +201,85 @@ Test PASSED: Output matches reference ---------------------------------------------------------------------------- ``` -The input and the output images are shown below for filter type selection set to 3 which performs edge detection: +The input and the output images are shown in the following images for filter type selection set to 3 which performs edge detection: ![missing image](./images/inputImage50.jpg) ![missing image](./images/outputImage.jpg) ### Running Hardware Emulation -The application can be run in hardware emulation mode in a similar way as software emulation. The only change needed is TARGET, which should be set to `hw_emu`. +The application can be run in hardware emulation mode in a similar way as software emulation. The only change needed is TARGET, which should be set to `hw_emu`. - **NOTE**: Hardware Emulation may take a long time. The `Makefile` default setting will make sure it simulates only a single image, but it is recommended in case of the random input image that image size be set smaller by keeping image height in the range of 30-100 pixels. The height and width of the image can be specified in the `make_options.mk` file. + >**NOTE:** Hardware emulation can take a long time. The `Makefile` default setting will make sure it simulates only a single image, but it is recommended in case of the random input image, that image size be set smaller by keeping image height in the range of 30-100 pixels. The height and width of the image can be specified in the `make_options.mk` file. 1. Launch hardware emulation using the following command: - ```bash - make run TARGET=hw_emu - ``` + ```bash + make run TARGET=hw_emu + ``` -It will build the hardware kernel in emulation mode and then launch the host application. The output printed in the console window will be similar to the `sw_emu` case. But after hardware emulation, you can analyze different synthesis reports and view different waveforms using Vitis Analyzer. + It will build the hardware kernel in emulation mode, and then launch the host application. The output printed in the console window will be similar to the `sw_emu` case. But after hardware emulation, you can analyze different synthesis reports and view different waveforms using the Vitis Analyzer. ## System Run -In this section, you will run the host application using FPGA hardware and analyze the overall system's performance using Vitis Analyzer and host application console output. +In this section, you will run the host application using FPGA hardware and analyze the overall system's performance using the Vitis Analyzer and host application console output. ### Building the Hardware xclbin -Once the kernel functionality is verified, and its resource usage is satisfactory, the hardware kernel build process can be started. The kernel build process will create an xclbin file targetting the actual accelerator card. It is an FPGA executable file that can be read and loaded by the host onto the FPGA card. Building xclbin takes a few hours, and is built as shown below: +When the kernel functionality is verified, and its resource usage is satisfactory, the hardware kernel build process can be started. The kernel build process will create an `xclbin` file targetting the actual accelerator card. It is an FPGA executable file that can be read and loaded by the host onto the FPGA card. Building `xclbin` takes a few hours, and is built as follows: -1. You can enable the performance profiling by setting "ENABLE_PROF?=yes" in `make_options.mk` file as shown below: +1. You can enable the performance profiling by setting "ENABLE_PROF?=yes" in `make_options.mk` file as follows: - ```bash - ENABLE_PROF?=yes - ``` + ```bash + ENABLE_PROF?=yes + ``` -2. Launch the hardware run using the following comand: +2. Launch the hardware run using the following comand: ```bash - make build TARGET=hw + make build TARGET=hw ``` - >**TIP**: You can use a prebuilt xclbin file if one is available by setting **USE_PRE_BUILT_XCLBIN := 1** in the `make_options.mk` file. + >**TIP**: You can use a prebuilt xclbin file if one is available by setting **USE_PRE_BUILT_XCLBIN := 1** in the `make_options.mk` file. -### Application Run Using FPGA Kernel +### Application Run Using the FPGA Kernel -1. To run the application, please proceed as follows: +1. To run the application, proceed as follows: - ```bash - make run TARGET=hw - ``` + ```bash + make run TARGET=hw + ``` -It should produce a console log similar to the one shown below: + It should produce a console log similar to the following : -```bash ----------------------------------------------------------------------------- + ```bash + ---------------------------------------------------------------------------- -Xilinx 2D Filter Example Application (Randomized Input Version) + Xilinx 2D Filter Example Application (Randomized Input Version) -FPGA binary : ../xclbin/fpgabinary.hw.xclbin -Number of runs : 60 -Image width : 1920 -Image height : 1080 -Filter type : 3 -Max requests : 12 -Compare perf. : 1 + FPGA binary : ../xclbin/fpgabinary.hw.xclbin + Number of runs : 60 + Image width : 1920 + Image height : 1080 + Filter type : 3 + Max requests : 12 + Compare perf. : 1 -Programming FPGA device -Generating a random 1920x1080 input image -Running FPGA accelerator on 60 images -Running Software version -Comparing results + Programming FPGA device + Generating a random 1920x1080 input image + Running FPGA accelerator on 60 images + Running Software version + Comparing results -Test PASSED: Output matches reference + Test PASSED: Output matches reference -FPGA Time : 0.4240 s -FPGA Throughput : 839.4765 MB/s -CPU Time : 28.9083 s -CPU Throughput : 12.3133 MB/s -FPGA Speedup : 68.1764 x ----------------------------------------------------------------------------- -``` + FPGA Time : 0.4240 s + FPGA Throughput : 839.4765 MB/s + CPU Time : 28.9083 s + CPU Throughput : 12.3133 MB/s + FPGA Speedup : 68.1764 x + ---------------------------------------------------------------------------- + ``` From the console output, it is clear that acceleration achieved when compared to CPU is 68x. The achieved throughput is 839 MB/s, which is close to the estimated throughput of 900 MB/s; it only differs by 6.66 percent. @@ -293,39 +293,39 @@ The trace information generated during the application run can be controlled by 1. After the design has been run; you can open the run time profile summary report using the following steps: - ```bash - vitis_analyzer ./build/fpgabinary.xclbin.run_summary - ``` + ```bash + vitis_analyzer ./build/fpgabinary.xclbin.run_summary + ``` + + >**NOTE:** In the 2023.1 release, this command opens the Analysis view of the new Vitis Unified IDE and loads the run summary as described in [Working with the Analysis View](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Working-with-the-Analysis-View). You can navigate to the various reports using the left pane of the Analysis view or by clicking on the links provided in the summary report. + +2. After the Vitis Analyzer tool opens, select **Profile Summary** from the left-side menu, and then select **Compute Unit Utilization** from the window displayed on the right-hand side. ->**NOTE:** In the 2023.1 release this command opens the Analysis view of the new Vitis Unified IDE and loads the run summary as described in [Working with the Analysis View](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Working-with-the-Analysis-View). You can navigate to the various reports using the left pane of the Analysis view or by clicking on the links provided in the summary report. -> -2. After the Vitis Analyzer tool opens, select **Profile Summary** from the left-side menu, and then select **Compute Unit Utilization** from the window displayed on the right-hand side. - The report will display stats about the measured performance of the compute units. You have built the `.xclbin` with three compute units, so the display will appear as shown below: - ![Compute Unit Utilization](images/cuPerf.jpg) + ![Compute Unit Utilization](images/cuPerf.jpg) - From this table, it can be seen that the kernel compute time as displayed in the **Avg Time** column is about 7 ms, almost equal to the estimated kernel latency in the previous lab. + From this table, it can be seen that the kernel compute time as displayed in the **Avg Time** column is about 7 ms, almost equal to the estimated kernel latency in the previous lab. - Another important measurement is the **CU Utilization** column, which is very close to 100 percent. This means the host was able to feed data to compute units through PCIe continuously. In other words, the host PICe bandwidth was sufficient, and compute units never saturated it. This can also be observed by examining the host bandwidth utilization. To see this, select **Host Data Transfers** in the report, and a table similar to the figure below will be displayed. From this table, it is clear that the host bandwidth is not fully utilized. + Another important measurement is the **CU Utilization** column, which is very close to 100 percent. This means the host was able to feed data to compute units through PCIe® continuously. In other words, the host PICe bandwidth was sufficient, and compute units never saturated it. This can also be observed by examining the host bandwidth utilization. To see this, select **Host Data Transfers** in the report, and a table similar to the following figure will be displayed. From this table, it is clear that the host bandwidth is not fully utilized. ![missing image](images/bwUtil.jpg) - Similarly, by selecting **Kernel Data Transfers** in the report, you can see how much bandwidth is utilized between the kernel and the device DDR memory. You have used a single memory bank (DDR[1]) for all the compute units, as shown below. + Similarly, by selecting **Kernel Data Transfers** in the report, you can see how much bandwidth is utilized between the kernel and the device double-data rate (DDR) memory. You have used a single memory bank (DDR[1]) for all the compute units, as shown below. -![missing image](images/bwKernel.jpg) + ![missing image](images/bwKernel.jpg) ### Application Timeline -The Application Timeline can also be used to examine performance parameters like CU latency per invocation and bandwidth utilization. +The Application Timeline can also be used to examine performance parameters like compute unit latency per invocation and bandwidth utilization. -1. Select **Application Timeline** from the left-side menu. +1. Select **Application Timeline** from the left-side menu. - This will display the Application Timeline in the right side window, as shown below. + This will display the Application Timeline in the right side window, as shown below. ![missing image](images/cuTime.jpg) - Zoom appropriately and go to device-side trace. For any CU, hover your mouse on any transaction in "Row 0" and a tooltip will show compute start and end times and also the latency. This should be similar to what you saw in the last section. + Zoom appropriately, and go to device-side trace. For any compute unit, hover your mouse on any transaction in "Row 0", and a tooltip will show the compute start and end times and also the latency. This should be similar to what you saw in the last section. Another important thing to observe is the host data transfer trace as shown below. From this report, it can be seen that the host read and write bandwidth is not fully utilized as there are gaps, showing times when there are no read/write transactions occurring. You can see that these gaps are significant, highlighting the fact that only a fraction of host PCIe bandwidth is utilized. @@ -335,13 +335,13 @@ From this discussion, and knowing that the Alveo U200 accelerator card has multi In this lab, you have learned: -- How to build, run and analyze the performance of a video filter -- How to write an optimized host-side application for multi CU designs +- How to build, run, and analyze the performance of a video filter +- How to write an optimized host-side application for multi compute unit designs - How to estimate kernel performance and compare it with measured performance ## Conclusion -Congratulations! You have successfully completed the tutorial. In this tutorial, you have learned how to estimate the performance and acceleration requirements for the hardware implementation of a Vitis Kernel. You have analyzed kernel performance using different tools and compared measured and estimated performance, to see how close both performance number are. You have also seen that you can create an optimized memory cache or hierarchy for FPGA based implementation easily that significantly boost application performance. Finally you have learnt how to write optimized host code to get best performance out of multiple CUs for a given kernel. +Congratulations! You have successfully completed the tutorial. In this tutorial, you have learned how to estimate the performance and acceleration requirements for the hardware implementation of a Vitis kernel. You have analyzed kernel performance using different tools and compared measured and estimated performance, to see how close both performance number are. You have also seen that you can create an optimized memory cache or hierarchy for FPGA-based implementation easily that significantly boost application performance. Finally, you have learned how to write optimized host code to get best performance out of multiple compute units for a given kernel.

Copyright © 2020–2023 Advanced Micro Devices, Inc

diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/1_overview.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/1_overview.md index 7005f7b24a..ee9dd10c5b 100644 --- a/Hardware_Acceleration/Design_Tutorials/02-bloom/1_overview.md +++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/1_overview.md @@ -29,7 +29,7 @@ The following figure shows a Bloom filter example representing the set `{x, y, z In this tutorial, each document consists of an array of words where: each word is a 32-bit unsigned integer comprised of a 24-bit word ID and an 8-bit integer representing the frequency. The search array consists of words of interest to the user, and represents a smaller set of 24-bit word IDs, where each word ID has a weight associated with it, determining the importance of the word. 1. Navigate to `Hardware_Acceleration/Design_Tutorials/02-bloom` directory. -2. Go to the `cpu_src` directory, open the `main.cpp` file, and look at line 63. +2. Go to the `cpu_src` directory, open the `main.cpp` file, and look at line 63. The Bloom filter application is 64 KB, which is implemented as `1L<Return to Start of Tutorial

Copyright © 2020–2023 Advanced Micro Devices, Inc

diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/2_experience-acceleration.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/2_experience-acceleration.md index 6fa213a399..a97c0bb61b 100644 --- a/Hardware_Acceleration/Design_Tutorials/02-bloom/2_experience-acceleration.md +++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/2_experience-acceleration.md @@ -41,14 +41,13 @@ In this lab, you will experience the acceleration potential by running the appli Execution COMPLETE ``` -3. Run the application on the FPGA. - For the purposes of this lab, the FPGA accelerator is implemented with an 8x parallelization factor. +3. Run the application on the FPGA. For the purposes of this lab, the FPGA accelerator is implemented with an 8x parallelization factor. - * Eight input words are processed in parallel, producing eight output flags in parallel during each clock cycle. + * Eight input words are processed in parallel, producing eight output flags in parallel during each clock cycle. To run the optimized application on the FPGA, run the following `make` command. - ``` bash + ``` bash make run_fpga SOLUTION=1 ``` @@ -68,7 +67,7 @@ In this lab, you will experience the acceleration potential by running the appli Throughput = Total data/Total time = 1.39 GB/427.1341ms = 3.25 GB/s - By efficiently leveraging FPGA acceleration, the throughput of the application increases by a factor of 7. + By efficiently leveraging FPGA acceleration, the throughput of the application increases by a factor of 7. ## Next Steps @@ -77,7 +76,6 @@ In this step, you observed the acceleration that can be achieved using an FPGA.

Return to Start of Tutorial

-

Copyright © 2020–2023 Advanced Micro Devices, Inc

Terms and Conditions

diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/3_architect-the-application.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/3_architect-the-application.md index 57c2e91f99..8ecdadbbd6 100644 --- a/Hardware_Acceleration/Design_Tutorials/02-bloom/3_architect-the-application.md +++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/3_architect-the-application.md @@ -67,7 +67,7 @@ You will evaluate which of these sections are a good fit for the FPGA. * On the FPGA, you can create custom architectures, and therefore, create an accelerator that will shift the data by an arbitrary number of bits in a single clock cycle. - * The FPGA also has dedicated DSP units that perform multiplications faster than the CPU. Even though the CPU runs at a frequency eight times higher than the FPGA, the arithmetic shift and multiplication operations can perform faster on the FPGA because of the customizable hardware architecture. + * The FPGA also has dedicated digital signal processing (DSP) units that perform multiplications faster than the CPU. Even though the CPU runs at a frequency eight times higher than the FPGA, the arithmetic shift and multiplication operations can perform faster on the FPGA because of the customizable hardware architecture. Therefore, this function is a good candidate for FPGA acceleration. @@ -108,13 +108,13 @@ You will evaluate which of these sections are a good fit for the FPGA. From this code, you can see: - * You are computing two hash outputs for each word in all the documents and creating output flags accordingly. + * You are computing two hash outputs for each word in all the documents and creating output flags accordingly. - * You already determined that the hash function(`MurmurHash2()`) is a good candidate for acceleration on the FPGA. + * You already determined that the hash function(`MurmurHash2()`) is a good candidate for acceleration on the FPGA. * The hash (`MurmurHash2()`) function with one word is independent of other words and can be done in parallel which improves the execution time. - * The algorithm sequentially accesses to the `input_doc_words` array. This is an important property because when implemented in the FPGA, it allows for very efficient accesses to the DDR. + * The algorithm sequentially accesses to the `input_doc_words` array. This is an important property because when implemented in the FPGA, it allows for very efficient accesses to the DDR. This code section is a good candidate for FPGA acceleration because the hash function can run faster on the FPGA, and you can compute hashes for multiple words in parallel by reading multiple words from the DDR in burst mode. @@ -148,7 +148,7 @@ You will evaluate which of these sections are a good fit for the FPGA. * The compute score requires one memory access to `profile_weights`, one accumulation, and one multiplication operation. * The memory accesses are random because they depend on the `word_id` and therefore, the content of each document. - * The size of the `profile_weights` array is 128 MB and must be stored in the DDR memory connected to the FPGA. Non-sequential accesses to DDR are big performance bottlenecks. Because accesses to the `profile_weights` array are random, implementing this function on the FPGA would not provide much performance benefit, and because the function takes only about 11% of the total running time, you can keep this function on the host CPU. + * The size of the `profile_weights` array is 128 MB and must be stored in the DDR memory connected to the FPGA. Non-sequential accesses to DDR are big performance bottlenecks.Because accesses to the `profile_weights` array are random, implementing this function on the FPGA would not provide much performance benefit, and because the function takes only about 11% of the total running time, you can keep this function on the host CPU. Based on this analysis, it is only beneficial to accelerate the Compute Output Flags from the Hash section on the FPGA. The execution of the Compute Document Score section can be kept on the host CPU. @@ -156,7 +156,7 @@ You will evaluate which of these sections are a good fit for the FPGA. ## Establish the Realistic Goal for the Overall Application -The Compute Score function is calculated on the CPU. Based on your calculations in the previous lab, it takes about 380 ms. You cannot accelerate the function further for a given CPU. Even if the FPGA can compute the hash function in zero time, the application will still take at the minimum of 380 ms, but running the FPGA in no time is also not realistic. You also need to account for sending the data from the CPU to the FPGA and retrieving it back to the CPU from the FPGA which will also add a delay. +The Compute Score function is calculated on the CPU. Based on your calculations in the previous lab, it takes about 380 ms. You cannot accelerate the function further for a given CPU. Even if the FPGA can compute the hash function in zero time, the application will still take at the minimum of 380 ms, but running the FPGA in no time is also not realistic. You also need to account for sending the data from the CPU to the FPGA and retrieving it back to the CPU from the FPGA which will also add a delay. 1. Set the goal for the application such that the Compute Hash function (in Hardware) should run as fast as the Compute Score on the CPU so that the hash function does not become the bottleneck. @@ -164,11 +164,11 @@ The Compute Score function is calculated on the CPU. Based on your calculations 1. Keep only the compute hash function in the FPGA. In software, this function takes about 2569 ms. - The goal of application has been established to compute the hashes and overall score of the 10,000 documents in about 380 ms. + The goal of application has been established to compute the hashes and overall score of the 10,000 documents in about 380 ms. ### Determine the Maximum Achievable Throughput -In most FPGA-accelerated systems, the maximum achievable throughput is limited by the PCIe® bus. The PCIe bus performance is influenced by many different aspects, such as the motherboard, drivers, targeted shell, and transfer sizes. The Vitis core development kit provides a utility, `xbutil`. +In most FPGA-accelerated systems, the maximum achievable throughput is limited by the PCIe® bus. The PCIe® bus performance is influenced by many different aspects, such as the motherboard, drivers, targeted shell, and transfer sizes. The Vitis core development kit provides a utility, `xbutil`. Run the `xbutil validate` command to measure the maximum PCIe bandwidth that can be achieved. The throughput on your design target cannot exceed this upper limit. @@ -185,15 +185,15 @@ The `xbutil validate` command produces the following output. Host <- PCIe <- FPGA read bandwidth = 12154.5 MB/s ``` -The PCIe FPGA write bandwidth is about 9 GB/sec and the FPGA read bandwidth is about 12 GB/sec. The PCIe bandwidth is 3.1 GB/sec above your established goal. +The PCIe FPGA write bandwidth is about 9 GB/s and the FPGA read bandwidth is about 12 GB/s. The PCIe bandwidth is 3.1 GB/s above your established goal. ### Identifying Parallelization for an FPGA Application In Software, the flow will look similar to the following figure. -![missing image](./images/Architect1.PNG) +![missing image](./images/Architect1.PNG) -- `Murmurhash2` functions are calculated for all the words up front, and output flags are set in the local memory. Each of the loops in the hash compute functions are run sequentially. -- After all hashes have computed, only then can another loop be called for all the documents to calculate the compute score. +* `Murmurhash2` functions are calculated for all the words up front, and output flags are set in the local memory. Each of the loops in the hash compute functions are run sequentially. +* After all hashes have computed, only then can another loop be called for all the documents to calculate the compute score. When you run the application on the FPGA, an additional delay is added when the data is transferred from the host to device memory and read back. You can split the entire application time and create a budgeted-based run based on following requirements: @@ -207,14 +207,14 @@ For achieving application target of 380 ms, adding 1470 ms + 30 ms + 380 ms is c If steps 1 through 4 are carried out sequentially like the CPU, you cannot achieve your performance goal. You will need to take advantage of concurrent processing and overlapping of the FPGA. 1. Parallelism between the host to device data transfer and Compute on FPGA. Split the 100,000 documents into multiple buffers, and send the buffers to the device so the kernel does not need to wait until the whole buffer is transferred. -2. Parallelism between Compute on FPGA and Compute Profile score on CPU -3. Increase the number of words to be processed in parallel to increase the concurrent hash compute processing. The hash compute in the CPU is performed on a 32-bit word. Because the hash compute can be completed independently on several words, you can explore computing 4, 8, or even 16 words in parallel. +2. Parallelism between Compute on FPGA and Compute Profile score on CPU. +3. Increase the number of words to be processed in parallel to increase the concurrent hash compute processing. The hash compute in the CPU is performed on a 32-bit word. Because the hash compute can be completed independently on several words, you can explore computing four, eight, or even 16 words in parallel. Based on your earlier analysis, the `Murmurhash2` function is a good candidate for computation on the FPGA, and the Compute Profile Score calculation can be carried out on the host. For a hardware function on the FPGA, based on the established goal in previous lab, you should process hashes as fast as possible by processing multiples of these words every clock cycle. The application conceptually can similar to the following figure. -![missing image](./images/Architect2.PNG) +![missing image](./images/Architect2.PNG) Transferring words from the host to CPU, Compute on FPGA, transferring words from the FPGA to CPU all can be executed in parallel. The CPU starts to calculate the profile score as soon as the flags are received;essentially, the profile score on the CPU can also start calculations in parallel. With the pipelining as shown above, the latency of the Compute on FPGA will become invisible. @@ -222,7 +222,7 @@ Transferring words from the host to CPU, Compute on FPGA, transferring words fro In this lab, you profiled an application and determined which parts were best-suited for FPGA acceleration. You also created the setup to create an optimized kernel to achieve your acceleration goal. You will explore these steps in the following lab sections: -* [Implementing the Kernel](./4_implement-kernel.md): Create an optimized kernel with 4, 8, and 16 words to be processed in parallel and all 100,000 documents worth 1.4 GB are sent from host to kernel in single batch. +* [Implementing the Kernel](./4_implement-kernel.md): Create an optimized kernel with four, eight, and 16 words to be processed in parallel and all 100,000 documents worth 1.4 GB are sent from host to kernel in single batch. * [Analyze Data Movement Between Host and Kernel](./5_data-movement.md): Explore sending 100,000 documents in multiple batches, so that the kernel compute can be overlapped with the host data transfer for optimized application performance. You will analyze the results by keeping the `Murmurhash2` functions on the FPGA and Compute Score functions on the CPU to process in sequential mode. Next, flags created by the accelerator will also be sent over to the host and overlapped with the host to data transfer and FPGA compute. diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/4_implement-kernel.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/4_implement-kernel.md index 39e095379e..410c9e555c 100644 --- a/Hardware_Acceleration/Design_Tutorials/02-bloom/4_implement-kernel.md +++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/4_implement-kernel.md @@ -8,18 +8,18 @@ # Implementing the Kernel -In this lab, you will create an optimized kernel with 4, 8, and 16 words to be processed in parallel. All 100,000 documents worth 1.4 GB are sent from the host CPU to the kernel using a single buffer from the host CPU. +In this lab, you will create an optimized kernel with four, 8eight, and 16 words to be processed in parallel. All 100,000 documents worth 1.4 GB are sent from the host CPU to the kernel using a single buffer from the host CPU. -## Bloom4x: Kernel Implementation Using 4 Words in Parallel +## Bloom4x: Kernel Implementation Using Four Words in Parallel -Processing 4 words in parallel will require 32-bits*4 = 128-bits in parallel, but you should access the DDR with 512-bits because the data is contiguous. This will require smaller number of memory accesses. +Processing four words in parallel will require 32-bits*4 = 128-bits in parallel, but you should access the DDR with 512-bits because the data is contiguous. This will require smaller number of memory accesses. Use the following interface requirements to create kernel: -- Read multiple words stored in the DDR as a 512-bit DDR access, equivalent of reading 16 words per DDR access. +- Read multiple words stored in the DDR as a 512-bit DDR access, equivalent of reading 16 words per DDR access. - Write multiple flags to the DDR as a 512-bit DDR access, equivalent of writing 32 flags per DDR access. -- Compute 4 words to be computed in parallel with each word requiring two `MurmurHash2` functions -- Compute the hash (two `MurmurHash2` functions) functions for 4 words every cycle. +- Compute four words to be computed in parallel with each word requiring two `MurmurHash2` functions. +- Compute the hash (two `MurmurHash2` functions) functions for four words every cycle. Refer to [Methodology for Accelerating Applications with the Vitis Software](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Methodology-for-Accelerating-Data-Center-Applications-with-the-Vitis-Software-Platform) in the in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416). @@ -37,7 +37,7 @@ The algorithm has been updated to receive 512-bits of words from the DDR with th - `load_filter`: Enable or disable of loading coefficients. This only needs to be loaded one time. 1. The first step of the kernel development methodology requires structuring the kernel code into the Load-Compute-Store pattern. This means creating a top-level function, `runOnfpga` with: - - Added sub-functions in the `compute_hash_flags_dataflow` for Load, Compute and Store. + - Added sub-functions in the `compute_hash_flags_dataflow` for Load, Compute, and Store. - Local arrays or `hls::stream` variables to pass data between these functions. 2. The source code has the following INTERFACE pragmas for `input_words`, `output_flags` and `bloom_filter`. @@ -51,23 +51,23 @@ The algorithm has been updated to receive 512-bits of words from the DDR with th - `m_axi`: Interface pragmas are used to characterize the AXI Master ports. - `port`: Specifies the name of the argument to be mapped to the AXI4 interface. - `offset=slave`: Indicates that the base address of the pointer is made available through the AXI4-Lite slave interface of the kernel. - - `bundle`: Specifies the name of the `m_axi` interface. In this example, the `input_words` and `output_flags` are mapped to a `maxiport0` and `bloom_filter` argument is mapped to `maxiport1`. + - `bundle`: Specifies the name of the `m_axi` interface. In this example, the `input_words` and `output_flags` are mapped to a `maxiport0`, and the `bloom_filter` argument is mapped to `maxiport1`. - The function `runOnfpga` loads the Bloom filter coefficients and calls the `compute_hash_flags_dataflow` function which has the main functionality of the Load, Compute and Store functions. + The function, `runOnfpga`, loads the Bloom filter coefficients and calls the `compute_hash_flags_dataflow` function which has the main functionality of the Load, Compute, and Store functions. - Refer to the function `compute_hash_flags_dataflow` in the `02-bloom/cpu_src/compute_score_fpga_kernel.cpp` file. The following block diagram shows how the compute kernel connects to the device DDR memories and how it feeds the compute hash block processing unit. + Refer to the function `compute_hash_flags_dataflow` in the `02-bloom/cpu_src/compute_score_fpga_kernel.cpp` file. The following block diagram shows how the compute kernel connects to the device DDR memories and how it feeds the compute hash block processing unit. ![missing image](./images/Kernel_block_diagram.PNG) The kernel interface to the DDR memories is an AXI interface that is kept at its maximum width of 512 at the input and output. The `compute_hash_flags` function input can have a width different than 512, managed through “PARALLELIZATION”. To deal with these variations on the processing element boundaries, "Resize" blocks are inserted that adapt between the memory interface width and the processing unit interface width. Essentially, blocks named "Buffer" are memory adapters that convert between streams, and the AXI and “Resize” blocks adapt to interface widths as it depends on PARALLELIZATION factor chosen for the given configuration. -3. The input of the `compute_hash_flags_dataflow` function, `input_words` are read as 512-bit burst reads from the global memory over an AXI interface and `data_from_gmem`, the stream of 512-bit values are created. +3. The input of the `compute_hash_flags_dataflow` function, `input_words` are read as 512-bit burst reads from the global memory over an AXI interface and `data_from_gmem`, the stream of 512-bit values are created. ``` hls_stream::buffer(data_from_gmem, input_words, total_size/(512/32)); ``` -4. The stream of parallel words, `word_stream` (equals PARALLELIZATION words) are created from `data_from_gmem` as `compute_hash_flags` requires 128-bit for 4 words to process in parallel. +4. The stream of parallel words, `word_stream` (equals PARALLELIZATION words) are created from `data_from_gmem` as `compute_hash_flags` requires 128-bit for four words to process in parallel. ``` hls_stream::resize(word_stream, data_from_gmem, total_size/(512/32)); @@ -75,7 +75,7 @@ The algorithm has been updated to receive 512-bits of words from the DDR with th 5. The function `compute_hash_flags_dataflow` calls the `compute_hash_flags` function for computing hash of parallel words. -6. With `PARALLELIZATION=4`, the output of the `compute_hash_flags`, `flag_stream` is 4*8-bit = 32-bit parallel words, which will be used to create the 512-bit values of stream as `data_to_mem`. +6. With `PARALLELIZATION=4`, the output of the `compute_hash_flags`, `flag_stream` is 4*8-bit = 32-bit parallel words, which will be used to create the 512-bit values of stream as `data_to_mem`. ``` hls_stream::resize(data_to_gmem, flag_stream, total_size/(512/8)); @@ -126,7 +126,7 @@ Now that you have the top-level function, `runOnfpga` updated with the proper da - `#pragma HLS PIPELINE II=1` is added to initiate the burst DDR accesses and read the Bloom filter coefficients every cycle. - The expected latency is about 16,000 cycles because the `bloom_filter_size` is fixed to 16,000. You should confirm this after you run HLS Synthesis. -2. Within the `compute_hash_flags` function, the `for` loop is rearchitected as nested for the loop to compute 4 words in parallel. +2. Within the `compute_hash_flags` function, the `for` loop is rearchitected as nested for the loop to compute four words in parallel. ```cpp void compute_hash_flags ( @@ -165,7 +165,7 @@ Now that you have the top-level function, `runOnfpga` updated with the proper da - Added `#pragma HLS UNROLL` - Unrolls internal loop to make four copies of the Hash functionality. - - Vitis HLS will try to pipeline the outer loop with `II=1`. With the inside loop unrolled, you can initiate the outer loop every clock cycle, and compute 4 words in parallel. + - Vitis HLS will try to pipeline the outer loop with `II=1`. With the inside loop unrolled, you can initiate the outer loop every clock cycle, and compute four words in parallel. - Added `#pragma HLS LOOP_TRIPCOUNT` min=1 max=3500000` - Reports the latency of the function after HLS Synthesis. @@ -181,7 +181,7 @@ Now, build the kernel using the Vitis compiler. The Vitis compiler will call the This command will call the `v++` compiler which then calls the Vitis HLS tool to translate the C++ code into RTL code that can be used to run Hardware Emulation. - >**NOTE**: For purposes of this tutorial, the number of input words used is only 100 because it will take a longer time to run the Hardware Emulation. + >**NOTE**: For purposes of this tutorial, the number of input words used is only 100 because it will take a longer time to run the Hardware Emulation. 2. Then, use the following commands to visualize the HLS Synthesis Report in the Vitis analyzer. @@ -189,13 +189,13 @@ Now, build the kernel using the Vitis compiler. The Vitis compiler will call the vitis_analyzer ../build/single_buffer/kernel_4/hw_emu/runOnfpga_hw_emu.xclbin.link_summary ``` - >**NOTE:** In the 2023.1 release this command opens the Analysis view of the new Vitis Unified IDE and loads the link summary as described in [Working with the Analysis View](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Working-with-the-Analysis-View). You can navigate to the various reports using the left pane of the Analysis view or by clicking on the links provided in the summary report. + >**NOTE:** In the 2023.1 release, this command opens the Analysis view of the new Vitis Unified IDE and loads the link summary as described in [Working with the Analysis View](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Working-with-the-Analysis-View). You can navigate to the various reports using the left pane of the Analysis view or by clicking the links provided in the summary report. -Select the System Estimate report to open it. + Select the System Estimate report to open it. - ![missing image](./images/4_Kernel_4_link.PNG) + ![missing image](./images/4_Kernel_4_link.PNG) - - The `compute_hash_flags` latency reported is 875,011 cycles. This is based on total of 35,000,000 words, computed with 4 words in parallel. This loop has 875,000 iterations and including the `MurmurHash2` latency, the total latency of 875,011 cycles is optimal. + - The `compute_hash_flags` latency reported is 875,011 cycles. This is based on total of 35,000,000 words, computed with four words in parallel. This loop has 875,000 iterations and including the `MurmurHash2` latency, the total latency of 875,011 cycles is optimal. - The `compute_hash_flags_dataflow` function has `dataflow` enabled in the Pipeline column. This function is important to review and indicates that the task-level parallelism is enabled and expected to have overlap across the sub-functions in the `compute_hash_flags_dataflow` function. - The latency reported for `read_bloom_filter` function is 16,385 for reading the Bloom filter coefficients from the DDR using the `bloom_filter maxi` port. This loop is iterated over 16,000 cycles reading 32-bits data of from the Bloom filter coefficients. @@ -205,7 +205,7 @@ The reports confirm that the latency of the function meets your target. You stil The initial version of the accelerated application code structure follows the structure of the original software version. The entire input buffer is transferred from the host to the FPGA in a single transaction. Then, the FPGA accelerator performs the computation. Finally, the results are read back from the FPGA to the host before being post-processed. -The following figure shows the sequential process of the host writing data on the device, compute by the accelerator on the FPGA, and read flags back to host, implemented in this first step. The Profile score is calculated sequentially on CPU after all the flags are received by the host. +The following figure shows the sequential process of the host writing data on the device, compute by the accelerator on the FPGA, and read flags back to host, implemented in this first step. The Profile score is calculated sequentially on CPU after all the flags are received by the host. ![missing image](./images/overlap_single_buffer.PNG) @@ -213,18 +213,18 @@ The FPGA accelerator computes the hash values and flags for the provided input w The functionality of the different inputs passed to the accelerator kernel is as follows: -* `input_doc_words`: Input array that contains the 32-bit words for all the documents. -* `bloom_filter`: Bloom filter array that contains the inserted search array hash values. -* `total_size`: Unsigned `int` that represents the total size processed by the FPGA when called. -* `load_weights`: Boolean that allows the `bloom_filter` array to load only once to the FPGA in the case of multiple kernel invocations. +- `input_doc_words`: Input array that contains the 32-bit words for all the documents. +- `bloom_filter`: Bloom filter array that contains the inserted search array hash values. +- `total_size`: Unsigned `int` that represents the total size processed by the FPGA when called. +- `load_weights`: Boolean that allows the `bloom_filter` array to load only once to the FPGA in the case of multiple kernel invocations. The output of the accelerator is as follows: -* `output_inh_flags`: Output array of 8-bit outputs where each bit in the 8-bit output indicates whether a word is present in the Bloom filter, that is then used for computing score in the CPU. +- `output_inh_flags`: Output array of 8-bit outputs where each bit in the 8-bit output indicates whether a word is present in the Bloom filter, that is then used for computing score in the CPU. -### Run Software Emulation, Hardware Emulation and Hw +### Run Software Emulation, Hardware Emulation and Hardware -1. To ensure the application passes Software Emulation with your changes, run the following command. +1. To ensure the application passes software emulation with your changes, run the following command. ``` cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer TARGET=sw_emu @@ -232,7 +232,7 @@ The output of the accelerator is as follows: Make sure that the Software Emulation is passing. -2. Next, to verify the functionality is intact, use the following command to run Hardware Emulation. +2. Next, to verify the functionality is intact, use the following command to run hardware emulation. ``` cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer TARGET=hw_emu @@ -240,10 +240,9 @@ The output of the accelerator is as follows: - The command should conclude with 'Verification: Pass'. This ensures that the generated hardware is functionally correct. However, you have not yet run the hardware on the FPGA. - - > **NOTE**: This tutorial is provided with `xclbin` files in the `$LAB_WORK_DIR/xclbin_save` directory. The `SOLUTION=1` option can be added to the make target for using these `xclbin` files for `hw` runs. These `xclbin` files were generated for Alveo U200 cards only. You must generate new `xclbin` files for every platform used in this tutorial. + > **NOTE**: This tutorial is provided with `xclbin` files in the `$LAB_WORK_DIR/xclbin_save` directory. The `SOLUTION=1` option can be added to the make target for using these `xclbin` files for `hw` runs. These `xclbin` files were generated for AMD Alveo™ U200 cards only. You must generate new `xclbin` files for every platform used in this tutorial. -4. Run the following steps to execute the application on hardware. +3. Run the following steps to execute the application on hardware. You are using 100,000 documents compute on the hardware. @@ -251,14 +250,14 @@ The output of the accelerator is as follows: cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer ITER=1 PF=4 TARGET=hw ``` - * If you are using an `xclbin` provided as part of solution in this tutorial, then use the following command. + - If you are using an `xclbin` provided as part of solution in this tutorial, then use the following command. ``` cd $LAB_WORK_DIR/makefile; make run STEP=single_buffer ITER=1 PF=4 TARGET=hw SOLUTION=1 ``` - * To use four words in parallel, `PF=4` will set the PARALLELIZATION macro to 4 in `$LAB_WORK_DIR/reference_files/compute_score_fpga_kernel.cpp`. - * `ITER=1` indicates buffer sent using single iteration (using a single buffer). + - To use four words in parallel, `PF=4` will set the PARALLELIZATION macro to 4 in `$LAB_WORK_DIR/reference_files/compute_score_fpga_kernel.cpp`. + - `ITER=1` indicates buffer sent using single iteration (using a single buffer). The following output displays. @@ -273,14 +272,14 @@ The output of the accelerator is as follows: Verification: PASS ``` - - Total FPGA time is 447 ms. This includes the host to DDR transfer, Total Compute on FPGA and DDR to host transfer. - - Total time of computing 100,000 documents is about 838 ms. + - The total FPGA time is 447 ms. This includes the host to DDR transfer, Total Compute on FPGA, and DDR to host transfer. + - The total time of computing 100,000 documents is about 838 ms. At this point, review the Profile reports and Timeline Trace to extract information, such as how much time it takes to transfer the data between host and kernel and how much time it takes to compute on the FPGA. ### Visualize the Resources Utilized -Use the Vitis analyzer to visualize the HLS Synthesis Report. You will need to build the kernel without `SOLUTION=1` to `generate link_summary` as this is not provided as part of tutorial. You can skip this step as well. +Use the Vitis analyzer to visualize the HLS Synthesis Report. You will need to build the kernel without `SOLUTION=1` to `generate link_summary` as this is not provided as part of tutorial. You can skip this step as well. ``` vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_4/hw/runOnfpga_hw.xclbin.link_summary @@ -288,7 +287,7 @@ vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_4/hw/runOnfpga_hw.xclbin The HLS Synthesis Report shows the number of LUTs, REG, and BRAM utilized for the Bloom4x kernel implementation. -![missing image](./images/kernel_4_util.PNG) +![missing image](./images/kernel_4_util.PNG) ### Review Profile Reports and Timeline Trace @@ -302,32 +301,32 @@ The Profile Summary and Timeline Trace reports are useful tools to analyze the p ### Review Profile Summary Report -* *Kernels & Compute Unit: Kernel Execution* indicates that the Total Time by kernel enqueue is about 292 ms. +- *Kernels & Compute Unit: Kernel Execution* indicates that the total time by kernel enqueue is about 292 ms. - ![missing image](./images/kernel_4_profile_1.PNG) + ![missing image](./images/kernel_4_profile_1.PNG) - - 4 words in parallel are computed. The accelerator is architected at 300 MHz. In total, you are computing 350,000,000 words (3,500 words/document * 100,000 documents). - - Number of words/(Clock Freq * Parallelization factor in kernel) = 350M/(300M*4) = 291.6 ms. The actual FPGA compute time is almost same as your theoretical calculations. + - Four words in parallel are computed. The accelerator is architected at 300 MHz. In total, you are computing 350,000,000 words (3,500 words/document * 100,000 documents). + - Number of words/(Clock Freq *Parallelization factor in kernel) = 350M/(300M*4) = 291.6 ms. The actual FPGA compute time is almost same as your theoretical calculations. -* *Host Data Transfer: Host Transfer* shows that the Host Write Transfer to DDR is 145 ms and the Host Read Transfer to DDR is 36 ms. - ![missing image](./images/kernel_4_profile_2.PNG) +- *Host Data Transfer: Host Transfer* shows that the Host Write Transfer to DDR is 145 ms, and the Host Read Transfer to DDR is 36 ms. + ![missing image](./images/kernel_4_profile_2.PNG) - - Host Write transfer using a theoretical PCIe bandwidth of 9GB should be 1399 MB/9GBps = 154 ms - - Host Read transfer using a theoretical PCIe bandwidth of 12GB should be 350 MB/12 GBps = 30 ms - - Reported number indicates that the PCIe transfers are occurring at the maximum bandwidth + - Host Write transfer using a theoretical PCIe bandwidth of 9 GB should be 1399 MB/9 GBps = 154 ms. + - Host Read transfer using a theoretical PCIe bandwidth of 12 GB should be 350 MB/12 GBps = 30 ms. + - Reported number indicates that the PCIe transfers are occurring at the maximum bandwidth. -* *Kernels & Compute Unit: Compute Unit Stalls* confirms that there are almost no "External Memory Stalls" - ![missing image](./images/Kernel_4_Guidance_Stall.PNG) +- *Kernels & Compute Unit: Compute Unit Stalls* confirms that there are almost no "External Memory Stalls". + ![missing image](./images/Kernel_4_Guidance_Stall.PNG) ### Review the Timeline Trace The Timeline Trace shows the data transfer from the host to the FPGA and back to the host as they appear. The Timeline Trace can be visualized so that the transfer from the host to the FPGA and the FPGA compute and transfer from the FPGA to host occur sequentially. - ![missing image](./images/Kernel_4_time_1.PNG) + ![missing image](./images/Kernel_4_time_1.PNG) - - There is a sequential execution of operations starting from the data transferred from the host to the FPGA, followed by compute in the FPGA and transferring back the results from the FPGA to the host. - - At any given time, either the host or FPGA has access to the DDR. In other words, there is no memory contention between the host and kernel accessing the same DDR. - - Using a single buffer will create a kernel itself with lowest latency and most optimized performance. +- There is a sequential execution of operations starting from the data transferred from the host to the FPGA, followed by compute in the FPGA and transferring back the results from the FPGA to the host. +- At any given time, either the host or FPGA has access to the DDR. In other words, there is no memory contention between the host and kernel accessing the same DDR. +- Using a single buffer will create a kernel itself with lowest latency and most optimized performance. ### Throughput Achieved @@ -337,18 +336,18 @@ Based on the results, the throughput of the application is 1399 MB/838 ms = appr Because there is no external memory access contention for accessing memory, this is the best possible performance for the Kernel computing four words in parallel. -* The kernel is reading 512-bits of data, but only 128-bits are used to compute 4 words in parallel. The kernel has the potential to compute more words because you have resources available on the FPGA. You can increase the number of words to be processed in parallel and can experiment with 8 words and 16 words in parallel. -* Also note, even though the kernel is operating at its best performance, the kernel has to wait until the complete transfer is done by the host. Usually, it is recommended that you send a larger buffer from the host to DDR, but a very large buffer will add a delay before the kernel can start, impacting overall performance. +- The kernel is reading 512-bits of data, but only 128-bits are used to compute four words in parallel. The kernel has the potential to compute more words because you have resources available on the FPGA. You can increase the number of words to be processed in parallel and can experiment with eight words and 16 words in parallel. +- Also note, even though the kernel is operating at its best performance, the kernel has to wait until the complete transfer is done by the host. Usually, it is recommended that you send a larger buffer from the host to DDR, but a very large buffer will add a delay before the kernel can start, impacting overall performance. You will continue this lab and create a kernel with 8 words to be computed in parallel. In the next lab, [Analyze Data Movement Between Host and Kernel](./5_data-movement.md), you will experiment with splitting the buffer into multiple chunks, and observe how this affects the performance later in this tutorial. -## Bloom8x: Kernel Implementation Using 8 Words in Parallel +## Bloom8x: Kernel Implementation Using Eight Words in Parallel -For config Bloom4x, you read 512-bit input values from the DDR and computed 4 words in parallel which uses only 128-bit input values. This steps enables you to run 8 words in parallel. +For config Bloom4x, you read 512-bit input values from the DDR and computed four words in parallel which uses only 128-bit input values. This steps enables you to run eight words in parallel. -You can achieve this by using `PF=8` on the command line. Use the following steps to compute 8 words in parallel. +You can achieve this by using `PF=8` on the command line. Use the following steps to compute eight words in parallel. -### Run Hardware on the FPGA +### Run Hardware on the FPGA Use the following command to run the hardware on the FPGA. @@ -367,22 +366,22 @@ Single_Buffer: Running with a single buffer of 1398.903 MBytes for FPGA processi Verification: PASS ``` -- The Total FPGA time is 315 ms. This includes the Host to DDR transfer, Total Compute on FPGA and DDR to Host transfer. +- The Total FPGA time is 315 ms. This includes the Host to DDR transfer, Total Compute on FPGA and DDR to Host transfer. - As expected, when computing 8 words in parallel, the total FPGA Time that includes the data transfer between the host and device, has reduced from 447 ms to 315 ms. ### Visualize the Resources Utilized -Use the Vitis analyzer to visualize the HLS Synthesis Report. You will need to build the kernel without `SOLUTION=1` to generate link_summary; this is not provided as part of tutorial. You can skip this step. +Use the Vitis analyzer to visualize the HLS Synthesis Report. You will need to build the kernel without `SOLUTION=1` to generate link_summary; this is not provided as part of tutorial. You can skip this step. ``` vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_8/hw/runOnfpga_hw.xclbin.link_summary ``` -From the HLS Synthesis Report, you can see that the number of resources increased for LUTs, REG, and BRAM compared to the Bloom4x kernel implementation. +From the HLS Synthesis Report, you can see that the number of resources increased for LUTs, REG, and block RAM compared to the Bloom4x kernel implementation. -![missing image](./images/kernel_8_util.PNG) +![missing image](./images/kernel_8_util.PNG) -### Review Profile Summary Report and Timeline Trace +### Review the Profile Summary Report and Timeline Trace Use the Vitis analyzer to visualize the run_summary report. @@ -390,13 +389,13 @@ Use the Vitis analyzer to visualize the run_summary report. vitis_analyzer $LAB_WORK_DIR/build/single_buffer/kernel_8/hw/runOnfpga_hw.xclbin.run_summary ``` -Review the profile reports and compare the metrics with configuration Bloom4x +Review the profile reports, and compare the metrics with configuration Bloom4x. -1. *Kernels & Compute Unit:Kernel Execution* reports 146 ms compared to the 292 ms. This is exactly half the time as now 8 words are computed in parallel instead of 4 words. +1. *Kernels & Compute Unit:Kernel Execution* reports 146 ms compared to the 292 ms. This is exactly half the time as now eight words are computed in parallel instead of four words. 2. *Host Data Transfer: Host Transfer* section reports the same delays. -3. The overall gain in the application overall gain is only because that kernel now is processing 8 words in parallel compared to 4 words in parallel. +3. The overall gain in the application overall gain is only because that kernel now is processing eight words in parallel compared to four words in parallel. - ![missing image](./images/Kernel_8_time_1.PNG) + ![missing image](./images/Kernel_8_time_1.PNG) ### Throughput Achieved @@ -404,7 +403,7 @@ Review the profile reports and compare the metrics with configuration Bloom4x ## Bloom16x : Kernel Implementation Using 16 Words in Parallel -In the previous step, you read 512-bit input values from the DDR and computed 8 words in parallel. You can compute 16 words in parallel by setting `PF=16` on the command line. Use the following steps to compute 16 words in parallel. +In the previous step, you read 512-bit input values from the DDR and computed eight words in parallel. You can compute 16 words in parallel by setting `PF=16` on the command line. Use the following steps to compute 16 words in parallel. ### Run Hardware on the FPGA @@ -424,11 +423,11 @@ Processing 1398.903 MBytes of data Verification: PASS ``` -As previous steps, you can also review profile and timeline trace reports. This is left out as an homework exercise. +As previous steps, you can also review profile and timeline trace reports. This is left out as an homework exercise. ### Opportunities for Performance Improvements -In this lab, you focused on building on an optimal kernel with 4, 8 and 16 words in parallel using a single buffer and reviewed the reports. As you observed from the Timeline Report that the kernel cannot start until the whole buffer is transferred to the DDR, which is about 145 ms when the single buffer is used. This is a substantial time delay considering the `Kernel execution time` is even smaller than the total host transfer time. +In this lab, you focused on building on an optimal kernel with four, eight, and 16 words in parallel using a single buffer and reviewed the reports. As you observed from the Timeline Report that the kernel cannot start until the whole buffer is transferred to the DDR, which is about 145 ms when the single buffer is used. This is a substantial time delay considering the `Kernel execution time` is even smaller than the total host transfer time. ## Next Steps @@ -437,8 +436,6 @@ In the next lab, you [explore sending documents in multiple buffers](./5_data-mo

Return to Start of Tutorial

-

Copyright © 2020–2023 Advanced Micro Devices, Inc

Terms and Conditions

- diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/5_data-movement.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/5_data-movement.md index d7a8717564..094426ee43 100644 --- a/Hardware_Acceleration/Design_Tutorials/02-bloom/5_data-movement.md +++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/5_data-movement.md @@ -6,7 +6,6 @@ - # Data Movement Between the Host and Kernel In the previous step, you implemented a sequential execution of the written words from the host, computing hash functions on the FPGA, and reading flags by the host. @@ -16,12 +15,12 @@ The compute does not start until the entire input is read into the FPGA, and sim In this lab, you will work with an: - Overlap of host data transfer and compute on the FPGA with split buffers (two buffers) - - Split the documents and send them to the FPGA in two iterations. + - Split the documents and send them to the FPGA in two iterations. - The kernel can start the compute as soon as the data for the corresponding iteration is transferred to the FPGA. - Overlap of host data transfer and compute with multiple buffers - Explore how the application performance is affected based on splitting the documents and into 2, 4, 8, 16, 32, 64, and 128 chunks. -- Overlap data transfer from host, compute on FPGA and profile score on the CPU - - Enables the host to start profile scores as soon as the flags are received. +- Overlap data transfer from host, compute on FPGA and profile score on the CPU. + - Enables the host to start profile scores as soon as the flags are received. ## Overlap of Host Data Transfer and Compute with Split Buffers @@ -117,9 +116,9 @@ Navigate to `$LAB_WORK_DIR/reference_files`, and with a file editor, open `run_s krnlWait.push_back(krnlDone); q.enqueueMigrateMemObjects({subbuf_inh_flags[0]}, CL_MIGRATE_MEM_OBJECT_HOST, &krnlWait, &flagDone); flagWait.push_back(flagDone); - ``` + ``` -5. During the second iteration, the kernel arguments are set, the commands to write the input buffer with second set of words to the FPGA, execute the kernel, and read the results back to the host, are enqueued. +5. During the second iteration, the kernel arguments are set, the commands to write the input buffer with second set of words to the FPGA, execute the kernel, and read the results back to the host, are enqueued. ```cpp // Set Kernel Arguments, Read, Enqueue Kernel and Write for second iteration @@ -167,7 +166,7 @@ Verification: PASS ## Review Profile Report and Timeline Trace for the Bloom8x Kernel -1. Run the following commands to view the Timeline Trace report with Bloom8x kernel. +1. Run the following commands to view the Timeline Trace report with the Bloom8x kernel. ``` vitis_analyzer $LAB_WORK_DIR/build/split_buffer/kernel_8/hw/runOnfpga_hw.xclbin.run_summary @@ -178,8 +177,8 @@ Verification: PASS ![missing image](./images/double_buffer_timeline_trace.PNG) - The Timeline Trace confirms that you achieved the execution schedule you expected. - * There is an overlap of the read and compute with write operations between the first and second iterations. - * The execution time of the first kernel run and the first data read are effectively "hidden" behind the write data transfer from host. This results in a faster overall run. + - There is an overlap of the read and compute with write operations between the first and second iterations. + - The execution time of the first kernel run and the first data read are effectively "hidden" behind the write data transfer from host. This results in a faster overall run. 3. From the Profile Report, *Host Data Transfer: Host Transfer* shows that the "data transfer" from the host CPU consumes more than the "Kernel Compute Time". - The Host to Global Memory WRITE Transfer takes about 178 ms, which is higher compared to using a single buffer. @@ -209,7 +208,7 @@ Executed Software-Only version | 3133.5186 ms Verification: PASS ``` -You can see that if the documents are split into two buffers, the overall application execution time using the Bloom8x kernel and Bloom16x kernel are very close. As expected, using the Bloom16x kernel rather than the Bloom8x kernel has no benefit. +You can see that if the documents are split into two buffers, and the overall application execution time using the Bloom8x kernel and Bloom16x kernel are very close. As expected, using the Bloom16x kernel rather than the Bloom8x kernel has no benefit. While developing your own application, these attributes can be explored to make trade-offs and pick the optimal kernel implementation optimized for resources/performance. @@ -276,7 +275,7 @@ Open `run_generic_buffer.cpp` in `$LAB_WORK_DIR/reference_files` with a file edi ``` 1. The kernel arguments are set, and the kernel is enqueued to load the Bloom filter coefficients. - + ```cpp // Set Kernel arguments and load the Bloom filter coefficients in the kernel cl::Event buffDone, krnlDone; @@ -291,7 +290,7 @@ Open `run_generic_buffer.cpp` in `$LAB_WORK_DIR/reference_files` with a file edi ``` 1. For each iteration, kernel arguments are set, and the commands to write the input buffer to the FPGA, execute the kernel, and read the results back to the host, are enqueued. - + ```cpp // Set Kernel arguments. Read, Enqueue Kernel and Write for each iteration for (int i=0; i Write* of the Host section seems to have no gap. Kernel compute time of each invocation is smaller than the Host transfer. + - *Data Transfer -> Write* of the Host section seems to have no gap. Kernel compute time of each invocation is smaller than the Host transfer. - Each Kernel compute and writing flags to DDR are overlapped with the next Host-> Device transfer. @@ -495,11 +494,11 @@ The following output displays. ![missing image](./images/sw_overlap_stalls.PNG) -3. *Host Data Transfer: Host Transfer* Host to Global Memory WRITE Transfer takes about 207.5 ms and Host to Global Memory READ Transfer takes about 36.4 ms +3. *Host Data Transfer: Host Transfer* Host to Global Memory WRITE Transfer takes about 207.5 ms and Host to Global Memory READ Transfer takes about 36.4 ms. ![missing image](./images/sw_overlap_profile_host.PNG) - * *Kernels & Compute Unit: Compute Unit Utilization* section shows that CU Utilization is about 71%. This is an important measure representing how much time CU was active over the Device execution time. +- *Kernels & Compute Unit: Compute Unit Utilization* section shows that CU Utilization is about 71%. This is an important measure representing how much time CU was active over the device execution time. ![missing image](./images/sw_overlap_profile_CU_util.PNG) @@ -517,8 +516,6 @@ The host and kernel are trying to access the same DDR bank at the same time whic

Return to Start of Tutorial

-

Copyright © 2020–2023 Advanced Micro Devices, Inc

Terms and Conditions

- diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/6_using-multiple-ddr.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/6_using-multiple-ddr.md index a058a0aebd..005ec302ac 100644 --- a/Hardware_Acceleration/Design_Tutorials/02-bloom/6_using-multiple-ddr.md +++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/6_using-multiple-ddr.md @@ -6,34 +6,32 @@ - # Using Multiple DDR Banks -In the previous step, you noticed the overlap of the host data transfer documents sent to FPGA were also split into multiple buffers. Flags from the FPGA were also sent to the host immediately; this overlaps the compute profile score on the CPU with the "Compute FPGA" which further improves the application execution time. +In the previous step, you noticed the overlap of the host data transfer documents sent to the FPGA were also split into multiple buffers. Flags from the FPGA were also sent to the host immediately; this overlaps the compute profile score on the CPU with the "Compute FPGA" which further improves the application execution time. -You also observed memory contention because the host and kernel both accessed the same bank at the same time. -In this section, you configure multiple DDR banks to improve the kernel performance. +You also observed memory contention because the host and kernel both accessed the same bank at the same time. In this section, you configure multiple DDR banks to improve the kernel performance. -Alveo cards have multiple DDR banks, and you can use multiple banks in ping-pong fashion to minimize the contention. +AMD Alveo™ cards have multiple DDR banks, and you can use multiple banks in ping-pong fashion to minimize the contention. -* The host is writing words to DDR bank 1 and DDR bank 2 alternatively. -* When the host is writing words to DDR bank1, the kernel is reading flags from DDR bank2. -* When host is writing documents to DDR bank2, the kernel is reading flags from DDR bank1. +* The host is writing words to DDR bank 1 and DDR bank 2 alternatively. +* When the host is writing words to DDR bank 1, the kernel is reading flags from DDR bank 2. +* When host is writing documents to DDR bank 2, the kernel is reading flags from DDR bank 1. -The kernel will read from DDR bank1 and bank2 alternatively and its `maxi` port is connected to both DDR banks. You must establish the connectivity of kernel arguments to DDR banks in the `v++ --link` command as described in [Mapping Kernel Ports to Memory](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Mapping-Kernel-Ports-to-Memory). In this case the `$LAB_WORK_DIR/makefile/connectivity.cfg` configuration file specifies the connectivity. +The kernel will read from DDR bank 1 and bank 2 alternatively and its `maxi` port is connected to both DDR banks. You must establish the connectivity of kernel arguments to DDR banks in the `v++ --link` command as described in [Mapping Kernel Ports to Memory](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Mapping-Kernel-Ports-to-Memory). In this case, the `$LAB_WORK_DIR/makefile/connectivity.cfg` configuration file specifies the connectivity. ``` [connectivity] sp=runOnfpga_1.input_words:DDR[1:2] ``` - - The `-sp` option instructs the `v++` linker that `input_words` is connected to both DDR banks 1 and 2. You will need to rebuild the kernel because connectivity is now changed. +* The `-sp` option instructs the `v++` linker that `input_words` is connected to both DDR banks 1 and 2. You will need to rebuild the kernel because connectivity is now changed. ## Code Modifications 1. Navigate to `$LAB_WORK_DIR/reference_files`, and with a file editor, open `run_sw_overlap_multiDDR.cpp`. -3. From the host code, you will need to send the words to both DDR banks alternatively. The DDR bank assignment in the host code is supported by a Xilinx vendor extension to the OpenCL API. Two Xilinx extension pointer objects (`cl_mem_ext_ptr_t`) are created, `buffer_words_ext[0]` and `buffer_words_ext[1]`. The`flags` will determine which DDR bank the buffer will be send to, so that kernel can access it. +2. From the host code, you will need to send the words to both DDR banks alternatively. The DDR bank assignment in the host code is supported by a AMD vendor extension to the OpenCL™ API. Two AMD extension pointer objects (`cl_mem_ext_ptr_t`) are created, `buffer_words_ext[0]` and `buffer_words_ext[1]`. The`flags` will determine which DDR bank the buffer will be send to, so that kernel can access it. ```cpp cl_mem_ext_ptr_t buffer_words_ext[2]; @@ -46,9 +44,9 @@ The kernel will read from DDR bank1 and bank2 alternatively and its `maxi` port buffer_words_ext[1].obj = input_doc_words; ``` -4. Next two buffers, `buffer_doc_words[0]` and `buffer_doc_words[1]` are created in `DDR[1]` and `DDR[2]` as follows. +3. The next two buffers, `buffer_doc_words[0]` and `buffer_doc_words[1]`, are created in `DDR[1]` and `DDR[2]` as follows. - ```cpp + ```cpp buffer_doc_words[0] = cl::Buffer(context, CL_MEM_EXT_PTR_XILINX | CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, total_size*sizeof(uint), &buffer_words_ext[0]); buffer_doc_words[1] = cl::Buffer(context, CL_MEM_EXT_PTR_XILINX | CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, total_size*sizeof(uint), &buffer_words_ext[1]); buffer_inh_flags = cl::Buffer(context, CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, total_size*sizeof(char),output_inh_flags); @@ -80,11 +78,9 @@ The kernel will read from DDR bank1 and bank2 alternatively and its `maxi` port } ``` -5. The kernel argument, input word is set to array of sub-buffers created from `buffer_doc_words[0]` and `buffer_doc_words[1]` alternatively; hence, data is sent to DDR bank 1 and 2 alternatively in each kernel execution. - +4. The kernel argument, input word is set to the array of sub-buffers created from `buffer_doc_words[0]` and `buffer_doc_words[1]` alternatively; hence, data is sent to DDR bank 1 and 2 alternatively in each kernel execution. - - ```cpp + ```cpp for (int i=0; i Write transactions and observe that the host is writing to bank1, bank2, bank1, bank2, alternatively. + * The Timeline Trace confirms that the host is writing to the DDR in a ping-pong fashion. You can hover your mouse over Data Transfer-> Write transactions and observe that the host is writing to bank1, bank2, bank1, bank2, alternatively. The kernel is always writing to same DDR bank1 as flags size is relatively small. - - In the previous lab, without usage of multiple banks the kernel cannot read the next set of words from the DDR until the host has read flags written by the kernel in the previous enqueue. In this lab, you can observe that both of these accesses can be carried out in parallel because these accesses are for different DDR banks. + * In the previous lab, without usage of multiple banks, the kernel cannot read the next set of words from the DDR until the host has read flags written by the kernel in the previous enqueue. In this lab, you can observe that both of these accesses can be carried out in parallel because these accesses are for different DDR banks. - This results in an improved FPGA compute that includes the transfer from the host, device compute and sending flag data back to the host. + This results in an improved FPGA compute that includes the transfer from the host, device compute, and sending flag data back to the host. 3. Review the Profile report and note the following observations: - * *Data Transfer: Host to Global Memory* section indicates: - - Host to Global Memory WRITE Transfer takes about 145.7 ms which is less than 207 ms. - - Host to Global Memory READ Transfer takes about 37.9 ms. + * *Data Transfer: Host to Global Memory* section indicates: + * Host to Global Memory WRITE Transfer takes about 145.7 ms which is less than 207 ms. + * Host to Global Memory READ Transfer takes about 37.9 ms. ![missing image](./images/multiDDR_profile_host.PNG) - * *Kernels & Compute Unit: Compute Unit Utilization* section shows that the CU Utilization has also increased to 89.5% from 71% in previous lab. + * The *Kernels & Compute Unit: Compute Unit Utilization* section shows that the CU Utilization has also increased to 89.5% from 71% in previous lab. ![missing image](./images/multiDDR_profile_CU_util.PNG) @@ -159,7 +155,7 @@ The kernel will read from DDR bank1 and bank2 alternatively and its `maxi` port ![missing image](./images/multiDDR_stalls.PNG) -Compared to the previous step using only one DDR, there is no overall application gain. The FPGA compute performance improves, but the bottleneck is processing the "Compute Score", which is limited by CPU Performance. If the CPU can process faster, you can get the better performance. +Compared to the previous step using only one DDR, there is no overall application gain. The FPGA compute performance improves, but the bottleneck is processing the "Compute Score", which is limited by the CPU Performance. If the CPU can process faster, you can get the better performance. Based on the results, the throughput of the application is 1399MB/426 ms = approximately 3.27 GBs. You now have approximately 7.2x (=3058 ms/426 ms) the performance results compared to the software-only version. @@ -169,13 +165,9 @@ Congratulations! You have successfully completed the tutorial. In this tutorial, you learned that optimizing how the host application interacts with the accelerator makes a significant difference. A native initial implementation delivered a 4x performance improvement over the reference software implementation. By leveraging data-parallelism, the overlapping data transfers, compute, and the overlapping CPU processing with FPGA processing, using multiple DDR banks, the application performance was increased by another 1.8x, achieving a total of 7.2x acceleration. - ---------------------------------------

Return to Start of Tutorial

- -

Copyright © 2020–2023 Advanced Micro Devices, Inc

Terms and Conditions

- diff --git a/Hardware_Acceleration/Design_Tutorials/02-bloom/README.md b/Hardware_Acceleration/Design_Tutorials/02-bloom/README.md index d6baf1275c..4fde4fbcd5 100644 --- a/Hardware_Acceleration/Design_Tutorials/02-bloom/README.md +++ b/Hardware_Acceleration/Design_Tutorials/02-bloom/README.md @@ -6,18 +6,17 @@ - # Optimizing Accelerated FPGA Applications: Bloom Filter Example ***Version: Vitis 2023.1*** ## Introduction -The methodology for accelerating applications on an FPGA is comprised of multiple phases: +The methodology for accelerating applications on an field programmable array (FPGA) is comprised of multiple phases: - - **Architecting the application**: Make key decisions about the architecture of the application and decide some important factors, such as what software functions should be mapped to FPGA kernels, how much parallelism is needed, and how it should be delivered. - - **Developing the accelerator to meet your desired performance goals**: Implement the kernel by modifying the source code and applying pragmas to create a kernel architecture that can achieve the desired performance goals. - - **Optimize the host code**: Review the application's access patterns, data movements, CPU and FPGA idle time, and update the host code to meet your performance goals. +- **Architecting the Application**: Make key decisions about the architecture of the application and decide some important factors, such as what software functions should be mapped to the FPGA kernels, how much parallelism is needed, and how it should be delivered. +- **Developing the Accelerator to Meet Your Desired Performance Goals**: Implement the kernel by modifying the source code and applying pragmas to create a kernel architecture that can achieve the desired performance goals. +- **Optimize the Host Code**: Review the application's access patterns, data movements, CPU and FPGA idle time, and update the host code to meet your performance goals. You begin this tutorial with a baseline application, and you profile the application to examine the potential for hardware acceleration. The tutorial application involves searching through an incoming stream of documents to find the documents that closely match a user’s interest based on a search profile. @@ -29,24 +28,24 @@ In general, a Bloom filter application has use cases in data analytics, such as The labs in this tutorial use: -* BASH Linux shell commands. -* 2023.1 Vitis core development kit release and the *xilinx_u200_gen3x16_xdma_2_202110_1* platform. If necessary, it can be easily ported to other versions and platforms. +- BASH Linux shell commands. +- 2023.1 Vitis core development kit release and the *xilinx_u200_gen3x16_xdma_2_202110_1* platform. If necessary, it can be easily ported to other versions and platforms. -This tutorial guides you to run the designed accelerator on the FPGA; therefore, the expectation is that you have an Xilinx® Alveo™ U200 Data Center accelerator card set up to run this tutorial. Because it can take several (six or seven) hours to generate the multiple `xclbin` files needed to run the accelerator, pregenerated `xclbin` files are provided for the U200 card. To use these pregenerated files, when building the hardware kernel or running the accelerator on hardware, you need to add the `SOLUTION=1` argument. +This tutorial guides you to run the designed accelerator on the FPGA; therefore, the expectation is that you have an AMD Alveo™ U200 Data Center accelerator card set up to run this tutorial. Because it can take several (six or seven) hours to generate the multiple `xclbin` files needed to run the accelerator, pregenerated `xclbin` files are provided for the U200 card. To use these pregenerated files, when building the hardware kernel or running the accelerator on hardware, you need to add the `SOLUTION=1` argument. >**IMPORTANT:** > -> * Before running any of the examples, make sure you have installed the Vitis core development kit as described in [Installation](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Installation-Requirements) in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416). ->* If you run applications on Alveo cards, ensure the card and software drivers have been correctly installed by following the instructions on the [Alveo Portfolio page](https://www.xilinx.com/products/boards-and-kits/alveo.html). +> - Before running any of the examples, make sure you have installed the Vitis core development kit as described in [Installation](https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Installation-Requirements) in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416). +>- If you run applications on Alveo cards, ensure the card and software drivers have been correctly installed by following the instructions on the [Alveo Portfolio page](https://www.xilinx.com/products/boards-and-kits/alveo.html). ### Accessing the Tutorial Reference Files 1. To access the tutorial content, enter the following in a terminal: `git clone http://github.com/Xilinx/Vitis-Tutorials`. 2. Navigate to the `Hardware_Acceleration/Design_Tutorials/02-bloom` directory. - * `cpu_src` contains all the original source code before modification. - * `images` contains the figures in this tutorial. - * `Makefile` in the `makefile` directory explains the commands used in this lab. Use the `PLATFORM` variable if targeting different platforms. - * `reference_file` contains the modified kernel and host-related files for achieving higher performance. + - `cpu_src` contains all the original source code before modification. + - `images` contains the figures in this tutorial. + - `Makefile` in the `makefile` directory explains the commands used in this lab. Use the `PLATFORM` variable if targeting different platforms. + - `reference_file` contains the modified kernel and host-related files for achieving higher performance. 3. Copy and extract large files in as follows: ``` @@ -54,16 +53,16 @@ This tutorial guides you to run the designed accelerator on the FPGA; therefore, tar -xvzf xclbin_save.tar.gz ``` - **TIP:** The `xclbin_save` contains the saved `xclbin` files that can be used directly for running on hardware by setting `SOLUTION=1` for the `make run` commands. - + >**TIP:** The `xclbin_save` contains the saved `xclbin` files that can be used directly for running on hardware by setting `SOLUTION=1` for the `make run` commands. + ### Tutorial Outline -* [Overview of the Original Application](1_overview.md) -* [Experience Acceleration Performance](2_experience-acceleration.md) -* [Architecting the Application](3_architect-the-application.md) -* [Implementing the Kernel](4_implement-kernel.md) -* [Analyze Data Movement Between Host and Kernel](5_data-movement.md) -* [Using Multiple DDR Banks](6_using-multiple-ddr) +- [Overview of the Original Application](1_overview.md) +- [Experience Acceleration Performance](2_experience-acceleration.md) +- [Architecting the Application](3_architect-the-application.md) +- [Implementing the Kernel](4_implement-kernel.md) +- [Analyze Data Movement Between Host and Kernel](5_data-movement.md) +- [Using Multiple DDR Banks](6_using-multiple-ddr)