-
Notifications
You must be signed in to change notification settings - Fork 14
Software Only
In this section we will learn how to:
- Translate a software only test program into a test bench and accelerator.
- Refactor the test program so that all accelerator memory accesses are performed using address information provided in a special configuration object.
- Refactor the test program so that the test bench and accelerator communicate through full cacheline accesses (64 bytes.)
The first thing to do is develop a complete implementation that includes both a test bench and accelerator code. This can be completely in software to begin with. Through a series of steps, we will move the accelerator portion into hardware. Here is a very simple program whose accelerator functionality is merely copying data. The majority of this programming is allocating memory, initializing the memory, performing the copy operation, and then testing the copying operation was correct. From the beginning we will wrap the test bench in a GoogleTest TEST macro.
#include "gtest/gtest.h"
#include <iostream>
#include <string.h>
TEST(AccelTest, SimpleTest) {
unsigned int n_ulls = 8*1024;
unsigned long long *inp_ptr = (unsigned long long *) malloc(8*n_ulls);
unsigned long long *out_ptr = (unsigned long long *) malloc(8*n_ulls);
for( unsigned int ip=0; ip<n_ulls; ++ip) {
inp_ptr[ip] = ip;
}
for( unsigned int ip=0; ip<n_ulls; ++ip) {
out_ptr[ip] = 0xdeadbeefdeadbeefULL;
}
memcpy( out_ptr, inp_ptr, 8*n_ulls);
// check
unsigned int correct = 0;
for (unsigned int ip=0; ip<n_ulls; ++ip) {
unsigned long long cand = out_ptr[ip];
unsigned long long ref = ip;
if ( cand == 0xdeadbeefdeadbeefULL) {
std::cout << "Uninitialized result at " << ip << std::endl;
}
if ( cand == ref) {
++correct;
} else {
std::cout << ip << " != " << ref << " " << cand << std::endl;
}
}
std::cout << "Results checked: " << correct
<< " of " << n_ulls << " correct." << std::endl;
EXPECT_EQ( correct, n_ulls);
free( inp_ptr);
free( out_ptr);
}
int main (int argc, char *argv[]) {
::testing::InitGoogleTest(&argc, argv);
return RUN_ALL_TESTS();
}This can be compiled using this command-line on our systems, where $GTEST_DIR specifies the GoogleTest directory:
g++ -O3 -Wall -I${GTEST_DIR}/include memcpy.cpp ${GTEST_DIR}/make/gtest_main.a -pthreadThen running the executable will produced this:
$ ./a.out
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from AccelTest
[ RUN ] AccelTest.SimpleTest
Results checked: 65536 of 65536 correct.
[ OK ] AccelTest.SimpleTest (1 ms)
[----------] 1 test from AccelTest (1 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (2 ms total)
[ PASSED ] 1 test.
We now have a simple test program that we can continue using throughout the hardware/software partitioning and, then, hardware design processes. Our next step is to split this program into the software and the accelerator. In this simple example, the accelerator will be the memcpy function call, and the software will be everything else. Since the FPGA hardware can only access certain portions of the Xeon memory, we need to allocate this memory is a special way, and pass the necessary information to access this memory to the accelerator. Later when using the SDK, we will allocate this shared memory through calls to the AALSDK library. To ease development, we will now transform our code to allocate memory using a software-only class with the same interface as we will be using later.
This interface (really a virtual class), FpgaAppSWAlloc is defined in $HLD_ROOT/common/fpga_app_sw.h. Two methods are already defined: void *alloc(unsigned long long) and void free(). We will use the alloc method to allocate a shared workspace between the Xeon and the FPGA. The only parameter is the size in bytes that we request be allocated for this workspace. We will use this workspace to allocate objects needed in our program. It will also align the allocation start on cacheline boundaries (64 bytes), since this is what will also be done by the Xeon-FPGA SDK. We then use the class hld_alloc (defined n $HLD_ROOT/common/hld_alloc.h to allocate our program objects from this workspace. This allocator returns pointers to areas within the WORKSPACE provided to its constructor. An internal pointer inside altor keeps track of the storage previous allocated. Here is some code replacing the two malloc operations with this new memory allocation scheme:
#include "gtest/gtest.h"
#include "hld_alloc.h"
#include "AcclApp.h"
TEST(AccelTest, SimpleTest) {
AcclApp theApp;
unsigned int n_cls = 64*1024;
unsigned long long sz = 2ULL*n_cls*64;
if ( theApp.alloc( sz)) {
unsigned char *WORKSPACE = theApp.m_JointVirt;
size_t WORKSPACE_SIZE = theApp.m_JointSize;
Config config;
hld_alloc altor((char *)WORKSPACE, WORKSPACE_SIZE);
CacheLine *inp_ptr = altor.allocate<CacheLine>( n_cls);
CacheLine *out_ptr = altor.allocate<CacheLine>( n_cls);
...
theApp.free();
} else {
EXPECT_TRUE( 0);
}
}
...We are allocating objects of type CacheLine. Here an initial definition of the class:
class CacheLine {
public:
unsigned long long words[8];
};Eight 64-bit words corresponds to a 512-bit cacheline, the unit of memory transfer (per 400MHz clock cycle) between the Xeon and the FPGA. Batching memory transfers into cachelines will improve performance, so we will also perform that transformation now.
Also, in preparation for moving the memcpy function to the accelerator, we must include in the accelerator the information needed to access the shared workspace, locate the input and output buffer locations, and specify the number of cachelines to copy. The way we do this is through the Config structure, passed from the testbench to the accelerator. For now, this structure needs the following information.
#include "CacheLine.h"
typedef unsigned long long AddrType;
struct Config {
private:
AddrType aInp : 64;
AddrType aOut : 64;
unsigned int nCLs : 32;
public:
AddrType get_aInp() const { return aInp; }
AddrType get_aOut() const { return aOut; }
unsigned int get_nCLs() const { return nCLs; }
void set_aInp( AddrType val) { aInp = val; }
void set_aOut( AddrType val) { aOut = val; }
void set_nCLs( unsigned int val) { nCLs = val; }
CacheLine *getInpPtr() const {
return reinterpret_cast<CacheLine*>( get_aInp());
}
CacheLine *getOutPtr() const {
return reinterpret_cast<CacheLine*>( get_aOut());
}
Config() {
aInp = aOut = nCLs = 0;
}
};Here we declare data fields using (possibly bitfield reduced) basic data types (char, short, int, long long, and their unsigned versions.) We've made these fields private and included getter and setter methods (although this is not necessary.) We've also included an initializing constructor (also not necessary), and helper methods to simplify pointer computations of the input and output data buffers.
We modify our original program to use this Config structure. It will look like this.
#include "gtest/gtest.h"
#include "hld_alloc.h"
#include "AcclApp.h"
#include "Config.h"
TEST(AccelTest, SimpleTest) {
AcclApp theApp;
unsigned int n_cls = 64*1024;
unsigned long long sz = 2ULL*n_cls*64;
if ( theApp.alloc( sz)) {
unsigned char *WORKSPACE = theApp.m_JointVirt;
size_t WORKSPACE_SIZE = theApp.m_JointSize;
Config config;
hld_alloc altor((char *)WORKSPACE, WORKSPACE_SIZE);
CacheLine *inp_ptr = altor.allocate<CacheLine>( n_cls);
CacheLine *out_ptr = altor.allocate<CacheLine>( n_cls);
config.set_aInp( (AddrType) inp_ptr);
config.set_aOut( (AddrType) out_ptr);
config.set_nCLs( n_cls);
{
for( unsigned int ip=0; ip<n_cls; ++ip) {
for ( unsigned int j=0; j<8; ++j) {
inp_ptr[ip].words[j] = 8*ip+j;
}
}
}
{
for( unsigned int ip=0; ip<n_cls; ++ip) {
for ( unsigned int j=0; j<8; ++j) {
out_ptr[ip].words[j] = 0xdeadbeefdeadbeefULL;
}
}
}
theApp.compute( &config, sizeof(config));
theApp.join();
// check
unsigned int correct = 0;
for (unsigned int ip=0; ip<n_cls; ++ip) {
for (unsigned int j=0; j<8; ++j) {
unsigned long long cand = out_ptr[ip].words[j];
unsigned long long ref = 8*ip+j;
if ( cand == 0xdeadbeefdeadbeefULL) {
std::cout << "Uninitialized result at " << ip << "," << j << std::endl;
}
if ( cand == ref) {
++correct;
} else {
std::cout << ip << "," << j << " != " << ref << " " << cand << std::endl;
}
}
}
std::cout << "Results checked. " << correct
<< " of " << 8*n_cls << " correct." << std::endl;
EXPECT_EQ( correct, 8*n_cls);
theApp.free();
} else {
EXPECT_TRUE( 0);
}
}
int main (int argc, char *argv[]) {
::testing::InitGoogleTest(&argc, argv);
return RUN_ALL_TESTS();
}Here are the declarations for the AcclApp class:
#include "fpga_app_sw.h"
struct AcclApp : public FpgaAppSwAlloc {
void compute (const void * config_void_ptr, const unsigned int config_size);
void join() {}
};The code for the compute method in the AcclApp class includes the kernel code we need to move to the accelerator, in this case the call to the system's memcpy command. All access to memory are through information obtained from the Config class. Furthermore, all memory accesses are to the workspace allocated earlier.
#include "AcclApp.h"
typedef unsigned long long UInt64;
#include "Config.h"
#include <string.h>
void AcclApp::compute( const void *config_void_ptr, const unsigned int config_size) {
const Config &config = *static_cast<const Config *>(config_void_ptr);
const CacheLine* inp_ptr = config.getInpPtr();
CacheLine* out_ptr = config.getOutPtr();
memcpy( out_ptr, inp_ptr, sizeof(CacheLine)*config.get_nCLs());
}Compiling using
g++ -O3 -Wall -I${GTEST_DIR}/include -I${HLD_ROOT}/common -I${HLD_ROOT}/acctempl tb.cpp AcclApp.cpp ${GTEST_DIR}/make/gtest_main.a -pthreadand then running produces:
$ ./a.out
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from AccelTest
[ RUN ] AccelTest.SimpleTest
Results checked. 8192 of 8192 correct.
[ OK ] AccelTest.SimpleTest (0 ms)
[----------] 1 test from AccelTest (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[ PASSED ] 1 test.