Memory Wrapper Integration

Memory Wrapper Integration And Testing

In the collateral generation stage two files were generated ("memcpy" prefix comes from design-under-test name):

memcpy_acc.h - a kernel-level SystemC (memcpy_hls) wrapped with memory subsystem components, i.e. load-store units and arbiters. the interface of module conforms with the inteface required for CCIP-level RTL integration
memcpy_acc_tb.h - is a testbench for memcpy_acc that can be driven by the refactored software we created in stage 1.

In order to simulate this testbench, what we need to do is include the third implementation of class AcclApp this time included from memcpy_acc_tb.h There are a variety of ways to do this, but since we will be working mostly with simulation-level, we'll make this mode the default. In tb.cpp, change the header include to:

#ifdef USE_SOFTWARE
#include "AcclApp.h"
#else
#ifdef KERNEL_TEST
#include "memcpy_hls_tb.h"
#else
#include "memcpy_acc_tb.h"
#endif
#endif

and modify the Makefile to:

DEBUG_FLAGS=-O2 -g

ifdef KERNEL_TEST
CFLAGS += -DKERNEL_TEST
endif

HLD_ROOT = ../..
SOURCES=tb.cpp
TARGET=accel_test

CXX=g++

include $(HLD_ROOT)/common/Makefile.inc

Now make with no options will generate a simulator that includes the memory system. Compilation produces:

g++ -O2 -g -std=c++11 -Wall -I../../common -I../../accio -I../../acctempl -Wno-virtual-move-assign -I/p/hdk/rtl/cad/x86-64_linux26/accellera/systemc/systemc-2.3.0/include -Wno-unused-label -I/nfs/site/disks/scl.work.58/ppt/aayupov/gtest/googletest/googletest/include  -o tb.o -c tb.cpp
g++ -MM -std=c++11 -Wall -I../../common -I../../accio -I../../acctempl -Wno-virtual-move-assign -I/p/hdk/rtl/cad/x86-64_linux26/accellera/systemc/systemc-2.3.0/include -Wno-unused-label -I/nfs/site/disks/scl.work.58/ppt/aayupov/gtest/googletest/googletest/include  tb.cpp > tb.d
g++ -O2 -g -pthread -o accel_test tb.o /p/hdk/rtl/cad/x86-64_linux26/accellera/systemc/systemc-2.3.0/lib-linux64/libsystemc.a /nfs/site/disks/scl.work.58/ppt/aayupov/gtest/googletest/googletest/make/gtest_main.a

Running the executable produces:

[COG_ENV_DIR] dlxc1340> ./accel_test
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from AccelTest
[ RUN      ] AccelTest.SimpleTest

Warning: (W505) object already exists: acc_top_0.mem.memReadReqIn. Latter declaration will be renamed to acc_top_0.mem.memReadReqIn_1
In file: ../../../../src/sysc/kernel/sc_object_manager.cpp:148

Warning: (W505) object already exists: acc_top_0.mem.memReadRespOut. Latter declaration will be renamed to acc_top_0.mem.memReadRespOut_1
In file: ../../../../src/sysc/kernel/sc_object_manager.cpp:148

Info: (I804) /IEEE_Std_1666/deprecated: interface and/or port binding in port constructors is deprecated
MEM NUM OF READ PORTS is 1
MEM NUM OF WRITE PORTS is 1
MEM LATENCY is 100 cycles
MEM NUM OF READ PORTS is 1
MEM NUM OF WRITE PORTS is 1
MEM LATENCY is 100 cycles
656815 nsTB: DONE received
Results checked. 524288 of 524288 correct.
Arbiter ARBITER_1 0 was idle (no requestors) for 644 cycles, consumer was not ready for 1 cycles
Arbiter ARBITER_0 0 was idle (no requestors) for 644 cycles, consumer was not ready for 1 cycles
ACCIn inp_mem_in_0  was full (all request slots taken) for 1 cycles
AccIn inp_mem_in_0 stats (idle now):  no outstanding requests = 519 acc not ready to receive 0 reorder waste 0
 MOCK_MEMORY: Average cycle period = 10ns
 MOCK_MEMORY bandwidth stats:
 MOCK_MEMORY latency 1000ns, or ~100cycles @100Mhz
 MOCK_MEMORY latency spread 0ns, or ~0cycles @100Mhz
    rd channel 0 bandwidth = 0.99
    wr channel 0 bandwidth = 0.99
    rd channel average bandwidth = 0.99, which is ~5.90GB/s @(1*100)Mhz 
    wr channel average bandwidth = 0.99, which is ~5.90GB/s @(1*100)Mhz 
[       OK ] AccelTest.SimpleTest (5327 ms)
[----------] 1 test from AccelTest (5327 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (5327 ms total)
[  PASSED  ] 1 test.

We can also compile with an option to include HLS related collateral. This can expose synthesizablity issues with your kernel.

make clean; make USE_HLS=1

produces:

g++ -O2 -g -std=c++11 -Wall -I../../common -I../../accio -I../../acctempl -Wno-virtual-move-assign -I/p/hdk/rtl/cad/x86-64_linux26/accellera/systemc/systemc-2.3.0/include -I/nfs/site/disks/ccdo.soc.cad_root.0/cad/x86-64_linux26/cadence/ctos/14.2/share/ctos/include/ctos_tlm -I/nfs/site/disks/ccdo.soc.cad_root.0/cad/x86-64_linux26/cadence/ctos/14.2/share/ctos/include/ctos_flex_channels -I/nfs/site/disks/ccdo.soc.cad_root.0/cad/x86-64_linux26/cadence/ctos/14.2/share/ctos/include/ctos_fx -DUSE_HLS -DSC_INCLUDE_DYNAMIC_PROCESSES -Wall -Wunused-label -Wno-unused-label -I/nfs/site/disks/scl.work.58/ppt/aayupov/gtest/googletest/googletest/include  -o tb.o -c tb.cpp
g++ -MM -std=c++11 -Wall -I../../common -I../../accio -I../../acctempl -Wno-virtual-move-assign -I/p/hdk/rtl/cad/x86-64_linux26/accellera/systemc/systemc-2.3.0/include -I/nfs/site/disks/ccdo.soc.cad_root.0/cad/x86-64_linux26/cadence/ctos/14.2/share/ctos/include/ctos_tlm -I/nfs/site/disks/ccdo.soc.cad_root.0/cad/x86-64_linux26/cadence/ctos/14.2/share/ctos/include/ctos_flex_channels -I/nfs/site/disks/ccdo.soc.cad_root.0/cad/x86-64_linux26/cadence/ctos/14.2/share/ctos/include/ctos_fx -DUSE_HLS -DSC_INCLUDE_DYNAMIC_PROCESSES -Wall -Wunused-label -Wno-unused-label -I/nfs/site/disks/scl.work.58/ppt/aayupov/gtest/googletest/googletest/include  tb.cpp > tb.d
g++ -O2 -g -pthread -o accel_test tb.o /p/hdk/rtl/cad/x86-64_linux26/accellera/systemc/systemc-2.3.0/lib-linux64/libsystemc.a /nfs/site/disks/scl.work.58/ppt/aayupov/gtest/googletest/googletest/make/gtest_main.a

and runs as:

[COG_ENV_DIR] dlxc1340> ./accel_test 
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from AccelTest
[ RUN      ] AccelTest.SimpleTest

Warning: (W505) object already exists: acc_top_0.mem.memReadReqIn. Latter declaration will be renamed to acc_top_0.mem.memReadReqIn_1
In file: ../../../../src/sysc/kernel/sc_object_manager.cpp:148

Warning: (W505) object already exists: acc_top_0.mem.memReadRespOut. Latter declaration will be renamed to acc_top_0.mem.memReadRespOut_1
In file: ../../../../src/sysc/kernel/sc_object_manager.cpp:148
MEM NUM OF READ PORTS is 1
MEM NUM OF WRITE PORTS is 1
MEM LATENCY is 100 cycles
MEM NUM OF READ PORTS is 1
MEM NUM OF WRITE PORTS is 1
MEM LATENCY is 100 cycles
656815 nsTB: DONE received
Results checked. 524288 of 524288 correct.
Arbiter ARBITER_1 0 was idle (no requestors) for 644 cycles, consumer was not ready for 1 cycles
Arbiter ARBITER_0 0 was idle (no requestors) for 644 cycles, consumer was not ready for 1 cycles
ACCIn inp_mem_in_0  was full (all request slots taken) for 1 cycles
AccIn inp_mem_in_0 stats (idle now):  no outstanding requests = 519 acc not ready to receive 0 reorder waste 0
 MOCK_MEMORY: Average cycle period = 10ns
 MOCK_MEMORY bandwidth stats:
 MOCK_MEMORY latency 1000ns, or ~100cycles @100Mhz
 MOCK_MEMORY latency spread 0ns, or ~0cycles @100Mhz
    rd channel 0 bandwidth = 0.99
    wr channel 0 bandwidth = 0.99
    rd channel average bandwidth = 0.99, which is ~5.90GB/s @(1*100)Mhz 
    wr channel average bandwidth = 0.99, which is ~5.90GB/s @(1*100)Mhz 
[       OK ] AccelTest.SimpleTest (6170 ms)
[----------] 1 test from AccelTest (6170 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (6170 ms total)
[  PASSED  ] 1 test.

Testbench parameters

memcpy_acc_tb.h instantiates a testbench template that takes several parameters. Most of the parameters are for the mock memory simulator that is embedded in the testbench.

typedef acc_top <memcpy_acc, 100/*Mhz*/, 1000/*mem latency in ns*/, 4/*number of mem read ports*/, 1/*number of mem write ports*/> AcclApp;

100Mhz - frequency of the accelerator (this is driven by what clock is used in the accelerator)
1000ns - latency to the mock memory
4 read ports - number of ports between the mock memory and the accelerator used for data traffic from memory to the accelerator. This is effectively a knob to increase the memory frequency X times of the accelerator frequency (4x100Mhz in the example). This requires the accelerator to have 4 read ports as well and one way to achieve this is to replicate the accelerator using multi_acc_template (see below).
1 write port - number of write ports between the mock memory and the accelerator for data traffic from the accelerator to memory.

Multiple Accelerator Unit Template

One simple way to convert a compute bound accelerator into a memory bound one is to replicate the accelerator units (given that there is enough parallelism between them). We provide a template that will instantiate multiple accelerator units together and generate multi-port interface to memory.

typedef multi_acc_template_Np<NUM_AUS/*number of AUs*/, memcpy_sched<NUM_AUS>/*scheduler module*/, memcpy_acc/*accelerator unit*/, Config, RD_CHANNELS/*# of read ports*/, WR_CHANNELS/*# if write ports*/> dut_t;

We can produce multi-AU solutions by setting the NUM_AUS parameter to a value other that 1 (the default). Here we can change the Makefile again:

DEBUG_FLAGS=-O2 -g

ifdef KERNEL_TEST
CFLAGS += -DKERNEL_TEST
endif

ifdef NUM_AUS
CFLAGS += -DNUM_AUS=$(NUM_AUS)
endif

HLD_ROOT = ../..
SOURCES=tb.cpp
TARGET=accel_test

CXX=g++

include $(HLD_ROOT)/common/Makefile.inc

We need to also modify the automatically generated code corresponding to the schedule that split the original task onto multiple AUs. Here is the generated code for the schedule_proc thread in memcpy_sched.h:

  void schedule_proc() {
    bool config_copied = false;

    {
      SCHEDULE_RESET_UNROLL: for (unsigned int i = 0; i < N; ++i) {
        this->acc_start[i] = false;
      }
    }

    this->done = false;
    wait();
    while (1) {
      {
        if (this->start.read()) {
          if ( !config_copied) {
            Config conf = this->config.read();
            SCHEDULE_START_UNROLL: for (unsigned i = 0; i < N; ++i) {
              Config acc_conf;
              acc_conf.copy(conf);

              // Overwrite acc_conf with parameters for the ith AUs computation

              this->acc_config[i].write(acc_conf);
              this->acc_start[i] = true;
            }
            config_copied = true;
          } else {
            bool all_done = true;
            SCHEDULE_DONE_UNROLL: for (unsigned i = 0; i < N; ++i) {
              bool accdone = this->acc_done[i] && this->acc_start[i];
              all_done = all_done && accdone;
            }
            if (all_done) {
              this->done = true;
            }
          }
        }
      }
      wait();
    }
  }

The schedule_proc thread produces new copies of the config signals for each AU. The following code splits the work into N chunks consequetively. The base pointers for the inp and out memory location are changed as well as the number of items to copy.

            Config conf = this->config.read();
            SCHEDULE_START_UNROLL: for (unsigned i = 0; i < N; ++i) {
              Config acc_conf;
              acc_conf.copy(conf);

              unsigned long long chunk = acc_conf.get_nCLs() / N;
              if ( i == N-1) {
                acc_conf.set_nCLs( acc_conf.get_nCLs()-chunk*(N-1));
              } else {
                acc_conf.set_nCLs( chunk);
              }
              acc_conf.set_aInp( acc_conf.get_aInp() + i*chunk*sizeof(CacheLine));
              acc_conf.set_aOut( acc_conf.get_aOut() + i*chunk*sizeof(CacheLine));
              std::cout << "i,nCLs,aInp,aOut:"
                << " " << i
                << "," << acc_conf.get_nCLs()
                << "," << acc_conf.get_aInp()
                << "," << acc_conf.get_aOut()
                << std::endl;

Compiling using make clean; make NUM_AUS=4 produces a four AU model that should have close to four times the bandwidth of the single AU model.

[COG_ENV_DIR] dlxc1340> ./accel_test 
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from AccelTest
[ RUN      ] AccelTest.SimpleTest

Warning: (W505) object already exists: acc_top_0.mem.memReadReqIn. Latter declaration will be renamed to acc_top_0.mem.memReadReqIn_4
In file: ../../../../src/sysc/kernel/sc_object_manager.cpp:148

Warning: (W505) object already exists: acc_top_0.mem.memReadRespOut. Latter declaration will be renamed to acc_top_0.mem.memReadRespOut_4
In file: ../../../../src/sysc/kernel/sc_object_manager.cpp:148

Info: (I804) /IEEE_Std_1666/deprecated: interface and/or port binding in port constructors is deprecated
MEM NUM OF READ PORTS is 4
MEM NUM OF WRITE PORTS is 4
MEM LATENCY is 100 cycles
MEM LATENCY is 100 cycles
MEM LATENCY is 100 cycles
MEM LATENCY is 100 cycles
MEM NUM OF READ PORTS is 4
MEM NUM OF WRITE PORTS is 4
MEM LATENCY is 100 cycles
MEM LATENCY is 100 cycles
MEM LATENCY is 100 cycles
MEM LATENCY is 100 cycles
i,nCLs,aInp,aOut: 0,16384,140737329516608,140737333710912
i,nCLs,aInp,aOut: 1,16384,140737330565184,140737334759488
i,nCLs,aInp,aOut: 2,16384,140737331613760,140737335808064
i,nCLs,aInp,aOut: 3,16384,140737332662336,140737336856640
165385 nsTB: DONE received
Results checked. 524288 of 524288 correct.
ArbiterN au_rd_arbiter 1 was idle (no requestors) for 653 cycles, consumer was not ready for 0 cycles
ArbiterN au_wr_arbiter 1 was idle (no requestors) for 653 cycles, consumer was not ready for 0 cycles
Arbiter ARBITER_1 0 was idle (no requestors) for 653 cycles, consumer was not ready for 1 cycles
Arbiter ARBITER_0 0 was idle (no requestors) for 653 cycles, consumer was not ready for 1 cycles
ACCIn inp_mem_in_0  was full (all request slots taken) for 1 cycles
AccIn inp_mem_in_0 stats (idle now):  no outstanding requests = 529 acc not ready to receive 0 reorder waste 0
Arbiter ARBITER_1 0 was idle (no requestors) for 653 cycles, consumer was not ready for 1 cycles
Arbiter ARBITER_0 0 was idle (no requestors) for 653 cycles, consumer was not ready for 1 cycles
ACCIn inp_mem_in_0  was full (all request slots taken) for 1 cycles
AccIn inp_mem_in_0 stats (idle now):  no outstanding requests = 529 acc not ready to receive 0 reorder waste 0
Arbiter ARBITER_1 0 was idle (no requestors) for 653 cycles, consumer was not ready for 1 cycles
Arbiter ARBITER_0 0 was idle (no requestors) for 653 cycles, consumer was not ready for 1 cycles
ACCIn inp_mem_in_0  was full (all request slots taken) for 1 cycles
AccIn inp_mem_in_0 stats (idle now):  no outstanding requests = 529 acc not ready to receive 0 reorder waste 0
Arbiter ARBITER_1 0 was idle (no requestors) for 653 cycles, consumer was not ready for 1 cycles
Arbiter ARBITER_0 0 was idle (no requestors) for 653 cycles, consumer was not ready for 1 cycles
ACCIn inp_mem_in_0  was full (all request slots taken) for 1 cycles
AccIn inp_mem_in_0 stats (idle now):  no outstanding requests = 529 acc not ready to receive 0 reorder waste 0
 MOCK_MEMORY: Average cycle period = 10ns
 MOCK_MEMORY bandwidth stats:
 MOCK_MEMORY latency 1000ns, or ~100cycles @100Mhz
 MOCK_MEMORY latency spread 0ns, or ~0cycles @100Mhz
    rd channel 0 bandwidth = 0.96
    rd channel 1 bandwidth = 0.96
    rd channel 2 bandwidth = 0.96
    rd channel 3 bandwidth = 0.96
    wr channel 0 bandwidth = 0.96
    wr channel 1 bandwidth = 0.96
    wr channel 2 bandwidth = 0.96
    wr channel 3 bandwidth = 0.96
    rd channel average bandwidth = 0.96, which is ~22.93GB/s @(4*100)Mhz 
    wr channel average bandwidth = 0.96, which is ~22.93GB/s @(4*100)Mhz 
[       OK ] AccelTest.SimpleTest (5430 ms)
[----------] 1 test from AccelTest (5430 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (5430 ms total)
[  PASSED  ] 1 test.

The simulator says the 4 AU version finished in 165385 ns, while the 1 AU version finished in 656815 ns, a ratio of 1 to ~3.97.

Experimentally Determine Response Queue Size

Queue sizing can be performed experimentally by SystemC simulations on models with different parameters. By passing parameters from the make command line to the SystemC code, we can conveniently compile and run different scenerios. Let's add this code to the Makefile:

ifdef __inp_Slots__
CFLAGS += -D__inp_Slots__=$(__inp_Slots__)
endif

and this code to memcpy_hls.h:

#ifndef __inp_Slots__
#define __inp_Slots__ 128
#endif

and this code to memcpy_acc.h:

  typedef LoadUnitParams< CacheLine, __inp_Slots__, 1 << 30, 1> inpLoadParams;

(Better yet you could change the dut_params.py file and regenerate memcpy_acc.h. Here is what that file should look like:

#!/usr/bin/env python3

from cog_acctempl import *

dut = DUT("memcpy")

dut.add_rds( [TypedRead("CacheLine","inp","__inp_Slots__","1 << 30","1")])
dut.add_wrs( [TypedWrite("CacheLine","out")])

dut.add_ut( UserType("CacheLine",[ArrayField(UnsignedLongLongField("words"),8)]))

dut.add_extra_config_fields( [BitReducedField(UnsignedIntField("nCLs"),32)])

dut.module.add_cthreads( [CThread("inp_fetcher",writes_to_done=True),
                          CThread("inp_addr_gen"),
                          CThread("out_addr_gen")])

dut.get_cthread( "inp_fetcher").add_ports( [RdRespPort("inp"),
                                            WrDataPort("out")])

dut.get_cthread( "inp_addr_gen").add_ports( [RdReqPort("inp")])
dut.get_cthread( "out_addr_gen").add_ports( [WrReqPort("out")])

if __name__ == "__main__":
    dut.dump_dot( dut.nm + ".dot")

Now we can recompile and run multiple configurations with a simple script, here varying the response buffer size for the random access memory access:

#!/bin/bash

for slots in 16 32 64 96 128 160 192 224 256
do
    echo "Compiling ${slots}..."
    make clean
    make __inp_Slots__=${slots}
    echo "Running ${slots}..."
    ./accel_test > LOG-${slots}
done

Extracting bandwidth from the log files, using something like this:

egrep 'rd channel average bandwidth' LOG-* | \
tr ':' ' ' | tr '-' ' ' | tr , ' ' | \
awk '{print $2, $8;}' | sort -k1n,1 > data.txt

that will produce a data file that can be easily plotted.

This gnuplot script (named experiment.gp) will fit a simple (non-linear) model (see y(x) with parameters x0, y0, and m below):

set term png
set output dir . "/data.png"

set key top left
set ylabel "Read Bandwidth Fraction"
set xlabel "Read Response Buffer Positions"
set xtics 32,32,256

y0 = 1.0
x0 = 100
m = 0.01

y(x) = x < x0 ? y0 + m*(x-x0) : y0

data_fn = dir . "/data.txt"

fit y(x) data_fn via x0, y0, m

set yrange [0:1]

set arrow from x0+32,y0-0.2 to x0,y0 lw 0 lc 4 
set label at x0+32,y0-0.2 sprintf("%.1f,%.3f", x0, y0 )

plot [0:256] data_fn with points pointtype 4, y(x) with lines

that will generate this plot:

plot

(You can set the variable dir using the command line syntax: gnuplot -e "dir=\".\"" experiment.gp. Replace . with the directory containing the data.txt file.)

The parameter x0 represents the smallest number of buffers that generates the maximum bandwidth (assuming this ramp then flattop model is applicable.) The y0 corresponds to the read bandwidth fraction for the flat section of the curve.

The plot says the inflection point is where the slot count is around 107, not different enough than the default value of 128 to consider changing.

Type AcclApp (typedef above) can be used as is in the refactored software as it implements IFpgaApp interface. The output of the run will include memory statistics of the mock memory based on the parameters above. For example:

 MOCK_MEMORY: Average cycle period = 10ns
 MOCK_MEMORY bandwidth stats:
 MOCK_MEMORY latency 1000ns, or ~150cycles @100Mhz
    rd channel 0 bandwidth = 1.00
    rd channel 1 bandwidth = 0.99
    rd channel 2 bandwidth = 0.89
    rd channel 3 bandwidth = 0.52
    wr channel 0 bandwidth = 0.01
    rd channel average bandwidth = 0.85, which is ~20.72GB/s @(4*100)Mhz 
    wr channel average bandwidth = 0.01, which is ~0.08GB/s @(1*100)Mhz

The statistics include the parameters that were set at instantiation of the acc_top<...> and the maximum achievable memory bandwidth per channel. In other words, this metric is useful to show if accelerator unit is memory or compute bound and what the memory bound it can achieve (20.72GB/s for example may not be achievable on the real platform, but it is a indicator that there is enough computation going on in the acceletor to saturate memory bandwidths up to 20GB/s given certain memory frequency and latency)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory Wrapper Integration

Memory Wrapper Integration And Testing

Testbench parameters

Multiple Accelerator Unit Template

Experimentally Determine Response Queue Size

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally