Skip to content
This repository was archived by the owner on Jan 7, 2023. It is now read-only.

Code Generation

Andrey Ayupov edited this page Mar 23, 2017 · 4 revisions

Accelerator Structure and SystemC Code Generation

The proposed accelerator structure is shown in the figure below.

Accelerator Structure

The compute kernel (in the proposed naming convention, we use suffix *_hls for it) is where the main computation of the accelerator is coded. The accelerator unit (*_acc) that includes the compute kernel and memory related components (load/store units and arbiters) will be automatically generated using the code generate scripts as shown below and in the following sections. The compute kernel has high-level memory ports to read/write data from the main memory. The interface with the memory ports will be covered later. In addition to the compute kernel and accelerator units, various testbenches will be also generated to test the design at different levels. The next section covers the specification API of the compute kernel interface (include memory ports).

Python Spec

The next step is to fully specify the interface to the accelerator interface. This can be complex and error prone task so we will use code generation scripts to do most of the grunt work. The code generation is done using the Python-based cog system. Provided templates and a few customization parameters are all that are needed to create starting code (and in some cases the final code) for the next steps in the process.

For our memcpy example, we can generate much of the interface we need from a few important pieces of information described using Python syntax in the file dut_params.py.

from cog_acctempl import *

dut = DUT("memcpy")

dut.add_rds( [TypedRead("CacheLine","inp")])
dut.add_wrs( [TypedWrite("CacheLine","out")])

dut.add_ut( UserType("CacheLine",[ArrayField(UnsignedLongLongField("words"),8)]))

dut.add_extra_config_fields( [BitReducedField(UnsignedIntField("nCLs"),32)])

We'll go line for line through this file. We first create a singleton dut from class DUT. The parameter to the constructor is that name for our design under test, in this case "memcpy". This name will be used in any DUT specific entities in our generated code. For instance, the name of the hardware module will be "memcpy_hls".

We now add read and write memory ports for the accelerator. There are multiple types of memory interfaces. For this accelerator, we want to use streaming accesses and the appropriate interface types are TypedRead and TypedWrite. In this configuration file, the syntax:

dut.add_rds( [TypedRead("CacheLine","inp")])
dut.add_wrs( [TypedWrite("CacheLine","out")])

extends the list of rd ports with the list containing an object of the Python-class TypedRead constructed with the two arguments "CacheLine" (the name of the type to be transfered on the read port) and "inp" (the portion of a name used to describe references to this port in the generated code.) The second line adds the write port with name out. The add_rds method extends the member list variable inps. This list member variable can now be used in the generator code (written in Python) to produce code associated with read channels. As an example the code snippet:

// memory ports
  /*[[[cog
       for p in dut.inps:
         cog.outl("ga::tlm_fifo_out<%s > %s;" % (p.reqTy(),p.reqNmK()))
         cog.outl("ga::tlm_fifo_in<%s > %s;" % (p.respTy(),p.respNmK()))
         cog.outl("")
    ]]]*/
  //[[[end]]]

when run through the cog system will generate this code:

  // memory ports
  ga::tlm_fifo_out<MemTypedReadReqType<CacheLine> > inpReqOut;
  ga::tlm_fifo_in<MemTypedReadRespType<CacheLine> > inpRespIn;

The Python class TypedRead has methods reqTy, reqNmK, respTy, and respNmK that generate names according to our standardized naming convention. The template uses this abstraction so that this aspect of the generated code can be changed by simply changing the class descriptions in ${HLD_ROOT}/scripts/systemc-gen/cog_acctempl.py

The remainder of the description file specifies the single user-defined class (CacheLine), and the single extra parameter (nCLs) needed in the Config class. We want our unit of data transfer to be a cacheline (8 64-bit words). We specify this user type again using simple Python objects, in this case an instance of UserType which contains instances of ArrayField and UnsignedLongLongField. The parameters to ArrayField are the class that we are repeating in the array, and the number of times we do this.

This user class specification results in this data layout:

class CacheLine {
public:
  unsigned long long words[8];
}

If this was all that was needed, we would not need a code generators for such user types. However, to effectly use this type within the SystemC environment, we need to extra methods for tracing, checking for equality, and writing to a stream. Furthermore, our memory subsystem requires additional methods for correctly marshalling data, methods that work in both the software (fast simulation) and hardware settings. The complete generated code is as follows:

/*[[[cog
     import cog
     from cog_acctempl import *
     from dut_params import *
     if "ty" not in globals():
       lst = cog.previous.lstrip('/').rstrip('\n').split('=')
       assert( lst[0]=="ty")
       assert( len(lst)==2)
       global ty
       ty = lst[1]
     cog.outl( "//ty=" + ty)
     ut = dut.usertypes[ty]
  ]]]*/
//ty=CacheLine
//[[[end]]] (checksum: 83c2abd91c3bc28abca443638a359778)
/*[[[cog
     cog.outl("#ifndef %s_H_" % ty)
     cog.outl("#define %s_H_" % ty)
  ]]]*/
#ifndef CacheLine_H_
#define CacheLine_H_
//[[[end]]] (checksum: ff5a1bc69d990f6d90fbb9f266b06299)

#ifndef __SYNTHESIS__
#include <cstddef>
#include <cassert>
#endif

/*[[[cog
     cog.outl("class %s {" % ty)
  ]]]*/
class CacheLine {
//[[[end]]] (checksum: a5bfab3cfa50fb1bb039b116ec8db585)
public:
  /*[[[cog
       for field in ut.fields:
         cog.outl( field.declaration)
    ]]]*/
  unsigned long long words[8];
  //[[[end]]] (checksum: d2798c94961ed332cb0e18b62e1815cc)

  /*[[[cog
       cog.outl("enum { BitCnt = %d };" % ut.bitwidth)
    ]]]*/
  enum { BitCnt = 512 };
  //[[[end]]] (checksum: 37159abeda6c5f75899fb314e1ed078c)

  static size_t getBitCnt() {
    /*[[[cog
         cog.outl("assert(sizeof(%s) == (size_t) BitCnt/8);" % ty)
      ]]]*/
    assert(sizeof(CacheLine) == (size_t) BitCnt/8);
    //[[[end]]] (checksum: 876cb2e24897c7e2cfec92b111262852)
    assert( 0 == (size_t) BitCnt%8);
    return BitCnt;
  }
  static size_t numberOfFields() {
    /*[[[cog
         cog.outl("return %d;" % ut.numberOfFields)
      ]]]*/
    return 8;
    //[[[end]]] (checksum: 314c347cd02990f9b8506ab115315c18)
  }
  static size_t fieldWidth( size_t index) {
    /*[[[cog
         sum = 0
         for field in ut.fields:
            cog.outl("if ( %d <= index && index < %d) {" % (sum,sum+field.numberOfFields))
            cog.outl( field.fieldWidth( sum))
            cog.outl("}")
            sum += field.numberOfFields
      ]]]*/
    if ( 0 <= index && index < 8) {
      return 64;
    }
    //[[[end]]] (checksum: 4b3342ea6b4acaa9728726bd517fca83)
    return 0;
  }
  void putField(size_t index, UInt64 d) {
    /*[[[cog
         sum = 0
         for field in ut.fields:
            cog.outl("if ( %d <= index && index < %d) {" % (sum,sum+field.numberOfFields))
            cog.outl( field.putField( sum))
            cog.outl("}")
            sum += field.numberOfFields
      ]]]*/
    if ( 0 <= index && index < 8) {
      words[index-0] = d;
    }
    //[[[end]]] (checksum: ab11b9b32dcfc398fc04749e7e1360e1)
  }
  UInt64 getField(size_t index) const {
    /*[[[cog
         sum = 0
         for field in ut.fields:
            cog.outl("if ( %d <= index && index < %d) {" % (sum,sum+field.numberOfFields))
            cog.outl( field.getField( sum))
            cog.outl("}")
            sum += field.numberOfFields
      ]]]*/
    if ( 0 <= index && index < 8) {
      return words[index-0];
    }
    //[[[end]]] (checksum: 16a9cfe200e7780c0afe0add24e543bf)
    return 0;
  }

#if !defined(__AAL_USER__) && !defined(USE_SOFTWARE)
  /*[[[cog
       cog.outl("inline friend void sc_trace(sc_trace_file* tf, const %s& d, const std::string& name) {" % ty)
    ]]]*/
  inline friend void sc_trace(sc_trace_file* tf, const CacheLine& d, const std::string& name) {
  //[[[end]]] (checksum: 831cd4fced8f6f9d1af3f7b67184262e)
  }
#endif

  /*[[[cog
       cog.outl("inline friend std::ostream& operator<<(std::ostream& os, const %s& d) {" % ty)
       cog.outl("  os << \"<%s>\";" % ty)
    ]]]*/
  inline friend std::ostream& operator<<(std::ostream& os, const CacheLine& d) {
    os << "<CacheLine>";
  //[[[end]]] (checksum: 720cf409d0cc7b5eb8f2c444297d0e84)
    return os;
  }

  /*[[[cog
       cog.outl("inline bool operator==(const %s& rhs) const {" % ty)
    ]]]*/
  inline bool operator==(const CacheLine& rhs) const {
  //[[[end]]] (checksum: cf4783549073f665518590a4ec88dbc4)
    bool result = true;
    for( unsigned int i=0; i<numberOfFields(); ++i) {
      result = result && (getField(i) == rhs.getField(i));
    }
    return result;
  }

};

#endif

(We include the generator code here as well. You have the option of removing this code---instead of just commenting it out as we do here---but, if you do, then you can't regenerate the code if you make changes to the dut_params.py file later in the development cycle. If you don't change this generated code, then you can, or course, regenerate without losing any changes. If you want to change the generated code, for example, to include a custom stream writer, then leaving the generator code in the source will allow you to regenerate based on a dut_params.py change. In this case use the previously generated code, not the template in the source tree, as your template when running the cog command.)

The Config.h can also be generated. For this, we use the memory port descripions previously mentioned, and a list of extra fields we would like to add to this class. Getter and Setter methods are generated for these fields as well as a 64-bit address pointer for each memory port.

The resulting code is this:

#ifndef __CONFIG_H__
#define __CONFIG_H__

#ifndef __SYNTHESIS__
#include <iomanip>
#include <iostream>
#include <string>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#endif
using std::endl ;
using std::cout ;
using std::string ;

#include "CacheLine.h"

typedef unsigned long long AddrType;

struct Config {
private:
  AddrType aInp : 64;
  AddrType aOut : 64;
  unsigned int nCLs : 32;

public:
  AddrType get_aInp() const {
    return aInp & 0x0000ffffffffffc0ULL;
  }
  AddrType get_aOut() const {
    return aOut & 0x0000ffffffffffc0ULL;
  }
  unsigned int get_nCLs() const {
    return nCLs;
  }

  void set_aVD( const AddrType& val) {
  }
  void set_aInp( const AddrType& val) {
    assert( !(val & ~0x0000ffffffffffc0ULL));
    aInp = val;
  }
  void set_aOut( const AddrType& val) {
    assert( !(val & ~0x0000ffffffffffc0ULL));
    aOut = val;
  }
  void set_nCLs( const unsigned int& val) {
    nCLs = val;
  }

  AddrType getInpAddr( size_t idx) const {
    return 0x0000ffffffffffffULL & (get_aInp() + (CacheLine::getBitCnt()/8)*idx);
  }
  AddrType getOutAddr( size_t idx) const {
    return 0x0000ffffffffffffULL & (get_aOut() + (CacheLine::getBitCnt()/8)*idx);
  }

  CacheLine* getInpPtr() const {
    return reinterpret_cast<CacheLine*>( get_aInp());
  }
  CacheLine* getOutPtr() const {
    return reinterpret_cast<CacheLine*>( get_aOut());
  }

  size_t addr2IdxInp( AddrType addr) const {
    return (addr - get_aInp()) / ((CacheLine::getBitCnt()/8));
  }
  size_t addr2IdxOut( AddrType addr) const {
    return (addr - get_aOut()) / ((CacheLine::getBitCnt()/8));
  }

  Config() {
    aInp = 0;
    aOut = 0;
    nCLs = 0;
  }
  void copy(Config &from) {
    aInp = from.aInp;
    aOut = from.aOut;
    nCLs = from.nCLs;
  }

#if !defined(__AAL_USER__) && !defined(USE_SOFTWARE)
  inline friend void sc_trace(sc_trace_file* tf, const Config& d,
      const std::string& name) {
  }
#endif

  inline friend std::ostream& operator<<(std::ostream& os, const Config& d) {
    os << "aInp: " << d.aInp << std::endl;
    os << "aOut: " << d.aOut << std::endl;
    os << "nCLs: " << d.nCLs << std::endl;
    return os;
  }

  inline bool operator==(const Config& rhs) const {
    bool result = true;
    result = result && (aInp == rhs.aInp);
    result = result && (aOut == rhs.aOut);
    result = result && (nCLs == rhs.nCLs);
    return result;
  }
};

#endif

(Here the generator code is not included. See the documentation for the -x and -d command line options to cog)

Address calculation methods are provided as well. These should be compatible with the basic version of this class we generated in the software only section.

Generating the Code

In addition to the user types and the Config type descriptions, more code can be generated from this simple information, including;

  • a starting template for the accelerator's SystemC kernel complete with memory interface declarations and SC_MODULE boilerplate,
  • a SystemC test bench for simulating your kernel,
  • SystemC code to wrap the kernel with a custom memory system,
  • a second SystemC test bench for simulating your kernel and the memory system together, and
  • a starting template for how to transform the Config information for dividing up work for multiple parallel accelerators.

We can generate (and safely regenerate) all these files using the following scripts.

export PYTHONPATH=.:$HLD_ROOT/scripts/systemc-gen

$HLD_ROOT/scripts/systemc-gen/gen-all-usertypes.py
$HLD_ROOT/scripts/systemc-gen/gen-all-other.py

We can also generate good starting point code for your CTHREADS by add more information to dut_params.py and running two more scripts.

$HLD_ROOT/scripts/systemc-gen/gen-all-cthreads.py
$HLD_ROOT/scripts/systemc-gen/gen-all-modules.py

(This last one invokes more advanced features used to generate internal module hierarchy. See Python API for a more complete description.) The complete dut_params.py file is as follows.

#!/usr/bin/env python3

from cog_acctempl import *

dut = DUT("memcpy")

dut.add_rds( [TypedRead("CacheLine","inp")])
dut.add_wrs( [TypedWrite("CacheLine","out")])

dut.add_ut( UserType("CacheLine",[ArrayField(UnsignedLongLongField("words"),8)]))

dut.add_extra_config_fields( [BitReducedField(UnsignedIntField("nCLs"),32)])

dut.module.add_cthreads( [CThread("inp_fetcher",writes_to_done=True),
                          CThread("inp_addr_gen"),
                          CThread("out_addr_gen")])

dut.get_cthread( "inp_fetcher").add_ports( [RdRespPort("inp"),
                                            WrDataPort("out")])

dut.get_cthread( "inp_addr_gen").add_ports( [RdReqPort("inp")])
dut.get_cthread( "out_addr_gen").add_ports( [WrReqPort("out")])

if __name__ == "__main__":
    dut.dump_dot( dut.nm + ".dot")

The new lines describe the thread structure of our implementation. We create three threads:

  • the main routine inp_fetcher that transfers read responses from the "inp" channel to write data on the "out" channel,
  • an address generator inp_addr_gen for requests on the "inp" channel, and
  • an address generator out_addr_gen for requests on the "out" channel.

We also specify which threads control which top level memory ports. The syntax specifies this connectivity diagram.

diagram

In fact, if you execute dut_params.py, it will generate the file memcpy.dot, which can be converted to a diagram using the dot program:

python3 dut_params.py
dot -Tpng -o memcpy.png memcpy.dot

Executing dut_params.py also performs some basic checking operations to ensure that each memory port is accessed by exactly one thread. For example, if we neglected to connect the request ports to the address generation threads, (i.e., the following lines were not specified in dut_params.py:

dut.get_cthread( "inp_addr_gen").add_ports( [RdReqPort("inp")])
dut.get_cthread( "out_addr_gen").add_ports( [WrReqPort("out")])

we would get the error messages:

$ python3 dut_params.py
Traceback (most recent call last):
  File "./dut_params.py", line 25, in <module>
    dut.dump_dot( dut.nm + ".dot")
  File "/nfs/site/disks/scl.work.59/ppt/smburns/shadow3/hld_fpga-sysc/scripts/systemc-gen/cog_acctempl.py", line 598, in dump_dot
    for f in self.tlm_fifos:
  File "/nfs/site/disks/scl.work.59/ppt/smburns/shadow3/hld_fpga-sysc/scripts/systemc-gen/cog_acctempl.py", line 451, in tlm_fifos
    self.semantic()
  File "/nfs/site/disks/scl.work.59/ppt/smburns/shadow3/hld_fpga-sysc/scripts/systemc-gen/cog_acctempl.py", line 552, in semantic
    assert p.nm in self.rd_req_tbl and p.nm in self.rd_resp_tbl, "No req or resp port on rd channel " + p.nm
AssertionError: No req or resp port on rd channel inp

This allows early debugging of connectivity between threads and is recommended for even the simplest of designs.

Clone this wiki locally