Skip to content
This repository was archived by the owner on Jan 7, 2023. It is now read-only.

HLS Flow

Andrey Ayupov edited this page Aug 14, 2017 · 16 revisions

How to get the tool

You will need Cadence® C-to-Silicon Compiler tool version 14.2 and a Cadence Stratus™ HLS license. For academic users, start at Cadence University software program for license requests. Any Cadence University program member who has updated their licenses this year should automatically have Stratus™ HLS licenses. Others can request Stratus™ HLS to be added to their licenses. To download the tool go to Cadence Downloads Page

High-Level Synthesis Flow

Please refer to the User Guide of C-to-Silicon Compiler on how to use the tool. Here we provide instructions that are specific to design components and examples that come with this package and some tips.

We synthesize module *_acc, e.g. memcpy_acc. This module has the interface that will let us plug in the generated RTL into the rest of the RTL as is. In addition, the RTL integration flow supports the multi_acc_template with several combinations of read/write ports (RD-WR: 2-1, 2-2, 4-1, 4-4).

In order for the HLS tool to identify the top module that will be synthesized the following is needed in the source code (e.g. in the end of memcpy_hls.h)

#if defined (USE_CTOS) && (__SYNTHESIS__)
SC_MODULE_EXPORT(memcpy_acc);
#endif

Below is an example of the HLS script to build the design.

close_design
new_design memcpy_acc 
set_attr design_dir "" /designs/memcpy_acc 
set_attr auto_write_models true /designs/memcpy_acc 
define_sim_config -model_dir ./model  /designs/memcpy_acc 
set_attr source_files ../memcpy_hls.h /designs/memcpy_acc 
set_attr compile_flags "-w -I../../../../common -I../../../../accio -I../../../../acctempl -DUSE_HLS -D__SYNTHESIS__ -DUSE_CTOS" /designs/memcpy_acc 
set_attr top_module_path memcpy_acc /designs/memcpy_acc 
set_attr build_flat true /designs/memcpy_acc 
set_attr prototype_memory_launch_delay 200 /designs/memcpy_acc 
set_attr prototype_memory_setup_delay 100 /designs/memcpy_acc 

define_clock -name clk  -period 3000  -rise 0  -fall 1500  
set_attr tech_lib_names NangateOpenCellLibrary_PDKv1_3_v2010_12/Front_End/Liberty/NLDM/NangateOpenCellLibrary_typical.lib /designs/memcpy_acc 
set_attr enable_multiple_pipeline_stalls true /designs/memcpy_acc
build -verbose

We use the Nangate ASIC library in the HLS script. The technology library is used in the HLS flow to characterize timing of arithmentic and logic operators of the design. Even though there is support for FPGA-based designs, a particular FPGA part may not be available. Because the memory subsystem is modeled at a cycle-accurate level, the timing estimation accuracy is not important for the memory components. It may be still important for kernel synthesis especially if kernel logic needs pipelining. We suggest to perform simple tuning of cycle delay in HLS estimations and post-FPGA synthesis and place-and-routing timing report.

The HLS tool generates RTL right after build stage and we strongly encourage to start validating this RTL through ASE flows. Various mismatches between SystemC and post-build RTL may occur, usually for the following reasons:

  • By default, SystemC design is compiled with TLM version of channels (vs. synthesizable) because the TLM communication is faster to simulate. However, that may hide HW related issues in your design such as channel reset. Thus, in addition to testing your SystemC with TLM version of channels, we strongly recommend compiling your design with USE_HLS parameter set (e.g. make USE_HLS=1) and rerun the tests.
  • All ga::tlm_fifo_in/ga::tlm_fifo_out ports have to be connected to clk/rst in constructors (see examples) and reset in CTHREADs (note different reset for put/get operations). This can be captured when compiled with USE_HLS=1
  • Reset problem may also happen for module fields. In SystemC simulation, there is no explicit uninitialized state of a variable and often module field variables are initialized with default values. Moreover, initialization of variables outside the reset sections of threads (e.g. in a constructor) is not honored by HLS tools. In RTL, it may result in an X-state that may get propageted incorrectly through the logic. All control related state variables should be initialized in the reset section of threads where there are used.

We recommend the following micro-architectural commands after the design is built and tested. The commands below are for the memory susbsystem.

inline [find -behavior *]
foreach arr [find -array *response_data_array_data_words] {restructure_array -3 $arr}
foreach loop [find -node "*UNROLL*_for_begin*"] {unroll_loop $loop}
foreach loop [find -node "*RESET*_for_begin*"] {unroll_loop $loop}

#flatten all arrays in memory subsystem that can be flattened

allocate_builtin_ram /designs/memcpy_acc/modules/memcpy_acc/arrays/inp_mem_in_acc_mem_in_available_slots_m_buf_tag
allocate_builtin_ram /designs/memcpy_acc/modules/memcpy_acc/arrays/inp_mem_in_acc_mem_in_request_queue_m_buf_tag
allocate_builtin_ram /designs/memcpy_acc/modules/memcpy_acc/arrays/inp_mem_in_acc_typed_in_out_requests_m_buf_addr_offset
allocate_builtin_ram /designs/memcpy_acc/modules/memcpy_acc/arrays/inp_mem_in_acc_typed_in_out_requests_m_buf_size
allocate_prototype_memory /designs/memcpy_acc/modules/memcpy_acc/arrays/inp_mem_in_acc_mem_in_response_data_array_data_words

Recommended SystemC Coding Style For Cycle-Accurate Design

HLS tools provide automatic pipelining capabilities. However, not any code can be automatically pipelined. Sometimes, user will have to partition design into pipeline stages explicitly in the code. Here, we provide the coding style that is compact and easy to read. The style is generally called anti-dependency style. We will use the simple code below as an example that we will rewrite in the anti-dependency style. Note this particular code will have not issues being pipelined by the HLS tool.

while (1) {
  a = a_in.get();
  b = b_in.get();
  out.put((a+b)*C); /*C is a config input for example*/
  wait();
}

Suppose, we would like to split computation into two stages where 'add' (a+b) is done in the first stage and 'multiply' (sum*C) in the second stage. The idea of the anti-dependency style is that the stages are coded in the reverse order allowing use of regular C variables as opposed to sc_signals (that would represent pipeline registers explicitly). 'valid_sum' and 'sum' are two variable used to communicate between pipeline stages. The back-pressure control will be handled by the HLS tool automatically.

bool valid_sum = false;
int sum;
while(1) {
  // stage 2
  if (valid_sum && out.nb_can_put()) {
     out.nb_put(sum*C); // invalidate if used 
     valid_sum = false;
  }

 // stage 1
 if (!valid_sum) {
        got_a = a_in.nb_get(a);
        got_b = b_in.nb_ get(b);
        valid_sum = got_a && got_b;
        if (valid_sum)   // you don’t need this condition necessarily, but maybe it is better powerwise
             sum = a+b;
  }
  wait();
}

Floating Point IP (or other FPGA IP) Support

In order to use floating point or any other FPGA IP provided by QSys in the HLS flow, one has to refer to the RTL IP feature of C-to-Silicon tool. Here, we will provide some details on how to integrate a QSys-generated floating point arithmetic IP into a loop being automatically pipelined with the HLS tool. The pipeline and its operations often have to support pipeline stalling usually when there's a back-pressure from the output. Backpressure in SystemC often comes from the blocking put call on an output of a process. Here's an example of a loop that we would like to automatically pipeline in HLS and use QSys IP:

float A, B;
float O;
while(1) {
  data_read = false;
  if (a_in.nb_can_get() && b_in.nb_can_get()) {
    data_read = true;
    a_in.nb_get(A)
    b_in.nb_get(B)
    O = A*B;
  }
  // pipe stages be inserted here
HLS_EXPAND_HER: wait();
  if (data_read) {
    out.put(O);// blocking put - has a stalling loop
  }
}

The goal is to use QSys floating-point IP in place of A*B operation. For that, a few things have to be done:

  • Generate IP in QSys for floating point multiplication operation
  • Modify SystemC code to abstract away floating point types and operations in a separate class/function
  • Apply RTL IP flow and map a new function to RTL IP generated by QSys

As mentioned previously, in the example above, the tool pipeline feature requires that a RTL IP supports back-pressure. In particular, the IP has to support an additional input called 'stall' and when it is asserted the pipeline has to stop executing (propagating data). In addition, the IP has to have two extra ports: valid_in and valid_out to signify which tokens in the pipeline are functionally valid ones and which ones are "bubbles" that the can be introduced by the HLS pipeline control logic (for example, during pipeline flushing).

FPGA IPs generated by QSys often can have the "enable" input port that logically can be used as the "stall" port required by the HLS tool. To model valid token propagation, we propose to use a shift register that has to also be stallable. A simple way to do so is to create another QSys IP called "shift register" with the depth equal to the the latency of the main functional IP (in our case floating-point multiply). The shift register will also require an enable port that will be driven by the stall signal from HLS. Here's the wrapper that we will use in the HLS RTL IP flow.

RTL IP integration in HLS

Next, we show how SystemC source code can be modified to get rid of floating point objects. First, you have to create a new class that will carry 32 bits objects and hide details of floating point arithmetic. For example:

struct MyFloatIP{
private:
  unsigned val : 32;
public:
  MyFloatIP() : val (0) {}
  MyFloatIP(unsigned val ) : val(val) {}

#ifdef __SYNTHESIS__
  #pragma ctos dont_touch
  static MyFloatIP mul (MyFloatIP a, MyFloatIP b)  {
    // for synthesis we will use floating ip
  }
#else
  #pragma ctos dont_touch
  static MyFloatIP mul (MyFloatIP a, MyFloatIP b)  {
    // for simulation we reinterpret int as float, perform floating point operation and reinterpret back to int
    float af, bf, of;
    memcpy(&af, &a, sizeof(af));
    memcpy(&bf, &b, sizeof(bf));
    of = af * bf;
    MyFloatIP result;
    memcpy(&result.val, &of, sizeof(of));
    return result;
  }

#endif

Then, modify your original code to use MyFloatIP class instead of float. That may in turn require changing data types other objects like ports carrying float point data, etc.

MyFloatIP A, B;
MyFloatIP O;
while(1) {
  data_read = false;
  if (a_in.nb_can_get() && b_in.nb_can_get()) {
    data_read = true;
    a_in.nb_get(A)
    b_in.nb_get(B)
    O = MyFloatIP::mul(A,B);
  }
  // pipe stages be inserted here
HLS_EXPAND_HER: wait();
  if (data_read) {
    out.put(O);// blocking put - has a stalling loop
  }
}

Finally, in the HLS tool, in order to use RTL IP flow, in addition to the verilog wrapper created in step 1, we will need an xml-file describing the IP. Here's an example of such an xml file for our example:

<?xml version="1.0"?>
<ctos_ip_definitions>

<rtl_ip_def>
<name>alt_fp_mult_ctos</name>
<rtl_filename>ip_rtl_wrapper.v</rtl_filename>
<rtl_language>verilog</rtl_language>
<rtl_type>pipelined</rtl_type>
<pipeline_depth>3</pipeline_depth>
<ports>
  <clock_port>
    <name>clk</name>
    <active_edge>rise</active_edge>
  </clock_port>

  <reset_port>
    <name>areset</name>
    <type>asynch</type>
    <active_level>low</active_level>
  </reset_port>

  <input_port>
    <name>a_val</name>
    <width>32</width>
  </input_port>

  <input_port>
    <name>b_val</name>
    <width>32</width>
  </input_port>

  <output_port>
    <name>mul_val_out</name>
    <width>32</width>
  </output_port>

  <valid_in_port>
    <name>valid_in</name>
  </valid_in_port>
  
  <valid_out_port>
    <name>valid_out</name>
  </valid_out_port>

  <stall_port>
    <name>stall</name>
  </stall_port>

</ports>
</rtl_ip_def>

</ctos_ip_definitions>
Clone this wiki locally