Skip to content

IY2002/Beyond-Flatland

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21,657 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Beyond Flatland: A Placement Flow for 3D FPGAs

This repository extends the Verilog-to-Routing (VTR) framework (release v9.0.0) with new 3D-aware placement capabilities. These modifications are detailed in the paper "Beyond Flatland: A Placement Flow for 3D FPGAs" submitted to DAC'26.

Overview

This work introduces a novel placement flow specifically designed for 3D FPGAs, extending the standard 2D VTR placement algorithm with:

  • 3D-aware partitioning using TritonPart for initial layer assignment
  • Novel move generators optimized for 3D FPGA architectures
  • Timing-aware layer optimization with dynamic parameter adjustment
  • Support for multiple 3D interconnect strategies (connection-box, switch-box, and hybrid)

The implementation demonstrates significant improvements in critical path delay and wirelength compared to baseline 2D and naive 3D approaches.

Key Modifications to VTR v9.0.0

This repository includes the following major enhancements to the base VTR framework:

1. TritonPart Integration for Layer Assignment

Files: vpr/src/base/partition_creator.cpp, partitioning_engine.cpp, hyper_graph.cpp

  • Post-packing hypergraph partitioning to assign clustered blocks to layers
  • Integration with OpenROAD's TritonPart tool for timing-aware partitioning
  • Criticality-driven partitioning to minimize inter-layer connections

2. 3D-Aware Placement Move Generators

Directory: vpr/src/place/move_generators/

New move generator implementations designed for 3D placement:

  • Layer Swap Moves (layer_swap_move_generator.cpp, layer_swap_ranged_move_generator.cpp): Probabilistically swap blocks between layers
  • Probabilistic Layer Assignment (*_probabilistic.cpp variants): Centroid, median, and weighted variants that consider layer assignments
  • No Layer Change Variants (*_no_layer_change.cpp): Traditional moves that respect initial layer constraints
  • Critical Layer Moves (critical_layer_move_generator.cpp): Timing-driven layer reassignment

3. Enhanced Timing Optimization

Command-line parameters:

  • --timing_tradeoff_adjustor: Controls timing vs wirelength tradeoff dynamically during annealing (linear, exponential, etc.)
  • --timing_layer_weight_adjustor: Adjusts the weight given to inter-layer connections during placement
  • --rl_agent_move_set: Selects adaptive move generation strategies (e.g., prob_layer_swap)
  • --timing_tradeoff_start/end: Start and end values for timing tradeoff
  • --timing_layer_weight_start/end: Start and end values for layer weight penalties
  • --timing_layer_weight_start_sr/end_sr: Start and end acceptance for applying layer weight penalties
  • --timing_tradeoff_start_sr/end_sr: Start and end acceptance for timing tradeoff variation
  • --partition_post_pack: Partition the packed netlist using TritonPart
  • --soft_partitioning init_place: Only uses the partitioning results for initial placement, and relaxes layer assignment constraints after.

4. 3D Interconnect Architecture Support

Seven architecture configurations supporting different via placement strategies (see Architecture Files section below).

Base Framework

This work builds on the official VTR release v9.0.0:
🔗 VTR v9.0.0 Source
📘 VTR Documentation

Prerequisites

System Requirements

  • Operating System: Linux (64-bit) - tested on Ubuntu 20.04/22.04
  • Compiler: GCC 9.0+ or Clang 10.0+ with C++17 support
  • Memory: Minimum 8GB RAM (16GB+ recommended for larger benchmarks)
  • Disk Space: ~10GB for full build and benchmarks

Required Dependencies

  1. Standard VTR Dependencies:

    • CMake 3.16+
    • Python 3.6+
    • Flex and Bison
    • Cairo graphics library
    • Eigen3
  2. OpenROAD (for TritonPart):

    This flow integrates with TritonPart for partitioning, which requires the OpenROAD toolchain.
    Please install OpenROAD following the instructions at:
    🔗 https://theopenroadproject.org/

Configuration

After installing OpenROAD, you must update two file paths:

  1. Update the path to tritonpart_run_script.sh:

    Edit vpr/src/base/partition_creator.cpp at line 100:

    std::string triton_path = "/path/to/tritonpart_run_script.sh";

    Change this to the absolute path of tritonpart_run_script.sh in your repository.

  2. Update the path to the openroad executable:

    Edit tritonpart_run_script.sh at line 4:

    OR_EXEC="/path/to/OpenROAD/build/bin/openroad"

    Change this to point to your local OpenROAD installation.

Building the Tool

Step 1: Clone and Initialize

git clone <repository-url>
cd vtr-3d
git submodule init
git submodule update

Step 2: Install System Dependencies

For Ubuntu/Debian systems:

./install_apt_packages.sh

For other Linux distributions, please refer to the VTR Building Guide.

Step 3: Setup Python Environment

Create and activate a Python virtual environment:

make env
source .venv/bin/activate
pip install -r requirements.txt

Note: You will need to activate the virtual environment (source .venv/bin/activate) in each new terminal session before running VTR tools.

Step 4: Build VTR

make -j$(nproc)

This will build all required tools including VPR, ODIN II, and ABC. The build process may take 10-30 minutes depending on your system.

Step 5: Verify Installation

Run a basic regression test to verify the build:

./vtr_flow/scripts/run_vtr_task.py ./vtr_flow/tasks/regression_tests/vtr_reg_basic/basic_timing

Expected output should show all tests passing with "OK" status.

Architecture Files

The Architecture_Files/ directory contains seven 3D FPGA architecture configurations used in the paper experiments:

Architecture File Description Via Strategy Use Case
cb_architecture.xml Connection-box based vias Vias placed at connection boxes Balanced approach for general circuits
cb_i_architecture.xml CB-based with input optimization Input-optimized via placement Circuits with high fanin
cb_o_architecture.xml CB-based with output optimization Output-optimized via placement Circuits with high fanout
sb_architecture.xml Switch-box based vias Vias placed at switch boxes Maximum routing flexibility
hybrid_architecture.xml Hybrid CB+SB approach Mixed CB and SB via placement Combined benefits of both strategies
hybrid_i_architecture.xml Hybrid with input optimization Input-optimized hybrid High fanin with routing flexibility
hybrid_o_architecture.xml Hybrid with output optimization Output-optimized hybrid High fanout with routing flexibility

All architectures are based on the VTR k6_N10 architecture (K=6 LUTs, N=10 BLEs per CLB) with 40nm technology parameters, extended with 3D capabilities and modified DSP/BRAM blocks as described in the paper.

Benchmarks

The Benchmarks/ directory contains 20 machine learning and deep neural network accelerator benchmark circuits:

Benchmark Description Type
attention_layer.blif.tar.gz Transformer attention mechanism ML Layer
bnn.blif.tar.gz Binary Neural Network Full Network
clstm_like.small/medium/large.blif.tar.gz Convolutional LSTM variants Sequence Processing
conv_layer.blif.tar.gz Convolution layer ML Layer
conv_layer_hls.blif.tar.gz HLS-generated convolution ML Layer
dla_like.small/medium.blif.tar.gz Deep Learning Accelerator variants DLA Core
eltwise_layer.blif.tar.gz Element-wise operations ML Layer
gemm_layer.blif.tar.gz General Matrix Multiply ML Layer
lstm.blif.tar.gz Long Short-Term Memory Sequence Processing
reduction_layer.blif.tar.gz Reduction operations ML Layer
robot_rl.blif.tar.gz Reinforcement learning for robotics RL Application
softmax.blif.tar.gz Softmax activation ML Layer
spmv.blif.tar.gz Sparse Matrix-Vector Multiply Linear Algebra
tpu_like.small/large.os/ws.blif.tar.gz TPU-like systolic array variants Systolic Array

Note: All benchmarks are compressed as .tar.gz files and must be extracted before use (see Running Experiments section).

Running Experiments

Extracting Benchmarks

Before running experiments, extract the benchmark files:

cd Benchmarks
tar -xzf lstm.blif.tar.gz
cd ..

Extract all benchmarks at once:

cd Benchmarks
for f in *.tar.gz; do tar -xzf "$f"; done
cd ..

Basic 3D Placement Run

Run VPR with a 3D architecture:

./build/vpr/vpr Architecture_Files/cb_architecture.xml Benchmarks/lstm.blif \
    --timing_tradeoff_adjustor linear \
    --timing_layer_weight_adjustor linear \
    --rl_agent_move_set prob_layer_swap \
    --seed 1

Reproducing Paper Results

To reproduce the results from Table 3 and Figure 7, use the optimized parameters found in the DAC26_results/ CSV files. For example, to reproduce the LSTM result with CB architecture (seed 5):

./build/vpr/vpr Architecture_Files/cb_architecture.xml Benchmarks/lstm.blif \
    --timing_tradeoff_adjustor linear \
    --timing_layer_weight_adjustor linear \
    --rl_agent_move_set prob_layer_swap \
    --timing_tradeoff_start 0.03 \
    --timing_tradeoff_end 0.51 \
    --timing_tradeoff_start_sr 1.0 \
    --timing_tradeoff_end_sr 0.15 \
    --timing_layer_weight_start 1.6 \
    --timing_layer_weight_end 1.0 \
    --timing_layer_weight_start_sr 0.41 \
    --timing_layer_weight_end_sr 0.32 \
    --partition_post_pack \
    --soft_partitioning init_place \
    --route_chan_width 300 \
    --seed 5

Running with Different Architectures

Test different 3D interconnect strategies:

# Switch-box based
./build/vpr/vpr Architecture_Files/sb_architecture.xml Benchmarks/bnn.blif --seed 1

# Hybrid approach
./build/vpr/vpr Architecture_Files/hybrid_architecture.xml Benchmarks/attention_layer.blif --seed 1

# Input-optimized hybrid
./build/vpr/vpr Architecture_Files/hybrid_i_architecture.xml Benchmarks/gemm_layer.blif --seed 1

Results Files

The DAC26_results/ directory contains experimental results presented in the paper:

Baseline Results

Files prefixed with vtr9_ contain baseline results using vanilla VTR v9.0.0:

  • vtr9_2d_results.csv: 2D FPGA baseline (no 3D capabilities)
  • vtr9_3d_cb_results.csv: VTR v9.0.0 with 3D CB architecture
  • vtr9_3d_cb_i_results.csv: VTR v9.0.0 with 3D CB input-optimized architecture
  • vtr9_3d_cb_o_results.csv: VTR v9.0.0 with 3D CB output-optimized architecture
  • vtr9_3d_sb_results.csv: VTR v9.0.0 with 3D SB architecture
  • vtr9_hybrid_results.csv: VTR v9.0.0 with 3D hybrid architecture
  • vtr9_hybrid_i_results.csv: VTR v9.0.0 with 3D hybrid input-optimized architecture
  • vtr9_hybrid_o_results.csv: VTR v9.0.0 with 3D hybrid output-optimized architecture
  • vtr9_cb_runtime_results.csv: VTR v9.0.0 with 3D CB architecture, with all runtime values

Our Results

Files prefixed with our_ contain results using the proposed 3D-aware placement flow:

  • our_3d_cb_results.csv: Proposed flow with CB architecture
  • our_3d_cb_i_results.csv: Proposed flow with CB input-optimized architecture
  • our_3d_cb_o_results.csv: Proposed flow with CB output-optimized architecture
  • our_3d_sb_results.csv: Proposed flow with SB architecture
  • our_hybrid_results.csv: Proposed flow with hybrid architecture
  • our_hybrid_i_results.csv: Proposed flow with hybrid input-optimized architecture
  • our_hybrid_o_results.csv: Proposed flow with hybrid output-optimized architecture
  • our_cb_runtime_results.csv: Proposed flow with CB architecture, with all runtime values

Note: The non-'runtime' result files were generated on a busy shared server where multiple jobs were running simultaneously. As a result, the runtime values in those files may be unreliable due to varying levels of CPU contention. In contrast, the 'runtime' files were produced on an otherwise idle server, with only the experiment workloads running, ensuring that their runtime measurements are consistent and not affected by external system load.

CSV File Format

Each CSV file contains the following columns:

  • success: Boolean indicating successful completion
  • return_code: Exit code (0 = success)
  • seed: Random seed used for the run
  • blif_file: Benchmark circuit name
  • cpd: Critical path delay in nanoseconds
  • wl: Total wirelength
  • runtime: Total runtime in seconds
  • placement_runtime: Placement-only runtime in seconds
  • config: Dictionary of placement parameters used
  • error: Error message (if any)

Runtime files have the following additional columns:

  • packing_runtime: Packing time in seconds
  • load_packing_runtime: Packing loading time in seconds
  • partitioning_runtime: Time spent by TritonPart in seconds
  • create_device_runtime: Time spent creating FPGA grid data structure in seconds
  • router_lookahead_runtime: Time spent to create lookahead delay table in seconds
  • routing_runtime: Routing time in seconds
  • analysis_runtime: Analysis time in seconds

Paper Correspondence

  • Table 3 presents aggregate statistics (mean/median CPD and wirelength) across architectures and benchmarks
  • Figure 7 shows placement quality convergence during the annealing process for selected benchmarks

Additional Notes for Reviewers

Important Configuration Notes

  1. TritonPart Dependency: While TritonPart integration is included in the code, the partitioning step can be disabled by commenting out the partitioning calls in partition_creator.cpp. The flow will still benefit from the 3D-aware move generators.

  2. Runtime: Large benchmarks (e.g., clstm_like.large, tpu_like.large) may require several hours for full place-and-route. For quick testing, use smaller benchmarks like softmax or attention_layer.

  3. Determinism: Results are deterministic given the same seed value. Multiple seeds were used in experiments to account for placement algorithm randomness.

  4. Memory Usage: Large benchmarks may require 16GB+ RAM. If experiencing out-of-memory errors, try smaller benchmark variants or increase system swap space.

Reproducing Results

For full reproducibility of paper results:

  1. Use the exact parameter configurations from the config column in the results CSVs
  2. Match the seed values (seeds 1-10 were used for most experiments)
  3. Use the same architecture file as indicated by the results filename
  4. Extract timing and wirelength from VPR output or use VTR flow scripts for automated result collection

Tips for Evaluation

  • Start with smaller benchmarks (softmax, attention_layer) to verify the setup
  • Compare vtr9_*.csv vs our_*.csv files to see improvements
  • The placement_runtime values show the efficiency of the placement algorithm
  • Critical path delay (cpd) improvements of 10-30% are typical for ML workloads

Citation and License

Base VTR Framework

This work builds upon VTR v9.0.0. For more information about the VTR project:

TritonPart

The partitioning functionality integrates OpenROAD's TritonPart:

License

This project maintains the VTR MIT License. See LICENSE.md for full details.

The software is provided "as is" without warranty of any kind. All modifications and extensions to VTR are provided under the same MIT License terms.


For questions or issues related to this artifact, please open an issue in the repository.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors