This repository extends the Verilog-to-Routing (VTR) framework (release v9.0.0) with new 3D-aware placement capabilities. These modifications are detailed in the paper "Beyond Flatland: A Placement Flow for 3D FPGAs" submitted to DAC'26.
This work introduces a novel placement flow specifically designed for 3D FPGAs, extending the standard 2D VTR placement algorithm with:
- 3D-aware partitioning using TritonPart for initial layer assignment
- Novel move generators optimized for 3D FPGA architectures
- Timing-aware layer optimization with dynamic parameter adjustment
- Support for multiple 3D interconnect strategies (connection-box, switch-box, and hybrid)
The implementation demonstrates significant improvements in critical path delay and wirelength compared to baseline 2D and naive 3D approaches.
This repository includes the following major enhancements to the base VTR framework:
Files: vpr/src/base/partition_creator.cpp, partitioning_engine.cpp, hyper_graph.cpp
- Post-packing hypergraph partitioning to assign clustered blocks to layers
- Integration with OpenROAD's TritonPart tool for timing-aware partitioning
- Criticality-driven partitioning to minimize inter-layer connections
Directory: vpr/src/place/move_generators/
New move generator implementations designed for 3D placement:
- Layer Swap Moves (
layer_swap_move_generator.cpp,layer_swap_ranged_move_generator.cpp): Probabilistically swap blocks between layers - Probabilistic Layer Assignment (
*_probabilistic.cppvariants): Centroid, median, and weighted variants that consider layer assignments - No Layer Change Variants (
*_no_layer_change.cpp): Traditional moves that respect initial layer constraints - Critical Layer Moves (
critical_layer_move_generator.cpp): Timing-driven layer reassignment
Command-line parameters:
--timing_tradeoff_adjustor: Controls timing vs wirelength tradeoff dynamically during annealing (linear,exponential, etc.)--timing_layer_weight_adjustor: Adjusts the weight given to inter-layer connections during placement--rl_agent_move_set: Selects adaptive move generation strategies (e.g.,prob_layer_swap)--timing_tradeoff_start/end: Start and end values for timing tradeoff--timing_layer_weight_start/end: Start and end values for layer weight penalties--timing_layer_weight_start_sr/end_sr: Start and end acceptance for applying layer weight penalties--timing_tradeoff_start_sr/end_sr: Start and end acceptance for timing tradeoff variation--partition_post_pack: Partition the packed netlist using TritonPart--soft_partitioning init_place: Only uses the partitioning results for initial placement, and relaxes layer assignment constraints after.
Seven architecture configurations supporting different via placement strategies (see Architecture Files section below).
This work builds on the official VTR release v9.0.0:
🔗 VTR v9.0.0 Source
📘 VTR Documentation
- Operating System: Linux (64-bit) - tested on Ubuntu 20.04/22.04
- Compiler: GCC 9.0+ or Clang 10.0+ with C++17 support
- Memory: Minimum 8GB RAM (16GB+ recommended for larger benchmarks)
- Disk Space: ~10GB for full build and benchmarks
-
Standard VTR Dependencies:
- CMake 3.16+
- Python 3.6+
- Flex and Bison
- Cairo graphics library
- Eigen3
-
OpenROAD (for TritonPart):
This flow integrates with TritonPart for partitioning, which requires the OpenROAD toolchain.
Please install OpenROAD following the instructions at:
🔗 https://theopenroadproject.org/
After installing OpenROAD, you must update two file paths:
-
Update the path to
tritonpart_run_script.sh:Edit
vpr/src/base/partition_creator.cppat line 100:std::string triton_path = "/path/to/tritonpart_run_script.sh";Change this to the absolute path of
tritonpart_run_script.shin your repository. -
Update the path to the
openroadexecutable:Edit
tritonpart_run_script.shat line 4:OR_EXEC="/path/to/OpenROAD/build/bin/openroad"Change this to point to your local OpenROAD installation.
git clone <repository-url>
cd vtr-3d
git submodule init
git submodule updateFor Ubuntu/Debian systems:
./install_apt_packages.shFor other Linux distributions, please refer to the VTR Building Guide.
Create and activate a Python virtual environment:
make env
source .venv/bin/activate
pip install -r requirements.txtNote: You will need to activate the virtual environment (source .venv/bin/activate) in each new terminal session before running VTR tools.
make -j$(nproc)This will build all required tools including VPR, ODIN II, and ABC. The build process may take 10-30 minutes depending on your system.
Run a basic regression test to verify the build:
./vtr_flow/scripts/run_vtr_task.py ./vtr_flow/tasks/regression_tests/vtr_reg_basic/basic_timingExpected output should show all tests passing with "OK" status.
The Architecture_Files/ directory contains seven 3D FPGA architecture configurations used in the paper experiments:
| Architecture File | Description | Via Strategy | Use Case |
|---|---|---|---|
cb_architecture.xml |
Connection-box based vias | Vias placed at connection boxes | Balanced approach for general circuits |
cb_i_architecture.xml |
CB-based with input optimization | Input-optimized via placement | Circuits with high fanin |
cb_o_architecture.xml |
CB-based with output optimization | Output-optimized via placement | Circuits with high fanout |
sb_architecture.xml |
Switch-box based vias | Vias placed at switch boxes | Maximum routing flexibility |
hybrid_architecture.xml |
Hybrid CB+SB approach | Mixed CB and SB via placement | Combined benefits of both strategies |
hybrid_i_architecture.xml |
Hybrid with input optimization | Input-optimized hybrid | High fanin with routing flexibility |
hybrid_o_architecture.xml |
Hybrid with output optimization | Output-optimized hybrid | High fanout with routing flexibility |
All architectures are based on the VTR k6_N10 architecture (K=6 LUTs, N=10 BLEs per CLB) with 40nm technology parameters, extended with 3D capabilities and modified DSP/BRAM blocks as described in the paper.
The Benchmarks/ directory contains 20 machine learning and deep neural network accelerator benchmark circuits:
| Benchmark | Description | Type |
|---|---|---|
attention_layer.blif.tar.gz |
Transformer attention mechanism | ML Layer |
bnn.blif.tar.gz |
Binary Neural Network | Full Network |
clstm_like.small/medium/large.blif.tar.gz |
Convolutional LSTM variants | Sequence Processing |
conv_layer.blif.tar.gz |
Convolution layer | ML Layer |
conv_layer_hls.blif.tar.gz |
HLS-generated convolution | ML Layer |
dla_like.small/medium.blif.tar.gz |
Deep Learning Accelerator variants | DLA Core |
eltwise_layer.blif.tar.gz |
Element-wise operations | ML Layer |
gemm_layer.blif.tar.gz |
General Matrix Multiply | ML Layer |
lstm.blif.tar.gz |
Long Short-Term Memory | Sequence Processing |
reduction_layer.blif.tar.gz |
Reduction operations | ML Layer |
robot_rl.blif.tar.gz |
Reinforcement learning for robotics | RL Application |
softmax.blif.tar.gz |
Softmax activation | ML Layer |
spmv.blif.tar.gz |
Sparse Matrix-Vector Multiply | Linear Algebra |
tpu_like.small/large.os/ws.blif.tar.gz |
TPU-like systolic array variants | Systolic Array |
Note: All benchmarks are compressed as .tar.gz files and must be extracted before use (see Running Experiments section).
Before running experiments, extract the benchmark files:
cd Benchmarks
tar -xzf lstm.blif.tar.gz
cd ..Extract all benchmarks at once:
cd Benchmarks
for f in *.tar.gz; do tar -xzf "$f"; done
cd ..Run VPR with a 3D architecture:
./build/vpr/vpr Architecture_Files/cb_architecture.xml Benchmarks/lstm.blif \
--timing_tradeoff_adjustor linear \
--timing_layer_weight_adjustor linear \
--rl_agent_move_set prob_layer_swap \
--seed 1To reproduce the results from Table 3 and Figure 7, use the optimized parameters found in the DAC26_results/ CSV files. For example, to reproduce the LSTM result with CB architecture (seed 5):
./build/vpr/vpr Architecture_Files/cb_architecture.xml Benchmarks/lstm.blif \
--timing_tradeoff_adjustor linear \
--timing_layer_weight_adjustor linear \
--rl_agent_move_set prob_layer_swap \
--timing_tradeoff_start 0.03 \
--timing_tradeoff_end 0.51 \
--timing_tradeoff_start_sr 1.0 \
--timing_tradeoff_end_sr 0.15 \
--timing_layer_weight_start 1.6 \
--timing_layer_weight_end 1.0 \
--timing_layer_weight_start_sr 0.41 \
--timing_layer_weight_end_sr 0.32 \
--partition_post_pack \
--soft_partitioning init_place \
--route_chan_width 300 \
--seed 5Test different 3D interconnect strategies:
# Switch-box based
./build/vpr/vpr Architecture_Files/sb_architecture.xml Benchmarks/bnn.blif --seed 1
# Hybrid approach
./build/vpr/vpr Architecture_Files/hybrid_architecture.xml Benchmarks/attention_layer.blif --seed 1
# Input-optimized hybrid
./build/vpr/vpr Architecture_Files/hybrid_i_architecture.xml Benchmarks/gemm_layer.blif --seed 1The DAC26_results/ directory contains experimental results presented in the paper:
Files prefixed with vtr9_ contain baseline results using vanilla VTR v9.0.0:
vtr9_2d_results.csv: 2D FPGA baseline (no 3D capabilities)vtr9_3d_cb_results.csv: VTR v9.0.0 with 3D CB architecturevtr9_3d_cb_i_results.csv: VTR v9.0.0 with 3D CB input-optimized architecturevtr9_3d_cb_o_results.csv: VTR v9.0.0 with 3D CB output-optimized architecturevtr9_3d_sb_results.csv: VTR v9.0.0 with 3D SB architecturevtr9_hybrid_results.csv: VTR v9.0.0 with 3D hybrid architecturevtr9_hybrid_i_results.csv: VTR v9.0.0 with 3D hybrid input-optimized architecturevtr9_hybrid_o_results.csv: VTR v9.0.0 with 3D hybrid output-optimized architecturevtr9_cb_runtime_results.csv: VTR v9.0.0 with 3D CB architecture, with all runtime values
Files prefixed with our_ contain results using the proposed 3D-aware placement flow:
our_3d_cb_results.csv: Proposed flow with CB architectureour_3d_cb_i_results.csv: Proposed flow with CB input-optimized architectureour_3d_cb_o_results.csv: Proposed flow with CB output-optimized architectureour_3d_sb_results.csv: Proposed flow with SB architectureour_hybrid_results.csv: Proposed flow with hybrid architectureour_hybrid_i_results.csv: Proposed flow with hybrid input-optimized architectureour_hybrid_o_results.csv: Proposed flow with hybrid output-optimized architectureour_cb_runtime_results.csv: Proposed flow with CB architecture, with all runtime values
Note: The non-'runtime' result files were generated on a busy shared server where multiple jobs were running simultaneously. As a result, the runtime values in those files may be unreliable due to varying levels of CPU contention. In contrast, the 'runtime' files were produced on an otherwise idle server, with only the experiment workloads running, ensuring that their runtime measurements are consistent and not affected by external system load.
Each CSV file contains the following columns:
success: Boolean indicating successful completionreturn_code: Exit code (0 = success)seed: Random seed used for the runblif_file: Benchmark circuit namecpd: Critical path delay in nanosecondswl: Total wirelengthruntime: Total runtime in secondsplacement_runtime: Placement-only runtime in secondsconfig: Dictionary of placement parameters usederror: Error message (if any)
Runtime files have the following additional columns:
packing_runtime: Packing time in secondsload_packing_runtime: Packing loading time in secondspartitioning_runtime: Time spent by TritonPart in secondscreate_device_runtime: Time spent creating FPGA grid data structure in secondsrouter_lookahead_runtime: Time spent to create lookahead delay table in secondsrouting_runtime: Routing time in secondsanalysis_runtime: Analysis time in seconds
- Table 3 presents aggregate statistics (mean/median CPD and wirelength) across architectures and benchmarks
- Figure 7 shows placement quality convergence during the annealing process for selected benchmarks
-
TritonPart Dependency: While TritonPart integration is included in the code, the partitioning step can be disabled by commenting out the partitioning calls in
partition_creator.cpp. The flow will still benefit from the 3D-aware move generators. -
Runtime: Large benchmarks (e.g.,
clstm_like.large,tpu_like.large) may require several hours for full place-and-route. For quick testing, use smaller benchmarks likesoftmaxorattention_layer. -
Determinism: Results are deterministic given the same seed value. Multiple seeds were used in experiments to account for placement algorithm randomness.
-
Memory Usage: Large benchmarks may require 16GB+ RAM. If experiencing out-of-memory errors, try smaller benchmark variants or increase system swap space.
For full reproducibility of paper results:
- Use the exact parameter configurations from the
configcolumn in the results CSVs - Match the seed values (seeds 1-10 were used for most experiments)
- Use the same architecture file as indicated by the results filename
- Extract timing and wirelength from VPR output or use VTR flow scripts for automated result collection
- Start with smaller benchmarks (
softmax,attention_layer) to verify the setup - Compare
vtr9_*.csvvsour_*.csvfiles to see improvements - The
placement_runtimevalues show the efficiency of the placement algorithm - Critical path delay (
cpd) improvements of 10-30% are typical for ML workloads
This work builds upon VTR v9.0.0. For more information about the VTR project:
- VTR Project: https://verilogtorouting.org/
- VTR v9.0.0 Release: https://github.com/verilog-to-routing/vtr-verilog-to-routing/releases/tag/v9.0.0
The partitioning functionality integrates OpenROAD's TritonPart:
- OpenROAD Project: https://theopenroadproject.org/
- TritonPart: Part of the OpenROAD physical design toolchain
This project maintains the VTR MIT License. See LICENSE.md for full details.
The software is provided "as is" without warranty of any kind. All modifications and extensions to VTR are provided under the same MIT License terms.
For questions or issues related to this artifact, please open an issue in the repository.