Skip to content
/ GoPTX Public

GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Weaving

License

Notifications You must be signed in to change notification settings

wu-kan/GoPTX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

b85448a · Mar 7, 2025

History

4 Commits
Mar 7, 2025
Mar 7, 2025
Mar 7, 2025
Mar 7, 2025

Repository files navigation

GoPTX

GoPTX fuses two GPU concurrent kernels at PTX level to improve ILP. For more details, you can see our DAC'25 paper "GoPTX: Fine-grained GPU Kernel Fusion by PTX Level Instruction Flow Weaving". The source code and docker image will come soon.

Quick start

Docker

Loading
flowchart LR
  subgraph GoPTX_link
    c.ptx
  end
  subgraph GoPTX_merge
    d.ptx
  end
  a.ptx--suffix_a-->c.ptx
  b.ptx--suffix_b-->c.ptx
  c.ptx--regex_merged-->d.ptx
  c.ptx--regex_a-->d.ptx
  c.ptx--regex_b-->d.ptx

Workflow:

  1. If the two kernels are not in the same PTX file, use GoPTX_link. It can add suffixes to avoid name conflict.
  2. GoPTX_merge uses regex to match the input kernels and regex to generate the name of the fused kernel.
docker run \
  --runtime=nvidia \
  --gpus "device=0" \
  --name goptx \
  -it \
  wukan0621/goptx:v0.0.1-dac25 sh
cd /GoPTX_nvcc/utils
GoPTX_link a.ptx b.ptx _sa _sb > c.ptx
GoPTX_merge c.ptx \$1_\$2_merged namea nameb > d.ptx

You can see how GoPTX weaves _Z5nameaPi_sa and _Z5namebPf_sb.

.version 8.5
.target sm_52
.address_size 64
.visible .entry  _Z5nameaPi_sa( .param .u64 _Z5nameaPi_param_0_sa){
 .reg .b32  %r_sa<3>;
 .reg .b64  %rd_sa<3>;
ld.param.u64 %rd_sa1,[ _Z5nameaPi_param_0_sa];
cvta.to.global.u64 %rd_sa2, %rd_sa1;
ld.global.u32 %r_sa1,[ %rd_sa2];
add.s32 %r_sa2, %r_sa1, 1;
st.global.u32[ %rd_sa2], %r_sa2;
ret;
}
.visible .entry  _Z5namebPf_sb( .param .u64 _Z5namebPf_param_0_sb){
 .reg .f32  %f_sb<3>;
 .reg .b64  %rd_sb<3>;
ld.param.u64 %rd_sb1,[ _Z5namebPf_param_0_sb];
cvta.to.global.u64 %rd_sb2, %rd_sb1;
ld.global.f32 %f_sb1,[ %rd_sb2];
add.f32 %f_sb2, %f_sb1, 0f3F800000;
st.global.f32[ %rd_sb2], %f_sb2;
ret;
}
.visible .entry  _Z5nameaPi_sa__Z5namebPf_sb_merged( .param .u64 _Z5nameaPi_param_0_saf, .param .u64 _Z5namebPf_param_0_sbs){
 .reg .b32  %r_saf<3>;
 .reg .b64  %rd_saf<3>;
 .reg .f32  %f_sbs<3>;
 .reg .b64  %rd_sbs<3>;
 $0__GoPTX:
ld.param.u64 %rd_saf1,[ _Z5nameaPi_param_0_saf];
ld.param.u64 %rd_sbs1,[ _Z5namebPf_param_0_sbs];
cvta.to.global.u64 %rd_saf2, %rd_saf1;
cvta.to.global.u64 %rd_sbs2, %rd_sbs1;
ld.global.u32 %r_saf1,[ %rd_saf2];
ld.global.f32 %f_sbs1,[ %rd_sbs2];
add.s32 %r_saf2, %r_saf1, 1;
add.f32 %f_sbs2, %f_sbs1, 0f3F800000;
st.global.u32[ %rd_saf2], %r_saf2;
st.global.f32[ %rd_sbs2], %f_sbs2;
 ret;
}

Get the results of the DAC'25 paper "GoPTX: Fine-grained GPU Kernel Fusion by PTX Level Instruction Flow Weaving". It requires a GPU of architecture sm_80 or sm_90 (we use A100-PCIE-40GB and have not tuned for other devices). For other architecture, you should rebuild from the source and specify your $CUDAARCHS. Our result is available at /root/GoPTX_nvcc/utils/results/results.adaptive.html.

docker run \
  --runtime=nvidia \
  --gpus "device=0" \
  --name goptx \
  wukan0621/goptx:v0.0.1-dac25
docker cp goptx:/root/GoPTX_nvcc/utils .

Build from source

See Dockerfile.

Command line usage

GoPTX_link <a.ptx> <b.ptx> [suffix_a=""] [suffix_b=""] > c.ptx
GoPTX_merge <c.ptx> <regex_merged> <regex_a> <regex_b> [strategy=0] > d.ptx

About

GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Weaving

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published