Added RISC-V V extension intrinsics for LLVM #18182

fzi-peccia · 2025-08-01T10:41:50Z

Implementation of paper Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs

tqchen · 2025-08-01T17:01:58Z

cc @cbalint13 can you help to take a look

cbalint13

Thank you much for this work !
I found it good, nice to see RVV enhanchements.

One note, in the aprofile's arch parser, could we reuse this global func here ?

cpp: llvm_get_vector_width(target)
py: _ffi_api.llvm_get_vector_width(target)

It was introduced:
https://github.com/apache/tvm/pull/17641/files

If not, it is fine for now will improve/simplify parser later.
I have some work that can be added on top of this for spacemit's IME

cbalint13 · 2025-08-04T16:13:39Z

@fzi-peccia , can look at i386 CI failure ?

cbalint13 · 2025-08-10T12:14:52Z

@fzi-peccia , can look at i386 CI failure ?

@fzi-peccia ,

Permit me a change proposal on how to avoid aprofile (serving ARM only), don't know if this will be kept in future.
Instead, let's use infos from LLVM side, and reuse existing VLEN inference (via target.llvm_get_vector_width)

Here is how it would look tvm-rvv-noaprofile.diff.txt, appliacable to the top of your current branch.
This also will pass the i386 CI failure caused by the alteration of aprofile (currently ARM only stuff).

I am all-in to see this merged, a very good start for future IME tensorization, beyond what LLVM (will?) supports.

LATER UPDATE

The diff here was reuploaded as .txt file (apologize for later edit).
Idea also inline with: [TIR] Fix host/device function check for build #18199

cbalint13

This is a nice work but requires adaptation to the current TVM infra.
There is a review regarding the integer variants as per their use-case coverage.

python/tvm/meta_schedule/tune_context.py

python/tvm/tir/tensor_intrin/riscv_cpu.py

src/target/parsers/aprofile.cc

src/target/parsers/cpu.cc

src/target/parsers/cpu.h

src/target/source/codegen_c.cc

python/tvm/target/target.py

python/tvm/tir/tensor_intrin/riscv_cpu.py

fzi-peccia · 2025-08-18T06:17:03Z

Sorry all, I was on vacation, I will tackle these comments this week.

fzi-peccia · 2025-08-18T11:34:47Z

Hi @cbalint13 . Thank you very much for the feedback and the diff. I implemented the changes you suggested and also rebased on main.

Regarding the mixed dtype cases, the original idea was to support this, and this kernel_dtype is a mistake that stayed there from those days. I replaced it now with the input_dtype, and maybe for this version we could merge a version without mix cases, and then add this feature in the future. What do you think?

cbalint13

I think this code should be fine now, I started to run some real tests on my side.

Proposed to merge this as experimental for a while
Align intrinsic initialization to happen only once, just like for other arches
Fix of generated vs. consumed intrinsic variant inconsistency

Lets have some rounds of tests with real networks in order to elevate to non-experimental, meanwhile in some subsequent PRs we can add IME XSMTVDot (this promise up to 2TOPS on spacemit-x60) and/or RVV 0.7.1 backward compatibility for the THead boards, also as a separate PR we can also have the int8 case having mixed-dtype combinations.

python/tvm/meta_schedule/tune_context.py

python/tvm/tir/tensor_intrin/riscv_cpu.py

src/meta_schedule/schedule_rule/schedule_rule.cc

src/meta_schedule/space_generator/space_generator.cc

python/tvm/tir/tensor_intrin/riscv_cpu.py

cbalint13 · 2025-08-22T14:28:07Z

@fzi-peccia ,

Tests were done by tuning a resnet18 model
Here is the TVM program and results after 5000 trials: rvv-resnet18-mstune-rpc-2025Aug22.tar.gz

Tests

In a rpc setup, I used the provided tvm-rvv-tune.py script.

There was trial proposals for tensorization:

$ cat workdir/logs/*.log | grep Tensorizing | awk '{print $NF}' | sort -u
rvv_float32_multivmul_8_16_m8
rvv_float32_multivmul_8_32_m8
rvv_float32_multivmul_8_4_m8
rvv_float32_multivmul_8_64_m8
rvv_float32_multivmul_8_8_m8
rvv_float32_vmacc_1_16_m8
rvv_float32_vmacc_1_32_m8
rvv_float32_vmacc_1_4_m8
rvv_float32_vmacc_1_64_m8
rvv_float32_vmacc_1_8_m8
rvv_float32_vmul_1_16_m8
rvv_float32_vmul_1_32_m8
rvv_float32_vmul_1_4_m8
rvv_float32_vmul_1_64_m8
rvv_float32_vmul_1_8_m8

The post analytics of all entries on IR level:

$ ./msch-database-tir-parse.py
Parsed #5000 records
No tensorized schedules found.

This needs investigation.

cbalint13 · 2025-08-22T22:59:06Z

@fzi-peccia ,

$ ./msch-database-tir-parse.py
Parsed #5000 records
No tensorized schedules found.

This needs investigation.

Based on #18224 investigation, it seems the RVV intrinsic templates needs double check (see example fix of issue).
The posted code here looked from beginning as being an oldish TVM, using the relay (guessing) as graph import.

cbalint13

$ ./msch-database-tir-parse.py
Parsed #5000 records
No tensorized schedules found.

This needs investigation.

Based on #18224 investigation, it seems the RVV intrinsic templates needs double check (see example fix of issue).

Based on latest real tests and investigations here this still needs changes as shown
To maintain long term, ideally, tensorization templates could have some testcases

python/tvm/tir/tensor_intrin/riscv_cpu.py

cbalint13 · 2025-08-25T12:57:15Z

Further, investigated the corectness of the proposed tensorization kernels.
The proposed multimvul does multiple dotproducts that would yield highest benefits inside RVV.

All tests here needs #18232

Proposed kernels looks wrong, implementation also produce bad numerical:
riscv64-rvv-kernels-pr18182.py.gz

$ ./riscv64-rvv-kernels-pr18182.py 64
Testing rvv_float32_multivmul_8_64_m8
C (output): (8,) [float32]
[1363.    0.    0.    0.    0.    0.    0.    0.]
Output (kernel) [1363.    0.    0.    0.    0.    0.    0.    0.]
Output (numpy) [1363. 1407. 1460. 1388. 1504. 1373. 1268. 1270.]

$ ./riscv64-rvv-kernels-pr18182.py 32
Testing rvv_float32_multivmul_8_32_m8
C (output): (8,) [float32]
[699.   0.   0.   0.   0.   0.   0.   0.]
Output (kernel) [699.   0.   0.   0.   0.   0.   0.   0.]
Output (numpy) [699. 493. 671. 707. 635. 639. 764. 611.]

$ ./riscv64-rvv-kernels-pr18182.py 16
Testing rvv_float32_multivmul_8_16_m8
C (output): (8,) [float32]
[425.   0.   0.   0.   0.   0.   0.   0.]
Output (kernel) [425.   0.   0.   0.   0.   0.   0.   0.]
Output (numpy) [425. 192. 382. 464. 465. 382. 438. 202.]
{...}

Here is a working reference fp32 kernel leveraging one-hot full RVV occupancy.
riscv64-rvv-full-fp32_kern.py.gz

$ ./riscv64-rvv-full-fp32_kern.py
DEBUG:pydot:pydot initializing
DEBUG:pydot:pydot 3.0.1
DEBUG:pydot.core:pydot core module initializing
DEBUG:pydot.dot_parser:pydot dot_parser module initializing
# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A_handle: T.handle, B_handle: T.handle, C_handle: T.handle):
        T.func_attr({"global_symbol": "rvv_dot_4f32_4x4f32_2f32"})
        A = T.match_buffer(A_handle, (4,), align=4, offset_factor=1)
        B = T.match_buffer(B_handle, (4, 4), strides=(4, 1), align=4, offset_factor=1)
        C = T.match_buffer(C_handle, (4,), align=4, offset_factor=1)
        with T.block("root"):
            T.reads(A[0:4], B[0:4, 0:4])
            T.writes(C[0:4])
            zero: T.float32xvscalex2 = T.call_llvm_intrin("float32xvscalex2", "llvm.riscv.vfmv.v.f", T.Broadcast(T.float32(0.0), T.vscale() * 2), C[0], T.uint64(1))
            vec_A: T.float32xvscalex4 = T.call_llvm_intrin("float32xvscalex4", "llvm.riscv.vle", T.Broadcast(T.float32(0.0), T.vscale() * 4), T.tvm_access_ptr(T.type_annotation("float32"), A.data, 0, 4, 1), T.int64(4))
            for i in range(4):
                with T.block("reduction"):
                    vi = T.axis.spatial(4, i)
                    T.reads(B[0:4, 0:4])
                    T.writes(C[vi])
                    vec_B: T.float32xvscalex4 = T.call_llvm_intrin("float32xvscalex4", "llvm.riscv.vle", T.Broadcast(T.float32(0.0), T.vscale() * 4), T.tvm_access_ptr(T.type_annotation("float32"), B.data, vi * 4, 4, 1), T.int64(4))
                    product: T.float32xvscalex4 = T.call_llvm_intrin("float32xvscalex4", "llvm.riscv.vfmul", T.Broadcast(T.float32(0.0), T.vscale() * 4), vec_A, vec_B, T.uint64(7), T.uint64(4))
                    reduction_result_vec: T.float32xvscalex2 = T.call_llvm_intrin("float32xvscalex2", "llvm.riscv.vfredusum", T.Broadcast(T.float32(0.0), T.vscale() * 2), product, zero, T.uint64(7), T.uint64(4))
                    C[vi] = T.call_llvm_intrin("float32", "llvm.riscv.vfmv.f.s", reduction_result_vec)

[6. 6. 9. 3.]
[[3. 7. 7. 7.]
 [0. 2. 5. 7.]
 [3. 9. 5. 7.]
 [9. 3. 6. 1.]]
Output (kernel) [144.  78. 138. 129.]
Output (numpy) [144.  78. 138. 129.]

For this working sample, 4 x (4x4) -> 4xlanes for VLEN=256 @ fp32 case is the maximum for a fully occupied RVV machine.

Now,

beside the matching template issues due to relax flow (exemplified with a working dense/matmul testcase), the numerical implementation of the kernels itself are also wrong and personally I don't see how they fully exploit the RVV machine (also provided a working testcase).

cbalint13 · 2025-08-29T14:52:12Z

@fzi-peccia ,

I dont know how to help to forward this, fell free to reuse this working draft.
Thank you 🙏

fzi-peccia force-pushed the riscv_rvv_tensor_intrinsic branch 2 times, most recently from 997c57a to d150dd9 Compare August 1, 2025 15:49

cbalint13 self-assigned this Aug 1, 2025

cbalint13 approved these changes Aug 1, 2025

View reviewed changes

cbalint13 mentioned this pull request Aug 10, 2025

[LLVM][CPPTEST] Small fixes for LLVM >= 20 #18202

Merged

cbalint13 requested changes Aug 15, 2025

View reviewed changes

cbalint13 reviewed Aug 15, 2025

View reviewed changes

python/tvm/tir/tensor_intrin/riscv_cpu.py Outdated Show resolved Hide resolved

fzi-peccia added 4 commits August 18, 2025 10:13

Added RISC-V V extension intrinsics for LLVM

b558b12

Added diff proposal to reuse target.llvm_get_vector_width

918d22e

Fixes/changes based on comments on PR

49d9a0c

Lint fixes + call_llvm_intrin nargs change

f9b2667

fzi-peccia force-pushed the riscv_rvv_tensor_intrinsic branch from 6f5aec2 to f9b2667 Compare August 18, 2025 08:18

cbalint13 requested changes Aug 21, 2025

View reviewed changes

fzi-peccia added 3 commits August 21, 2025 14:26

Intrinsic registration mistakes fixed

a86c214

Enabled RVV intrinsics as experimental

9f363b5

Lint fixes

91592e9

cbalint13 requested changes Aug 21, 2025

View reviewed changes

python/tvm/tir/tensor_intrin/riscv_cpu.py Show resolved Hide resolved

fzi-peccia added 2 commits August 22, 2025 08:43

CI Lint fix

10cf781

CI fix

c56e82d

cbalint13 mentioned this pull request Aug 22, 2025

[Bug][RELAX][PIPELINE] Metaschedule failure due to pipeline passes #18224

Closed

cbalint13 requested changes Aug 23, 2025

View reviewed changes

python/tvm/tir/tensor_intrin/riscv_cpu.py Show resolved Hide resolved

python/tvm/tir/tensor_intrin/riscv_cpu.py Show resolved Hide resolved

python/tvm/tir/tensor_intrin/riscv_cpu.py Show resolved Hide resolved

python/tvm/tir/tensor_intrin/riscv_cpu.py Show resolved Hide resolved

cbalint13 reviewed Aug 23, 2025

View reviewed changes

python/tvm/tir/tensor_intrin/riscv_cpu.py Show resolved Hide resolved

cbalint13 mentioned this pull request Aug 25, 2025

[LLVM][Fix] Do not emit debuginfo on vscale or other unknown types #18232

Merged

cbalint13 mentioned this pull request Aug 27, 2025

[LLVM][METASCHEDULE] Add RISCV V-extension v1.0 kernels to metaschedule #18243

Merged

cbalint13 removed their assignment Aug 29, 2025

Added RISC-V V extension intrinsics for LLVM #18182

Are you sure you want to change the base?

Added RISC-V V extension intrinsics for LLVM #18182

Uh oh!

Conversation

fzi-peccia commented Aug 1, 2025

Uh oh!

tqchen commented Aug 1, 2025

Uh oh!

cbalint13 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbalint13 commented Aug 4, 2025

Uh oh!

cbalint13 commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbalint13 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fzi-peccia commented Aug 18, 2025

Uh oh!

fzi-peccia commented Aug 18, 2025

Uh oh!

cbalint13 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cbalint13 commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

Uh oh!

cbalint13 commented Aug 22, 2025

Uh oh!

cbalint13 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cbalint13 commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbalint13 commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cbalint13 left a comment •

edited

Loading

cbalint13 commented Aug 10, 2025 •

edited

Loading

cbalint13 left a comment •

edited

Loading

cbalint13 commented Aug 22, 2025 •

edited

Loading

cbalint13 commented Aug 25, 2025 •

edited

Loading

cbalint13 commented Aug 29, 2025 •

edited

Loading