Skip to content

Conversation

fzi-peccia
Copy link
Contributor

@fzi-peccia fzi-peccia force-pushed the riscv_rvv_tensor_intrinsic branch 2 times, most recently from 997c57a to d150dd9 Compare August 1, 2025 15:49
@tqchen
Copy link
Member

tqchen commented Aug 1, 2025

cc @cbalint13 can you help to take a look

@cbalint13 cbalint13 self-assigned this Aug 1, 2025
Copy link
Contributor

@cbalint13 cbalint13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fzi-peccia ,

Thank you much for this work !
I found it good, nice to see RVV enhanchements.

One note, in the aprofile's arch parser, could we reuse this global func here ?

cpp: llvm_get_vector_width(target)
py: _ffi_api.llvm_get_vector_width(target)

It was introduced:
https://github.com/apache/tvm/pull/17641/files

If not, it is fine for now will improve/simplify parser later.
I have some work that can be added on top of this for spacemit's IME

@cbalint13
Copy link
Contributor

@fzi-peccia , can look at i386 CI failure ?

@cbalint13
Copy link
Contributor

cbalint13 commented Aug 10, 2025

@fzi-peccia , can look at i386 CI failure ?

@fzi-peccia ,

Permit me a change proposal on how to avoid aprofile (serving ARM only), don't know if this will be kept in future.
Instead, let's use infos from LLVM side, and reuse existing VLEN inference (via target.llvm_get_vector_width)

  • Here is how it would look tvm-rvv-noaprofile.diff.txt, appliacable to the top of your current branch.
  • This also will pass the i386 CI failure caused by the alteration of aprofile (currently ARM only stuff).

I am all-in to see this merged, a very good start for future IME tensorization, beyond what LLVM (will?) supports.

LATER UPDATE

Copy link
Contributor

@cbalint13 cbalint13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice work but requires adaptation to the current TVM infra.
There is a review regarding the integer variants as per their use-case coverage.

@fzi-peccia
Copy link
Contributor Author

Sorry all, I was on vacation, I will tackle these comments this week.

@fzi-peccia fzi-peccia force-pushed the riscv_rvv_tensor_intrinsic branch from 6f5aec2 to f9b2667 Compare August 18, 2025 08:18
@fzi-peccia
Copy link
Contributor Author

Hi @cbalint13 . Thank you very much for the feedback and the diff. I implemented the changes you suggested and also rebased on main.

Regarding the mixed dtype cases, the original idea was to support this, and this kernel_dtype is a mistake that stayed there from those days. I replaced it now with the input_dtype, and maybe for this version we could merge a version without mix cases, and then add this feature in the future. What do you think?

Copy link
Contributor

@cbalint13 cbalint13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code should be fine now, I started to run some real tests on my side.

  • Proposed to merge this as experimental for a while
  • Align intrinsic initialization to happen only once, just like for other arches
  • Fix of generated vs. consumed intrinsic variant inconsistency

Lets have some rounds of tests with real networks in order to elevate to non-experimental, meanwhile in some subsequent PRs we can add IME XSMTVDot (this promise up to 2TOPS on spacemit-x60) and/or RVV 0.7.1 backward compatibility for the THead boards, also as a separate PR we can also have the int8 case having mixed-dtype combinations.

@cbalint13
Copy link
Contributor

cbalint13 commented Aug 22, 2025

@fzi-peccia ,

Tests were done by tuning a resnet18 model
Here is the TVM program and results after 5000 trials: rvv-resnet18-mstune-rpc-2025Aug22.tar.gz


Tests

In a rpc setup, I used the provided tvm-rvv-tune.py script.

  • There was trial proposals for tensorization:
$ cat workdir/logs/*.log | grep Tensorizing | awk '{print $NF}' | sort -u
rvv_float32_multivmul_8_16_m8
rvv_float32_multivmul_8_32_m8
rvv_float32_multivmul_8_4_m8
rvv_float32_multivmul_8_64_m8
rvv_float32_multivmul_8_8_m8
rvv_float32_vmacc_1_16_m8
rvv_float32_vmacc_1_32_m8
rvv_float32_vmacc_1_4_m8
rvv_float32_vmacc_1_64_m8
rvv_float32_vmacc_1_8_m8
rvv_float32_vmul_1_16_m8
rvv_float32_vmul_1_32_m8
rvv_float32_vmul_1_4_m8
rvv_float32_vmul_1_64_m8
rvv_float32_vmul_1_8_m8
  • The post analytics of all entries on IR level:
$ ./msch-database-tir-parse.py
Parsed #5000 records
No tensorized schedules found.

This needs investigation.

@cbalint13
Copy link
Contributor

@fzi-peccia ,

$ ./msch-database-tir-parse.py
Parsed #5000 records
No tensorized schedules found.

This needs investigation.

Based on #18224 investigation, it seems the RVV intrinsic templates needs double check (see example fix of issue).
The posted code here looked from beginning as being an oldish TVM, using the relay (guessing) as graph import.

Copy link
Contributor

@cbalint13 cbalint13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ ./msch-database-tir-parse.py
Parsed #5000 records
No tensorized schedules found.

This needs investigation.

Based on #18224 investigation, it seems the RVV intrinsic templates needs double check (see example fix of issue).

Based on latest real tests and investigations here this still needs changes as shown
To maintain long term, ideally, tensorization templates could have some testcases

@cbalint13
Copy link
Contributor

cbalint13 commented Aug 25, 2025

Further, investigated the corectness of the proposed tensorization kernels.
The proposed multimvul does multiple dotproducts that would yield highest benefits inside RVV.

All tests here needs #18232


$ ./riscv64-rvv-kernels-pr18182.py 64
Testing rvv_float32_multivmul_8_64_m8
C (output): (8,) [float32]
[1363.    0.    0.    0.    0.    0.    0.    0.]
Output (kernel) [1363.    0.    0.    0.    0.    0.    0.    0.]
Output (numpy) [1363. 1407. 1460. 1388. 1504. 1373. 1268. 1270.]

$ ./riscv64-rvv-kernels-pr18182.py 32
Testing rvv_float32_multivmul_8_32_m8
C (output): (8,) [float32]
[699.   0.   0.   0.   0.   0.   0.   0.]
Output (kernel) [699.   0.   0.   0.   0.   0.   0.   0.]
Output (numpy) [699. 493. 671. 707. 635. 639. 764. 611.]

$ ./riscv64-rvv-kernels-pr18182.py 16
Testing rvv_float32_multivmul_8_16_m8
C (output): (8,) [float32]
[425.   0.   0.   0.   0.   0.   0.   0.]
Output (kernel) [425.   0.   0.   0.   0.   0.   0.   0.]
Output (numpy) [425. 192. 382. 464. 465. 382. 438. 202.]
{...}

$ ./riscv64-rvv-full-fp32_kern.py
DEBUG:pydot:pydot initializing
DEBUG:pydot:pydot 3.0.1
DEBUG:pydot.core:pydot core module initializing
DEBUG:pydot.dot_parser:pydot dot_parser module initializing
# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A_handle: T.handle, B_handle: T.handle, C_handle: T.handle):
        T.func_attr({"global_symbol": "rvv_dot_4f32_4x4f32_2f32"})
        A = T.match_buffer(A_handle, (4,), align=4, offset_factor=1)
        B = T.match_buffer(B_handle, (4, 4), strides=(4, 1), align=4, offset_factor=1)
        C = T.match_buffer(C_handle, (4,), align=4, offset_factor=1)
        with T.block("root"):
            T.reads(A[0:4], B[0:4, 0:4])
            T.writes(C[0:4])
            zero: T.float32xvscalex2 = T.call_llvm_intrin("float32xvscalex2", "llvm.riscv.vfmv.v.f", T.Broadcast(T.float32(0.0), T.vscale() * 2), C[0], T.uint64(1))
            vec_A: T.float32xvscalex4 = T.call_llvm_intrin("float32xvscalex4", "llvm.riscv.vle", T.Broadcast(T.float32(0.0), T.vscale() * 4), T.tvm_access_ptr(T.type_annotation("float32"), A.data, 0, 4, 1), T.int64(4))
            for i in range(4):
                with T.block("reduction"):
                    vi = T.axis.spatial(4, i)
                    T.reads(B[0:4, 0:4])
                    T.writes(C[vi])
                    vec_B: T.float32xvscalex4 = T.call_llvm_intrin("float32xvscalex4", "llvm.riscv.vle", T.Broadcast(T.float32(0.0), T.vscale() * 4), T.tvm_access_ptr(T.type_annotation("float32"), B.data, vi * 4, 4, 1), T.int64(4))
                    product: T.float32xvscalex4 = T.call_llvm_intrin("float32xvscalex4", "llvm.riscv.vfmul", T.Broadcast(T.float32(0.0), T.vscale() * 4), vec_A, vec_B, T.uint64(7), T.uint64(4))
                    reduction_result_vec: T.float32xvscalex2 = T.call_llvm_intrin("float32xvscalex2", "llvm.riscv.vfredusum", T.Broadcast(T.float32(0.0), T.vscale() * 2), product, zero, T.uint64(7), T.uint64(4))
                    C[vi] = T.call_llvm_intrin("float32", "llvm.riscv.vfmv.f.s", reduction_result_vec)

[6. 6. 9. 3.]
[[3. 7. 7. 7.]
 [0. 2. 5. 7.]
 [3. 9. 5. 7.]
 [9. 3. 6. 1.]]
Output (kernel) [144.  78. 138. 129.]
Output (numpy) [144.  78. 138. 129.]

For this working sample, 4 x (4x4) -> 4xlanes for VLEN=256 @ fp32 case is the maximum for a fully occupied RVV machine.


Now,

beside the matching template issues due to relax flow (exemplified with a working dense/matmul testcase), the numerical implementation of the kernels itself are also wrong and personally I don't see how they fully exploit the RVV machine (also provided a working testcase).

@cbalint13
Copy link
Contributor

cbalint13 commented Aug 29, 2025

@fzi-peccia ,

I dont know how to help to forward this, fell free to reuse this working draft.
Thank you 🙏

@cbalint13 cbalint13 removed their assignment Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants