Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] TVM/LLVM sets RISC-V VLEN to 128 bits instead of 256 on Banana Pi K1 #17625

Open
JieGH opened this issue Feb 5, 2025 · 3 comments
Open
Assignees
Labels
needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it type: bug

Comments

@JieGH
Copy link

JieGH commented Feb 5, 2025

TVM/LLVM sets RISC-V VLEN to 128 bits instead of 256 on Banana Pi K1

Expected Behavior

I have a vecter-based TVM design that shows poor performance against C implementation, 1/4 of performance approx.
The LLVM warns VLEN to be set as default 128 bits, but the Banana Pi should uses 256 bits instead.
TVM and LLVM should correctly detect and utilize a vector length (VLEN) of 256 bits on the Banana Pi K1 board.

Actual Behavior

  • The board reports VLEN=256
#include <stdio.h>

int main() {
    unsigned long vlenb;
    __asm__ volatile ("csrr %0, vlenb" : "=r"(vlenb));
    printf("Vector length in bytes (vlenb): %lu\n", vlenb);
    printf("Vector length in bits (vlenb): %lu\n", vlenb*8);
    return 0;
}

when execute

Vector length in bytes (vlenb): 32
Vector length in bits (vlenb): 256

-- When checking the asm code generated from TVM flow, when querying via csrr a4, vlenb in assembly. Does the ASM code show e32, m2, indicating the vector length is 128 bit.

vector_test_compute_:
.Lfunc_begin1:
	.loc	1 0 0
	.cfi_startproc
	csrr	a4, vlenb
	srli	t0, a4, 1
	li	a2, 1024
	bltu	a2, t0, .LBB1_7
.Ltmp107:
	addi	a2, t0, -1
	andi	a7, a2, 1024
	xori	a6, a7, 1024
	slli	t1, a4, 1
	mv	a5, a1
	mv	a3, a0
	mv	a2, a6
	vsetvli	a4, zero, e32, m2, ta, ma
  • And, LLVM outputs a warning: src/target/llvm/codegen_llvm.cc:185: Warning: Set native vector bits to be 128 for riscv64
  • Attempts to override with flags like -mllvm -riscv-v-vector-bits-min=256 -mllvm -riscv-v-vector-bits-max=256 do not take effect, llvm says it illegal flags, then it recommend
  • ValueError: Error when parsing target["mllvm"]: Cannot recognize 'mllvm'. Candidates are: cl-opt, opt-level, fast-math-ninf, fast-math-arcp, fast-math-nnan, fast-math, fast-math-contract, num-cores, device, libs, tag, mtriple, host, from_device, target_device_type, fast-math-reassoc, keys, mattr, fast-math-nsz, model, mabi, mcpu, jit, mfloat-abi. Target creation from string failed: llvm -jit=orcjit -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mattr=+m,+a,+f,+d,+zfh,+v,+c -mllvm -riscv-v-vector-bits-min=256 -mllvm -riscv-v-vector-bits-max=256

Could anyone advice me how can I set the correct vector length to 256 bits? And from your @cbalint13 knowledge of RISCV, could you provide some advice? Thanks all.

Environment

  LLVM version 19.1.3
  Optimized build with assertions.
  Default target: riscv64-linux-gnu
  Host CPU: generic-rv64

  Registered Targets:
    riscv32 - 32-bit RISC-V
    riscv64 - 64-bit RISC-V

TVM: 0.18.0
Linux k1 6.1.15 #1.0.15.1 SMP PREEMPT Sat Sep  7 07:20:18 UTC 2024 riscv64 riscv64 riscv64 GNU/Linux

riscv64-linux-gnu-gcc --version
riscv64-linux-gnu-gcc (Ubuntu 13.2.0-4ubuntu3-bb2) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


llvm -jit=orcjit -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mattr=+m,+a,+f,+d,+zfh,+v,+c

@JieGH JieGH added needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it type: bug labels Feb 5, 2025
@cbalint13
Copy link
Contributor

Hi @JieGH

Regarding arch specific technical details {CPU: type, variations, features} TVM relay on LLVM itself (it queries the very library). AFAIK, when its about RISCV land, there are limited infos within LLVM available, limited to interest, maintainance, with things mostly coming from SiFive folks.

Let me look and tackle this problem and come up with best way for TVM, here are some possibilities on my mind:

  1. Let me look once again in LLVM what info regarding VLEN can be queried, and implement such query for the VLEN.

If the LLVM query don't work (cpu variation is not enlisted in LLVM) let's further override within TVM:

  1. Let's also make the flag for TVM like "-riscv-v-vector-bits-min=256" to work properly (currently TVM hardcode it to 256).
  2. Let's also declare the arches in TVM way (spacemit, th1520, C906, etc) see here, but this requires maintenance on long term.

I believe {0 + 1, 2} can be implemented all together to offer users multiple ways of control.

@JieGH
Copy link
Author

JieGH commented Feb 10, 2025

Hi @cbalint13, Thanks for the advice. I now have a method for choosing VLEN, which is using an additional flag:
tvm.target.Target("llvm -jit=orcjit -mtriple=riscv64 -mcpu=spacemit-x60 - mattr=+64bit,+m,+a,+f,+d,+c,+zfh,+v,+zvl256b")

By specifying zvl256b flags, it means enable 'Zvl' (Minimum Vector Length) 256. This indeed has an impact on the execution's performance.
However,

  1. the TVM still warns the 128bit sets as default bit length despite the zvl flags having been enabled and having an impact on performance. I have not yet run the latest update you posted at Handle vector width (VLEN) for RISCV arches Handle vector width (VLEN) for RISCV arches #17631

  2. I searched zvl flags for a given matrix mul problem; I changed the zvl and measured the performance. The best performance appeared at the vector length that the chip should not support. For example if I set zvl256b, the execution takes 491ms to complete, if I set zvl8192b, the execution takes 384 ms to finish, which has over 20% speed up. There are something wrong here.

Any comments on this? Thanks.

@cbalint13
Copy link
Contributor

cbalint13 commented Feb 10, 2025

Hi @JieGH ,

Hi @cbalint13, Thanks for the advice. I now have a method for choosing VLEN, which is using an additional flag: tvm.target.Target("llvm -jit=orcjit -mtriple=riscv64 -mcpu=spacemit-x60 - mattr=+64bit,+m,+a,+f,+d,+c,+zfh,+v,+zvl256b")

By specifying zvl256b flags, it means enable 'Zvl' (Minimum Vector Length) 256. This indeed has an impact on the execution's performance. However,

Yes, another way to tell LLVM the VLEN is via the canonical flags, but we also need TVM itself to be aware of this.

  1. the TVM still warns the 128bit sets as default bit length despite the zvl flags having been enabled and having an impact on performance. I have not yet run the latest update you posted at Handle vector width (VLEN) for RISCV arches Handle vector width (VLEN) for RISCV arches #17631

You have an older LLVM, and it does not know about -mcpu=spacemit-x60, so it will fall as a generic.

  • Can check llvm version from tvm side:
    $ python3 -c "import tvm; print(tvm.target.codegen.llvm_version_major())"
    20
    
  • Can also look inside riscv64 what -mcpu options are there for your installed LLVM:
    $ python -c "import tvm; print(tvm.target.codegen.llvm_get_cpu_archlist(tvm.target.Target('llvm -mtriple=riscv64--')))"
    ["generic", "generic-rv32", "generic-rv64", "mips-p8700", "rocket",
     "rocket-rv32", "rocket-rv64", "rp2350-hazard3", "sifive-7-series", 
     "sifive-e20", "sifive-e21", "sifive-e24", "sifive-e31", "sifive-e34", 
     "sifive-e76", "sifive-p450", "sifive-p470", "sifive-p670", "sifive-s21", 
     "sifive-s51", "sifive-s54", "sifive-s76", "sifive-u54", "sifive-u74", 
     "sifive-x280", "spacemit-x60", "syntacore-scr1-base", "syntacore-scr1-max", 
     "syntacore-scr3-rv32", "syntacore-scr3-rv64", "syntacore-scr4-rv32", 
     "syntacore-scr4-rv64", "syntacore-scr5-rv32", "syntacore-scr5-rv64",
     "syntacore-scr7", "tt-ascalon-d8", "veyron-v1", "xiangshan-nanhu"]
    

The flags (older LLVM) would be: llvm -device=riscv_cpu -vector-width=256 -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mattr=+64bit,+a,+c,+d,+f,+m,+v (orcjit is already default, vector-with informs booth TVM and LLVM).

  1. I searched zvl flags for a given matrix mul problem; I changed the zvl and measured the performance. The best performance appeared at the vector length that the chip should not support. For example if I set zvl256b, the execution takes 491ms to complete, if I set zvl8192b, the execution takes 384 ms to finish, which has over 20% speed up. There are something wrong here.>
    Any comments on this? Thanks.

Performance also depends on how LLVM optimizes things out, TVM have no highly-specialized optimizations for RISCV.

TVM emmits candidates/iterations as intermediate proposals (in auto-tunnig flow) and forwards to LLVM, while electing only the best performing ones. Not sure if you are also tring to tune your model/function, but without a tuning process TVM likely emits a subperforming variant, even for a simple matmul operation, there should be a warn on this:

WARNING:autotvm:One or more operators have not been tuned. 
Please tune your model for better performance. Use DEBUG logging level to see more details.

The work done in #17631 only informs TVM about VLEN intentions from LLVM side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it type: bug
Projects
None yet
Development

No branches or pull requests

2 participants