Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIE2P] Legalize and select VMUL.f from G_FMUL #360

Open
wants to merge 2 commits into
base: aie-public
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions llvm/lib/Target/AIE/AIELegalizerHelper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
#include "llvm/IR/IntrinsicsAIE2.h"
#include "llvm/IR/IntrinsicsAIE2P.h"
#include "llvm/Support/ErrorHandling.h"
#include <cassert>

namespace llvm {

Expand Down Expand Up @@ -1157,6 +1158,51 @@ bool AIELegalizerHelper::legalizeG_FABS(LegalizerHelper &Helper,
return true;
}

bool AIELegalizerHelper::legalizeG_FMUL(LegalizerHelper &Helper,
MachineInstr &MI) const {
assert(ST.isAIE2P() && "Custom legalization supported for AIE2P only");

MachineIRBuilder &MIRBuilder = Helper.MIRBuilder;
MachineRegisterInfo &MRI = *MIRBuilder.getMRI();

const Register DstReg = MI.getOperand(0).getReg();
assert(MRI.getType(DstReg) == LLT::scalar(16) &&
"Expected bfloat16 type in custom legalization.");

Register SrcLHS = MI.getOperand(1).getReg();
Register SrcRHS = MI.getOperand(2).getReg();

const LLT InsertVecLLT = V32BF16;
const unsigned InsertEltOpc =
ST.getInstrInfo()->getGenericInsertVectorEltOpcode();

const Register IdxReg = MIRBuilder.buildConstant(S32, 0).getReg(0);
Copy link
Collaborator

@martien-de-jong martien-de-jong Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be cheaper to broadcast? Or is this picked up by a push.lo?

const Register UndefVec512 = MIRBuilder.buildUndef(InsertVecLLT).getReg(0);

SrcLHS = MIRBuilder
.buildInstr(InsertEltOpc, {InsertVecLLT},
{UndefVec512, SrcLHS, IdxReg})
.getReg(0);
SrcRHS = MIRBuilder
.buildInstr(InsertEltOpc, {InsertVecLLT},
{UndefVec512, SrcRHS, IdxReg})
.getReg(0);

Register Res =
MIRBuilder.buildInstr(MI.getOpcode(), {V32BF16}, {SrcLHS, SrcRHS})
.getReg(0);

const unsigned ExtractEltOpc =
ST.getInstrInfo()->getGenericExtractVectorEltOpcode(/*SignExt*/ true);
Res = MIRBuilder.buildInstr(ExtractEltOpc, {S32}, {Res, IdxReg}).getReg(0);
Res = MIRBuilder.buildAssertInstr(TargetOpcode::G_ASSERT_SEXT, {S32}, Res, 16)
.getReg(0);
MIRBuilder.buildTrunc(DstReg, Res);

MI.eraseFromParent();
return true;
}

bool AIELegalizerHelper::legalizeG_FADD_G_FSUB(LegalizerHelper &Helper,
MachineInstr &MI) const {
MachineIRBuilder &MIRBuilder = Helper.MIRBuilder;
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/Target/AIE/AIELegalizerHelper.h
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ class AIELegalizerHelper {
bool legalizeG_FPEXT(LegalizerHelper &Helper, MachineInstr &MI) const;
bool legalizeG_FABS(LegalizerHelper &Helper, MachineInstr &MI) const;
bool legalizeG_FADD_G_FSUB(LegalizerHelper &Helper, MachineInstr &MI) const;
bool legalizeG_FMUL(LegalizerHelper &Helper, MachineInstr &MI) const;
bool legalizeG_SELECT(LegalizerHelper &Helper, MachineInstr &MI,
const unsigned MaxBitSize = 512) const;
bool legalizeG_BITCAST(LegalizerHelper &Helper, MachineInstr &MI) const;
Expand Down
23 changes: 23 additions & 0 deletions llvm/lib/Target/AIE/aie2p/AIE2PInstrPatterns.td
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ class VecConf {
int BMODE_16x16_b = 1;
int BMODE_32x16 = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny to have aliases here.


int VARIANT_BF16xBF16_1_elem_1 = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds as if there are more variants. List them all in one go?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could but I'm not sure if we will ever be able to use all of them them in any patterns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the translation of a hardware enumeration into tablegen speak. I'm hoping that one day we'll have a single point of definition for these, and the full list would make them more recognisable.


bits<1> dynZeroAccum = 0; // 0 – Use default first accumulator input to the post-adder. 1 – Replace default first accumulator with zeros.
bits<2> amode = 0; // Accumulator width (see above)
bits<2> bmode = 0; // Multiplication precision (see above)
Expand All @@ -59,6 +61,7 @@ class VecConf {
}

def accfp32_vecconf : VecConf { let amode = AMODE_FP32; let bmode = BMODE_16x16; }
def mulbf16_vecconf : VecConf { let amode = AMODE_FP32; let bmode = BMODE_16x16; let cmode = VARIANT_BF16xBF16_1_elem_1; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a local definition, I wouldn't mind using CMODE as prefix.


/// Generic pattern classes
class PatGpr<SDPatternOperator OpNode, AIE2PInst Inst, ValueType type>
Expand Down Expand Up @@ -222,6 +225,26 @@ def : Pat<(fadd ACC2048:$acc1, ACC2048:$acc2),
def : Pat<(fsub ACC2048:$acc1, ACC2048:$acc2),
(VSUB_f_vmac_cm2_add_reg ACC2048:$acc1, ACC2048:$acc2, (i32 accfp32_vecconf.ConfBits))>;

// MUL
def : Pat<(v64bf16 (fmul v64bf16:$vec1, v64bf16:$vec2)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check: We are performing the same multiplication twice: one for extract lo and other to extract hi. I guess we cannot express an optimized reuse of the same VMUL here, right?

(v64bf16 (REG_SEQUENCE VEC1024,
(VCONV_bf16_fp32_mv_x_srs_bf
(EXTRACT_SUBREG
(VMUL_f_vmul_bf_vmul_bf_core_Y_Y VEC1024:$vec1, VEC1024:$vec2, (i32 mulbf16_vecconf.ConfBits)),
sub_1024_acc_lo)),
sub_512_lo,
(VCONV_bf16_fp32_mv_x_srs_bf
(EXTRACT_SUBREG
(VMUL_f_vmul_bf_vmul_bf_core_Y_Y VEC1024:$vec1, VEC1024:$vec2, (i32 mulbf16_vecconf.ConfBits)),
sub_1024_acc_hi)),
sub_512_hi))>;

def : Pat<(v32bf16 (fmul v32bf16:$vec1, v32bf16:$vec2)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a standard legalization?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, I don't know any but for the wider v64bf16 case above we could possibly use .fewerElements to keep only one pattern. I will try it.

(VCONV_bf16_fp32_mv_x_srs_bf
(EXTRACT_SUBREG
(VMUL_f_vmul_bf_vmul_bf_core_X_X VEC512:$vec1, VEC512:$vec2, (i32 mulbf16_vecconf.ConfBits)),
sub_1024_acc_lo))>;

// VMUL/VMAC Intrinsics

def : Pat<(int_aie2p_I1024_I1024_ACC2048_addmac_conf VEC1024:$s1, VEC1024:$s2, ACC2048:$acc1, ACC2048:$acc2, eR:$acc),
Expand Down
9 changes: 8 additions & 1 deletion llvm/lib/Target/AIE/aie2p/AIE2PLegalizerInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -225,12 +225,17 @@ AIE2PLegalizerInfo::AIE2PLegalizerInfo(const AIE2PSubtarget &ST)

getActionDefinitionsBuilder(G_FABS).customFor({S16, S32, S64}).scalarize(0);

getActionDefinitionsBuilder(G_FMUL)
.legalFor({V64S16, V32S16})
.customFor({S16})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to retain .clampScalar?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have custom legalization for S16 now, no need to clamp it to S32/S64. Any other scalar should be illegal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean .clampScalar(0, S16, S64)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why? the only float type under 16 bits we have is bfloat (aka S16)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Just pointing that we deviate from old behavior, s128 to s64 or s8 to s32. But you are right, it does not make sense for these types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a comment to explain why we would customize this for s16. I dont really get the context here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have an instruction to multiply bf16 scalars, so instead of using an inefficient and potentially unsafe libcall (e.g. in the case of hardware loops) we need custom legalization by inserting the bf16 scalar into a vector, perform the element wise multiplication with VMUL.f and extract the bf16 scalar again. I can add this explanation as a comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same as for FADD / FSUB. We implement a scalar multiplication by a full element by element vector mul.

.libcallFor({S32, S64});

getActionDefinitionsBuilder({G_FADD, G_FSUB})
.legalFor({AccV64S32})
.customFor({S16})
.libcallFor({S32, S64});

getActionDefinitionsBuilder({G_FMUL, G_FDIV, G_FREM})
getActionDefinitionsBuilder({G_FDIV, G_FREM})
.clampScalar(0, S32, S64)
.libcallFor({S32, S64});

Expand Down Expand Up @@ -723,6 +728,8 @@ bool AIE2PLegalizerInfo::legalizeCustom(
case TargetOpcode::G_FADD:
case TargetOpcode::G_FSUB:
return AIEHelper.legalizeG_FADD_G_FSUB(Helper, MI);
case TargetOpcode::G_FMUL:
return AIEHelper.legalizeG_FMUL(Helper, MI);
case TargetOpcode::G_BUILD_VECTOR:
return AIEHelper.legalizeG_BUILD_VECTOR(Helper, MI);
case TargetOpcode::G_UNMERGE_VALUES:
Expand Down
1 change: 0 additions & 1 deletion llvm/test/CodeGen/AIE/GlobalISel/legalize-fmul.mir
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
# (c) Copyright 2024 Advanced Micro Devices, Inc. or its affiliates

# RUN: llc -mtriple aie2 -run-pass=legalizer %s -verify-machineinstrs -o - | FileCheck -DVER=2 --check-prefix=COMMON --check-prefix=AIE2 %s
# RUN: llc -mtriple aie2p -run-pass=legalizer %s -verify-machineinstrs -o - | FileCheck -DVER=2p --check-prefix=COMMON --check-prefix=AIE2P %s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have AIE2P checkline in the test. You could also remove -DVER=2 --check-prefix=COMMON


---
name: test_fmul_bfloat16
Expand Down
63 changes: 63 additions & 0 deletions llvm/test/CodeGen/AIE/aie2p/GlobalIsel/inst-select-fmul.mir
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
#
# This file is licensed under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
#
# (c) Copyright 2025 Advanced Micro Devices, Inc. or its affiliates

# RUN: llc -mtriple aie2p -run-pass=instruction-select %s -o - | FileCheck %s


---
name: test_fmul_1024
legalized: true
regBankSelected: true
tracksRegLiveness: true
body: |
bb.1.entry:
liveins: $y0, $y1
; CHECK-LABEL: name: test_fmul_1024
; CHECK: liveins: $y0, $y1
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: [[COPY:%[0-9]+]]:vec1024 = COPY $y0
; CHECK-NEXT: [[COPY1:%[0-9]+]]:vec1024 = COPY $y1
; CHECK-NEXT: [[MOV_RLC_imm11_pseudo:%[0-9]+]]:er = MOV_RLC_imm11_pseudo 60
; CHECK-NEXT: [[VMUL_f_vmul_bf_vmul_bf_core_Y_Y:%[0-9]+]]:edm = VMUL_f_vmul_bf_vmul_bf_core_Y_Y [[COPY]], [[COPY1]], [[MOV_RLC_imm11_pseudo]], implicit-def dead $srfpflags, implicit $crfpmask
; CHECK-NEXT: [[MOV_RLC_imm11_pseudo1:%[0-9]+]]:er = MOV_RLC_imm11_pseudo 60
; CHECK-NEXT: [[VMUL_f_vmul_bf_vmul_bf_core_Y_Y1:%[0-9]+]]:edm = VMUL_f_vmul_bf_vmul_bf_core_Y_Y [[COPY]], [[COPY1]], [[MOV_RLC_imm11_pseudo1]], implicit-def dead $srfpflags, implicit $crfpmask
; CHECK-NEXT: [[COPY2:%[0-9]+]]:ecmh = COPY [[VMUL_f_vmul_bf_vmul_bf_core_Y_Y1]].sub_1024_acc_hi
; CHECK-NEXT: [[VCONV_bf16_fp32_mv_x_srs_bf:%[0-9]+]]:exo = VCONV_bf16_fp32_mv_x_srs_bf [[COPY2]], implicit-def dead $srf2fflags, implicit $crf2fmask, implicit $crrnd
; CHECK-NEXT: [[COPY3:%[0-9]+]]:ecml = COPY [[VMUL_f_vmul_bf_vmul_bf_core_Y_Y]].sub_1024_acc_lo
; CHECK-NEXT: [[VCONV_bf16_fp32_mv_x_srs_bf1:%[0-9]+]]:exe = VCONV_bf16_fp32_mv_x_srs_bf [[COPY3]], implicit-def dead $srf2fflags, implicit $crf2fmask, implicit $crrnd
; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vec1024 = REG_SEQUENCE [[VCONV_bf16_fp32_mv_x_srs_bf1]], %subreg.sub_512_lo, [[VCONV_bf16_fp32_mv_x_srs_bf]], %subreg.sub_512_hi
; CHECK-NEXT: PseudoRET implicit $lr, implicit [[REG_SEQUENCE]]
%0:vregbank(<64 x s16>) = COPY $y0
%1:vregbank(<64 x s16>) = COPY $y1
%2:vregbank(<64 x s16>) = G_FMUL %0, %1
PseudoRET implicit $lr, implicit %2
...

---
name: test_fmul_512
legalized: true
regBankSelected: true
tracksRegLiveness: true
body: |
bb.1.entry:
liveins: $x0, $x1
; CHECK-LABEL: name: test_fmul_512
; CHECK: liveins: $x0, $x1
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: [[COPY:%[0-9]+]]:vec512 = COPY $x0
; CHECK-NEXT: [[COPY1:%[0-9]+]]:vec512 = COPY $x1
; CHECK-NEXT: [[MOV_RLC_imm11_pseudo:%[0-9]+]]:er = MOV_RLC_imm11_pseudo 60
; CHECK-NEXT: [[VMUL_f_vmul_bf_vmul_bf_core_X_X:%[0-9]+]]:edm = VMUL_f_vmul_bf_vmul_bf_core_X_X [[COPY]], [[COPY1]], [[MOV_RLC_imm11_pseudo]], implicit-def dead $srfpflags, implicit $crfpmask
; CHECK-NEXT: [[COPY2:%[0-9]+]]:ecml = COPY [[VMUL_f_vmul_bf_vmul_bf_core_X_X]].sub_1024_acc_lo
; CHECK-NEXT: [[VCONV_bf16_fp32_mv_x_srs_bf:%[0-9]+]]:vec512 = VCONV_bf16_fp32_mv_x_srs_bf [[COPY2]], implicit-def dead $srf2fflags, implicit $crf2fmask, implicit $crrnd
; CHECK-NEXT: PseudoRET implicit $lr, implicit [[VCONV_bf16_fp32_mv_x_srs_bf]]
%0:vregbank(<32 x s16>) = COPY $x0
%1:vregbank(<32 x s16>) = COPY $x1
%2:vregbank(<32 x s16>) = G_FMUL %0, %1
PseudoRET implicit $lr, implicit %2
...
80 changes: 80 additions & 0 deletions llvm/test/CodeGen/AIE/aie2p/GlobalIsel/legalize-fmul.mir
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
# This file is licensed under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
#
# (c) Copyright 2024 Advanced Micro Devices, Inc. or its affiliates

# RUN: llc -mtriple aie2p -run-pass=legalizer %s -verify-machineinstrs -o - | FileCheck %s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to include the libcall tests as well.

Copy link
Collaborator Author

@khallouh khallouh Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already didn't have them but I will add them while at it.


---
name: test_fmul_s16
body: |
bb.0:
liveins: $r1, $r2
; CHECK-LABEL: name: test_fmul_s16
; CHECK: liveins: $r1, $r2
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $r1
; CHECK-NEXT: [[TRUNC:%[0-9]+]]:_(s16) = G_TRUNC [[COPY]](s32)
; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY $r2
; CHECK-NEXT: [[TRUNC1:%[0-9]+]]:_(s16) = G_TRUNC [[COPY1]](s32)
; CHECK-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
; CHECK-NEXT: [[DEF:%[0-9]+]]:_(<32 x s16>) = G_IMPLICIT_DEF
; CHECK-NEXT: [[AIE_INSERT_VECTOR_ELT:%[0-9]+]]:_(<32 x s16>) = G_AIE_INSERT_VECTOR_ELT [[DEF]], [[TRUNC]](s16), [[C]](s32)
; CHECK-NEXT: [[AIE_INSERT_VECTOR_ELT1:%[0-9]+]]:_(<32 x s16>) = G_AIE_INSERT_VECTOR_ELT [[DEF]], [[TRUNC1]](s16), [[C]](s32)
; CHECK-NEXT: [[FMUL:%[0-9]+]]:_(<32 x s16>) = G_FMUL [[AIE_INSERT_VECTOR_ELT]], [[AIE_INSERT_VECTOR_ELT1]]
; CHECK-NEXT: [[AIE_SEXT_EXTRACT_VECTOR_ELT:%[0-9]+]]:_(s32) = G_AIE_SEXT_EXTRACT_VECTOR_ELT [[FMUL]](<32 x s16>), [[C]](s32)
; CHECK-NEXT: [[ASSERT_SEXT:%[0-9]+]]:_(s32) = G_ASSERT_SEXT [[AIE_SEXT_EXTRACT_VECTOR_ELT]], 16
; CHECK-NEXT: $r0 = COPY [[ASSERT_SEXT]](s32)
; CHECK-NEXT: PseudoRET implicit $lr, implicit $r0
%0:_(s32) = COPY $r1
%1:_(s16) = G_TRUNC %0(s32)
%2:_(s32) = COPY $r2
%3:_(s16) = G_TRUNC %2(s32)
%4:_(s16) = G_FMUL %1, %3
%5:_(s32) = G_ANYEXT %4(s16)
$r0 = COPY %5(s32)
PseudoRET implicit $lr, implicit $r0
...

---
name: test_fmul_vec_1024
body: |
bb.0:
liveins: $dm0, $dm1
; CHECK-LABEL: name: test_fmul_vec_1024
; CHECK: liveins: $dm0, $dm1
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<64 x s16>) = COPY $cml0
; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<64 x s16>) = COPY $cml1
; CHECK-NEXT: [[FMUL:%[0-9]+]]:_(<64 x s16>) = G_FMUL [[COPY]], [[COPY1]]
; CHECK-NEXT: $cml0 = COPY [[FMUL]](<64 x s16>)
; CHECK-NEXT: PseudoRET implicit $lr, implicit $cml0
%0:_(<64 x s16>) = COPY $cml0
%1:_(<64 x s16>) = COPY $cml1
%2:_(<64 x s16>) = G_FMUL %0, %1
$cml0 = COPY %2(<64 x s16>)
PseudoRET implicit $lr, implicit $cml0
...

---
name: test_fmul_vec_512
body: |
bb.0:
liveins: $dm0, $dm1
; CHECK-LABEL: name: test_fmul_vec_512
; CHECK: liveins: $dm0, $dm1
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<32 x s16>) = COPY $bmll0
; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<32 x s16>) = COPY $bmll1
; CHECK-NEXT: [[FMUL:%[0-9]+]]:_(<32 x s16>) = G_FMUL [[COPY]], [[COPY1]]
; CHECK-NEXT: $bmll0 = COPY [[FMUL]](<32 x s16>)
; CHECK-NEXT: PseudoRET implicit $lr, implicit $bmll0
%0:_(<32 x s16>) = COPY $bmll0
%1:_(<32 x s16>) = COPY $bmll1
%2:_(<32 x s16>) = G_FMUL %0, %1
$bmll0 = COPY %2(<32 x s16>)
PseudoRET implicit $lr, implicit $bmll0
...