[BugFix][Ansor] Fixing BroadcastShape function #17627

thaisacs · 2025-02-06T05:59:11Z

Behavior before correction

When Ansor doesn't find a schedule for a layer during tuning, like this:

tvmgen_default_fused_nn_conv2d_28
Cannot find tuned schedules for target=llvm -keys=cpu -mtriple=x86_64-pc-linux-gnu, workload_key=["2d10de6646307f0e3e5cf4b31c20e69b", [1, 7, 7, 960], [1, 1, 960, 320], [1, 7, 7, 320]]. A fallback TOPI schedule is used, which may bring great performance regression or even compilation failure. Compute DAG info:
p0 = PLACEHOLDER [1, 7, 7, 960]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 960, 320]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])

I was getting model compilation failures like this

InternalError: Check failed: (false) is false: Incompatible broadcast dims: 144 and 128 in: [1, 7, 7, 144] and [1, 1, 1, 128]

Behavior after correction

Now, When Ansor doesn't find a schedule for a layer during tuning, like this:

fused_nn_dense_add
Cannot find tuned schedules for target=llvm -keys=cpu -mtriple=x86_64-pc-linux-gnu, workload_key=["08f7449d79e570b7274174709e5e5e01", [1, 1280], [1000, 1280], [1, 1000], [1, 1000]]. A fallback TOPI schedule is used, which may bring great performance regression or even compilation failure. Compute DAG info:
p0 = PLACEHOLDER [1, 1280]
p1 = PLACEHOLDER [1000, 1280]
T_matmul_NT(i0, i1) += (p0[i0, k]*p1[i1, k])
p2 = PLACEHOLDER [1, 1000]
T_add(ax0, ax1) = (T_matmul_NT[ax0, ax1] + p2[ax0, ax1])

the code reports the following warning

[00:50:32] /home/thais/Dev/tvm/include/tvm/topi/detail/broadcast.h:86: Warning: Incompatible broadcast dims: 256 and 96. Automatically cutting the larger dimension.
[00:50:33] /home/thais/Dev/tvm/include/tvm/topi/detail/broadcast.h:86: Warning: Incompatible broadcast dims: 256 and 160. Automatically cutting the larger dimension.

And, I can compile the models without compilation and execution errors and the accuracy of the model is maintained.

cbalint13 · 2025-02-07T14:05:45Z

include/tvm/topi/detail/broadcast.h

-                    << " in: " << tvm::Array<tvm::PrimExpr>(shape1.begin(), shape1.end()) << " and "
-                    << tvm::Array<tvm::PrimExpr>(shape2.begin(), shape2.end());
+      LOG(WARNING) << "Incompatible broadcast dims: " << shape1[s1_size - i] << " and "
+                   << shape2[s2_size - i] << ". Automatically cutting the larger dimension.";


You relax here the constraint, leaving explicit warning on this, maybe enough for brodcasting case.

Not an english native here, maybe s/cutting/trimming/ would sound better ? I leave it up to your consideration.

Trimming sounds better to me (I'm also not a native English speaker). I've made the correction.

cbalint13 · 2025-02-07T14:20:58Z

Hi @thaisacs ,

Thanks for your contribution !

Not sure what model did you use, is it a well know public model or custom one !?
You basically relax/bypass a sanity check on the broadcast operation here in TVM.

Should't instead this fix or add to your model a trim operation or something to complain with that shape mismatch ?

cbalint13 · 2025-02-07T14:33:13Z

Updating, more reasoning:

If metaschedule iterates through and bail-out/abort as whole program then maybe we should relax as you proposed.

But let's see others opinion on this statement Cc @Hzfengsy @MasterJH5574

Also, for acceptance a small testcase would be nice for such broadcast scenario here at test_topi_broadcast.py .

thaisacs · 2025-02-07T14:34:52Z

@cbalint13

Hello, I'm conducting experiments with the following models:

resnet_18
resnet_50
resnext_50
wide_resnet_50
mobilenet_v2
mobilenet_v3
resnet_152
inception_v3
alexnet
densenet_121
vgg_16
googlenet

Imported directly from PyTorch to Relay.
This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

cbalint13 · 2025-02-07T14:48:00Z

@cbalint13

Hello, I'm conducting experiments with the following models:

resnet_18

resnet_50
{...}

googlenet

Imported directly from PyTorch to Relay. This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

Thank you, see now. But from your description it is still not clear the extend of the issue:

Metaschedule (a) bails-out/abort as a whole program or (b) only skips on some specific layers ?

In case of (a) yes , if whole program is aborted it is clear we should fix this issue.
In case of (b) iterated variations fails (it is normal, maybe all variations too so layer is skipped), relaxing this can get "working schedules" but no guarantees on correctness. If Ansor is unable to find a single valid schedule within a layer then must check the cause of inflexibility, but just relaxing some rules to get some results is not the best idea.

Lets again see others opinion here on it.

Could attach outputs (as text.gz) here with complete schedule proccess from your side ?

thaisacs · 2025-02-07T16:44:03Z

@cbalint13
Hello, I'm conducting experiments with the following models:

resnet_18

resnet_50
{...}

googlenet

Imported directly from PyTorch to Relay. This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

Thank you, see now. But from your description it is still not clear the extend of the issue:

Metaschedule (a) bails-out/abort as a whole program or (b) only skips on some specific layers ?

In case of (a) yes , if whole program is aborted it is clear we should fix this issue.

In case of (b) iterated variations fails (it is normal, maybe all variations too so layer is skipped), relaxing this can get "working schedules" but no guarantees on correctness. If Ansor is unable to find a single valid schedule within a layer then must check the cause of inflexibility, but just relaxing some rules to get some results is not the best idea.

Lets again see others opinion here on it.

Could attach outputs (as text.gz) here with complete schedule proccess from your side ?

When I execute the following function with the GoogLeNet network, for example, and log_file being the file achieved during tuning time,

def model_run(network_arg, dtype, target, log_file):
    mod, params, inputs = get_network_with_key(network_arg, dtype)

    print("Compile...")
    input_shape = inputs[0][1]

    with auto_scheduler.ApplyHistoryBest(log_file):
        with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):
            lib = relay.build(mod, target=target, params=params)
            
    with auto_scheduler.ApplyHistoryBest(log_file):
        with tvm.transform.PassContext(opt_level=0, config={"relay.backend.use_auto_scheduler": True}):
            ref_lib = relay.build(mod, target=target, params=params)

    # Check the correctness
    def get_output(input_data, data, lib):
        dev = tvm.device(str(target), 0)
        module = graph_executor.GraphModule(lib["default"](dev))
        module.set_input(input_data, data)
        module.run()
        return module.get_output(0).numpy()

    def run_bench(input_data, data, lib):
        dev = tvm.device(str(target), 0)
        # Create graph executor
        module = graph_executor.GraphModule(lib["default"](dev))
        module.set_input(input_data, data)
        # Evaluate
        print("Evaluate inference time cost...")
        for x in range(0, 1):
            print(module.benchmark(dev, repeat=10, number=10, min_repeat_ms=500, end_to_end=True))

    np.random.seed(0)
    data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
    run_bench(inputs[0][0], data_tvm, lib)

    np.random.seed(0)
    data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
    actual_output1 = get_output(inputs[0][0], data_tvm, lib)
    expected_output = get_output(inputs[0][0], data_tvm, ref_lib)

    tvm.testing.assert_allclose(actual_output1, expected_output, rtol=1e-4, atol=1e-4)

The program is aborted during compilation, specifically at the following lines:

    with auto_scheduler.ApplyHistoryBest(log_file):
        with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):
            lib = relay.build(mod, target=target, params=params)

Then, we are unable to generate code and perform model inference. Note that during tuning to generate the log_file, there were no issues. It is possible to generate the .json file with the schedules for the layers that were tuned."

cbalint13 · 2025-02-07T21:41:47Z

@thaisacs ,

Thanks for the details, see now what you are trying.

You are using relay (not maintained) flow which is scheduled to be phased out with the next release.
You have to use the relax flow, I done here an experiment with resnet18 and cannot reproduce the bug.

See my attached relax-metasched-resnet18.log.gz test script, more online here
Also see the console log tvm-metasched-relax-resnet18.py.gz (truncated, stopped bit earlier).

Notes:

If you want to manipulate the history (load/store) states see the docs how to do this via relax.

I was using versions below, but due to fact that my pytorch is too new I used the relax ONNX backend.

$ rpm -q tvm onnx pytorch
tvm-0.20-20250204.0.git9404fb5a.cu12_6.fc42.x86_64
onnx-1.18.0-20250203.0.git0277a1f6.fc42.x86_64
pytorch-2.7.0-20250117.0.git42c64bd3.cu12_6.fc42.x86_64

Please let me know if issue still persists for you via relax.

Later update:

re-uploaded sample and logs with finished inference part, confirming no errors including inference.
relax-metasched-resnet18.log.gz
tvm-metasched-relax-resnet18.py.gz

Back to your proposed fix,

I cannot confirm this issue on my side, please check too using relax for your case.
The fix proposed here just shunt a constraint for broadcast which IMHO is a bad idea.

thaisacs · 2025-02-08T00:52:13Z

@cbalint13

Thank you for your response and help.

I am using Relay and the auto-scheduler (a.k.a. Ansor), but I don't think the issue is in Relay.
For some tasks, the auto-scheduler fails to generate code. In these cases, it calls TOPI and ends up in the BroadcastShape function. For some schedules found during tuning, it is not possible to use the TOPI code for the layers where code generation failed. As a result, the tuning solution cannot be used, and we need to perform a new tuning.

I believe that addressing this issue is crucial for maintaining support for the auto-scheduler. Was support for the auto-scheduler discontinued in the next release?

cbalint13 · 2025-02-08T01:50:11Z

@cbalint13

Thank you for your response and help.

I am using Relay and the auto-scheduler (a.k.a. Ansor), but I don't think the issue is in Relay. For some tasks, the auto-scheduler fails to generate code. In these cases, it calls TOPI and ends up in the BroadcastShape function. For some schedules found during tuning, it is not possible to use the TOPI code for the layers where code generation failed. As a result, the tuning solution cannot be used, and we need to perform a new tuning.

A pinpoint example would help here to understand the source cause, relay vs relax flow differs in ways the passes are applied.
I am personally against the shunt for the broadcast, the very failure must signal the scheduler to skip its iter variant (as not legit).

Cc @Hzfengsy

I believe that addressing this issue is crucial for maintaining support for the auto-scheduler. Did the auto-scheduler support stop in v0.19.0?

auto-scheduler (metaschedule/ansor) stays , the older autotune with relay flow goes (unmaintained since a while).
auto-scheduler is commonly used by booth relax and old relay, but in flows the way rules may be applied differently for each.

I suggest to adapt your code to an equivalent modern relax flow, if this issue still pops up we must look at it.
Do you also experience this issue only with a more recent tvm version (assuming with unmaintained relay) ?

thaisacs · 2025-02-08T16:29:31Z

@cbalint13

Thanks. I'll test it on relax flow.

Hzfengsy · 2025-02-11T13:49:18Z

Thanks for the great discussion!

This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

What do you mean by "more frequently"? IIUC, if it's a TOPI issue, it will always or never happen, but not a probability event.

InternalError: Check failed: (false) is false: Incompatible broadcast dims: 144 and 128 in: [1, 7, 7, 144] and [1, 1, 1, 128]

The error seems reasonable.

So I wonder where the root of the issue is. To be clearer, TOPI or ansor

thaisacs · 2025-02-11T20:13:46Z

Thanks for the great discussion!

This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

What do you mean by "more frequently"? IIUC, if it's a TOPI issue, it will always or never happen, but not a probability event.

InternalError: Check failed: (false) is false: Incompatible broadcast dims: 144 and 128 in: [1, 7, 7, 144] and [1, 1, 1, 128]

The error seems reasonable.

So I wonder where the root of the issue is. To be clearer, TOPI or ansor

@Hzfengsy

I think it's in the interaction of the two: auto-scheduler and TOPI. The auto-scheduler can find schedules for some layers, but not all. When attempting to compile the entire model, TOPI needs to deal with the case that the internal tensors have different dimensions. Currently, TOPI handles this by stopping the compilation process.

Note that, when a tensor's dimension is dynamic and cannot be determined at compile time, TOPI prematurely considers the shape of the output tensor.

cbalint13 · 2025-02-12T14:23:23Z

Thanks for the great discussion!

This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

What do you mean by "more frequently"? IIUC, if it's a TOPI issue, it will always or never happen, but not a probability event.

InternalError: Check failed: (false) is false: Incompatible broadcast dims: 144 and 128 in: [1, 7, 7, 144] and [1, 1, 1, 128]

The error seems reasonable.
So I wonder where the root of the issue is. To be clearer, TOPI or ansor

@Hzfengsy

I think it's in the interaction of the two: auto-scheduler and TOPI. The auto-scheduler can find schedules for some layers, but not all. When attempting to compile the entire model, TOPI needs to deal with the case that the internal tensors have different dimensions. Currently, TOPI handles this by stopping the compilation process.

Note that, when a tensor's dimension is dynamic and cannot be determined at compile time, TOPI prematurely considers the shape of the output tensor.

@thaisacs ,

As practical note, if the amount of sample limit per layer is too low metaschedule will fail to propose any valid sketches for layers.
Can be seen in provided sample here using relax flow (with the attached logs as proof), if samples are lowered from 8000 to 1000.

If we "shunt" things like this boradcast, there will be more apperantly "valid" proposals so searching converges faster, but is not we want, we also want a "legit/valid" final form of the tuned model.

thaisacs · 2025-02-12T15:43:27Z

Thanks for the great discussion!

This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

What do you mean by "more frequently"? IIUC, if it's a TOPI issue, it will always or never happen, but not a probability event.

InternalError: Check failed: (false) is false: Incompatible broadcast dims: 144 and 128 in: [1, 7, 7, 144] and [1, 1, 1, 128]

The error seems reasonable.
So I wonder where the root of the issue is. To be clearer, TOPI or ansor

@Hzfengsy
I think it's in the interaction of the two: auto-scheduler and TOPI. The auto-scheduler can find schedules for some layers, but not all. When attempting to compile the entire model, TOPI needs to deal with the case that the internal tensors have different dimensions. Currently, TOPI handles this by stopping the compilation process.
Note that, when a tensor's dimension is dynamic and cannot be determined at compile time, TOPI prematurely considers the shape of the output tensor.

@thaisacs ,

As practical note, if the amount of sample limit per layer is too low metaschedule will fail to propose any valid sketches for layers. Can be seen in provided sample here using relax flow (with the attached logs as proof), if samples are lowered from 8000 to 1000.

If we "shunt" things like this boradcast, there will be more apperantly "valid" proposals so searching converges faster, but is not we want, we also want a "legit/valid" final form of the tuned model.

@cbalint13

I think that the broadcast shape function has no impact on the search. It is only used for the final compilation of the model.

Aren't invalid schedules removed by the evolutionary search? The searches I performed considered 1000 points per layer of the model. For example, in resnet_152 with 27 layers, the evolutionary search explored 26968 schedules.
In my tests, the broadcast shape function did not change the accuracy of the model.
The only model that had not-so-good accuracy was inception_v3, but this happens if the broadcast function fails to give an error too.

cbalint13 · 2025-03-03T22:07:20Z

@thaisacs

I recently tuned on 0.20+ @ git#4ac03b39e1, right before the relay removal, some models using meta_scheduler for two SBCs opencl+aarch64 (mali GPU) & opencl+riscv64 (PowerVR GPU) as distinct remote endpoints. All experiments went without problems, even using stop + resume (apply_history) due to long wait-times spanning many days. The much older relay+autotuner still worked fine, however I decided to not use it, but auto/meta_schedulers.

Tuned models was CV alike mini-flows mimicking some classical OpenCV preprocessing chains like 2x{rgb2gray(fp16)->resize->gauss_11x11_blur(fp16)}->two_frame_diff(int8)->binary_threshold(int8)->edge_dilate_x8(int8), using pure ML operator counterparts, to distill GPU (opencl) offloadable DSOs.

I also tuned one more large & complex model, the yolo11s (plateau at 40k trials), autoscheduler reporting 63 different fused layers, I targeted riscv64(+v,+sve) CPU, and opencl+aarch64(mali gpu).

For the case of auto_scheduler experiment was iterated through NHWC, NCHW and mixed-precision (fp16) transform passes to be able to pick the fastest candidate. Not encountered any problems at assembly or final export as DSO binaries. DSO played out nice on mentioned SBCs booth via C++ or the python API, no errors. Due to large trial-errors at tuning warmup, on these very low end GPU kinds, I had to use a large 50sec patience (timeout) per trial, and this benefited a lot of traction as the iterations evolved in time.

I used a custom little parametrizable python program (not the tvmc way), I can share it happily to test the model failure.

I would be happy to investigate the root cause of the issue presented here if this can be reproduced clearly.

thaisacs force-pushed the broadcast-error branch 2 times, most recently from 234b87c to d2facb3 Compare February 6, 2025 13:13

thaisacs changed the title ~~[FixBug][Ansor] Fixing BroadcastShape function~~ [BugFix][Ansor] Fixing BroadcastShape function Feb 6, 2025

thaisacs force-pushed the broadcast-error branch 5 times, most recently from 2fa4299 to 93feceb Compare February 7, 2025 01:29

cbalint13 reviewed Feb 7, 2025

View reviewed changes

thaisacs force-pushed the broadcast-error branch from 93feceb to 1c6d127 Compare February 7, 2025 14:15

[FixBug][Ansor] Fixing BroadcastShape function

d7dc989

thaisacs force-pushed the broadcast-error branch from 1c6d127 to d7dc989 Compare February 7, 2025 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix][Ansor] Fixing BroadcastShape function #17627

[BugFix][Ansor] Fixing BroadcastShape function #17627

thaisacs commented Feb 6, 2025

cbalint13 Feb 7, 2025

thaisacs Feb 7, 2025

cbalint13 commented Feb 7, 2025

cbalint13 commented Feb 7, 2025 •

edited

Loading

thaisacs commented Feb 7, 2025

cbalint13 commented Feb 7, 2025 •

edited

Loading

thaisacs commented Feb 7, 2025

cbalint13 commented Feb 7, 2025 •

edited

Loading

thaisacs commented Feb 8, 2025 •

edited

Loading

cbalint13 commented Feb 8, 2025 •

edited

Loading

thaisacs commented Feb 8, 2025

Hzfengsy commented Feb 11, 2025

thaisacs commented Feb 11, 2025

cbalint13 commented Feb 12, 2025 •

edited

Loading

thaisacs commented Feb 12, 2025

cbalint13 commented Mar 3, 2025 •

edited

Loading

[BugFix][Ansor] Fixing BroadcastShape function #17627

Are you sure you want to change the base?

[BugFix][Ansor] Fixing BroadcastShape function #17627

Conversation

thaisacs commented Feb 6, 2025

Behavior before correction

Behavior after correction

cbalint13 Feb 7, 2025

Choose a reason for hiding this comment

thaisacs Feb 7, 2025

Choose a reason for hiding this comment

cbalint13 commented Feb 7, 2025

cbalint13 commented Feb 7, 2025 • edited Loading

thaisacs commented Feb 7, 2025

cbalint13 commented Feb 7, 2025 • edited Loading

thaisacs commented Feb 7, 2025

cbalint13 commented Feb 7, 2025 • edited Loading

thaisacs commented Feb 8, 2025 • edited Loading

cbalint13 commented Feb 8, 2025 • edited Loading

thaisacs commented Feb 8, 2025

Hzfengsy commented Feb 11, 2025

thaisacs commented Feb 11, 2025

cbalint13 commented Feb 12, 2025 • edited Loading

thaisacs commented Feb 12, 2025

cbalint13 commented Mar 3, 2025 • edited Loading

cbalint13 commented Feb 7, 2025 •

edited

Loading

cbalint13 commented Feb 7, 2025 •

edited

Loading

cbalint13 commented Feb 7, 2025 •

edited

Loading

thaisacs commented Feb 8, 2025 •

edited

Loading

cbalint13 commented Feb 8, 2025 •

edited

Loading

cbalint13 commented Feb 12, 2025 •

edited

Loading

cbalint13 commented Mar 3, 2025 •

edited

Loading