Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix][Ansor] Fixing BroadcastShape function #17627

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thaisacs
Copy link
Contributor

@thaisacs thaisacs commented Feb 6, 2025

Behavior before correction

When Ansor doesn't find a schedule for a layer during tuning, like this:

tvmgen_default_fused_nn_conv2d_28
Cannot find tuned schedules for target=llvm -keys=cpu -mtriple=x86_64-pc-linux-gnu, workload_key=["2d10de6646307f0e3e5cf4b31c20e69b", [1, 7, 7, 960], [1, 1, 960, 320], [1, 7, 7, 320]]. A fallback TOPI schedule is used, which may bring great performance regression or even compilation failure. Compute DAG info:
p0 = PLACEHOLDER [1, 7, 7, 960]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 960, 320]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])

I was getting model compilation failures like this

InternalError: Check failed: (false) is false: Incompatible broadcast dims: 144 and 128 in: [1, 7, 7, 144] and [1, 1, 1, 128]

Behavior after correction

Now, When Ansor doesn't find a schedule for a layer during tuning, like this:

fused_nn_dense_add
Cannot find tuned schedules for target=llvm -keys=cpu -mtriple=x86_64-pc-linux-gnu, workload_key=["08f7449d79e570b7274174709e5e5e01", [1, 1280], [1000, 1280], [1, 1000], [1, 1000]]. A fallback TOPI schedule is used, which may bring great performance regression or even compilation failure. Compute DAG info:
p0 = PLACEHOLDER [1, 1280]
p1 = PLACEHOLDER [1000, 1280]
T_matmul_NT(i0, i1) += (p0[i0, k]*p1[i1, k])
p2 = PLACEHOLDER [1, 1000]
T_add(ax0, ax1) = (T_matmul_NT[ax0, ax1] + p2[ax0, ax1])

the code reports the following warning

[00:50:32] /home/thais/Dev/tvm/include/tvm/topi/detail/broadcast.h:86: Warning: Incompatible broadcast dims: 256 and 96. Automatically cutting the larger dimension.
[00:50:33] /home/thais/Dev/tvm/include/tvm/topi/detail/broadcast.h:86: Warning: Incompatible broadcast dims: 256 and 160. Automatically cutting the larger dimension.

And, I can compile the models without compilation and execution errors and the accuracy of the model is maintained.

@thaisacs thaisacs force-pushed the broadcast-error branch 2 times, most recently from 234b87c to d2facb3 Compare February 6, 2025 13:13
@thaisacs thaisacs changed the title [FixBug][Ansor] Fixing BroadcastShape function [BugFix][Ansor] Fixing BroadcastShape function Feb 6, 2025
@thaisacs thaisacs force-pushed the broadcast-error branch 5 times, most recently from 2fa4299 to 93feceb Compare February 7, 2025 01:29
<< " in: " << tvm::Array<tvm::PrimExpr>(shape1.begin(), shape1.end()) << " and "
<< tvm::Array<tvm::PrimExpr>(shape2.begin(), shape2.end());
LOG(WARNING) << "Incompatible broadcast dims: " << shape1[s1_size - i] << " and "
<< shape2[s2_size - i] << ". Automatically cutting the larger dimension.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You relax here the constraint, leaving explicit warning on this, maybe enough for brodcasting case.

Not an english native here, maybe s/cutting/trimming/ would sound better ? I leave it up to your consideration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimming sounds better to me (I'm also not a native English speaker). I've made the correction.

@cbalint13
Copy link
Contributor

Hi @thaisacs ,

Thanks for your contribution !

  • Not sure what model did you use, is it a well know public model or custom one !?
  • You basically relax/bypass a sanity check on the broadcast operation here in TVM.

Should't instead this fix or add to your model a trim operation or something to complain with that shape mismatch ?

@cbalint13
Copy link
Contributor

cbalint13 commented Feb 7, 2025

Updating, more reasoning:

If metaschedule iterates through and bail-out/abort as whole program then maybe we should relax as you proposed.

But let's see others opinion on this statement Cc @Hzfengsy @MasterJH5574

Also, for acceptance a small testcase would be nice for such broadcast scenario here at test_topi_broadcast.py .

@thaisacs
Copy link
Contributor Author

thaisacs commented Feb 7, 2025

@cbalint13

Hello, I'm conducting experiments with the following models:

  • resnet_18
  • resnet_50
  • resnext_50
  • wide_resnet_50
  • mobilenet_v2
  • mobilenet_v3
  • resnet_152
  • inception_v3
  • alexnet
  • densenet_121
  • vgg_16
  • googlenet

Imported directly from PyTorch to Relay.
This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

@cbalint13
Copy link
Contributor

cbalint13 commented Feb 7, 2025

@cbalint13

Hello, I'm conducting experiments with the following models:

  • resnet_18
  • resnet_50
    {...}
  • googlenet

Imported directly from PyTorch to Relay. This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

Thank you, see now. But from your description it is still not clear the extend of the issue:

Metaschedule (a) bails-out/abort as a whole program or (b) only skips on some specific layers ?

  • In case of (a) yes , if whole program is aborted it is clear we should fix this issue.
  • In case of (b) iterated variations fails (it is normal, maybe all variations too so layer is skipped), relaxing this can get "working schedules" but no guarantees on correctness. If Ansor is unable to find a single valid schedule within a layer then must check the cause of inflexibility, but just relaxing some rules to get some results is not the best idea.

Lets again see others opinion here on it.

Could attach outputs (as text.gz) here with complete schedule proccess from your side ?

@thaisacs
Copy link
Contributor Author

thaisacs commented Feb 7, 2025

@cbalint13
Hello, I'm conducting experiments with the following models:

  • resnet_18
  • resnet_50
    {...}
  • googlenet

Imported directly from PyTorch to Relay. This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

Thank you, see now. But from your description it is still not clear the extend of the issue:

Metaschedule (a) bails-out/abort as a whole program or (b) only skips on some specific layers ?

  • In case of (a) yes , if whole program is aborted it is clear we should fix this issue.
  • In case of (b) iterated variations fails (it is normal, maybe all variations too so layer is skipped), relaxing this can get "working schedules" but no guarantees on correctness. If Ansor is unable to find a single valid schedule within a layer then must check the cause of inflexibility, but just relaxing some rules to get some results is not the best idea.

Lets again see others opinion here on it.

Could attach outputs (as text.gz) here with complete schedule proccess from your side ?

When I execute the following function with the GoogLeNet network, for example, and log_file being the file achieved during tuning time,

def model_run(network_arg, dtype, target, log_file):
    mod, params, inputs = get_network_with_key(network_arg, dtype)

    print("Compile...")
    input_shape = inputs[0][1]

    with auto_scheduler.ApplyHistoryBest(log_file):
        with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):
            lib = relay.build(mod, target=target, params=params)
            
    with auto_scheduler.ApplyHistoryBest(log_file):
        with tvm.transform.PassContext(opt_level=0, config={"relay.backend.use_auto_scheduler": True}):
            ref_lib = relay.build(mod, target=target, params=params)

    # Check the correctness
    def get_output(input_data, data, lib):
        dev = tvm.device(str(target), 0)
        module = graph_executor.GraphModule(lib["default"](dev))
        module.set_input(input_data, data)
        module.run()
        return module.get_output(0).numpy()

    def run_bench(input_data, data, lib):
        dev = tvm.device(str(target), 0)
        # Create graph executor
        module = graph_executor.GraphModule(lib["default"](dev))
        module.set_input(input_data, data)
        # Evaluate
        print("Evaluate inference time cost...")
        for x in range(0, 1):
            print(module.benchmark(dev, repeat=10, number=10, min_repeat_ms=500, end_to_end=True))

    np.random.seed(0)
    data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
    run_bench(inputs[0][0], data_tvm, lib)

    np.random.seed(0)
    data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
    actual_output1 = get_output(inputs[0][0], data_tvm, lib)
    expected_output = get_output(inputs[0][0], data_tvm, ref_lib)

    tvm.testing.assert_allclose(actual_output1, expected_output, rtol=1e-4, atol=1e-4)

The program is aborted during compilation, specifically at the following lines:

    with auto_scheduler.ApplyHistoryBest(log_file):
        with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):
            lib = relay.build(mod, target=target, params=params)

Then, we are unable to generate code and perform model inference. Note that during tuning to generate the log_file, there were no issues. It is possible to generate the .json file with the schedules for the layers that were tuned."

@cbalint13
Copy link
Contributor

cbalint13 commented Feb 7, 2025

@thaisacs ,

Thanks for the details, see now what you are trying.

  • You are using relay (not maintained) flow which is scheduled to be phased out with the next release.
  • You have to use the relax flow, I done here an experiment with resnet18 and cannot reproduce the bug.

See my attached relax-metasched-resnet18.log.gz test script, more online here
Also see the console log tvm-metasched-relax-resnet18.py.gz (truncated, stopped bit earlier).

Notes:

  • If you want to manipulate the history (load/store) states see the docs how to do this via relax.
  • I was using versions below, but due to fact that my pytorch is too new I used the relax ONNX backend.
    $ rpm -q tvm onnx pytorch
    tvm-0.20-20250204.0.git9404fb5a.cu12_6.fc42.x86_64
    onnx-1.18.0-20250203.0.git0277a1f6.fc42.x86_64
    pytorch-2.7.0-20250117.0.git42c64bd3.cu12_6.fc42.x86_64
    

Please let me know if issue still persists for you via relax.

Later update:


Back to your proposed fix,

  • I cannot confirm this issue on my side, please check too using relax for your case.
  • The fix proposed here just shunt a constraint for broadcast which IMHO is a bad idea.

@thaisacs
Copy link
Contributor Author

thaisacs commented Feb 8, 2025

@cbalint13

Thank you for your response and help.

I am using Relay and the auto-scheduler (a.k.a. Ansor), but I don't think the issue is in Relay.
For some tasks, the auto-scheduler fails to generate code. In these cases, it calls TOPI and ends up in the BroadcastShape function. For some schedules found during tuning, it is not possible to use the TOPI code for the layers where code generation failed. As a result, the tuning solution cannot be used, and we need to perform a new tuning.

I believe that addressing this issue is crucial for maintaining support for the auto-scheduler. Was support for the auto-scheduler discontinued in the next release?

@cbalint13
Copy link
Contributor

cbalint13 commented Feb 8, 2025

@cbalint13

Thank you for your response and help.

I am using Relay and the auto-scheduler (a.k.a. Ansor), but I don't think the issue is in Relay. For some tasks, the auto-scheduler fails to generate code. In these cases, it calls TOPI and ends up in the BroadcastShape function. For some schedules found during tuning, it is not possible to use the TOPI code for the layers where code generation failed. As a result, the tuning solution cannot be used, and we need to perform a new tuning.

A pinpoint example would help here to understand the source cause, relay vs relax flow differs in ways the passes are applied.
I am personally against the shunt for the broadcast, the very failure must signal the scheduler to skip its iter variant (as not legit).

Cc @Hzfengsy

I believe that addressing this issue is crucial for maintaining support for the auto-scheduler. Did the auto-scheduler support stop in v0.19.0?

auto-scheduler (metaschedule/ansor) stays , the older autotune with relay flow goes (unmaintained since a while).
auto-scheduler is commonly used by booth relax and old relay, but in flows the way rules may be applied differently for each.

I suggest to adapt your code to an equivalent modern relax flow, if this issue still pops up we must look at it.
Do you also experience this issue only with a more recent tvm version (assuming with unmaintained relay) ?

@thaisacs
Copy link
Contributor Author

thaisacs commented Feb 8, 2025

@cbalint13

Thanks. I'll test it on relax flow.

@Hzfengsy
Copy link
Member

Thanks for the great discussion!

This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

What do you mean by "more frequently"? IIUC, if it's a TOPI issue, it will always or never happen, but not a probability event.

InternalError: Check failed: (false) is false: Incompatible broadcast dims: 144 and 128 in: [1, 7, 7, 144] and [1, 1, 1, 128]

The error seems reasonable.

So I wonder where the root of the issue is. To be clearer, TOPI or ansor

@thaisacs
Copy link
Contributor Author

Thanks for the great discussion!

This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

What do you mean by "more frequently"? IIUC, if it's a TOPI issue, it will always or never happen, but not a probability event.

InternalError: Check failed: (false) is false: Incompatible broadcast dims: 144 and 128 in: [1, 7, 7, 144] and [1, 1, 1, 128]

The error seems reasonable.

So I wonder where the root of the issue is. To be clearer, TOPI or ansor

@Hzfengsy

I think it's in the interaction of the two: auto-scheduler and TOPI. The auto-scheduler can find schedules for some layers, but not all. When attempting to compile the entire model, TOPI needs to deal with the case that the internal tensors have different dimensions. Currently, TOPI handles this by stopping the compilation process.

Note that, when a tensor's dimension is dynamic and cannot be determined at compile time, TOPI prematurely considers the shape of the output tensor.

@cbalint13
Copy link
Contributor

cbalint13 commented Feb 12, 2025

Thanks for the great discussion!

This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

What do you mean by "more frequently"? IIUC, if it's a TOPI issue, it will always or never happen, but not a probability event.

InternalError: Check failed: (false) is false: Incompatible broadcast dims: 144 and 128 in: [1, 7, 7, 144] and [1, 1, 1, 128]

The error seems reasonable.
So I wonder where the root of the issue is. To be clearer, TOPI or ansor

@Hzfengsy

I think it's in the interaction of the two: auto-scheduler and TOPI. The auto-scheduler can find schedules for some layers, but not all. When attempting to compile the entire model, TOPI needs to deal with the case that the internal tensors have different dimensions. Currently, TOPI handles this by stopping the compilation process.

Note that, when a tensor's dimension is dynamic and cannot be determined at compile time, TOPI prematurely considers the shape of the output tensor.

@thaisacs ,

As practical note, if the amount of sample limit per layer is too low metaschedule will fail to propose any valid sketches for layers.
Can be seen in provided sample here using relax flow (with the attached logs as proof), if samples are lowered from 8000 to 1000.

If we "shunt" things like this boradcast, there will be more apperantly "valid" proposals so searching converges faster, but is not we want, we also want a "legit/valid" final form of the tuned model.

@thaisacs
Copy link
Contributor Author

Thanks for the great discussion!

This issue occurred more frequently with GoogLeNet, MobileNetV2, MobileNetV3, ResNet-152, and InceptionV3 models.

What do you mean by "more frequently"? IIUC, if it's a TOPI issue, it will always or never happen, but not a probability event.

InternalError: Check failed: (false) is false: Incompatible broadcast dims: 144 and 128 in: [1, 7, 7, 144] and [1, 1, 1, 128]

The error seems reasonable.
So I wonder where the root of the issue is. To be clearer, TOPI or ansor

@Hzfengsy
I think it's in the interaction of the two: auto-scheduler and TOPI. The auto-scheduler can find schedules for some layers, but not all. When attempting to compile the entire model, TOPI needs to deal with the case that the internal tensors have different dimensions. Currently, TOPI handles this by stopping the compilation process.
Note that, when a tensor's dimension is dynamic and cannot be determined at compile time, TOPI prematurely considers the shape of the output tensor.

@thaisacs ,

As practical note, if the amount of sample limit per layer is too low metaschedule will fail to propose any valid sketches for layers. Can be seen in provided sample here using relax flow (with the attached logs as proof), if samples are lowered from 8000 to 1000.

If we "shunt" things like this boradcast, there will be more apperantly "valid" proposals so searching converges faster, but is not we want, we also want a "legit/valid" final form of the tuned model.

@cbalint13

I think that the broadcast shape function has no impact on the search. It is only used for the final compilation of the model.

Aren't invalid schedules removed by the evolutionary search? The searches I performed considered 1000 points per layer of the model. For example, in resnet_152 with 27 layers, the evolutionary search explored 26968 schedules.
In my tests, the broadcast shape function did not change the accuracy of the model.
The only model that had not-so-good accuracy was inception_v3, but this happens if the broadcast function fails to give an error too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants