Add a validation pass #1758

tfogal · 2024-02-13T21:35:22Z

We should have a validation pass to ensure that the input program makes sense. For example, the following program:

from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id1(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[-1,-1], contiguity=[True,True], dtype=DataType.Float, is_cpu=False, stride_order=[1,0])
    T1 = fd.define_tensor(shape=[-1,-1], contiguity=[True,True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[1,0])
    T2 = fd.ops.add(T0, T1)
    fd.add_output(T2)

with FusionDefinition() as fd:
    nvfuser_fusion_id1(fd)

import torch
inputs = [
    torch.randn((4,), dtype=torch.float32, device='cuda:0').as_strided((2, 2), (2, 1)),
    torch.randn((4,), dtype=torch.float32, device='cuda:0').as_strided((2, 2), (2, 1)),
]   
fd.execute(inputs)

crashes when run with:

RuntimeError: Expected T1_g[ iS2{i3}, iS3{i4} ] to be bound to a tensor of dtype __bfloat, but got a tensor of dtype float

We should fail more gracefully:

we should not tear down the whole process when we fail; report an error back up instead
the user can't do anything with references like iS2{i3}; error messages more at the level of the program are what they would want

Incomplete list of what we should consider in a validation pass:

Check operand types
Ensure support matches the GPU targeted: we can't create bf16 kernels on volta hardware
Ensure support matches the GPU targeted: we can't create fp8 kernels on ampere hardware

The text was updated successfully, but these errors were encountered:

naoyam · 2024-02-13T22:04:30Z

I think we automatically insert a necessary cast operation. The error in this case is about the T1 input being float32.

For the fp type support, that should be relatively trivial.

More generally, though, it's a hard problem to report a meaning error message in Python from a C++ error. Is anybody familiar with general design guidelines? For example, should raising more specific exceptions help? What should be the protocol of error reporting look like between C++ and Python? Currently, the C++ side just uses NVF_ERROR, which throws an nvfuser::nvfError exception. Maybe we should have some more subclasses of this exception, like nvfSchedulingError, nvfLoweringError?

kevinstephano · 2024-02-13T22:23:01Z

These are cases that are only identifiable at runtime as we tried to catch definition time issues in Python to make the errors better for the ones we could, easily. We probably need to investigate potentially how exceptions get translated to python. Maybe this is looking into how people do exceptions through bindings or changing the exception error type from C++.

Another issue is that the Tensor identifiers between C++ and Python do not match up.

tfogal · 2024-02-13T22:44:47Z

I think we automatically insert a necessary cast operation. The error in this case is about the T1 input being float32.

oops, yep! Thanks Naoya. Edited to fix.

tfogal · 2024-02-13T22:54:58Z

More generally, though, it's a hard problem to report a meaning error message in Python from a C++ error. Is anybody familiar with general design guidelines? For example, should raising more specific exceptions help? What should be the protocol of error reporting look like between C++ and Python? Currently, the C++ side just uses NVF_ERROR, which throws an nvfuser::nvfError exception. Maybe we should have some more subclasses of this exception, like nvfSchedulingError, nvfLoweringError?

Yeah, it's definitely hard!

I think new types are useful insofar as a user wants to handle them differently. For example in POSIX, EAGAIN vs. EACCES make sense as distinct errors because the user may reasonably want to retry in the first error, vs. give up in the second. In our case, it might then depend on whether the client has any recourse as to what to do about it.

Tagging @nouiz as I know he's been having discussions on better error handling APIs as of late. What are your thoughts, Frédéric?

naoyam · 2024-02-14T01:37:26Z

I separated out the validation issue as #1760.

nouiz · 2024-02-14T16:22:29Z

I think the C/C++ code shouldn't use error code anywhere. The important is to have a detailed error string and pass it around.
XLA use absl::Status instead of exception. I like that, but the important is error string from where the error happen and making sure it is reported back to the user. Anything that does this is a pretty good state.
This should trigger a Python exception in the end with that string.

It is hard for higher level to add the correct detail when the error happen at a low level. And when they try to do this, it is a maintenance night-mare as this can change at all release (and isn't always well documented).

So I would go to make sure the lowest place that detect the error create detailed error string and make sure it is passed above. Error code can be used to select the right python exception type. But not more then that.

naoyam · 2024-02-21T20:25:52Z

As for the original issue, @jacobhinkle improved the error message in #1784. This is where we validate required device properties:

https://github.com/NVIDIA/Fuser/blob/main/csrc/executor.cpp#L394-L405

The more general issue of better reporting from nvfuser to a higher level framework remains.

kevinstephano · 2024-03-01T04:43:44Z

I will note I opened an issue on the Python API, we might want to look into validating some of these attributes at the definition level so the error messages are nicer. #1856

tfogal assigned naoyam Feb 13, 2024

naoyam mentioned this issue Feb 14, 2024

Validate use of bf16 (and other relatively new hardware features) #1760

Closed

naoyam added the ux Improving user experience label Feb 14, 2024

kevinstephano mentioned this issue Feb 29, 2024

Explore whether the FusionDefinition's information can be leveraged to error check Tensor attribute mismatches during FusionDefinition.execute() #1856

Closed

tfogal mentioned this issue Aug 5, 2024

Improve error message with mismatched valtype/dtype #2739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a validation pass #1758

Add a validation pass #1758

tfogal commented Feb 13, 2024 •

edited

Loading

naoyam commented Feb 13, 2024

kevinstephano commented Feb 13, 2024

tfogal commented Feb 13, 2024

tfogal commented Feb 13, 2024

naoyam commented Feb 14, 2024

nouiz commented Feb 14, 2024

naoyam commented Feb 21, 2024

kevinstephano commented Mar 1, 2024

Add a validation pass #1758

Add a validation pass #1758

Comments

tfogal commented Feb 13, 2024 • edited Loading

naoyam commented Feb 13, 2024

kevinstephano commented Feb 13, 2024

tfogal commented Feb 13, 2024

tfogal commented Feb 13, 2024

naoyam commented Feb 14, 2024

nouiz commented Feb 14, 2024

naoyam commented Feb 21, 2024

kevinstephano commented Mar 1, 2024

tfogal commented Feb 13, 2024 •

edited

Loading