Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a validation pass #1758

Open
3 tasks
tfogal opened this issue Feb 13, 2024 · 8 comments
Open
3 tasks

Add a validation pass #1758

tfogal opened this issue Feb 13, 2024 · 8 comments
Assignees
Labels
ux Improving user experience

Comments

@tfogal
Copy link
Collaborator

tfogal commented Feb 13, 2024

We should have a validation pass to ensure that the input program makes sense. For example, the following program:

from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id1(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[-1,-1], contiguity=[True,True], dtype=DataType.Float, is_cpu=False, stride_order=[1,0])
    T1 = fd.define_tensor(shape=[-1,-1], contiguity=[True,True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[1,0])
    T2 = fd.ops.add(T0, T1)
    fd.add_output(T2)

with FusionDefinition() as fd:
    nvfuser_fusion_id1(fd)

import torch
inputs = [
    torch.randn((4,), dtype=torch.float32, device='cuda:0').as_strided((2, 2), (2, 1)),
    torch.randn((4,), dtype=torch.float32, device='cuda:0').as_strided((2, 2), (2, 1)),
]   
fd.execute(inputs)

crashes when run with:

RuntimeError: Expected T1_g[ iS2{i3}, iS3{i4} ] to be bound to a tensor of dtype __bfloat, but got a tensor of dtype float

We should fail more gracefully:

  • we should not tear down the whole process when we fail; report an error back up instead
  • the user can't do anything with references like iS2{i3}; error messages more at the level of the program are what they would want

Incomplete list of what we should consider in a validation pass:

  • Check operand types
  • Ensure support matches the GPU targeted: we can't create bf16 kernels on volta hardware
  • Ensure support matches the GPU targeted: we can't create fp8 kernels on ampere hardware
@naoyam
Copy link
Collaborator

naoyam commented Feb 13, 2024

I think we automatically insert a necessary cast operation. The error in this case is about the T1 input being float32.

For the fp type support, that should be relatively trivial.

More generally, though, it's a hard problem to report a meaning error message in Python from a C++ error. Is anybody familiar with general design guidelines? For example, should raising more specific exceptions help? What should be the protocol of error reporting look like between C++ and Python? Currently, the C++ side just uses NVF_ERROR, which throws an nvfuser::nvfError exception. Maybe we should have some more subclasses of this exception, like nvfSchedulingError, nvfLoweringError?

@kevinstephano
Copy link
Collaborator

These are cases that are only identifiable at runtime as we tried to catch definition time issues in Python to make the errors better for the ones we could, easily. We probably need to investigate potentially how exceptions get translated to python. Maybe this is looking into how people do exceptions through bindings or changing the exception error type from C++.

Another issue is that the Tensor identifiers between C++ and Python do not match up.

@tfogal
Copy link
Collaborator Author

tfogal commented Feb 13, 2024

I think we automatically insert a necessary cast operation. The error in this case is about the T1 input being float32.

oops, yep! Thanks Naoya. Edited to fix.

@tfogal
Copy link
Collaborator Author

tfogal commented Feb 13, 2024

More generally, though, it's a hard problem to report a meaning error message in Python from a C++ error. Is anybody familiar with general design guidelines? For example, should raising more specific exceptions help? What should be the protocol of error reporting look like between C++ and Python? Currently, the C++ side just uses NVF_ERROR, which throws an nvfuser::nvfError exception. Maybe we should have some more subclasses of this exception, like nvfSchedulingError, nvfLoweringError?

Yeah, it's definitely hard!

I think new types are useful insofar as a user wants to handle them differently. For example in POSIX, EAGAIN vs. EACCES make sense as distinct errors because the user may reasonably want to retry in the first error, vs. give up in the second. In our case, it might then depend on whether the client has any recourse as to what to do about it.

Tagging @nouiz as I know he's been having discussions on better error handling APIs as of late. What are your thoughts, Frédéric?

@naoyam
Copy link
Collaborator

naoyam commented Feb 14, 2024

I separated out the validation issue as #1760.

@naoyam naoyam added the ux Improving user experience label Feb 14, 2024
@nouiz
Copy link

nouiz commented Feb 14, 2024

I think the C/C++ code shouldn't use error code anywhere. The important is to have a detailed error string and pass it around.
XLA use absl::Status instead of exception. I like that, but the important is error string from where the error happen and making sure it is reported back to the user. Anything that does this is a pretty good state.
This should trigger a Python exception in the end with that string.

It is hard for higher level to add the correct detail when the error happen at a low level. And when they try to do this, it is a maintenance night-mare as this can change at all release (and isn't always well documented).

So I would go to make sure the lowest place that detect the error create detailed error string and make sure it is passed above. Error code can be used to select the right python exception type. But not more then that.

@naoyam
Copy link
Collaborator

naoyam commented Feb 21, 2024

As for the original issue, @jacobhinkle improved the error message in #1784. This is where we validate required device properties:

https://github.com/NVIDIA/Fuser/blob/main/csrc/executor.cpp#L394-L405

The more general issue of better reporting from nvfuser to a higher level framework remains.

@kevinstephano
Copy link
Collaborator

I will note I opened an issue on the Python API, we might want to look into validating some of these attributes at the definition level so the error messages are nicer. #1856

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ux Improving user experience
Projects
None yet
Development

No branches or pull requests

4 participants