-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port CUDA Kernels #8
Comments
Hi! So I was thinking of porting substation after reading the paper but a question about how you want the cuda kernels integrated in. In substation, they do optimizations by "generating" cuda files specific to the dimensions of each kernel(from what I understand from looking at https://github.com/spcl/substation/blob/master/pytorch_module/test_softmax.py) so basically it makes cuda code specific for each function so it's faster but may be messier. Bitsandbytes does it but just loading the .so files from a given location and deepspeed does it by I think building the cstr when doing pip install. So a bit slower but more general. There might be more ways to do it but which way do you think will work the best for you? |
Just checked colossalai, seems like they have functions called op_builders that they use to build certain cuda libraries. |
@isamu-isozaki This is a good idea. What are the pros and cons of substation? What do you think we should use? If there are some operations that Also, for manually porting kernel, I think we should do something like this
|
@xrsrke I think substation's method in general is faster, but it needs you to generate a new cuda file for each possible shape of tensor. So the main disadvantage is it's not clean I think(My guess is just changing batch size will need a new cuda script if we were to just copy). I think the way you are doing is similar to colossalai's and deepspeed's ver which we can definitely do. I do remember that setting up colossalai is pretty troublesome compared to say deepspeed.h I'm not sure why but we can probably cross that bridge when we get there. I think this approach is more general but might be slightly slower than say substation. I think we can probably start with this approach and if we want, extend to substation and build kernels specific to a certain dimension input |
do you think this makes sense? I can check out megatron-lm's way etc if you want |
@isamu-isozaki Could you try to benchmark the two approaches? Try fusing a softmax using substation, then compare the two approaches... (also, it could be that some operations are performed better by substitution, while others are more efficient when written manually. We should take this into account while benchmarking). Or maybe, we should put this for experimental later on.. and for now, just port these kernels. Also, I've just added GPT-NeoX's kernel to the issue above. |
Check out the following open source projects, and propose which CUDA kernels we should port. Then write a kernel builder which takes a kernel name, and loads it.
Implementation
APIs
TODOs
FusedScaleMaskSoftmax
[link]MixedFusedLayerNorm
[link]The text was updated successfully, but these errors were encountered: