Description
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
General cuda-python
Is your feature request related to a problem? Please describe.
I would like to be able to gather ncu metrics such as bank conflicts from within the cuda-python system. It's now really easy to do a grid search i.e. of cta sizes in different dimensions and jit compile them and bench them. It would be amazing if it were possible to launch each kernel and gather bank conflict metrics without leaving the script.
Describe the solution you'd like
Maybe something configured similarly to cuda.core.experimental.Program
and cuda.core.experimental.ProgramOptions
where you could specify metrics and regex patterns for launches within a context maybe using with ...
syntax in python or start()
stop()
.
Describe alternatives you've considered
I'll probably end up for the time being just writing a script that has a bench mode and jit compiles the cartesian product of a bunch of params, benches them and then has the script launch itself as a subprocess in ncu-mode
for iteration counts and problem sizes compatible with ncu. I'll probably write that out to csv and then parse that within the current script.
Additional context
I've been working with cutlass/cute using a lightweight python database to cache jit compiled kernels with cuda-python and it's been a really nice dev process. I'd prefer this ecosystem even if I was developing the kernels for native c/c++. I can shmoo over cta sizes and watch for registers spills and bench, if ncu could be integrated it would be easy to shmoo over mma tiling permutations, swizzle patterns etc...
Metadata
Metadata
Assignees
Type
Projects
Status