Skip to content

[FEA]: ncu metrics within cuda-python #681

Open
@capybara-club

Description

@capybara-club

Is this a duplicate?

Area

General cuda-python

Is your feature request related to a problem? Please describe.

I would like to be able to gather ncu metrics such as bank conflicts from within the cuda-python system. It's now really easy to do a grid search i.e. of cta sizes in different dimensions and jit compile them and bench them. It would be amazing if it were possible to launch each kernel and gather bank conflict metrics without leaving the script.

Describe the solution you'd like

Maybe something configured similarly to cuda.core.experimental.Program and cuda.core.experimental.ProgramOptions where you could specify metrics and regex patterns for launches within a context maybe using with ... syntax in python or start() stop().

Describe alternatives you've considered

I'll probably end up for the time being just writing a script that has a bench mode and jit compiles the cartesian product of a bunch of params, benches them and then has the script launch itself as a subprocess in ncu-mode for iteration counts and problem sizes compatible with ncu. I'll probably write that out to csv and then parse that within the current script.

Additional context

I've been working with cutlass/cute using a lightweight python database to cache jit compiled kernels with cuda-python and it's been a really nice dev process. I'd prefer this ecosystem even if I was developing the kernels for native c/c++. I can shmoo over cta sizes and watch for registers spills and bench, if ncu could be integrated it would be easy to shmoo over mma tiling permutations, swizzle patterns etc...

Metadata

Metadata

Assignees

No one assigned

    Labels

    triageNeeds the team's attention

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions