Skip to content

GandalfTea/gperf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Launching analysis:
  * daemon on instance catching all pytorch threads and deploy metrics to kernel module 
    -- track core metrics with ->0 overhead, inform user if big problems are found
  * 'apx analyze <file.py> [argv]' attaching to pid, dumping a 'data.[apx]' and then 'apx report' to launch cli report.
    -- more detailed, might run kernels multiple times to benchmark, end with 'apx report' for full report


daemon nvperfd:
  * spawned either by kernel or by '<cmd> init' with comm 'nvperfd'
  * 2 open sockets, netlink connector api and logs
  * logs in /var/log/nvperfd/nvperfd.log under LOCAL3.*
  * read nvidia tracked metrics from /home/<user>/.config/nvperf/nvperfd.conf
    -- users use the repl to modify the tracked metrics and we read it when launching them
  * use netlink sockets to monitor PROC_EVENT_* when new processes are launched:
    -- track every python3 process (pytorch renames main process to pt_main_thread after ~0.5s)
    -- track threads and check for nvidia controllers 
    -- if found, deploy trackers
    -- keep tracking for pid until PROC_EVENT_EXIT is called, disable trackers and read data
    -- dump data to state file folder
    -- display simple tracking data, tell user to review with 'apx report' or smth

nvctl: nvidia CUPTI API interface


Things to track:
  i. Memory movement from DRAM to VRAM:
    * check if memory is page-locked (presence of cudaMallocHost)
      - prompt people to use torch.Tensor.pin_memory, pin_memory=True (DataLoader) or mlock syscall.
      -- https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
    * latency of pinned buffer if memory is paged 
    * check data spills and subsequent movements into VRAM

 ii. VRAM and Cache Management:
    (
      * L2 Cache can be segmented to separete persistent data. Verify if this is used and what the size is. (this is disabled in multi-gpu)
        -- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#l2-cache-set-aside-for-persisting-accesses
      * User can set up custom L2 data persistence policy.
        -- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#l2-policy-for-persisting-accesses
    )
    * VRAM to Shared Cache movements, evictions and stalls
    * Track cache misses and update latency

iii. Core utilization and Idle Time:
 iv. Application level parallelisation 
  v. etc. 

About

gpu performance monitoring

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published