GitHub - GandalfTea/gperf: gpu performance monitoring

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
nvctl		nvctl
nvperfd		nvperfd
test		test
.gitignore		.gitignore
Makefile		Makefile
README		README
repl.c		repl.c

Repository files navigation

Launching analysis:
  * daemon on instance catching all pytorch threads and deploy metrics to kernel module 
    -- track core metrics with ->0 overhead, inform user if big problems are found
  * 'apx analyze <file.py> [argv]' attaching to pid, dumping a 'data.[apx]' and then 'apx report' to launch cli report.
    -- more detailed, might run kernels multiple times to benchmark, end with 'apx report' for full report


daemon nvperfd:
  * spawned either by kernel or by '<cmd> init' with comm 'nvperfd'
  * 2 open sockets, netlink connector api and logs
  * logs in /var/log/nvperfd/nvperfd.log under LOCAL3.*
  * read nvidia tracked metrics from /home/<user>/.config/nvperf/nvperfd.conf
    -- users use the repl to modify the tracked metrics and we read it when launching them
  * use netlink sockets to monitor PROC_EVENT_* when new processes are launched:
    -- track every python3 process (pytorch renames main process to pt_main_thread after ~0.5s)
    -- track threads and check for nvidia controllers 
    -- if found, deploy trackers
    -- keep tracking for pid until PROC_EVENT_EXIT is called, disable trackers and read data
    -- dump data to state file folder
    -- display simple tracking data, tell user to review with 'apx report' or smth

nvctl: nvidia CUPTI API interface


Things to track:
  i. Memory movement from DRAM to VRAM:
    * check if memory is page-locked (presence of cudaMallocHost)
      - prompt people to use torch.Tensor.pin_memory, pin_memory=True (DataLoader) or mlock syscall.
      -- https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
    * latency of pinned buffer if memory is paged 
    * check data spills and subsequent movements into VRAM

 ii. VRAM and Cache Management:
    (
      * L2 Cache can be segmented to separete persistent data. Verify if this is used and what the size is. (this is disabled in multi-gpu)
        -- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#l2-cache-set-aside-for-persisting-accesses
      * User can set up custom L2 data persistence policy.
        -- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#l2-policy-for-persisting-accesses
    )
    * VRAM to Shared Cache movements, evictions and stalls
    * Track cache misses and update latency

iii. Core utilization and Idle Time:
 iv. Application level parallelisation 
  v. etc.