Skip to content

Task03 Denis Sokolov ITMO#1062

Closed
DenChika wants to merge 3 commits into
GPGPUCourse:task03from
DenChika:task03
Closed

Task03 Denis Sokolov ITMO#1062
DenChika wants to merge 3 commits into
GPGPUCourse:task03from
DenChika:task03

Conversation

@DenChika
Copy link
Copy Markdown

@DenChika DenChika commented Mar 3, 2026

Локальный вывод

$ ./main_matrix_transpose
Found 2 GPUs in 0.216854 sec (OpenCL: 0.145116 sec, Vulkan: 0.0713244 sec)
Available devices:
  Device #0: API: OpenCL. GPU. AMD Radeon(TM) Graphics (gfx902). Free memory: 3069/3137 Mb.
  Device #1: API: OpenCL. CPU. AMD Ryzen 5 5500U with Radeon Graphics         . Intel(R) Corporation. Total memory: 7514 Mb.
Using device #0: API: OpenCL. GPU. AMD Radeon(TM) Graphics (gfx902). Free memory: 3069/3137 Mb.
Using OpenCL API...
Matrix size: rows=H=8192 x cols=W=16384 (512 MB)
______________________________________________________
Evaluating algorithm #1/2: 01 naive transpose (non-coalesced)
Kernels compilation done in 0.272752 seconds
algorithm times (in seconds) - 10 values (min=0.170015 10%=0.170539 median=0.175083 90%=0.593364 max=0.593364)
median effective algorithm bandwidth: 5.71158 GB/s                                                            
______________________________________________________
Evaluating algorithm #2/2: 02 transpose via local memory (coalesced)
Kernels compilation done in 0.438939 seconds
algorithm times (in seconds) - 10 values (min=0.155474 10%=0.157204 median=0.162531 90%=0.74197 max=0.74197)
median effective algorithm bandwidth: 6.15268 GB/s

$ ./main_matrix_multiply
Found 2 GPUs in 0.212095 sec (OpenCL: 0.151997 sec, Vulkan: 0.0597343 sec)    
Available devices:                                                            
  Device #0: API: OpenCL. GPU. AMD Radeon(TM) Graphics (gfx902). Free memory: 3069/3137 Mb.                                 
  Device #1: API: OpenCL. CPU. AMD Ryzen 5 5500U with Radeon Graphics         . Intel(R) Corporation. Total memory: 7514 Mb.
Using device #0: API: OpenCL. GPU. AMD Radeon(TM) Graphics (gfx902). Free memory: 3069/3137 Mb.
Using OpenCL API...
C = A x B, matrices size: C (rows=H=2048 x cols=W=4096) = A (rows=H=2048 x cols=K=1024) x B (rows=K=1024 x cols=W=4096)
matrices data size: A - 8 MB, B - 16 MB, C - 16 MB
______________________________________________________
Evaluating algorithm #1/3: CPU with OpenMP
algorithm times (in seconds) - 1 values (min=16.8321 10%=16.8321 median=16.8321 90%=16.8321 max=16.8321)
algorithm GFlops: 1.02016 GFlops                                                                        
algorithm effective memory bandwidth: 0.003249 GB/s                                                     
______________________________________________________                                                  
Evaluating algorithm #2/3: 01 naive                                                                     
Kernels compilation done in 0.0863511 seconds
algorithm times (in seconds) - 10 values (min=0.563561 10%=0.564931 median=0.573185 90%=0.712614 max=0.712614)
algorithm GFlops: 29.958 GFlops                                                                               
algorithm effective memory bandwidth: 0.0954098 GB/s                                                          
relative differences with CPU: 8388608 values (min=0 10%=0 median=0 90%=0 max=0)
median relative difference with CPU: 0                
99% percentile relative difference with CPU: 0        
______________________________________________________
Evaluating algorithm #3/3: 02 using local memory      
Kernels compilation done in 0.0590079 seconds
algorithm times (in seconds) - 10 values (min=0.211308 10%=0.213502 median=0.224496 90%=0.303139 max=0.303139)
algorithm GFlops: 76.4889 GFlops                                                                              
algorithm effective memory bandwidth: 0.243601 GB/s                                                           
relative differences with CPU: 8388608 values (min=0 10%=0 median=0 90%=0 max=0)
median relative difference with CPU: 0
99% percentile relative difference with CPU: 0

Вывод Github CI

$ ./main_matrix_transpose
Found 2 GPUs in 0.0515368 sec (CUDA: 8.0741e-05 sec, OpenCL: 0.0241771 sec, Vulkan: 0.0272261 sec)
Available devices:
  Device #0: API: OpenCL. CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15990 Mb.
  Device #1: API: Vulkan. CPU. llvmpipe (LLVM 20.1.2, 256 bits). Free memory: 15990/15990 Mb.
Using device #0: API: OpenCL. CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15990 Mb.
Using OpenCL API...
Matrix size: rows=H=[8](https://github.com/GPGPUCourse/GPGPUTasks2025/actions/runs/22810671670/job/66166944549#step:15:9)192 x cols=W=16384 (512 MB)
______________________________________________________
Evaluating algorithm #1/2: 01 naive transpose (non-coalesced)
Kernels compilation done in 0.11[9](https://github.com/GPGPUCourse/GPGPUTasks2025/actions/runs/22810671670/job/66166944549#step:15:10)772 seconds
algorithm times (in seconds) - 10 values (min=0.1644 10%=0.174158 median=0.20944 90%=0.296053 max=0.296053)
median effective algorithm bandwidth: 4.77464 GB/s
______________________________________________________
Evaluating algorithm #2/2: 02 transpose via local memory (coalesced)
Kernels compilation done in 0.0419922 seconds
algorithm times (in seconds) - [10](https://github.com/GPGPUCourse/GPGPUTasks2025/actions/runs/22810671670/job/66166944549#step:15:11) values (min=0.169447 10%=0.16961 median=0.169875 90%=0.214047 max=0.214047)
median effective algorithm bandwidth: 5.8867 GB/s

$ ./main_matrix_multiply
Found 2 GPUs in 0.0511717 sec (CUDA: 8.1863e-05 sec, OpenCL: 0.0236903 sec, Vulkan: 0.0273561 sec)
Available devices:
  Device #0: API: OpenCL. CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15990 Mb.
  Device #1: API: Vulkan. CPU. llvmpipe (LLVM 20.1.2, 256 bits). Free memory: 15990/15990 Mb.
Using device #0: API: OpenCL. CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15990 Mb.
Using OpenCL API...
C = A x B, matrices size: C (rows=H=204[8](https://github.com/GPGPUCourse/GPGPUTasks2025/actions/runs/22810403574/job/66166198091#step:16:9) x cols=W=4096) = A (rows=H=2048 x cols=K=1024) x B (rows=K=1024 x cols=W=40[9](https://github.com/GPGPUCourse/GPGPUTasks2025/actions/runs/22810403574/job/66166198091#step:16:10)6)
matrices data size: A - 8 MB, B - 16 MB, C - 16 MB
______________________________________________________
Evaluating algorithm #1/3: CPU with OpenMP
algorithm times (in seconds) - 1 values (min=14.8991 [10](https://github.com/GPGPUCourse/GPGPUTasks2025/actions/runs/22810403574/job/66166198091#step:16:11)%=14.8991 median=14.8991 90%=14.8991 max=14.8991)
algorithm GFlops: 1.15252 GFlops
algorithm effective memory bandwidth: 0.00367052 GB/s
______________________________________________________
Evaluating algorithm #2/3: 01 naive
Kernels compilation done in 0.127516 seconds
algorithm times (in seconds) - 10 values (min=1.42624 10%=1.43796 median=1.45924 90%=1.664 max=1.664)
algorithm GFlops: [11](https://github.com/GPGPUCourse/GPGPUTasks2025/actions/runs/22810403574/job/66166198091#step:16:12).7674 GFlops
algorithm effective memory bandwidth: 0.0374766 GB/s
relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.[12](https://github.com/GPGPUCourse/GPGPUTasks2025/actions/runs/22810403574/job/66166198091#step:16:13)363e-06 max=2.77294)
median relative difference with CPU: 2.21073e-07
99% percentile relative difference with CPU: 1.09303e-05
______________________________________________________
Evaluating algorithm #3/3: 02 using local memory
Kernels compilation done in 0.0652832 seconds
algorithm times (in seconds) - 10 values (min=1.41456 10%=1.41573 median=1.41781 90%=1.48529 max=1.48529)
algorithm GFlops: 12.11[13](https://github.com/GPGPUCourse/GPGPUTasks2025/actions/runs/22810403574/job/66166198091#step:16:14) GFlops
algorithm effective memory bandwidth: 0.0385719 GB/s
relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294)
median relative difference with CPU: 2.21073e-07
99% percentile relative difference with CPU: 1.09303e-05

@DenChika
Copy link
Copy Markdown
Author

DenChika commented Mar 3, 2026

Забавно, Github CI на наивной реализации matrix_multiply выдал больше GFlops, чем на реализации с local_memory

@DenChika DenChika changed the title Task03 Denis Sokolov Task03 Denis Sokolov ITMO Mar 3, 2026
@GPUcourseBOT
Copy link
Copy Markdown
Collaborator

Результаты тестирования PR #1062

Логи тестирования (нажмите чтобы развернуть)
=== СТАТУС: Успешно выполнены программы: main_matrix_transpose, main_matrix_multiply ===
=== main_matrix_transpose stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 8.45184 sec (CUDA: 0.115671 sec, OpenCL: 0.841202 sec, Vulkan: 7.49491 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
Matrix size: rows=H=8192 x cols=W=16384 (512 MB)
______________________________________________________
Evaluating algorithm #1/2: 01 naive transpose (non-coalesced)
Kernels compilation done in 2.79815 seconds
algorithm times (in seconds) - 10 values (min=0.012165 10%=0.0121714 median=0.01218 90%=2.8104 max=2.8104)
median effective algorithm bandwidth: 82.1019 GB/s
______________________________________________________
Evaluating algorithm #2/2: 02 transpose via local memory (coalesced)
Kernels compilation done in 0.230498 seconds
algorithm times (in seconds) - 10 values (min=0.00837265 10%=0.00837601 median=0.00838911 90%=0.238957 max=0.238957)
median effective algorithm bandwidth: 119.202 GB/s
=== main_matrix_multiply stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 0.327966 sec (CUDA: 0.127407 sec, OpenCL: 0.0381992 sec, Vulkan: 0.1623 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
C = A x B, matrices size: C (rows=H=2048 x cols=W=4096) = A (rows=H=2048 x cols=K=1024) x B (rows=K=1024 x cols=W=4096)
matrices data size: A - 8 MB, B - 16 MB, C - 16 MB
______________________________________________________
Evaluating algorithm #1/3: CPU with OpenMP
algorithm times (in seconds) - 1 values (min=11.6236 10%=11.6236 median=11.6236 90%=11.6236 max=11.6236)
algorithm GFlops: 1.47729 GFlops
algorithm effective memory bandwidth: 0.00470486 GB/s
______________________________________________________
Evaluating algorithm #2/3: 01 naive
Kernels compilation done in 0.069236 seconds
algorithm times (in seconds) - 10 values (min=0.038186 10%=0.0382057 median=0.0391316 90%=0.143556 max=0.143556)
algorithm GFlops: 438.814 GFlops
algorithm effective memory bandwidth: 1.39753 GB/s
relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294)
median relative difference with CPU: 2.21073e-07
99% percentile relative difference with CPU: 1.09303e-05
______________________________________________________
Evaluating algorithm #3/3: 02 using local memory
Kernels compilation done in 0.0838083 seconds
algorithm times (in seconds) - 10 values (min=0.0598847 10%=0.0600926 median=0.0602318 90%=0.143199 max=0.143199)
algorithm GFlops: 285.09 GFlops
algorithm effective memory bandwidth: 0.90795 GB/s
relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294)
median relative difference with CPU: 2.21073e-07
99% percentile relative difference with CPU: 1.09303e-05

Посмотреть полные логи

@PolarNick239
Copy link
Copy Markdown
Member

GitHub CI - на CPU, поэтому не так важно.

Но на GPU у вас так же происходит замедление.

Посмотрите на свою реализацию с local_memory, в чем заключается оптимизация? За счет чего она должна работать быстрее чем наивная реализация?

@PolarNick239
Copy link
Copy Markdown
Member

9/10 баллов 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants