Closed
Description
Hi @rusty1s,
Thanks for the awesome work of putting together and maintaining pytorch_scatter.
I'm facing an issue with scatter.
When I run the following code:
from torch_scatter import scatter
import torch
x_j = torch.randn((12143200, 192), dtype=torch.float32).to('cuda:0')
edge_index = torch.randint(low=0, high=73727, size=(12143200,)).to('cuda:0')
out = scatter(src=x_j.to(torch.float32), index=edge_index, dim=0, dim_size=73728, reduce='max')
print(out)
I'm setting export CUDA_LAUNCH_BLOCKING=1
before running this code
I'm using one V100 GPU with 32GB of memory to run this code, here's my nvidia-smi
data:
Sat Aug 10 13:21:38 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:06:00.0 Off | 0 |
| N/A 33C P0 42W / 300W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:07:00.0 Off | 0 |
| N/A 34C P0 43W / 300W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:0A:00.0 Off | 0 |
| N/A 34C P0 46W / 300W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:0B:00.0 Off | 0 |
| N/A 33C P0 43W / 300W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:85:00.0 Off | 0 |
| N/A 33C P0 43W / 300W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:86:00.0 Off | 0 |
| N/A 34C P0 44W / 300W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:89:00.0 Off | 0 |
| N/A 36C P0 44W / 300W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:8A:00.0 Off | 0 |
| N/A 33C P0 43W / 300W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Here's my conda environment:
name: MyEnv
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- _openmp_mutex=5.1=1_gnu
- blas=1.0=mkl
- brotli-python=1.0.9=py37hd23a5d3_7
- bzip2=1.0.8=h7f98852_4
- ca-certificates=2023.08.22=h06a4308_0
- certifi=2022.12.7=py37h06a4308_0
- charset-normalizer=3.3.0=pyhd8ed1ab_0
- cudatoolkit=10.2.89=hfd86e86_1
- ffmpeg=4.3.2=hca11adc_0
- flit-core=3.6.0=pyhd3eb1b0_0
- freetype=2.12.1=h4a9f257_0
- giflib=5.2.1=h5eee18b_3
- gmp=6.2.1=h58526e2_0
- gnutls=3.6.13=h85f3911_1
- idna=3.4=pyhd8ed1ab_0
- intel-openmp=2023.1.0=hdb19cb5_46305
- jpeg=9b=h024ee3a_2
- lame=3.100=h7f98852_1001
- lcms2=2.12=h3be6417_0
- ld_impl_linux-64=2.38=h1181459_1
- libffi=3.4.4=h6a678d5_0
- libgcc-ng=11.2.0=h1234567_1
- libgomp=11.2.0=h1234567_1
- libpng=1.6.39=h5eee18b_0
- libstdcxx-ng=11.2.0=h1234567_1
- libtiff=4.2.0=h85742a9_0
- libuv=1.44.2=h5eee18b_0
- libwebp=1.2.0=h89dd481_0
- libwebp-base=1.2.0=h27cfd23_0
- lz4-c=1.9.4=h6a678d5_0
- mkl=2020.2=256
- mkl-service=2.3.0=py37he8ac12f_0
- mkl_fft=1.3.0=py37h54f3939_0
- mkl_random=1.1.1=py37h0573a6f_0
- ncurses=6.4=h6a678d5_0
- nettle=3.6=he412f7d_0
- ninja=1.10.2=h06a4308_5
- ninja-base=1.10.2=hd09550d_5
- openh264=2.1.1=h780b84a_0
- openssl=1.1.1w=h7f8727e_0
- pillow=9.3.0=py37hace64e9_1
- pip=22.3.1=py37h06a4308_0
- pysocks=1.7.1=py37h89c1867_5
- python=3.7.16=h7a1cb2a_0
- python_abi=3.7=2_cp37m
- pytorch-mutex=1.0=cuda
- pyyaml=6.0=py37h5eee18b_1
- readline=8.2=h5eee18b_0
- requests=2.31.0=pyhd8ed1ab_0
- setuptools=65.6.3=py37h06a4308_0
- six=1.16.0=pyhd3eb1b0_1
- sqlite=3.41.2=h5eee18b_0
- tbb=2021.8.0=hdb19cb5_0
- timm=0.3.2=pyhd8ed1ab_0
- tk=8.6.12=h1ccaba5_0
- typing_extensions=4.4.0=py37h06a4308_0
- urllib3=2.0.6=pyhd8ed1ab_0
- wheel=0.38.4=py37h06a4308_0
- x264=1!161.3030=h7f98852_1
- xz=5.4.2=h5eee18b_0
- yaml=0.2.5=h7b6447c_0
- zlib=1.2.13=h5eee18b_0
- zstd=1.4.9=haebb681_0
- pip:
- cffi==1.15.1
- cryptography==42.0.5
- cupy-cuda102==11.6.0
- cycler==0.11.0
- fastrlock==0.8.2
- fonttools==4.38.0
- jinja2==3.1.3
- joblib==1.3.2
- kiwisolver==1.4.5
- markupsafe==2.1.5
- matplotlib==3.5.3
- numpy==1.21.6
- nvidia-cublas-cu11==11.10.3.66
- nvidia-cuda-nvrtc-cu11==11.7.99
- nvidia-cuda-runtime-cu11==11.7.99
- nvidia-cudnn-cu11==8.5.0.96
- packaging==23.2
- pandas==1.3.5
- psutil==5.9.8
- pycparser==2.21
- pydeprecate==0.3.2
- pyopenssl==24.1.0
- pyparsing==3.1.1
- python-dateutil==2.8.2
- pytz==2023.3.post1
- scikit-learn==1.0.2
- scipy==1.7.3
- threadpoolctl==3.1.0
- torch==1.7.1+cu110
- torch-geometric==2.3.1
- torch-scatter==2.0.7
- torchaudio==0.7.2
- torcheval==0.0.7
- torchmetrics==0.7.2
- torchprofile==0.0.4
- torchvision==0.8.2+cu110
- tqdm==4.66.2
This is the error I face:
Traceback (most recent call last):
File "playground.py", line 5, in <module>
out = scatter(src=x_j.to(torch.float32), index=edge_index, dim=0, dim_size=73728, reduce='max')
File "/raid/ismail2/miniconda3/envs/MyEnv/lib/python3.7/site-packages/torch_scatter/scatter.py", line 161, in scatter
return scatter_max(src, index, dim, out, dim_size)[0]
File "/raid/ismail2/miniconda3/envs/MyEnv/lib/python3.7/site-packages/torch_scatter/scatter.py", line 73, in scatter_max
return torch.ops.torch_scatter.scatter_max(src, index, dim, out, dim_size)
RuntimeError: CUDA error: an illegal memory access was encountered
I've been stuck here for a while and would really appreciate any help on this. Thanks.
PS: AFAIU, the illegal memory error is different from the out-of-memory error.