Skip to content

GPU bugs: failed to output some results of wfc on GPU #6365

Open
@Cstandardlib

Description

@Cstandardlib

Describe the bug

By testing on CPU/GPU devices, it is discovered that

out_wfc_re_im      1
out_pchg           1
out_wfc_norm       1

will fail if run on GPU device.

Output is as follows:

Unexpected Device Error /abacus-develop/source/source_base/module_device/cuda/memory_op.cu:184: cudaErrorIllegalAddress, an illegal memory access was encountered

The output code lies in source/source_esolver/esolver_ks_pw.cpp:

const std::vector<int> out_wfc_norm = PARAM.inp.out_wfc_norm;
    const std::vector<int> out_wfc_re_im = PARAM.inp.out_wfc_re_im;
    if (out_wfc_norm.size() > 0 || out_wfc_re_im.size() > 0)
    {
        ModuleIO::get_wf_pw(out_wfc_norm,
                            out_wfc_re_im,
                            this->kspw_psi->get_nbands(),
                            PARAM.inp.nspin,
                            this->pw_rhod->nx,
                            this->pw_rhod->ny,
                            this->pw_rhod->nz,
                            this->pw_rhod->nxyz,
                            &ucell,
                            this->psi,
                            this->pw_wfc,
                            this->ctx,
                            this->Pgrid,
                            PARAM.globalv.global_out_dir,
                            this->kv,
                            GlobalV::KPAR,
                            GlobalV::MY_POOL);
    }

Investigation shows that
in get_wf_pw of source/source_io/get_wf_pw.h:

                pw_wfc->recip_to_real(ctx, &psi[0](ib, 0), wfcr_norm.data(), ik);

is attempting to do a recip to real on GPU while both input psi and output wfcr_norm are CPU host arrays.

// declared in source/source_esolver/esolver_ks_pw.h
    psi::Psi<std::complex<double>, base_device::DEVICE_CPU>* psi = nullptr;
// declared in source/source_io/get_wf_pw.h
std::vector<std::complex<double>> wfcr_norm(nxyz);
  • Also note that out_wfc_pw works well on GPU.

Expected behavior

Output of GPU wfc should work well as CPU does.

To Reproduce

built with CUDA and TEST on

cmake -B build -DBUILD_TESTING=ON -DUSE_CUDA=ON

running abacus-develop/tests/11_PW_GPU/BUG_PW_outwfcr_GPU with

out_wfc_norm       1

will yield the above results.

Environment

Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
gcc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).

Metadata

Metadata

Assignees

Labels

BugsBugs that only solvable with sufficient knowledge of DFTGPU & DCU & HPCGPU and DCU and HPC related any issuesInput&OutputSuitable for coders without knowing too many DFT detailsUnit Tests/Integreate TestsIssues/PR related to unit tests and integrate tests

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions