Skip to content

Bug: GPU false results of cell-relax #6141

Open
@Cstandardlib

Description

@Cstandardlib

Describe the bug

When running cell-relax of a simple FCC-Al, GPU calculation output diverges from CPU results.

The CPU calculation converges within 3 steps, with FINAL_ETOT_IS -1883.2252729313104282 eV, while the GPU calculation does not report convergence until 22 steps, giving FINAL_ETOT_IS -1526.2800901207624520 eV.

During this process, the stress given by GPU increases gradually to a very large number and then decreases rapidly to near zero, with the first step stress nearly the same as that of CPU by the initial structure.

Further experiments show that the GPU cell-relax produces the wrong structrue after step 1 with initial STRU.

Calculations by:

  • PW
  • GPU / NVIDIA GeForce RTX 3090
  • For details see the following parts.

Expected behavior

The CPU and GPU cell-relax should give comparable results.

To Reproduce

A simple case that can be downloaded from https://github.com/mcresearch/abacus-user-guide/tree/master/examples/surface_energy/Al_fcc100/0_bulk.

Trying to set symmetry=-1 and increase KPT to 14 14 14, the GPU calculation shows similar behavior as the original case.

Environment

  • OS: Ubuntu 22.04.4 LTS
  • Compiler:
    • gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)
    • nvcc Build cuda_12.4.r12.4/compiler.33961263_0
  • ABACUS v3.9.0.2 Commit: 35448cb (Mon Mar 31 09:24:22 2025 +0800)
  • Built with
cmake -B build -DUSE_CUDA=ON
cmake --build build -j`nproc`
  • This problem was encountered in both single- and multi-core calculation with different OMP and MPI configurations.

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).

Metadata

Metadata

Assignees

Labels

BugsBugs that only solvable with sufficient knowledge of DFTGPU & DCU & HPCGPU and DCU and HPC related any issuesGeometryRelaxationIssues related to geometry relaxation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions