Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: Sporadic MUL_MAT Failures in test-backend-ops for Nvidia backend #11972

Open
ShanoToni opened this issue Feb 20, 2025 · 1 comment

Comments

@ShanoToni
Copy link

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA A100-PCIE-40GB)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz)
version: 4667 (d2fe216)
built with gcc (GCC) 12.2.0 for x86_64-pc-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

Test code

Command line

`./bin/test-backend-ops`

Problem description & steps to reproduce

Test failure was encountered while running MUL_MAT trough test-backend-ops.

  • The failing mulmat configuration was identified as MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]) Test case created Here
  • Failures seemed random, consecutive runs of test-backend-ops did not reproduce the error. Modifying the test-backend-ops.cpp by adding the mul_mat test case 1000 times was able to reproduce the failing test consistently (At least a few out of the 1000 cases would fail)
// Example of adding failing mul_mat case
    for (int i = 0; i < 1000; i++) {
        test_cases.emplace_back(new test_mul_mat(GGML_TYPE_Q5_1, GGML_TYPE_F32, 16, 1, 256, {1,  1}, {1, 1}));
    }
  • The test fails due to NMSE being over the maximum error threshold.
  • Example error output:
  MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.000508874 > 0.000500000     
    0  0.948417  1.035245, diff = -0.086828
    1 -2.924956 -2.844111, diff = -0.080845
    2 -1.777758 -1.695090, diff = -0.082667
    3  0.450649  0.537106, diff = -0.086457
    4 -4.114096 -4.030904, diff = -0.083191
    5 -0.682358 -0.596930, diff = -0.085428
    6 -8.252451 -8.167437, diff = -0.085014
    7 -0.692235 -0.606851, diff = -0.085384
    8 -5.382234 -5.304606, diff = -0.077628
    9  3.467584  3.552903, diff = -0.085320
   10 -7.941753 -7.861615, diff = -0.080138
   11  3.101702  3.186424, diff = -0.084722
   12  0.954475  1.037351, diff = -0.082876
   13  2.353770  2.437956, diff = -0.084186
   14 -1.223359 -1.139174, diff = -0.084185
   15  0.853322  0.939753, diff = -0.086431
  • The nvidia backend seems to convert the src1 to a Q8_1 type and then run mul_mat with inputs Q5_1 and Q8_1. Could this be causing the precision issue?

  • The largest encountered NMSE from 20000 runs was identified as 0.001409

  • Is the loss of precision expected to this degree? The max error for the mul_mat tests is set to 5e-4. Should this be modified?

First Bad Commit

Due to the sporadic nature of the test failure, the commit (d2fe216) was the first one where the failure was encountered, and currently the origin is not identified. Latest commit that was tested and error was reproduced is (4806498)

Relevant log output

MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.000508874 > 0.000500000     
    0  0.948417  1.035245, diff = -0.086828
    1 -2.924956 -2.844111, diff = -0.080845
    2 -1.777758 -1.695090, diff = -0.082667
    3  0.450649  0.537106, diff = -0.086457
    4 -4.114096 -4.030904, diff = -0.083191
    5 -0.682358 -0.596930, diff = -0.085428
    6 -8.252451 -8.167437, diff = -0.085014
    7 -0.692235 -0.606851, diff = -0.085384
    8 -5.382234 -5.304606, diff = -0.077628
    9  3.467584  3.552903, diff = -0.085320
   10 -7.941753 -7.861615, diff = -0.080138
   11  3.101702  3.186424, diff = -0.084722
   12  0.954475  1.037351, diff = -0.082876
   13  2.353770  2.437956, diff = -0.084186
   14 -1.223359 -1.139174, diff = -0.084185
   15  0.853322  0.939753, diff = -0.086431
@JohannesGaessler
Copy link
Collaborator

test-backend-ops uses a random seed for generating the test data. It is expected that the NMSE will sometimes exceed the threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants