Misc. bug: Sporadic MUL_MAT Failures in test-backend-ops for Nvidia backend

### Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA A100-PCIE-40GB)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz)
version: 4667 (d2fe216f)
built with gcc (GCC) 12.2.0 for x86_64-pc-linux-gnu


### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

Test code

### Command line

```shell
`./bin/test-backend-ops`
```

### Problem description & steps to reproduce

Test failure was encountered while running MUL_MAT trough `test-backend-ops`.
- The failing mulmat configuration was identified as `MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3])` Test case created [Here](https://github.com/ggml-org/llama.cpp/blob/d2fe216fb2fb7ca8627618c9ea3a2e7886325780/tests/test-backend-ops.cpp#L4052)
- Failures seemed random, consecutive runs of `test-backend-ops` did not reproduce the error. Modifying the `test-backend-ops.cpp` by adding the mul_mat test case 1000 times was able to reproduce the failing test consistently (At least a few out of the 1000 cases would fail)
```cpp
// Example of adding failing mul_mat case
    for (int i = 0; i < 1000; i++) {
        test_cases.emplace_back(new test_mul_mat(GGML_TYPE_Q5_1, GGML_TYPE_F32, 16, 1, 256, {1,  1}, {1, 1}));
    }
```
- The test fails due to NMSE being over the maximum error threshold.
- Example error output:
```
  MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.000508874 > 0.000500000     
    0  0.948417  1.035245, diff = -0.086828
    1 -2.924956 -2.844111, diff = -0.080845
    2 -1.777758 -1.695090, diff = -0.082667
    3  0.450649  0.537106, diff = -0.086457
    4 -4.114096 -4.030904, diff = -0.083191
    5 -0.682358 -0.596930, diff = -0.085428
    6 -8.252451 -8.167437, diff = -0.085014
    7 -0.692235 -0.606851, diff = -0.085384
    8 -5.382234 -5.304606, diff = -0.077628
    9  3.467584  3.552903, diff = -0.085320
   10 -7.941753 -7.861615, diff = -0.080138
   11  3.101702  3.186424, diff = -0.084722
   12  0.954475  1.037351, diff = -0.082876
   13  2.353770  2.437956, diff = -0.084186
   14 -1.223359 -1.139174, diff = -0.084185
   15  0.853322  0.939753, diff = -0.086431
```
- The nvidia backend seems to convert the `src1` to a `Q8_1` type and then run mul_mat with inputs `Q5_1` and `Q8_1`. Could this be causing the precision issue?

- The largest encountered NMSE from 20000 runs was identified as `0.001409`
- Is the loss of precision expected to this degree? The max error for the mul_mat tests is set to `5e-4`. Should this be modified?



### First Bad Commit

Due to the sporadic nature of the test failure, the commit (d2fe216f) was the first one where the failure was encountered, and currently the origin is not identified. Latest commit that was tested and error was reproduced is (4806498b)

### Relevant log output

```shell
MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.000508874 > 0.000500000     
    0  0.948417  1.035245, diff = -0.086828
    1 -2.924956 -2.844111, diff = -0.080845
    2 -1.777758 -1.695090, diff = -0.082667
    3  0.450649  0.537106, diff = -0.086457
    4 -4.114096 -4.030904, diff = -0.083191
    5 -0.682358 -0.596930, diff = -0.085428
    6 -8.252451 -8.167437, diff = -0.085014
    7 -0.692235 -0.606851, diff = -0.085384
    8 -5.382234 -5.304606, diff = -0.077628
    9  3.467584  3.552903, diff = -0.085320
   10 -7.941753 -7.861615, diff = -0.080138
   11  3.101702  3.186424, diff = -0.084722
   12  0.954475  1.037351, diff = -0.082876
   13  2.353770  2.437956, diff = -0.084186
   14 -1.223359 -1.139174, diff = -0.084185
   15  0.853322  0.939753, diff = -0.086431
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Sporadic MUL_MAT Failures in test-backend-ops for Nvidia backend #11972

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Sporadic MUL_MAT Failures in test-backend-ops for Nvidia backend #11972

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions