Produce clash on CI in GNU/Linux with case-folding ext4

EliahKagan · EliahKagan · commit f48a2bf93799 · 2025-05-18T21:36:24.000-04:00
This (from EliahKagan#36) adds a `test-ext4-casefold` matrix job definition to `ci.yml`, which is intended probably to be temporary, to demonstrate for #2008 that #2006 affects GNU/Linux (in addition to macOS and Windows), when the GNU/Linux system is using an ext4 volume with a case-insensitive directory tree, i.e., where the ext4 filesystem is created with `-O casefold` and operations are performed in a directory within that filesystem where `chattr +F` has been used to enable case-folding path equivalence. The bug is nondeterministic and challenging to produce on GNU/Linux, even when the case-folding precondition is satisfied (as all the `test-ext4-casefold` jobs achieve). It is much harder to produce on GNU/Linux than on macOS or Windows. Accordingly, the number of duplicate test cases is increased 25-fold from 1000 to 25000 in those tests, and multiple jobs are created, some of them equivalent. As expected, the failures do not occur when the runner does not run tests in parallel, and they do occur most of the time when the runner runs tests in parallel with a maximum number of parallel tests set to 2 or to 4. However, unexpectedly, the failures almost never occur when the test are run in parallel with a maximum number of parallel tests set to 3. It is unclear to what extent, if any, this generalizes, but I have not observed these patterns when testing locally, including when modifying the local procedure to be more similar to the behavior here (such as by suppressing reporting of non-failing tests and suppressing the progress bar). In local testing, I am unable to produce failures on some GNU/Linux systems, but on those on which I am able to produce them, they are easier to produce with all values of the `--test-threads` operand, including 3. Not all operating systems, versions, and kernel builds and options support case-folding on ext4. But support is fairly common on Linux-based systems, and the CI runners have no problem with that. In contrast, although case-folding tmpfs also exists on Linux, it is newer and not (yet) supported as widely, and the GitHub-hosted runners do not support case-folding tmpfs. It is possible to mount a tmpfs filesystem and create an ext4 image in it so that, in principle, fewer operations need to access the disk. That is one of the approaches that was tried, but it does not appear to make it any easier to surface these failures. This squashes multiple commits. (Mistakes, and less illuminating experiments, have already been omitted; the commits being squashes here are those that are expected to be of possible interest, but whose details are almost but not quite important enough to justify the cumbersome effect of having them individually in the history.) The squashed commits' individual messages are shown below. For full changes and CI results from each commit, see: EliahKagan#36 --- * Add a case-folding GNU/Linux CI test job This can likely be removed later. But it might be kept if it finds anything interesting, especially if it helps test a bugfix. In that case, it may make sense to have it run only tests that are of special interest in connection with case-sensitivity. * Case-fold `$TMPDIR` as well in CI casefold test * Run only 2 parallel test processes in test-ext4-casefold This seemed to make the bug *easier* to reproduce (compared to larger values) locally, so maybe it will surface it on CI even in GNU/Linux. * Increase symlink experiment reps 10x in test-ext4-casefold From 1000 (0..=999) to 10000 (0..=9999). * Try even harder to reproduce the failure on GNU/Linux By repeating `cargo nextext` on the affected test group, up to 20 times. * Add `ubuntu-24.04-arm` to `test-ext4-casefold` job In case the failure might somehow be easier to reproduce there. * Use tmpfs casefold instead of ext4 casefold * Use tmpfs-backed ext4 as a workaround For when tmpfs does not suppport `-o casefold` in mounting. This does require that ext4 support `-O casefold`, but that was already working before. Even though that worked before, this could fail when using an ext4 image on tmpfs, because the ext4 image is smaller than it was when tmpfs was not used, in order to allow it to be created in memory. While this allows it to be created, the tests may fail if build artifacts and other files need more space than is available. * Run only the test(s) of interest In `test-linux-casefold`, this runs only the tests that are duplicates of each other, where the failure has been observed, rather than first running the whole test suite. The reason is to try to build fewer crates and test executables, since this is currenntly failing because it runs out of space on the ext4 image whose size is itself constrained by the amount of available memory for the tmpfs filesystem on which the image is created. * Don't attempt to use case-folding tmpfs But check if it is reported as available and, if so, issue a GHA warning that it should be used instead of the current way. * Try to fail at least as often, with less overhead This combines a few changes: - Only report the status of tests that actually fail. - Run the `cargo nextest` run command once, not 20 times. - Multiply the number of test cases by another factor of 10. - Stop after the first failing case. * Move the ext4 image off tmpfs This tries going back to not using tmpfs for I/O speedup, with the idea that the increased complexity and decreased flexibility might be possible to avoid now that other adjustments are made to surface the failure more reliably. * Try to fail more often, with less delay The `test-ext4-casefold` jobs were still not surfacing the error quite reliably enough. The wait was also longer than ideal. This commit makes some changes that hope to improve on both. - Decrease the number of duplicate test cases by a factor of 4. This has the effect that they are increased by 25x compared to the multiplier in the commited code, instead of 100x. - Create 10 times as many CI jobs in the matrix, by adding a `num` variable whose values are those in the range 0..=9. This variable is not actually used, it just serves to produce more CI jobs. - Don't use `Swatinem/rust-cache` in these jobs anymore. It is not obvious that it will make things work better overall, in terms of speedup, cache usage, and cache access, now that we are increasing the number of matrix jobs by a factor of 10. Moving some of the repetition into separate jobs, which may run in parallel, is not only, nor even primarily, to leverage runner parallelism to do more test runs faster. Instead, it is because it looks like some chaotic initial state may contribute to how likely the failure is to occur. No definitive experiment on CI shows this to be the case -- but in local testing, on some systems, I often have runs that fail early over and over again, and then wait a while, come back, and have many runs that don't surface the bug, then come back later and it is easy to produce again, and so on. * Further diversify runners * Undiversify runners but vary test parallelism Since varying `runs-on` didn't surface more errors, nor help smooth out inconsistencies across runs. The `*-arm` runners seem to have fewer errors with the particular way the tests are running now, so this just uses `ubuntu-latest`. This also adjusts `rep`, having it take on more values, so as to more than make up for what would otherwise be a smaller number of jobs. * Test more values for test parallelism This creates additional CI `test-ext4-casefold` jobs, for more values of the `--test-threads` operand passed to `cargo nextest`. The motivation is that the value of 3 curiously seems never to surface the failure on CI. Unlike 1, 3 should is expected to surface the failure comparably to 2 and 4, both of which do fail more often than not. Yet the jobs with `--test-threads=3` keep passing. (This is unlikely to be due to case-sensitivity misdetection, since the assertions for when the filesystem is case-sensitive would also fail, if they fire in a job where the filesystem is case-folding.) If these jobs are kept, then `rep` should probably be adjusted to have fewer, perhaps 7, values. However, this adjusts `rep` to have 16 values (i.e. twice as many as before), since the higher operands to `--test-threads` are expected to produce the failure less often. * Test fewer test parallelism values and reps - Test `--test-threads` operands of 1 to 4, rather than 1 to 7. - Do 7 reps instead of 16. In addition to the 1000x repetition in `checkout.rs` of the writes_through_symlinks_are_prevented_even_if_overwriting_is_allowed test case -- which almost always surfaces #2006 on CI on macOS and Windows without further modifications -- the `test-ext4-casefold` jobs as they currenly stand seem like they would ber sufficient to test whether a possible future solution would fix the bug on GNU/Linux. This looks like about the minmum amount we should reasonably verify still fails (in some jobs) immediately before a #2006 fix, while no longer failing *any* jobs afterwards. But the `test-ext4-casefold` jobs are still much more than would be wanted on the main branch. Assuming the cause of the failures is the same on all systems, and the solution neither explicitly nor implicitly operates differently across systems (when comparing how it works when the filesystem is case-insensitive, that is), it may be enough on CI to regression test only macOS and Windows for #2006. Thus, this also adds a TODO comment atop the `test-ext4-casefold` definition reminding that it should not be kept, or not in full.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -185,6 +185,74 @@ jobs:
       - name: Check that tracked archives are up to date
         run: git diff --exit-code  # If this fails, the fix is usually to commit a regenerated archive.
 
+  # TODO(integration): Remove all or at least most test-ext4-casefold jobs prior to merging #2008.
+  test-ext4-casefold:
+    strategy:
+      matrix:
+        # `--test-threads` operand
+        parallel-tests:
+          - 1  # Never fails.
+          - 2  # Usually fails.
+          - 3  # Almost never fails, somehow!
+          - 4  # Usually fails.
+        # Duplicate jobs
+        rep: [A, B, C, D, E, F, G]
+      fail-fast: false
+
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Warn if we could use tmpfs mounted with `-o casefold` instead
+        run: |
+          case "$(cat </sys/fs/tmpfs/features/casefold || :)" in
+          supported)
+            echo '::warning:: The runner OS supports case-folding tmpfs, which may be preferable.'
+            ;;
+          esac
+      - name: Set up casefold directory
+        run: |
+          set -ux
+          dd if=/dev/zero of=~/ext4-casefold.img bs=100M count=150
+          mkfs.ext4 -O casefold -- ~/ext4-casefold.img
+          mkdir -- ~/ext4-casefold
+          sudo mount -- ~/ext4-casefold.img ~/ext4-casefold
+          sudo chown -- "$USER" ~/ext4-casefold
+          mkdir -- ~/ext4-casefold/icase
+          chattr +F -- ~/ext4-casefold/icase
+      - name: Arrange to map workspace and TMPDIR in all subsequent steps
+        run: |
+          set -ux
+          mkdir -- ~/ext4-casefold/icase/{workspace,tmp}
+          sudo mount --rbind -- ~/ext4-casefold/icase/workspace .
+          printf 'TMPDIR=%s\n' ~/ext4-casefold/icase/tmp >> "$GITHUB_ENV"
+      - name: Verify case folding in workspace and TMPDIR
+        run: |
+          set -ux
+          shopt -s nullglob
+          verify() {
+            touch a A
+            files=(?)
+            test "${#files[@]}" -eq 1
+            rm a
+          }
+          verify
+          (cd -- "$TMPDIR"; verify)
+      - uses: actions/checkout@v4
+      - uses: dtolnay/rust-toolchain@stable
+      - uses: taiki-e/install-action@v2
+        with:
+          tool: nextest
+      - name: More writes_through_symlinks_are_prevented_even_if_overwriting_is_allowed reps
+        run: |
+          # Prepend leading digits "24" to the upper bound of the label range.
+          sed -Ei 's/^(#\[test_matrix\(0\.\.=)([[:digit:]]+\)])$/\124\2/' \
+            gix-worktree-state/tests/state/checkout.rs
+      - name: Test writes_through_symlinks_are_prevented_even_if_overwriting_is_allowed (nextest)
+        run: |
+          cargo nextest run -p gix-worktree-state-tests \
+            writes_through_symlinks_are_prevented_even_if_overwriting_is_allowed \
+            --status-level=fail --test-threads=${{ matrix.parallel-tests }}
+
   test-fixtures-windows:
     runs-on: windows-latest
 
@@ -497,6 +565,7 @@ jobs:
       - test
       - test-journey
       - test-fast
+      - test-ext4-casefold
       - test-fixtures-windows
       - test-32bit
       - test-32bit-windows-size-doc