mirror_ocp_release: fixes for concurrent jobs #626

nsilla · 2025-04-01T10:07:56Z

SUMMARY

Fixes CILAB-2034: when multiple jobs run to mirror the same version on the same host, the status of the mirroring directory may be unstable depending on the order the tasks are run on each job.

This change implements temporary directories per role call so the artifacts are only moved to the final location at the end of the execution.

ISSUE TYPE

Bug

Tests

TestBos2Sno: sno - https://www.distributed-ci.io/jobs/95df9c8c-513b-446b-b864-23374790bc2a/jobStates

TestBos2Sno: sno sno:components=ocp=4.18.5,

dcibot · 2025-04-01T10:08:01Z

from change #626:

no check (not a code change)

dcibot · 2025-04-01T10:08:01Z

from change #626:

no check (not a code change)

nsilla · 2025-04-01T10:08:41Z

roles/mirror_ocp_release/tasks/artifacts.yml

@@ -11,20 +11,26 @@
  when:
    - mor_force or not _mor_target.stat.exists
  block:
-    - name: "Extract installer and metadata from release image"
-      ansible.builtin.shell: >
-        flock -x {{ mor_cache_dir }}/{{ mor_version }}/release_extract.lock -c '


With this new approach, use of filesystem locks is not needed anymore.

Have you considered scenarios where two jobs run concurrently, both extracting to a temporary location and writing to mor_cache_dir? How do you prevent conflicts in such cases? Implementing a mechanism like a lock might help avoid these issues.

that's a fair point. I was accepting the risk of facing such scenarios provided moving files around the filesystem is must faster than running the operations directly on the mor_cache_dir.

But at this point it's right what we could just limit the protected zone to the task were we copy the files to the cache directory once they have been processed in the temporary directory.

here comes another thought. The way this implementation works, the problematic task would be when we run the copy module to move the files from the temporary directory to the cache directory. Alternatively or in combination of the lock usage, we can add the parameter "force: false" to the module call, so ansible won't replace a file that already exists, even if the contents are different.
The question here is whether it'd be safe to assume the artifact won't change between jobs deploying the same OCP release.
My guess is the files won't change if the jobs are running concurrently, but would those artifacts change between jobs running with, say, days of difference?

Unfortunately, I don't find a way of implementing locks that would allows us to run ansible tasks in the locked zone.

In other words, when using locks the lock code must be part of the same shell script.

nsilla · 2025-04-01T10:09:16Z

roles/mirror_ocp_release/tasks/artifacts.yml

        {{ mor_oc }} adm release extract
        --registry-config {{ mor_auths_file }}
        --command={{ mor_installer }}
        --from {{ mor_pull_url }}
-        --to "{{ mor_cache_dir }}/{{ mor_version }}";
+        --to "{{ _mor_tmp.path }}";


Artifacts are first extracted into the temporary directory for the job.

nsilla · 2025-04-01T10:10:32Z

roles/mirror_ocp_release/tasks/fetch.yml

-    get_checksum: false
-  register: target
-  when:
-    - not mor_force


Since we're extracting the artifacts on a temporary directory, the file won't exist in advance.

softwarefactory-project-zuul · 2025-04-01T10:10:55Z

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/d7a4402f81474e608fb4f42729bae5af

✔️ dci-rpm-build-el8 SUCCESS in 2m 43s
✔️ dci-rpm-build-el9 SUCCESS in 2m 17s

nsilla · 2025-04-01T10:12:24Z

roles/mirror_ocp_release/tasks/main.yml

+- name: Copy artifacts to release directory
+  ansible.builtin.copy:
+    src: "{{ _mor_tmp.path }}/"
+    dest: "{{ mor_cache_dir }}/{{ mor_version }}"


At the end of the job execution the artifacts are copied to the job's target directory. For this, we use a regular task at the end of the role instead of handlers, so we don't wait for the end of the play to copy the artifacts.

dcibot · 2025-04-01T10:15:10Z

from change #626:

no check (not a code change)

dcibot · 2025-04-01T10:15:11Z

from change #626:

no check (not a code change)

nsilla · 2025-04-01T10:18:14Z

roles/mirror_ocp_release/tasks/main.yml

+    state: directory
+    prefix: mor-
+  register: _mor_tmp
+  notify: "Remove temporary directory"


We use a handler to remove the temporary directory so we make sure it's run at the end of the play.

The goal is to have it running on successful and failed jobs. For this to work, the play calling this role must activate the flag "force_handlers: true", otherwise the handler will only be run on success.

We chose this approach to a block with an "always" section, so we don't need to add extra tasks to track the failed step properly.

softwarefactory-project-zuul · 2025-04-01T10:18:17Z

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/858b7680c5534cb8a37bcbe10ae686da

✔️ dci-rpm-build-el8 SUCCESS in 2m 53s
✔️ dci-rpm-build-el9 SUCCESS in 2m 32s

dcibot · 2025-04-01T10:21:08Z

from change #626:

no check (not a code change)

dcibot · 2025-04-01T10:21:10Z

from change #626:

no check (not a code change)

nsilla · 2025-04-01T10:22:35Z

roles/mirror_ocp_release/tasks/fetch.yml

@@ -35,4 +27,5 @@
  ansible.builtin.command: /usr/sbin/restorecon -R "{{ mor_dir }}/{{ mor_uri | basename }}"
  become: true
  when: _mor_selinux.rc == 0
+  # we may need to run this task over the target directory rather than mor_dir (= _mor_tmp.path)


In the original scenario, the artifacts are extracted into an httpd served directory, so restoring the contexts is needed for the files to be properly served. Restoring the contexts on the temporary directory may not have any effect.

Actually, the next tasks in the artifacts.yml file after including the fetch.yml set new the new context container_file_t on the extracted artifacts. This I assume is done from here so it'll be applied to both, OCP versions greater or equal than 4.8 and lower than 4.8. But that means the tasks in fetch.yml are not needed.

softwarefactory-project-zuul · 2025-04-01T10:24:04Z

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/11ebf1837b13479eaaa373ac7c7939f7

✔️ dci-rpm-build-el8 SUCCESS in 2m 48s
✔️ dci-rpm-build-el9 SUCCESS in 2m 07s

dcibot · 2025-04-01T10:29:36Z

from change #626:

no check (not a code change)

dcibot · 2025-04-01T10:29:36Z

from change #626:

no check (not a code change)

nsilla · 2025-04-01T10:32:17Z

roles/mirror_ocp_release/tasks/fetch.yml

- name: "Apply new SELinux file context to file"
-  ansible.builtin.command: /usr/sbin/restorecon -R "{{ mor_dir }}/{{ mor_uri | basename }}"
-  become: true
-  when: _mor_selinux.rc == 0


Restoring the selinux context does not make sense when extracting the artifacts on a temporary directory.
Also, the first tasks in artifacts.yml after including fetch.yml override the context and set it to container_file_t, which should be valid even after moving the artifacts to the target directory served from the cache container.
A different discussion is whether these tasks should be run before or after copying the artifacts to the target directory.

We have to restore this block of code, since fetch.yml is also included from images.yml to pull the disk image directly into the cache store (version directory ignored) so then it's directly served by the cache container.

softwarefactory-project-zuul · 2025-04-01T10:32:42Z

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/6facf0e9a4d343e6a136352ee7118466

✔️ dci-rpm-build-el8 SUCCESS in 2m 46s
✔️ dci-rpm-build-el9 SUCCESS in 2m 31s

dcibot · 2025-04-01T10:35:32Z

from change #626:

no check (not a code change)

dcibot · 2025-04-01T10:35:32Z

from change #626:

no check (not a code change)

dcibot · 2025-04-01T10:36:43Z

from change #626:

no check (not a code change)

dcibot · 2025-04-01T10:36:43Z

from change #626:

no check (not a code change)

softwarefactory-project-zuul · 2025-04-01T10:39:46Z

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/66b208cf4a9e4d4faa8e883b851c661b

✔️ dci-rpm-build-el8 SUCCESS in 2m 49s
✔️ dci-rpm-build-el9 SUCCESS in 2m 30s

softwarefactory-project-zuul · 2025-04-03T13:47:24Z

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/99e6a2cba3c241f9ba74e35a011caaf0

✔️ dci-rpm-build-el8 SUCCESS in 2m 45s
✔️ dci-rpm-build-el9 SUCCESS in 2m 27s

dcibot · 2025-04-03T14:26:31Z

from change #626:

SUCCESS https://www.distributed-ci.io/jobs/25b4c432-b217-4920-9cea-c0f10bafbc2f/jobStates

betoredhat · 2025-04-03T20:34:26Z

roles/mirror_ocp_release/tasks/artifacts.yml

+    - name: Copy artifacts to release directory
+      ansible.builtin.copy:
+        src: "{{ _mor_tmp.path }}/"
+        dest: "{{ mor_cache_dir }}/{{ mor_version }}/"


Please apply the SELinux context to the file in the release directory.

betoredhat · 2025-04-03T20:59:01Z

Hey @nsilla, I have some concerns that this change might undermine the caching mechanism, leading to unnecessary downloads of releases/ISOs on every job. I understand the complexity here, and I also recognize that some installers already perform similar tasks during installation.

Just my 2 cents- let's hear what the rest of the team thinks.

dcibot · 2025-04-04T08:37:19Z

from change #626:

FAILURE https://www.distributed-ci.io/jobs/60865157-72a7-4f18-bdea-ff289da78262/jobStates

softwarefactory-project-zuul · 2025-04-04T08:37:57Z

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/41b5c0f55ae3428d804f8522c3065593

✔️ dci-rpm-build-el8 SUCCESS in 2m 50s
✔️ dci-rpm-build-el9 SUCCESS in 2m 23s

dcibot · 2025-04-07T10:16:43Z

from change #626:

FAILURE https://www.distributed-ci.io/jobs/1fb33319-e462-4b8c-9c79-1ff2c17087c8/jobStates

softwarefactory-project-zuul · 2025-04-07T10:17:50Z

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/a732517c060c4335b427123ae198ce7e

✔️ dci-rpm-build-el8 SUCCESS in 2m 51s
✔️ dci-rpm-build-el9 SUCCESS in 2m 13s

nsilla · 2025-04-07T10:24:33Z

roles/mirror_ocp_release/tasks/artifacts.yml

+        src: "{{ _mor_tmp.path }}/"
+        dest: "{{ mor_cache_dir }}/{{ mor_version }}/"
+        mode: preserve
+        force: false


By setting the "force" parameter to "false" we prevent the module to get a file into an inconsistent state if the file already exists.
This is interesting, for instance, to prevent binary execution exceptions if modification of the binary file are detected during the execution.
There are some concerns regarding this approach, though:

if force is set to true, existing files are only replaced if changed, which should not happen between files belonging in the same release number.

since Ansible copy module uses atomic moves, the target file path should not suffer hash code changes during the copy process, so it shouldn't be possible for Ansible to mark a file as replaceable just because it's content is in an unstable state yet.

thus, the force: false option is only an extra precaution we take.

this feature would prevent files to be updated if they suffer any modification even within the same ocp release number.

softwarefactory-project-zuul · 2025-04-07T13:24:41Z

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/31cee11e07a64d2da0ef4a8ea9fdd606

✔️ dci-rpm-build-el8 SUCCESS in 2m 49s
✔️ dci-rpm-build-el9 SUCCESS in 2m 28s

dcibot · 2025-04-07T13:25:16Z

from change #626:

ERROR https://www.distributed-ci.io/jobs/fb404c29-e044-4da2-8539-95b1c7d72332/jobStates

softwarefactory-project-zuul · 2025-04-07T13:30:05Z

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/67fb040f52164f3f9fdd4a30bcefb45a

✔️ dci-rpm-build-el8 SUCCESS in 2m 46s
✔️ dci-rpm-build-el9 SUCCESS in 2m 19s

dcibot · 2025-04-07T13:30:41Z

from change #626:

ERROR https://www.distributed-ci.io/jobs/09629558-5d6a-4b62-936b-f0958fc84b61/jobStates

dcibot · 2025-04-07T13:33:31Z

from change #626:

FAILURE https://www.distributed-ci.io/jobs/d252607d-bd53-4c3b-8555-72dffeb3e0e2/jobStates

softwarefactory-project-zuul · 2025-04-07T13:34:52Z

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/3b48532fc1e24f2eac65d26564ff7657

✔️ dci-rpm-build-el8 SUCCESS in 2m 48s
✔️ dci-rpm-build-el9 SUCCESS in 2m 27s

softwarefactory-project-zuul · 2025-04-07T13:38:28Z

Build succeeded.
https://softwarefactory-project.io/zuul/t/local/buildset/162543ad44d44b1eb6eb5aa761fa4411

✔️ dci-rpm-build-el8 SUCCESS in 2m 46s
✔️ dci-rpm-build-el9 SUCCESS in 2m 30s

dcibot · 2025-04-07T14:31:56Z

from change #626:

SUCCESS https://www.distributed-ci.io/jobs/95df9c8c-513b-446b-b864-23374790bc2a/jobStates

nsilla commented Apr 1, 2025

View reviewed changes

nsilla force-pushed the concurrent_release_mirror branch from 7e5d9f5 to e7675e7 Compare April 1, 2025 10:15

nsilla commented Apr 1, 2025

View reviewed changes

nsilla force-pushed the concurrent_release_mirror branch from e7675e7 to 903f001 Compare April 1, 2025 10:21

nsilla commented Apr 1, 2025

View reviewed changes

nsilla force-pushed the concurrent_release_mirror branch from 903f001 to 11767d3 Compare April 1, 2025 10:29

nsilla commented Apr 1, 2025

View reviewed changes

nsilla force-pushed the concurrent_release_mirror branch from 11767d3 to 4995c13 Compare April 1, 2025 10:35

nsilla force-pushed the concurrent_release_mirror branch from 4995c13 to 13e17c5 Compare April 1, 2025 10:36

nsilla force-pushed the concurrent_release_mirror branch from 13e17c5 to bd32593 Compare April 1, 2025 13:24

nsilla force-pushed the concurrent_release_mirror branch from 5c7fcac to c53b9cc Compare April 3, 2025 13:43

betoredhat reviewed Apr 3, 2025

View reviewed changes

nsilla force-pushed the concurrent_release_mirror branch 2 times, most recently from 19e8084 to b3dc5c9 Compare April 4, 2025 08:34

nsilla requested a review from betoredhat April 4, 2025 08:35

nsilla marked this pull request as ready for review April 4, 2025 08:35

nsilla requested a review from a team as a code owner April 4, 2025 08:35

nsilla force-pushed the concurrent_release_mirror branch from b3dc5c9 to 89d08c8 Compare April 7, 2025 10:14

nsilla commented Apr 7, 2025

View reviewed changes

nsilla force-pushed the concurrent_release_mirror branch from 89d08c8 to 355027e Compare April 7, 2025 13:21

nsilla force-pushed the concurrent_release_mirror branch from 355027e to ceeec63 Compare April 7, 2025 13:27

nsilla force-pushed the concurrent_release_mirror branch from ceeec63 to 33a30f4 Compare April 7, 2025 13:31

mirror_ocp_release: fixes for concurrent jobs

eb3c103

nsilla force-pushed the concurrent_release_mirror branch from 33a30f4 to eb3c103 Compare April 7, 2025 13:35

nsilla requested a review from tonyskapunk April 8, 2025 08:57

mirror_ocp_release: fixes for concurrent jobs #626

Are you sure you want to change the base?

mirror_ocp_release: fixes for concurrent jobs #626

Conversation

nsilla commented Apr 1, 2025 • edited Loading

SUMMARY

ISSUE TYPE

Tests

dcibot commented Apr 1, 2025

dcibot commented Apr 1, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Apr 1, 2025

Choose a reason for hiding this comment

dcibot commented Apr 1, 2025

dcibot commented Apr 1, 2025

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Apr 1, 2025

dcibot commented Apr 1, 2025

dcibot commented Apr 1, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Apr 1, 2025

dcibot commented Apr 1, 2025

dcibot commented Apr 1, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Apr 1, 2025

dcibot commented Apr 1, 2025

dcibot commented Apr 1, 2025

dcibot commented Apr 1, 2025

dcibot commented Apr 1, 2025

softwarefactory-project-zuul bot commented Apr 1, 2025

softwarefactory-project-zuul bot commented Apr 3, 2025

dcibot commented Apr 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

betoredhat commented Apr 3, 2025

dcibot commented Apr 4, 2025

softwarefactory-project-zuul bot commented Apr 4, 2025

dcibot commented Apr 7, 2025

softwarefactory-project-zuul bot commented Apr 7, 2025

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Apr 7, 2025

dcibot commented Apr 7, 2025

softwarefactory-project-zuul bot commented Apr 7, 2025

dcibot commented Apr 7, 2025

dcibot commented Apr 7, 2025

softwarefactory-project-zuul bot commented Apr 7, 2025

softwarefactory-project-zuul bot commented Apr 7, 2025

dcibot commented Apr 7, 2025

nsilla commented Apr 1, 2025 •

edited

Loading