Skip to content

mirror_ocp_release: fixes for concurrent jobs #626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nsilla
Copy link
Contributor

@nsilla nsilla commented Apr 1, 2025

SUMMARY

Fixes CILAB-2034: when multiple jobs run to mirror the same version on the same host, the status of the mirroring directory may be unstable depending on the order the tasks are run on each job.

This change implements temporary directories per role call so the artifacts are only moved to the final location at the end of the execution.

ISSUE TYPE
  • Bug
Tests

TestBos2Sno: sno sno:components=ocp=4.18.5,

@dcibot
Copy link
Collaborator

dcibot commented Apr 1, 2025

from change #626:

  • no check (not a code change)

1 similar comment
@dcibot
Copy link
Collaborator

dcibot commented Apr 1, 2025

from change #626:

  • no check (not a code change)

@@ -11,20 +11,26 @@
when:
- mor_force or not _mor_target.stat.exists
block:
- name: "Extract installer and metadata from release image"
ansible.builtin.shell: >
flock -x {{ mor_cache_dir }}/{{ mor_version }}/release_extract.lock -c '
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this new approach, use of filesystem locks is not needed anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered scenarios where two jobs run concurrently, both extracting to a temporary location and writing to mor_cache_dir? How do you prevent conflicts in such cases? Implementing a mechanism like a lock might help avoid these issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a fair point. I was accepting the risk of facing such scenarios provided moving files around the filesystem is must faster than running the operations directly on the mor_cache_dir.

But at this point it's right what we could just limit the protected zone to the task were we copy the files to the cache directory once they have been processed in the temporary directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here comes another thought. The way this implementation works, the problematic task would be when we run the copy module to move the files from the temporary directory to the cache directory. Alternatively or in combination of the lock usage, we can add the parameter "force: false" to the module call, so ansible won't replace a file that already exists, even if the contents are different.
The question here is whether it'd be safe to assume the artifact won't change between jobs deploying the same OCP release.
My guess is the files won't change if the jobs are running concurrently, but would those artifacts change between jobs running with, say, days of difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't find a way of implementing locks that would allows us to run ansible tasks in the locked zone.

In other words, when using locks the lock code must be part of the same shell script.

{{ mor_oc }} adm release extract
--registry-config {{ mor_auths_file }}
--command={{ mor_installer }}
--from {{ mor_pull_url }}
--to "{{ mor_cache_dir }}/{{ mor_version }}";
--to "{{ _mor_tmp.path }}";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Artifacts are first extracted into the temporary directory for the job.

get_checksum: false
register: target
when:
- not mor_force
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're extracting the artifacts on a temporary directory, the file won't exist in advance.

Copy link

- name: Copy artifacts to release directory
ansible.builtin.copy:
src: "{{ _mor_tmp.path }}/"
dest: "{{ mor_cache_dir }}/{{ mor_version }}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of the job execution the artifacts are copied to the job's target directory. For this, we use a regular task at the end of the role instead of handlers, so we don't wait for the end of the play to copy the artifacts.

@nsilla nsilla force-pushed the concurrent_release_mirror branch from 7e5d9f5 to e7675e7 Compare April 1, 2025 10:15
@dcibot
Copy link
Collaborator

dcibot commented Apr 1, 2025

from change #626:

  • no check (not a code change)

1 similar comment
@dcibot
Copy link
Collaborator

dcibot commented Apr 1, 2025

from change #626:

  • no check (not a code change)

state: directory
prefix: mor-
register: _mor_tmp
notify: "Remove temporary directory"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use a handler to remove the temporary directory so we make sure it's run at the end of the play.

The goal is to have it running on successful and failed jobs. For this to work, the play calling this role must activate the flag "force_handlers: true", otherwise the handler will only be run on success.

We chose this approach to a block with an "always" section, so we don't need to add extra tasks to track the failed step properly.

Copy link

@nsilla nsilla force-pushed the concurrent_release_mirror branch from e7675e7 to 903f001 Compare April 1, 2025 10:21
@dcibot
Copy link
Collaborator

dcibot commented Apr 1, 2025

from change #626:

  • no check (not a code change)

1 similar comment
@dcibot
Copy link
Collaborator

dcibot commented Apr 1, 2025

from change #626:

  • no check (not a code change)

@@ -35,4 +27,5 @@
ansible.builtin.command: /usr/sbin/restorecon -R "{{ mor_dir }}/{{ mor_uri | basename }}"
become: true
when: _mor_selinux.rc == 0
# we may need to run this task over the target directory rather than mor_dir (= _mor_tmp.path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original scenario, the artifacts are extracted into an httpd served directory, so restoring the contexts is needed for the files to be properly served. Restoring the contexts on the temporary directory may not have any effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the next tasks in the artifacts.yml file after including the fetch.yml set new the new context container_file_t on the extracted artifacts. This I assume is done from here so it'll be applied to both, OCP versions greater or equal than 4.8 and lower than 4.8. But that means the tasks in fetch.yml are not needed.

Copy link

@nsilla nsilla force-pushed the concurrent_release_mirror branch from 903f001 to 11767d3 Compare April 1, 2025 10:29
@dcibot
Copy link
Collaborator

dcibot commented Apr 1, 2025

from change #626:

  • no check (not a code change)

1 similar comment
@dcibot
Copy link
Collaborator

dcibot commented Apr 1, 2025

from change #626:

  • no check (not a code change)

- name: "Apply new SELinux file context to file"
ansible.builtin.command: /usr/sbin/restorecon -R "{{ mor_dir }}/{{ mor_uri | basename }}"
become: true
when: _mor_selinux.rc == 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restoring the selinux context does not make sense when extracting the artifacts on a temporary directory.
Also, the first tasks in artifacts.yml after including fetch.yml override the context and set it to container_file_t, which should be valid even after moving the artifacts to the target directory served from the cache container.
A different discussion is whether these tasks should be run before or after copying the artifacts to the target directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to restore this block of code, since fetch.yml is also included from images.yml to pull the disk image directly into the cache store (version directory ignored) so then it's directly served by the cache container.

Copy link

@nsilla nsilla force-pushed the concurrent_release_mirror branch from 11767d3 to 4995c13 Compare April 1, 2025 10:35
@dcibot
Copy link
Collaborator

dcibot commented Apr 1, 2025

from change #626:

  • no check (not a code change)

1 similar comment
@dcibot
Copy link
Collaborator

dcibot commented Apr 1, 2025

from change #626:

  • no check (not a code change)

@nsilla nsilla force-pushed the concurrent_release_mirror branch from 4995c13 to 13e17c5 Compare April 1, 2025 10:36
@dcibot
Copy link
Collaborator

dcibot commented Apr 1, 2025

from change #626:

  • no check (not a code change)

1 similar comment
@dcibot
Copy link
Collaborator

dcibot commented Apr 1, 2025

from change #626:

  • no check (not a code change)

Copy link

@nsilla nsilla force-pushed the concurrent_release_mirror branch from 13e17c5 to bd32593 Compare April 1, 2025 13:24
@nsilla nsilla force-pushed the concurrent_release_mirror branch from 5c7fcac to c53b9cc Compare April 3, 2025 13:43
Copy link

@dcibot
Copy link
Collaborator

dcibot commented Apr 3, 2025

- name: Copy artifacts to release directory
ansible.builtin.copy:
src: "{{ _mor_tmp.path }}/"
dest: "{{ mor_cache_dir }}/{{ mor_version }}/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please apply the SELinux context to the file in the release directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@betoredhat
Copy link
Contributor

Hey @nsilla, I have some concerns that this change might undermine the caching mechanism, leading to unnecessary downloads of releases/ISOs on every job. I understand the complexity here, and I also recognize that some installers already perform similar tasks during installation.

Just my 2 cents- let's hear what the rest of the team thinks.

@nsilla nsilla force-pushed the concurrent_release_mirror branch 2 times, most recently from 19e8084 to b3dc5c9 Compare April 4, 2025 08:34
@nsilla nsilla requested a review from betoredhat April 4, 2025 08:35
@nsilla nsilla marked this pull request as ready for review April 4, 2025 08:35
@nsilla nsilla requested a review from a team as a code owner April 4, 2025 08:35
@dcibot
Copy link
Collaborator

dcibot commented Apr 4, 2025

Copy link

@nsilla nsilla force-pushed the concurrent_release_mirror branch from b3dc5c9 to 89d08c8 Compare April 7, 2025 10:14
@dcibot
Copy link
Collaborator

dcibot commented Apr 7, 2025

Copy link

src: "{{ _mor_tmp.path }}/"
dest: "{{ mor_cache_dir }}/{{ mor_version }}/"
mode: preserve
force: false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By setting the "force" parameter to "false" we prevent the module to get a file into an inconsistent state if the file already exists.
This is interesting, for instance, to prevent binary execution exceptions if modification of the binary file are detected during the execution.
There are some concerns regarding this approach, though:

  • if force is set to true, existing files are only replaced if changed, which should not happen between files belonging in the same release number.
  • since Ansible copy module uses atomic moves, the target file path should not suffer hash code changes during the copy process, so it shouldn't be possible for Ansible to mark a file as replaceable just because it's content is in an unstable state yet.
  • thus, the force: false option is only an extra precaution we take.
  • this feature would prevent files to be updated if they suffer any modification even within the same ocp release number.

@nsilla nsilla force-pushed the concurrent_release_mirror branch from 89d08c8 to 355027e Compare April 7, 2025 13:21
Copy link

@dcibot
Copy link
Collaborator

dcibot commented Apr 7, 2025

@nsilla nsilla force-pushed the concurrent_release_mirror branch from 355027e to ceeec63 Compare April 7, 2025 13:27
Copy link

@dcibot
Copy link
Collaborator

dcibot commented Apr 7, 2025

@nsilla nsilla force-pushed the concurrent_release_mirror branch from ceeec63 to 33a30f4 Compare April 7, 2025 13:31
@dcibot
Copy link
Collaborator

dcibot commented Apr 7, 2025

Copy link

@nsilla nsilla force-pushed the concurrent_release_mirror branch from 33a30f4 to eb3c103 Compare April 7, 2025 13:35
Copy link

@dcibot
Copy link
Collaborator

dcibot commented Apr 7, 2025

@nsilla nsilla requested a review from tonyskapunk April 8, 2025 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants