-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moved all singularity-related download functions into a new file #3509
base: dev
Are you sure you want to change the base?
Conversation
5961e7b
to
952bc92
Compare
952bc92
to
a9b2b00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks already very promising. Thank you very much! I need to take another look tomorrow...
else: | ||
raise FileNotFoundError("Singularity cache is required but no '$NXF_SINGULARITY_CACHEDIR' set!") | ||
|
||
assert self.outdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I dislike the assert
here.
self.outdir
should be set, as it is already handled by the __init__
function of DownloadWorkflow
. But if it was None
, you are producing an AssertionError
that is not handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert
is just there to please pylance/mypy, or whatever tool I've got running in my IDE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do it like this in other places in the code, with the comment, which makes the purpose more clear
assert self.outdir | |
assert self.outdir is not None # mypy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My IDE was saying that outdir
was str | None | Unknown
. I hope further refactoring of the class will lead to this function being in a scope where it's clear that outdir
is defined and a string.
|
||
# Check that the directories exist | ||
if not os.path.isdir(out_path_dir): | ||
log.debug(f"Output directory not found, creating: {out_path_dir}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor remark, but Output directory is generic, so it is unclear which exactly was missing. If you print in the log.debug()
that the Singularity images directory was missing, it is probably more meaningful.
""" | ||
|
||
def __init__( | ||
self, container_library: Iterable[str], registry_set: Iterable[str], progress: rich.progress.Progress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the __init__()
function should have a brief summary of the arguments, maybe something along these lines.
container_library (List[str]): The container libraries (registries) to pull containers from. Specified manually by the user.
registry_set (Set[str]): A set of container libraries for creating symlinks. Consists of all container_libraries, registries defined in the config of a workflow and a core set of hardcoded registries occurring in various modules.
Might also be an opportunity to replace the generic name registry_set
with something outlining its purpose? (Blaming myself here actually for picking that in the first place...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to work more on this class and will definitely review the naming of the parameters.
That said, encouraged by your comment, I've moved a couple of methods to functions and renamed their registry_set
parameter to registries
(I realised that _set
was probably an indicator it was a set
object ?)
self.kill_with_fire = False | ||
|
||
def get_container_filename(self, container: str) -> str: | ||
"""Check Singularity cache for image, copy to destination folder if found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The help text still needs to be updated. But I like your approach to break it up to smaller, logically separate functions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the text. Hope that's clearer
""" | ||
|
||
|
||
@contextlib.contextmanager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this part of the code, but it looks fancy! Great that you found a way to deal with interrupted downloads and prevent persistent faulty images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comments and made the code more explicit. I hope it's clearer
# download into the cache | ||
fetch_list.append((container, cache_path)) | ||
# and copy from the cache to the output | ||
containers_copy.append((container, cache_path, output_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a copy to output_path? But it is already late, so it is probably me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resonates with a comment you made in the previous PR. I need to double-check all those copy operations. Make sure that the "amend" / "copy" are respected, and avoid copying files if not needed
…environment variables
…have to be aware of the TaskID
Codecov ReportAttention: Patch coverage is
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is looking really promising already!
self.progress.update(task, description="Downloading singularity images") | ||
self.download_images(containers_download, task, parallel_downloads=4) | ||
self.progress.update_main_task(description="Downloading singularity images") | ||
self.download_images(containers_download, parallel_downloads=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parallel_downloads
is a CLI argument, currently stored in self.parallel_downloads
of a DownloadWorkflow
instance. I have not heard of anyone using it ever (except for debugging), but it is probably reasonable to keep it customizable?
If not, we should remove the CLI argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know there was a CLI argument ! I'll connect them back
…uarantee the task gets removed
- Replaced the generator with a callback + a list return - Fixed the handling of exceptions so that KeyboardInterrupt is always captured - More tets
Attempt to use a local installation of singularity to pull the image. | ||
# Find out what the library directory is | ||
library_dir = os.environ.get("NXF_SINGULARITY_LIBRARYDIR") | ||
if library_dir and not os.path.isdir(library_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are trying to use the library pathlib
for files instead of os
. Please try to use pathlib
everywhere else, thanks!
if library_dir and not os.path.isdir(library_dir): | |
if library_dir and not Path(library_dir).is_dir(): |
cache_dir = os.environ.get("NXF_SINGULARITY_CACHEDIR") | ||
if self.container_cache_utilisation in ["amend", "copy"]: | ||
if cache_dir: | ||
if not os.path.isdir(cache_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not os.path.isdir(cache_dir): | |
if not Path(cache_dir).is_dir(): |
else: | ||
raise FileNotFoundError("Singularity cache is required but no '$NXF_SINGULARITY_CACHEDIR' set!") | ||
|
||
assert self.outdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do it like this in other places in the code, with the comment, which makes the purpose more clear
assert self.outdir | |
assert self.outdir is not None # mypy |
for container, output_path in containers_pull: | ||
# it is possible to try multiple registries / mirrors if multiple were specified. | ||
# Iteration happens over a copy of self.container_library[:], as I want to be able to remove failing registries for subsequent images. | ||
for library in self.container_library[:]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes it more readable
for library in self.container_library[:]: | |
for library in self.container_library.copy(): |
log.debug(f"Singularity command: {' '.join(singularity_command)}") | ||
|
||
# Run the singularity pull command | ||
with subprocess.Popen( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the function run_cmd()
in nf_core/utils.py
, I think it would be useful to reduce the code here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have time, it would be great to change indeed. But since it is our current codebase, I do not want to be too demanding either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a big difference with nf_core.utils.run_cmd
. Here we stream the output to update the progress bar. The motivation is that singularity pulls can take a while to run, and it's better for users to get a visual feedback that something is really happening.
nf_core.utils.run_cmd
runs the command and only returns the output/error when the command ends. I'm not sure that's the right thing to do here.
I would rather propose to introduce nf_core.utils.stream_cmd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds convincing to me, although I have admittedly not looked into the specific implementation of nf_core.utils.run_cmd
.
if not (shutil.which("singularity") or shutil.which("apptainer")): | ||
raise OSError("Singularity/Apptainer is needed to pull images, but it is not installed or not in $PATH") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not needed as we already have the same check in pull_images()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all fairness, this check is simply our current codebase.
Coincidentally, I want to get rid of those checks entirely, as they prevent running the tool on MacOS with recent pipelines that use Seqera Containers or at least disable that check with -f
/--force
. Instead, the tool should just output a list of commands that it could not run because singularity
/ apptainer
was missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain your use case a bit more, @MatthiasZepper ? If someone runs nf-core pipelines download
with --container-system singularity
, it seems fair to complain if singularity/apptainer is not available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I had the case in mind that you are running nf-core download
on a different computer that you will use for executing the workflows later. Specifically, I was thinking of MacOS, which does not support singularity
/apptainer
.
We have a HPC that runs Linux, but are downloading the pipelines with our developers' Macbooks. If we need to pull a Docker container and convert it or have a mandatory oras://
download, I need to start my VM, but most nf-core pipelines actually have http://
downloads for all required modules, so it is possible to obtain a perfectly functional download without singularity
/apptainer
. Due to this check, nf-core download
will nonetheless not run, which is why I have symlinked ~/bin/apptainer -> /usr/bin/true
on my MacOS.
For all modules that use Seqera Containers, it is no longer possible to remove the Docker container, if there is a http://
download available. So for Seqera Container modules, you will always download both: The native Singularity container and convert the Docker image. This makes it even more attractive to dummy replace apptainer while downloading to save some bandwith and try out, which containers are actually missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oki. In that case, I would propose adding a command-line option to allow missing containers as a result of a failure of the copy/download/pull command or because the command can't be run (i.e. your case). It would essentially turn all the errors into warnings. In your case, it'll still do all the downloads but skip the pulls and report the list of missing containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, although applying --force
instead of a dedicated CLI argument would probably suffice as well. In either way out of the scope of this PR, which is, I believe, already comprehensive enough and shouldn't get bogged down by too many small additions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in #3506 (comment), the directory name should be download
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my decision to avoid import errors for now. Running git mv
is later is pretty quick, I guess.
successful_downloads.append(future_downloads[future]) | ||
|
||
with concurrent.futures.ThreadPoolExecutor(max_workers=parallel_downloads) as pool: | ||
# The entire block needs to be monitored for KeyboardInterrupt so that ntermediate files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# The entire block needs to be monitored for KeyboardInterrupt so that ntermediate files | |
# The entire block needs to be monitored for KeyboardInterrupt so that intermediate files |
Closes #3019
Supersedes #3163
Ongoing work part of the "Workflows to go" hackathon project. I've started moving all singularity functions into a new file.
Overview of the moves from
nf_core/pipelines/download.py
In this move, I wanted to have clean copy/download/pull_image functions that know nothing about a "cache" or a "library". They're just given a path they need to put their file into.
I was also able to decouple things and move code from methods to functions when access to class variables etc wasn't really needed. Or in the case of the file downloads, I've introduced a dedicated class where only the necessary state is tracked.
So far, we've got this functionality in
downloads.utils
:intermediate_file
is a context manager for getting download/copy operations done in a temporary file that is deleted if there is a problem. This is to avoid polluting the cache/output directories with partial files (the question was raised in Container cache download #3163 (comment)) and I've rolled it out across all three source methods.DownloadProgress
is the extension ofrich.progress
. I've added helper methods to deal with the "main task" and "sub tasks"FileDownloader
can download files in parallel. I've reviewed the handling of errors and exceptions to make sure that no state if left behind.and in
downloads.singularity
:get_container_filename
returns the name a container will have on disk, taking into consideration a set of registry prefixes to ignore.symlink_registries
creates symlinks for a container so that the same image file can be used across registries.That's all completely covered in unit tests.
At this minute, I'm not 100% settled on this SingularityFetcher class. I want to look into decoupling it a bit more.
PR checklist
CHANGELOG.md
is updateddocs
is updated