Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moved all singularity-related download functions into a new file #3509

Draft
wants to merge 31 commits into
base: dev
Choose a base branch
from

Conversation

muffato
Copy link
Member

@muffato muffato commented Mar 25, 2025

Closes #3019
Supersedes #3163

Ongoing work part of the "Workflows to go" hackathon project. I've started moving all singularity functions into a new file.

Overview of the moves from nf_core/pipelines/download.py

Object or function New file New name
DownloadError downloads/utils (same name)
DownloadProgress downloads/utils (same name)
ContainerError downloads/singularity (same name)
symlink_singularity_images downloads/singularity symlink_registries
singularity_image_filenames downloads/singularity get_container_filename
singularity_copy_cache_image downloads/singularity SingularityFetcher.copy_image
singularity_download_image downloads/utils FileDownloader.download_file
singularity_pull_image downloads/singularity SingularityFetcher.pull_image
get_singularity_images (part of) downloads/singularity SingularityFetcher.fetch_containers, SingularityFetcher.pull_images, SingularityFetcher.download_images
get_singularity_images (part of) downloads/utils FileDownloader.download_files_in_parallel

In this move, I wanted to have clean copy/download/pull_image functions that know nothing about a "cache" or a "library". They're just given a path they need to put their file into.
I was also able to decouple things and move code from methods to functions when access to class variables etc wasn't really needed. Or in the case of the file downloads, I've introduced a dedicated class where only the necessary state is tracked.

So far, we've got this functionality in downloads.utils:

  • intermediate_file is a context manager for getting download/copy operations done in a temporary file that is deleted if there is a problem. This is to avoid polluting the cache/output directories with partial files (the question was raised in Container cache download #3163 (comment)) and I've rolled it out across all three source methods.
  • DownloadProgress is the extension of rich.progress. I've added helper methods to deal with the "main task" and "sub tasks"
  • FileDownloader can download files in parallel. I've reviewed the handling of errors and exceptions to make sure that no state if left behind.

and in downloads.singularity:

  • get_container_filename returns the name a container will have on disk, taking into consideration a set of registry prefixes to ignore.
  • symlink_registries creates symlinks for a container so that the same image file can be used across registries.

That's all completely covered in unit tests.

At this minute, I'm not 100% settled on this SingularityFetcher class. I want to look into decoupling it a bit more.

PR checklist

  • This comment contains a description of changes (with reason)
  • CHANGELOG.md is updated
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Sorry, something went wrong.

@muffato muffato self-assigned this Mar 25, 2025
@muffato muffato added the download nf-core download label Mar 25, 2025
@muffato muffato force-pushed the singularity_downloads branch from 5961e7b to 952bc92 Compare March 25, 2025 12:23
@muffato muffato force-pushed the singularity_downloads branch from 952bc92 to a9b2b00 Compare March 25, 2025 17:09
Copy link
Member

@MatthiasZepper MatthiasZepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks already very promising. Thank you very much! I need to take another look tomorrow...

else:
raise FileNotFoundError("Singularity cache is required but no '$NXF_SINGULARITY_CACHEDIR' set!")

assert self.outdir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I dislike the assert here.

self.outdir should be set, as it is already handled by the __init__ function of DownloadWorkflow. But if it was None, you are producing an AssertionError that is not handled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert is just there to please pylance/mypy, or whatever tool I've got running in my IDE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do it like this in other places in the code, with the comment, which makes the purpose more clear

Suggested change
assert self.outdir
assert self.outdir is not None # mypy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My IDE was saying that outdir was str | None | Unknown. I hope further refactoring of the class will lead to this function being in a scope where it's clear that outdir is defined and a string.


# Check that the directories exist
if not os.path.isdir(out_path_dir):
log.debug(f"Output directory not found, creating: {out_path_dir}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor remark, but Output directory is generic, so it is unclear which exactly was missing. If you print in the log.debug() that the Singularity images directory was missing, it is probably more meaningful.

"""

def __init__(
self, container_library: Iterable[str], registry_set: Iterable[str], progress: rich.progress.Progress
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the __init__() function should have a brief summary of the arguments, maybe something along these lines.

container_library (List[str]): The container libraries (registries) to pull containers from. Specified manually by the user.

registry_set (Set[str]): A set of container libraries for creating symlinks. Consists of all container_libraries, registries defined in the config of a workflow and a core set of hardcoded registries occurring in various modules.

Might also be an opportunity to replace the generic name registry_set with something outlining its purpose? (Blaming myself here actually for picking that in the first place...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to work more on this class and will definitely review the naming of the parameters.

That said, encouraged by your comment, I've moved a couple of methods to functions and renamed their registry_set parameter to registries (I realised that _set was probably an indicator it was a set object ?)

self.kill_with_fire = False

def get_container_filename(self, container: str) -> str:
"""Check Singularity cache for image, copy to destination folder if found.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The help text still needs to be updated. But I like your approach to break it up to smaller, logically separate functions!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the text. Hope that's clearer

"""


@contextlib.contextmanager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this part of the code, but it looks fancy! Great that you found a way to deal with interrupted downloads and prevent persistent faulty images.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments and made the code more explicit. I hope it's clearer

# download into the cache
fetch_list.append((container, cache_path))
# and copy from the cache to the output
containers_copy.append((container, cache_path, output_path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a copy to output_path? But it is already late, so it is probably me...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resonates with a comment you made in the previous PR. I need to double-check all those copy operations. Make sure that the "amend" / "copy" are respected, and avoid copying files if not needed

Copy link

codecov bot commented Mar 26, 2025

Codecov Report

Attention: Patch coverage is 63.17204% with 137 lines in your changes missing coverage. Please review.

Project coverage is 77.45%. Comparing base (33830f9) to head (ea4bc89).
Report is 25 commits behind head on dev.

Files with missing lines Patch % Lines
nf_core/pipelines/downloads/singularity.py 44.11% 133 Missing ⚠️
nf_core/pipelines/download.py 90.90% 2 Missing ⚠️
nf_core/pipelines/downloads/utils.py 98.21% 2 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@MatthiasZepper MatthiasZepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is looking really promising already!

self.progress.update(task, description="Downloading singularity images")
self.download_images(containers_download, task, parallel_downloads=4)
self.progress.update_main_task(description="Downloading singularity images")
self.download_images(containers_download, parallel_downloads=4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallel_downloads is a CLI argument, currently stored in self.parallel_downloads of a DownloadWorkflow instance. I have not heard of anyone using it ever (except for debugging), but it is probably reasonable to keep it customizable?

If not, we should remove the CLI argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know there was a CLI argument ! I'll connect them back

@nikhil nikhil mentioned this pull request Mar 26, 2025
4 tasks
Attempt to use a local installation of singularity to pull the image.
# Find out what the library directory is
library_dir = os.environ.get("NXF_SINGULARITY_LIBRARYDIR")
if library_dir and not os.path.isdir(library_dir):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to use the library pathlib for files instead of os. Please try to use pathlib everywhere else, thanks!

Suggested change
if library_dir and not os.path.isdir(library_dir):
if library_dir and not Path(library_dir).is_dir():

cache_dir = os.environ.get("NXF_SINGULARITY_CACHEDIR")
if self.container_cache_utilisation in ["amend", "copy"]:
if cache_dir:
if not os.path.isdir(cache_dir):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not os.path.isdir(cache_dir):
if not Path(cache_dir).is_dir():

else:
raise FileNotFoundError("Singularity cache is required but no '$NXF_SINGULARITY_CACHEDIR' set!")

assert self.outdir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do it like this in other places in the code, with the comment, which makes the purpose more clear

Suggested change
assert self.outdir
assert self.outdir is not None # mypy

for container, output_path in containers_pull:
# it is possible to try multiple registries / mirrors if multiple were specified.
# Iteration happens over a copy of self.container_library[:], as I want to be able to remove failing registries for subsequent images.
for library in self.container_library[:]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes it more readable

Suggested change
for library in self.container_library[:]:
for library in self.container_library.copy():

log.debug(f"Singularity command: {' '.join(singularity_command)}")

# Run the singularity pull command
with subprocess.Popen(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the function run_cmd() in nf_core/utils.py, I think it would be useful to reduce the code here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have time, it would be great to change indeed. But since it is our current codebase, I do not want to be too demanding either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a big difference with nf_core.utils.run_cmd. Here we stream the output to update the progress bar. The motivation is that singularity pulls can take a while to run, and it's better for users to get a visual feedback that something is really happening.
nf_core.utils.run_cmd runs the command and only returns the output/error when the command ends. I'm not sure that's the right thing to do here.

I would rather propose to introduce nf_core.utils.stream_cmd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds convincing to me, although I have admittedly not looked into the specific implementation of nf_core.utils.run_cmd.

Comment on lines +358 to +359
if not (shutil.which("singularity") or shutil.which("apptainer")):
raise OSError("Singularity/Apptainer is needed to pull images, but it is not installed or not in $PATH")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not needed as we already have the same check in pull_images()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all fairness, this check is simply our current codebase.

Coincidentally, I want to get rid of those checks entirely, as they prevent running the tool on MacOS with recent pipelines that use Seqera Containers or at least disable that check with -f/--force. Instead, the tool should just output a list of commands that it could not run because singularity / apptainer was missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain your use case a bit more, @MatthiasZepper ? If someone runs nf-core pipelines download with --container-system singularity, it seems fair to complain if singularity/apptainer is not available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I had the case in mind that you are running nf-core download on a different computer that you will use for executing the workflows later. Specifically, I was thinking of MacOS, which does not support singularity/apptainer.

We have a HPC that runs Linux, but are downloading the pipelines with our developers' Macbooks. If we need to pull a Docker container and convert it or have a mandatory oras:// download, I need to start my VM, but most nf-core pipelines actually have http:// downloads for all required modules, so it is possible to obtain a perfectly functional download without singularity/apptainer. Due to this check, nf-core download will nonetheless not run, which is why I have symlinked ~/bin/apptainer -> /usr/bin/true on my MacOS.

For all modules that use Seqera Containers, it is no longer possible to remove the Docker container, if there is a http:// download available. So for Seqera Container modules, you will always download both: The native Singularity container and convert the Docker image. This makes it even more attractive to dummy replace apptainer while downloading to save some bandwith and try out, which containers are actually missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oki. In that case, I would propose adding a command-line option to allow missing containers as a result of a failure of the copy/download/pull command or because the command can't be run (i.e. your case). It would essentially turn all the errors into warnings. In your case, it'll still do all the downloads but skip the pulls and report the list of missing containers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, although applying --force instead of a dedicated CLI argument would probably suffice as well. In either way out of the scope of this PR, which is, I believe, already comprehensive enough and shouldn't get bogged down by too many small additions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in #3506 (comment), the directory name should be download

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my decision to avoid import errors for now. Running git mv is later is pretty quick, I guess.

successful_downloads.append(future_downloads[future])

with concurrent.futures.ThreadPoolExecutor(max_workers=parallel_downloads) as pool:
# The entire block needs to be monitored for KeyboardInterrupt so that ntermediate files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# The entire block needs to be monitored for KeyboardInterrupt so that ntermediate files
# The entire block needs to be monitored for KeyboardInterrupt so that intermediate files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
download nf-core download
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants