Skip to content

[WIP] Guidelines for pipelines#58

Open
AlexVCaron wants to merge 4 commits intonf-neuro:mainfrom
AlexVCaron:feat/pipeline_install
Open

[WIP] Guidelines for pipelines#58
AlexVCaron wants to merge 4 commits intonf-neuro:mainfrom
AlexVCaron:feat/pipeline_install

Conversation

@AlexVCaron
Copy link
Contributor

@AlexVCaron AlexVCaron commented Feb 6, 2026

New pages for basic pipeline usage and common configuration, plus guidelines for download using both nf-core and our script.

Build the documentation and run the development server to get a view !

@AlexVCaron AlexVCaron force-pushed the feat/pipeline_install branch from 32f3340 to 4fe7c2b Compare February 6, 2026 20:04
Copy link
Member

@gagnonanthony gagnonanthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! It's clear and easy to read. I've made a few comments, but before going further, I think we might need a discussion as a team. Here's my current thoughts:

While the doc is clear and super nice, your angle is tailored for a user's perspective (someone who wants to run a pipeline). You made it super general, which is perfect, but I have a hard time finding arguments as to why we should put this doc on the nf-neuro website and not on the pipelines' documentation (each would be more tailored to their own parameters). Otherwise, users would need to do back and forth between two docs.

What I think would be cool is to change the angle of this doc to support pipeline developers create their own pipeline documentation. This way we would get interaction on the nf-neuro website from developers, which means the doc can be more complex and "geeky". Developers will digest this information into a more user-friendly version tailored to their own pipeline, which users will refer to when launching their stuff.

Otherwise, I feel we duplicate the information a lot, which might confuse users as to which doc to refer to. Also, the rest of the nf-neuro website is mostly designed for pipelines developers, not pipeline users, so having a section that targets users is a bit strange imo.

Anyway, those are just thoughts, feel free to jump in @arnaudbore @levje @AlexVCaron

data and the `input` directory.

:::caution
**Symbolic links must be carefully verified on HPC**, as so they are accessible by computing nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Symbolic links must be carefully verified on HPC**, as so they are accessible by computing nodes
**Symbolic links must be carefully verified on HPC** to validate that they will be accessible by computing nodes.

Comment on lines +63 to +78
4. **Create or edit `nextflow.config`** in the directory created at **step 1**. In it, set or replace :

```groovy
params.input = 'input'
params.outdir = 'results'
```

Refer to the documentation of the pipeline you are running for any other **input parameters** needing to be set, and for **execution
parameters** that might be of interest to set given your data, project or research question.

:::tip[Centralize your configuration !]
You can specify configuration for the pipeline in many different ways. **We cannot recommend enough you centralize everything in the
`nextflow.config` we made you create above, for debugging purposes, but also for reuse, sharing and safekeeping**. A rule of thumb is
compiling all **static** configuration in that file, and supply parameters at the command line only to slightly execution
to specific use-cases. **Overriding parameters using the `-c` nextflow argument should be avoided at all costs !**
:::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I agree with this method for setting pipeline parameters. Asking users to create nextflow.config files is a bit too much for me, especially since they can conflict with each pipeline's config files.

I think we should promote the use of params-file.yaml rather than config files (or simply ask them to add parameters via the CLI). The usage is a bit more intuitive, and we make a clear separation of what is config/parameters.

In any case, the specific parameters are shown in the QC report, so they can refer to that for future analyses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I completely agree here. I do think input via the CLI is preferable, but don't understand using yaml for input configuration. It makes sense on a developer standpoint, as they are using the language to implement components, but we don't assume users to know yaml. We assume they read documentation and know Nextflow. Nowhere, except maybe deep in some nf-core guidelines, will they be asked to provide configuration as yaml. Can you argue me why the popularity of yaml would make it preferential here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't even know what is the params-files.yaml. I see the argument to have a file that can't interfere with some essential parameters of the pipeline. However, I think I'm more in favor of using the "imperfect" nextflow.config file (even though, let's be honest, YAML is much cleaner than the nextflow config).

However, I am not sure I agree with the line saying that "Overriding parameters using the -c nextflow argument should be avoided at all costs!". That means I have to create a nextflow.config for each dataset I'm processing? Since I'm running the same pipeline with the same custom parameters on several parameters, can't I just have a single configuration file that's common to all my runs? With that phrasing, I feel like I'd be committing a crime by using something that's clearly well supported by nextflow. I would suggest to either:

  1. Remove this if we decide it's ok to use the -c option.
  2. Rephrase this so we don't feel terrible using that option and/or a small explanation on why this would be bad practice.

Copy link
Contributor Author

@AlexVCaron AlexVCaron Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I know, all you need to override configuration is a single nextflow.config file in the working directory of the pipeline. It takes precedence on everything else no matter what. Only the -c option overwrites it ... @gagnonanthony is it different when nextflow downloads the pipeline ? Do you absolutely need the -c option ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that if there's multiple working directories (I would probably have a working directory per-dataset I'm running), I'd have to create a nextflow.config file in each working directory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should always keep a trace of the parameters you use, in a script with CLI arguments (with --) or in a nextflow.config where you launch your study. I think it's better to keep those locally to each run, for history and replicability. So I think it's better to avoid -c if you can.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the override configuration, @AlexVCaron and @levje, I don't think it's different when nextflow downloads the pipeline. The pipeline is downloaded to $HOME/.nextflow/..., so it's very unlikely this will also be your working directory. I prefer suggesting to users that they specify additional configs with -c (using a config file named something meaningful, not just nextflow.config). In my mind, if they do so, they need to understand that they are changing the pipeline's default configs (or supplying configs for their cluster).

Regarding the params.yml file, it's documented in the Nextflow docs and is recommended for providing parameters when not using the CLI (JSON is also supported, but I personally prefer YAML). It's also how nf-core recommends doing this (it's part of their boilerplate documentation for pipelines). I see two advantages of using the params.yml file:

  1. It's cleaner for users who do not have a lot of programming experience (and more intuitive). Here's an example:
input: '/path/to/BIDS/'
outdir: '/path/to/results/'
participant_label:
    - sub-01
    - sub-02
dti_shells: "0 1500"
fodf_shells: "0 1500 3000"
  1. It creates a clear separation between what is configurations (like resources, executors, etc.) and pipeline parameters. I feel that if inexperienced users start playing in a nextflow.config-like file (most certainly with the help of ChatGPT or something), we could end up with issues that are hard-to-track because they rewrote something that was not meant to be in the pipeline. Having them create a yaml for parameters limits the possible problems.

Also, it gets supplied differently, using the -params-file options in nextflow. Let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location of the pipeline downloaded ($HOME/.nextflow/...) doesn't matter in the case of the configuration overriding. The location of the downloaded pipeline is labeled as the "project directory", but if we have a nextflow.config in the launch directory (from where we launch the pipeline), it will override the configuration without having to specify the -c option.

Personally, I prefer to manually input the custom configuration in the CLI with the -c option. I see 2 good points on doing that:

  1. It's transparent to the user that we are using a custom configuration file (or not).
  2. The configuration file name appears in the QC report. So if I want to reproduce results I ran 2 years ago and I forgot that I inputted a custom configuration, I can just have a look at the command I ran in the QC report and see the exact input I used for my pipeline.

On the -params-file, you're slowly convincing me (with good arguments) to prefer this way of specifying parameters. I think the idea is great. Only problem I have is that the YAML format feels disconnected to the other technologies/languages used in the pipelines (and using a JSON file would be a nightmare). Other than that, I'm all for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally put the custom download script as the first proposed solution, then the nf-core one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom script was and will always remain a bugfix to me. So in my books, placing it first doesn't make sense. Happy to be proven wrong. But feel free to advertise otherwise in your pipeline documentation !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gagnonanthony. Even thought it's more an ease-of-use script to circumvent some issues, I think the simpler solution should be the first one given to the user here. If I'm a user who has little knowledge about nextflow and its associated libraries, I don't really care about installing nf-core and all that stuff if there's a script that takes care of everything by running a single command.

Copy link
Contributor

@levje levje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. I do have the same opinion as @gagnonanthony wrt where this type of documentation belongs.

Comment on lines +54 to +55
be exact, when in doubt ensure you have **a lot** of it). The pipeline's execution should not be affected if no more
space is available to write results, but you won't have access to them easily. In which case, you might need to re-execute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not so sure about this sentence makes sense: The pipeline's execution should not be affected if no more space is available to write results. If there's no more space left on my device, the pipeline does crash/stop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will precise. I had a use-case on HPC in head when I wrote this. You could write results to project and process in scratch, and run out of space in the results space, while still processing correctly ... my bad 😅

some of the steps or the whole pipeline a second time, wasting time and computing resources.

:::tip
Pipeline's usually work in **overwrite mode**, meaning **subsequent pipeline runs will write over previous ones for the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline's -> Pipelines


:::tip
Pipeline's usually work in **overwrite mode**, meaning **subsequent pipeline runs will write over previous ones for the same
input subjects**. If unsure, consult the documentation for the specific pipeline you want to use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add period at the end of sentence.

This is your **main task**. You need to prepare the spaces where your input data lives, according to
your pipeline's input specification. You also need to allocate a space for the pipeline's outputs. You'll
need to refer to its own documentation to get everything in order, as each pipeline has its own specificities.
Nomatterwhat, here is a quick checklist to get everything in good shape, ready to address any pipeline's I/O
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nomatterwhat -> No matter what

Comment on lines +63 to +78
4. **Create or edit `nextflow.config`** in the directory created at **step 1**. In it, set or replace :

```groovy
params.input = 'input'
params.outdir = 'results'
```

Refer to the documentation of the pipeline you are running for any other **input parameters** needing to be set, and for **execution
parameters** that might be of interest to set given your data, project or research question.

:::tip[Centralize your configuration !]
You can specify configuration for the pipeline in many different ways. **We cannot recommend enough you centralize everything in the
`nextflow.config` we made you create above, for debugging purposes, but also for reuse, sharing and safekeeping**. A rule of thumb is
compiling all **static** configuration in that file, and supply parameters at the command line only to slightly execution
to specific use-cases. **Overriding parameters using the `-c` nextflow argument should be avoided at all costs !**
:::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't even know what is the params-files.yaml. I see the argument to have a file that can't interfere with some essential parameters of the pipeline. However, I think I'm more in favor of using the "imperfect" nextflow.config file (even though, let's be honest, YAML is much cleaner than the nextflow config).

However, I am not sure I agree with the line saying that "Overriding parameters using the -c nextflow argument should be avoided at all costs!". That means I have to create a nextflow.config for each dataset I'm processing? Since I'm running the same pipeline with the same custom parameters on several parameters, can't I just have a single configuration file that's common to all my runs? With that phrasing, I feel like I'd be committing a crime by using something that's clearly well supported by nextflow. I would suggest to either:

  1. Remove this if we decide it's ok to use the -c option.
  2. Rephrase this so we don't feel terrible using that option and/or a small explanation on why this would be bad practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments