Skip to content

Commit 60028cb

Browse files
deploy: dfb5754
1 parent eed65a5 commit 60028cb

File tree

1,196 files changed

+47096
-20557
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,196 files changed

+47096
-20557
lines changed

.git-blame-ignore-revs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Apply ruff as linter for python code
2+
9e87b864f699c371b444b592a19e610a3c9d3286
3+

.github/workflows/integration-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ jobs:
7272

7373
- uses: viash-io/viash-actions/setup@v6
7474

75-
- uses: nf-core/setup-nextflow@v2.0.0
75+
- uses: nf-core/setup-nextflow@v2.1.4
7676

7777
# use cache
7878
- name: Cache resources data

.github/workflows/release-build.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ jobs:
6060

6161
- uses: viash-io/viash-actions/setup@v6
6262

63-
- uses: nf-core/setup-nextflow@v2.0.0
63+
- uses: nf-core/setup-nextflow@v2.1.4
6464

6565
# use cache
6666
- name: Cache resources data
@@ -186,13 +186,13 @@ jobs:
186186
password: ${{ secrets.GTHB_PAT }}
187187

188188
- name: Test component
189-
timeout-minutes: 30
189+
timeout-minutes: 40
190190
run: |
191191
viash test \
192192
"${{ matrix.component.config }}" \
193193
--config_mod ".engines[.type == 'docker'].image := 'ghcr.io/openpipelines-bio/openpipeline/${{ matrix.component.namespace }}${{matrix.component.namespace_separator}}${{ matrix.component.name }}:${{ github.event.inputs.version_tag }}'" \
194194
--config_mod ".engines[.type == 'docker'].setup := []" \
195195
--cpus 4 \
196-
--memory "12gb" \
196+
--memory "14gb" \
197197
--engine docker \
198198
--runner executable

.github/workflows/viash-test.yml

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,44 @@ concurrency:
1010
cancel-in-progress: ${{ !contains(github.ref, 'main')}}
1111

1212
jobs:
13+
linting:
14+
runs-on: ubuntu-latest
15+
16+
steps:
17+
- uses: actions/checkout@v4
18+
with:
19+
fetch-depth: 0
20+
- name: Install Python
21+
uses: actions/setup-python@v5
22+
with:
23+
python-version: "3.12"
24+
- name: Install dependencies
25+
run: |
26+
python -m pip install --upgrade pip
27+
pip install ruff
28+
- name: Run Ruff
29+
run: ruff check --output-format=github .
30+
31+
- uses: r-lib/actions/setup-r@v2
32+
with:
33+
use-public-rspm: true
34+
35+
- uses: r-lib/actions/setup-r-dependencies@v2
36+
with:
37+
packages: any::lintr, any::styler, any::roxygen2
38+
needs: lint, styler
39+
40+
- name: Lint
41+
run: lintr::lint_dir(path = ".")
42+
shell: Rscript {0}
43+
env:
44+
LINTR_ERROR_ON_LINT: true
45+
46+
- name: Style
47+
run: styler::style_dir(dry = "off")
48+
shell: Rscript {0}
49+
50+
1351
# phase 1
1452
list:
1553
env:

.pre-commit-config.yaml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
2+
repos:
3+
- repo: https://github.com/astral-sh/ruff-pre-commit
4+
# Ruff version.
5+
rev: v0.8.1
6+
hooks:
7+
- id: ruff
8+
- id: ruff-format
9+
- repo: local
10+
hooks:
11+
- id: run_styler
12+
name: run_styler
13+
language: r
14+
description: style files with {styler}
15+
entry: "Rscript -e 'styler::style_file(commandArgs(TRUE))'"
16+
files: '(\.[rR]profile|\.[rR]|\.[rR]md|\.[rR]nw|\.[qQ]md)$'
17+
additional_dependencies:
18+
- styler
19+
- knitr
20+
- repo: https://github.com/lorenzwalthert/precommit
21+
rev: v0.4.3
22+
hooks:
23+
- id: lintr

CHANGELOG.md

Lines changed: 59 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,69 @@
1+
# openpipelines 2.0.0
2+
3+
## BREAKING CHANGES
4+
5+
* `velocity/scvelo`: update `scvelo` to `0.3.3`, which also removes support for using `loom` input files. The component now uses a `MuData` object as input. Several arguments were added to support selecting different inputs from the MuData file: `counts_layer`, `modality`, `layer_spliced`, `layer_unspliced`, `layer_ambiguous`. An `output_h5mu` argument was has been added (PR #932).
6+
7+
* `src/annotate/onclass` and `src/annotate/celltypist`: Input parameter for gene name layers of input datasets has been updated to `--input_var_gene_names` and `reference_var_gene_names` (PR #919).
8+
9+
* Several components under `src/scgpt` (`cross_check_genes`, `tokenize_pad`, `binning`) now processes the input (query) datasets differently. Instead of subsetting datasets based on genes in the model vocabulary and/or highly variable genes, these components require an input .var column with a boolean mask specifying this information. The results are written back to the original input data, preserving the dataset structure (PR #832).
10+
11+
* `query/cellxgene_census`: The default output layer has been changed from `.layers["counts"]` to `.X` to be more aligned with the standard OpenPipelines format (PR #933).
12+
Use argument `--output_layer_counts counts` to revert the behaviour to the previous default.
13+
14+
## NEW FUNCTIONALITY
15+
16+
* `velocyto_to_h5mu`: now writes counts to `.X` (PR #932)
17+
18+
* `qc/calculate_atac_qc_metrics`: new component for calculating ATAC QC metrics (PR #868).
19+
20+
* `workflows/annotation/scgpt_annotation` workflow: Added a scGPT transformer-based cell type annotation workflow (PR #832).
21+
22+
* `workflows/annotation/scgpt_integration_knn` workflow: Cell-type annotation based on scGPT integration with KNN label transfer (PR #875).
23+
24+
* CI: Use `params.resources_test` in test workflows in order to point to an alternative location (e.g. a cache) (PR #889).
25+
26+
## MINOR CHANGES
27+
28+
* Pin `scikit-learn` for `labels_transfer/xgboost` to `<1.6` (PR #931).
29+
30+
* `filter/filter_with_scrublet`: provide cleaner error message when running scrublet on an empty modality (PR #929).
31+
32+
* Several component (cleanup): remove workaround for using being able to use shared utility functions with Nextflow Fusion (PR #920).
33+
34+
* `scgpt/cell_type_annotation` component update: Added support for multi-processing (PR #832).
35+
36+
* Several annotation (`src/annotate/`) components (`onclass`, `celltypist`, `random_forest_annotation`, `scanvi`, `svm_annotation`): Updated input parameteres to ensure uniformity across components, implemented functionality to cross-check the overlap of genes between query and reference (model) datasets and implemented logic to allow for subsetting of genes (PR #919).
37+
38+
* `workflows/annotation/scgpt_annotation` workflow: Added a scGPT transformer-based cell type annotation workflow (PR #832).
39+
40+
* `scgpt/cross_check_genes` component update: Highly variable genes are now cross-checked based on the boolean mask in `var_input`. The filtering information is stored in the `--output_var_filter` .var field instead of subsetting the dataset (PR #832).
41+
42+
* `scgpt/binning` component update: This component now requires the `--var_input` parameter to provide gene filtering information. Binned data is written to the `--output_obsm_binned_counts` .obsm field in the original input data (PR #832).
43+
44+
* `scgpt/pad_tokenize` component update: Genes are padded and tokenized based on filtering information in `--var_input` and `--input_obsm_binned_counts` (PR #832).
45+
46+
* `resources_test_scripts/scgpt.sh`: Update scGPT test resources to avoid subsetting of datasets (PR #926).
47+
48+
* `workflows/integration/scgpt_leiden` workflow update: Update workflow such that input dataset is not subsetted for HVG but uses boolean masks in .var field instead (PR #875).
49+
50+
## BUG FIXES
51+
52+
* `scvi_leiden` workflow: fix the input layer argument of the workflow not being passed to the scVI component (PR #936 and PR #938).
53+
54+
* `scgpt/embedding`: remove unused argument `dbsn` (PR #875).
55+
56+
* `scgpt/binning`: update handling of empty rows in sparse matrices (PR #875).
57+
158
# openpipelines 2.0.0-rc.2
259

360
## BUG FIXES
461

5-
* `annotate/popv`: fix popv raising `ValueError` when an accelerator (e.g. GPU) is unavailable (PR #918, backported from PR #915).
62+
* `annotate/popv`: fix popv raising `ValueError` when an accelerator (e.g. GPU) is unavailable (PR #915).
663

764
## MINOR CHANGES
865

9-
* `dataflow/split_h5mu`: Optimize resource usage of the component (PR #917, backported from PR #913).
66+
* `dataflow/split_h5mu`: Optimize resource usage of the component (PR #913).
1067

1168
# openpipelines 2.0.0-rc.1
1269

_viash.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
viash_version: 0.9.0
22

3-
version: dev
4-
53
source: src
64
target: target
75

resources_test_scripts/rna_velocity.sh

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
#!/bin/bash
22

3-
set -eo pipefail
4-
53
# ensure that the command below is run from the root of the repository
64
REPO_ROOT=$(git rev-parse --show-toplevel)
75
cd "$REPO_ROOT"
@@ -19,7 +17,7 @@ mkdir -p "$velocyto_dir"
1917
# Create a compatible BAM file from BD Rhapsody Output #
2018
########################################################
2119

22-
bd_rhap_wta_bam="resources_test/bdrhap_5kjrt/processed/WTA.bd_rhapsody.output_raw/sample_final.BAM"
20+
bd_rhap_wta_bam="resources_test/bdrhap_5kjrt/processed/output_raw/Combined_sample_Bioproduct.bam"
2321

2422
if [[ ! -f "$bd_rhap_wta_bam" ]]; then
2523
echo "$bd_rhap_wta_bam does not exist. Please generate BD Rhapsody test data first."
@@ -52,3 +50,11 @@ viash run src/velocity/velocyto/config.vsh.yaml -- \
5250
-i "$bam" \
5351
-o "$OUT/velocyto_processed/cellranger_tiny.loom" \
5452
--transcriptome "$gtf"
53+
54+
echo "> Converting loom file to MuData object"
55+
viash run src/velocity/velocyto_to_h5mu/config.vsh.yaml -- \
56+
--input_loom "$OUT/velocyto_processed/cellranger_tiny.loom" \
57+
--input_h5mu "resources_test/cellranger_tiny_fastq/raw_dataset.h5mu" \
58+
--modality velocyto \
59+
--output_compression "gzip" \
60+
--output "$OUT/velocyto_processed/velocyto.h5mu"

resources_test_scripts/scgpt.sh

100644100755
Lines changed: 50 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,12 @@ OUT=resources_test/$ID
1111
# create foundational model directory
1212
foundation_model_dir="$OUT/source"
1313
mkdir -p "$foundation_model_dir"
14+
export foundation_model_dir
15+
16+
# create finetuned model directory
17+
finetuned_model_dir="$OUT/finetuned_model"
18+
mkdir -p "$finetuned_model_dir"
19+
export finetuned_model_dir
1420

1521
# install gdown if necessary
1622
# Check whether gdown is available
@@ -19,13 +25,39 @@ if ! command -v gdown &> /dev/null; then
1925
exit 1
2026
fi
2127

28+
# install torch if necessary
29+
# Check whether torch is available
30+
if ! python -c "import torch"; then
31+
echo "This script requires torch. Please make sure it is available in your python environment."
32+
exit 1
33+
fi
34+
2235
echo "> Downloading scGPT foundation model (full_human)"
2336
# download foundational model files (full_human)
2437
# https://drive.google.com/drive/folders/1oWh_-ZRdhtoGQ2Fw24HP41FgLoomVo-y
2538
gdown '1H3E_MJ-Dl36AQV6jLbna2EdvgPaqvqcC' -O "${foundation_model_dir}/vocab.json"
2639
gdown '1hh2zGKyWAx3DyovD30GStZ3QlzmSqdk1' -O "${foundation_model_dir}/args.json"
2740
gdown '14AebJfGOUF047Eg40hk57HCtrb0fyDTm' -O "${foundation_model_dir}/best_model.pt"
2841

42+
echo "> Converting to finetuned model format"
43+
python <<HEREDOC
44+
import torch
45+
import mudata
46+
import os
47+
48+
foundation_model_dir = os.environ.get('foundation_model_dir')
49+
finetuned_model_dir = os.environ.get('finetuned_model_dir')
50+
51+
found_model_path = f"{foundation_model_dir}/best_model.pt"
52+
ft_model_path = f"{finetuned_model_dir}/best_model.pt"
53+
54+
f_model_dict = torch.load(found_model_path, map_location="cpu")
55+
model_dict = {}
56+
model_dict["model_state_dict"] = f_model_dict
57+
model_dict["id_to_class"] = {k: str(k) for k in range(15)}
58+
torch.save(model_dict, ft_model_path)
59+
HEREDOC
60+
2961
# create test data dir
3062
test_resources_dir="$OUT/test_resources"
3163
mkdir -p "$test_resources_dir"
@@ -45,12 +77,13 @@ input_mdata.write_h5mu("${test_resources_dir}/Kim2020_Lung.h5mu")
4577
HEREDOC
4678

4779
echo "> Subsetting datasets"
48-
viash run src/filter/subset_h5mu/config.vsh.yaml -p docker -- \
80+
viash run src/filter/subset_h5mu/config.vsh.yaml --engine docker -- \
4981
--input "${test_resources_dir}/Kim2020_Lung.h5mu" \
5082
--output "${test_resources_dir}/Kim2020_Lung_subset.h5mu" \
5183
--number_of_observations 4000
5284

5385
rm "${test_resources_dir}/Kim2020_Lung.h5ad"
86+
rm "${test_resources_dir}/Kim2020_Lung.h5mu"
5487

5588
echo "> Preprocessing datasets"
5689
nextflow \
@@ -63,46 +96,38 @@ nextflow \
6396
--publish_dir "${test_resources_dir}"
6497

6598
echo "> Filtering highly variable features"
66-
viash run src/feature_annotation/highly_variable_features_scanpy/config.vsh.yaml -p docker -- \
67-
--input "${test_resources_dir}/iKim2020_Lung_subset_preprocessed.h5mu" \
99+
viash run src/feature_annotation/highly_variable_features_scanpy/config.vsh.yaml --engine docker -- \
100+
--input "${test_resources_dir}/Kim2020_Lung_subset_preprocessed.h5mu" \
68101
--output "${test_resources_dir}/Kim2020_Lung_subset_hvg.h5mu" \
69102
--layer "log_normalized" \
70-
--var_name_filter "filter_with_hvg" \
103+
--var_name_filter "scgpt_filter_with_hvg" \
71104
--n_top_features 1200 \
72105
--flavor "seurat_v3"
73-
74-
viash run src/filter/do_filter/config.vsh.yaml -p docker -- \
75-
--input "${test_resources_dir}/Kim2020_Lung_subset_hvg.h5mu" \
76-
--output "${test_resources_dir}/Kim2020_Lung_subset_hvg_filtered.h5mu" \
77-
--var_filter "filter_with_hvg"
78106

79107
echo "> Running scGPT cross check genes"
80-
viash run src/scgpt/cross_check_genes/config.vsh.yaml -p docker -- \
81-
--input "${test_resources_dir}/Kim2020_Lung_subset_hvg_filtered.h5mu" \
108+
viash run src/scgpt/cross_check_genes/config.vsh.yaml --engine docker -- \
109+
--input "${test_resources_dir}/Kim2020_Lung_subset_hvg.h5mu" \
82110
--output "${test_resources_dir}/Kim2020_Lung_subset_genes_cross_checked.h5mu" \
83-
--vocab_file "${foundation_model_dir}/vocab.json"
111+
--vocab_file "${foundation_model_dir}/vocab.json" \
112+
--var_input "scgpt_filter_with_hvg" \
113+
--output_var_filter "scgpt_cross_checked_genes"
84114

85115
echo "> Running scGPT binning"
86-
viash run src/scgpt/binning/config.vsh.yaml -p docker -- \
116+
viash run src/scgpt/binning/config.vsh.yaml --engine docker -- \
87117
--input "${test_resources_dir}/Kim2020_Lung_subset_genes_cross_checked.h5mu" \
88118
--input_layer "log_normalized" \
89-
--output "${test_resources_dir}/Kim2020_Lung_subset_binned.h5mu"
119+
--output "${test_resources_dir}/Kim2020_Lung_subset_binned.h5mu" \
120+
--output_obsm_binned_counts "binned_counts" \
121+
--var_input "scgpt_cross_checked_genes"
90122

91123
echo "> Running scGPT tokenizing"
92-
viash run src/scgpt/pad_tokenize/config.vsh.yaml -p docker -- \
124+
viash run src/scgpt/pad_tokenize/config.vsh.yaml --engine docker -- \
93125
--input "${test_resources_dir}/Kim2020_Lung_subset_binned.h5mu" \
94-
--input_layer "binned" \
126+
--input_obsm_binned_counts "binned_counts" \
95127
--output "${test_resources_dir}/Kim2020_Lung_subset_tokenized.h5mu" \
96-
--model_vocab "${foundation_model_dir}/vocab.json"
97-
98-
echo "> Running scGPT integration"
99-
viash run src/scgpt/embedding/config.vsh.yaml -p docker -- \
100-
--input "${test_resources_dir}/Kim2020_Lung_subset_tokenized.h5mu" \
101-
--output "${test_resources_dir}/Kim2020_Lung_subset_scgpt_integrated.h5mu" \
102-
--model "${foundation_model_dir}/best_model.pt" \
103128
--model_vocab "${foundation_model_dir}/vocab.json" \
104-
--model_config "${foundation_model_dir}/args.json" \
105-
--obs_batch_label "sample"
129+
--var_input "scgpt_cross_checked_genes" \
130+
106131

107132
echo "> Removing unnecessary files in test resources dir"
108133
find "${test_resources_dir}" -type f \( ! -name "Kim2020_*" -o ! -name "*.h5mu" \) -delete

ruff.toml

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Exclude a variety of commonly ignored directories.
2+
exclude = [
3+
".git",
4+
".pyenv",
5+
".pytest_cache",
6+
".ruff_cache",
7+
".venv",
8+
".vscode",
9+
"__pypackages__",
10+
"_build",
11+
"build",
12+
"dist",
13+
"node_modules",
14+
"site-packages",
15+
]
16+
17+
builtins = ["meta"]
18+
19+
20+
21+
22+
[format]
23+
# Like Black, use double quotes for strings.
24+
quote-style = "double"
25+
26+
# Like Black, indent with spaces, rather than tabs.
27+
indent-style = "space"
28+
29+
# Like Black, respect magic trailing commas.
30+
skip-magic-trailing-comma = false
31+
32+
# Like Black, automatically detect the appropriate line ending.
33+
line-ending = "auto"
34+
35+
[lint.flake8-pytest-style]
36+
fixture-parentheses = false
37+
mark-parentheses = false
38+
39+
[lint]
40+
ignore = [
41+
# module level import not at top of file
42+
"E402"
43+
]

0 commit comments

Comments
 (0)