Skip to content

Latest commit

 

History

History
147 lines (108 loc) · 4.27 KB

README.md

File metadata and controls

147 lines (108 loc) · 4.27 KB

fastQpick

Fast and memory-efficient sampling of DNA-seq or RNA-seq FASTQ data with or without replacement.


Installation

Install via PyPI

pip install fastQpick

Install from Source Code

Using pip:

pip install git+https://github.com/pachterlab/fastQpick.git

Or clone the repository and build manually:

git clone https://github.com/pachterlab/fastQpick.git
cd fastQpick
python -m build
python -m pip install dist/fastQpick-x.x.x-py3-none-any.whl

Usage

Command-line Interface

Run fastQpick with a specified fraction and options:

fastQpick --fraction FRACTION [OPTIONS] FASTQ_FILE1 FASTQ_FILE2 ...

Python API

Use fastQpick in your Python code:

from fastQpick import fastQpick

fastQpick(
    input_file_list=['FASTQ_FILE1', 'FASTQ_FILE2', ...],
    fraction=FRACTION,
    ...
)

Documentation

  • Command-line Help: Use the following command to see all available options:

    fastQpick --help
  • Python API Help: Use the help function to explore the API:

    help(fastQpick)

Options

  • input_files (str, list, or tuple) List of input FASTQ files or directories containing FASTQ files. Required. Positional argument on command line.
  • fraction (int or float) The fraction of reads to sample, as a float greater than 0. Any value equal to or greater than 1 will turn on the -r flag automatically.
  • seed (int or str) Random seed(s). Can provide multiple seeds separated by commas. Default: 42
  • output_dir (str) Output directory. Default: ./fastQpick_output
  • gzip_output (bool) Whether or not to gzip the output. Default: False (uncompressed)
  • group_size (int) The size of grouped files. Provide each pair of files sequentially, separated by a space. E.g., I1, R1, R2 - would have group_size=3. Default: 1 (unpaired)
  • replacement (bool) Sample with replacement. Default: False (without replacement).
  • overwrite (bool) Overwrite existing output files. Default: False
  • low_memory (bool) Whether to use low memory mode (uses ~5.5x less memory than default, but adds marginal time to the data - structure generation preprocessing). Default: False
  • verbose (bool) Whether to print progress information. Default: True

Features

  • Efficient sampling of large FASTQ files.
  • Works with both single and paired-end sequencing data.
  • Supports sampling with or without replacement.
  • Command-line interface and Python API for seamless integration.
  • Memory efficient - in low-memory mode, only uses as much memory as a list of (small) integers the length of the number of reads in the fastq file for each file.
  • Time efficient - only passes through the fastq once and writes to output in batches - can process 600M reads in 10-15 minutes

Low memory mode vs. standard

Low memory mode vs. standard, when fraction=1 (i.e., number of reads to sample is the same as the number of reads in the fastq):

  • Adds an extra ~1-3 seconds per million reads per group_size (i.e., 500M reads would take 30 minutes instead of 20-25 minutes)
  • Saves an extra ~40MB RAM per million reads (i.e., 500M reads would take 3.75GB RAM vs 20.6GB RAM)

Examples

1. Sample 10% of reads with replacement from a FASTQ file:

Command-line

fastQpick --fraction 0.1 -r input.fastq

Python

from fastQpick import fastQpick

fastQpick(
    input_files='input.fastq',
    fraction=0.1,
    replacement=True
)

2. Sample 100% of reads with replacement from multiple paired FASTQ files (R1, R2) across three seeds (i.e., bootstrapping):

Command-line

fastQpick --fraction 1 -s 42,43,44 -r -g 2 input1_R1.fastq input1_R2.fastq

Python

from fastQpick import fastQpick

fastQpick(
    input_files='input.fastq',
    fraction=1,
    seed="42,43,44",
    replacement=True,
    group_size=2,
)

License

fastQpick is licensed under the 2-clause BSD license. See the LICENSE file for details.


Contributing

We welcome contributions! Please see the CONTRIBUTING.md file for guidelines on how to get involved.