Skip to content

BarcodeSeqKit is a library designed for extracting and processing barcoded sequences from next-generation sequencing data.

License

Notifications You must be signed in to change notification settings

mtinti/BarcodeSeqKit

Repository files navigation

BarcodeSeqKit

Key Features

  • Simple yet powerful: Extract barcodes from BAM or FASTQ files with minimal code
  • Support for both file types: Process BAM files (including softclipped regions) and FASTQ files (including paired-end data)
  • Flexible barcode options: Use single barcodes or specific 5’/3’ combinations
  • Orientation detection: Identify barcodes in both forward and reverse complement orientations
  • Fuzzy matching: Configure allowable mismatches for barcode detection
  • Specialized functions: Search in softclipped regions of BAM alignments or both reads in paired FASTQ data
  • Detailed statistics: Get comprehensive reports on barcode matches and distribution

Regular Expression Support in Barcode Matching

BarcodeSeqKit uses Python’s regular expression engine (re module) for exact barcode matching, which means you can leverage the full power of regular expressions in your barcode patterns.

Installation

You can install BarcodeSeqKit using pip:

pip install barcodeseqkit

Or directly from the GitHub repository:

pip install git+https://github.com/username/BarcodeSeqKit.git

Quick Start

Let’s dive right in with some common use cases!

Command-line Usage

BarcodeSeqKit provides a simple command-line interface that makes it easy to extract barcodes without writing any code.

Extract a Single Barcode

barcodeseqkit --bam tests/test.bam \
              --barcode CTGACTCCTTAAGGGCC \
              --output-prefix single_barcode \
              --output-dir results

This command extracts reads containing the specified barcode (in either forward or reverse complement orientation) and creates: - results/single_barcode_barcode_orientFR.bam: Forward orientation matches - results/single_barcode_barcode_orientRC.bam: Reverse complement matches - results/single_barcode_extraction_stats.json: Detailed statistics in JSON format - results/single_barcode_extraction_stats.tsv: Detailed statistics in TSV format

Extract 5’ and 3’ Barcodes

barcodeseqkit --bam tests/test.bam \
              --barcode5 CTGACTCCTTAAGGGCC \
              --barcode3 TAACTGAGGCCGGC \
              --output-prefix dual_barcode \
              --output-dir results

This creates separate files for each barcode and orientation combination: - results/dual_barcode_barcode5_orientFR.bam - results/dual_barcode_barcode5_orientRC.bam - results/dual_barcode_barcode3_orientFR.bam - results/dual_barcode_barcode3_orientRC.bam

Process Paired FASTQ Files

barcodeseqkit --fastq1 tests/test.1.fastq.gz \
              --fastq2 tests/test.2.fastq.gz \
              --barcode CTGACTCCTTAAGGGCC \
              --output-prefix fastq_extraction \
              --search-both-reads

This processes paired FASTQ files and creates output FASTQ files for each barcode category.

Using Regular Expression Patterns

Instead of specifying a fixed barcode sequence, you can provide a regular expression pattern:

# Define a barcode with ambiguous positions
barcode_config = BarcodeConfig(
    sequence="ACGT[AT]GC[GC].TT",
    location=BarcodeLocationType.FIVE_PRIME,
    name="5prime_variable"
)

Python API Usage

BarcodeSeqKit’s Python API is designed to be intuitive and straightforward:

from BarcodeSeqKit.core import BarcodeConfig, BarcodeLocationType, BarcodeExtractorConfig
from BarcodeSeqKit.bam_processing import process_bam_file

# Define barcodes
barcodes = [
    BarcodeConfig(
        sequence="TAACTGAGGCCGGC",
        location=BarcodeLocationType.THREE_PRIME,
        name="3prime"
    ),
    BarcodeConfig(
        sequence="CTGACTCCTTAAGGGCC",
        location=BarcodeLocationType.FIVE_PRIME,
        name="5prime"
    )
]

# Create configuration
config = BarcodeExtractorConfig(
    barcodes=barcodes,
    output_prefix="my_extraction",
    output_dir="./results",
    max_mismatches=0,
    search_softclipped=True,
    verbose=True
)

# Process BAM file
stats = process_bam_file(config, "tests/test.bam")

# Report results
print(f"Processed {stats.total_reads} reads")
print(f"Found {stats.total_barcode_matches} barcode matches")

or FASTQ files:

from BarcodeSeqKit.fastq_processing import process_fastq_files

# Use the same config as above
fastq_files = ["tests/test.1.fastq.gz", "tests/test.2.fastq.gz"]
stats = process_fastq_files(
    config=config,
    fastq_files=fastq_files,
    compress_output=True,
    search_both_reads=True
)

Key Concepts

Barcode Types

BarcodeSeqKit supports three types of barcode configurations:

  1. Generic barcodes: Use these when you just want to find a specific sequence regardless of location
  2. 5’ barcodes: Use these when the barcode is expected at the 5’ end of the sequence
  3. 3’ barcodes: Use these when the barcode is expected at the 3’ end of the sequence

Barcode Orientations

For each barcode, BarcodeSeqKit tracks two possible orientations:

  • Forward (FR): The barcode appears in its specified sequence
  • Reverse Complement (RC): The barcode appears as its reverse complement

Softclipped Regions

When working with BAM files, the --search-softclipped option examines only the softclipped portions of reads:

  • For forward strand reads (+): Examines the 5’ softclipped region
  • For reverse strand reads (-): Examines the 3’ softclipped region

This is especially useful for splice leader sequences in trypanosomatids or where barcodes are clipped during alignment.

Advanced Options

Command-Line Arguments

BarcodeSeqKit offers a range of options to customize your extraction:

Option Description
--max-mismatches N Allow up to N mismatches in barcode detection
--search-softclipped Search in softclipped regions (BAM only)
--search-both-reads Look for barcodes in both reads of paired FASTQ files
--no-compress Disable compression for FASTQ output files
--verbose Enable detailed logging

For a complete list, run barcodeseqkit --help.

Barcode Configuration Files

For complex projects with multiple barcodes, you can use a YAML configuration file:

barcodes:
  - sequence: CTGACTCCTTAAGGGCC
    location: 5
    name: 5prime
    description: 5' barcode for experiment X
  - sequence: TAACTGAGGCCGGC
    location: 3
    name: 3prime
    description: 3' barcode for experiment X

Then use it with:

barcodeseqkit --bam test.bam --barcode-config my_barcodes.yaml --output-prefix config_extraction

In BarcodeSeqKit, when multiple barcodes are provided, the program uses an efficient approach: it parses the input file(s) only once while searching for all barcodes simultaneously during that single pass.

Output Files and Statistics

BarcodeSeqKit generates:

  1. Categorized output files: BAM or FASTQ files containing reads matching specific barcode/orientation combinations
  2. Statistics in JSON format: Detailed machine-readable statistics
  3. Statistics in TSV format: Human-readable tabular statistics

The statistics include: - Total number of reads processed - Total barcode matches found - Match counts by barcode type - Match counts by orientation - Match counts by category - Overall match rate

Statistics-Only Mode

BarcodeSeqKit now supports a “statistics-only” mode that processes files and generates detailed statistics without writing output sequence files. This feature is particularly useful for:

  • Quickly analyzing barcode distributions in large datasets
  • Performing QC checks before committing to full processing
  • Estimating barcode frequencies without using additional disk space
  • Benchmarking and optimization tasks

Command-line Usage

To use statistics-only mode from the command line, add the --only-stats flag:

barcodeseqkit --bam sample.bam \
              --barcode5 CTGACTCCTTAAGGGCC \
              --output-prefix quick_stats \
              --output-dir results \
              --only-stats

BarcodeSeqKit has a clean, modular design:

  1. Core (00_core.ipynb): Data structures and configuration
  2. Sequence Utilities (01_sequence_utils.ipynb): Barcode detection algorithms
  3. BAM Processing (02_bam_processing.ipynb): BAM file handling
  4. FASTQ Processing (03_fastq_processing.ipynb): FASTQ file handling
  5. Command-Line Interface (04_cli.ipynb): Argument parsing and execution

Real-World Example

Let’s walk through a practical example of extracting barcodes from the test data included with BarcodeSeqKit:

from BarcodeSeqKit.core import BarcodeConfig, BarcodeLocationType, BarcodeExtractorConfig
from BarcodeSeqKit.bam_processing import process_bam_file
import os

# Define barcodes that we know are in the test data
barcodes = [
    BarcodeConfig(
        sequence="TAACTGAGGCCGGC",
        location=BarcodeLocationType.THREE_PRIME,
        name="3prime"
    ),
    BarcodeConfig(
        sequence="CTGACTCCTTAAGGGCC",
        location=BarcodeLocationType.FIVE_PRIME,
        name="5prime"
    )
]

# Create configuration
output_dir = "../tests/index_api"
os.makedirs(output_dir, exist_ok=True)

config = BarcodeExtractorConfig(
    barcodes=barcodes,
    output_prefix="example_run",
    output_dir=output_dir,
    max_mismatches=0,
    search_softclipped=True,
    verbose=True
)

# Process the BAM file (if it exists)
test_bam = "../tests/test.bam"
if os.path.exists(test_bam):
    print(f"Processing {test_bam}")
    stats = process_bam_file(config, test_bam)
    
    # Print summary statistics
    print(f"\nResults summary:")
    print(f"Total reads: {stats.total_reads}")
    print(f"Total barcode matches: {stats.total_barcode_matches}")
    
    if stats.total_reads > 0:
        match_rate = (stats.total_barcode_matches / stats.total_reads) * 100
        print(f"Match rate: {match_rate:.2f}%")
    
    # Print barcode-specific statistics
    for barcode_name, count in stats.matches_by_barcode.items():
        print(f"  {barcode_name}: {count} matches")
    
    # Print orientation-specific statistics
    for orientation, count in stats.matches_by_orientation.items():
        print(f"  Orientation {orientation}: {count} matches")
else:
    print(f"Test file not found: {test_bam}")
2025-03-24 14:06:24,292 - BarcodeSeqKit - INFO - BAM file: ../tests/test.bam (498 reads)
2025-03-24 14:06:24,293 - BarcodeSeqKit - INFO - Output categories: ['barcode3_orientFR', 'barcode3_orientRC', 'barcode5_orientFR', 'barcode5_orientRC', 'noBarcode']

Processing ../tests/test.bam

Classifying reads:   0%|          | 0/498 [00:00<?, ?it/s]

2025-03-24 14:06:24,340 - BarcodeSeqKit - INFO - First pass complete: classified 18 reads

Writing reads:   0%|          | 0/498 [00:00<?, ?it/s]

2025-03-24 14:06:24,411 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_api/example_run_barcode3_orientFR.bam
2025-03-24 14:06:24,424 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_api/example_run_barcode3_orientRC.bam
2025-03-24 14:06:24,433 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_api/example_run_barcode5_orientFR.bam
2025-03-24 14:06:24,442 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_api/example_run_barcode5_orientRC.bam
2025-03-24 14:06:24,450 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_api/example_run_noBarcode.bam


Results summary:
Total reads: 498
Total barcode matches: 18
Match rate: 3.61%
  5prime: 10 matches
  3prime: 8 matches
  Orientation FR: 10 matches
  Orientation RC: 8 matches
config = BarcodeExtractorConfig(
    barcodes=barcodes,
    output_prefix="quick_stats",
    output_dir="../tests/quick_stats",
    max_mismatches=0,
    verbose=True,
    write_output_files=False  # Skip writing sequence files
)

# Process BAM file and get statistics only
stats = process_bam_file(config, "../tests/test.bam")

# Print summary
print(f"Total reads: {stats.total_reads}")
print(f"Total barcode matches: {stats.total_barcode_matches}")
print(f"Match rate: {stats.total_barcode_matches / stats.total_reads * 100:.2f}%")
2025-03-24 14:07:19,200 - BarcodeSeqKit - INFO - BAM file: ../tests/test.bam (498 reads)
2025-03-24 14:07:19,202 - BarcodeSeqKit - INFO - Output categories: ['barcode3_orientFR', 'barcode3_orientRC', 'barcode5_orientFR', 'barcode5_orientRC', 'noBarcode']

Classifying reads:   0%|          | 0/498 [00:00<?, ?it/s]

2025-03-24 14:07:19,259 - BarcodeSeqKit - INFO - First pass complete: classified 18 reads

Total reads: 498
Total barcode matches: 18
Match rate: 3.61%
!barcodeseqkit --bam ../tests/test.bam \
              --barcode5 CTGACTCCTTAAGGGCC \
              --barcode3 TAACTGAGGCCGGC \
              --output-prefix dual_barcode \
              --output-dir ../tests/index_cli
Input BAM file: ../tests/test.bam
Using 5' barcode with sequence: CTGACTCCTTAAGGGCC
Using 3' barcode with sequence: TAACTGAGGCCGGC
Saved configuration to ../tests/index_cli/dual_barcode_config.yaml
2025-03-17 12:18:29,756 - BarcodeSeqKit - INFO - BAM file: ../tests/test.bam (498 reads)
2025-03-17 12:18:29,756 - BarcodeSeqKit - INFO - Output categories: ['barcode5_orientFR', 'barcode5_orientRC', 'barcode3_orientFR', 'barcode3_orientRC', 'noBarcode']
Classifying reads: 100%|███████████████████| 498/498 [00:00<00:00, 58040.55it/s]
2025-03-17 12:18:29,810 - BarcodeSeqKit - INFO - First pass complete: classified 18 reads
Writing reads: 100%|██████████████████████| 498/498 [00:00<00:00, 230344.44it/s]
2025-03-17 12:18:29,816 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_barcode5_orientFR.bam
2025-03-17 12:18:29,844 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_barcode5_orientRC.bam
2025-03-17 12:18:29,856 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_barcode3_orientFR.bam
2025-03-17 12:18:29,882 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_barcode3_orientRC.bam
2025-03-17 12:18:29,894 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_noBarcode.bam
Extraction complete

Using BarcodeSeqKit with Containers

BarcodeSeqKit is available as a containerized application, which allows you to run it without installing any dependencies on your system. This guide explains how to use BarcodeSeqKit with Docker and Singularity containers.

Docker Usage

Prerequisites

  • Docker installed on your system

Pulling the Docker Image

docker pull mtinti/barcodeseqkit:0.0.4

Basic Usage with the Included Test File

The container includes a test BAM file at /app/tests/test.bam. You can run BarcodeSeqKit on this test file and save the results to your local machine:

# Create a directory for the results
mkdir -p results

# Run the container with the included test file
docker run --rm \
  -v $(pwd)/results:/output \
  mtinti/barcodeseqkit:0.0.4 \
  --bam /app/tests/test.bam \
  --barcode5 CTGACTCCTTAAGGGCC \
  --barcode3 TAACTGAGGCCGGC \
  --output-prefix test_extraction \
  --output-dir /output \
  --search-softclipped \
  --verbose

This command: - Uses the test BAM file already included in the container - Mounts your local results directory to /output inside the container - Extracts reads matching the specified 5’ and 3’ barcodes - Saves the results in your local results directory

Processing Your Own Data

To process your own data files:

docker run --rm \
  -v /path/to/your/data:/data \
  mtinti/barcodeseqkit:0.0.4 \
  --bam /data/your_sample.bam \
  --barcode ACGTACGT \
  --output-prefix extraction \
  --output-dir /data/results

Conclusion

BarcodeSeqKit provides a streamlined, user-friendly approach to barcode extraction from sequencing data. With its intuitive interface and flexible options, it’s suitable for a wide range of applications, from simple barcode detection to complex multi-barcode analyses.

Whether you’re working with BAM files, FASTQ files, single barcodes, or multiple barcodes with specific locations, BarcodeSeqKit offers a straightforward solution for your barcode extraction needs.

For detailed API documentation, check out the module-specific notebooks: - Core Data Structures - Sequence Utilities - BAM Processing - FASTQ Processing - Command-Line Interface

About

BarcodeSeqKit is a library designed for extracting and processing barcoded sequences from next-generation sequencing data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •