- Simple yet powerful: Extract barcodes from BAM or FASTQ files with minimal code
- Support for both file types: Process BAM files (including softclipped regions) and FASTQ files (including paired-end data)
- Flexible barcode options: Use single barcodes or specific 5’/3’ combinations
- Orientation detection: Identify barcodes in both forward and reverse complement orientations
- Fuzzy matching: Configure allowable mismatches for barcode detection
- Specialized functions: Search in softclipped regions of BAM alignments or both reads in paired FASTQ data
- Detailed statistics: Get comprehensive reports on barcode matches and distribution
BarcodeSeqKit uses Python’s regular expression engine (
re
module) for exact barcode matching, which means you can leverage the full power of regular expressions in your barcode patterns.
You can install BarcodeSeqKit using pip:
pip install barcodeseqkit
Or directly from the GitHub repository:
pip install git+https://github.com/username/BarcodeSeqKit.git
Let’s dive right in with some common use cases!
BarcodeSeqKit provides a simple command-line interface that makes it easy to extract barcodes without writing any code.
barcodeseqkit --bam tests/test.bam \
--barcode CTGACTCCTTAAGGGCC \
--output-prefix single_barcode \
--output-dir results
This command extracts reads containing the specified barcode (in either
forward or reverse complement orientation) and creates: -
results/single_barcode_barcode_orientFR.bam
: Forward orientation
matches - results/single_barcode_barcode_orientRC.bam
: Reverse
complement matches - results/single_barcode_extraction_stats.json
:
Detailed statistics in JSON format -
results/single_barcode_extraction_stats.tsv
: Detailed statistics in
TSV format
barcodeseqkit --bam tests/test.bam \
--barcode5 CTGACTCCTTAAGGGCC \
--barcode3 TAACTGAGGCCGGC \
--output-prefix dual_barcode \
--output-dir results
This creates separate files for each barcode and orientation
combination: - results/dual_barcode_barcode5_orientFR.bam
-
results/dual_barcode_barcode5_orientRC.bam
-
results/dual_barcode_barcode3_orientFR.bam
-
results/dual_barcode_barcode3_orientRC.bam
barcodeseqkit --fastq1 tests/test.1.fastq.gz \
--fastq2 tests/test.2.fastq.gz \
--barcode CTGACTCCTTAAGGGCC \
--output-prefix fastq_extraction \
--search-both-reads
This processes paired FASTQ files and creates output FASTQ files for each barcode category.
Instead of specifying a fixed barcode sequence, you can provide a regular expression pattern:
# Define a barcode with ambiguous positions
barcode_config = BarcodeConfig(
sequence="ACGT[AT]GC[GC].TT",
location=BarcodeLocationType.FIVE_PRIME,
name="5prime_variable"
)
BarcodeSeqKit’s Python API is designed to be intuitive and straightforward:
from BarcodeSeqKit.core import BarcodeConfig, BarcodeLocationType, BarcodeExtractorConfig
from BarcodeSeqKit.bam_processing import process_bam_file
# Define barcodes
barcodes = [
BarcodeConfig(
sequence="TAACTGAGGCCGGC",
location=BarcodeLocationType.THREE_PRIME,
name="3prime"
),
BarcodeConfig(
sequence="CTGACTCCTTAAGGGCC",
location=BarcodeLocationType.FIVE_PRIME,
name="5prime"
)
]
# Create configuration
config = BarcodeExtractorConfig(
barcodes=barcodes,
output_prefix="my_extraction",
output_dir="./results",
max_mismatches=0,
search_softclipped=True,
verbose=True
)
# Process BAM file
stats = process_bam_file(config, "tests/test.bam")
# Report results
print(f"Processed {stats.total_reads} reads")
print(f"Found {stats.total_barcode_matches} barcode matches")
or FASTQ files:
from BarcodeSeqKit.fastq_processing import process_fastq_files
# Use the same config as above
fastq_files = ["tests/test.1.fastq.gz", "tests/test.2.fastq.gz"]
stats = process_fastq_files(
config=config,
fastq_files=fastq_files,
compress_output=True,
search_both_reads=True
)
BarcodeSeqKit supports three types of barcode configurations:
- Generic barcodes: Use these when you just want to find a specific sequence regardless of location
- 5’ barcodes: Use these when the barcode is expected at the 5’ end of the sequence
- 3’ barcodes: Use these when the barcode is expected at the 3’ end of the sequence
For each barcode, BarcodeSeqKit tracks two possible orientations:
- Forward (FR): The barcode appears in its specified sequence
- Reverse Complement (RC): The barcode appears as its reverse complement
When working with BAM files, the --search-softclipped
option examines
only the softclipped portions of reads:
- For forward strand reads (+): Examines the 5’ softclipped region
- For reverse strand reads (-): Examines the 3’ softclipped region
This is especially useful for splice leader sequences in trypanosomatids or where barcodes are clipped during alignment.
BarcodeSeqKit offers a range of options to customize your extraction:
Option | Description |
---|---|
--max-mismatches N |
Allow up to N mismatches in barcode detection |
--search-softclipped |
Search in softclipped regions (BAM only) |
--search-both-reads |
Look for barcodes in both reads of paired FASTQ files |
--no-compress |
Disable compression for FASTQ output files |
--verbose |
Enable detailed logging |
For a complete list, run barcodeseqkit --help
.
For complex projects with multiple barcodes, you can use a YAML configuration file:
barcodes:
- sequence: CTGACTCCTTAAGGGCC
location: 5
name: 5prime
description: 5' barcode for experiment X
- sequence: TAACTGAGGCCGGC
location: 3
name: 3prime
description: 3' barcode for experiment X
Then use it with:
barcodeseqkit --bam test.bam --barcode-config my_barcodes.yaml --output-prefix config_extraction
In BarcodeSeqKit, when multiple barcodes are provided, the program uses an efficient approach: it parses the input file(s) only once while searching for all barcodes simultaneously during that single pass.
BarcodeSeqKit generates:
- Categorized output files: BAM or FASTQ files containing reads matching specific barcode/orientation combinations
- Statistics in JSON format: Detailed machine-readable statistics
- Statistics in TSV format: Human-readable tabular statistics
The statistics include: - Total number of reads processed - Total barcode matches found - Match counts by barcode type - Match counts by orientation - Match counts by category - Overall match rate
BarcodeSeqKit now supports a “statistics-only” mode that processes files and generates detailed statistics without writing output sequence files. This feature is particularly useful for:
- Quickly analyzing barcode distributions in large datasets
- Performing QC checks before committing to full processing
- Estimating barcode frequencies without using additional disk space
- Benchmarking and optimization tasks
To use statistics-only mode from the command line, add the
--only-stats
flag:
barcodeseqkit --bam sample.bam \
--barcode5 CTGACTCCTTAAGGGCC \
--output-prefix quick_stats \
--output-dir results \
--only-stats
BarcodeSeqKit has a clean, modular design:
- Core (
00_core.ipynb
): Data structures and configuration - Sequence Utilities (
01_sequence_utils.ipynb
): Barcode detection algorithms - BAM Processing (
02_bam_processing.ipynb
): BAM file handling - FASTQ Processing (
03_fastq_processing.ipynb
): FASTQ file handling - Command-Line Interface (
04_cli.ipynb
): Argument parsing and execution
Let’s walk through a practical example of extracting barcodes from the test data included with BarcodeSeqKit:
from BarcodeSeqKit.core import BarcodeConfig, BarcodeLocationType, BarcodeExtractorConfig
from BarcodeSeqKit.bam_processing import process_bam_file
import os
# Define barcodes that we know are in the test data
barcodes = [
BarcodeConfig(
sequence="TAACTGAGGCCGGC",
location=BarcodeLocationType.THREE_PRIME,
name="3prime"
),
BarcodeConfig(
sequence="CTGACTCCTTAAGGGCC",
location=BarcodeLocationType.FIVE_PRIME,
name="5prime"
)
]
# Create configuration
output_dir = "../tests/index_api"
os.makedirs(output_dir, exist_ok=True)
config = BarcodeExtractorConfig(
barcodes=barcodes,
output_prefix="example_run",
output_dir=output_dir,
max_mismatches=0,
search_softclipped=True,
verbose=True
)
# Process the BAM file (if it exists)
test_bam = "../tests/test.bam"
if os.path.exists(test_bam):
print(f"Processing {test_bam}")
stats = process_bam_file(config, test_bam)
# Print summary statistics
print(f"\nResults summary:")
print(f"Total reads: {stats.total_reads}")
print(f"Total barcode matches: {stats.total_barcode_matches}")
if stats.total_reads > 0:
match_rate = (stats.total_barcode_matches / stats.total_reads) * 100
print(f"Match rate: {match_rate:.2f}%")
# Print barcode-specific statistics
for barcode_name, count in stats.matches_by_barcode.items():
print(f" {barcode_name}: {count} matches")
# Print orientation-specific statistics
for orientation, count in stats.matches_by_orientation.items():
print(f" Orientation {orientation}: {count} matches")
else:
print(f"Test file not found: {test_bam}")
2025-03-24 14:06:24,292 - BarcodeSeqKit - INFO - BAM file: ../tests/test.bam (498 reads)
2025-03-24 14:06:24,293 - BarcodeSeqKit - INFO - Output categories: ['barcode3_orientFR', 'barcode3_orientRC', 'barcode5_orientFR', 'barcode5_orientRC', 'noBarcode']
Processing ../tests/test.bam
Classifying reads: 0%| | 0/498 [00:00<?, ?it/s]
2025-03-24 14:06:24,340 - BarcodeSeqKit - INFO - First pass complete: classified 18 reads
Writing reads: 0%| | 0/498 [00:00<?, ?it/s]
2025-03-24 14:06:24,411 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_api/example_run_barcode3_orientFR.bam
2025-03-24 14:06:24,424 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_api/example_run_barcode3_orientRC.bam
2025-03-24 14:06:24,433 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_api/example_run_barcode5_orientFR.bam
2025-03-24 14:06:24,442 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_api/example_run_barcode5_orientRC.bam
2025-03-24 14:06:24,450 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_api/example_run_noBarcode.bam
Results summary:
Total reads: 498
Total barcode matches: 18
Match rate: 3.61%
5prime: 10 matches
3prime: 8 matches
Orientation FR: 10 matches
Orientation RC: 8 matches
config = BarcodeExtractorConfig(
barcodes=barcodes,
output_prefix="quick_stats",
output_dir="../tests/quick_stats",
max_mismatches=0,
verbose=True,
write_output_files=False # Skip writing sequence files
)
# Process BAM file and get statistics only
stats = process_bam_file(config, "../tests/test.bam")
# Print summary
print(f"Total reads: {stats.total_reads}")
print(f"Total barcode matches: {stats.total_barcode_matches}")
print(f"Match rate: {stats.total_barcode_matches / stats.total_reads * 100:.2f}%")
2025-03-24 14:07:19,200 - BarcodeSeqKit - INFO - BAM file: ../tests/test.bam (498 reads)
2025-03-24 14:07:19,202 - BarcodeSeqKit - INFO - Output categories: ['barcode3_orientFR', 'barcode3_orientRC', 'barcode5_orientFR', 'barcode5_orientRC', 'noBarcode']
Classifying reads: 0%| | 0/498 [00:00<?, ?it/s]
2025-03-24 14:07:19,259 - BarcodeSeqKit - INFO - First pass complete: classified 18 reads
Total reads: 498
Total barcode matches: 18
Match rate: 3.61%
!barcodeseqkit --bam ../tests/test.bam \
--barcode5 CTGACTCCTTAAGGGCC \
--barcode3 TAACTGAGGCCGGC \
--output-prefix dual_barcode \
--output-dir ../tests/index_cli
Input BAM file: ../tests/test.bam
Using 5' barcode with sequence: CTGACTCCTTAAGGGCC
Using 3' barcode with sequence: TAACTGAGGCCGGC
Saved configuration to ../tests/index_cli/dual_barcode_config.yaml
2025-03-17 12:18:29,756 - BarcodeSeqKit - INFO - BAM file: ../tests/test.bam (498 reads)
2025-03-17 12:18:29,756 - BarcodeSeqKit - INFO - Output categories: ['barcode5_orientFR', 'barcode5_orientRC', 'barcode3_orientFR', 'barcode3_orientRC', 'noBarcode']
Classifying reads: 100%|███████████████████| 498/498 [00:00<00:00, 58040.55it/s]
2025-03-17 12:18:29,810 - BarcodeSeqKit - INFO - First pass complete: classified 18 reads
Writing reads: 100%|██████████████████████| 498/498 [00:00<00:00, 230344.44it/s]
2025-03-17 12:18:29,816 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_barcode5_orientFR.bam
2025-03-17 12:18:29,844 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_barcode5_orientRC.bam
2025-03-17 12:18:29,856 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_barcode3_orientFR.bam
2025-03-17 12:18:29,882 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_barcode3_orientRC.bam
2025-03-17 12:18:29,894 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_noBarcode.bam
Extraction complete
BarcodeSeqKit is available as a containerized application, which allows you to run it without installing any dependencies on your system. This guide explains how to use BarcodeSeqKit with Docker and Singularity containers.
- Docker installed on your system
docker pull mtinti/barcodeseqkit:0.0.4
The container includes a test BAM file at /app/tests/test.bam
. You can
run BarcodeSeqKit on this test file and save the results to your local
machine:
# Create a directory for the results
mkdir -p results
# Run the container with the included test file
docker run --rm \
-v $(pwd)/results:/output \
mtinti/barcodeseqkit:0.0.4 \
--bam /app/tests/test.bam \
--barcode5 CTGACTCCTTAAGGGCC \
--barcode3 TAACTGAGGCCGGC \
--output-prefix test_extraction \
--output-dir /output \
--search-softclipped \
--verbose
This command: - Uses the test BAM file already included in the
container - Mounts your local results
directory to /output
inside
the container - Extracts reads matching the specified 5’ and 3’
barcodes - Saves the results in your local results
directory
To process your own data files:
docker run --rm \
-v /path/to/your/data:/data \
mtinti/barcodeseqkit:0.0.4 \
--bam /data/your_sample.bam \
--barcode ACGTACGT \
--output-prefix extraction \
--output-dir /data/results
BarcodeSeqKit provides a streamlined, user-friendly approach to barcode extraction from sequencing data. With its intuitive interface and flexible options, it’s suitable for a wide range of applications, from simple barcode detection to complex multi-barcode analyses.
Whether you’re working with BAM files, FASTQ files, single barcodes, or multiple barcodes with specific locations, BarcodeSeqKit offers a straightforward solution for your barcode extraction needs.
For detailed API documentation, check out the module-specific notebooks: - Core Data Structures - Sequence Utilities - BAM Processing - FASTQ Processing - Command-Line Interface