Thiis is new version of the bioinformatics pipeline for analyzing the impact of genetic variants on upstream Open Reading Frames (uORFs).
This tool identifies and annotates variants that affect uORFs, which are small open reading frames located in the 5' untranslated regions (5' UTRs) of mRNAs. Variants in uORFs can impact translation regulation of the main protein-coding sequence, potentially leading to biological effects. The pipeline predicts consequences of variants on both the uORF itself and the downstream main CDS.
- The code was fully re-written to enhance its readability and further editing;
- Stability, handling of errors, and logging have been substantially improved;
- Added correct handling of deletions spanning start/end of the reading frame, as well as splice sites;
- Correct processing of transcripts with uORF starts upstream of annotated TSS was enabled;
- M>ore types of main CDS impact are now annotated, especially for overlapping ORFs;
- More relevant information about each variant is now available in the TSV output.
- Python 3.7+
- Conda or Miniconda
- Clone the repository:
git clone https://github.com/yourusername/uorf-variant-analysis.git
cd uorf-variant-analysis
- Create and activate conda environment:
conda env create -f environment.yml
conda activate uorf-variant-analysis
python uorf_annotator.py --bed path/to/uorfs.bed --vcf path/to/variants.vcf \
--gtf path/to/annotation.gtf --fasta path/to/genome.fa \
--output-prefix results/output
Option | Description |
---|---|
--bed |
Path to BED file with uORF coordinates (for default, use data/sorted.v4.bed ) |
--vcf |
Path to VCF file with variants |
--gtf |
Path to GTF annotation file (for default, use data/combined_uorf.v4.gtf )Important! Decompress file before use |
--fasta |
Path to reference genome FASTA file (CRCh38 is expected with default files) |
--output-prefix |
Prefix for output files (.tsv and .bed will be appended) |
--uorf-type |
Filter uORFs by start codon type (ALL , ATG , or NON-ATG ). Default: ALL |
--exclude-maincds-variants |
Exclude variants that are located within the main CDS region |
--debug |
Enable detailed debugging logs |
The BED file should contain uORF coordinates with the following columns:
- Chromosome
- Start position (0-based)
- End position (exclusive)
- Name field (format: transcript_id|additional_info|start_codon_type)
- Score (not used)
- Strand
Example:
chr1 12345 12400 ENST00000123456|uORF1|ATG 0 +
Standard VCF format with variants to be analyzed.
Standard GTF format with gene annotations. The file must include CDS and exon features.
-
TSV output (output_prefix.tsv): Contains detailed annotation of each variant including:
- Variant information (chromosome, position, rsID, alleles)
- uORF information (coordinates, transcript ID)
- Consequence on uORF (e.g., frameshift, start_lost)
- Impact on main CDS (e.g., n_terminal_extension)
- Codon changes
-
BED output (output_prefix.bed): Contains visualizable genomic regions affected by variants, which can be loaded into genome browsers.
- uorf_annotator.py: Main entry point for the pipeline
- scripts/parsers.py: Parsers for GTF and other genomic file formats
- scripts/converters.py: Handles conversion between genomic and transcript coordinates
- scripts/processors.py: Processes variants to determine their effects
- scripts/annotator.py: Annotates variants with biological consequences
- scripts/models.py: Data structures for genomic and transcript features
- scripts/transcript_sequence.py: Handles transcript sequences and uORF extraction
- Pipeline: Main controller that orchestrates the analysis workflow
- CoordinateConverter: Converts between genomic and transcript coordinates
- VariantProcessor: Processes variants and determines their effects
- VariantAnnotator: Annotates variants with biological consequences
- Transcript: Represents transcript data with coordinate mappings
- TranscriptSequence: Handles sequence extraction and manipulation
The pipeline classifies variants into the following consequence types:
- START_LOST: Loss of uORF start codon
- STOP_LOST: Loss of uORF stop codon
- STOP_GAINED: Creation of a premature stop codon
- FRAMESHIFT: Indel causing a shift in reading frame
- DELETION_AND_STOP_LOST: Complex cases where both deletion and stop loss occur
- MISSENSE: Nonsynonymous variants changing amino acid
- SYNONYMOUS: Synonymous variants preserving amino acid
- SPLICE_SITE: Variants affecting splice sites
- INFRAME_DELETION: Deletions that maintain the reading frame
- INFRAME_INSERTION: Insertions that maintain the reading frame
Predicts how the uORF variant affects the main CDS:
- N_TERMINAL_EXTENSION: Extension of protein N-terminus
- OUT_OF_FRAME_OVERLAP: Out-of-frame overlap with main CDS
- UORF_PRODUCT_TRUNCATION: Truncation of uORF product
- UORF_PRODUCT_EXTENSION: Extension of uORF product
- STOP_GAINED: Introduction of premature stop codon
- OVERLAP_EXTENSION: Extension of uORF-CDS overlap
- OVERLAP_TRUNCATION: Truncation of uORF-CDS overlap
- OVERLAP_ELIMINATION: Elimination of uORF-CDS overlap
- MAIN_CDS_UNAFFECTED: No effect on main CDS
python uorf_annotator.py --bed data/uorfs.bed --vcf data/variants.vcf \
--gtf data/gencode.v38.annotation.gtf --fasta data/GRCh38.p13.genome.fa \
--output-prefix results/all_variants
python uorf_annotator.py --bed data/uorfs.bed --vcf data/variants.vcf \
--gtf data/gencode.v38.annotation.gtf --fasta data/GRCh38.p13.genome.fa \
--output-prefix results/atg_uorf_variants --uorf-type ATG
python uorf_annotator.py --bed data/uorfs.bed --vcf data/variants.vcf \
--gtf data/gencode.v38.annotation.gtf --fasta data/GRCh38.p13.genome.fa \
--output-prefix results/non_cds_variants --exclude-maincds-variants
If you use this pipeline in your research, please cite: