"Splitting" the data #273

JesperHJespersen · 2025-02-28T11:10:34Z

Hi :)

I am annotating VCF files from SAVANA CNA caller using annotSV and am a bit curious as to how the split function works.

It seems very useful to have one row in the df corresponding to each gene effect, instead of each SV, however I seem to have som problem getting AnnotSV to split the data like that:

With each SV having numerous gene names listed.

Is this simply because I run larger CNAs through the algorithm, and so it cannot separate them into singular genes? Or am I not setting it up correctly?

Here is the script I have run:

#!/bin/bash
#SBATCH --job-name=annotsv_81_split
#SBATCH --output=annotsv_81_split.out
#SBATCH --error=annotsv_81_split.err
#SBATCH --time=4:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --account=Renal_long_read

####################################### File Paths and Directories #######################################
INPUT_VCF="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/CNA_analysis/SAVANA_CNA_output/patient_81/81_segmented_absolute_copy_number.vcf"
FIXED_VCF="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/CNA_analysis/SAVANA_CNA_output/patient_81/81_segmented_absolute_copy_number_fixed.vcf"
BASE_OUTPUT_DIR="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/annotation/Patient_81_split"
OUTPUT_DIR="$BASE_OUTPUT_DIR"
OUTPUT_TSV="$OUTPUT_DIR/81_CNA_annotated.tsv"
OUTPUT_VCF="$OUTPUT_DIR/81_CNA_annotated.vcf"
GENOME_BUILD="GRCh38"

Directory for annotation files

ANNOTSV_ANNOTATIONS="/home/jesperjespersen/AnnotSV_annotations"

####################################### Create Output Directory #######################################
mkdir -p "$OUTPUT_DIR"

####################################### Input File Check and VCF Header Fix #######################################
if [[ ! -f "$INPUT_VCF" ]]; then
echo "Error: Input VCF file not found!"
exit 1
fi

Fix VCF header if the FORMAT field is missing

grep "^#CHROM" "$INPUT_VCF" | grep -q "FORMAT" ||
awk 'BEGIN {OFS="\t"}
/^#CHROM/ { print $0, "FORMAT", "SAMPLE"; next }
!/^#/ { print $0, "GT", "0/1" }' "$INPUT_VCF" > "$FIXED_VCF"

If no change was made, copy the original file to FIXED_VCF

if [[ ! -s "$FIXED_VCF" ]]; then
cp "$INPUT_VCF" "$FIXED_VCF"
fi

####################################### Run AnnotSV with Split Analysis #######################################
AnnotSV
-SVinputFile "$FIXED_VCF"
-genomeBuild "$GENOME_BUILD"
-annotationsDir "$ANNOTSV_ANNOTATIONS"
-outputFile "$OUTPUT_TSV"
-outputDir "$OUTPUT_DIR"
-annotationMode split
-vcf 1

Move the generated VCF to the final location (if created)

GENERATED_VCF="${OUTPUT_DIR}/$(basename "$FIXED_VCF" .vcf).annotated.vcf"

if [[ -f "$GENERATED_VCF" ]]; then
mv "$GENERATED_VCF" "$OUTPUT_VCF"
else
echo "Warning: AnnotSV did not generate a VCF file as expected."
fi

Best regards
Jesper :)

JesperHJespersen · 2025-02-28T11:26:01Z

This is ofcourse not an issue to do in R, but in that case, is it correctly understood that all relevant genes are listed in the INFO section under Gene_names?

As I am interested in all genes affected by the CNAs, i also suppose it is beneficial to set Reseslect 1 to 0?.

Thank you for your time :)

lgmgeo · 2025-02-28T12:06:12Z

I'm not sure I understand your request...
The way you run AnnotSV is an internal protocol of your lab, I can't help at that level.

This is ofcourse not an issue to do in R

What's the issue? AnnotSV is coded in Tcl, I don't understand the link with the R language.

is it correctly understood that all relevant genes are listed in the INFO section under Gene_names?

To convert the output format from tsv to VCF, AnnotSV relies on the variantconvert tool.
Please, see https://github.com/SamuelNicaise/variantconvert for more information

As I am interested in all genes affected by the CNAs, i also suppose it is beneficial to set Reseslect 1 to 0?

REselect1 is defined to work with Regulatory Element.
cf README

JesperHJespersen · 2025-02-28T13:44:19Z

Yes sorry, my questions were probably badly expressed.

I am unsure if the AnnotSv command I run (ignoring the filtering and all that), is correct if I would like an output like showed as an example in the READ.ME FAQ:

Where each gene has its own row, instead of each SV having its own row. I have read through the guide and from what I understand it should be done by setting -annotationMode to split. However after annotation my data doesn't have any repeats of chromosomal regions and a lot of genes for each SV, so I am unsure if I have implemented it correctly.

I have also read your description of the REselect1 argument but I am unsure of what it does and thus the default output of AnnotSV. Does AnnotSV by default ignore all genes that have not been previously categorized into one of the 5 groups?

Thank you for your time and sorry for my confusion.

Best
Jesper

lgmgeo · 2025-02-28T14:08:01Z

I am unsure if the AnnotSv command I run (ignoring the filtering and all that), is correct if I would like an output like showed as an example in the READ.ME FAQ

=> you need to analyse the tsv output file (not the VCF).

TSV output file: each gene has its own row (-annotationMode both)
VCF output file: each SV have its own row

I have also read your description of the REselect1 argument but I am unsure of what it does

This option only impacts the "RE_gene" column.

Does AnnotSV by default ignore all genes that have not been previously categorized into one of the 5 groups?

No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Splitting" the data #273

"Splitting" the data #273

JesperHJespersen commented Feb 28, 2025

JesperHJespersen commented Feb 28, 2025

lgmgeo commented Feb 28, 2025

JesperHJespersen commented Feb 28, 2025

lgmgeo commented Feb 28, 2025 •

edited

Loading

"Splitting" the data #273

"Splitting" the data #273

Comments

JesperHJespersen commented Feb 28, 2025

Directory for annotation files

Fix VCF header if the FORMAT field is missing

If no change was made, copy the original file to FIXED_VCF

Move the generated VCF to the final location (if created)

JesperHJespersen commented Feb 28, 2025

lgmgeo commented Feb 28, 2025

JesperHJespersen commented Feb 28, 2025

lgmgeo commented Feb 28, 2025 • edited Loading

lgmgeo commented Feb 28, 2025 •

edited

Loading