Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Splitting" the data #273

Open
JesperHJespersen opened this issue Feb 28, 2025 · 4 comments
Open

"Splitting" the data #273

JesperHJespersen opened this issue Feb 28, 2025 · 4 comments

Comments

@JesperHJespersen
Copy link

Hi :)

I am annotating VCF files from SAVANA CNA caller using annotSV and am a bit curious as to how the split function works.

It seems very useful to have one row in the df corresponding to each gene effect, instead of each SV, however I seem to have som problem getting AnnotSV to split the data like that:

Image

With each SV having numerous gene names listed.

Is this simply because I run larger CNAs through the algorithm, and so it cannot separate them into singular genes? Or am I not setting it up correctly?

Here is the script I have run:

#!/bin/bash
#SBATCH --job-name=annotsv_81_split
#SBATCH --output=annotsv_81_split.out
#SBATCH --error=annotsv_81_split.err
#SBATCH --time=4:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --account=Renal_long_read

####################################### File Paths and Directories #######################################
INPUT_VCF="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/CNA_analysis/SAVANA_CNA_output/patient_81/81_segmented_absolute_copy_number.vcf"
FIXED_VCF="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/CNA_analysis/SAVANA_CNA_output/patient_81/81_segmented_absolute_copy_number_fixed.vcf"
BASE_OUTPUT_DIR="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/annotation/Patient_81_split"
OUTPUT_DIR="$BASE_OUTPUT_DIR"
OUTPUT_TSV="$OUTPUT_DIR/81_CNA_annotated.tsv"
OUTPUT_VCF="$OUTPUT_DIR/81_CNA_annotated.vcf"
GENOME_BUILD="GRCh38"

Directory for annotation files

ANNOTSV_ANNOTATIONS="/home/jesperjespersen/AnnotSV_annotations"

####################################### Create Output Directory #######################################
mkdir -p "$OUTPUT_DIR"

####################################### Input File Check and VCF Header Fix #######################################
if [[ ! -f "$INPUT_VCF" ]]; then
echo "Error: Input VCF file not found!"
exit 1
fi

Fix VCF header if the FORMAT field is missing

grep "^#CHROM" "$INPUT_VCF" | grep -q "FORMAT" ||
awk 'BEGIN {OFS="\t"}
/^#CHROM/ { print $0, "FORMAT", "SAMPLE"; next }
!/^#/ { print $0, "GT", "0/1" }' "$INPUT_VCF" > "$FIXED_VCF"

If no change was made, copy the original file to FIXED_VCF

if [[ ! -s "$FIXED_VCF" ]]; then
cp "$INPUT_VCF" "$FIXED_VCF"
fi

####################################### Run AnnotSV with Split Analysis #######################################
AnnotSV
-SVinputFile "$FIXED_VCF"
-genomeBuild "$GENOME_BUILD"
-annotationsDir "$ANNOTSV_ANNOTATIONS"
-outputFile "$OUTPUT_TSV"
-outputDir "$OUTPUT_DIR"
-annotationMode split
-vcf 1

Move the generated VCF to the final location (if created)

GENERATED_VCF="${OUTPUT_DIR}/$(basename "$FIXED_VCF" .vcf).annotated.vcf"

if [[ -f "$GENERATED_VCF" ]]; then
mv "$GENERATED_VCF" "$OUTPUT_VCF"
else
echo "Warning: AnnotSV did not generate a VCF file as expected."
fi

Best regards
Jesper :)

@JesperHJespersen
Copy link
Author

This is ofcourse not an issue to do in R, but in that case, is it correctly understood that all relevant genes are listed in the INFO section under Gene_names?

As I am interested in all genes affected by the CNAs, i also suppose it is beneficial to set Reseslect 1 to 0?.

Thank you for your time :)

@lgmgeo
Copy link
Owner

lgmgeo commented Feb 28, 2025

I'm not sure I understand your request...
The way you run AnnotSV is an internal protocol of your lab, I can't help at that level.

This is ofcourse not an issue to do in R

What's the issue? AnnotSV is coded in Tcl, I don't understand the link with the R language.

is it correctly understood that all relevant genes are listed in the INFO section under Gene_names?

To convert the output format from tsv to VCF, AnnotSV relies on the variantconvert tool.
Please, see https://github.com/SamuelNicaise/variantconvert for more information

As I am interested in all genes affected by the CNAs, i also suppose it is beneficial to set Reseslect 1 to 0?

REselect1 is defined to work with Regulatory Element.
cf README

Image

@JesperHJespersen
Copy link
Author

Yes sorry, my questions were probably badly expressed.

  1. I am unsure if the AnnotSv command I run (ignoring the filtering and all that), is correct if I would like an output like showed as an example in the READ.ME FAQ:

Image

Where each gene has its own row, instead of each SV having its own row. I have read through the guide and from what I understand it should be done by setting -annotationMode to split. However after annotation my data doesn't have any repeats of chromosomal regions and a lot of genes for each SV, so I am unsure if I have implemented it correctly.

  1. I have also read your description of the REselect1 argument but I am unsure of what it does and thus the default output of AnnotSV. Does AnnotSV by default ignore all genes that have not been previously categorized into one of the 5 groups?

Thank you for your time and sorry for my confusion.

Best
Jesper

@lgmgeo
Copy link
Owner

lgmgeo commented Feb 28, 2025

I am unsure if the AnnotSv command I run (ignoring the filtering and all that), is correct if I would like an output like showed as an example in the READ.ME FAQ

=> you need to analyse the tsv output file (not the VCF).

TSV output file: each gene has its own row (-annotationMode both)
VCF output file: each SV have its own row

I have also read your description of the REselect1 argument but I am unsure of what it does

This option only impacts the "RE_gene" column.

Does AnnotSV by default ignore all genes that have not been previously categorized into one of the 5 groups?

No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants