-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Splitting" the data #273
Comments
This is ofcourse not an issue to do in R, but in that case, is it correctly understood that all relevant genes are listed in the INFO section under Gene_names? As I am interested in all genes affected by the CNAs, i also suppose it is beneficial to set Reseslect 1 to 0?. Thank you for your time :) |
I'm not sure I understand your request...
What's the issue? AnnotSV is coded in Tcl, I don't understand the link with the R language.
To convert the output format from tsv to VCF, AnnotSV relies on the variantconvert tool.
REselect1 is defined to work with Regulatory Element. |
Yes sorry, my questions were probably badly expressed.
Where each gene has its own row, instead of each SV having its own row. I have read through the guide and from what I understand it should be done by setting -annotationMode to split. However after annotation my data doesn't have any repeats of chromosomal regions and a lot of genes for each SV, so I am unsure if I have implemented it correctly.
Thank you for your time and sorry for my confusion. Best |
=> you need to analyse the tsv output file (not the VCF). TSV output file: each gene has its own row (-annotationMode both)
This option only impacts the "RE_gene" column.
No |
Hi :)
I am annotating VCF files from SAVANA CNA caller using annotSV and am a bit curious as to how the split function works.
It seems very useful to have one row in the df corresponding to each gene effect, instead of each SV, however I seem to have som problem getting AnnotSV to split the data like that:
With each SV having numerous gene names listed.
Is this simply because I run larger CNAs through the algorithm, and so it cannot separate them into singular genes? Or am I not setting it up correctly?
Here is the script I have run:
#!/bin/bash
#SBATCH --job-name=annotsv_81_split
#SBATCH --output=annotsv_81_split.out
#SBATCH --error=annotsv_81_split.err
#SBATCH --time=4:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --account=Renal_long_read
####################################### File Paths and Directories #######################################
INPUT_VCF="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/CNA_analysis/SAVANA_CNA_output/patient_81/81_segmented_absolute_copy_number.vcf"
FIXED_VCF="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/CNA_analysis/SAVANA_CNA_output/patient_81/81_segmented_absolute_copy_number_fixed.vcf"
BASE_OUTPUT_DIR="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/annotation/Patient_81_split"
OUTPUT_DIR="$BASE_OUTPUT_DIR"
OUTPUT_TSV="$OUTPUT_DIR/81_CNA_annotated.tsv"
OUTPUT_VCF="$OUTPUT_DIR/81_CNA_annotated.vcf"
GENOME_BUILD="GRCh38"
Directory for annotation files
ANNOTSV_ANNOTATIONS="/home/jesperjespersen/AnnotSV_annotations"
####################################### Create Output Directory #######################################
mkdir -p "$OUTPUT_DIR"
####################################### Input File Check and VCF Header Fix #######################################
if [[ ! -f "$INPUT_VCF" ]]; then
echo "Error: Input VCF file not found!"
exit 1
fi
Fix VCF header if the FORMAT field is missing
grep "^#CHROM" "$INPUT_VCF" | grep -q "FORMAT" ||
awk 'BEGIN {OFS="\t"}
/^#CHROM/ { print $0, "FORMAT", "SAMPLE"; next }
!/^#/ { print $0, "GT", "0/1" }' "$INPUT_VCF" > "$FIXED_VCF"
If no change was made, copy the original file to FIXED_VCF
if [[ ! -s "$FIXED_VCF" ]]; then
cp "$INPUT_VCF" "$FIXED_VCF"
fi
####################################### Run AnnotSV with Split Analysis #######################################
AnnotSV
-SVinputFile "$FIXED_VCF"
-genomeBuild "$GENOME_BUILD"
-annotationsDir "$ANNOTSV_ANNOTATIONS"
-outputFile "$OUTPUT_TSV"
-outputDir "$OUTPUT_DIR"
-annotationMode split
-vcf 1
Move the generated VCF to the final location (if created)
GENERATED_VCF="${OUTPUT_DIR}/$(basename "$FIXED_VCF" .vcf).annotated.vcf"
if [[ -f "$GENERATED_VCF" ]]; then
mv "$GENERATED_VCF" "$OUTPUT_VCF"
else
echo "Warning: AnnotSV did not generate a VCF file as expected."
fi
Best regards
Jesper :)
The text was updated successfully, but these errors were encountered: