-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Hello!
I am working with low-coverage (~4x) Whole Genome Resequencing data (sequenced on a NovaSeq) for 24 individuals.
After using bamUtils' clipOverlap tool, the output of ValidataSamFile looks like this:
Error Type Count
ERROR:CIGAR_MAPS_OFF_REFERENCE 332
ERROR:INVALID_CIGAR 2067551
ERROR:MATE_NOT_FOUND 3382975
ERROR:MISMATCH_MATE_CIGAR_STRING 13874713
The same table for the bam file used as input for clipOverlap looks like this:
Error Type Count
ERROR:MATE_NOT_FOUND 3382975
Thus, my understanding is that that for some reason the clipOverlap tool produces reads that:
Have a CIGAR string mapping the read beyond the reference (332)
Have an invalid CIGAR string (2 millions)
Have a mismatched string between mates (13 million)
This is an example INVALID_CIGAR:
ERROR::INVALID_CIGAR:Record 47
Read name A00152:841:H7J22DRX3:1:2216:20247:12978
No real operator (M|I|D|N) in CIGAR
And another one for good measure:
ERROR::INVALID_CIGAR:Record 107
Read name A00152:841:H7J22DRX3:2:1112:5267:11397
No real operator (M|I|D|N) in CIGAR
My workflow from FASTQ files to using clipOverlap looks like this (for Individual 1, starting with lane 1 only but merging the two lanes at step 6):
fastp -i ${Ind1_L1}_R1.fastq.gz -I ${Ind1_L1}_R2.fastq.gz -o ... -O ... --unpaired1 ... --unpaired2 ... --low_complexity_filter -l 50
bwa mem -M -R '@RG\tID:H7J22DRX3.${lane}\tPU:H7J22DRX3.${lane}.${sample}\tSM:${sample}\tPL:ILLUMINA' ${REF} ${Ind1_L1}_R1_trimmed.fastq.gz ${Ind1_L1}_R2_trimmed.fastq.gz > ${Ind1_L1}.sam
samtools view ${I1}.sam -b -o ${Ind1_L1}.bam
samtools view -h -q 20 ${Ind1_L1}.bam | samtools view -buS | samtools sort ${Ind1_L1}_sorted.bam
samtools index ${Ind1_L1}_sorted.bam
samtools merge ${Ind1}_sorted.bam ${Ind1_L1}_sorted.bam ${Ind1_L2}_sorted.bam
picard MarkDuplicates I=${Ind1}_sorted.bam O=${Ind1}_sorted_dup.bam VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true
bam clipOverlap --in ${Ind1}_sorted_dup.bam --out ${Ind1}_sorted_dup_clipped.bam
I am unsure as to why clipOverlap is causing this issue. Any pointers would be appreciated
Cheers, LVB