Skip to content

Erroneous CIGAR string after clipOverlap #72

@MboiTui

Description

@MboiTui

Hello!

I am working with low-coverage (~4x) Whole Genome Resequencing data (sequenced on a NovaSeq) for 24 individuals.

After using bamUtils' clipOverlap tool, the output of ValidataSamFile looks like this:

Error Type  Count
ERROR:CIGAR_MAPS_OFF_REFERENCE  332
ERROR:INVALID_CIGAR 2067551
ERROR:MATE_NOT_FOUND    3382975
ERROR:MISMATCH_MATE_CIGAR_STRING    13874713

The same table for the bam file used as input for clipOverlap looks like this:

Error Type  Count
ERROR:MATE_NOT_FOUND    3382975 

Thus, my understanding is that that for some reason the clipOverlap tool produces reads that:

Have a CIGAR string mapping the read beyond the reference (332)
Have an invalid CIGAR string (2 millions)
Have a mismatched string between mates (13 million)
This is an example INVALID_CIGAR:

ERROR::INVALID_CIGAR:Record 47
Read name A00152:841:H7J22DRX3:1:2216:20247:12978
No real operator (M|I|D|N) in CIGAR

And another one for good measure:

ERROR::INVALID_CIGAR:Record 107
Read name A00152:841:H7J22DRX3:2:1112:5267:11397
No real operator (M|I|D|N) in CIGAR

My workflow from FASTQ files to using clipOverlap looks like this (for Individual 1, starting with lane 1 only but merging the two lanes at step 6):

fastp -i ${Ind1_L1}_R1.fastq.gz -I ${Ind1_L1}_R2.fastq.gz -o ... -O ... --unpaired1 ... --unpaired2 ... --low_complexity_filter -l 50

bwa mem -M -R '@RG\tID:H7J22DRX3.${lane}\tPU:H7J22DRX3.${lane}.${sample}\tSM:${sample}\tPL:ILLUMINA' ${REF} ${Ind1_L1}_R1_trimmed.fastq.gz ${Ind1_L1}_R2_trimmed.fastq.gz > ${Ind1_L1}.sam

samtools view ${I1}.sam -b -o ${Ind1_L1}.bam

samtools view -h -q 20 ${Ind1_L1}.bam | samtools view -buS | samtools sort ${Ind1_L1}_sorted.bam

samtools index ${Ind1_L1}_sorted.bam

samtools merge ${Ind1}_sorted.bam ${Ind1_L1}_sorted.bam ${Ind1_L2}_sorted.bam

picard MarkDuplicates I=${Ind1}_sorted.bam O=${Ind1}_sorted_dup.bam VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true

bam clipOverlap --in ${Ind1}_sorted_dup.bam --out ${Ind1}_sorted_dup_clipped.bam

I am unsure as to why clipOverlap is causing this issue. Any pointers would be appreciated

Cheers, LVB

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions