Hello!
I am working with low-coverage (~4x) Whole Genome Resequencing data (sequenced on a NovaSeq) for 24 individuals.
After using bamUtils' clipOverlap
tool, the output of ValidataSamFile looks like this:
Error Type Count
ERROR:CIGAR_MAPS_OFF_REFERENCE 332
ERROR:INVALID_CIGAR 2067551
ERROR:MATE_NOT_FOUND 3382975
ERROR:MISMATCH_MATE_CIGAR_STRING 13874713
The same table for the bam file used as input for clipOverlap looks like this:
Error Type Count
ERROR:MATE_NOT_FOUND 3382975
Thus, my understanding is that that for some reason the clipOverlap tool produces reads that:
- Have a CIGAR string mapping the read beyond the reference (332)
- Have an invalid CIGAR string (2 millions)
- Have a mismatched string between mates (13 million)
This is an example INVALID_CIGAR:
ERROR::INVALID_CIGAR:Record 47
Read name A00152:841:H7J22DRX3:1:2216:20247:12978
No real operator (M|I|D|N) in CIGAR
And another one for good measure:
ERROR::INVALID_CIGAR:Record 107
Read name A00152:841:H7J22DRX3:2:1112:5267:11397
No real operator (M|I|D|N) in CIGAR
I am pretty new to using gatk, picard and working with CIGAR string, so I am not sure where the issue lays and what might be originating it.
My workflow from FASTQ files to using clipOverlap looks like this (for Individual 1, starting with lane 1 only but merging the two lanes at step 6):
fastp -i ${Ind1_L1}_R1.fastq.gz -I ${Ind1_L1}_R2.fastq.gz -o ... -O ... --unpaired1 ... --unpaired2 ... --low_complexity_filter -l 50
bwa mem -M -R '@RG\tID:H7J22DRX3.${lane}\tPU:H7J22DRX3.${lane}.${sample}\tSM:${sample}\tPL:ILLUMINA' ${REF} ${Ind1_L1}_R1_trimmed.fastq.gz ${Ind1_L1}_R2_trimmed.fastq.gz > ${Ind1_L1}.sam
samtools view ${I1}.sam -b -o ${Ind1_L1}.bam
samtools view -h -q 20 ${Ind1_L1}.bam | samtools view -buS | samtools sort ${Ind1_L1}_sorted.bam
samtools index ${Ind1_L1}_sorted.bam
samtools merge ${Ind1}_sorted.bam ${Ind1_L1}_sorted.bam ${Ind1_L2}_sorted.bam
picard MarkDuplicates I=${Ind1}_sorted.bam O=${Ind1}_sorted_dup.bam VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true
bam clipOverlap --in ${Ind1}_sorted_dup.bam --out ${Ind1}_sorted_dup_clipped.bam
I am unsure as to why clipOverlap (or maybe an earlier step in my workflow?) is causing this issue. Any pointers would be appreciated
Cheers, LVB
please, show us the following output:
Furthermore Bwa mem and samtools sort can be combined.
don't use -q 20 to avoid errors like "MATE_NOT_FOUND".
Hello Pierre,
Thanks for your reply and insight.
with samtools view | grep using the erroneous CIGAR string I get:
Make sense I will remove -Q 20, thank you
may be not the problem but why this
RG:Z:H7J22DRX3.${lane}-1C23074F
in the RG.you're using single quotes with bwa-mem and it won't expand '${lane}' : https://stackoverflow.com/questions/6697753/what-are-the-differences-between-single-and-double-quotes-in-bash
ooh good pickup. Might not be the problem but definitely not helping. Will re-run with streamlined workflow and with double quotes and see if the problem persists.
Thank you Pierre.
update. The error persists. With ValidateSamFile I get no error until i use MarkDuplicates.
Then for the MarkDuplicates output I get:
I guess the 117k reads with a mate not found are those where the mate was removed in its entirety because it was deemed a duplicate (checking whether this is the expected behaviour).
When I then run clipOverlaps, I get a similar result as before:
The mate not found reads are less and match those that originated from markDuplicates.
All other errors are more abundant than before, but in similar proportions. I am guessing those additional errors are from the ~3 million Q < 20 reads that were previously being filtered.
Some example errors:
1
2
Output of samtools view for #1
Output of samtools view for #2