Invalid CIGAR after using bam clipOverlap
0
1
Entering edit mode
6 months ago
MboiTui ▴ 20

Hello!

I am working with low-coverage (~4x) Whole Genome Resequencing data (sequenced on a NovaSeq) for 24 individuals.

After using bamUtils' clipOverlap tool, the output of ValidataSamFile looks like this:

Error Type  Count
ERROR:CIGAR_MAPS_OFF_REFERENCE  332
ERROR:INVALID_CIGAR 2067551
ERROR:MATE_NOT_FOUND    3382975
ERROR:MISMATCH_MATE_CIGAR_STRING    13874713

The same table for the bam file used as input for clipOverlap looks like this:

Error Type  Count
ERROR:MATE_NOT_FOUND    3382975 

Thus, my understanding is that that for some reason the clipOverlap tool produces reads that:

  • Have a CIGAR string mapping the read beyond the reference (332)
  • Have an invalid CIGAR string (2 millions)
  • Have a mismatched string between mates (13 million)

This is an example INVALID_CIGAR:

ERROR::INVALID_CIGAR:Record 47
Read name A00152:841:H7J22DRX3:1:2216:20247:12978
No real operator (M|I|D|N) in CIGAR

And another one for good measure:

ERROR::INVALID_CIGAR:Record 107
Read name A00152:841:H7J22DRX3:2:1112:5267:11397
No real operator (M|I|D|N) in CIGAR

I am pretty new to using gatk, picard and working with CIGAR string, so I am not sure where the issue lays and what might be originating it.

My workflow from FASTQ files to using clipOverlap looks like this (for Individual 1, starting with lane 1 only but merging the two lanes at step 6):

  1. fastp -i ${Ind1_L1}_R1.fastq.gz -I ${Ind1_L1}_R2.fastq.gz -o ... -O ... --unpaired1 ... --unpaired2 ... --low_complexity_filter -l 50

  2. bwa mem -M -R '@RG\tID:H7J22DRX3.${lane}\tPU:H7J22DRX3.${lane}.${sample}\tSM:${sample}\tPL:ILLUMINA' ${REF} ${Ind1_L1}_R1_trimmed.fastq.gz ${Ind1_L1}_R2_trimmed.fastq.gz > ${Ind1_L1}.sam

  1. samtools view ${I1}.sam -b -o ${Ind1_L1}.bam

  2. samtools view -h -q 20 ${Ind1_L1}.bam | samtools view -buS | samtools sort ${Ind1_L1}_sorted.bam

  3. samtools index ${Ind1_L1}_sorted.bam

  4. samtools merge ${Ind1}_sorted.bam ${Ind1_L1}_sorted.bam ${Ind1_L2}_sorted.bam

  5. picard MarkDuplicates I=${Ind1}_sorted.bam O=${Ind1}_sorted_dup.bam VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true

  6. bam clipOverlap --in ${Ind1}_sorted_dup.bam --out ${Ind1}_sorted_dup_clipped.bam

I am unsure as to why clipOverlap (or maybe an earlier step in my workflow?) is causing this issue. Any pointers would be appreciated

Cheers, LVB

validatesamfile cigar bam clipoverlap • 687 views
ADD COMMENT
1
Entering edit mode

please, show us the following output:

samtools view "${Ind1}_sorted_dup.bam" | grep -F "A00152:841:H7J22DRX3:2:1112:5267:11397"

Furthermore Bwa mem and samtools sort can be combined.

bwa mem -M -R '@RG\tID:H7J22DRX3.${lane}\tPU:H7J22DRX3.${lane}.${sample}\tSM:${sample}\tPL:ILLUMINA' ${REF} ${Ind1_L1}_R1_trimmed.fastq.gz ${Ind1_L1}_R2_trimmed.fastq.gz  |  samtools sort ${Ind1_L1}_sorted.bam

don't use -q 20 to avoid errors like "MATE_NOT_FOUND".

ADD REPLY
0
Entering edit mode

Hello Pierre,

Thanks for your reply and insight.

with samtools view | grep using the erroneous CIGAR string I get:

A00152:841:H7J22DRX3:2:1112:5267:11397  163 scaffold_1  3342    60  6S36M56S    =   3342    36  GAGCTACCCCAAGGCCAGTAGACATAACGAGGATGGAGCTACCCCAAGGCCAGTAGACATAACGAGGATGGAGCTACCCCAAGGCCAGTAGACATAAC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  SA:Z:scaffold_1,3342,+,41S36M21S,60,1;  MC:Z:6S36M56S   MD:Z:28C7   PG:Z:MarkDuplicates RG:Z:H7J22DRX3.${lane}-1C23074F NM:i:1  AS:i:31 XS:i:0
A00152:841:H7J22DRX3:2:1112:5267:11397  419 scaffold_1  3342    60  41H36M21H   =   3342    36  CCCCAAGGCCAGTAGACATAACGAGGATGGAGCTAC    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    SA:Z:scaffold_1,3342,+,6S36M56S,60,1;   MC:Z:6H36M56H   MD:Z:28C7   PG:Z:MarkDuplicates RG:Z:H7J22DRX3.${lane}-1C23074F NM:i:1  AS:i:31 XS:i:0
A00152:841:H7J22DRX3:2:1112:5267:11397  83  scaffold_1  3342    60  6S36M56S    =   3342    -36 GAGCTACCCCAAGGCCAGTAGACATAACGAGGATGGAGCTACCCCAAGGCCAGTAGACATAACGAGGATGGAGCTACCCCAAGGCCAGTAGACATAAC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF  SA:Z:scaffold_1,3342,-,41S36M21S,60,1;  MC:Z:6S36M56S   MD:Z:28C7   PG:Z:MarkDuplicates RG:Z:H7J22DRX3.${lane}-1C23074F NM:i:1  AS:i:31 XS:i:0
A00152:841:H7J22DRX3:2:1112:5267:11397  339 scaffold_1  3342    60  41H36M21H   =   3342    -36 CCCCAAGGCCAGTAGACATAACGAGGATGGAGCTAC    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    SA:Z:scaffold_1,3342,-,6S36M56S,60,1;   MC:Z:6H36M56H   MD:Z:28C7   PG:Z:MarkDuplicates RG:Z:H7J22DRX3.${lane}-1C23074F NM:i:1  AS:i:31 XS:i:0

Make sense I will remove -Q 20, thank you

ADD REPLY
1
Entering edit mode

may be not the problem but why this RG:Z:H7J22DRX3.${lane}-1C23074F in the RG.

you're using single quotes with bwa-mem and it won't expand '${lane}' : https://stackoverflow.com/questions/6697753/what-are-the-differences-between-single-and-double-quotes-in-bash

ADD REPLY
0
Entering edit mode

ooh good pickup. Might not be the problem but definitely not helping. Will re-run with streamlined workflow and with double quotes and see if the problem persists.

Thank you Pierre.

ADD REPLY
0
Entering edit mode

update. The error persists. With ValidateSamFile I get no error until i use MarkDuplicates.

Then for the MarkDuplicates output I get:

## HISTOGRAM    java.lang.String
Error Type  Count
ERROR:MATE_NOT_FOUND    117529

I guess the 117k reads with a mate not found are those where the mate was removed in its entirety because it was deemed a duplicate (checking whether this is the expected behaviour).

When I then run clipOverlaps, I get a similar result as before:

## HISTOGRAM    java.lang.String
Error Type  Count
ERROR:CIGAR_MAPS_OFF_REFERENCE  673
ERROR:INVALID_CIGAR 3369673
ERROR:MATE_NOT_FOUND    117529
ERROR:MISMATCH_MATE_CIGAR_STRING    18341785

The mate not found reads are less and match those that originated from markDuplicates.

All other errors are more abundant than before, but in similar proportions. I am guessing those additional errors are from the ~3 million Q < 20 reads that were previously being filtered.

Some example errors:

1

ERROR::MISMATCH_MATE_CIGAR_STRING:Record 11
Read name A00152:841:H7J22DRX3:1:2151:15338:24455
Mate CIGAR string does not match CIGAR string of mate

2

ERROR::INVALID_CIGAR:Record 48
Read name A00152:841:H7J22DRX3:1:2216:20247:12978
No real operator (M|I|D|N) in CIGAR

Output of samtools view for #1

A00152:841:H7J22DRX3:1:2151:15338:24455 163 scaffold_1  365 60  150M    =   515 255 GCTCTATATACTTCACTTATATACATCACTGCAACAGACTAATACATATAATGTTCTATAATGTACTGTATACACATCAATGTGCACACAGAGCAGCCCAGCCTTTGTATACAGCAGCATTAACCCTTTACTGGAAATAAGGGTCCAAAC  :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFF:FFFFFF:FFFFFFMC:Z:151M  MD:Z:28G2T118   PG:Z:MarkDuplicates RG:Z:H7J22DRX3.L001 NM:i:2  AS:i:140    XS:i:22
A00152:841:H7J22DRX3:1:2151:15338:24455 83  scaffold_1  515 60  46S105M =   365 -255    TTGTATACAGCAGCATTAACCCTTTACTGGAAATAAGGGTCCAAACCCTGAAAAACATTAAGAACGATCTTCAGCGTAAAGAAGAACAAGGAGTCCTGGAAGTGATGATGTGGCCCCCACAGAGGGGCCCCTGATCTCATCATCCTCCAGT F:FFFFFFFFFFFFFFFFFFFF:F,FFFFFFFF,FFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:141    MC:Z:150M   MD:Z:65T78A6    NM:i:2  PG:Z:MarkDuplicates RG:Z:H7J22DRX3.L001 XS:i:96

Output of samtools view for #2

A00152:841:H7J22DRX3:1:2216:20247:12978 163 scaffold_1  1291    60  70S =   1291    70  GTCCTAACCCTAACCCTCACTCCAACCCTAACCCTAGTCCTATCCCTAACCCTTACCCTCACTCCAACCC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  MC:Z:70M    MD:Z:70 PG:Z:MarkDuplicates RG:Z:H7J22DRX3.L001 NM:i:0  AS:i:70 XS:i:36
A00152:841:H7J22DRX3:1:2216:20247:12978 83  scaffold_1  1291    60  70M =   1291    -70 GTCCTAACCCTAACCCTCACTCCAACCCTAACCCTAGTCCTATCCCTAACCCTTACCCTCACTCCAACCC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  MC:Z:70M    MD:Z:70 PG:Z:MarkDuplicates RG:Z:H7J22DRX3.L001 NM:i:0  AS:i:70 XS:i:36
ADD REPLY

Login before adding your answer.

Traffic: 2402 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6