Hi everyone,
I’m trying to troubleshoot a low read consensus rate in duplex sequencing data - by using GroupReadsByUmi and CallDuplexConsensusReads in the fgbio package.
For CallDuplexConsensusReads, I use:
java -jar fgbio-1.4.0.jar CallDuplexConsensusReads \
--input=grouped.bam --output=consensus.unmapped.bam \
--error-rate-pre-umi=15 --error-rate-post-umi=10 \
--min-input-base-quality=10
But get:
[2021/12/06 17:33:10 | CallDuplexConsensusReads | Info] Raw Reads Filtered Due to Not Enough Reads (Either Total, AB, or BA): 20,787,008 (0.9964).
When looking at ‘grouped.bam’, I noticed the MI tag only referred to two consecutive sequences, presumably these are the paired reads. Also, each pair has the same read identity (MI= N/A& N/A, not N/A & N/B).
Example from grouped.bam:
A00133:442:HM5MFDSX2:3:2615:27823:2895 83 1 3050000 60 29S71M = 3050000 -71 AATGTTTTGCCAGTTCTTTTAAATTCTCAACAGTAAAAGAAACAAATTTTAGAGGAACTAAAATATTGCAATCTCCATGCCATGGGAGAATAGAAATACA FFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF ZA:Z:CGG ZB:Z:AAT MC:Z:24S71M50S MD:Z:0N70 PG:Z:bwa RG:Z:A MI:Z:0/B NM:i:1 MQ:i:60 UQ:i:37 AS:i:71 RX:Z:CGG-AAT
A00133:442:HM5MFDSX2:3:2615:27823:2895 163 1 3050000 60 24S71M50S = 3050000 71 TTTGCCAGTTCTTTTAAATTCTCAACAGTAAAAGAAACAAATTTTAGAGGAACTAAAATATTGCAATCTCCATGCCATGGGAGAATAGAAATACATACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFF:,FFFFF:FFFFFFFF:FFFFFFFFF ZA:Z:CGG ZB:Z:AAT MC:Z:29S71M MD:Z:0N70 PG:Z:bwa RG:Z:A MI:Z:0/B NM:i:1 MQ:i:60 UQ:i:37 AS:i:72 RX:Z:CGG-AAT
A00133:442:HM5MFDSX2:3:1468:3821:33786 99 1 3050000 60 21S124M = 3050014 159 GCCAGTTCTTTTAAATTCTCAACAGTAAAAGAAACAAATTTTAGAGGAACTAAAATATTGCAATCTCCATGCCATGGGAGAATAGAAATACATCTATCAGAGATATTGTCAGGACCCGAAAGAAATGAAACCCAAAGTGGAGGTC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFF:,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF ZA:Z:CGA ZB:Z:CTA MC:Z:145M MD:Z:0N123 PG:Z:bwa RG:Z:A MI:Z:1/A NM:i:1 MQ:i:60 UQ:i:37 AS:i:124 RX:Z:CGA-CTA
A00133:442:HM5MFDSX2:3:1468:3821:33786 147 1 3050014 60 145M = 3050000 -159 AAATTTTAGAGGAACTAAAATATTGCAATCTCCATGCCATGGGAGAATAGAAATACATCTATCAGAGATATTGTCAGGACCCGAAAGAAATGAAACCCAAAGTGGAGGTCCACTAGTGAGTATAATGATGGCAGGTATTTTTGCT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF ZA:Z:CGA ZB:Z:CTA MC:Z:21S124M MD:Z:145 PG:Z:bwa RG:Z:A MI:Z:1/A NM:i:0 MQ:i:60 UQ:i:0 AS:i:145 RX:Z:CGA-CTA
I think this is why the consensus rate is so low (from a lack of reads with the same MI), though I’m not sure.
For reference, this is my GroupReadsByUmi code:
java -jar fgbio-1.4.0.jar GroupReadsByUmi \
--input=mapped.bam --output=grouped.bam \
--strategy=paired --edits=1 --min-map-q=1
There was suitable amplification in the original sequencing experiment, so there must be an ample number of reads for grouping by UMI.
Thanks for your help!
Alex