I have an issue with some mapped sequencing with 9-base UMIs (the UMI is at the end of the read ID). A lot of the time, the individual molecules are (consensus) deduplicated correctly, but where there are very slight location differences between two or more sequences of the same molecule, such as a slightly shorter read,
NB500905:50:HVY23AFXY:2:21206:18601:14328:AAAAAAAAA 97 chr4 76426801 60 66M = 76426799 -68 TCAGTTTCCCGAGTAGCTGGGACTACAGGTGCCCGCCACCATGCCCGGCTAATTTTTTTTTTTTTT AAAEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:66 YS:i:0 YT:Z:DP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:21206:18601:14328:AAAAAAAAA 145 chr4 76426799 60 57M = 76426801 68 CCTCAGTTTCCCGAGTAGCTGGGACTACAGGTGCCCGCCACCATGCCCGGCTAATTT AE/EAE/E/EEEE/EA/EE/EEEAE/EEAAE/EEEEE/E//EE/EAEEAEEEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:57 YS:i:0 YT:Z:DP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:4:21412:17448:14097:AAAAAAAAA 97 chr4 76426801 60 67M = 76426799 0 TCAGTTTCCCGAGTAGCTGGGACTACAGGTGCCCGCCACCATGCCCGGCTAATTTTTTTTTTTTTTT AAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA AS:i:-5 ZS:i:-10 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:66A0 YT:Z:UP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:4:21412:17448:14097:AAAAAAAAA 145 chr4 76426799 60 58M = 76426801 0 CCTCAGTTTCCCGAGTAGCTGGGACTACAGGTGCCCGCCACCATGCCCGGCTAATTTT 6E/////A/6EE//////A//E////A///E//EEE/EE//E//6EEE/E//AEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:58 YT:Z:UP RG:Z:10-B NH:i:1
the molecule is being retained, duplicated in the deduplicated BAM files. So my first question is this, is there an algorithm to correctly deduplicate this molecule and return one consensus sequence rather than two?
Secondly, it appears that other sequencing issues with the first read in a pair has resulted in the first read being different, with the second being the same, such as in this example (12 pairs of what is presumably the same molecule - the 9-bp UMI and the locations of the second read are very similar):
NB500905:50:HVY23AFXY:1:11204:12926:7292:AAAAAAAAA 83 chr17 39699570 60 69M = 39699476 -163 GGTCCTGGAAGCCACAAGGTAAACACAACACATCCCCCTCCTTGACTATCAATTTTACTAGAGGATGTG EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:69 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:1:11204:12926:7292:AAAAAAAAA 163 chr17 39699476 60 59M = 39699570 163 TCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEE/EE6EEEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:59 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:1:11210:13179:6528:AAAAAAAAA 83 chr17 39699492 60 67M = 39699473 -86 GTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTCAGATACTTCAAAGATTCCAGAAGA EEEEEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:67 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:1:11210:13179:6528:AAAAAAAAA 163 chr17 39699473 60 60M = 39699492 86 AAGTCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTT E66/A/EE6//6E6E/A/EA///E//EE/EEAEEE/E///EA///EEEE/E//<A</EEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:60 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:11111:19485:18118:AAAAAAAAA 83 chr17 39699525 60 69M = 39699474 -120 CATGCTTTTCAGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCACAAGGTAAAC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:69 YS:i:-3 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:11111:19485:18118:AAAAAAAAA 163 chr17 39699474 60 60M = 39699525 120 AGTCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTT EE/EAEEEEEEEEAE6/EE/E/EE/EEEEEEEEEEEEEAEEE/EEEEEEEEEAAE/EEEE AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:0 MD:Z:38A21 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:11211:18464:14540:AAAAAAAAA 83 chr17 39699484 60 69M = 39699477 -76 ATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTCAGATACTTCAAAGATTCC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:69 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:11211:18464:14540:AAAAAAAAA 163 chr17 39699477 60 58M = 39699484 76 CCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:58 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:21111:8435:13671:AAAAAAAAA 83 chr17 39699534 60 68M = 39699476 -126 CAGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCACAAGGTAAACACAACACA EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:21111:8435:13671:AAAAAAAAA 163 chr17 39699476 60 59M = 39699534 126 TCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC EEEEEEEEEEEEEAEAEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEE/EEEEEEEEEA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:59 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:21112:12742:5719:AAAAAAAAA 83 chr17 39699531 60 67M = 39699476 -122 TTTCAGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCACAAGGTAAACACAA EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:67 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:21112:12742:5719:AAAAAAAAA 163 chr17 39699476 60 59M = 39699531 122 TCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:59 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:21308:6819:11408:AAAAAAAAA 83 chr17 39699519 60 67M = 39699477 -109 TAAGATCATGCTTTTCAGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCACA EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:67 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:21308:6819:11408:AAAAAAAAA 163 chr17 39699477 60 58M = 39699519 109 CCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC EA/<EAAE//EE/EE/EEEAE<EAAE/EAE</EEEEE/EEEEEEEAEE/A<<EEEAEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:58 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:21309:11545:15754:AAAAAAAAA 83 chr17 39699486 60 69M = 39699477 -78 GTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTCAGATACTTCAAAGATTCCAG EEEAEE6/EE/EEA/A/EEE/EEEEEEAEEE/EAEEEEEAEE/EEEE6EEEE/EEEEEEEAEEE/AAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:69 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:2:21309:11545:15754:AAAAAAAAA 163 chr17 39699477 60 58M = 39699486 78 CCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC //E/EEAEEEEE6EE/EAE////EEEE/EAEEEAEA///<E/<EE/EEEEEE/A6AA6 AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:58 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:3:11405:26640:4281:AAAAAAAAA 83 chr17 39699480 60 68M = 39699476 -72 TTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTCAGATACTTCAAAG EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:3:11405:26640:4281:AAAAAAAAA 163 chr17 39699476 60 59M = 39699480 72 TCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC EEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEE//EEEEEEAEEEEEAEEEEEEEAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:59 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:3:11406:13328:7002:AAAAAAAAA 83 chr17 39699533 60 68M = 39699477 -124 TCAGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCAAAAGGTAAACACAACAC EEEEEEE/AE</EEEAAEEEAEEAAEE/EEE/EEEEEEEEEEEAEE//EEEEEEAEEEEEEE/AE6AA AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:51C16 YS:i:-4 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:3:11406:13328:7002:AAAAAAAAA 163 chr17 39699477 60 58M = 39699533 124 CCTTTCGATGTGACTGTCTCCTCCCAAAATTGTAGACCCTCTTAAGATCATGCTTTTC E/EE/EEEEEEEEEEEEEEEEEEEEEEE6EEEEEAEE/A6EEEEEEEEEEEEAEEAEE AS:i:-4 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:28T29 YS:i:-5 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:3:21610:12320:10534:AAAAAAAAA 83 chr17 39699525 60 69M = 39699476 -118 CATGCTTTTCAGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCACAAGGTAAAC EEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:69 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:3:21610:12320:10534:AAAAAAAAA 163 chr17 39699476 60 59M = 39699525 118 TCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEAEEEEEEEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:59 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:4:21402:24172:19713:AAAAAAAAA 83 chr17 39699535 60 69M = 39699474 -130 AGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCACAAGGTAAACACAACACATC /EEEEEAA/AAEE/AEEEAEA//EEEE<AEEAEE</EEEEAE<E/EAEEAEE/E6E/EAEEEE/EEAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:69 YS:i:-6 YT:Z:CP RG:Z:10-B NH:i:1
NB500905:50:HVY23AFXY:4:21402:24172:19713:AAAAAAAAA 163 chr17 39699474 60 60M = 39699535 130 AGTCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTT E6/EAEE////EE///A///A/EE////EE/E6</<A/AE//!//<E66E/!E///6/E/ AS:i:-6 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:0 MD:Z:42T8C8 YS:i:0 YT:Z:CP RG:Z:10-B NH:i:1
The obvious non-deduplication works to bias anything downstream, such as variant calling. This is a more difficult molecule to consensus deduplicate, but a method which would work with this case would work in the first case.
To check that this is not a mapping issue, I have blatted the two sets of sequences on the genome browser, and it is very clear that one of the two reads in each pair is effectively fixed, with the other being started at different positions on the molecule (my seuqencing collaborator suggests that the Illumina primer for read 1 did not bind effectively, and that it is 'slippy' in some sense):
So my question is, is there a robust consensus deduplication method that can deal with both cases, and derive a single pair of consensus reads from each set?
I've been using gencore here at https://github.com/OpenGene/gencore, which looks at the start points of both R1 and R2, and treats molecules (with the same UMI) as different if they don't match those locs or if the reads are different lengths. It doesn't consensus deduplicate the whole set, just a subset that match, leaving the remainder, which would bias everything.
I have used UMI-tools before with similar sequencing chemistries, and it works fine for RNA-seq, where the counts are important, but the consensus sequence is important here as I want to call variants on it.
It shouldn't be that difficult to take the output from
umi_tools group
and write some custom code that calculates a consensus sequence, depending on how sophisticated you want that consensus to be - if you just want a majority call, it should be fairly straight forward.Which version of UMI-tools is this? The one currently documented on readthedocs.io (1.0.0) does not have this option, and the one I currently have installed (0.5.5) does not have this either.
It was in 1.1.0. I don't know why the documentation is still showing 1.0.0.