Question

getting different number of reads in R1 and R2 after using UMItools

0

Entering edit mode

5.6 years ago

Sara ▴ 280

I used STAR for the alignment. then I wanted to use UMItools dedup. before using UMItools I had equal number of reades in R1 and R2 but after that the number of reads was not equal but I assume UMItools would consider both reads for the deduplication. do you know what I got different number of reads after using UMItools?

next-gen • 1.1k views

ADD COMMENT • link updated 5.6 years ago by i.sudbery 22k • written 5.6 years ago by Sara ▴ 280

score 1 · Answer 1 · 2020-01-14

There are two circumstances under which this can happen, and it depends on a combination of how BAM files encode pairing information and how mappers output read pairs.

Circumstance 1: Two alignments for read2

The SAM specification says that the pairing information is recorded by recording the chromosome and position of the primary alignment of the mate read. Now imagine you have a read pair where the read1 has a single alignment position and the read2 has two alignment positions. Some mappers will output the read1 only once, because it has only one alignment position, but other aligners will output two identical copies of read1, marking one as primary and one as secondary - this maintains an equal number of read1 and read2s. However, both read1s will point to the primary alignment of read2 - no read1 will point to the secondary alignment of read2 (some aligners break this rule, but by no means all). So you have a situation like:

read1.p ---> read2.p
         ____/
read1.s /    read2.s

Because UMI-tools makes its decisions on the basis of read1 (including the information stored in read1 on the position of read2), and then finds the read2 mates of the read1s it decides to keep, only one of the read2s will be output. ACtaully, I can't imagine any different way in which we might operate that would output both read1s and both read2s.

Circumstance 2: Two alignments for read1

If there are two alignments for read1, then the situation is a little more complex. Again, both read1s (whether primary or secondary) will point to the same read2. When UMI tools decides to keep a read1, it adds the details of the read2 it is look for to a set. When to comes across read2s, it checks against its list of things its looking for to see if it should be output, and if it is there, it outputs the read and then deletes it from the list. This keeps memory usage reasonable. If, when a chromosome is finished, not all the read2s have been found, it scans the chr again from the start to find the read2's its missing.

Now imagine read1 primary alignment is selected, and then read2 primary alignment is found - at this point read2 primary alignment is removed from the "looking for list". Now read2 secondary alignment is selected and added to the list. read2 primary alignment will be output again on the second sweep (read2 secondary alignment will never be output because no read1 points to it) - thus read2 primary alignment is output twice.

Now image read1 primary alignment is selected, and then read1 secondary alignment is found. read2 primary is already on the list. now read2 primary is found and removed from the list. - both read1s have been output, but only 1 read2.

This second problem may be theoretically soluble and we are working on it.