Hi everyone,
I have a RNA-seq library (paired-end) with a large amount of multimappers because it comes from a total RNA extract with partial ribodepletion (~50 % of reads are from rRNAs). I tried mapping those reads with tophat changing the --max-multihits parameter (1 and 10), and I was surprised to see that the number of reads mapped in a proper pair is massively - from 38 to 87% - affected by that option.
samtools flagstat accepted_hits_1.bam # tophat --max-multihits 1
111707163 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
111707163 + 0 mapped (100.00% : N/A)
111707163 + 0 paired in sequencing
54154286 + 0 read1
57552877 + 0 read2
42710174 + 0 properly paired (38.23% : N/A)
50502846 + 0 with itself and mate mapped
61204317 + 0 singletons (54.79% : N/A)
0 + 0 with mate mapped to a different chr
samtools flagstat accepted_hits_2.bam # tophat --max-multihits 10
379040545 + 0 in total (QC-passed reads + QC-failed reads)
267241809 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
379040545 + 0 mapped (100.00% : N/A)
111798736 + 0 paired in sequencing
54212619 + 0 read1
57586117 + 0 read2
96903430 + 0 properly paired (86.68% : N/A)
97210716 + 0 with itself and mate mapped
14588020 + 0 singletons (13.05% : N/A)
0 + 0 with mate mapped to a different chr
I first thought that this change was due the secondary alignment, but if I filter them out (samtools view -F 256
), I still get 87% reads mapped in a proper pair.
What exactly happened here ?
My intuition is that reads are mapped independantly from their mate up to 1 or 10 times. After that tophat looks if there is one combination of position from the paired reads that leads to proper pairing and reports that pair of position. Can anyone confirm this or provide additional insight ?
Thank you,
Carlo
Thank you for your input !