Entering edit mode
9.3 years ago
mbio.kyle
▴
380
Hello,
I am using tophat to align 100bp single end RNAseq reads to the human transcriptome (using hg19). I have noticed a large difference between the number of reads reported in the prep_reads step and the align_summary step.
As an example here it the prep_reads.info file from one of my samples:
min_read_len=101
max_read_len=101
reads_in =22536887
reads_out=22535224
And here is the align summary:
Reads:
Input : 2599620
Mapped : 1557662 (59.9% of input)
of these: 125871 ( 8.1%) have multiple alignments (80 have >20)
59.9% overall read mapping rate.
Why is the number of reads set in much higher than the number of reads listed as input when calculating the alignment rate. My understanding is that the prep reads step is the one which filters out reads.
Thanks,
Kyle
This is weird.
reads_in
should be same asInput
. Somewhere on this forum I read that multi-threading may cause this problem but if you don't use-p
parameter then it should resolve the problem. But they couldn't figure how why it is happening.I did some more investigating into this. All the samples which ran in my pipeline python script (multi threaded) showed this discrepancy. One sample failed for other reasons and I had to re run it manually, and reads_in matched input. So this must be the issue.
Thanks!
Could you link me to the original thread by chance? I am quite interested in this now.
Found it but not sure how much it will help Tophat - Understated Number Of Reads In The "Align_Summary.Txt" File
Excellent, thank you very much. I have re-ran my alignments without multi-threading and the results are quite shocking.
This is without the
-p
flagAnd this is with it
I double checked to see if it was just a reporting issue but it is not, the single threaded bam file is almost a GB in size, while the threaded one is 53M.
Thank you so much for clearing this up for me. I hope this gets fixed soon.
Here is a github issue which was opened a few days ago: https://github.com/infphilo/tophat/issues/18
The suggestion is that the issue should be fixed in the new tophat version (2.1.0). I am rerunning with the updated version.
Thanks for the follow up. This is a pretty common issue with most of the bioinformatics tools. You have errors coming and going. Normally most of the problems can be resolved through using the most latest version or going one version back if the error is in the most latest version.
I tried searching for the post but couldn't find it. I don't think that the post explained reason behind the discrepancy. I will search again and post the link if I am successful.