I am confused about my RNA-seq mapping results from tophat. It's pair-end sequencing data, 101 bp length from Hiseq 2000.
Here is the command I used:
tophat -p 8 -g 1 -G $geneInfo -o $examp_thout $bowtie2in $fastqP1 $fastqP2
"-g 1" specified that only uniquely mapped reads are kept; "$geneInfo" is the gtf files I downloaded from tophat website; "$bowtie2in" is the index files from tophat website.
tophat used bowtie 2.0.0.7 and samtools 0.1.18.0.
As an example, I have 95604894 reads in total.
According to "the accepted_hits.bam", there are 88040486 uniquely reads mapped (all of them have tag : "NH:i:1"), and there are 7564408 reads on "unmapped.bam".
So, 92.1% of reads are uniquely mapped on the genome! I have similar results for other samples.
Is it too good to be true? How about your data?
And there are some log files under "log" folder, one of them is "bowtie.leftkeptreads.log"
47730953 reads; of these:
47730953 (100.00%) were unpaired; of these:
10052350 (21.06%) aligned 0 times
20708981 (43.39%) aligned exactly 1 time
16969622 (35.55%) aligned >1 times
78.94% overall alignment rate
It seems that lots of reads are mapped in multiple location, how could I get high uniquely mapped reads percentage?
Many thanks!
I want to get the read counts for each genes, then analyze them with linear model under poisson distribution. That is why I want the uniquely mapped reads. How do you work with tophat? Do you know any way to get the uniquely mapped reads?
There could be multiple way to do it. For example there is an optional field in the SAM file called:
You will need to allow reporting of more than one read and then you will want to select lines that contains
NH:i:1
Also look here Counting repeat and unique reads of tophat output
According to your explanation, I set "-g 1" to let tophat report only one mapped reads even if reads are mapped in multiple position. Then, for example, I set "-g 2" , and selecte reas with "NH i 1", which will give the uniquely mapped reads. Is it a paradox?
I also see many people in seqanswers said that "-g 1" will give the uniquly mapped reads. It is so confusing.
I would recommend to try to avoid setting too many limits during alignment phase, instead filter the resulting SAM file.
For example note what the manual says about
-g
:Thanks, I will try with default value, then see how it happens.