Hi everyone!
I am running an RNA-Seq experiment and was wondering if anyone could help me with a particular query i had.
So my RNA-Seq sample had 29,775,906 reads.
I ran this sample through tophat (no coverage search, unstranded, w/bowtie2). I checked the logs folder in the output and got the following stats :-
[2013-04-04 10:57:02] Preparing reads
left reads: min. length=50, max. length=50, 29767088 kept reads (8818 discarded)
Now, 29767088 + 8818 = 29,775,906 . So until now it makes sense.
From the bowtie files i got the following information :-
In,bowtie.left_kept_reads.log
29767088 (100.00%) were unpaired; of these:
4316521 (14.50%) aligned 0 times
17118264 (57.51%) aligned exactly 1 time
8332303 (27.99%) aligned >1 times
In,bowtie.left_kept_reads_seg1.log
4316521 (100.00%) were unpaired; of these:
3198237 (74.09%) aligned 0 times
1037571 (24.04%) aligned exactly 1 time
80713 (1.87%) aligned >1 times
In bowtie.left_kept_reads_seg2.log
4316521 (100.00%) were unpaired; of these:
3177510 (73.61%) aligned 0 times
1069186 (24.77%) aligned exactly 1 time
69825 (1.62%) aligned >1 times
So at this point , correct me if i am wrong, we can conclude :-
total reads aligned 1 time = 17118264 + 1069186 + 1037571 = 19225021 total reads aligned >1 time = 8332303 + 80713 + 69825 = 8482841
But then i converted the accepted_hits.bam file in a .sam file. On this .sam file I did the following two operations :-
awk '{print $1}' 11129_nc_u.sam | sort | uniq -u | wc -l
gave me 21638486 (i.e total number of uniques)
and
awk '{print $1}' 11129_nc_u.sam | sort | uniq -d | wc -l
gave me 5519946 (i.e. total duplicates.)
Now why is the number of uniques from the .sam folder (21638486 ) greater than 19225021 (i.e. obtained from the logs folder)? In the same breath, why is the number of duplicates from .sam folder (5519946) less than the number in the logs folder (8482841) ?
Can someone please clarify this point for me?
Thanks in Advance :)