Question

Query Regarding Rna-Seq Hit Count

3

Entering edit mode

12.6 years ago

yvs.0815 ▴ 30

Hi everyone!

I am running an RNA-Seq experiment and was wondering if anyone could help me with a particular query i had.

So my RNA-Seq sample had 29,775,906 reads.

I ran this sample through tophat (no coverage search, unstranded, w/bowtie2). I checked the logs folder in the output and got the following stats :-

[2013-04-04 10:57:02] Preparing reads

     left reads: min. length=50, max. length=50, 29767088 kept reads (8818 discarded)

Now, 29767088 + 8818 = 29,775,906 . So until now it makes sense.

From the bowtie files i got the following information :-

In,bowtie.left_kept_reads.log

29767088 (100.00%) were unpaired; of these:

4316521 (14.50%) aligned 0 times

17118264 (57.51%) aligned exactly 1 time

8332303 (27.99%) aligned >1 times

In,bowtie.left_kept_reads_seg1.log

4316521 (100.00%) were unpaired; of these:

3198237 (74.09%) aligned 0 times

1037571 (24.04%) aligned exactly 1 time

80713 (1.87%) aligned >1 times

In bowtie.left_kept_reads_seg2.log

4316521 (100.00%) were unpaired; of these:

3177510 (73.61%) aligned 0 times

1069186 (24.77%) aligned exactly 1 time

69825 (1.62%) aligned >1 times

So at this point , correct me if i am wrong, we can conclude :-

total reads aligned 1 time = 17118264 + 1069186 + 1037571 = 19225021 total reads aligned >1 time = 8332303 + 80713 + 69825 = 8482841

But then i converted the accepted_hits.bam file in a .sam file. On this .sam file I did the following two operations :-

awk '{print $1}' 11129_nc_u.sam | sort | uniq -u | wc -l

gave me 21638486 (i.e total number of uniques)

and

awk '{print $1}' 11129_nc_u.sam | sort | uniq -d | wc -l

gave me 5519946 (i.e. total duplicates.)

Now why is the number of uniques from the .sam folder (21638486 ) greater than 19225021 (i.e. obtained from the logs folder)? In the same breath, why is the number of duplicates from .sam folder (5519946) less than the number in the logs folder (8482841) ?

Can someone please clarify this point for me?

Thanks in Advance :)

rna-seq tophat bowtie reads • 2.8k views

ADD COMMENT • link updated 12.6 years ago by vodka ▴ 80 • written 12.6 years ago by yvs.0815 ▴ 30

score 0 · Answer 1 · 2013-04-09

0

Entering edit mode

12.6 years ago

vodka ▴ 80

Check here: http://seqanswers.com/forums/showpost.php?p=72778&postcount=4 As long as the two seg? logs derive from portions of you reads I believe that summing those numbers is wrong.

ADD COMMENT • link 12.6 years ago by vodka ▴ 80