Counting Repeat And Unique Reads Of Tophat Output
1
2
Entering edit mode
13.2 years ago
Stevelor ▴ 310

Hey,

I used Tophat for paired-end RNA-Seq mapping and converted the "accepted_hits.bam" to a *.bed file with 82859900 entries/lines -> hits on the reference genome. I wanted to know how much unique and repeat reads i've got...also on how many locations on the reference genome the repeats reads hit.

So i wrote some lines of code comparing and counting the unique read IDs with following result:

hits: 82859900
unique hits: 75600252
repeat hits: 3217634 hit on 7259648 locations

Looks good!!! But is there another way to get these counts out of the tophat log-files? What are they for, cause they give me strange counts^^
Or is this the only way to get this information??
How do you count these reads??
I am not happy with samtools flagstat and picardtools :(

Cheers, Steve

tophat rna parsing read • 7.3k views
ADD COMMENT
1
Entering edit mode

it would be nice if you can share your lines. I would really like to know how to do something like that.

ADD REPLY
7
Entering edit mode
13.2 years ago
Gww ★ 2.7k

In the bam file created by TopHat there is an auxiliary tag (NH) that specifies the number of hits each read has. For example, NH:i:2 says that there are two hits for that read.

ADD COMMENT
0
Entering edit mode

do you know what the NM and XS specify?

ADD REPLY
0
Entering edit mode

NM is the number of mismatches in the read. XS: Is the eXpected Strand of the transcript based on transcript annotations and / or splice site motifs ie. GT:AG or AT:AC.

ADD REPLY
0
Entering edit mode

NM is the number of mismatches in the read. XS: Is the eXpected Strand of the read based on transcript annotations and / or splice site motifs ie. GT:AG or AT:AC

ADD REPLY

Login before adding your answer.

Traffic: 2500 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6