Question

Count number of mapped raw reads for each transcript in SAM file

0

Entering edit mode

10.1 years ago

manekineko ▴ 150

Is there a better war to count how many reads are mapped to each sequence/transcript and get a count table with names and counts than this:

perl -ne ' if (/^\@SQ/) { @F = split(/\t|:/, $_); print $F[2]."\n" } ' SAMFILE > ID.txt
perl -ne ' chomp($_); print $_."\t".`grep -c "\t$_" SAMFILE ` ' ID.txt > COUNTTABLE

The sam file is 8GB nevertheless the transcripts bowtie index was containing only 70 transcripts I need, and it takes me forever (couple of hours) to get this count-table?

RNA-Seq • 4.3k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by manekineko ▴ 150

2

Entering edit mode

What about samtools idxstats?

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by swbarnes2 15k

0

Entering edit mode

I think this should work for him.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

AFAIK the numbers reported by samtools idxstats represent the number of alignments of reads that are mapped to the sequence, not the (non-redundant) number of reads what I need?

EDIT:that worked fine :)

ADD REPLY • link 10.1 years ago by manekineko ▴ 150

0

Entering edit mode

I wrote an implementation in C++ for my purposes using seqan library to do just that. Works fairly fast. Didn't try it on such big files though. You can try doing something similar. With seqan, it should be relatively painless.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by darxsys ▴ 240

0

Entering edit mode

Why you don't use standard tools such as HTSeq, summarizeOverlaps in R or RSeQC?

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by Ashutosh Pandey 12k

Ram · Answer 1 · 2015-06-23

3

Entering edit mode

10.1 years ago

Brian Bushnell 20k

With the BBMap package:

pileup.sh in=mapped.sam covstats=stats.txt rpkm=rpkm.txt

It's fast.

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by Brian Bushnell 20k

Ram · Answer 2 · 2015-06-23

2

Entering edit mode

10.1 years ago

Antonio R. Franco ★ 5.2k

How about using a gff or gft file of your reference genome and a dedicated program such as:

summarizeOverlaps from the R package GenomicAlignments
htseq-count which is written in phyton HTSeq (Python)
featureCounts from the R package Rsubread
simpleRNASeq from the R package easyRNASeq

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

Another one http://rseqc.sourceforge.net/

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

The problem is that I do not want to make it complicated involving whole genome or transcriptome analysys, I just have 70 sequences which I index for bowtie and map my reads to a SAM file....and want to count the reads in those 70 sequences.