Tophat: # reads mapping to CDS, rRNA, intergenic?
1
1
Entering edit mode
9.6 years ago
Adrian Pelin ★ 2.6k

Hello,

How can I get # of reads mapping to CDS/rRNA features or intergenic? I did my alignment with tophat and I have .gtf .gff and .bed annotations of my features.

Thank you,

Adrian

RNA-Seq NGS Tophat • 4.4k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode
9.6 years ago

Use HTSeq tool.

python -m HTSeq.scripts.count [options] <alignment_file> <gff_file>

Example command line to count reads mapped to the CDS:

python -m HTSeq.scripts.count --type=CDS --idattr=gene_id --mode=union --stranded=yes <sam_file> <gff_file>
ADD COMMENT
0
Entering edit mode

Thanks for the useful tool, but it doesn't seem to work with tophat alignemnts:

7584 GFF lines processed.
Warning: No features of type 'exon' found.
Error occured when reading beginning of SAM/BAM file.
  ('SAM line does not contain at least 11 tab-delimited fields.', 'line 1 of file con1/accepted_hits.bam')
  [Exception type: ValueError, raised in _HTSeq.pyx:1276]

Hmm, looks like I need to specify -f bam and install a bam reader.

ADD REPLY
1
Entering edit mode

I can confirm that yes, all you need to do is specify "-f bam" and this error should go away. The error is misleading because it says "SAM/BAM". Many other tools perform this detection behind-the-scenes with no problems, so it surprising that HTSeq does not.

ADD REPLY
0
Entering edit mode

I am not sure (this is just a guess) but the problem may be because you are providing bam file to HTseq as an input. It requires a SAM file. You can make HTseq read from the standard input ("-") using the following command.

samtools view accepted_hits.bam | python -m HTSeq.scripts.count --type=CDS --idattr=gene_id --mode=union --stranded=yes - <gff_file>

Or you can convert your bam file to sam file

samtools view -h input.bam > output.sam and give it to HTSeq.

If this is not the case, would you mind pasting the first few lines from your bam/sam.

ADD REPLY
0
Entering edit mode

Though you can see this in the example command that is provided here, I'd like to point out that to get the full distribution of CDS vs UTR vs Introns vs Intergenic, you will have to run at least that many commands with htseq-count, changing the --type= argument for each feature type.

ADD REPLY

Login before adding your answer.

Traffic: 2241 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6