Question

Htseq-Count Not Working With Bam From Tophat

1

Entering edit mode

13.2 years ago

siddharth.sethi5 ▴ 300

Hi everyone
I have RNA seq data. I used Tophat2 for the alignment, got the junctions successfully.

When I use HTSeq-count (latest version) to count the reads, I get all counts as zero .The output below is for only chr10

nofeature 0 ambiguous 0 toolowaQual 0 notaligned 0 alignmentnotunique 1213

Here is my command
python -m HTSeq.scripts.count -s no -i genename chr10.sorted.sam Musmusculus.gtf > count_output

I have checked the SAM file, it do contains unique reads ("NH:1"). I sorted the BAM file using samtools (samtools sort -n ).

My SAM file looks like this:

WTCHG33331111:3:1101:2760:117234#TCGCATAA 161 chr10 85405316 255 28M610N23M = 85406036 771 CACCAAT AAGCACTGGTTGTCCTGTGAGTTGATAATCCACAGCAATGAGGG bbeeeeeggggghiibffdgghb[]gghiehdghghhihgghiiihihf AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1 WTCHG33331111:3:1101:2760:117234#TCGCATAA 81 chr10 85406036 255 51M = 85405316 -771 TTCCCATCCACCACG TGCTGCTCCACTTTGGCGATATTGTGCGACACATCA fhiihgffc^geciihhihiihhffiihhfhfgefgiigeggecccce_ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0MD:Z:51 YT:Z:UU XS:A:- NH:i:1 WTCHG33331111:3:1101:4247:53268#TCGCATAC 97 chr10 84641325 255 51M = 88137631 3496357 ATTCAGCTTACACTT TCCACATTGCTGTTGATCACAAAAGGAAGTCAAGAC bbbeeeeegggggiiiiihiiiiiihiiiiiiihiidhiiifiihiihfbf AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0MD:Z:51 YT:Z:UU NH:i:1 WTCHG33331111:3:1101:4963:41163#TCGCATAC 97 chr10 85405305 255 39M610N12M = 85406105 1268 CCCCATG GTGCCACCAATAAGCACTGGTTGTCCTGTGAGTTGATAATCCAC Y\^cccWc^ca^aeefaaegfJee]cgggfghfX^YbY^caeeadh[cX AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1 WTCHG33331111:3:1101:4963:41163#TCGCATAC 145 chr10 85406105 255 38M417N13M = 85405305 -1268 AGGTCAT CAGGGGTTGTGTTGAAGACTTTGGCAAAAGCCTGACGGGTTAAG bbeeeac^eaVfdfgfehgafabfdgfcaafecfbZecgegcca^^^_a AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1

My gtf file was downloaded from Ensemble and i replaced the contig names according to my BAM file. It looks like
chr1 snRNA exon 3092097 3092206 . + . geneid "ENSMUSG00000064842"; transcriptid "ENSMUST00000082908"; exonnumber "1"; genename "U6"; genebiotype "snRNA"; transcriptname "U6.149-201"; chr1 processedtranscript exon 3203690 3206425 . - . geneid "ENSMUSG00000051951"; transcriptid "ENSMUST0000016289 7"; exonnumber "1"; genename "Xkr4"; genebiotype "proteincoding"; transcriptname "Xkr4-003"; chr1 processedtranscript exon 3195982 3197398 . - . geneid "ENSMUSG00000051951"; transcriptid "ENSMUST0000016289 7"; exonnumber "2"; genename "Xkr4"; genebiotype "proteincoding"; transcriptname "Xkr4-003"; chr1 processedtranscript exon 3203520 3205713 . - . geneid "ENSMUSG00000051951"; transcriptid "ENSMUST0000015926 5"; exonnumber "1"; genename "Xkr4"; genebiotype "proteincoding"; transcriptname "Xkr4-002"; chr1 processedtranscript exon 3196604 3197398 . - . geneid "ENSMUSG00000051951"; transcriptid "ENSMUST0000015926 5"; exonnumber "2"; genename "Xkr4"; genebiotype "proteincoding"; transcriptname "Xkr4-002";

I tried HTSeq-count with one of my BAM (BWA align) from DNA-seq and it worked properly, so i think something's wrong with my tophat BAM file or I am doing something wrong?? I will appreciate any kind of help.

Thanks
Sid

tophat • 11k views

ADD COMMENT • link updated 13.2 years ago by shrirambhosle ▴ 30 • written 13.2 years ago by siddharth.sethi5 ▴ 300

0

Entering edit mode

can you provide minimal snippets that reproduce the problem? (e.g., your GTF example doesn't have annotations for chr10; in the sam file, seq and qualscores don't appear to be the same length).

ADD REPLY • link 13.2 years ago by Ryan Dale 5.0k

0

Entering edit mode

Hi Daler, i checked the seq and quality scores in the sam file, they appear to be of the same length(51) , atleast for the reads i checked. The gtf file do contains all chromosomes:

chr10 proteincoding exon 3134304 3134909 . + . geneid "ENSMUSG00000015202"; transcriptid "ENSMUST00000015346"; exon _number "1"; genename "Cnksr3"; genebiotype "proteincoding"; transcriptname "Cnksr3-001";
chr10 proteincoding CDS 3134858 3134909 . + 0 geneid "ENSMUSG00000015202"; transcriptid "ENSMUST00000015346"; exon number "1"; genename "Cnksr3"; genebiotype "proteincoding"; transcriptname "Cnksr3-001"; proteinid "ENSMUSP00000015346";
chr10 proteincoding startcodon 3134858 3134860 . + 0 geneid "ENSMUSG00000015202"; transcriptid "ENSMUST0000001534 6"; exonnumber "1"; genename "Cnksr3"; genebiotype "proteincoding"; transcript_name "Cnksr3-001";

Actually i have tried using the complete BAM(all chr) also, but the result is the same. All my BAMs from tophat are not working. The BAM files looks fine, i used them with cufflinks and cuffdiff, it worked fine

ADD REPLY • link 13.2 years ago by siddharth.sethi5 ▴ 300

0

Entering edit mode

Maybe it's just a formatting issue on Biostar, but for the first sequence in your example, but I get

len('CACCAATAAGCACTGGTTGTCCTGTGAGTTGATAATCCACAGCAATGAGGG')= 51

and

len('bbeeeeeggggghiibffdgghb[]gghiehdghghhihgghiiihihf') = 49

Since my own files from TopHat work fine with htseq-count, I think it's necessary to troubleshoot with the exact input that gives you problems. To avoid Biostar formatting issues (like the loss of tabs and whatever causes the above length discrepancy), try pastebin or create a gist.

ADD REPLY • link 13.2 years ago by Ryan Dale 5.0k

score 1 · Answer 1 · 2012-06-29

1

Entering edit mode

13.1 years ago

shrirambhosle ▴ 30

Hi

Try using the latest version of HTSeq (HTSeq-0.5.3p9): http://pypi.python.org/pypi/HTSeq

Cheers, Shriram

ADD COMMENT • link 13.1 years ago by shrirambhosle ▴ 30

score 0 · Answer 2 · 2012-06-20

0

Entering edit mode

13.2 years ago

Mikael Huss 4.8k

I am not sure but it could be that you have to explicitly specify the feature name (e g "geneid"). If you look at the HTSeq-count manual (http://www-huber.embl.de/users/anders/HTSeq/doc/count.html), it specifies that the default is to look for "gene_id" in the GTF (including the underscore). You may have to specify "-i geneid" for it to work.

There is nothing intrinsically incompatible with Tophat BAM/SAM files and HTSeq; I use this combo myself all the time.

ADD COMMENT • link 13.2 years ago by Mikael Huss 4.8k

0

Entering edit mode

Hi Mikael, actually somehow the underscores got omitted both from my command and the GTF file when i posted them here. I double checked it, they are present. Still i tried changing the "-i" parameter, used default and all possible things, but the result is the same. Can you think of any other problem that might be causing this? thanks

ADD REPLY • link 13.2 years ago by siddharth.sethi5 ▴ 300

0

Entering edit mode

Not really. Maybe something silly like the endline characters being wrong in the GTF file after you changed contig names. On Mac OS X you can accidentally get "\r"as newline characters instead of "\n" which isn't visible to you in e g TextEdit but which is shown as ^M in Emacs or the shell. But that is just a wild guess

ADD REPLY • link 13.2 years ago by Mikael Huss 4.8k

0

Entering edit mode

kindly help me out bro C: htseq count error

ADD REPLY • link 5.4 years ago by Balatheskulter • 0

0

Entering edit mode

Please don’t spam multiple posts asking people to help you.

ADD REPLY • link 5.4 years ago by Ram 45k

0

Entering edit mode

you should have been read my queries fully, the second time I used the star to map against a reference genome, and still, I didn't get the output

ADD REPLY • link 5.4 years ago by Balatheskulter • 0

0

Entering edit mode

Your exact “kindly help me out bro” post was added as a comment on two posts (Htseq-Count Not Working With Bam From Tophat and featurecounts and DESEQ ) and as an answer on another post (HTSeq-count in RNA-Seq ). That behavior is quite aptly described as spamming.

I deleted two out of those three, so regular users such as yourself can’t see them but moderators can.

ADD REPLY • link 5.4 years ago by Ram 45k