Hi everyone
I have RNA seq data. I used Tophat2 for the alignment, got the junctions successfully.
When I use HTSeq-count (latest version) to count the reads, I get all counts as zero .The output below is for only chr10
nofeature 0
ambiguous 0
toolowaQual 0
notaligned 0
alignmentnotunique 1213
Here is my command
python -m HTSeq.scripts.count -s no -i genename chr10.sorted.sam Musmusculus.gtf > count_output
I have checked the SAM file, it do contains unique reads ("NH:1"). I sorted the BAM file using samtools (samtools sort -n ).
My SAM file looks like this:
WTCHG33331111:3:1101:2760:117234#TCGCATAA 161 chr10 85405316 255 28M610N23M = 85406036 771 CACCAAT AAGCACTGGTTGTCCTGTGAGTTGATAATCCACAGCAATGAGGG bbeeeeeggggghiibffdgghb[]gghiehdghghhihgghiiihihf AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1
WTCHG33331111:3:1101:2760:117234#TCGCATAA 81 chr10 85406036 255 51M = 85405316 -771 TTCCCATCCACCACG TGCTGCTCCACTTTGGCGATATTGTGCGACACATCA fhiihgffc^geciihhihiihhffiihhfhfgefgiigeggecccce_ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0MD:Z:51 YT:Z:UU XS:A:- NH:i:1
WTCHG33331111:3:1101:4247:53268#TCGCATAC 97 chr10 84641325 255 51M = 88137631 3496357 ATTCAGCTTACACTT TCCACATTGCTGTTGATCACAAAAGGAAGTCAAGAC bbbeeeeegggggiiiiihiiiiiihiiiiiiihiidhiiifiihiihfbf AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0MD:Z:51 YT:Z:UU NH:i:1
WTCHG33331111:3:1101:4963:41163#TCGCATAC 97 chr10 85405305 255 39M610N12M = 85406105 1268 CCCCATG GTGCCACCAATAAGCACTGGTTGTCCTGTGAGTTGATAATCCAC Y\^cccWc^ca^aeefaaegfJee]cgggfghfX^YbY^caeeadh[cX AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1
WTCHG33331111:3:1101:4963:41163#TCGCATAC 145 chr10 85406105 255 38M417N13M = 85405305 -1268 AGGTCAT CAGGGGTTGTGTTGAAGACTTTGGCAAAAGCCTGACGGGTTAAG bbeeeac^eaVfdfgfehgafabfdgfcaafecfbZecgegcca^^^_a AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1
My gtf file was downloaded from Ensemble and i replaced the contig names according to my BAM file. It looks like
chr1 snRNA exon 3092097 3092206 . + . geneid "ENSMUSG00000064842"; transcriptid "ENSMUST00000082908"; exonnumber "1"; genename "U6"; genebiotype "snRNA"; transcriptname "U6.149-201";
chr1 processedtranscript exon 3203690 3206425 . - . geneid "ENSMUSG00000051951"; transcriptid "ENSMUST0000016289 7"; exonnumber "1"; genename "Xkr4"; genebiotype "proteincoding"; transcriptname "Xkr4-003";
chr1 processedtranscript exon 3195982 3197398 . - . geneid "ENSMUSG00000051951"; transcriptid "ENSMUST0000016289 7"; exonnumber "2"; genename "Xkr4"; genebiotype "proteincoding"; transcriptname "Xkr4-003";
chr1 processedtranscript exon 3203520 3205713 . - . geneid "ENSMUSG00000051951"; transcriptid "ENSMUST0000015926 5"; exonnumber "1"; genename "Xkr4"; genebiotype "proteincoding"; transcriptname "Xkr4-002";
chr1 processedtranscript exon 3196604 3197398 . - . geneid "ENSMUSG00000051951"; transcriptid "ENSMUST0000015926 5"; exonnumber "2"; genename "Xkr4"; genebiotype "proteincoding"; transcriptname "Xkr4-002";
I tried HTSeq-count with one of my BAM (BWA align) from DNA-seq and it worked properly, so i think something's wrong with my tophat BAM file or I am doing something wrong?? I will appreciate any kind of help.
Thanks
Sid
can you provide minimal snippets that reproduce the problem? (e.g., your GTF example doesn't have annotations for chr10; in the sam file, seq and qualscores don't appear to be the same length).
Hi Daler, i checked the seq and quality scores in the sam file, they appear to be of the same length(51) , atleast for the reads i checked. The gtf file do contains all chromosomes:
chr10 proteincoding exon 3134304 3134909 . + . geneid "ENSMUSG00000015202"; transcriptid "ENSMUST00000015346"; exon _number "1"; genename "Cnksr3"; genebiotype "proteincoding"; transcriptname "Cnksr3-001";
chr10 proteincoding CDS 3134858 3134909 . + 0 geneid "ENSMUSG00000015202"; transcriptid "ENSMUST00000015346"; exon number "1"; genename "Cnksr3"; genebiotype "proteincoding"; transcriptname "Cnksr3-001"; proteinid "ENSMUSP00000015346";
chr10 proteincoding startcodon 3134858 3134860 . + 0 geneid "ENSMUSG00000015202"; transcriptid "ENSMUST0000001534 6"; exonnumber "1"; genename "Cnksr3"; genebiotype "proteincoding"; transcript_name "Cnksr3-001";
Actually i have tried using the complete BAM(all chr) also, but the result is the same. All my BAMs from tophat are not working. The BAM files looks fine, i used them with cufflinks and cuffdiff, it worked fine
Maybe it's just a formatting issue on Biostar, but for the first sequence in your example, but I get
len('CACCAATAAGCACTGGTTGTCCTGTGAGTTGATAATCCACAGCAATGAGGG')
= 51and
len('bbeeeeeggggghiibffdgghb[]gghiehdghghhihgghiiihihf')
= 49Since my own files from TopHat work fine with htseq-count, I think it's necessary to troubleshoot with the exact input that gives you problems. To avoid Biostar formatting issues (like the loss of tabs and whatever causes the above length discrepancy), try pastebin or create a gist.