htseq count error
1
0
Entering edit mode
4.7 years ago

I used tophat to map paired-end reads against the reference genome and generated Acceptedhits.bam file

$ htseq-count --idattr locus_tag -o samout 119acceptedhits.bam Rsolani.gff

AG1IA_10588     0

AG1IA_10589     0

AG1IA_10590     0

AG1IA_10591     0

AG1IA_10592     0

AG1IA_10593     0

__no_feature    0

__ambiguous     0


__too_low_aQual 0

__not_aligned   0

__alignment_not_unique  0

why am I getting results like this?

RNA-Seq software error next-gen • 1.5k views
ADD COMMENT
1
Entering edit mode

Check if chromosome notation is the same between BAM and GTF. A common error is that in BAM it is like 1, 2, 3 and in GTF it is like chr1, chr2, chr3. As a remark, tophat is deprecated for years now. You might get better results if you use more recent algnment tools such as hisat2 or even tools like kallisto and salmon to irectly quantify your reads against a transcriptome. Based on recent literature these leightweight quantifiers (salmon/kallisto) seem to be superior for RNA-seq quantification compared to traditional aligners.

ADD REPLY
0
Entering edit mode

Now I have used STAR to map the reads against the reference genome. The output is Aligned.sortedbycoord.out.bam and Alignedout.bam

$ samtools view Aligned.sortedByCoord.out.bam | head
SRR5500529.2896506      419     KB317696.1      1658    3       7M230616N119M   =       232299  230767  GGCGGATGGCCTCCTGCTCCAGACCCTGCACCCATCAAGGAAGCATGCCCGTTCATAGGGCTAGTAGTGAACCCCATCACATGAGAACGAGGGACAAACTCGGTAGGCGACAGAGTACCATTGCCA       CCCCCDGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGEGGGGGGGCGGGGGGGGGGBGGGBGGGGGGGGGGGGGGGGG@GEBBB#  NH:i:2  HI:i:2       AS:i:244        nM:i:2
SRR5500529.8830704      355     KB317696.1      1853    1       11M268420N115M  =       270355  268628  CCCACAGCCAGCACCTCCTCCTCCCGTGCGCAACCGGACTGTGTCCGAGAATGTCGCTCGATCCATATCTCCTCCTGCAGCTCCTCCAGCAATCGCGACCCCGACCGCTACTCGTCTAGGTGCCCC       CCCCCGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGG/EGGG>FEGGGGGGGGGGGEGGGBGGGGGGGGDGEEGGGGGGGGGEGGGGGGGGGGGGGDDGGGGGGGGDGGGGBGGGGB.EEDGGGGG  NH:i:3  HI:i:3       AS:i:241        nM:i:3
chromosome name is same as gtf file "KB317696.1 "
GTF file - is like 
KB317696.1  Genbank exon    1594    1929    .   +   .   ID=exon-AG1IA_00001-2;Parent=rna-AG1IA_00001;gbkey=mRNA;locus_tag=AG1IA_00001;partial=true;product=hypothetical protein.
still I get the same error
$ htseq-count --idattr locus_tag -o samout Aligned.sortedByCoord.out.bam Rsolani.gff
AG1IA_10586     0
AG1IA_10587     0
AG1IA_10588     0
AG1IA_10589     0
AG1IA_10590     0
AG1IA_10591     0
AG1IA_10592     0
AG1IA_10593     0
__no_feature    0
__ambiguous     0
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  0
ADD REPLY
0
Entering edit mode

needs more information for debug. If your bam file is OK, then it is probably something wrong with --idattr locus_tag and your off file

ADD REPLY
0
Entering edit mode

boaty, kindly look at the above stanza.

ADD REPLY
1
Entering edit mode
4.7 years ago
boaty ▴ 220

it seems that you aligned your reads on some reference other than genome, the RNAME is KB317696.1 not chromosome. And you want to count reads on each exon, your GTF coordinates are coordinate for a gene the pb is HTSeq-count does not work in this way, you need to align reads on genome then feed HTSeq the coordinate of each gene/transcript/exon in your GTF file, HTSeq need information of both gene and exon to deal with splicing event.

FYI: if you want to count reads on exon, just filter your sam line, group reads by : 1. same exon. 2. different exon. Then count these lines.

ADD COMMENT

Login before adding your answer.

Traffic: 1782 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6