Question

Htseq Count Finding No Feature

0

Entering edit mode

11.6 years ago

joaslucas ▴ 90

Hi,

I am trying to get read counts from RNA-seq using HTSEq gene counter. I firstly aligned my reads against rRNA using BWA, converted the unaligned files to .fastq using Picard, aligned this fastq file against the MTB genome using BWA and used the .sam file for reads counts using HTSeq gene count. the GTF file was got from MTBdatabase. Unfortunatelly, I am not getting a single feature counted. I believe it has something to do with the sam files. Can anyone help me? it only around 500k genes per sample, just a trainning. I am using Gene counter because I want later user DESeq for diferentially gene expression. Thanks.

the last few lines of the HTseq count output is bellow: . . Rv3920c 0 Rv3921c 0 Rv3922c 0 Rv3923c 0 Rv3924c 0 no_feature 444038 ambiguous 0 too_low_aQual 0 not_aligned 1226003 alignment_not_unique 0 lc@lc:~$

htseq bacteria gene-expression • 9.0k views

ADD COMMENT • link 11.6 years ago by joaslucas ▴ 90

1

Entering edit mode

Which version of htseq-count are you using? Does the gtf have the same chromosomal ids that the generated sam? Does it have 'exon' features?

ADD REPLY • link 11.6 years ago by vodka ▴ 80

0

Entering edit mode

Hi Vodka, my version is HTSeq-0.5.4p1. I aligned my reads (fastq) against the M. tuberculosis genome in fasta format (using BWA), as bellow. So, it does not have the gene ID. I thought that after the alignment the HTSeq would be able to count genes according to GTF file provided. This is not working.

My Mtb in fasta format

gi|57116681|ref|NC_000962.2| Mycobacterium tuberculosis H37Rv chromosome, complete genome TTGACCGATGACCCCGGTTCAGGCTTCACCACAGTGTGGAACGCGGTCGTCTCCGAACTTAACGGCGACC CTAAGGTTGACGACGGACCCAGCAGTGATGCTAATCTCAGCGCTCCGCTGACCCCTCAGCAAAGGGCTTG GCTCAATCTCGTCCAGCCATTGACCATCGTCGAGGGGTTTGCTCTGTTATCCGTGCCGAGCAGCTTTGTC CAAAACGAAATCGAGCGCCATCTGCGGGCCCCGATTACCGACGCTCTCAGCCGCCGACTCGGACATCAGA ...

My GTF file ... Chromosome_2.1 MT_H37RV_GB_ORF start_codon 1 3 . + 0 gene_id "Rv0001"; transcript_id "Rv0001.1"; Chromosome_2.1 MT_H37RV_GB_ORF stop_codon 1522 1524 . + 0 gene_id "Rv0001"; transcript_id "Rv0001.1"; Chromosome_2.1 MT_H37RV_GB_ORF exon 1 1524 . + . gene_id "Rv0001"; transcript_id "Rv0001.1"; Chromosome_2.1 MT_H37RV_GB_ORF CDS 1 1521 . + 0 gene_id "Rv0001"; transcript_id "Rv0001.1"; Chromosome_2.1 MT_H37RV_GB_ORF start_codon 2052 2054 . + 0 gene_id "Rv0002"; transcript_id "Rv0002.1"; Chromosome_2.1 MT_H37RV_GB_ORF stop_codon 3258 3260 . + 0 gene_id "Rv0002"; transcript_id "Rv0002.1"; Chromosome_2.1 MT_H37RV_GB_ORF exon 2052 3260 . + . gene_id "Rv0002"; transcript_id "Rv0002.1"; Chromosome_2.1 MT_H37RV_GB_ORF CDS 2052 3257 . + 0 gene_id "Rv0002"; transcript_id "Rv0002.1"; Chromosome_2.1 MT_H37RV_GB_ORF start_codon 3280 3282 . + 0 gene_id "Rv0003"; transcript_id "Rv0003.1"; Chromosome_2.1 MT_H37RV_GB_ORF stop_codon 4435 4437 . + 0 gene_id "Rv0003"; transcript_id "Rv0003.1"; ...

ADD REPLY • link 11.6 years ago by joaslucas ▴ 90

0

Entering edit mode

I've never worked with bacteria data but I know that chromosome ids have to be the same on the sam file and the gtf provided to htseq-count. Therefore if you have Chromosome_2.1 as the sole reported id on your gtf file and you are sure that the coords reported there corresponds to the coords mapped on your mtb reference by bwa you just should check the third field of the sam file and change Chromosome_2.1 to that value (or the other way around, but that's less efficient usually). But I'm really not expert on using htseq_count with this kind of data so you should check that your gtf file corresponds to the needed format (I see exon features but that's not the only requisite, check the manual for details...tomorrow I will look for other notes about the needed gtf format and post them here [I remember about requisites about transcript_id ad so on but I am not sure]).

ADD REPLY • link 11.6 years ago by vodka ▴ 80