Hi everyone,
I'm trying to run HTSeq on a group of BAM files generated from the alignment of an RNAseq illumina reads mapped to a reference genome. The reference genome is the sequence with highest quality available and was downloaded from NCBI refseq. I want to generate a count table using HTSeq for later DE as follows:
htseq-count -f bam -t gene -i gene alignment.bam reference.gtf
what confuses me is that according to the manual the -t is "exon" by default but I don't have exon in my third column of the gtf file. On the other hand, I do have exon in my gff file. I see that sometimes gtf and gff are used interchangeably in spite of them having different structures, this confuses me a little.
Another question is: Should I sort the bam files? The manual talks about sorting and stuff but then I read here https://www.echemi.com/community/running-htseq-count-over-bam-files_mjart2205182203_481.html that newer versions of HTSeq don't require a sorted input.
Thanks for any help!
Hi Shred,
Thanks a lot for your answer, concise and clear. I tried using "CDS" but it somehow failed. I'm aware that exon doesn't really make a lot of sense in the bacterial world, my confusion is because the docs suggest that the annotation file should be .gtf and talk about "exon" while the "exon" is actually in the .gff formatted annotation but not in the .gtf. My confusion was should I use the recommended gtf and use some other identifier such as "gene" or switch to a gff and use exon. I will now do as you suggested and replace "gene" for "exon" in the gtf file and see what happens. The github link was also useful,
Thanks.