Hello, I've been using htseq-count (and the related python library to develop my own tools) to obtain read counts over known genes many times but only yesterday I decided (due to a modification in the default value in the library about the openness or closedness of the Ensembl gtf coords) to perform an exaustive check...I have to say that the results puzzled me. Following the suggestion given in the manual (http://www-huber.embl.de/users/anders/HTSeq/doc/features.html#HTSeq.GFF_Reader) I checked the divisibility by 3 of the end-begin+1 intervals reported for CDF and stop_codon feature but the results were not homogeneus as I expected:
$ wget ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens/Homo_sapiens.GRCh37.70.gtf.gz
$ zcat Homo_sapiens.GRCh37.70.gtf.gz | awk '$3 == "CDS" {P=($5-$4+1)/3; print $1,$3,P==int(P)}' | grep -c "0$"
459850
$ zcat Homo_sapiens.GRCh37.70.gtf.gz | awk '$3 == "CDS" {P=($5-$4+1)/3; print $1,$3,P==int(P)}' | grep -c "1$"
335070
I checked by hand a few examples and there are protein coding genes in both categories. For stop-codons the results are a little more in agreement with the docs: 796 zeroes and 82902 ones. What is really buggering me is that I am not sure what CDS and stop_codon features are...reading the docs in a hurry some times ago I did not realize that the whole modulus 3 question is strange as referred to genomic coordinates. Reading here:http://mblab.wustl.edu/GTF22.html it seems that CDS refers to exons and thus I don't expect any divisibility by 3 a priori, while for stop_codon I do (or not, if they can be spliced as reported in the last link).
My final question is: which is the coordinates standard (1- or 0- based, closed or open intervals) for gtf found on Ensembl and on the illumina project gtf linked in the tophat page? I do realize that a single base should not be that much relevant but if there isn't any agreement on this basic things I suspect that bigger problems will arise.
Thank you very much! Everything is clear now, I think that htseq-counts docs should be changed because right now they are misleading as long as they declare that the "type" of a gtf could be derived analyzing %3 on CDS.