Question

Gtf Format End Coord Included Or Excluded

0

Entering edit mode

12.4 years ago

vodka ▴ 80

Hello, I've been using htseq-count (and the related python library to develop my own tools) to obtain read counts over known genes many times but only yesterday I decided (due to a modification in the default value in the library about the openness or closedness of the Ensembl gtf coords) to perform an exaustive check...I have to say that the results puzzled me. Following the suggestion given in the manual (http://www-huber.embl.de/users/anders/HTSeq/doc/features.html#HTSeq.GFF_Reader) I checked the divisibility by 3 of the end-begin+1 intervals reported for CDF and stop_codon feature but the results were not homogeneus as I expected:

$ wget ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens/Homo_sapiens.GRCh37.70.gtf.gz
$ zcat Homo_sapiens.GRCh37.70.gtf.gz | awk '$3 == "CDS" {P=($5-$4+1)/3; print  $1,$3,P==int(P)}' | grep -c "0$"
459850
$ zcat Homo_sapiens.GRCh37.70.gtf.gz | awk '$3 == "CDS" {P=($5-$4+1)/3; print  $1,$3,P==int(P)}' | grep -c "1$"
335070

I checked by hand a few examples and there are protein coding genes in both categories. For stop-codons the results are a little more in agreement with the docs: 796 zeroes and 82902 ones. What is really buggering me is that I am not sure what CDS and stop_codon features are...reading the docs in a hurry some times ago I did not realize that the whole modulus 3 question is strange as referred to genomic coordinates. Reading here:http://mblab.wustl.edu/GTF22.html it seems that CDS refers to exons and thus I don't expect any divisibility by 3 a priori, while for stop_codon I do (or not, if they can be spliced as reported in the last link).

My final question is: which is the coordinates standard (1- or 0- based, closed or open intervals) for gtf found on Ensembl and on the illumina project gtf linked in the tophat page? I do realize that a single base should not be that much relevant but if there isn't any agreement on this basic things I suspect that bigger problems will arise.

gtf annotation coordinates • 3.4k views

ADD COMMENT • link updated 12.4 years ago by Devon Ryan 105k • written 12.4 years ago by vodka ▴ 80

score 3 · Answer 1 · 2013-03-22

You have a few questions, so I'll try to get to all of them in turn:

1) The heterogeneity you observed is due to splicing, which can happen in the middle of (not just between) codons. That is also the case for stop codons, though it seems to occur less frequently. That sort of thing is normal. In genomic, as opposed to transcript, coordinates, there's no guarantee that length%3==0.

2) CDS features just denote coding sequences (these may be entire exons or only a portion of them), i.e., the part that codes for the protein sequence. stop_codon denotes the TGA/TAG/TAA sequence that denotes the end-point of translation. Anything following that would be 3' UTR (likewise, anything in an exon before the start_codon feature is 5" UTR).

3) GTF files are 1-based with closed intervals. That is to say that if column 4 is A and column 5 is B, then the interval is [A,B].