Entering edit mode
8.5 years ago
dariober
15k
There seem to be some inconsistencies between GTF records from UCSC table browser and from files generated with genePredToGtf
(from UCSC utilities).
For example, on the table browser I selected the vegaGene table to search for transcript OTTHUMT00000097860, and I get
chr1 hg19_vegaGene start_codon 865692 865694 0.000000 + . gene_id "OTTHUMT00000097860"; transcript_id "OTTHUMT00000097860";
chr1 hg19_vegaGene CDS 865692 865716 0.000000 + 2 gene_id "OTTHUMT00000097860"; transcript_id "OTTHUMT00000097860";
chr1 hg19_vegaGene exon 865692 865716 0.000000 + . gene_id "OTTHUMT00000097860"; transcript_id "OTTHUMT00000097860";
chr1 hg19_vegaGene CDS 866419 866469 0.000000 + 1 gene_id "OTTHUMT00000097860"; transcript_id "OTTHUMT00000097860";
...
If instead I use genePredToGtf the start_codon record seems to be missing:
curl -o - -O http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/vegaGene.txt.gz | gunzip -c \
| cut -f 2- \
| genePredToGtf file stdin stdout \
| grep 'OTTHUMT00000097860' \
| grep 'codon'
This return the stop codon but not the start:
chr1 stdin stop_codon 879531 879533 . + 0 gene_id "SAMD11"; transcript_id "OTTHUMT00000097860"; exon_number "12"; exon_id "OTTHUMT00000097860.12"; gene_name "SAMD11";
Am I missing something?
Thanks a lot for digging this information out, it makes sense. This seems to suggest that the output of genePredToGtf is more accurate than the web browser's (?)
Glad to help and learn too :) Not sure about accurate versus non accurate as I'm not familiar with the UCSC utilities or table browser. But I'd expect what I see in the Ensembl browser is what I get from the FTP, via BioMart, REST API, Perl APIs. The underlying database is the same, the mode of access is different.