My goal is to get a UCSC table in GTF format from the FTP database and convert it to GFF3 format. My strategy is to convert the UCSC table to GTF and then to GFF3 - unless there is an easier way?
Through UCSC's Tables website it's possible to obtain tables like the Ensembl table in GTF format. I'd like to get the same table via FTP "Annotation" download, but I do not see these tables there. For example for mm9: http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/
The table I'm interested in is Ensembl gene, ensGene.txt
, which is not listed to GTF format. How can it be converted to GTF format?
I'd like to be a GTF format where the root nodes are the Ensembl gene entries like ENSMUSG....
and the transcript are children nodes. I think this might be possible with genePredToGtf
but cannot get it to work. The following command fails:
cat ensGene.txt | cut -f2-11 | genePredToGtf file stdin foo.gtf
Anyone know how this can be corrected?
Also, the UCSC tables format seems to be a 0-based start. Does genePredToGtf take care of making the resulting GTF 1-based?
Once I have a GTF, I can convert it to GFF3 format. Is there a utility that goes directly from genePred to GFF3, which would save this headache? I tried GBrowse
's ucsc_genes2gff.pl
(available here: http://search.cpan.org/~lds/GBrowse-2.52/bin/bed2gff3.pl) but it does not generate gene entries, only mRNA/children entries and ignores the ENSMUSG
identifier.
Thanks