Where can I find valid versions of Uniprot database (for all isoforms of all genes) in GFF3 format? I'm interested in this for hg18/hg19 and mm9. Thanks.
Where can I find valid versions of Uniprot database (for all isoforms of all genes) in GFF3 format? I'm interested in this for hg18/hg19 and mm9. Thanks.
Building on Pierre's answer you can then get each uniprot record in gff using
http://www.uniprot.org/uniprot/THE_ID_YOU_FOUND.gff
One by one. Or using batch retrieve to get the entries in one go. Then look for the small link back to uniprot and then download the uniprot entries using the orange download button in gff.
This is gff but not 100% gff3 as the Sequence Ontology does not have all UniProt features so they can't be described with 100% valid gff3. Which makes it rather hard for UniProt to be encoded in GFF3.
The column proteinID
should be the Uniprot-ID
By taking advantage of Pierre's tip, you'll just need to get the ID list here.
With the list in hand, remove all header/RefSeq things and the second column with:
cat hgTables | grep -v "NP_" | awk '{print $1}' > hgTablesUniProt
Then, get your files (Beware! Loooong list!):
while read line; do wget http://www.uniprot.org/uniprot/$line.gff done < hgTablesUniProt
As Pierre says: That's it!
Just to mention, I've assumed a bash shell in hand. And I think a delay in wget
could be polite.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.