Entering edit mode
11.5 years ago
Leszek
4.2k
I need to fetch positions (chr, start, end, strand) of some 7 million proteins we store in our db.
There is nice dump in GenBank ftp (gene2accesion.gz) providing these info, but unfortunately only for RefSeq entries...
Could you please recommend some better method than querying all proteins through Bio.Entrez or parsing all GenBank dumps (70+Gb zipped!)?
For 7 million proteins, you'll want to parse a database dump rather than use a web service. What kinds of identifiers (ID/accession) does your database use for the proteins? Most likely the proteins will map to multiple transcripts - do you want every transcript for the protein? And what's the issue with Refseq?
I mapped my proteins vs genbank, so for each protein I got protein GI. Unfortunately for many of these there is no RefSeq, just nr entry. I need just gene position in the chromosome, so I don't care about transcripts.
Perhaps you can download the genome annotation file (GTF/GFF), for the genome that you are interested in, and parse it against your list of proteins, for these attributes.
Maybe you can go http://genome.ucsc.edu/cgi-bin/hgTables, download known_gene table. Add it to a local mysql server and use bash for loop to query the table. Something like ... "for i in `cat my_list_of_proteinIDs`;do mysql -h ~localhost~ -u user -D hg19 -N -A -e 'select name,chrom,txStart, txEnd from knownGene where proteinID = "proteinID"' >> my_list_annot;done.