Gene Order From Genbank
0
1
Entering edit mode
11.4 years ago
Leszek 4.2k

I need to fetch positions (chr, start, end, strand) of some 7 million proteins we store in our db.
There is nice dump in GenBank ftp (gene2accesion.gz) providing these info, but unfortunately only for RefSeq entries...
Could you please recommend some better method than querying all proteins through Bio.Entrez or parsing all GenBank dumps (70+Gb zipped!)?

genbank entrez python • 2.8k views
ADD COMMENT
1
Entering edit mode

For 7 million proteins, you'll want to parse a database dump rather than use a web service. What kinds of identifiers (ID/accession) does your database use for the proteins? Most likely the proteins will map to multiple transcripts - do you want every transcript for the protein? And what's the issue with Refseq?

ADD REPLY
0
Entering edit mode

I mapped my proteins vs genbank, so for each protein I got protein GI. Unfortunately for many of these there is no RefSeq, just nr entry. I need just gene position in the chromosome, so I don't care about transcripts.

ADD REPLY
0
Entering edit mode

Perhaps you can download the genome annotation file (GTF/GFF), for the genome that you are interested in, and parse it against your list of proteins, for these attributes.

ADD REPLY
0
Entering edit mode

Maybe you can go http://genome.ucsc.edu/cgi-bin/hgTables, download known_gene table. Add it to a local mysql server and use bash for loop to query the table. Something like ... "for i in `cat my_list_of_proteinIDs`;do mysql -h ~localhost~ -u user -D hg19 -N -A -e 'select name,chrom,txStart, txEnd from knownGene where proteinID = "proteinID"' >> my_list_annot;done.

ADD REPLY

Login before adding your answer.

Traffic: 1608 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6