Gene Order From Genbank

1

Entering edit mode

11.5 years ago

Leszek 4.2k

I need to fetch positions (chr, start, end, strand) of some 7 million proteins we store in our db.
There is nice dump in GenBank ftp (gene2accesion.gz) providing these info, but unfortunately only for RefSeq entries...
Could you please recommend some better method than querying all proteins through Bio.Entrez or parsing all GenBank dumps (70+Gb zipped!)?

genbank entrez python • 2.9k views

ADD COMMENT • link updated 5.5 years ago by Biostar 20 • written 11.5 years ago by Leszek 4.2k

1

Entering edit mode

For 7 million proteins, you'll want to parse a database dump rather than use a web service. What kinds of identifiers (ID/accession) does your database use for the proteins? Most likely the proteins will map to multiple transcripts - do you want every transcript for the protein? And what's the issue with Refseq?

ADD REPLY • link 11.5 years ago by Neilfws 49k

0

Entering edit mode

I mapped my proteins vs genbank, so for each protein I got protein GI. Unfortunately for many of these there is no RefSeq, just nr entry. I need just gene position in the chromosome, so I don't care about transcripts.

ADD REPLY • link 11.5 years ago by Leszek 4.2k

0

Entering edit mode

Perhaps you can download the genome annotation file (GTF/GFF), for the genome that you are interested in, and parse it against your list of proteins, for these attributes.

ADD REPLY • link 11.5 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

Maybe you can go http://genome.ucsc.edu/cgi-bin/hgTables, download known_gene table. Add it to a local mysql server and use bash for loop to query the table. Something like ... "for i in `cat my_list_of_proteinIDs`;do mysql -h ~localhost~ -u user -D hg19 -N -A -e 'select name,chrom,txStart, txEnd from knownGene where proteinID = "proteinID"' >> my_list_annot;done.

ADD REPLY • link 10.9 years ago by AndreiR ▴ 260

Login before adding your answer.