Question

Bulk download of gene names NCBI

0

Entering edit mode

7.2 years ago

T_18 ▴ 50

Dear all,

I’m relatively new to harvesting data from NCBI databases, and I am struggling some time with the following task. I try to download gene names based on a list of protein accession IDs (in text file). For example: I want to download the gene name/identification of “AAR23114.1”, going to the NCBI page of this ID (https://www.ncbi.nlm.nih.gov/protein/AAR23114.1) I find the gene name below at “CDS” at the second line: “/gene=“cyp6a2”.

I have a list of >1000 accession IDs and I want to download the subsequent gene names for all of them. Off course I have tried to find the answer myself:

Biomart does not work for ‘regular’ gene sequences of NCBI
I have tried to download gene information in bulk using the Batch Entrez facilities, but unfortunately the gene name information is not included for every record in the files you can download (e.g. summary or feature table -> although it is available at the individual pages!), further the information lay-out is not standardized for every record in general.

I am trying to get this done with efetch, but without any success so far. Is there a way to retrieve these gene names based on (protein) accession IDs?

Thanks in advance!

E-utilities • 5.0k views

ADD COMMENT • link updated 7.2 years ago by Renesh ★ 2.2k • written 7.2 years ago by T_18 ▴ 50

0

Entering edit mode

although it is available at the individual pages

example ?

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Yes: "For example: I want to download the gene name/identification of “AAR23114.1”, going to the NCBI page of this ID (https://www.ncbi.nlm.nih.gov/protein/AAR23114.1) I find the gene name below at “CDS” at the second line: “/gene=“cyp6a2”."

ADD REPLY • link 7.2 years ago by T_18 ▴ 50

0

Entering edit mode

yes, this is your first example. I was looking for the one where the gene name is only available in the download "(e.g. summary or feature table -> although it is available at the individual pages!),"

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I have not found a case where it is only available in the download, the problem is that it is often missing in the download. So the information is available on the gene page (see previous example) but not in the downloaded summary: (Send to> file> summary/ gene feature or any other format):

cytochrome P450 [Drosophila melanogaster] 506 aa protein AAR23114.1 GI:38505146

Ideally I can download a list with all protein accessions linked to the gene names. E.g. through efetch?

ADD REPLY • link 7.2 years ago by T_18 ▴ 50

0

Entering edit mode

Pierre: you're a (bio)star! Thanks a lot..

ADD REPLY • link 7.2 years ago by T_18 ▴ 50

score 0 · Answer 1 · 2017-09-20

0

Entering edit mode

7.2 years ago

Pierre Lindenbaum 164k

my solution using xslt:

example:

$ cat accessions.txt | xargs -n 100 echo | sed 's/ /\&id=/g' | while read S; do wget -O - -q "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=${S}&retmode=xml" | xsltproc --novalid biostar273687 - ; done  

CAA64262.1  NSP2
CAA46742.1  N/A
CAA68495.1  N/A
CAA46741.1  N/A
CAA88010.1  orf
CAA24511.1  N/A
CAA67568.1  VP7
CAA64568.1  9
CAA64658.1  9
CAA64657.1  9
CAA64659.1  9
CAA46743.1  9
CAA00124.1  N/A
5CB7_B  N/A
5CB7_A  N/A

ADD COMMENT • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I feel I'm almost there, but bumped into this error: "biostar273687.xsl:73: parser error : Premature end of data in tag stylesheet line 3 cannot parse biostar273687.xsl"

Am I correct that there could be an end tag missing? Should there be "</xsl:> on line 5?

ADD REPLY • link 7.2 years ago by T_18 ▴ 50

0

Entering edit mode

you're right , I've badly copied, the code, I'm going to fix it !

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Ok, I've updated, a xsl:stylesheet was missing at the end.

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

score 0 · Answer 2 · 2017-09-21

You can use the Batch Entrez for large number of records (https://www.ncbi.nlm.nih.gov/sites/batchentrez)

Save all IDs in a text file
Browse the text file and retrieve the proteins
Click on the retrieved records and it will direct you to NCBI gene summary page
Click on send to button (select file -> format (feature table) )
It will download as a file. In the downloaded file, you can see the accession and gene names.