Bulk download of gene names NCBI
2
0
Entering edit mode
7.2 years ago
T_18 ▴ 50

Dear all,

I’m relatively new to harvesting data from NCBI databases, and I am struggling some time with the following task. I try to download gene names based on a list of protein accession IDs (in text file). For example: I want to download the gene name/identification of “AAR23114.1”, going to the NCBI page of this ID (https://www.ncbi.nlm.nih.gov/protein/AAR23114.1) I find the gene name below at “CDS” at the second line: “/gene=“cyp6a2”.

I have a list of >1000 accession IDs and I want to download the subsequent gene names for all of them. Off course I have tried to find the answer myself:

  • Biomart does not work for ‘regular’ gene sequences of NCBI
  • I have tried to download gene information in bulk using the Batch Entrez facilities, but unfortunately the gene name information is not included for every record in the files you can download (e.g. summary or feature table -> although it is available at the individual pages!), further the information lay-out is not standardized for every record in general.

I am trying to get this done with efetch, but without any success so far. Is there a way to retrieve these gene names based on (protein) accession IDs?

Thanks in advance!

E-utilities • 5.0k views
ADD COMMENT
0
Entering edit mode

although it is available at the individual pages

example ?

ADD REPLY
0
Entering edit mode

Yes: "For example: I want to download the gene name/identification of “AAR23114.1”, going to the NCBI page of this ID (https://www.ncbi.nlm.nih.gov/protein/AAR23114.1) I find the gene name below at “CDS” at the second line: “/gene=“cyp6a2”."

ADD REPLY
0
Entering edit mode

yes, this is your first example. I was looking for the one where the gene name is only available in the download "(e.g. summary or feature table -> although it is available at the individual pages!),"

ADD REPLY
0
Entering edit mode

I have not found a case where it is only available in the download, the problem is that it is often missing in the download. So the information is available on the gene page (see previous example) but not in the downloaded summary: (Send to> file> summary/ gene feature or any other format):

  1. cytochrome P450 [Drosophila melanogaster] 506 aa protein AAR23114.1 GI:38505146

Ideally I can download a list with all protein accessions linked to the gene names. E.g. through efetch?

ADD REPLY
0
Entering edit mode

Pierre: you're a (bio)star! Thanks a lot..

ADD REPLY
0
Entering edit mode
7.2 years ago

my solution using xslt:

example:

$ cat accessions.txt | xargs -n 100 echo | sed 's/ /\&id=/g' | while read S; do wget -O - -q "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=${S}&retmode=xml" | xsltproc --novalid biostar273687 - ; done  

CAA64262.1  NSP2
CAA46742.1  N/A
CAA68495.1  N/A
CAA46741.1  N/A
CAA88010.1  orf
CAA24511.1  N/A
CAA67568.1  VP7
CAA64568.1  9
CAA64658.1  9
CAA64657.1  9
CAA64659.1  9
CAA46743.1  9
CAA00124.1  N/A
5CB7_B  N/A
5CB7_A  N/A
ADD COMMENT
0
Entering edit mode

I feel I'm almost there, but bumped into this error: "biostar273687.xsl:73: parser error : Premature end of data in tag stylesheet line 3 cannot parse biostar273687.xsl"

Am I correct that there could be an end tag missing? Should there be "</xsl:> on line 5?

ADD REPLY
0
Entering edit mode

you're right , I've badly copied, the code, I'm going to fix it !

ADD REPLY
0
Entering edit mode

Ok, I've updated, a xsl:stylesheet was missing at the end.

ADD REPLY
0
Entering edit mode
7.2 years ago
Renesh ★ 2.2k

You can use the Batch Entrez for large number of records (https://www.ncbi.nlm.nih.gov/sites/batchentrez)

  • Save all IDs in a text file
  • Browse the text file and retrieve the proteins
  • Click on the retrieved records and it will direct you to NCBI gene summary page
  • Click on send to button (select file -> format (feature table) )
  • It will download as a file. In the downloaded file, you can see the accession and gene names.
ADD COMMENT

Login before adding your answer.

Traffic: 1787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6