Hello,
I have a list of protein accessions from NCBI e.g. XP_003642916.1 and would like to automatically retrieve the nucleotide sequence. Is there any way to do it?
Thanks a lot,
D.
Hello,
I have a list of protein accessions from NCBI e.g. XP_003642916.1 and would like to automatically retrieve the nucleotide sequence. Is there any way to do it?
Thanks a lot,
D.
There are at least 2 ways to do it.
The first, which requires more programming, is to use EUtils. You would first use ESearch to retrieve the UID for the protein, then ELink to cross-reference to the nucleotide database and finally, EFetch to retrieve the nucleotide sequence. I'll provide some example code at the end of this post.
The second way - which is almost always the answer to questions of the type "how to I convert from one ID to another?" - is to use BioMart.
You should search this website for examples of usage, which have been posted many times. Basically: you select Ensembl Genes for the database and Gallus gallus genes for the dataset. Under Filters, you choose Gene, ID List Limit and Refseq predicted protein ID, then upload or copy/paste your accession list. Under attributes you choose Sequences and the type of sequence to retrieve. Then you click "Results". You will probably need to omit the suffix of your accessions (i.e. XP_003642916, not XP_003642916.1).
OK: some EUtils code, using the BioRuby library (you could also use Perl, Python, whatever). It's just to demonstrate and would need some work to make it into a useful program.
require 'bio' # bioruby gem
require 'crack' # crack gem (XML parser)
require 'open-uri' # may not need this line
Bio::NCBI.default_email = "me@me.com"
accn = "XP_003642916.1"
ncbi = Bio::NCBI::REST.new
base = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
# esearch
search = ncbi.esearch(accn, {"db" => "protein"})
# => ["363743782"]
# elink
suff = "elink.fcgi?dbfrom=protein&db=nuccore&linkname=protein_nuccore_mrna&id=363743782"
xml = open("#{base + suff}").read
xml = Crack::XML.parse(xml)
puts xml['eLinkResult']['LinkSet']['LinkSetDb']['Link']['Id']
# => "363743781"
# efetch
result = ncbi.efetch("363743781", {"db" => "nuccore", "retmode" => "text", "rettype" => "fasta"})
puts result
>gi|363743781|ref|XM_003642868.1| PREDICTED: Gallus gallus fibroblast growth factor 22-like (LOC100858338), mRNA
ATGAGGCGCGGGGGCCCCGCCGCTCTCGCCGCCTGCCTCGCTGGGGCGCTCGCCGTGCTGGCGGGGCCGG
GACCGGGCAGCTCCGTCTGGAGCGGCCGGCGACCCCCCCGCAGCTACGGGCATCTGGAAGGCGACGTGCG
CTGCCGGCGGCTCTTCTCCGCCACCCGCTTCTTCCTGAGCATCGACGGCGGCGGCGGAGTGGAGGGGACG
CGCTGGAGGGAGCGGCCGGGCAGCATCGTCGAGATCCGGTCGGTGCGTGTCGGAGTCGTGGCCATCCGAG
CGGTGCACACCGGCTTCTACCTGGCCATGAACAAGCAGGGGCAGCTCTACGGGTCGAAGGAGTTCAGCCC
CAACTGCAAGTTCACGGAGCGCATTGAGGAGAACGGCTACAACACCTACGCCTCGCTGCGCTGGCGGCAC
CGGGGCCGCCCCATGTTCCTCTCCCTCAATAGCAAAGGGAGGCCGCGGCGAGGGGGCAAGACGCGCCGGC
AGCACCTCTCCACCCACTTCCTCCCCATGCTCGTCAGCTGA
Thank you so much, this is really helpful. D.
I thought I'd convert Neilfws's first approach to Python for the benefit of those looking for help or more guidance using Python to use the Entrez Programming Utilities (eUtils). This will use the Biopython package.
Using Python to access the Entrez Programming Utilities (eUtils) presents a number of paths to the result. For example, the first step of going from the accession.version to GI number (see here or here about the forms of uids) the bioperl EUtilities Cookbook suggests either EFetch or ESummary. Of course, Neilfws's answer did that conversion with ESearch. For ease of those reading this to understand the steps, I'll am following along with Neilfws's approach.
from Bio import Entrez
Entrez.email = "A.N.Other@example.com" # Always tell NCBI who you are. PUT YOUR EMAIL THERE.
protein_accn_numbers = ["ABR17211.1", "XP_002864745.1", "AAT45004.1", "XP_003642916.1" ]
protein_gi_numbers = []
#ESearch
# BE CAREFUL TO NOT ABUSE THE NCBI SYSTEM.
# see http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec119 for information.
# For example, if searching with more than 100 records, you'd need to do this ESearch step
# on weekends or outside USA peak times.
for accn in protein_accn_numbers:
esearch_handle = Entrez.esearch(db="protein", term=accn)
esearch_result= Entrez.read(esearch_handle)
esearch_handle.close()
#print esearch_result
#print esearch_result["IdList"][0]
protein_gi_numbers.append(esearch_result["IdList"][0])
#print protein_gi_numbers
retrieved_mRNA_uids = []
#ELink
handle = Entrez.elink(dbfrom="protein", db="nuccore", LinkName="protein_nuccore_mrna", id=protein_gi_numbers)
result = Entrez.read(handle)
handle.close()
#print result
for each_record in result:
mrna_id = each_record["LinkSetDb"][0]["Link"][0]["Id"]
retrieved_mRNA_uids.append(mrna_id)
#print retrieved_mRNA_uids
#EPost
epost_handle = Entrez.epost(db="nuccore", id=",".join(retrieved_mRNA_uids))
epost_result = Entrez.read(epost_handle)
epost_handle.close()
webenv = epost_result["WebEnv"]
query_key = epost_result["QueryKey"]
#EFetch
count = len(retrieved_mRNA_uids)
batch_size = 20
the_records = ""
for start in range(0, count, batch_size):
end = min(count, start + batch_size)
print("Fetching records %i thru %i..." % (start + 1, end))
fetch_handle = Entrez.efetch(db="nuccore",
rettype="fasta", retmode="text",
retstart=start, retmax=batch_size,
webenv=webenv,
query_key=query_key)
data = fetch_handle.read()
fetch_handle.close()
the_records = the_records + data
print the_records #for seeing how to save as file as get record blocks, see similar
# example at line 101, found under 'Update: Searching for citations using ELink,
# EPost and EFetch with history' of section '9.15.3 Searching for citations' ,
# at http://nbviewer.ipython.org/github/gumption/Using_Biopython_Entrez/blob/master/Biopython_Tutorial_and_Cookbook_Chapter_9.ipynb
You can see this script in action in an interactive IPython console here.
In case anyone finds it useful, I made a script version of this called GetmRNAforProtein.py
that goes beyond the basic script here in that it has file handling, a usage message, and some feedback to the user as it works. You just point it at your file that has your FASTA records. It can be found here.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi Neilfws,
For your second step---elink. I can't get anything other than below. <elinkresult> <linkset> <dbfrom>protein</dbfrom> <idlist> <id>363743782</id> </idlist> </linkset> </elinkresult> I input this,
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=nuccore&%20&linkname=protein_nuccore_mrna&id=363743782