Hello everyone,
I encounter the following problem: I have a list of chloroplast gene IDs and a list of taxonomic IDs, and I want to retrieve for each organism all the nucleotide sequences of the gene list. This seems pretty simple: I "just" want to retrieve all sequences in fasta format. It is not a problem to write a script to iterate over the different request, but even for a simple pair gene ID - organism ID I was not able to find the correct way to do this... I tried the following, for example for the pair ycf2[Gene Name] AND txid4097[Organism] :
- first, I try the NCBI website, but by searching in Gene database (query: ycf2[Gene Name] AND txid4097[Organism]) I did not obtain the sequences, in protein database I obtained only amino acid sequence and in nucleotide database I obtained only the whole chloroplast sequence...
- using Bioperl, I got a "WebEnv parameter is required" error (I don't copy-paste the code, it is similar as explained in http://bioperl.org/howtos/Beginners_HOWTO.html)
- then I tried with e-utilities : esearch -db gene -query "ycf2[Gene Name] AND txid4097[Organism]" | efetch -format fasta. This works +/- but I only get the names and descriptions, not the nucleotide sequences
- I also tried with esearch to extract the ID of the corresponding protein, but I did not succeed to retrieve the sequence (esearch -db gene -query "ycf2[Gene Name] AND txid4097[Organism]" | efetch -format xml | grep "<gene-commentary_accession>NP" | head -1 | sed -r 's/<.+>(.+)<\/.+>$/\1/')
- finally, I tried Biopython, but got the same error as already reported (Biopython: Entrez.efetch causes UnboundLocalError)
If anybody can help, I would be very grateful !
Thank you in advance,
E.F.
EDIT: my problem could be solved if someone could tell me how to emulate via a script or command line the "Send -> "Gene features" -> "file" of a Genbank record such as this one https://www.ncbi.nlm.nih.gov/nuccore /NC_001879.2?report=genbank ??? Thanks.
Response to your EDIT:
Would
be what you are looking for?efetch -db nuccore -query NC_001879.2 -format gb > name.gb
Sorry, my edit was not so clear... I would like to do this for all results returned by the query "ycf2[Gene Name] AND txid4097[Organism]", NC_001879.2 was just an example..
I tried also this command (ideally I would like gene sequences in fasta format) : esearch -db nuccore -query "ycf2[Gene Name] AND txid4097[Organism]" | efetch -format fasta but the problem is that it returns the whole chloroplast genome sequence, not the gene sequence (I don't understand why...)
One last precision, it seems that the correct command line would be: esearch -db nuccore -query NC_001879.2 | efetch -format gb > name.gb
By the way, thank you for your answer.
I've updated my previous answer to work correctly.
The gene sequences are not stored by themselves at NCBI. The gene database only contains a form of record of the gene, whilst the sequence for the gene is stored within the genome in the database nuccore/nucleotide.
In order to get the gene fasta sequence, you will have to extract the coordinates for the genes.
would for example give you an xml which you can parse for the nucleotide accession and coordinates. Then, you could retrieve the fasta sequence using efetch .
This is the best I could come up with. I tried setting up an easy one-liner but I didn't manage. Anyway, try and set up a simple script for querying and parsing and you should be good to go.