Retrieve nt sequence of chloroplaste genes (NCBI query with gene name and taxon ID)
0
0
Entering edit mode
8.1 years ago
erica.fary ▴ 20

Hello everyone,

I encounter the following problem: I have a list of chloroplast gene IDs and a list of taxonomic IDs, and I want to retrieve for each organism all the nucleotide sequences of the gene list. This seems pretty simple: I "just" want to retrieve all sequences in fasta format. It is not a problem to write a script to iterate over the different request, but even for a simple pair gene ID - organism ID I was not able to find the correct way to do this... I tried the following, for example for the pair ycf2[Gene Name] AND txid4097[Organism] :

  • first, I try the NCBI website, but by searching in Gene database (query: ycf2[Gene Name] AND txid4097[Organism]) I did not obtain the sequences, in protein database I obtained only amino acid sequence and in nucleotide database I obtained only the whole chloroplast sequence...
  • using Bioperl, I got a "WebEnv parameter is required" error (I don't copy-paste the code, it is similar as explained in http://bioperl.org/howtos/Beginners_HOWTO.html)
  • then I tried with e-utilities : esearch -db gene -query "ycf2[Gene Name] AND txid4097[Organism]" | efetch -format fasta. This works +/- but I only get the names and descriptions, not the nucleotide sequences
  • I also tried with esearch to extract the ID of the corresponding protein, but I did not succeed to retrieve the sequence (esearch -db gene -query "ycf2[Gene Name] AND txid4097[Organism]" | efetch -format xml | grep "<gene-commentary_accession>NP" | head -1 | sed -r 's/<.+>(.+)<\/.+>$/\1/')
  • finally, I tried Biopython, but got the same error as already reported (Biopython: Entrez.efetch causes UnboundLocalError)

If anybody can help, I would be very grateful !

Thank you in advance,

E.F.

EDIT: my problem could be solved if someone could tell me how to emulate via a script or command line the "Send -> "Gene features" -> "file" of a Genbank record such as this one https://www.ncbi.nlm.nih.gov/nuccore /NC_001879.2?report=genbank ??? Thanks.

ncbi database gene sequence • 3.1k views
ADD COMMENT
0
Entering edit mode

Response to your EDIT:

Would efetch -db nuccore -query NC_001879.2 -format gb > name.gb be what you are looking for?

efetch -db nuccore -id NC_001879.2 -format gb > name.gb
ADD REPLY
0
Entering edit mode

Sorry, my edit was not so clear... I would like to do this for all results returned by the query "ycf2[Gene Name] AND txid4097[Organism]", NC_001879.2 was just an example..

I tried also this command (ideally I would like gene sequences in fasta format) : esearch -db nuccore -query "ycf2[Gene Name] AND txid4097[Organism]" | efetch -format fasta but the problem is that it returns the whole chloroplast genome sequence, not the gene sequence (I don't understand why...)

One last precision, it seems that the correct command line would be: esearch -db nuccore -query NC_001879.2 | efetch -format gb > name.gb

By the way, thank you for your answer.

ADD REPLY
0
Entering edit mode

I've updated my previous answer to work correctly.

The gene sequences are not stored by themselves at NCBI. The gene database only contains a form of record of the gene, whilst the sequence for the gene is stored within the genome in the database nuccore/nucleotide.

In order to get the gene fasta sequence, you will have to extract the coordinates for the genes.

esearch -db gene -query "ycf2[Gene Name] AND txid4097[Organism]" | efetch -format docsum

would for example give you an xml which you can parse for the nucleotide accession and coordinates. Then, you could retrieve the fasta sequence using efetch .

efetch -db nuccore -id NC_001879.2 -seq_start 1 -seq_stop 100 -strand + -format fasta > test.fna

This is the best I could come up with. I tried setting up an easy one-liner but I didn't manage. Anyway, try and set up a simple script for querying and parsing and you should be good to go.

ADD REPLY

Login before adding your answer.

Traffic: 1859 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6