Question

How to retrieve nucleotid sequence from gene ids of ncbis "gene" data base?

0

Entering edit mode

6.9 years ago

john ▴ 130

Hello people

I would like to retrieve all sequence from a set of gene entrys of the NCBI data base "gene".

As an example I would like to retrieve all sequence of this query:

"txid511145[Organism:noexp] "

URL: https://www.ncbi.nlm.nih.gov/gene/?term=txid511145%5BOrganism%3Anoexp%5D

The only way I found so far is to download the the full genome to which the genes refer and grep all the sequence locally according to the length and starting position. Is there a better way?

Thanks

gene id ncbi • 3.9k views

ADD COMMENT • link updated 6.9 years ago by Joseph Hughes ★ 3.0k • written 6.9 years ago by john ▴ 130

0

Entering edit mode

You may try Eutils https://www.ncbi.nlm.nih.gov/books/NBK25500/

ADD REPLY • link 6.9 years ago by Santosh Anand 5.8k

0

Entering edit mode

I do actually, but cant figure out how to.

ADD REPLY • link 6.9 years ago by john ▴ 130

0

Entering edit mode

What query did you try?

ADD REPLY • link 6.9 years ago by Santosh Anand 5.8k

0

Entering edit mode

in my question is the query and also the url

ADD REPLY • link 6.9 years ago by john ▴ 130

0

Entering edit mode

I meant esearch/eutils query.. Check the link and try to build a eUtils query.

ADD REPLY • link 6.9 years ago by Santosh Anand 5.8k

0

Entering edit mode

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term=txid511145%5BOrganism%3Anoexp%5D but this is not much of a difference as the one I showed before

ADD REPLY • link 6.9 years ago by john ▴ 130

1

Entering edit mode

Sorry, my bad! What I meant was to use Entrez Direct command line tools: https://www.ncbi.nlm.nih.gov/books/NBK179288/

You can make complicated queries using that and can chain queries where the results from one query are fed to the next one. See if that helps. Also a youtube video is here

ADD REPLY • link 6.9 years ago by Santosh Anand 5.8k

0

Entering edit mode

Oh thats a nice one. I didnt found that. Until now I used the r package rentrez for the calls. But it seems this one is more powerful, maybe. But unfortunately I still do not get what I want. Here is my query

esearch -db gene -query 'txid511145[Organism:noexp]' | efetch -format fasta

But this again just returns me the entries of the gene db. Example:

ID: 945651 99. dnaC DNA biosynthesis protein [Escherichia coli str. K-12 substr. MG1655] Other Aliases: b4361, ECK4351, JW4325, dnaD Annotation: NC_000913.3 (4600238..4600975, complement)

I could use the last line to get the corresponding fasta locally. But I would like to know if the server of ncbi I would do this for me or not.

ADD REPLY • link 6.9 years ago by john ▴ 130

score 2 · Accepted Answer · 2018-01-16

2

Entering edit mode

6.9 years ago

Joseph Hughes ★ 3.0k

I think something like this using the elink function of eutilities should work:

esearch -db gene -query 'txid511145[Organism:noexp]' | elink -target nuccore | efetch -format fasta

As you know what the accession number of you genome is, you are much better starting from that. The following retrieves all coding sequences for the reference genome

esearch -db nuccore -query 'NC_000913.3' | efetch -format fasta_cds_na

This gives a total of 4319 sequences. Accession number NC_000913.2 is an older version of the accession number.

ADD COMMENT • link 6.9 years ago by Joseph Hughes ★ 3.0k

0

Entering edit mode

Oh that looked so good. But the result is really not what I hoped it would be. The returned fasta just has 38 entries and this e coli strain should have 4516 genes. Also one of the entries is the whole genome. Not really know what this results refere to any how. As all the genes map only to two entries in the nucore db "NC_000913.3" and "NC_000913.2".

ADD REPLY • link 6.9 years ago by john ▴ 130

0

Entering edit mode

Does your starting point have to be the taxid? The problem with starting with a taxid is that it is not very precise. It sounds like you know the two full reference genomes that you want to extract genes from so why not start from those accession numbers?

ADD REPLY • link 6.9 years ago by Joseph Hughes ★ 3.0k

0

Entering edit mode

That works for me. Yeah the starting point is the organism so the txid. But thats okay. So I check for the best genome and work with this further. Thanks!

ADD REPLY • link 6.9 years ago by john ▴ 130