Question

Retrieve genome in fasta format from ncbi

1

Entering edit mode

6.2 years ago

tim.ivanov.92 ▴ 40

What is the way to retrieve genomes from ncbi via biopython? I am able to get a record for my genome of interest, and also i am able to download it manually from search

But how to download it inside a script?

from Bio import Entrez
Entrez.email = "my_email@email.ru"
handle = Entrez.esearch(db="genome", term="Drosophila eugracilis[Orgn]", idtype="acc")
record = Entrez.read(handle)
for i in record.keys():
    print i,record[i]

biopython genome ncbi • 4.2k views

ADD COMMENT • link updated 5.4 years ago by atiestorage ▴ 10 • written 6.2 years ago by tim.ivanov.92 ▴ 40

0

Entering edit mode

You need efetch, see for example:

https://stackoverflow.com/a/26347810/3691040

ADD REPLY • link 6.2 years ago by Joe 21k

0

Entering edit mode

the problem with using nucleotide db is empy resulting list of id's:

search_term = 'Drosophila eugracilis[orgn] AND complete genome[title]'
handle = Entrez.esearch(db='nucleotide', term=search_term)
genome_ids = Entrez.read(handle)['IdList']
print genome_ids
>>> []

and when using efetch with this genome id - it finds something else:

records=Entrez.efetch(db="nucleotide", id=6863, rettype="gb", retmode="text")
print(records.read())

LOCUS       X51384                   564 bp    DNA     linear   INV 15-NOV-2007
DEFINITION  Caenorhabditis elegans DNA encoding U4-3 snRNA.

which is not a genome

ADD REPLY • link 6.2 years ago by tim.ivanov.92 ▴ 40

0

Entering edit mode

Yeah that's weird, I see the same thing.

Have you tried your query directly on NCBI's website first to see what you get?

ADD REPLY • link 6.2 years ago by Joe 21k

0

Entering edit mode

It seems you need to use db="genome" not nucleotide in your efetch.

ID 6863 in Nucleotide points to that sRNA, but that same ID number in Genome does correctly point to that Drosophila species (or find out what ID the drosophila genome is using inside nucleotide)

ADD REPLY • link 6.2 years ago by Joe 21k

0

Entering edit mode

this is true - i used id from db=genome - i first found in in ncbi web server but when i change db to genomes in last request - it says

HTTPErrorTraceback (most recent call last)
<ipython-input-255-7978eb8b163d> in <module>()
----> 1 records=Entrez.efetch(db="genome", id=6863, rettype="gb", retmode="text")
      2 print(records.read())

/uge_mnt/home/tim_ivanov/pythonlibs/biopython-1.71/Bio/Entrez/__init__.pyc in efetch(db, **keywords)
    178             # more than about 200 IDs
    179             post = True
--> 180     return _open(cgi, variables, post=post)
    181 
    182 

/uge_mnt/home/tim_ivanov/pythonlibs/biopython-1.71/Bio/Entrez/__init__.pyc in _open(cgi, params, post, ecitmatch)
    528             handle = _urlopen(cgi)
    529     except _HTTPError as exception:
--> 530         raise exception
    531 
    532     return _binary_to_string_handle(handle)

HTTPError: HTTP Error 400: Bad Request

ADD REPLY • link 6.2 years ago by tim.ivanov.92 ▴ 40

0

Entering edit mode

Hi there,

It is not possible to download the sequences directly from genome database, you will need to link to the actual sequence holding record using elink.

ADD REPLY • link 6.2 years ago by Sej Modha 5.3k

0

Entering edit mode

Can you please give an example of using it in pipe with efetch to download genome? or point me on a tutorial page with it

ADD REPLY • link 6.2 years ago by tim.ivanov.92 ▴ 40

1

Entering edit mode

5.4 years ago

atiestorage ▴ 10

Other aproximation in shell:

for org in \<br>
 "Agrobacterium tumefaciens" \<br>
 "Bacillus anthracis" \<br>
 "Escherichia coli" \ <br>
"Neisseria gonorrhoeae" \<br>
 "Pseudomonas aeruginosa" \<br>
 "Shigella flexneri" \ <br>
"Streptococcus pneumoniae"<br>
 do<br>
echo "Download URL for: $org"<br>
data=$(esearch -db genome -query "$org [ORGN]" | efetch -format docsum | xtract -pattern DocumentSummary -element Id)<br>
esearch -db genome -query "$data"|elink -target assembly|esummary|xtract -pattern DocumentSummary -element FtpPath_GenBank<br>
sleep 1<br>
done<br>

You can also play with the query and search by taxid and many other options
All is here:
https://www.ncbi.nlm.nih.gov/books/NBK25501/

ADD COMMENT • link updated 5.4 years ago by ATpoint 85k • written 5.4 years ago by atiestorage ▴ 10

score 3 · Accepted Answer · 2018-09-06

3

Entering edit mode

6.2 years ago

Sej Modha 5.3k

esearch/elink commands would be different depending on the genome of interest.

For example, following command can be used to download assembly in the fasta format using NCBI Unix utils:

esearch -db genome -query "6863"|elink -target assembly|esummary|xtract -pattern FtpPath_RefSeq -element FtpPath_RefSeq

This should give you the URL where assembly files are saved and relevant files can be downloaded from that URL.