Question

Downloading shotgun assembly sequences using Biopython from Ncbi

0

Entering edit mode

9.5 years ago

Prasad ▴ 50

Hi All,

I need to download sequences for a genome using a link similar to http://www.ncbi.nlm.nih.gov/nuccore/ACHI00000000. Downloading involves few steps:

When this page is opened in browser, a WGS link can be seen which on clicking leads to the following page - http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ACHI01#contigs

I would like to download all sequences from this page. Is there a nice way to do download sequences for multiple genomes using Biopython or any other Python module.

Please advise. Thanks!

WGS biopython • 4.7k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.5 years ago by Prasad ▴ 50

Ram · Answer 1 · 2015-06-11

0

Entering edit mode

9.5 years ago

steven ▴ 70

If you navigate to the "Download" link next to the Contigs tab, you can download a gzip archive of all of the contig sequences in Genbank or Fasta format. Then unzip the file and it will be usable - make sure to change the file extension though.

You can then use a SeqIO iterator to easily parse the fasta file: http://biopython.org/wiki/SeqIO#Sequence_Input

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.5 years ago by steven ▴ 70

0

Entering edit mode

I could download sequence by navigating to download but I want to do that from script. As downloading it manually for more than 500 bacteria's is not possible. So I was wondering if there is any way by using Entrez from biopython to query this kind pages

http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ACHI01#contigs for given bacteria.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.5 years ago by Prasad ▴ 50

0

Entering edit mode

Oh! Sorry I misunderstood your original question. You can use BioPython to implement the Entrez commands used in this guide and then save the sequences into fasta/genbank format with SeqIO. If you have any questions about using BioPython let me know.

If you have a list of bacteria search terms/accession ids in a text file, open the file for reading in python and for each line, perform the three Entrez commands in the guide and then parse the wgs sequence into a file.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.5 years ago by steven ▴ 70

0

Entering edit mode

WGS has a CGI interface, where you can download complete sets of contigs:

wget 'http://www.ncbi.nlm.nih.gov/Traces/wgs/?download=ACHI01.1.fsa_nt.gz' -O - | gunzip > AMTC01.fasta

With urllib or request you can call this URL directly from python.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.5 years ago by piet ★ 1.9k

0

Entering edit mode

Thanks piet, I will try it out.

ADD REPLY • link 9.5 years ago by Prasad ▴ 50

Ram · Answer 2 · 2015-06-11

0

Entering edit mode

9.5 years ago

Prasad ▴ 50

Hi Steven, thanks for replying. I did do that. I downloaded Fasta sequences from nuccore database and parsed it using SeqIO, that does work fine for complete genomes. But for WGS assemblies, it just starts printing 'NNN'.

Please see this post Obtaining sequence from Bioproject IDs using biopython gives unknown sequence.

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.5 years ago by Prasad ▴ 50