Downloading shotgun assembly sequences using Biopython from Ncbi
2
0
Entering edit mode
9.5 years ago
Prasad ▴ 50

Hi All,

I need to download sequences for a genome using a link similar to http://www.ncbi.nlm.nih.gov/nuccore/ACHI00000000. Downloading involves few steps:

When this page is opened in browser, a WGS link can be seen which on clicking leads to the following page - http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ACHI01#contigs

I would like to download all sequences from this page. Is there a nice way to do download sequences for multiple genomes using Biopython or any other Python module.

Please advise. Thanks!

WGS biopython • 4.7k views
ADD COMMENT
0
Entering edit mode
9.5 years ago
steven ▴ 70

If you navigate to the "Download" link next to the Contigs tab, you can download a gzip archive of all of the contig sequences in Genbank or Fasta format. Then unzip the file and it will be usable - make sure to change the file extension though.

You can then use a SeqIO iterator to easily parse the fasta file: http://biopython.org/wiki/SeqIO#Sequence_Input

ADD COMMENT
0
Entering edit mode

I could download sequence by navigating to download but I want to do that from script. As downloading it manually for more than 500 bacteria's is not possible. So I was wondering if there is any way by using Entrez from biopython to query this kind pages

http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ACHI01#contigs for given bacteria.

ADD REPLY
0
Entering edit mode

Oh! Sorry I misunderstood your original question. You can use BioPython to implement the Entrez commands used in this guide and then save the sequences into fasta/genbank format with SeqIO. If you have any questions about using BioPython let me know.

If you have a list of bacteria search terms/accession ids in a text file, open the file for reading in python and for each line, perform the three Entrez commands in the guide and then parse the wgs sequence into a file.

ADD REPLY
0
Entering edit mode

WGS has a CGI interface, where you can download complete sets of contigs:

wget 'http://www.ncbi.nlm.nih.gov/Traces/wgs/?download=ACHI01.1.fsa_nt.gz' -O - | gunzip > AMTC01.fasta

With urllib or request you can call this URL directly from python.

ADD REPLY
0
Entering edit mode

Thanks piet, I will try it out.

ADD REPLY
0
Entering edit mode
9.5 years ago
Prasad ▴ 50

Hi Steven, thanks for replying. I did do that. I downloaded Fasta sequences from nuccore database and parsed it using SeqIO, that does work fine for complete genomes. But for WGS assemblies, it just starts printing 'NNN'.

Please see this post Obtaining sequence from Bioproject IDs using biopython gives unknown sequence.

ADD COMMENT

Login before adding your answer.

Traffic: 2515 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6