Downloading all COI sequences from BOLD fails
0
0
Entering edit mode
4.6 years ago
rDNA ▴ 20

I have metabarcoding sequence data (COI) from bulk animal samples (including arthropoda, nematoda, annelida, mollusca) and I want to BLAST all of these sequences. I used following command to do this: blastn -remote -db nt -query COI_all.fasta -num_alignments 2 -out COI_blasted.txt. However this results in errors similar to this post: Problem when running remote Blast with a big file .

These errors probably appear due to the number of sequences in my file (around 700) and the remote connection is thus interrupted.

I found that a solution would be to use blastn with a local database and since the samples are so diverse, I would like to download ALL animal COI sequences from BOLD (or gen bank). It would not be a problem if non-animal (e.g. plant) sequences would also be included.

I think the BOLD database would be great to BLAST my sequences to. However, I'm currently struggling to find a good way to download all animal COI sequences from BOLD.

When entering COI-5P as search term on http://v4.boldsystems.org/index.php/Public_SearchTerms I receive error: Your search terms resulted in too many matching terms. Please try again with more specific search criteria. I could likely download the sequences from all the phyla etc seperately and merge them, but I'd rather just download 1 file.

I also tried to use the API by running: wget http://v4.boldsystems.org/index.php/API_Public/sequence?marker=COI-5P. A download starts but around 3.7 MB download, it is stuck and the file I receive only contains ~5000 sequences.

Does anyone have a solution to download all COI sequences from BOLD in one file?

I could also download COI sequences from gen bank using the ftp://ftp.ncbi.nlm.nih.gov/blast/db/ URL, but I'm not sure which exact files I need. For 16S, 18S,.. it is obvious, but not for COI. Any suggestions?

Thanks for the help.

software error next-gen • 2.1k views
ADD COMMENT
0
Entering edit mode

Have you tried the search interface BOLD provides here?

They used to provide direct download of all data at one point (only for ref: A: Downloading all COI sequences from BOLD database ) but seem to have removed that ability from their site now. You could email and ask them.

ADD REPLY
0
Entering edit mode

Thank you for your reply. My wget command is based on that link of biostars you provide, but it now does not seem to work like that anymore. I will get in contact with them.

ADD REPLY
0
Entering edit mode

I've contacted BOLD about the stalling behavior of the download, and this is their reply: "This issue is because of the large API request that retrieves millions of records, which our system does not handle. Please break up the search by smaller groups, such as classes."

ADD REPLY
0
Entering edit mode

There are some issues with bold and the API. I remember that at one point you could download all Phylum data except Arthropoda with the API.

So you could make a script that downloads the data per Phylum with the API, maybe make a list in python based on this page http://v3.boldsystems.org/index.php/TaxBrowser_Home. Run your script and if it is finished remove the Arthropoda files. Next, download the Arthropoda data from here manually http://v3.boldsystems.org/index.php/Public_SearchTerms?query=Arthropoda[tax].

ADD REPLY

Login before adding your answer.

Traffic: 1805 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6