Question

Blastx For A Million Metagenomic Sequences

3

Entering edit mode

14.7 years ago

Monzoor ▴ 310

I intend to use a similarity based binning program like MEGAN or SOrt-ITEMS or CARMA for analyzing sequences in my metagenomic data set. For this, I first have to generate a BLASTx output of my metagenomic sequences against a huge data base such as nr or pfam. I do not have huge computing resources to run a standalone for the same. Any suggestions as to how do I obtain a blastx output for a million sequences.

blast sort online • 5.8k views

ADD COMMENT • link updated 11.0 years ago by Biostar 20 • written 14.7 years ago by Monzoor ▴ 310

2

Entering edit mode

I did much the same recently for several million metagenome sequences. Fortunately the researcher was patient, so we just waited until it finished (spread across as many CPUs as we could spare). You need to prioritise time taken vs the cost. If this is a one off computational step, and you have no resources locally, EC2 as Brad suggests, is the way to go.

ADD REPLY • link 14.7 years ago by User 59 13k

score 7 · Answer 1 · 2010-12-19

7

Entering edit mode

14.7 years ago

Brad Chapman 9.7k

Amazon offers on demand computing which is very useful for big computational tasks when you don't have local resources:

http://aws.amazon.com/ec2/

You'll want to prepare your database on a shared Elastic Block Store (EBS):

http://aws.amazon.com/ebs/

and likely want to parallelize your task. For a large BLAST job you can split the FASTA file and run over multiple on-demand machines.

Resources like CloudBioLinux can help, as they come pre-installed with blast and other software:

http://www.cloudbiolinux.com/

ADD COMMENT • link 14.7 years ago by Brad Chapman 9.7k

0

Entering edit mode

Thanks Brad. Ive checked Amazon. Looks Good to me. Someone else in my group has now suggested checking out MEGAN-DB

ADD REPLY • link 14.6 years ago by Monzoor ▴ 300

score 1 · Answer 2 · 2011-08-26

I would suggest that you choose your target database carefully. For example, running the blasts on the swissprot database would be faster than on nr. If some species in the database are irrelevant to your study, you can create a new database by filtering for that species using the 'makeblastdb' program from the latest 'blastplus' program. The blasts would again be much faster.

Cheers

Ram · Answer 3 · 2011-08-26

0

Entering edit mode

14.0 years ago

Larry_Parnell 16k

Take a look at my recent question, which is similar to your research project, and the answer supplied. Michael talks about environmental gene tags (EGTs) and gives several links to the relevant papers and the source code.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 14.0 years ago by Larry_Parnell 16k