Blastx For A Million Metagenomic Sequences
3
3
Entering edit mode
13.9 years ago
Monzoor ▴ 310

I intend to use a similarity based binning program like MEGAN or SOrt-ITEMS or CARMA for analyzing sequences in my metagenomic data set. For this, I first have to generate a BLASTx output of my metagenomic sequences against a huge data base such as nr or pfam. I do not have huge computing resources to run a standalone for the same. Any suggestions as to how do I obtain a blastx output for a million sequences.

blast sort online • 5.3k views
ADD COMMENT
2
Entering edit mode

I did much the same recently for several million metagenome sequences. Fortunately the researcher was patient, so we just waited until it finished (spread across as many CPUs as we could spare). You need to prioritise time taken vs the cost. If this is a one off computational step, and you have no resources locally, EC2 as Brad suggests, is the way to go.

ADD REPLY
7
Entering edit mode
13.9 years ago

Amazon offers on demand computing which is very useful for big computational tasks when you don't have local resources:

http://aws.amazon.com/ec2/

You'll want to prepare your database on a shared Elastic Block Store (EBS):

http://aws.amazon.com/ebs/

and likely want to parallelize your task. For a large BLAST job you can split the FASTA file and run over multiple on-demand machines.

Resources like CloudBioLinux can help, as they come pre-installed with blast and other software:

http://www.cloudbiolinux.com/

ADD COMMENT
0
Entering edit mode

Thanks Brad. Ive checked Amazon. Looks Good to me. Someone else in my group has now suggested checking out MEGAN-DB

ADD REPLY
1
Entering edit mode
13.3 years ago

I would suggest that you choose your target database carefully. For example, running the blasts on the swissprot database would be faster than on nr. If some species in the database are irrelevant to your study, you can create a new database by filtering for that species using the 'makeblastdb' program from the latest 'blastplus' program. The blasts would again be much faster.

Cheers

ADD COMMENT
0
Entering edit mode
13.3 years ago

Take a look at my recent question, which is similar to your research project, and the answer supplied. Michael talks about environmental gene tags (EGTs) and gives several links to the relevant papers and the source code.

ADD COMMENT

Login before adding your answer.

Traffic: 3097 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6