Question

What Are The Server Requirements For Analysis And Alignment Of 454 Environmental Metagenomics Data.

3

Entering edit mode

13.9 years ago

John ▴ 790

I'm doing shotgun environmental metagenomics of 454 data and investigating setting up my own local server and running MEGAN, CARMA or Galaxy Metagenomics for my analysis and workflow.

If I'm doing simple analysis of who's in the community and which genes are present, and I have a couple of samples to analyze, what kind of server will I need to purchase? This doesn't seem that computationally intensive to me. My guess is there are around 100 different kinds of bacteria living in my community, let's say the average genome size is 3Mb.

How might these server requirements change if I wanted to do genome alignments (e.g. Velvet)?

For budgeting purposes, do you estimate the server requirements would be? I imagine a total of one of these samples per week for about 6 months, 24 samples in total for this project.

Many thanks,

John

metagenomics alignment analysis server • 3.9k views

ADD COMMENT • link updated 12.0 years ago by Biostar 20 • written 13.9 years ago by John ▴ 790

score 2 · Answer 1 · 2011-10-05

First things first: MEGAN and CARMA rely on BLAST to assign reads to genomes. Then you have the problems raised by "I think there are 100 different kinds of bacteria". Well, do you know that for sure? Couldn't it be 200? And even if you knew these were only 100 kind of bacteria ... which ones? Do you know that "for sure" beforehand? The next question then would be how accurate your analysis should be, because - just as an example - if you took E.coli MG1655 as representative for E.coli into your database set, you would not be able to make the difference between different kinds of E.coli (some of them harmless, some of them very pathogenic). Depending on the question you have this can be quite important.

In the end, I suppose you will have to resort to BLAST each 454 sequence against all GenBank microbial sequences. There I'd recommend you also take the unfinished WGS data into your set. That is: your database for comparison will be the whole of ftp://ftp.ncbi.nih.gov/refseq/release/microbial

And that's quite a lot, a couple of thousand sequenced at least. Last time I did that 2 years ago it was around 3 of 4k genomes, now they will be way more.

Build a BLAST db with the above mentioned, then blast 10k 454 sequences against that, then extrapolate for the number of sequences you expect per 454 run (a couple of 100k?). This will give you an idea how how computationally intensive things will get.

Last thing: for parallelising the BLAST task, count on running multiple BLASTs in parallel instead of using the multithreaded option of BLAST. More memory intensive, but faster.

Good luck.

score 1 · Answer 2 · 2011-09-30

Cannot tell you much about the market price for servers, but probaly it's a good idea to start with hardware requirements:

I don't think you need a really special server for storing the metagenomics data, typical 2x500GB (or 4x for backup) storage should be enough, which gives you enough space to download bacterial reference genomes etc. If you plan to do de novo assembly (velvet, mira, abyss and co.), enough RAM is what you want. This again depends on the amount of reads you get and the assembler you gonna use. I would suggest at least 24GB RAM, which is a really conservative estimate.