Question

What Are The Computing Requirements For Handling Illumina-Generated Reads?

10

Entering edit mode

14.3 years ago

Panos ★ 1.8k

I would like some approximate estimates for CPU, RAM, etc that I must have in order to be able to process the Illumina reads. I suspect that the biggest difference compared to 454 is the sheer volume of generated sequence per run (I think ~55Gb) and the subsequent assembly of all these reads...

illumina • 3.4k views

ADD COMMENT • link updated 14.3 years ago by Aaron Statham ★ 1.1k • written 14.3 years ago by Panos ★ 1.8k

0

Entering edit mode

Can you provide more information, please? What throughput (number of lanes over time), what genome(s) e.g. bacterial vs. vertebrate vs. metagenome, what types of experiment e.g. RNAseq vs. ChIPseq vs. whole genome. Do you have a reference genome? Are you getting raw data, or just reads?

ADD REPLY • link 14.3 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

We want to do metagenomic study of microbial populations in environmental samples. At a first stage I was thinking about de-novo assembly and if this doesn't work very well, maybe mapping to reference genome(s). And to tell you the truth I don't yet know what's the difference between raw data and reads! What do they (the sequencing companies) usually give you back?

ADD REPLY • link 14.3 years ago by Panos ★ 1.8k

score 9 · Answer 1 · 2010-08-12

9

Entering edit mode

14.3 years ago

User 59 13k

I'm mapping some Illumina reads (~73 million) to a reference sequence using Maq. This is currently consuming 30GB of RAM, but not much in the way of CPU (it's a 4 core machine), however other mappers will have different overheads. Assembly of the reads (which I'm not doing) could well take more. We have 128GB of RAM in the machine we bought for NGS work, and I still consider this to be a sensible minimum requirement - and yes I have used all 128GB whilst doing de-novo assembly of SOLiD data.

As for the CPU requirements you would do well to map out a potential workflow for your data analysis and work out how many of the tools you are likely to use will parallelise across multiple cores natively. Then think about how much of the rest of the data analysis can be potentially split in batches across cores. This may give you some idea of at least a ballpark figure for the kind of speedup you might hope to get by spreading the load.

ADD COMMENT • link 14.3 years ago by User 59 13k

3

Entering edit mode

Is maq still considered current? I thought it had been superceded by bowtie, bwt, bfast.. Does it still do some things better than newer aligners?

ADD REPLY • link 14.3 years ago by Aaron Statham ★ 1.1k

0

Entering edit mode

I'm using Maq for a specific reason for a specific dataset. I'd normally default to Bowtie in this case. It's certainly.. slower than other solutions.

ADD REPLY • link 14.3 years ago by User 59 13k

score 8 · Answer 2 · 2010-08-12

If you are going to perform de novo assembly the more RAM the better, but the minimum requirements depend on the number and length of reads, and the size of the genome you are assembling (some peoples experiences with velvet for instance are here).

Mapping to reference genomes is quite straightforward and not particularly resource hungry these days (RNA-seq is harder and takes longer than for example ChIP-seq though)

One point I wouldn't neglect is I/O - having a gigabit ethernet connection and fast hard drives in your machine is a wise investment.

A good way to get a feel for the requirements of NGS analysis is to just download some public data and create a pipeline to process it - you'll quickly learn where the bottlenecks are and can plan your machine purchase better that way