Question

Hardware requirements for bowtie2/STAR RNA-seq alignment

0

Entering edit mode

4.6 years ago

MatStat ▴ 160

Hi all,

I'm trying to understand what are the hardware requirements for alignment using bowtie2/STAR of bulk RNA-seq data in terms of:

Processor and cores

RAM

SSD hardrive space

Computing clusters

Server

The data:

Seq method: Illumina HiSeq High Output V4

Single-end (ie single-read)

100 human tissue samples

Each sample yielded 21 million reads.

All the best.

RNA-Seq Illumina bowtie bowtie2 STAR • 9.4k views

ADD COMMENT • link 4.6 years ago by MatStat ▴ 160

score 0 · Answer 1 · 2021-01-06

0

Entering edit mode

4.6 years ago

GenoMax 152k

This question has been asked many times before: Hardware requirement for RNA-seq has several links to older threads. Requirements have not changed much over the years. Human and mouse genomes are similar in size and thus have similar hardware requirements.

Pay attention to (i.e. get the most you can) Memory --> CPU --> Storage in that order.

ADD COMMENT • link 4.6 years ago by GenoMax 152k

0

Entering edit mode

Hi GenoMax,

Thank you for the prompt reply. I've read the answers in the link (and sub-links) you've added. But still didn't get an idea of a minimal to optimal settings using cluster computing servers for example.

Just as an example, I've tried to run 1 fastq sample on my mac (i5, 16 RAM, 500 SSD) and it was extremely strenuous and took more than 20 hrs.

Thanks.

ADD REPLY • link 4.6 years ago by MatStat ▴ 160

1

Entering edit mode

There is no way around lack of memory/compute power. With most aligners you are going to need 30+ GB of free RAM with human/mouse genomes. If you start using more than a few threads (say 6-8) that requirement is going to start going up. Just throwing tons of cores does not solve the problem either since efficiency of software becomes important at that stage. Unless you are working with server hardware the I/O on a local machine (even with SSD's) is going to be limiting for the speed at which data can be aligned. It is not uncommon for it to take few hours to align 20-50M reads.

But still didn't get an idea of a minimal to optimal settings using cluster computing servers for example.

Any good 2 socket server (not a desktop) is going to provide anywhere between 8-64+ cores (depending on CPU's chosen). You would want at least 128G of RAM to have comfortable headroom for other tasks. Storage is really up to you. Ideally you will need performant network block storage that is mounted on this server via 10G ethernet or infiniband etc to provide the fastest possible read/write speeds. If that is not available then you will need to resort to local SSD's. Keep in mind that SSD's wear out and have a finite life if continuously written to.

ADD REPLY • link 4.6 years ago by GenoMax 152k

0

Entering edit mode

Ok great thanks a lot for the answer.

ADD REPLY • link 4.6 years ago by MatStat ▴ 160

0

Entering edit mode

Do you really need to use alignment for bulk RNA-seq? Why not use pseudoalignment? Less memory and computing requirements.

ADD REPLY • link 4.6 years ago by dsull ★ 7.6k

1

Entering edit mode

Hi dsull, So I am reproducing results according to a workflow protocol from GitHub. That means I need to do what they did. In addition, I assume they don't use pseudoalignment since it needs to be sensitive enough to get unmapped reads which can be further used.

Best

ADD REPLY • link 4.6 years ago by MatStat ▴ 160