Question

Tophat2 mapping takes a lot of time

0

Entering edit mode

8.5 years ago

debitboro ▴ 270

Hi Everyone,

I have submitted a script of Mapping to a cluster of 896 cores (32 nodes x 16 cores + 32 nodes x 12 cores), Frequency(core) = 2.66 GHz. For my job, I used 100 G of RAM. I want to map PE RNASeq reads (two files R1 and R2, and each file contains 40M reads) to the hg_GRCh38 from Ensembl using Tophat2. I have launched the script since 5 days, and it still running. Is it a normal situation ?

Tophat2 mapping Cluster • 2.6k views

ADD COMMENT • link updated 7.4 years ago by Biostar 20 • written 8.5 years ago by debitboro ▴ 270

0

Entering edit mode

That's a rather long time for 40M reads. Look and see if any of the files are still being updated and also in the log directory to see what it's actually doing now.

ADD REPLY • link 8.5 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon,

This is the content of the log file:

[2016-05-08 21:43:30] Checking for Bowtie index files (genome)..
[2016-05-08 21:43:30] Checking for reference FASTA file
[2016-05-08 21:43:30] Generating SAM header for /home/homosapiens/hg_GRCh38
[2016-05-08 21:44:00] Preparing reads
left reads: min. length=101, max. length=101, 39908097 kept reads (7183 discarded)
right reads: min. length=101, max. length=101, 39881131 kept reads (34149 discarded)
[2016-05-08 22:08:42] Mapping left_kept_reads to genome hg_GRCh38 with Bowtie2
[2016-05-11 20:34:34] Mapping left_kept_reads_seg1 to genome hg_GRCh38 with Bowtie2 (1/4)
[2016-05-12 05:51:34] Mapping left_kept_reads_seg2 to genome hg_GRCh38 with Bowtie2 (2/4)
[2016-05-12 15:00:28] Mapping left_kept_reads_seg3 to genome hg_GRCh38 with Bowtie2 (3/4)
...

ADD REPLY • link 8.5 years ago by debitboro ▴ 270

0

Entering edit mode

I suspect it's using a single thread. That'd explain why each step is taking forever.

ADD REPLY • link 8.5 years ago by Devon Ryan 104k

0

Entering edit mode

I used the following command with 24 threads:

tophat2 -p 24 -r 100 -o $mapping_directory $hg_ref_path$bowtie_prefix $rsequenceR1 $rsequenceR2

But I set the #SBATCH --ntasks=1 (since I used slurm to submit my job)

ADD REPLY • link 8.5 years ago by debitboro ▴ 270

0

Entering edit mode

Right, so you told tophat2 to use more threads than cores and then told slurm that it's only using a single thread (you'll need #SBATCH -c 24, though that won't work since you don't have nodes with that many cores).

It's likely that either slurm is only allowing a single thread to be used or something else is also running on that node and using most of the resources.

ADD REPLY • link 8.5 years ago by Devon Ryan 104k

0

Entering edit mode

Did you add #SBATCH -N 1 (from what I can see on web) to keep all threads on a single physical server (since you have a max of 16 cores, I would use a max of 16 threads if your cluster allows you to reserve a whole node) for SLURM? Having these threads spread across physical nodes is going to lead to strangeness like this.2x40M reads should not take more than a day with up to16 cores.

ADD REPLY • link 8.5 years ago by GenoMax 147k

0

Entering edit mode

Having the threads split across nodes won't even work :)

ADD REPLY • link 8.5 years ago by Devon Ryan 104k

0

Entering edit mode

You should use STAR, you would be surprised that in 25min you would finish it.

ADD REPLY • link 8.5 years ago by tiago211287 ★ 1.5k