How to increase the speed of subread for DNA whole genome sequence
1
0
Entering edit mode
8.2 years ago
tarek.mohamed ▴ 370

Hi All,

I am trying to align one DNA whole genome sequence file using subread, the sequence reads are paired end, 150 bp reads. I used 'thread=64' and 216 core (9 nodes, 24 core each). but after two days it processed only 15% and I had to kill the job.

I am using the university quest, so basically I can get as many nodes/cores as I want. Is there any way by which I can accelerate the alignment.

>align(index="my_index",readfile1=c("141023_H0E72_113-JG4_L003_R1.fastq.gz"),readfile2=c("141023_H0E72_113-JG4_L003_R2.fastq.gz"), output_file=c("subread_jg4.sam"),nthreads=64,unique=TRUE,type=1)

  ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.22.3

//========================== subread-align setting===========================\\
||                                                                            ||
|| Function      : Read alignment (DNA-Seq)                                   ||
|| Input file 1  : 141023_H0E72_113-JG4_L003_R1.fastq.gz                      ||
|| Input file 2  : 141023_H0E72_113-JG4_L003_R2.fastq.gz                      ||
|| Output file   : subread_jg4.sam (BAM)                                      ||
|| Index name    : my_index                                                   ||
||                                                                            ||
||                       Threads : 40                                         ||
||                  Phred offset : 33                                         ||
||       # of extracted subreads : 10                                         ||
||                Min read1 vote : 3                                          ||
||                Min read2 vote : 1                                          ||
||             Max fragment size : 600                                        ||
||             Min fragment size : 50                                         ||
||    Maximum allowed mismatches : 3                                          ||
||   Maximum allowed indel bases : 5                                          ||
|| # of best alignments reported : 1                                          ||
||                Unique mapping : yes                                        ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ==============

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.22.3

Thanks

Tarek

subread speed DNA sequence • 2.8k views
ADD COMMENT
0
Entering edit mode
1) How many reads do you have (or, how big are your input files in GB)?
2) What reference are you aligning them to?
ADD REPLY
0
Entering edit mode

Hi Brian,

Thanks for the reply,

I have two files for paired end sequence, each is 35 GB. I am aligning to the human ref genome NCBI.GRch38.

Tarek

ADD REPLY
0
Entering edit mode

OK, that's going way too slow and I have no idea why. I have not used Subread, though; are you sure it supports multiple nodes? Most software doesn't.

You could try BBMap (which I wrote) like this:

bbmap.sh -Xmx31g ref=humanGenome.fasta in=141023_H0E72_113-JG4_L003_R1.fastq.gz in2=141023_H0E72_113-JG4_L003_R2.fastq.gz out=mapped.sam maxindel=100

For your data I estimate it would finish in around 16 hours on one 24-core node . If you want to run on multiple nodes, you can first partition the data with partition.sh (included with BBMap), for example:

partition.sh in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8

...then run each pair on a different node and merge the bam files.

ADD REPLY
0
Entering edit mode

Hello tarek.mohamed!

It appears that your post has been cross-posted to another site: https://support.bioconductor.org/p/87097/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

Hi wouterDeCoster

I did not know that I am not allowed to post same post in 2 different site.

Tarek

ADD REPLY
0
Entering edit mode

It is not that you are not allowed to cross-post but people who frequently contribute answers on Biostars frown upon this practice since it may unnecessarily lead to duplication of effort on part of those who provide answers.

ADD REPLY
0
Entering edit mode

I see.

Thanks

ADD REPLY
0
Entering edit mode
6.5 years ago
andysw90 ▴ 20

Outputting to bam rather than sam might help -- smaller files = less time writing.

ADD COMMENT
0
Entering edit mode

Doesn't writing BAM takes longer due to the compression time?

ADD REPLY

Login before adding your answer.

Traffic: 1815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6