Entering edit mode
8.3 years ago
tarek.mohamed
▴
370
Hi All,
I am trying to align one DNA whole genome sequence file using subread, the sequence reads are paired end, 150 bp reads. I used 'thread=64' and 216 core (9 nodes, 24 core each). but after two days it processed only 15% and I had to kill the job.
I am using the university quest, so basically I can get as many nodes/cores as I want. Is there any way by which I can accelerate the alignment.
>align(index="my_index",readfile1=c("141023_H0E72_113-JG4_L003_R1.fastq.gz"),readfile2=c("141023_H0E72_113-JG4_L003_R2.fastq.gz"), output_file=c("subread_jg4.sam"),nthreads=64,unique=TRUE,type=1)
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
Rsubread 1.22.3
//========================== subread-align setting===========================\\
|| ||
|| Function : Read alignment (DNA-Seq) ||
|| Input file 1 : 141023_H0E72_113-JG4_L003_R1.fastq.gz ||
|| Input file 2 : 141023_H0E72_113-JG4_L003_R2.fastq.gz ||
|| Output file : subread_jg4.sam (BAM) ||
|| Index name : my_index ||
|| ||
|| Threads : 40 ||
|| Phred offset : 33 ||
|| # of extracted subreads : 10 ||
|| Min read1 vote : 3 ||
|| Min read2 vote : 1 ||
|| Max fragment size : 600 ||
|| Min fragment size : 50 ||
|| Maximum allowed mismatches : 3 ||
|| Maximum allowed indel bases : 5 ||
|| # of best alignments reported : 1 ||
|| Unique mapping : yes ||
|| ||
\\===================== http://subread.sourceforge.net/ ==============
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsubread_1.22.3
Thanks
Tarek
Hi Brian,
Thanks for the reply,
I have two files for paired end sequence, each is 35 GB. I am aligning to the human ref genome NCBI.GRch38.
Tarek
OK, that's going way too slow and I have no idea why. I have not used Subread, though; are you sure it supports multiple nodes? Most software doesn't.
You could try BBMap (which I wrote) like this:
bbmap.sh -Xmx31g ref=humanGenome.fasta in=141023_H0E72_113-JG4_L003_R1.fastq.gz in2=141023_H0E72_113-JG4_L003_R2.fastq.gz out=mapped.sam maxindel=100
For your data I estimate it would finish in around 16 hours on one 24-core node . If you want to run on multiple nodes, you can first partition the data with partition.sh (included with BBMap), for example:
partition.sh in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8
...then run each pair on a different node and merge the bam files.
Hello tarek.mohamed!
It appears that your post has been cross-posted to another site: https://support.bioconductor.org/p/87097/
This is typically not recommended as it runs the risk of annoying people in both communities.
Hi wouterDeCoster
I did not know that I am not allowed to post same post in 2 different site.
Tarek
It is not that you are not allowed to cross-post but people who frequently contribute answers on Biostars frown upon this practice since it may unnecessarily lead to duplication of effort on part of those who provide answers.
I see.
Thanks