Based on various posts at Biostars, I gather than error correction (EC) is important for accurate de Bruijn map generation, and de novo sequence. I am trying to stick with BBTools for all my pre-processing steps.
I've completed the following steps: Force Trim Modulo, Adapter Trimming, Quality Trimming, PHI-X174 check & removal, Human contamination check & removal. So for this EC step, I seem to be running out of memory for bbnorm.sh, or even khist.sh.
My khist.sh syntax and STDOUT are shown below.
aksrao@farm:~/FUSARIUM/solexa/FilterbyTile$ srun --partition=high --time=5:00:00 khist.sh in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq khist=khist_preEC.txt peaks=peaks_preEC.txt bits=8
srun: job 13889377 queued and waiting for resources
srun: job 13889377 has been allocated resources
java -ea -Xmx49554m -Xms49554m -cp /share/apps/bbmap-36-67/current/ jgi.KmerNormalize bits=32 ecc=f passes=1 keepall dr=f prefilter hist=stdout minprob=0 minqual=0 mindepth=0 minkmers=1 hashes=3 in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq khist=khist_preEC.txt peaks=peaks_preEC.txt bits=8
Executing jgi.KmerNormalize [bits=32, ecc=f, passes=1, keepall, dr=f, prefilter, hist=stdout, minprob=0, minqual=0, mindepth=0, minkmers=1, hashes=3, in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq, in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq, khist=khist_preEC.txt, peaks=peaks_preEC.txt, bits=8]
Settings:
threads: 24
k: 31
deterministic: false
toss error reads: false
passes: 1
bits per cell: 8
cells: 23.58B
hashes: 3
prefilter bits: 2
prefilter cells: 50.79B
prefilter hashes: 2
base min quality: 0
kmer min prob: 0.0
target depth: 100
min depth: 0
max depth: 100
min good kmers: 1
depth percentile: 54.0
ignore dupe kmers: true
fix spikes: false
histogram length: 255
print zero cov: false
slurmstepd: error: Step 13889377.0 exceeded memory limit (11998168 > 5324800), being killed
srun: Exceeded job memory limit
slurmstepd: error: *** STEP 13889377.0 ON c11-92 CANCELLED AT 2017-08-22T17:37:40 ***
srun: error: c11-92: task 0: Killed
srun: Force Terminated job step 13889377.0
My bbnorm.sh syntax and STDOUT are shown below. I thought using the recommended flags, explained at http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbnorm-guide/, that I would never run into memory issues. Perhaps I am misunderstanding those instructions / syntax? Because I never ran into memory issues with Jellyfish (for example), I suspect my syntax and/or understanding of EC is incomplete/incorrect.
Any advice and suggestion most welcome. Thank you!
aksrao@farm:~/FUSARIUM/solexa/FilterbyTile$ srun --partition=high --time=5:00:00 --nodes=1 bbnorm.sh in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq out1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont_EC1.fq out2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont_EC1.fq ecc=t keepall passes=1 bits=16 prefilter
srun: job 13897223 queued and waiting for resources
srun: job 13897223 has been allocated resources
java -ea -Xmx48284m -Xms48284m -cp /share/apps/bbmap-36-67/current/ jgi.KmerNormalize bits=32 in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq out1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont_EC1.fq out2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont_EC1.fq ecc=t keepall passes=1 bits=16 prefilter
Executing jgi.KmerNormalize [bits=32, in1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont.fq, in2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont.fq, out1=EthFoc-11_1_TileFilt_AdTrimR1_phixclean_HumanDecont_EC1.fq, out2=EthFoc-11_2_TileFilt_AdTrimR1_phixclean_HumanDecont_EC1.fq, ecc=t, keepall, passes=1, bits=16, prefilter]
Settings:
threads: 24
k: 31
deterministic: true
toss error reads: false
passes: 1
bits per cell: 16
cells: 11.49B
hashes: 3
prefilter bits: 2
prefilter cells: 49.49B
prefilter hashes: 2
base min quality: 5
kmer min prob: 0.5
target depth: 100
min depth: 5
max depth: 100
min good kmers: 15
depth percentile: 54.0
ignore dupe kmers: true
fix spikes: false
Enabled overlap correction (79.2% percent overlap)
slurmstepd: error: Step 13897223.0 exceeded memory limit (16771856 > 5324800), being killed
srun: Exceeded job memory limit
slurmstepd: error: *** STEP 13897223.0 ON c9-75 CANCELLED AT 2017-08-22T22:51:57 ***
srun: error: c9-75: task 0: Killed
srun: Force Terminated job step 13897223.0
Per your advice, using tadpole.sh, I bumped up RAM requested incrementally, and it took > 50GB for the command to execute. If you could you please look at info I've copy/pasted, and linked out to, and comment on anything out of the ordinary / abnormal, that would be helpful indeed. Thanks, Brian!
Syntax:
Details in STDERR file = http://textuploader.com/d6j5i
And the STDOUT file = http://textuploader.com/d6j5b
INPUT FILE LISTINGS
OUTPUT FILE LISTINGS
Hi Anand,
Looks fine to me!
Excellent! Thanks a LOT for all your help Brian and also to genomax. Cheers!
Brian: For tadpole.sh, under command line help menu, it says:
Further down, under
I am not sure I see a flag named - "countmin" Could you please clarify? Thanks!
In my syntax in previous reply, I used
Is that even correct, because it seems like it should be an integer value.
Based on your post at seqanwers -http://seqanswers.com/forums/showthread.php?t=61445, it seems I could use “minprob=0.8”. Is that what you refer to by "countmin sketch"? I am a little confused here.
Also, I wonder if it is simpler to just not use these 2 flags prefilter and minprob, and let the software do its thing... Your thoughts / advice?
"prefilter=t" works, and becomes "prefilter=2". You don't need to use prefilter unless you run out of memory (basically, for large genomes and large datasets). A Count-min sketch is a type of Bloom filter. It's unrelated to the "mincount" or "minprob" flag. "minprob" ignores individual kmers, as encountered in a read, with a probability of correctness below a certain value. "prefilter" first counts everything, then ignores kmers with a probable count below a certain value. "mincount" is different; it's mainly for assembly (don't assemble kmers with count less than X).
Just ignore the advanced flags for now; the defaults are fine, and you only need to adjust K as needed based on read length (typically 1/3rd of read length is good for error correction). You can adjust minprob and prefilter if you run out of memory.
If after quality trimming read length is not constant, but a distribution of lengths, lets says 72-150bp, then would it be k < 72/3, OR k < 150/3.
I am thinking k < min/3 - yes? So in my example here, I could go for k=23, right?
OR would be it be k < max/3, in this example k < 50. I think not, but I wanna be sure...Thx!
Interestingly, my runs completed without having specifed k value.... strange! Right?
No, 23 is too short, and most of your reads should still be 150bp long. Longer kmers are generally better as long as you have sufficient depth. I'd recommend 50.