Greetings,
I am using several steps of the assemblyPipeline.sh found under bbmap-38-60/pipelines/ for the purpose of pre-processing my SE Illumina 100nt raw reads, before mapping to a reference genome.
To insure my pre-processing steps are reproducible, I ran 1 library 4 times, through the exact same pre-processing steps, as shown below. After each step, however, starting with clumpify.sh, the line counts are slightly different for these replicated results, as shown below.
So my question to forum members is whether you've seen this behavior, and if yes, whether it's normal. Is there a reason why this is happening - for example, random seeding, or something else?
If not, do you have ideas why this might be happening on my SLURM based HPCC, and how to prevent this minor variability? Results vary between compute nodes, and back-to-back runs on the same compute node as well !
Thanks!
Step 1. Rename SRR data files for BBmap compatibility
BBMap_38.61/bbmap/rename.sh in=TEST.fastq out=TEST_rename.fastq fixsra=t -Xmx20g
71369268 TEST/TEST_rename.fastq
71369268 TEST_repeat/TEST_rename.fastq
71369268 TEST_repeat2/TEST_rename.fastq
71369268 TEST_repeat3/TEST_rename.fastq
==========================================================================================
Step 2. Remove optical duplicates
clumpify.sh -Xms20g in=TEST_rename.fastq out=TEST_rename_clumped.fq.gz dedupe optical
4640107 TEST/TEST_rename_clumped.fq.gz
4640730 TEST_repeat/TEST_rename_clumped.fq.gz
4639478 TEST_repeat2/TEST_rename_clumped.fq.gz
4638977 TEST_repeat3/TEST_rename_clumped.fq.gz
==========================================================================================
Step 3. Remove poor tiles in Illumina reads
filterbytile.sh -Xms20g in=TEST_rename_clumped.fq.gz out=TEST_rename_clumped_FbT.fq.gz
4319815 TEST/TEST_rename_clumped_FbT.fq.gz
4321528 TEST_repeat/TEST_rename_clumped_FbT.fq.gz
4317394 TEST_repeat2/TEST_rename_clumped_FbT.fq.gz
4320158 TEST_repeat3/TEST_rename_clumped_FbT.fq.gz
==========================================================================================
Step 4. Adapter and quality trimming
bbduk.sh in=TEST_rename_clumped_FbT.fq.gz out=TEST_rename_clumped_FbT_bbdukTrim.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=70 ref=adapters ftm=5 ordered
4318431 TEST/TEST_rename_clumped_FbT_bbdukTrim.fq.gz
4322073 TEST_repeat/TEST_rename_clumped_FbT_bbdukTrim.fq.gz
4320215 TEST_repeat2/TEST_rename_clumped_FbT_bbdukTrim.fq.gz
4316559 TEST_repeat3/TEST_rename_clumped_FbT_bbdukTrim.fq.gz
==========================================================================================
Step 5. Filter out artifacts and PhiX spike-ins
bbduk.sh in=TEST_rename_clumped_FbT_bbdukTrim.fq.gz out=TEST_rename_clumped_FbT_bbdukTrim_Fltrd.fq.gz k=31 ref=artifacts,phix ordered cardinality
4319563 TEST/TEST_rename_clumped_FbT_bbdukTrim_Fltrd.fq.gz
4317725 TEST_repeat/TEST_rename_clumped_FbT_bbdukTrim_Fltrd.fq.gz
4325693 TEST_repeat2/TEST_rename_clumped_FbT_bbdukTrim_Fltrd.fq.gz
4316804 TEST_repeat3/TEST_rename_clumped_FbT_bbdukTrim_Fltrd.fq.gz
==========================================================================================
Step 6. Filter out rRNA
bbmap.sh ref=MtrunA17r5.0-ANR-EGN-r1.6.rrna.fasta.shIDscleaned-up in=TEST_rename_clumped_FbT_bbdukTrim_Fltrd.fq.gz outu=TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_shIDScleaned.fq.gz outm=TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAmatched_shIDScleaned.fq.gz nodisk
4303501 TEST/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_shIDScleaned.fq.gz
4308744 TEST_repeat/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_shIDScleaned.fq.gz
4310331 TEST_repeat2/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_shIDScleaned.fq.gz
4308370 TEST_repeat3/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_shIDScleaned.fq.gz
29032 TEST/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAmatched_shIDScleaned.fq.gz
29457 TEST_repeat/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAmatched_shIDScleaned.fq.gz
29577 TEST_repeat2/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAmatched_shIDScleaned.fq.gz
29395 TEST_repeat3/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAmatched_shIDScleaned.fq.gz
==========================================================================================
Step 7. Filter out human and common bacterial contaminants
cat hg19_masked.fa.gz fusedEPmasked2.fa.gz > hg19_mask
bbmap.sh ref=hg19_masked_fusedEPmasked2.fa.gz in=TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_shIDScleaned.fq.gz outu=TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_NonHGBact.fq.gz outm=TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_HGBact.fq.gz nodisk
4313738 TEST/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_NonHGBact.fq.gz
4324692 TEST_repeat/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_NonHGBact.fq.gz
4322861 TEST_repeat2/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_NonHGBact.fq.gz
4319180 TEST_repeat3/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_NonHGBact.fq.gz
1421 TEST/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_HGBact.fq.gz
1317 TEST_repeat/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_HGBact.fq.gz
1282 TEST_repeat2/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_HGBact.fq.gz
1357 TEST_repeat3/TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_HGBact.fq.gz
==========================================================================================
Step 8. Split mapping to plant (PacBio v5) versus bacterial (strain 1021) genomes
bbsplit.sh in=TEST_rename_clumped_FbT_bbdukTrim_Fltrd_rRNAfltrd_NonHGBact.fq.gz ref=MtrunA17r5.0-20161119-ANR.fasta,Sm1021_Chr_pSymAB.fasta basename=TEST_BBsplit_%.fq outu=TEST_MtA17_Sm1021_UNmapped.fq
MAP_EUK=TEST*/TEST_BBsplit_MtrunA17r5.0-20161119-ANR.fq
MAP_PRO=TEST*/TEST_BBsplit_Sm1021_Chr_pSymAB.fq
UNMAPPED=TEST*/TEST_MtA17_Sm1021_UNmapped.fq
61328164 TEST/TEST_BBsplit_MtrunA17r5.0-20161119-ANR.fq
61332572 TEST_repeat/TEST_BBsplit_MtrunA17r5.0-20161119-ANR.fq
61328536 TEST_repeat2/TEST_BBsplit_MtrunA17r5.0-20161119-ANR.fq
61311576 TEST_repeat3/TEST_BBsplit_MtrunA17r5.0-20161119-ANR.fq
224 TEST/TEST_BBsplit_Sm1021_Chr_pSymAB.fq
216 TEST_repeat/TEST_BBsplit_Sm1021_Chr_pSymAB.fq
224 TEST_repeat2/TEST_BBsplit_Sm1021_Chr_pSymAB.fq
216 TEST_repeat3/TEST_BBsplit_Sm1021_Chr_pSymAB.fq
7563888 TEST/TEST_MtA17_Sm1021_UNmapped.fq
7563940 TEST_repeat/TEST_MtA17_Sm1021_UNmapped.fq
7563708 TEST_repeat2/TEST_MtA17_Sm1021_UNmapped.fq
7560668 TEST_repeat3/TEST_MtA17_Sm1021_UNmapped.fq
==========================================================================================
Hi genomax,
Thanks for your super quick post.
Yes, I am using as many threads as are available, which is usually 8 or 12 cores, each with 2 cpus (though my terminologies may be muddled, it is certainly > 1 core)
For my step 1,
clumpify.sh
run I will tryseed = 1
, thanks for that suggestion.For specifying the value of average insert size, I looked up how to specify it, and found this seqanswers link.
Since my reads are SE, not PE, and for RNA transcripts, I suppose, of the 3 options listed (mapping Vs overlap vs. assembly) , I only have 1) Via Mapping as my viable option, correct?
This Mapping strategy to evaluate average insert size seems a little circular in logic - In that I am executing my pre-processing steps PRIOR to the
bbmap.sh
mapping step, but here, I'd need to map usingbbmap.sh
but before pre-processing, in order to specify average insert size to later runbbmap.sh
in deterministic mode. Is this a problem, or is average insert size a best approximation that's good enough, and I do not have to run these steps iteratively for an exact average insert size value?In the deterministic flag explanation, it says
So my guess is that bbmap did run deterministic, since my data is SE not PE reads, but I look forward to your comments. Thanks again!
You can set the
seed
to any positive integer value. It does not need to be 1. But it needs to be identical for all repeat runs.Since you have single-end data there is no way to estimate/determine insert size. Sorry I missed that part. It also does not matter with mapping since you have single-end data. So in your case alignments should have been deterministic.
Using more than one core sometimes leads to slightly different results (though overall conclusion should not change). I will ping @Brian to see if he can add his final word.
Genomax: Here is my 3 part update
PART 1 No problem with
rename.sh
step - gave matching results even previouslyPART 2 Your suggestion for seed=1 allowed me toreproduce my results for clumpify.sh, using the syntax below:
PART 3 However, the next step - FILTER BY TILE using
filterbytile.sh
, gives results that differs between runsIs there a way to enforce reproducibility of
filterbytile.sh
runs?I have never found much use for
filterbytile.sh
. If you have some bad data then an aligner is generally going to be able to take care of those reads. As to why the results differ slightly I can't say. It must have something to do with algorithmic implementation for that tool. You can try dropping that step out.I agree, the aligner should be able to take care of any bad reads in bad tiles.
Pursuing my non-reproducibility of exact matched outputs for replicated runs, specifically for
filterbytile.sh
, I just found that using the following flag allows me to obtain reproducible results with: usekmers=fThis syntax idea arose from an old seqanswers post by BBMap author Brian Bushnell at this link
However, in the help menu for filterbytile.sh, version 38-60, it says -
Therefore, I wonder under what circumstances using only average quality (q), and probability of being error-free (e), and excluding kmer uniqueness (u) would be OK versus not OK.... Please share your thoughts / experience. Thanks!