Have any of you worked with the excellent BBMap Toolbox? I have, and I love it, but I am not at all sure why this is happening. I am trying to remove human contaminants from genomic samples of a large conifer species using the method described by Brian Bushnell (author) here:
http://seqanswers.com/forums/showthread.php?t=42552
Why even do this? Well, our sequencing center is quite busy, and definitely used primarily for human data. I started to run this on the files in our project and checked in on "human.fq" for those files (the mapped reads, that are thus contaminants) and discovered that it had indeed wrote high certainty matches against the human genome. Cool, so I go to apply this to all of the files (rather than the three used as experiment), and I ran into a persistent problem.
So, I run the wrapper script that calls the mapping command on each of ~400 files, only to discover that each one fails after ~40 seconds with a characteristic, vague stack trace. So, here are the commands...
(1) Indexing the Masked Human genome (hg19, UCSC, as recommended), exactly one time, before launching the scripts:
bbmap.sh ref=/path/to/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz path=/path/to/databases/Hosa/hg19_masked/
(2) Run the script, dispatching jobs -- which creates this script for each set of paired end reads:
echo '#!/usr/bin/env bash
bbmap.sh -Xms50G -Xmx54G minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=/path/to/databases/Hosa/hg19_masked/ in=/path/to/clean/DDP2_S2_H2VF3_L003_R#_001.fastq outu=/path/to/clean_decontam/DDP2_S2_H2VF3_L003_R#_001.fastq outm=/path/to/clean_decontam/DDP2_S2_H2VF3_L003_R#_001.human_c.fastq statsfile=/path/to/clean_decontam/DDP2_S2_H2VF3_L003_R#_001.stats_c.txt usejni=t threads=11' | sbatch --job-name=BBMap_DDP2_S2_H2VF3_L003_Runs_001.fastq --time=0 --mem=64G --partition=bigmemm --ntasks=1 --cpus-per-task=12 --mail-type=END,FAIL --mail-user=[me]
I have indeed compiled jni, and didn't receive any warnings while doing so. This is the exact formatting -- so if whitespace is an issue, then I can replace all instances of whitespace characters (except the newline between bash header and command) with spaces. This doesn't seem like it would be a problem though. Sbatch is the dispatcher for slurm, which is our manager. I have opted to use 85% of 'physical memory' when setting the java heap size, where the physical memory is what is available to the job thanks to slurm. I am using one less thread than total to leave a little room I guess, since I don't really know the worker threads dispatched or the subprocesses under their control.
Okay, so I tried to change up the parameters a little, specifying the reference fasta (using ref=) and letting it write a new one for each file, and that didn't work. I tried specifying the reference fasta (using ref=) and letting it index the reference in memory (using nodisk), and that didn't work. Both of these specifications failed in the same way, and in the same way as the original method (as in the script above):
Here is one such failure, using the ref, nodisk way
java -Djava.library.path=/install/path/bbmap/jni/ -ea -Xmx54G -cp /install/path/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xms50G -Xmx54G minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 ref=/path/to/databases/Hosa/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz nodisk in=/path/to/clean/DDP2_S2_H2VC3_L007_R#_001.fastq outu=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.fastq outm=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.human_c.fastq statsfile=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.stats_c.txt usejni=t threads=11
Executing align2.BBMap [tipsearch=20, maxindel=80, minhits=2, bwr=0.18, bw=40, minratio=0.65, midpad=150, minscaf=50, quickmatch=t, rescuemismatches=15, rescuedist=800, maxsites=3, maxsites2=100, build=1, overwrite=true, fastareadlen=500, -Xms50G, -Xmx54G, minid=0.95, maxindel=3, bwr=0.16, bw=12, quickmatch, minhits=2, ref=/path/to/databases/Hosa/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz, nodisk, in=/path/to/clean/DDP2_S2_H2VC3_L007_R#_001.fastq, outu=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.fastq, outm=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.human_c.fastq, statsfile=/path/to/clean_decontam/DDP2_S2_H2VC3_L007_R#_001.stats_c.txt, usejni=t, threads=11]
BBMap version 35.82
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.650
Set threads to 11
Retaining first best site only for ambiguous mappings.
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.908
Executing dna.FastaToChromArrays2 [/path/to/databases/Hosa/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=false, minscaf=50, midpad=150, startpad=8000, stoppad=8000, nodisk=true]
Set genScaffoldInfo=true
Exception in thread "main" java.lang.NullPointerException
at align2.AbstractMapper.checkFiles(AbstractMapper.java:709)
at align2.AbstractMapper.<init>(AbstractMapper.java:54)
at align2.BBMap.<init>(BBMap.java:41)
at align2.BBMap.main(BBMap.java:29)
So, what's going on? What am I doing wrong here? Thanks
Brian Bushnell, can you help, please? Thanks in advance
Ahh, looks like there is a bug there were it is doing the
#
symbol replacement. Thanks for finding that! I'll fix it in the next release.For now, please use 1 and 2 instead of the
#
symbol for the outm and outu (e.g.,outm1=matched1.fq out2=matched2.fq
). Sorry about that.-Brian
You're awesome thanks!