Entering edit mode
8.8 years ago
cacampbell
▴
60
Hi,
Not sure what to make of the following situation. I have been mapping paired-end reads of one Conifer to a reference of a different, but closely related Conifer. Not ideal, but all that we have. Mapping proceeds as normal, exits normally, but the output files don't seem to be useable downstream... and the output from BBMap is ... malformed?
Example Script:
bbmap.sh in1=[read]_1.fq.gz in2=[mate]_2.fq.gz outm=[out_mapped]_pe.sam outu=[out_unmapped].unmapped.sam ref=[reference].fa nodisk covstats=[covstats].txt covhist=[covhist]_covhist.txt threads=102 slow k=12 -Xmx300G basecov=[basecov]_basecov.txt usejni=t bincov=[bincov]_bincov.txt' | sbatch --job-name=[job_name] --time=0 --mem=300G --partition=bigmemm --ntasks=1 --cpus-per-task=26 --mail-user=[me] --mail-type=END,FAIL
What do I mean by "Malformed"?
java -Djava.library.path=/home/cacampbe/bbmap/jni/ -ea -Xmx300G -cp /home/cacampbe/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 in1=[basename]_1.fq.gz in2=[basename]_2.fq.gz outm=[basename]_pe.sam outu=[basename]_pe.unmapped.sam ref=[reference].fa nodisk covstats=[basename]_2.txt covhist=[basename]_covhist.txt threads=102 slow k=12 -Xmx300G basecov=[basename]_basecov.txt usejni=t bincov=[basename]_bincov.txt
Executing align2.BBMap [tipsearch=150, minhits=1, minratio=0.45, build=1, overwrite=true, fastareadlen=500, in1=[basename]_1.fq.gz, in2=[basename]_2.fq.gz, outm=[basename]_pe.sam, outu=[basename]_pe.unmapped.sam, ref=[reference].fa, nodisk, covstats=[basename]_2.txt, covhist=[basename]_covhist.txt, threads=102, k=12, -Xmx300G, basecov=[basename]_basecov.txt, usejni=t, bincov=[basename]_bincov.txt]
BBMap version 35.82
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.450
Set threads to 102
Retaining first best site only for ambiguous mappings.
Executing dna.FastaToChromArrays2 [reference].fa, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=false, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=true]
Set genScaffoldInfo=true
Set genome to 1
Loaded Reference: 120.072 seconds.
Loading index for chunk 1-97, build 1
Indexing threads started for block 8-11
Indexing threads started for block 12-15
Indexing threads started for block 44-47
Indexing threads started for block 4-7
Indexing threads started for block 56-59
Indexing threads started for block 92-95
Indexing threads started for block 32-35
Indexing threads started for block 0-3
Indexing threads started for block 68-71
Indexing threads started for block 72-75
Indexing threads started for block 40-43
Indexing threads started for block 28-31
Indexing threads started for block 36-39
Indexing threads started for block 16-19
Indexing threads started for block 24-27
Indexing threads started for block 76-79
Indexing threads started for block 96-97
Indexing threads started for block 84-87
Indexing threads started for block 64-67
Indexing threads started for block 20-23
Indexing threads started for block 80-83
Indexing threads started for block 88-91
Indexing threads started for block 48-51
Indexing threads started for block 52-55
Indexing threads started for block 60-63
Indexing threads finished for block 96-97
Indexing threads finished for block 80-83
Indexing threads finished for block 72-75
Indexing threads finished for block 68-71
Indexing threads finished for block 92-95
Indexing threads finished for block 60-63
Indexing threads finished for block 84-87
Indexing threads finished for block 56-59
Indexing threads finished for block 44-47
Indexing threads finished for block 4-7
Indexing threads finished for block 48-51
Indexing threads finished for block 20-23
Indexing threads finished for block 12-15
Indexing threads finished for block 16-19
Indexing threads finished for block 28-31
Indexing threads finished for block 52-55
Indexing threads finished for block 32-35
Indexing threads finished for block 40-43
Indexing threads finished for block 8-11
Indexing threads finished for block 0-3
Indexing threads finished for block 24-27
Indexing threads finished for block 88-91
Indexing threads finished for block 36-39
Indexing threads finished for block 64-67
Indexing threads finished for block 76-79
Generated Index: 1347.827 seconds.
Executing jgi.CoveragePileup [covhist=[basename]_covhist.txt, covstats=[basename]_2.txt, basecov=[basename]_basecov.txt, bincov=[basename]_bincov.txt, physcov=false, 32bit=false, nzo=false, twocolumn=false, secondary=false, covminscaf=0, ksb=true, binsize=1000, startcov=false, strandedcov=false, rpkm=null, normcov=null, normcovo=null, in1=[basename]_1.fq.gz, in2=[basename]_2.fq.gz]
Set NONZERO_ONLY to false
Set TWOCOLUMN to false
Set USE_SECONDARY_ALIGNMENTS to false
Set KEEP_SHORT_BINS to true
Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Analyzed Index: 279.244 seconds.
Started output stream: 161.640 seconds.
Started output stream: 9.049 seconds.
Cleared Memory: 547.877 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 102 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101
------------------ Results ------------------
Genome: 1
Key Length: 12
Max Indel: 16000
Minimum Score Ratio: 0.45
Mapping Mode: normal
Reads Used: 0 (0 bases)
Mapping: 2.663 seconds.
Reads/sec: 0.00
kBases/sec: 0.00
Pairing data: pct reads num reads pct bases num bases
mated pairs: NaN% 0 NaN% 0
bad pairs: NaN% 0 NaN% 0
insert size avg: NaN
Read 1 data: pct reads num reads pct bases num bases
mapped: NaN% 0 NaN% 0
unambiguous: NaN% 0 NaN% 0
ambiguous: NaN% 0 NaN% 0
low-Q discards: NaN% 0 NaN% 0
perfect best site: NaN% 0 NaN% 0
semiperfect site: NaN% 0 NaN% 0
rescued: NaN% 0
Match Rate: NA NA NaN% 0
Error Rate: NaN% 0 NaN% 0
Sub Rate: NaN% 0 NaN% 0
Del Rate: NaN% 0 NaN% 0
Ins Rate: NaN% 0 NaN% 0
N Rate: NaN% 0 NaN% 0
Read 2 data: pct reads num reads pct bases num bases
mapped: NaN% 0 NaN% 0
unambiguous: NaN% 0 NaN% 0
ambiguous: NaN% 0 NaN% 0
low-Q discards: NaN% 0 NaN% 0
perfect best site: NaN% 0 NaN% 0
semiperfect site: NaN% 0 NaN% 0
rescued: NaN% 0
Match Rate: NA NA NaN% 0
Error Rate: NaN% 0 NaN% 0
Sub Rate: NaN% 0 NaN% 0
Del Rate: NaN% 0 NaN% 0
Ins Rate: NaN% 0 NaN% 0
N Rate: NaN% 0 NaN% 0
Average coverage: 0.00
Standard deviation: 0.00
Percent scaffolds with any coverage: 0.00
Percent of reference bases covered: 0.00
Total time: 33750.199 seconds.
So, that's weird. Maybe too stringent parameters for what I am trying to do?
Brian Bushnell
Why are you using 102 threads for this job??
Having threads for a job like this spread across several physical servers is probably causing this problem. If you have large data files you should start several BBMap jobs in parallel and then merge/deal with BAM files afterwards.
How big is the reference here?
This is not spread over many physical servers. This is running on one node of our supercomputer cluster. Each Node has 1 TB of RAM and 60 CPUs (that are each quad core). Several of these jobs can run in parallel on the same node.
Generally, I wouldn't use this many threads for a single job, but it doesn't appear that BBMap has any problem with this -- and Brian hasn't mentioned elsewhere (forums, documentation) that BBMap has an upper bound on useful number of threads (I would definitely like to know if this is the case, though). I have tried varying the number of threads given to it for different jobs, and it generally performs similarly from 30 - 100 threads, so maybe the limit is in the lower end of that range.
The reference is around 80G -- and each read is comparably sized (60G - 80G). Larger projects using a ~100G reference and lanes work just fine and have sensible output from BBMap. These projects use the same species reference, rather than a different one, as in this case.
I have been thinking about a possible configuration that can fit the hardware you were describing, where there would be 100+ cores on a physical server. (Disclaimer: If it is a NUMA/Cray type machine then the following would not apply).
Best Xeon's intel currently sells seem to be 18-core E7-8890 chips (cost for each CPU is an eye watering ~$6k+). Server motherboards can have quad-socket's (standard commercial ones). With such a combination we can have 72 real-cores (or 144 hyper-threads). So the config you describe would be ~$50K (with a 1TB RAM, without adding disks/networking/RAID controllers/warranty etc). If you have a cluster made from several of these, then I envy you :-)
You also appear to be using SLURM (correct me if I am wrong) for job scheduling. I don't understand
--ntasks=1 --cpus-per-task=26
directives (not yet a SLURM user). There also does not appear to be any directive (-N 1
is the right option) to keep all the threads from this job on a single node (your cluster may be configured to do this automatically).I have a feeling that in spite of such beefy hardware you are running into some hardware limit/process contention (you may have to work with your admins to trace something like this down).
Would be curious to see if you are able to find out specifics on your cluster and to see if they match my speculation.
This is yet another proof that @Brian's code is really well written/HPC grade. I don't know if he has had a chance to test it on beefy hardware like this before.
JGI has an 80-core system made from two 40-core motherboards (each 4x10-core, I think) linked together. The performance is pretty abysmal; it's strictly used for things that cannot complete with the memory limitations of a single motherboard (that one has 2 TB RAM and is the biggest on the Genepool cluster). As long as a process needs under 1 TB (Genepool's current single-machine maximum), it is far faster to run it on a system with a single physical motherboard. The divided memory, in practice, probably puts a well-threaded 80-thread application at roughly the same speed as it would run on an 10-core uniform-memory motherboard. Of course, it depends on how much random memory access is needed.
There are distributed algorithms for many roles, which scale well across multiple nodes, but using them greatly constrains one's choices of both algorithms and infrastructure.
Very interesting! Can I ask what brand of servers are these? Are you saying that there are 60 physical CPU's/sockets with each CPU being a quad-core? Actually a TB of RAM seems inadequate here compared to the CPU's.
You must also have a very high performance storage system (is that all SSD based) that must be needed with this kind of compute power. Your job must be becoming I/O bound at some stage (unless everything is read into memory first) and perhaps that explains the plateauing of performance after a certain # of threads.
What is interesting is that you have had similar jobs complete before so clearly the issue is with this set. Even if the reference was totally non-homologous the only thing that should happen is no alignment (wonder if that is what NaN represents but one would expect some alignment just by chance). Do you have hundreds (of thousands of contigs in reference) and BBMap is hitting some internal limit?
Are there no other error messages? And no actual output BAM?
I assume they are some model of Xeon or Supermicro Motherboard (but I wouldn't know), and the file servers that we use are SSDs (mounted on the cluster). I honestly have no idea how our cluster works beyond that, how it is put together, how nodes communicate... etc. (or even a cluster generally, for that matter). All I know is the amount of resources that are available for our jobs (for our job queue). You would probably have to buy one of the admins in our IT department lunch to get a better idea of how it is all put together.
Presumably, the performance ceiling comes from the merging of threads and I/O, as you suspect. As far as I know, BBMap makes usage of pigz to handle I/O, which is multithreaded, but only so fast. And at a certain point, any program that is multithreaded will run into a bottleneck as the threads complete and merge, I think. I have never seen a program that can take an arbitrary upper limit on threads and make good use of all of them, but that's to be expected. As you suspect, something not entirely related to the mapping is probably causing the holdup.
So The SAM file output from BBMap looks like a SAM file
but puzzlingly, the output mapped and unmapped SAM files are exactly the same size in each case of running this project...
If there were no alignment, we might expect that the "NaN"s are instead some low number like 0 - 2 %. So I think there is something broken in this case. Just no idea what it is. No error messages, no warnings, job completes in what seems to be an appropriate amount of time (comparable to other projects).
I'm just hoping that Brian has seen this kind of thing before.
The key is this:
Reads Used: 0 (0 bases)
(wow... all the formatting options in the forum have changed...)
So, BBMap did not find any reads in the input files. That's why the output files are the same size - they are both only headers, with no actual reads. And thus all the statistics are NaN because of 0/0 division.
Can you verify that the inputs are acceptable? You specified "in1=[basename]_1.fq.gz in2=[basename]_2.fq.gz"
Those were used literally, so unless you actually have a file called "[basename]_1.fq.gz", that would explain the zero output. However, if the file did not exist it SHOULD crash and state that the file does not exist. So I assume there is a file called "[basename]_1.fq.gz" of size zero bytes. Is that correct?
Fortunately, you won't have to re-index, which took the majority of the time. Loading an indexed reference is a lot faster. Just be sure to use the correct paths to the input files, and please tell me if those indeed were the correct paths.
Sorry for the confusion -- I replaced the filenames in just this post with [basename] where appropriate, since these are long filepaths that make the whole post much less readable. The real filenames are named in the fashion of an Illumina sequencing directory.
In this case, the actual filenames would be something like this:
and in this post, I replaced everything but "_1.fq.gz" or "_2.fq.gz" or other identifying suffixes with [basename]
So, yes, the files are correct paths and have a non-zero, very large size. These are the files that came directly from BBDuk runs that did adapter / quality / contamination trimming.
This is very mysterious. Sorry for assuming you messed up the syntax, but that is a common mistake and there are no other obvious reasons for this issue. I assume if you run "tail" on the sam files, the last lines are also headers, right?
Would you mind trying this:
reformat.sh in1=[basename]_1.fq.gz in2=[basename]_2.fq.gz out=out.fq reads=4
...to see what the output is? If it fails, please try this:
reformat.sh in1=[basename]_1.fq.gz in2=[basename]_2.fq.gz out=out.fq reads=4 unpigz=f gunzip=f bf2=f
Either way, I would love to see the first 8 lines of each fastq file (you can email them to me at bbushnell at lbl dot gov, if they are not confidential), as that may help me diagnose this.
And yes, BBTools use pigz for compression/decompression, when available. In my testing pigz scales perfectly for output, though it maxes at about 1.33 threads for input; and even when you cap it at 1 thread it's faster than gzip, for reasons I don't understand. But, since decompressing gzip is rate-limited by being impossible to fully parallelize, it's always fastest to keep paired reads in dual files as you are doing. It has, on occasion, caused trouble, though. You can disable it with "pigz=f unpigz=f", in which case the moderately slower Java internal decompressor will be used, which will make no difference in this case because with such a huge reference it is compute-limited, not I/O=limited. BBTools can also use gunzip (which is faster than Java mainly because decompression is done in a different thread than parsing) which is why I suggested the gunzip=f flag for diagnosis purposes. The "bf2=f" flag disables pipeline-multithreaded input, which could potentially have a race condition and has in a handful of circumstances caused odd behavior in strangely-configured systems. For example, it's never caused a problem on JGI's usual Genepool cluster, or on my PC... but NERSC is bringing up a new supercomputer (Cori) where it does seem to cause problems. The benefit of pipeline-parallel "bf2" mode over singlethreaded "bf1" mode is only ~50%, and only when I/O is limiting (primarily Reformat). I will examine it this weekend to see if I can find and fix a race condition.
I am also kind of hoping that this is a transient (hardware/OS) problem and re-running with the same command will work successfully. That's a lot of compute-time to load the index, though (probably a couple hours), which is why I suggest trying reformat - it will complete in a fraction of a second, and should yield the same statistics as BBMap since it uses the same read input routines.
@Genomax - BBMap has no limits on size or number of contigs of references, other than 2^63 sequences overall or 2^31 bases in a single sequence (which I hope is bigger than the largest possible stable chromosome); the limiting factors should always be time and memory. It's been successfully used with metagenomes containing hundreds of millions of contigs.
Hey Brian,
Thanks for the support -- I'll give this a try tomorrow at work and will update then
So, I was on vacation last week, but back at it today.
Things are even less clear at this point, as it turns out that BBDuk had output empty files with a similar error to the ones here. So, I am going to be trying reformat.sh on the original files. If that works, then I will continue through the steps until I hit a problem again.
Hi Brian,
After leaving this project for a few weeks and coming back to it, I ran BBDuk2 again -- using quality of 10 to trim, and discovered that 99.9% of bases were being trimmed. Would looking at the fastqc information verify this result? If this isn't some kind of bug, is... is this bad? Is this data garbage?
I am rerunning with quality = 1 now and will edit this post with the result.
Thanks
Something sounds odd. Is this illumina format data (phred+64)?
Yes, Illumina format data phred+64, according to the Report provided with the data.
So, that's weird right?
No it is not weird. You need to specify
qin=64
on your command line. That should allow BBDuk to recognize this data as phred+64. Are you doingqtrim=r trimq=10
?At some point you may want the data to be in phred+33 format so you could
reformat.sh
it before trimming or afterwards.Okay,
Will do thanks. I will reformat to phred+33, and try again. This seems like it should solve the issues I am having with this data.