Hi,
I'm trying to build a reference index using BBsplit. I plan to use this index later on to split xenograft reads into host and graft reads.
From bbsplit.sh --help
, the syntax is:
bbsplit.sh build=<num> ref_human=/path/to/human_ref.fa ref_mouse=/path/to/mouse_ref.fa path=/path/for/index
When I run this, the log and output seem to indicate that only chromosomes 1-13 are processed.
Log:
java -ea -Xmx32g -Xms32g -cp /path/to/utils/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 -Xmx32g build=2 ref_human=/dev/fd/63 ref_mouse=/path/to/analysis_resources/gencodevM25_GRCm38.p6/GRCm38.primary_assembly.genome.fa path=/path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6
Executing align2.BBSplitter [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, -Xmx32g, build=2, ref_human=/dev/fd/63, ref_mouse=/path/to/analysis_resources/gencodevM25_GRCm38.p6/GRCm38.primary_assembly.genome.fa, path=/path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6]
Creating merged reference file /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/merged_ref_208959345802405.fa.gz
Ref merge time: 67.448 seconds.
Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, build=2, ref=/path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/merged_ref_208959345802405.fa.gz]
Version 38.96
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
No output file.
NOTE: Deleting contents of /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2 because reference is specified and overwrite=true
Writing reference.
Executing dna.FastaToChromArrays2 [/path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/merged_ref_208959345802405.fa.gz, 2, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]
Set genScaffoldInfo=true
Writing chunk 1
Writing chunk 2
Writing chunk 3
Writing chunk 4
Writing chunk 5
Writing chunk 6
Writing chunk 7
Writing chunk 8
Writing chunk 9
Writing chunk 10
Writing chunk 11
Writing chunk 12
Writing chunk 13
Set genome to 2
Loaded Reference: 0.005 seconds.
Loading index for chunk 1-13, build 2
No index available; generating from reference genome: /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/index/2/chr1-3_index_k13_c2_b2.block
No index available; generating from reference genome: /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/index/2/chr4-7_index_k13_c2_b2.block
No index available; generating from reference genome: /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/index/2/chr8-11_index_k13_c2_b2.block
No index available; generating from reference genome: /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/index/2/chr12-13_index_k13_c2_b2.block
Indexing threads started for block 8-11
Indexing threads started for block 4-7
Indexing threads started for block 12-13
Indexing threads started for block 0-3
Indexing threads finished for block 12-13
Indexing threads finished for block 0-3
Indexing threads finished for block 8-11
Indexing threads finished for block 4-7
Generated Index: 289.727 seconds.
Finished Writing: 26.169 seconds.
No reads to process; quitting.
Total time: 363.183 seconds.
Assuming chunks are not chromosomes, but verifying the output index files just in case:
ls -lR /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6
GRCH38.p13_GRCm38.p6:
total 32
drwxr-sr-x 4 ram pdxcore ram 2048 2022-06-21 16:02 ref/
GRCH38.p13_GRCm38.p6/ref:
total 64
drwxr-sr-x 4 ram pdxcore ram 2048 2022-06-22 14:16 genome/
drwxr-sr-x 4 ram pdxcore ram 2048 2022-06-22 14:17 index/
GRCH38.p13_GRCm38.p6/ref/genome:
total 64
drwxr-sr-x 2 ram pdxcore ram 4096 2022-06-22 14:18 2/
GRCH38.p13_GRCm38.p6/ref/genome/2:
total 3369440
-rw-r--r-- 1 ram pdxcore ram 139332337 2022-06-22 14:18 chr1.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 114957859 2022-06-22 14:18 chr2.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 150494840 2022-06-22 14:18 chr3.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 117847786 2022-06-22 14:18 chr4.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 134074430 2022-06-22 14:18 chr5.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 142117515 2022-06-22 14:18 chr6.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 124216258 2022-06-22 14:18 chr7.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 144500979 2022-06-22 14:18 chr8.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 129348469 2022-06-22 14:18 chr9.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 146176565 2022-06-22 14:19 chr10.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 135428456 2022-06-22 14:21 chr11.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 147058547 2022-06-22 14:22 chr12.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 27575390 2022-06-22 14:19 chr13.chrom.gz
-rw-r--r-- 1 ram pdxcore ram 742 2022-06-22 14:18 info.txt
-rw-r--r-- 1 ram pdxcore ram 1796826715 2022-06-22 14:17 merged_ref_208959345802405.fa.gz
-rw-r--r-- 1 ram pdxcore ram 12 2022-06-22 14:16 namelist.txt
-rw-r--r-- 1 ram pdxcore ram 125 2022-06-22 14:16 reflist.txt
-rw-r--r-- 1 ram pdxcore ram 3927 2022-06-22 14:18 scaffolds.txt.gz
-rw-r--r-- 1 ram pdxcore ram 349 2022-06-22 14:18 summary.txt
GRCH38.p13_GRCm38.p6/ref/index:
total 64
drwxr-sr-x 2 ram pdxcore ram 2048 2022-06-22 14:23 2/
GRCH38.p13_GRCm38.p6/ref/index/2:
total 22075712
-rw-r--r-- 1 ram pdxcore ram 5465729975 2022-06-22 14:23 chr1-3_index_k13_c2_b2.block
-rw-r--r-- 1 ram pdxcore ram 71027559 2022-06-22 14:22 chr1-3_index_k13_c2_b2.block2.gz
-rw-r--r-- 1 ram pdxcore ram 7067656551 2022-06-22 14:23 chr4-7_index_k13_c2_b2.block
-rw-r--r-- 1 ram pdxcore ram 75651111 2022-06-22 14:23 chr4-7_index_k13_c2_b2.block2.gz
-rw-r--r-- 1 ram pdxcore ram 7446090599 2022-06-22 14:23 chr8-11_index_k13_c2_b2.block
-rw-r--r-- 1 ram pdxcore ram 76394851 2022-06-22 14:23 chr8-11_index_k13_c2_b2.block2.gz
-rw-r--r-- 1 ram pdxcore ram 2344942587 2022-06-22 14:21 chr12-13_index_k13_c2_b2.block
-rw-r--r-- 1 ram pdxcore ram 57934699 2022-06-22 14:22 chr12-13_index_k13_c2_b2.block2.gz
I looked that the txt files under index/2
genome/2
, and none of them mention anything above chr13. I've checked my references, and both contain >13 chromosomes. The /dev/fd/
input is from a <(zcat ref.fa.gz)
- build #1 was without this zcat and produced identical output, however, I don't have the logs from that run.
I see - thank you, Brian. I'll run bbsplit and check how the classification goes. A collaborator mentioned it needs a lot of RAM (~200G). Is that true or are we doing it wrong?
No, the amount of memory is dictated by the reference, generally around 7 bytes per bp (so probably under 50GB in this case). Although it's possible to reduce that by roughly a factor of 2 at the cost of some sensitivity with the 'usemodulo' flag applied (both when indexing and mapping).
Actually, if you concatenate the references (or stream from stdin) you can run "stats.sh in=ref.fa k=13" and it will give you an estimate of the memory BBMap needs for that reference and that kmer size.
That is great news. It did not run to completion in a 48gb node yesterday. I'll try on a 64gb node today.
EDIT
This is the error when running on a 64GB interactive node:
I'll run the
stats.sh
to estimate RAM requirement but it looks like 54G (as seen in the Xmx and Xms params) doesn't cut it.Lowering the number of threads should help. Stay at 12 or below, if 64G of RAM is all you have.
That works! I added
Xmx50g
(to be explicit about it) andthreads=12
and it seems to be running fine