BBSplit index creation seems to use a subset of the references provided
2
1
Entering edit mode
2.5 years ago
Ram 44k

Hi,

I'm trying to build a reference index using BBsplit. I plan to use this index later on to split xenograft reads into host and graft reads.

From bbsplit.sh --help, the syntax is:

bbsplit.sh build=<num> ref_human=/path/to/human_ref.fa ref_mouse=/path/to/mouse_ref.fa path=/path/for/index

When I run this, the log and output seem to indicate that only chromosomes 1-13 are processed.

Log:

java -ea -Xmx32g -Xms32g -cp /path/to/utils/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 -Xmx32g build=2 ref_human=/dev/fd/63 ref_mouse=/path/to/analysis_resources/gencodevM25_GRCm38.p6/GRCm38.primary_assembly.genome.fa path=/path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6
Executing align2.BBSplitter [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, -Xmx32g, build=2, ref_human=/dev/fd/63, ref_mouse=/path/to/analysis_resources/gencodevM25_GRCm38.p6/GRCm38.primary_assembly.genome.fa, path=/path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6]

Creating merged reference file /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/merged_ref_208959345802405.fa.gz
Ref merge time:         67.448 seconds.
Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, build=2, ref=/path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/merged_ref_208959345802405.fa.gz]
Version 38.96

Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
No output file.
NOTE:   Deleting contents of /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2 because reference is specified and overwrite=true
Writing reference.
Executing dna.FastaToChromArrays2 [/path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/merged_ref_208959345802405.fa.gz, 2, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

Set genScaffoldInfo=true
Writing chunk 1
Writing chunk 2
Writing chunk 3
Writing chunk 4
Writing chunk 5
Writing chunk 6
Writing chunk 7
Writing chunk 8
Writing chunk 9
Writing chunk 10
Writing chunk 11
Writing chunk 12
Writing chunk 13
Set genome to 2

Loaded Reference:       0.005 seconds.
Loading index for chunk 1-13, build 2
No index available; generating from reference genome: /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/index/2/chr1-3_index_k13_c2_b2.block
No index available; generating from reference genome: /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/index/2/chr4-7_index_k13_c2_b2.block
No index available; generating from reference genome: /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/index/2/chr8-11_index_k13_c2_b2.block
No index available; generating from reference genome: /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/index/2/chr12-13_index_k13_c2_b2.block
Indexing threads started for block 8-11
Indexing threads started for block 4-7
Indexing threads started for block 12-13
Indexing threads started for block 0-3
Indexing threads finished for block 12-13
Indexing threads finished for block 0-3
Indexing threads finished for block 8-11
Indexing threads finished for block 4-7
Generated Index:        289.727 seconds.
Finished Writing:       26.169 seconds.
No reads to process; quitting.

Total time:             363.183 seconds.

Assuming chunks are not chromosomes, but verifying the output index files just in case:

ls -lR /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6

 GRCH38.p13_GRCm38.p6:
 total 32
 drwxr-sr-x 4 ram pdxcore ram 2048 2022-06-21 16:02 ref/

 GRCH38.p13_GRCm38.p6/ref:
 total 64
 drwxr-sr-x 4 ram pdxcore ram 2048 2022-06-22 14:16 genome/
 drwxr-sr-x 4 ram pdxcore ram 2048 2022-06-22 14:17 index/

 GRCH38.p13_GRCm38.p6/ref/genome:
 total 64
 drwxr-sr-x 2 ram pdxcore ram 4096 2022-06-22 14:18 2/

 GRCH38.p13_GRCm38.p6/ref/genome/2:
 total 3369440
 -rw-r--r-- 1 ram pdxcore ram  139332337 2022-06-22 14:18 chr1.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram  114957859 2022-06-22 14:18 chr2.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram  150494840 2022-06-22 14:18 chr3.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram  117847786 2022-06-22 14:18 chr4.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram  134074430 2022-06-22 14:18 chr5.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram  142117515 2022-06-22 14:18 chr6.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram  124216258 2022-06-22 14:18 chr7.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram  144500979 2022-06-22 14:18 chr8.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram  129348469 2022-06-22 14:18 chr9.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram  146176565 2022-06-22 14:19 chr10.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram  135428456 2022-06-22 14:21 chr11.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram  147058547 2022-06-22 14:22 chr12.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram   27575390 2022-06-22 14:19 chr13.chrom.gz
 -rw-r--r-- 1 ram pdxcore ram        742 2022-06-22 14:18 info.txt
 -rw-r--r-- 1 ram pdxcore ram 1796826715 2022-06-22 14:17 merged_ref_208959345802405.fa.gz
 -rw-r--r-- 1 ram pdxcore ram         12 2022-06-22 14:16 namelist.txt
 -rw-r--r-- 1 ram pdxcore ram        125 2022-06-22 14:16 reflist.txt
 -rw-r--r-- 1 ram pdxcore ram       3927 2022-06-22 14:18 scaffolds.txt.gz
 -rw-r--r-- 1 ram pdxcore ram        349 2022-06-22 14:18 summary.txt

 GRCH38.p13_GRCm38.p6/ref/index:
 total 64
 drwxr-sr-x 2 ram pdxcore ram 2048 2022-06-22 14:23 2/

 GRCH38.p13_GRCm38.p6/ref/index/2:
 total 22075712
 -rw-r--r-- 1 ram pdxcore ram 5465729975 2022-06-22 14:23 chr1-3_index_k13_c2_b2.block
 -rw-r--r-- 1 ram pdxcore ram   71027559 2022-06-22 14:22 chr1-3_index_k13_c2_b2.block2.gz
 -rw-r--r-- 1 ram pdxcore ram 7067656551 2022-06-22 14:23 chr4-7_index_k13_c2_b2.block
 -rw-r--r-- 1 ram pdxcore ram   75651111 2022-06-22 14:23 chr4-7_index_k13_c2_b2.block2.gz
 -rw-r--r-- 1 ram pdxcore ram 7446090599 2022-06-22 14:23 chr8-11_index_k13_c2_b2.block
 -rw-r--r-- 1 ram pdxcore ram   76394851 2022-06-22 14:23 chr8-11_index_k13_c2_b2.block2.gz
 -rw-r--r-- 1 ram pdxcore ram 2344942587 2022-06-22 14:21 chr12-13_index_k13_c2_b2.block
 -rw-r--r-- 1 ram pdxcore ram   57934699 2022-06-22 14:22 chr12-13_index_k13_c2_b2.block2.gz

I looked that the txt files under index/2 genome/2, and none of them mention anything above chr13. I've checked my references, and both contain >13 chromosomes. The /dev/fd/ input is from a <(zcat ref.fa.gz) - build #1 was without this zcat and produced identical output, however, I don't have the logs from that run.

bbsplit bbmap xenograft bbtools • 2.4k views
ADD COMMENT
2
Entering edit mode
2.5 years ago

That's a legacy naming scheme; the earliest iterations did indeed have 1 chromosome per file. However, that's not practical when dealing with assemblies that have millions of contigs, so, now they are packed together with many per file, but the files still retain the .chrom suffix and some of the internal text reference remain unchanged.

ADD COMMENT
0
Entering edit mode

I see - thank you, Brian. I'll run bbsplit and check how the classification goes. A collaborator mentioned it needs a lot of RAM (~200G). Is that true or are we doing it wrong?

ADD REPLY
2
Entering edit mode

No, the amount of memory is dictated by the reference, generally around 7 bytes per bp (so probably under 50GB in this case). Although it's possible to reduce that by roughly a factor of 2 at the cost of some sensitivity with the 'usemodulo' flag applied (both when indexing and mapping).

Actually, if you concatenate the references (or stream from stdin) you can run "stats.sh in=ref.fa k=13" and it will give you an estimate of the memory BBMap needs for that reference and that kmer size.

ADD REPLY
0
Entering edit mode

That is great news. It did not run to completion in a 48gb node yesterday. I'll try on a 64gb node today.

EDIT

This is the error when running on a 64GB interactive node:

java -ea -Xmx54640m -Xms54640m -cp /path/to/utils/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 build=2 in=ML_24702_S3_L002_R1_001.fastq.gz in2=ML_24702_S3_L002_R2_001.fastq.gz path=/path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ basename=out_%_#.fq.gz
Executing align2.BBSplitter [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, build=2, in=ML_24702_S3_L002_R1_001.fastq.gz, in2=ML_24702_S3_L002_R2_001.fastq.gz, path=/path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/, basename=out_%_#.fq.gz]

Merged reference file /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/merged_ref_208959345802405.fa.gz already exists; skipping merge.
Ref merge time:         0.669 seconds.
Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, build=2, in=ML_24702_S3_L002_R1_001.fastq.gz, in2=ML_24702_S3_L002_R2_001.fastq.gz, ref=/path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/merged_ref_208959345802405.fa.gz, out_human=out_human_#.fq.gz, out_mouse=out_mouse_#.fq.gz]
Version 38.96

Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
Retaining first best site only for ambiguous mappings.
NOTE:   Ignoring reference file because it already appears to have been processed.
NOTE:   If you wish to regenerate the index, please manually delete /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/summary.txt
Set genome to 2

Loaded Reference:       6.976 seconds.
Loading index for chunk 1-13, build 2
Generated Index:        49.252 seconds.
Analyzed Index:         9.080 seconds.
Cleared Memory:         15.209 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 64 mapping threads.
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2832 bytes for AllocateHeap
[thread 47330142209792 also had an error]
[thread 47330209318656 also had an error]
[thread 47330478806784 also had an error]
[thread 47330411702016 also had an error]
# An error report file with more information is saved as:
# /path/to/hs_err_pid51218.log

I'll run the stats.sh to estimate RAM requirement but it looks like 54G (as seen in the Xmx and Xms params) doesn't cut it.

ADD REPLY
1
Entering edit mode

Lowering the number of threads should help. Stay at 12 or below, if 64G of RAM is all you have.

ADD REPLY
0
Entering edit mode

That works! I added Xmx50g (to be explicit about it) and threads=12 and it seems to be running fine

ADD REPLY
1
Entering edit mode
2.5 years ago
GenoMax 148k

AFAIK

Writing chunk 1
Writing chunk 2
Writing chunk 3
Writing chunk 4
Writing chunk 5

Is not referring to actual chromosomes. I think BBMap (like other aligners) simply concatenates chromsomes into a loooong string and then creates the index. chr* in file names you see is likely some random names/split @Brian uses when writing the index.

Do you have data that suggests absence of other chromosomes in the index? Don't go on what the names of the files say.

ADD COMMENT
0
Entering edit mode

I mentioned some other digging that I did: the text files under the genome dir (not index as I previously mentioned). They only mention chr 1-13.

➜ head -n 50 /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/*.txt
==> /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/info.txt <==
#Chromosome sizes
#Generated on   Wed Jun 22 14:17:54 CDT 2022
#Version        5
#chrom  scaffolds       contigs length  defined undefined       startPad        stopPad
1       2       97      491158252       471029240       20129012        8000    8000
2       2       36      388518415       387852802       665613  8000    8000
3       3       65      511698812       510314031       1384781 8000    8000
4       3       89      417339376       399821648       17517728        8000    8000
5       4       83      489778878       456222832       33556046        8000    8000
6       7       240     525741827       491930711       33811116        8000    8000
7       174     430     471099536       423365683       47733853        8000    8000
8       3       71      498669621       486781117       11888504        8000    8000
9       3       66      447021290       436111624       10909666        8000    8000
10      4       74      506782760       492582057       14200703        8000    8000
11      4       29      469505491       456139038       13366453        8000    8000
12      5       139     516369744       495873705       20496039        8000    8000
13      46      185     97124602        93386767        3737835 8000    8000

==> /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/namelist.txt <==
human
mouse

==> /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/reflist.txt <==
/proc/33022/fd/63
/path/to/mouse_ref.fa

==> /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/summary.txt <==
#Summary
#Generated on   Wed Jun 22 14:18:41 CDT 2022
#Version        5
chroms  13
bases   5830808604
defined 5601411255
undefined       229397349
contigs 1604
scaffolds       260
interpad        300
source  /path/to/resources/bbsplit/GRCH38.p13_GRCm38.p6/ref/genome/2/merged_ref_208959345802405.fa.gz
bytes   1796826715
last modified   1655925474000
scafprefixes    true
ADD REPLY

Login before adding your answer.

Traffic: 2971 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6