hi folks,
Recently, I've been working on developing molecular and computational methods to improve the accuracy of relative abundance estimation of metagenome assembled genomes (MAGs) from NGS datasets. Most of this work is focused on host-associated viral metagenomics, where quite a bit of nucleic acid manipulation and amplification is needed for sequencing. When I was evaluating the molecular side of the process, I used bowtie2 for read alignments against the handful of genomes in my mock community only. This worked fine for low throughput miseq runs for eval purposes, but is less ideal for hundreds of samples with orders of magnitude more throughput. Since those early days, I've also switched to BBmap as my primary short read aligner and I've been really impressed with its speed and accuracy. However, while assessing the performance of a few short read aligners on [a handful of mock community controls from] real datasets recently, I've noticed that BBmap aligns a very small fraction (<15%) of reads from a given sample. It does this seemingly quite accurately compared to other read aligners, but I'm wondering if theres a good tradeoff between speed and accuracy that yields higher alignable % of read pairs? For example, see:
------------------ Results ------------------
Genome: 1
Key Length: 13
Max Indel: 16000
Minimum Score Ratio: 0.9081
Mapping Mode: normal
Reads Used: 160842560 (23021057453 bases)
Mapping: 1812.129 seconds.
Reads/sec: 88758.88
kBases/sec: 12703.87
Pairing data: pct pairs num pairs pct bases num bases
mated pairs: 1.7664% 1420555 1.8571% 427531618
bad pairs: 0.0000% 0 0.0000% 0
insert size avg: 289.33
Read 1 data: pct reads num reads pct bases num bases
mapped: 1.7664% 1420555 1.9532% 213765809
unambiguous: 1.7660% 1420220 1.9527% 213716684
ambiguous: 0.0004% 335 0.0004% 49125
low-Q discards: 0.0013% 1076 0.0012% 133213
perfect best site: 1.3331% 1072070 1.4762% 161564894
semiperfect site: 1.3332% 1072178 1.4764% 161580586
rescued: 0.0236% 19017
Match Rate: NA NA 99.5051% 212908537
Error Rate: 21.7281% 347103 0.4916% 1051931
Sub Rate: 21.1298% 337546 0.3812% 815654
Del Rate: 0.7988% 12761 0.0942% 201564
Ins Rate: 1.0280% 16422 0.0162% 34713
N Rate: 0.1184% 1892 0.0032% 6905
Read 2 data: pct reads num reads pct bases num bases
mapped: 1.7664% 1420555 1.7748% 214330322
unambiguous: 1.7660% 1420219 1.7743% 214280573
ambiguous: 0.0004% 336 0.0004% 49749
low-Q discards: 0.0000% 0 0.0000% 0
perfect best site: 1.1366% 914081 1.1422% 137943881
semiperfect site: 1.1367% 914150 1.1423% 137953988
rescued: 0.0471% 37897
Match Rate: NA NA 99.2782% 213000720
Error Rate: 34.3817% 504830 0.7202% 1545243
Sub Rate: 33.8403% 496880 0.6003% 1287846
Del Rate: 0.8624% 12663 0.1021% 219009
Ins Rate: 1.1514% 16906 0.0179% 38388
N Rate: 0.1747% 2565 0.0016% 3368
Total time: 1819.680 seconds.
This sample was at the lower end of the distribution, but for the handful of of mock samples that I'm benchmarking against, % pairs aligned didn't exceed 15%. To provide a little background, I'm aligning short PE NovaSeq (150bp) reads against a set of ~5200 high confidence viral OTU sequences (eg. MAGs) cross assembled from these mock community controls and several hundred clinical samples. The command that I use to get these results are:
bbmap.sh -Xmx250g threads=24 pairedonly=t minid=0.95 unpigz=T pigz=T in1=$F1 in2=$F2 outm=$BBM_output_dir/${F1%%_L001_R1_001.fastq.gz*}_mapped.bam bs=$BBM_output_dir/${F1%%_L001_R1_001.fastq.gz*}_bs.sh; sh $BBM_output_dir/${F1%%_L001_R1_001.fastq.gz*}_bs.sh
unsurprisingly, when I remove the minid=0.95
flag, I get a higher % of aligned pairs but the accuracy drastically decreases. Similarly, when using bwa-mem2 or minimap2, I get a much higher fraction of aligned pairs, but most of those reads aren't mapped accurately (eg. to the bugs in the mock community). I guess this isn't super surprising since they're local aligners.
So, now I turn to you good people of Biostars: do you have any recommendations for increasing the fraction of reads aligned by BBmap while still maintaining high accuracy? As BBmap is quite efficient, I wouldn't mind a reasonable increase in alignment time if it lead to more reads accurately aligned. Alternatively, if there is another aligner that's equally or better suited to this type of usage, feel free to suggest. I will also continue to eval and update with results.
thanks, y'all!
How did you assemble the MAG's and are you certain that they are of good quality. If you have chimera's in the assembly then those would throw off any aligners.
Have you looked to see what the other 85% reads that do not map are (blast?). What is the inset size of the library fragments? Do these reads overlap by any chance?
You can run
bbmap.sh
in local alignment mode by addinglocal=f
and see what that does.