There's no shortage of posts on Biostars regarding ...
- (1) which aligner is best for mapping short reads (~ 30 bp)
- (2) whether the task of finding all alignments of a short read can be achieved by BLAT/BLAST
- (3) additional recommendations for short read aligners outside bwa/bowtie/BLAT/BLAST
I read through all of them and learned a lot. But I'm struggling to find all alignment locations of ~ 1 000 000 x 32-mers, within a Hamming distance of 3, against a reference genome.
Wondering if folks can suggest either a parameter change, or a tool of choice! Here's what I've tried, and what I observed:
- With
BLAT
I find that the k-mer alignments are underreported.- Using a Bloom filter, I find the counts of k-mers exceed the alingment locations using
BLAT
. Furthermore, the aligned locations ofBLAT
never exceeds ~ 200 or so...!blat -t=dna -minMatch=1 -stepSize=1 -minScore=29 -tileSize=8 -repMatch=10000 -fine -out=pslx hg38.fa kmers.fa kmers.psl
- Using a Bloom filter, I find the counts of k-mers exceed the alingment locations using
- With
bowtie
(version 1) the resultant alignments are all perfect matches! The CIGAR scores are all32M
suggesting perfect alignement... despite indicating mismatches, why?bowtie --sam -v 3 -p 8 --all -f --norc hg38 kmers.fa kmers_aln.sam
- With
bbduk
I keep running out of memory (never even runs), despite allocating 500 GB (!) Is allowing a Hamming distance = 3 too computationally expensive?bbduk.sh -Xmx500g in=kmers.fa ref=hg38.fa outm=kmers_aln.fq k=32 mm=f hdist=3 threads=8
If you are mapping then why are you using
bbduk.sh
? This way it is running infilter
mode by default. You should usebbmap.sh
instead. Why are you exporting the result as a.fq
file? You also need to use a smallerk
to allow for initial matches.Try
You will need to play with
minid
parameter to get 3 mismatches.GenoMax Thank you so much for this response! I think I misspoke (and learned the difference btwn alignment vs. mapping Alignment and mapping). I would like to do alignment, and merely followed your suggestion to another user (kmer alignment with mismatch). The
.fq
was based on readingbbduk
documentation.Still a little confused as I took the past few days to try
BLAST
andbbmap
. I expect nearly uniform genome coverage but the kmers I have only map to half of the genomic regions? Not sure if it is by default masking some regions.Can you post an example of some of the kmers you are working with? Aligners will generally have an upper limit in terms of alignments they report. Have you tried to align with bbmap? You will need to add
ambig=all
option to report all alignments.