Hi all,
I have a dataset of V4 sequences from MiSeq, and a data set of full-length Sanger 16S sequences.
I'm trying to match-up V4 sequences with their full-length sequence in the other dataset using BLAST. I'm using a pident > 0.99
threshold to assign a V4 sequence to a 16S sequence, but I am seeing sensitivity to the BLAST parameters used.
I'm finding that the results are very sensitive to word_size. I get more hits when I switch to blastn
(word_size=11
) as opposed to using megablast
(word_size=28
), these additional hits are also more often hitting pident=100.0
.
I'm wondering what best practice is here when I'm expecting high similarity between sequences. Should I be using blastn
or megablast
? Is BLAST even the correct tool here?
Thanks in advance!
If you have full length sequences and the sequence from MiSeq are similar you should perhaps try doing global alignments (e.g. Needle from EMBOSS) instead.
Wouldn't global alignments attempt to align the entire sequences against each other? With the V4 sequences being shorter I'm not sure this would work. I could trim the 16S sequences to the predicted V4 region first
Sounded like you wanted to align entire MiSeq reads to the reference and thus my suggestion.