Question

BLAST parameters for sequences with high similarity

0

Entering edit mode

8 months ago

behzad.karkaria • 0

Hi all,

I have a dataset of V4 sequences from MiSeq, and a data set of full-length Sanger 16S sequences.

I'm trying to match-up V4 sequences with their full-length sequence in the other dataset using BLAST. I'm using a pident > 0.99 threshold to assign a V4 sequence to a 16S sequence, but I am seeing sensitivity to the BLAST parameters used.

I'm finding that the results are very sensitive to word_size. I get more hits when I switch to blastn (word_size=11) as opposed to using megablast (word_size=28), these additional hits are also more often hitting pident=100.0.

I'm wondering what best practice is here when I'm expecting high similarity between sequences. Should I be using blastn or megablast? Is BLAST even the correct tool here?

Thanks in advance!

BLAST Sanger 16S v4 miseq • 512 views

ADD COMMENT • link updated 8 months ago by GenoMax 147k • written 8 months ago by behzad.karkaria • 0

0

Entering edit mode

If you have full length sequences and the sequence from MiSeq are similar you should perhaps try doing global alignments (e.g. Needle from EMBOSS) instead.

ADD REPLY • link 8 months ago by GenoMax 147k

0

Entering edit mode

Wouldn't global alignments attempt to align the entire sequences against each other? With the V4 sequences being shorter I'm not sure this would work. I could trim the 16S sequences to the predicted V4 region first

ADD REPLY • link 8 months ago by behzad.karkaria • 0

0

Entering edit mode

Sounded like you wanted to align entire MiSeq reads to the reference and thus my suggestion.

ADD REPLY • link 8 months ago by GenoMax 147k

score 1 · Answer 1 · 2024-03-12

That sounds like a cool project.

When you expect high sequence similarity between queries and database, megablast is the way to go (it's faster too). And it may have different rules for numbers of sequences, gap extension penalties etc. Some of that can explain the variation in your results.

You may want to consider limiting your max numbers of hits or hsps

(Given that there can be sequencing errors and genetic diversity, you might not systematically expect 100% id)