I'm a little confused by the use of similarity and identity and was wondering if someone could help set me straight.
I have recently read a paper in which the authors state that, following BLASTN, they extracted and used sequences with >90% similarity. However, I do not know how they calculated similarity using BLASTN. I am aware that you can set a threshold for identity using BLASTN with the argument -perc_identity 90.00; but so far as I understand, similarity and identity are not the same things. For instance here, sequence A and B = 100% identity but 60% similarity.
Do you think the authors actually mean similarity in this instance? If so, how do I calculate similarity using BLASTN?
Thanks
Ah, that makes much more sense. Thank you! So similarity will only ever be greater than identity when it comes to BlastP, and is the same for BlastN. Is there another factor that will take into account the length of the alignment? Say A is longer than B, but B is identical for x many bases? e.g. A: AAGGCTT B: AAGGC
that's correct indeed
Well, there is a parameter that provides the query coverage % ( will be ~75% in your example).
Since it's blastn (and only for blastN), you can also use the bitscore as a kind of proxy for the alignment length (the scoring of a nucleotide alignment is quite linear ).