I have short FASTA files (~400nt) of differing lengths. I would like to compare/align them and get a similarity metric. I am not interested in the exact alignment or alignment output. I just need one value that shows how similar they are. Perhaps a distance like hamming distance would work. I want to do this pairwise with loads of sequences. Can anyone suggest a bash command line tool?
I haven't tried all the suggestions here, but I came across this BLAST usage:
And this returns
of which percentage identity (74%) seems to be a usable measure of similarity. I am not sure if this is the best metric for global sequence-to-sequence similarity.
It’s definitively not a measure of global sequence similarity.
The
L
inBLAST
stands for “local”.This is broadly the approach a few of us have suggested (use alignment and take an alignment score of choice as your metric), but global alignments are probably better suited to your question (though it would be worth you showing us some examples of your sequences).
Ah yes! Good point. But, does local matter for short sequences of few hundred nucleotides?
It might matter, it might not - “short” is all relative. 400bps might seem short, but its more than long enough to carry an entire promoter sequence and various other untranslated features. How you align these would matter if you were interested in such structures.
It sort of depends what you’re trying to show. What your BLAST results are telling you is that, of the amount of sequence it could locally align (in your case 252 nucleotides of your 400ish), they are ~74% identical.
Now, what this tells me is that only a little over half of your sequence is similar to the other. It says almost nothing about the remaining 150 bases that haven’t been aligned.
If you were only interested in a domain or structure or something that fell within that 252, then you could carry on, but otherwise you are ignoring almost a third of the sequence if you take that metric.
I am comparing human X and Y chromosomes. Y is rearranged and jumbled up. But I have already identified short regions in X that correspond with Y. Now I just need to figure out how similar they are. Here is an example
blastn returns this as 88.3% percentage identity. The lengths are same in this particular example but lengths can vary a bit.
@rmf, BLAST isn’t telling you those 2 sequences are 88% identical.
As my comment above, its telling you how similar the maximally aligning subsequence it could find between the 2 sequences are.
Now, I don’t know if your BLAST results did manage to align the full length of those sequences, but I doubt it.
BLAST is not really the tool you want for this task.
blast is definitively not the right tool for pair wise alignment, it was design to quickly scan thru a db. For proper alignment you'll have to use Smith-Waterman for local alignment or Needleman-Wunsch for global. Both are in the EMBOSS linux tool.
In general this is indeed an OK approach.
You just need to keep in mind that it only reports something in relation to the part that can be aligned! so it might report 80% but only align 50% of the sequence (which is then does clearly something different than on 80% overall similarity)
Are the short regions you already identified of somewhat similar length? (taking your original question in mind it seems not)
They are comparable in lengths. The length difference range from 0 to 83 nt.
In that case I might consider using a global aligner , such as NW from EMBOSS . For a global aligner the rule is that they have to be of similar size not strictly of equal size
An alignment would give you similarity score which you may be looking for
It's a simple question/request but it might turn out more complex than you hoped for I'm afraid.
Eg. a 50bp seq is 100% part of a 100bp sequence : what would be the expected similarity score here? 1 or 0.5?
It's indeed the
that makes this a lot more complicated.
That's why most methods/approaches will make the 'equal length' assumption
Just to clarify, are you also interested in their global similarity, or regions of local similarity? (It sounds like the former at the moment)?