Question

Get Pairwise Sequence Identity

4

Entering edit mode

14.4 years ago

Chris ★ 1.6k

Hey,

for some project I need to run pairwise sequence alignments (local ones, i.e. Smith-Waterman) on protein sequences in order to produce a similarity matrix based on the percentage of identical aligned residues. Now, one approach could be to use blast for that: First create the blast database out of the fastafile that contains all sequences to be pairwisely compaired (using formatdb). Then, for each sequence in the file run blast against that database. But that wouldn't return me alignments against _all_ sequences but rather the ones below the given e-value. Setting the e-value to some huge number doesn't seem to be the proper solution to me (which e-value to use? Besides, when setting it too high, the returned alignments exceed the number of available sequences in the database.). Could there be another solution with blast? I'd rather stick with blast if possible before using some other not so popular aligner.

Thanks, Chris

blast alignment • 13k views

ADD COMMENT • link 13.9 years ago by Chris ★ 1.6k

score 7 · Answer 1 · 2010-06-15

I think you will have to use a global sequence alignment per protein pair. BLAST is local, it will find smaller chucks that match within a pair of proteins--that is why you get more alignments than sequences in the database. You could use BLAST and set the e-value high as you suggest, but then you'll need some heuristic to decide which HSP / chunk to take from each pair.

The EMBOSS package has an executable called needle which does Needleman-Wunsch global sequence alignment. BioPython has a wrapper for that and other aligners. This also provides a command-line interface and a python API to Needleman-Wunsch so you can get and score alignments.

score 5 · Answer 2 · 2010-06-15

Protein alignments of non-related / not similar proteins will give you most likely nonsensical values. Aligners almost never throw a towel, and will give you an output even if you use heamoglobine vs collagen. So if your set of proteins is not pre-screened, you should have some threshold for rejecting random noise comparisons. Blastp score is a good place to look first.

If you go with a non-blast aligners, then (according to my more phylogeny-oriented friend), the good rule is to perform two alignments, one regular (ABCDE vs PQXYZ) and one with reversed protein sequences (EDCBA vs ZYXQP). Only the residues aligning in both forward and reverse alignments are reliable.

EDIT: clarified wording

score 4 · Answer 3 · 2010-06-15

If I understood correctly, you want to make a matrix allProteins vs allProteins with "distances" in each cell. And you want to use identity %. Maybe you could use "blast two sequences" (individually aligning every pair of sequences) instead of aligning against a database.

Another suggestion. Why, instead, don't you produce a distance matrix using a phylogenetic program like Phylip? PROTDIST should do. With the advantage that instead of using similarity % you also take into account amminoacid substitution rates...

Ram · Answer 4 · 2010-06-15

Are you sure your problem is well defined? Local pairwise alignments with blast may be not the solution you are looking for.

When you calculate the local alignment between two sequences, it means that you are looking for small regions which are similar between the two, and that doesn't give you a measure of how much two sequences are distant. For example have a look at these two sequences:

>seq1
cacaTTTTTTTcgt
>seq2
cgacagTTTTTTTcgatcaggacTTTTTTTcagcta

What will you obtain if you calculate the pairwise local alignment between the two? is it a measure of the distance between the two sequences?

I would try with global alignment instead. I am not sure of which option you can pass to blast to force it to calculate the global alignment between all the sequeces, but maybe it will be enough to give a small cutoff.

score 1 · Answer 5 · 2010-06-16

1

Entering edit mode

14.4 years ago

Chris ★ 1.6k

Thanks all for your contributions. It's indeed a local alignment I want. Imagine for example two proteins, each containing two domains: Protein 1 contains domains A,B, protein 2 contains B,C. Now, when using a global alignment that would try to align over the whole sequences which doesn't make sense in my specific case. A local alignment would align only both regions B. As for the similarity of both, this would only be the first step. My goal is to calculate the similarity by means of hssp values which basically relates sequence identity with alignment length, as proposed in some papers (e.g. Sander, Schneider 1991, Proteins or Rost 1999 Protein Engineering). I'm now using the 'water' aligner from the emboss suite.

ADD COMMENT • link 14.4 years ago by Chris ★ 1.6k

0

Entering edit mode

as I was saying you in my answer, you may have the case of a protein in which one of the two has a duplicated domain; to continue your example, you could have A,B and A,A,B. In that case, the local alignment doesn't give you any useful information on the distance between proteins. In any case, I agree with darked89 in that you have to do a pre-filtering of the proteins before, otherwise your comparisons don't make sense. The distance you want to calculate is a measure of how many mutations are needed to one seq to anotehr, but if they don't have a common origin, you can't do that.

ADD REPLY • link 14.4 years ago by Giovanni M Dall'Olio 28k