Question

Score Protein Variants Based On Frequency Of Aa In Multiple Sequence Alignment

5

Entering edit mode

14.7 years ago

Tim ▴ 350

For reference, please read this excerpt from Human non-synonymous SNPs: server and survey Vasily Ramensky, Peer Bork, and Shamil Sunyaev

Profile analysis of homologous sequences. The amino acid replacement may be incompatible with the spectrum of substitutions observed at that position in a family of homologous proteins. PolyPhen identifies homologues of the input sequences via a BLAST (23) search of the NRDB database. The set of aligned sequences with sequence identity to the input sequence in the range 30±94% (inclusive) is used by the new version of the PSIC (position-specific independent counts) software (24) to calculate the so-called profile matrix (http://strand.imb.ac.ru/PSIC/). Elements of the matrix (pro- file scores) are logarithmic ratios of the likelihood of a given amino acid occurring at a particular site to the likelihood of this amino acid occurring at any site (background frequency). PolyPhen computes the absolute value of the difference between profile scores of both allelic variants in the polymorphic position. PolyPhen also shows the number of aligned sequences at the query position; this may be used to assess the reliability of profile score calculations.

I'd like to calculate something similar (score variants based on frequency that AA in aligned sequences) to what's mentioned here programmatically, but I can't find any implementation of the above described system.

Does anyone know of a working implementation of this or something similar, that's available either in code or as a web service?

Or should it is easy enough to implement something like this ourselves?

multiple-sequence-alignment protein • 4.6k views

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14.7 years ago by Tim ▴ 350

score 3 · Answer 1 · 2010-04-14

I use such profile matrices but I don't know any public implementation, I have done my own in C++. It is not so long to do.

I create an array of array "tab[L][20]" with L the size of the alignment.

Then I read the sequences of the alignment and I count the number of amino acids in each column. I also count the number of gaps. Then I can calculate a log odd score like in Fano [1]

Something to care about is the similarity between the sequences in the alignment. If sequences are too similar then some amino acids might be over-represented at a position. This can introduce a bias in the statistics.

You can read this if you want to see how I use profiles : FROST: a filter-based fold recognition method

[1] Fano RM. Transmition of information: a statistical theory of communication. Cambridge, MA: MIT Press; 1961.

Ram · Answer 2 · 2010-03-06

2

Entering edit mode

14.7 years ago

Chris ★ 1.6k

I'm not sure if I understand you correctly. If you are looking for a webservice that returns the PSIC scoring matrix, why don't you just follow the URL mentioned in the paper's abstract, i.e. http://strand.imb.ac.ru/PSIC/ which leads you to a html form where you can paste your mutliple alignment and returns the PSIC matrix. Or did I misunderstand you?

ADD COMMENT • link 14.7 years ago by Chris ★ 1.6k

1

Entering edit mode

The form on the above page triggers http://strand.imb.ac.ru/PSIC-cgi/run.pl so that Perl script probably has the code you're looking for. Maybe mail the webmaster (vlasov@imb.imb.ac.ru) or the authors of the article for a copy of that code?

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 14.7 years ago by Jeroen Van Goey 2.3k

0

Entering edit mode

I want to do this programmatically, so I can do this scoring thousands of times.. Manual wont do, and using curl for this seems hackish, unreliable & sensitive to change.

ADD REPLY • link 14.7 years ago by Tim ▴ 350