Protein Sequence Descriptors
2
1
Entering edit mode
13.0 years ago
Funuser ▴ 10

i am looking for a way to describe protein sequences by a Neural Network. However i am still missing some descriptors i can use. Do you know of free descriptors or better a descriptor package i could use?

edit: My problem is actually: Different parts of the sequence have a influence on the protein. I want to go over the sequence and then predict this influence. Influence can be an assay or anything rly. In the end it should have formed a model for this and be able to predict for a new sequence. As i do this only to get acquainted with weka and stuff, i dont really have an idea what to use as assay :).

sequence analysis protein • 5.4k views
ADD COMMENT
2
Entering edit mode
13.0 years ago
Chris ★ 1.6k

I assume you're talking about feature extraction? I've once written a python tool that turns sequence-based features (predicted sec. strct., solv. acc., evolutionary information, predicted PPI interfaces, PFam data, biochemical propensities...) into position-specific numerical normalized features. Output formats are weka arff and libsvm/liblinear-compliant datasets. This tool however heavily depends on predictprotein [1] which is a command line wrapper for all kinds of sequence-based predictors developed in our group. It's available as machine image (complete linux OS) or debian packages. Let me know if this sounds appealing to you.

[1] http://predictprotein.org/

ADD COMMENT
0
Entering edit mode

sounds good actually. i would love to play around with weka. how could i get this tool chain running?

ADD REPLY
0
Entering edit mode

try getting the predictprotein image running. Let me know when you succeeded and contact me again (s. my webpage).

ADD REPLY
0
Entering edit mode

thanks a lot, will do once i got it running :)

ADD REPLY
0
Entering edit mode

Hi Chris!

The python tool you mention seems interesting and I would like to explore it for my work on disease/druggability gene predictions. However, I would appreciate if you could help me get started on how to use the tool in batch mode, because my current set consists of ~20k proteins with unique uniprot IDs.

ADD REPLY
0
Entering edit mode
13.0 years ago

You can use AAINDEX database to derive descriptors using protein sequence. AAINDEX provides amino acid indices, substitution matrices and pair-wise contact potentials.

Background on amino acid index from AAINDEX:

AAindex is a database of numerical indices representing various physicochemical and biochemical properties of amino acids and pairs of amino acids. AAindex consists of three sections now: AAindex1 for the amino acid index of 20 numerical values, AAindex2 for the amino acid mutation matrix and AAindex3 for the statistical protein contact potentials. All data are derived from published literature.

Manuscript describing current version of AAINDEX is available here.

Current version include (ver.9.1) 544 amino acid indices, 94 amino acid mutation matrices and 47 contact potential matrices

You can use this data as a normalized score for the whole protein chain or use them to derive hybrid features. You may please refer to following papers that used AAINDEX derived features/descriptors to develop Support Vector Machines and Random Forests based machine learning algorithms for prediction of 3D domain swapping.

ADD COMMENT
0
Entering edit mode

the question is now, how do i form this into a model i can use?

ADD REPLY
0
Entering edit mode

funuser: IMHO, That should be a separate question.

ADD REPLY

Login before adding your answer.

Traffic: 1520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6