Feature extraction from protein sequences for machine learning classification
2
1
Entering edit mode
10.4 years ago
insilico123 ▴ 10

How to extract features from protein sequences, so that it can be converted into vector for training the data in machine learning. From some papers I found methods like using AAindex, PSSM for training data. But I was unable to find the detailed method behind it. Please, suggest some papers or links which can be helpful.

machine-learning python feature-extraction • 8.4k views
ADD COMMENT
0
Entering edit mode

From the literature I found following article:

VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines

It uses Auto cross covariance (ACC). I have written the following python code to calculate it. Please suggest if its working fine.

http://biotoolsinsilico.blogspot.in/2014/07/auto-cross-covariance-python.html

import numpy as np

# z1 z2 and z3 descriptor was used to represent the protein sequence

# Index j was used for the z-scales (j = 1, 2, 3),

# n is the number of amino acids in a sequence,

# index i is the amino acid position (i = 1, 2, ...n)

# l is the lag (l = 1, 2, ...L).

# a short range of lags (L= 1, 2, 3, 4, 5)
Z = np.random.rand(3,80)
print(Z)
#Z = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
n = Z.shape[1]
n = n-1
print(n)
# Autocovariance
column = []
for j in range(0,3):
    row = []
    for l in range(0,5):
        summ = 0
        for i in range(0,n-l):
            rightsum = (Z[j,i]*Z[j,i+1])/(n-l)
            summ = summ + rightsum
        row.append(summ)
    column.append(row)

R = np.array(column)
print(R)

# Cross Covariance

ja = [0,1,2,0,1,2]
ka = [1,0,0,2,2,1] 
column = []
for j,k in zip(ja,ka):
    row = []
    for l in range(0,5):
        summ = 0
        for i in range(0,n-l):
            rightsum = (Z[j,i]*Z[k,i+1])/(n-l)

            summ = summ + rightsum
        row.append(summ)
    column.append(row)

C = np.array(column) 
print(C)
ADD REPLY
3
Entering edit mode
10.4 years ago
Quak ▴ 520

Features you want to extract are divided into two groups; 1) sequence based sequence, 2) features extracted from the predicted structure.

Amino acid composition, amino acid property, amino acid distribution and etc are in group one. There are mainly two R packages, Seqinr and BioSeqClass from Bioconductor. I attach a table from my thesis, which summarize this threat.

I would recommend you reading this paper; and the code is implemented in the BioSeqClass package.

Prediction of protein folding class using global description of amino acid sequence. PNAS, 92(19):8700-8704, 1995 (BioSeqClass - Bioconductor package)

ADD COMMENT
0
Entering edit mode

hiii

try to get secondary prediction by R language and I run the code

predictPROTEUS from BioSeqClass in R language

PROTEUS = predictPROTEUS(proteinSeq[1:2],proteus2.organism="euk")
Error in file(file, "rt") : cannot open the connection
In addition: Warning messages:
1: running command 'perl C:\Users\D58B~1\AppData\Local\Temp\Rtmpu8B2vI\file1728128132c4.pl' had status 127
2: In file(file, "rt") :
  cannot open file 'C:\Users\D58B~1\AppData\Local\Temp\Rtmpu8B2vI\file1728128132c4.proteus2': No such file or directory

any suggestion

ADD REPLY
0
Entering edit mode

Hi Quak

I'm starting on this topic, I want to do something similar, I'm working on python, writing descriptors for amino acid sequences. I saw your table from your thesis. My question is about your data, because my data is a lot of antibodies sequences. Was your data heterogeneus? I mean the length of sequences, how affect for the compute?

ADD REPLY
0
Entering edit mode

mine was enzymes family - so within families, sequences are homogeneous, but across heterogenious (relatively).

If your sequences are homogenious, means, the biological functions are hidden in subtle changes of amino acid differences ! in otherwords, most of features would be redundant. but you might be able to align all and see what are those subtle differences ...

but if sequences are heterogenious, you would have an easier life since feature are not redundant.

I don't think, the length of sequence matters unless you want to predict the structure ...

ADD REPLY
0
Entering edit mode

in first comment "a table from my thesis" not working. Please 'send me table'. allmotog@gmail.com.thanks in advance

Amino acid composition, amino acid property, amino acid distribution and etc are in group one. There are mainly two R packages, Seqinr and BioSeqClass from Bioconductor. I attach a table from my thesis, which summarize this threat.

ADD REPLY
0
Entering edit mode
9.1 years ago

Usefulness of my answer depends on how many different proteins you are interested in. Concerning single proteins, you can generate circular graphs of your sequence using I-PV. Then you can either extract features based on chemical property or directly choosing amino acids.

In the first example I extract the sequence of aromatic residues, 50 amino acids per line. Watch it below:

http://i-pv.org/gifs/featureExtraction1.gif

In the second example first I select some amino acids to display on the text tract, then I make the font-size a bit bigger. And then I show them on the scatter track underneath by clicking on the "sequence display" from the drop down menu. Then I extract these feature based on sequence, 100 amino acids per line. Here is how I did it:

http://i-pv.org/gifs/featureExtraction2.gif

I hope this helps,

Good luck,

ADD COMMENT

Login before adding your answer.

Traffic: 2373 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6