Question

Machine learning features from nucleotide sequences

2

Entering edit mode

8.3 years ago

rajeshkumar_vinod ▴ 30

What possible features can be extracted from nucleotide sequences for machine learning? For example gc content, dinucleotide frequency etc.

Machine learning • 4.9k views

ADD COMMENT • link updated 8.3 years ago by Khader Shameer 18k • written 8.3 years ago by rajeshkumar_vinod ▴ 30

2

Entering edit mode

Good question, but remember that a tool looking for a job rarely ends up doing that job well.

ADD REPLY • link 8.3 years ago by John 13k

2

Entering edit mode

8.3 years ago

ebrahimiet ▴ 50

In the following papers, we used a range of nucleotide features

Gene Volume 578, Issue 2, 10 March 2016, Pages 194–204 Unravelling evolution of Nanog, the key transcription factor involved in self-renewal of undifferentiated embryonic stem cells, by pattern recognition in nucleotide and tandem repeats characteristics

BMC Research Notes20147:565 DOI: 10.1186/1756-0500-7-565 Prediction of hepatitis C virus interferon/ribavirin therapy outcome based on viral nucleotide attributes using machine learning algorithms

ADD COMMENT • link 8.3 years ago by ebrahimiet ▴ 50

1

Entering edit mode

8.3 years ago

shenwei356 8.7k

k-mer, a very important one.
secondary structure

ADD COMMENT • link 8.3 years ago by shenwei356 8.7k

0

Entering edit mode

It should not be related to structure only thing we can get from sequence. And for k mer how should i choose which k mer is best for me?

ADD REPLY • link 8.3 years ago by rajeshkumar_vinod ▴ 30

0

Entering edit mode

you may try different Ks. in some field, secondary structure may help.

ADD REPLY • link 8.3 years ago by shenwei356 8.7k

0

Entering edit mode

8.3 years ago

WouterDeCoster 47k

I would consider transcription factor binding consensus motifs an interesting feature, among others like promotors, poly adenylation signals, conservation across species,... But these meta-features (I just made that up) need an external annotation so maybe that's not what you're looking for.

What is the purpose of your analysis?

ADD COMMENT • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

i need to do predictions

ADD REPLY • link 8.3 years ago by rajeshkumar_vinod ▴ 30

0

Entering edit mode

Aaaaah predictions. That's oddly specific.

ADD REPLY • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

8.3 years ago

O.rka ▴ 740

This one is probably my favorite. It uses k-mer counts with a t-sne algorithm to cluster contigs into bins of organisms. Used for binning out organisms from a metagenome. http://claczny.github.io/VizBin/ .

By machine learning, are you talking about doing predictions or clustering?

ADD COMMENT • link 8.3 years ago by O.rka ▴ 740

0

Entering edit mode

predictions i have a very interesting problem that i can't discuss right now.

ADD REPLY • link 8.3 years ago by rajeshkumar_vinod ▴ 30

score 3 · Accepted Answer · 2016-08-14

There are multiple ways to compile your features:

1) Knowledge-based approach: here you would only use a set of limited features that have a direct influence on your prediction/classification/learning task. Feature set will be limited, and you won't be able to add new knowledge to the field. See an example where we used a subset of features that we assumed to have role in 3D domain swapping

2) Data-driven approach: you can compile all available features that you can gather from your nucleotides (DNA or RNA?) and test them using rigorous feature selection method See an example where we used the entire set of features with hybrid features (combining multiple features) to predict 3D domain swapping

3) Feature engineering/representation learning: you can either of the above set and use deep neural encoding methods, here the algorithm would engineer the features (NN, RBM, PCA, LSTM, etc.). This approach is more applicable when you have large dataset(s) and not primarily looking for features contributing to your predictive model such as feature selection or biological inference.

Count(s) of individual bases (ATGC - mono, bi, tri...)
k-mer count (See previous answers)
physicochemical properties of your sequences
evolutionary scores (example here)
mutation/substitution scores (GERP, PhyloP, etc.)
Annotation-based features (part of gene-structure (exon-intron), coding or non-coding etc.)

PS. Like one of the answers, it all depends on your prediction problem