Question

Feature/Motif Extraction from Sequences?

0

Entering edit mode

7.8 years ago

bn ▴ 30

Hello, I’m looking to do feature extraction from sets of sequences with a minimum of assumptions for subsequent downstream comparison with other sets. Like for example given ATGAGGA , TTGGCGTA, for category 1 and GGTTGGTT, CCTTAAT for category 2 determine what category AGGAAGEA is in

What are the usual ways to go around with this sort of thing?

How would you extract features which wouldn’t necessarily conform to a fixed size kmers? An nmer at one location might be related to an bmer at some distance for example.

I’ve look at strategies such as ‘bag of words’. But they seem unsuited to the problem because among other things you don’t even know the dictionary to break the string into in the first place.

sequence genome SNP alignment gene • 1.9k views

ADD COMMENT • link updated 7.8 years ago by simon.vanheeringen ▴ 280 • written 7.8 years ago by bn ▴ 30

0

Entering edit mode

I would say it wholly depends on the nature of your subsequent downstream comparison. What is the question and with what purpose do you want to do the analysis? Can you clarify? There's a whole body of work on k-mer/motif analysis, and you might not want/need to re-invent the wheel.

For instance, if you want to work with k-mers there is the kmer-SVM software (http://www.beerlab.org/gkmsvm/ ), which works very well in classification. It is based on a gapped k-mer model. However, due to the black box-like nature of a SVM, interpretability can be a problem. If you are interested in motif analysis (ie transcription factor binding sites), you can use de novo motif finders. Some work on k-mer-based models, other use other approaches. My own software GimmeMotifs (http://gimmemotifs.readthedocs.org) is an example. Widely used programs are Homer and MEME.

ADD REPLY • link 7.8 years ago by simon.vanheeringen ▴ 280

0

Entering edit mode

I have a set of CHIPpeaks. I want to extract the sequences and then use some of them to predict other sequences. the rest of the peaks will be the evaluation set. I'd like two solutions or just one if it can fit both criteria. 1. Something that is at least partially interpretable 2. The best performance (accuracy)

ADD REPLY • link 7.8 years ago by bn ▴ 30

score 0 · Answer 1 · 2017-09-21

0

Entering edit mode

7.8 years ago

simon.vanheeringen ▴ 280

Given your clarification, I think kmer-SVM does exactly what you want. You can train and evaluate performance (using cross-validation). The SVM model can then be used to predict new sequence. The k-mers will have associated weights, that you would be able to cluster, match to known motifs, etc.

ADD COMMENT • link 7.8 years ago by simon.vanheeringen ▴ 280