Hello, I’m looking to do feature extraction from sets of sequences with a minimum of assumptions for subsequent downstream comparison with other sets. Like for example given ATGAGGA , TTGGCGTA, for category 1 and GGTTGGTT, CCTTAAT for category 2 determine what category AGGAAGEA is in
What are the usual ways to go around with this sort of thing?
How would you extract features which wouldn’t necessarily conform to a fixed size kmers? An nmer at one location might be related to an bmer at some distance for example.
I’ve look at strategies such as ‘bag of words’. But they seem unsuited to the problem because among other things you don’t even know the dictionary to break the string into in the first place.
I would say it wholly depends on the nature of your subsequent downstream comparison. What is the question and with what purpose do you want to do the analysis? Can you clarify? There's a whole body of work on k-mer/motif analysis, and you might not want/need to re-invent the wheel.
For instance, if you want to work with k-mers there is the kmer-SVM software (http://www.beerlab.org/gkmsvm/ ), which works very well in classification. It is based on a gapped k-mer model. However, due to the black box-like nature of a SVM, interpretability can be a problem. If you are interested in motif analysis (ie transcription factor binding sites), you can use de novo motif finders. Some work on k-mer-based models, other use other approaches. My own software GimmeMotifs (http://gimmemotifs.readthedocs.org) is an example. Widely used programs are Homer and MEME.
I have a set of CHIPpeaks. I want to extract the sequences and then use some of them to predict other sequences. the rest of the peaks will be the evaluation set. I'd like two solutions or just one if it can fit both criteria. 1. Something that is at least partially interpretable 2. The best performance (accuracy)