Question

Overview questions about K-nearest neighbor classification methods

0

Entering edit mode

6.1 years ago

dllopezr ▴ 130

Hi everyone!

Suppose that you have a set of sequences that you belong to a gene, and also you have a set of sequences of unknow functions, and you want to know if some of this latter sequences belong to the gene of interest. Well..this is my problem.

I am reading of sequence classification methods and I am interesting in K-nearest-neighbor (KNN) for my work. But there are some concepts that I don't get it yet.

KNN require a training set with a feature which will be compared with the sequence to construct a distance measure, but what is the feature in the case of sequences?. I think that the distance could be an alignment score between the query sequence and each of train sequences, this is correct?
There is a program or tool that performs this alignments and assing the query sequences to their respective group?
I cite this paper fragment "For each gene family, a self-vs.-self usearch (version 7.0) (Edgar, 2010) (30% global identity cutoff) was then performed to generate a distance matrix between different sequences. A nearest neighbor clustering procedure was then carried out to cluster sequences into different groups. Outlier groups were then inspected and removed for their not clustering with the largest group, even at 30% global identity"

it's not clear if they used a training set, neither what tool they use to construct the matrix from the alignment nor the tool that use for the KNN compute.

There is a tool that constructs a matrix from a file of pairwise alignment?

Thank you for your help!

k -nearest neighbor sequence classification • 717 views

ADD COMMENT • link updated 5.9 years ago by Biostar 20 • written 6.1 years ago by dllopezr ▴ 130