Hello everyone,
I have a project in which I will create an algorithm which will take a nucleotide sequence and it will decide if it is splice site or not. It is a classifier problem. Using the Ngram, we take the similarity of the testing sequence of the positive class and of the negative class, thus, I have two features for the classifier.
I would like to find a new one. How could I calculate the evolution distance between some eukaryotic species? (H.sapiens, rerio, melanogaster, elegan, thaliana). I need this feature in order to take more seriously (higher weight) some species(sequences) than other when I will calculate the centroid of the negative class and positive class.
There is the source domain and the target domain. In my algorithm the source domain we use for the training and the target domain for testing. For instance, we could have H.sapiens and Rerio and Melanogaster for source domain and only thaliana for target domain.
- Should I find a conserved gene or protein? in this case, which gene or protein is appropriate?
Reading this paper (they take a gene and do a phylogenetic analysis)
I realized that I have to choose a gene which is in all studying species
First approach (easy way)
After choosing a gene(fasta file) with the help of the Clustal program, we create the Multiple sequence alignment and there is also a choice in order to take the evolution distance as an array.
Second approach (hard way)
Having the Multiple sequence alignment, we take the result and put it into the TREE-PUZZLE program, form which we take some values for the PHYLIP program. Where could I find a tutorial for this approach?
similar problem here
Would time tree be interesting to you?
Absolutely, the timetree is a solution to my problem. There is some similar works who already used the timetree in order to solve similar problem. However, I would like to solve this problem by using phylogenetic analysis.