I am python script learner and I am using python 3.6. I have been working with DNA/Protein sequence files which are in 3 different formats (Phylip (.phy), clustal (.aln), fasta (.fas))
. I want to use the sequence files so as the sequences are clustered with each other one by one in a way the minimum number of changes are counted. (Whatever next sequence has minimum number of changes is clustered next i.e, most similar are clustered.) and in the end give it a tree like representation or newick format generation.
What I need to know is that what strategy should I use to cluster the sequences ? Should I use similarity matrix? But what are its basis/formula used for nucleotide/protein sequences? If a distance matrix generation as used in neighbor joining method , is not followed, then what could be the simplest strategy to cluster the two sequences together??
Actually I am in a need to learn these stages, how come a MSA be used to generate a tree like structure. In case one has to use no distance formula for joining the two sequences, is there some other basis which can be used to join the two sequences together? What could be the simplest approach for doing so?
Neighbour joining is the simplest, pretty much, that’s why I suggested it.
There are many algorithms for tree construction. You can find all of the algorithms themselves on Wikipedia etc. Just google them, or get yourself a good textbook on bioinformatics algorithms.
Many will use a distance matrix, which is exactly what it sounds like, but different software will implement different methods.
Would it be wise to use a distance matrix for the method which does not involve using of that? For example if a tree reconstruction is character based, i.e, they use sequences directly to reconstruct a tree like structure which gives the evolutionary meanings, still a distance matrix step is the one you would recommend?
I don't really know what you're asking. Distance matrices are a tried-and-tested method, and off the top of my head I can't think of how else you would be able to do this task?
You have to score the differences between all your sequences, and store those scores somewhere, so where else if not a distance matrix?
I don't know what you mean by this - but perhaps that's just me not being completely familiar with the area. Are you talking about algorithms like Bayesian Inference/molecular clocks etc, to reconstruct 'true' evolutionary relationships, rather than just what basically amounts to string differences in the methods above?
1) By saying character based method I meant, one of the strategies for reconstructing a phylogenetic tree from sequence data is using the sequence data directly without using an evolutionary model, just like a parsimony tree reconstruction is done. 2) So my question is , if no evolutionary model is used , what strategy should be adopted to cluster the sequences together in a way that they show some evolutionary meaning? Should there be some matrix involved? What matrix should it be or what formula should it be using? 3) By saying 'distance matrix' I referred to the specific 'p-distance formula' based matrix used by 'neighbor-joining' method of reconstructing tree. While I guess, what you said was just a simple matrix containing values calculated as a difference between sequences.? Is this right? 4) You said using a distance matrix, on what formula it is based in your opinion?
I still dont understand why you need to make this distinction? Any alignment you would do implicitly implies some sort of relation, though it does not necessarily mean ancestry, they may be convergently evolved and share similar domains etc - there’s no way to show this without ancient data.
You are always assuming relatedness when you pick sequences from extant organisms (i.e. the tips of the tree). It’s difficult, if not actually impossible, to prove common ancestry, so we operate on the assumption that it is the simplest explanation.
I’m not sure what the ‘best’ character-state only algorithms are. The most widely used tree construction programs these days are Maximum Likelihood trees, but these can/do use evolutionary models as they are essentially trying to reconstruct the ancestral (deeper) nodes of the tree.
I don’t have a particular opinion on what formula to use. NJ trees and other character based methods are not routinely used in phylogenetics anymore because, while they do reflect dissimilarity, they do not accurately capture evolution necessarily. The only other character based method I know of off the top of my head is UPGMA, but this is still based on a similarity matrix.
I feel like we are going round in circles here somewhat - It would be best to edit your question with the goal you are actually trying to achieve, rather than quibble over the specific methods without enough information. These comments are no longer addressing the original question asked.
The short answer (to my mind, but I’m no expert) is that if you care about evolution, use an ML based approach with an appropriate model so that you can do ancestor reconstruction, if you purely want to show which sequences are most/least similar in a hierarchical clustering method, choose NJ or UPGMA etc.
Alright then, thanku for this much help.