So I have generated a large fastA file where I took 2 protein domains from a protein sequence. To be clear, I had an initial protein sequence like AAAAAAABBBBBBCCCCCDDDDDDEEEEEE and I created a new sequence that was BBBBBBDDDDDD.
Now, with Mega, I used Muscle to align all my sequences, and then generated a Maximum Likelihood Tree. Many of sequences were redundant or very similar. What I would like to do is to group these sequences together into clusters based on how different they are in sequence. So if I have ~300 sequences, I would like to group them into ~30 clusters. How would I go about doing this? How can I get a measure of how different the sequences are in absolute terms, not just the binning process by these trees where it gives me value for branch length?
Thanks so much!
You created artificial sequences and now want to further divide them into a random number of clusters? What valid scientific inference can you draw after these sorts of manipulations?
So we have two regions, A & B, in a protein that are responsible for binding to promoter for RNA polymerization. For primary sigma factors, there are 3000 available sequences, and I would like to know how closely related the sequences for promoter specificity is across these 3000 sequences. So, for example, if many of them are identical or very closely related, I would like to collapse that into a group and represent that as a branch on the tree. So I could gain insight in knowing that of the 3000 sequences sequenced, there are grouped into X number that are actually very distinct from one another.
What I am asking is, how is this broad binning done? Is there a value that gives the percentage of similarity across sequences that I could use?
I am not a phylogenetics expert and someone else may have more enlightening advice later but for now here goes.
Since you have done a MSA of all the sequences did you build a simple NJ tree? You could use that as a guide to select sequences for representation of "clusters" that you are thinking of. This thread Reducing Number Of Sequences For Phylogentic Tree Construction suggests using 90% identity. Still by artificially joining those two domains you are likely ignoring important information (just a feeling).
Thanks for the link, I'll use that.
I have created a Max likelihood tree which does help a bit, but I'll need something more programmatic to get down to a smaller number of clusters.
In regards to your point on artificially joining these domains, we conversed on the other thread about this. I considered adding a series of dashes to make sure there are not any overlaps bridging the two sequences, but you made the point that since there is homology, it will align that way anyway (which is what I think did happen).
I need to use both protein domains instead of using them individually because the first domain specifies the -35 region and the second domain specifies the -10 region. Therefore, if I split them up, I won't be able to tell which sigma factors as a whole are distinct (e.g. sigma factor 1 has regions A & B and sigma factor 2 has regions A & C)
I was referring to the intervening sequence between the two domains, leaving it intact. If the proteins are homologous that part should align as well.
Hi, Even i have similar problem like you. Your research area matches with mine. Is that Ok if we discuss.