Hello, everyone.
I'm building a phylogenetic tree from sequenced transcriptomes of 100 species. I've calculated orthogroups by OrthoMCL and will build a tree by RAxML. In the most articles I've seen that authors take for tree building only those orthogroups, which have exactly one gene from each species. Taking into account that I have transcriptomes, due to, for example, misassemblies, with the increasing number of transcriptomes the number of such orthogroups will decrease. I want, instead, to take all orthogroups where there is exactly one gene from at least 50 species. If there are two genes from some species in some orthogroup, I'll drop both of them (because paralogs can hamper true tree reconstruction). In the resulting concatenated alignments of orthogroups, which I'll give to RAxML, I just fill with gaps (-------) places where some of species doesn't have an ortholog. RAxML can deal with such gaps - it just won't use information from this column in the alignment for this species (https://goo.gl/GZ47bu). So, my method is good, for example because it allows to take information for tree building from more genes. However, in all articles I've seen, people try to build trees from complete orthogroups. Am I missing some drawback in the method?
I would be grateful for possible help
P.S. The lower limit of 50 species in an orthogroup is arbitrary - I just don't want to take too small orthogroups, because they may originate from contamination P.P.S. Speaking in details, I have only 30 orthogroups with exactly one gene assembled from each species, but 4000 orthogroups with one gene assembled from at least 50 species.
Thank you for your response.
The aim of the work is simply to build a correct tree of species. Information from only 'complete' orthogroups is insufficient for this, due to a low number of such orthogroups.
For tree building I use concatenated alignments of orthogroups, so, since all species are represented in some orthogroups, there will be no error