I have two sets of phylogeny-
1) Species phylogeny (in black)- Species B to D have published genomes and I have assembled a genome for Species A. I constructed the phylogeny based on multiple sequence alignment of protein orthologs across Species A to D (OrthoMCl -> MUSCLE -> trimAl -> MrBayes).
2) Subspecies phylogeny (in red) - I also have sequencing data for different subspecies and isolates of Species A. I mapped these onto Species A genome, identified SNPs (using GATK) and drew a SNP-based phylogeny.
My question now is "what is the best way to integrate both these phylogenies into one?".
I do not want to assemble the genomes for all the subspecies (tedious for 20 isolates), and I do not want to map the Species B-D reads onto Species A (They are very divergent and inferring through MSA is best I think).
I can infer nucleic acid/protein sequences of the subspecies' orthologs from variant calls and add them to the multiple sequence alignment in Species phylogeny. But I find the output of tools like vcf2fq and FastaAlternateReferenceMaker complicated -New Fasta Sequence From Reference Fasta And Variant Calls File?. In this case, how to deal with SNPs in repetitive regions that we usually exclude from analysis?
Is there any other way to achieve this?
What is more efficient may depend on genome size and ploidity. For bacteria I would recommend to assemble the reads denovo with spades, which is fast and very easy to use. For bacteria denovo assembling is not at all "tedious".
Genome size is 20Mb and the organism is haploid. So denovo assembly is tedious (ordering, filling gaps, annotating genes).
You should not do phylogeny on repetitive regions. Repeats are formed by recombination and recombination events will distort the phylogenetic signal.
In addition, highly repetitive regions are prone to sequencing errors, and thus unreliable variant calls.
Thanks @piet @WouterDeCoster. Will keep that in mind! As of now, I do have discarded all SNPs in DustMasker predicted regions.