Question

Adding an outgroup to a phylogeny: Steps and considerations

0

Entering edit mode

16 months ago

doggie • 0

I have successfully created a phylogeny by aligning my isolates (in paired-end fastq formats)against a reference genome. Now, I would like to add an outgroup to the phylogeny for comparison. The outgroup is available as a assembly genome (in fna format) on NCBI.

My question is: How do I add this outgroup to my existing phylogeny? Should I align it to the original reference genome or use a different approach?

Thank you in advance for your help!

outgroup alignment • 1.6k views

ADD COMMENT • link 16 months ago by doggie • 0

score 0 · Answer 1 · 2023-07-13

0

Entering edit mode

16 months ago

Michael 55k

A few more details might be helpful. Out of principle considerations, I would say that all taxa in a phylogenetic analysis should be derived and treated as much in the same way as possible. Otherwise, you are risking heterogeneity introduced by differences in the methods rather than differences in sequence evolution. At least you would have to calibrate your approach.

It might not seem that important for an outgroup but still, I suggest that if you used reference-based multiple sequence alignment for all other taxa, also the outgroup should be added by aligning to the reference.

ADD COMMENT • link 16 months ago by Michael 55k

0

Entering edit mode

Thank you for your reply, so I am doing a WGS project, and I would like to align this Genome assembly ASM15142v1 (outgroup) against my reference Genome assembly ASM2306590v1. I have a bash script to do the bwa mem, Picard and all those alignment steps for my isolates, so with the vcf files, I converted them into fasta and concatenated theme into an MSA and submitted to raxml, so I am now struggling in the step of whether I should align that outgroup genome to the reference genome and re-make a new MSA and submit to raxml.

ADD REPLY • link 16 months ago by doggie • 0

0

Entering edit mode

How many samples do you have? And how did you make the MSA exactly? Otherwise, I would say yes but if the outgroup assembly is from a different species the result might not be comparable to the other assemblies. It is important that all columns represent "the same" conserved position in the genome. When the genomes are not collinear and the variation does not mostly consist of simple SNPs and InDels this might be hard to achieve by a reference based multiple sequence aligner or your approach.

ADD REPLY • link 16 months ago by Michael 55k

0

Entering edit mode

so i have 100 samples, and I make MSA by turning the vcfs into fasta by GATK FastaAlternateReferenceMaker. I think they could share the same conserved position, but if not, what methods do you reckon to add an outgroup?

ADD REPLY • link 16 months ago by doggie • 0

score 0 · Answer 2 · 2023-07-14

Ok, I see. So you are baking ALT-genomes where each polymorphic site is substituted by its ALT allele, right? You can achieve the same with bcftools consensus method with more control. The problem is that the outgroup is a different genome and possibly cannot be turned into a vcf file and an ALT-genome based on the coordinates of the reference species in the same way. So, you would need a (reference-based) whole genome alignment of all genomes. I think doing this de novo using e.g. Progressive Cactus or Mauve is not feasible. Instead, I have made good experience with the NASP pipeline even though it's not that widely used. It is essentially a wrapper for Mummer/Nucmer.

Put all genomes in one folder, provide the reference genome, and NASP will output a SNP-matrix. While this looks like adding another layer of variant detection, you can interpret the output as a MSA consisting only of variable sites and give it to the phylogenetics software. If you get very short sequences, though, this may mean that the outgroup was too divergent and you might have to look for a more closely related species.

Because the output consists only of the variable sites, your substitution model may need "Ascertainment bias correction" to correct the ML branch lengths using a model with an ASC prefix.

In general though, I would not mix genome assemblies and ALT-genomes from variant calling in the same phylogenetic analysis. It is important to take into account that ALT-genomes have a very different error structure from genome assemblies. Without good variant filtering they might consist mostly of sequencing errors and low-frequency alleles in the end.