After reading some articles regarding the use of NGS to sequence bacterial whole-genomes for characterizing outbreaks, I have become interested in building a pipeline for streamlining this workflow as a pet project. Basically, at a very high-level I would like to take whole-genome sequence data for a collection of microbial samples from different sources in an outbreak and generate phylogenetic trees to visualize the relatedness between different isolates, which could be used to characterize outbreak patterns.
Elsewhere on Biostar (phylogenetic analysis of whole genomes) I read that this is best accomplished with multiple sequence alignments of 16S or other highly conserved regions. However, in some of the papers I've been reading, such as this one by Köser, et al (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3715836/), it looks like they just did WGS, aligned against a reference, identified SNPs and indels, and then did some kind of phylogenetic analysis with all the SNPs using the RAxML software. Without knowing too much more about this latter approach, I think I like the idea of doing this a little more than going with just 16S since 16S wouldn't work with viral sequences, but I don't know if there are any serious drawbacks of following what was done in the Köser paper.
Can anybody comment on what the best approach is to achieve my goals, or if this idea even makes sense?
Thanks,
-Paul
In general, constructing phylogenetic trees from just SNPs, say you end up with the right topology but edge lengths are not correct. This is, because you just focus on specific differences compared to one reference. What does that mean for relativeness? Tree resulting from SNPs only give you a hint on which of the organisms are related to each other, you can't really state how close / far. Hope that helps