Hey guys!
I was wondering if as a rule of the thumb there is an expected range of SNPs within a group of isolates. I have performed a core snp analysis among 360 E. coli isolates (whole genome) and have a total of 149.408 SNPs. Individual SNP counts for each isolate range between 9.000 and 500. I'm aware the reference has an impact on the total SNPs that are called so I selected the best fitting one by calculating the mash distances between my isolates and a large dataset of refseq E. coli, in the end, the selected reference had an average mash distance to the rest of the isolates of 0.02, which I believe is a fair number. Do you think this many SNPs are correct or am I missing something?
Thanks!
Too many for what purpose?
I will also ask this: Are your 360 isolate of equal genomes (good/great) quality? Are they all one contig? That is going to impact the number of SNP's you are going to find.
E. coli genome size varies between ~4 and 5.5 Mb so the numbers may be reasonable, taking into consideration the question above.
Hey GenoMax ,
Thanks for the reply! My purpose is to make a phylogenetic tree of these samples and check for clusters. The files I'm using for SNP calling are the reads and not assembled isolates but they show a nice quality (The coverage is also good, in average 300X), there were indeed some ugly looking sequences but I left those out of this. That being said, I sort of expected variations because these isolates come from very different sources and they are not clonally related (Found a total of 50 different STs). I just don't have a clear number of SNPs that can mean a bad analysis or actual genetic variation.
By the way, I'm starting to believe that if these values are biologically correct, I might be going for a method with a unnecessary high resolution. I'm considering running Roary to get a gross overview of clusters and then maybe if I need more resolution on a specific set of isolates I could go for a SNP analysis on that specific group.
If you have 300x coverage you could also try to get consensus sequences for the strains (or even try to assemble them). Then you can follow that up with phylogenetic analysis of specific genes (instead of the entire genomes).
That's an interesting approach but I don't see how I could try to get a consensus strain for the strains, you mean I could make an assembly against the closest reference and then make a phylogenetic analysis of some common genes? Is not this something similar to running Roary in which you generate a core genome alignment and then use it as input to run a tree infering tool like IQ-Tree? Maybe I'm messing things up here.