Suppose I have n isolates, some from strain A and some from strain B. After performing deep sequencing and alignment against a reference genome (strain A), I have a list of SNPs for each isolate and the frequency of that SNP in the given isolate.
What would be the best way to determine if an isolate population is just from a single strain, or if it is some mixture of both strains.
Say isolate 1 is supposed to be from strain A, how can I assert that it is composed of just strain A, or if the isolate contains both strains?
This is close, but not exactly what I want. There are different genotypes within a single isolate, we have mutations that are less than 100% frequency. However, what is not clear is if a given genotype is a mutation of strain A, or is a genotype of strain B.
The goal is to see if there is contamination. The possibilites are that the isolate is all Strain A (or some mutant of it), all Strain B (or some mutation of it) or that the isolate contains some mixture of strain A and strain B.
Reference sequences for Strain A and Strain B are known.
What you want to say is that you have something that by its mutations is half way between A and B and you are not sure whether it is A mutated of B?
I presume the contamination, if it has occurred, is whole-strain contamination.
What do you mean when you say that the mutations are less than 100% freq? That not all the reads covering this area have them?
Couldn't you align the two strains, see in which positions they differ and then align the reads to one of the strains and see whether the other strain pattern for that genomic position occurs in some percentage that would indicate it is not there by accident?
That's what I'm doing at the moment, but I was hoping there would be something already out there.
Not halfway, just that Mutant A != Strain B. Being in the set of strain A mutants, a subset of strain A, is mutually exclusive of being in the set of Strain B.
Yes that is what I mean by <100%. A satistically significant number of reads align to the position that differ in some position from the reference sequence.