Hi everyone !
For a project i'm trying to do, I need to measure which bacterial strains in my data set have the most de novo variants (SNVs and indels). Normally, laboratory experiments are needed to perform this analysis because it is necessary to proliferate clones of a bacteria, force some kind of stress and then sequence their genomes, and compare them with the genome of the initial cell. Because the sequence of the genome of the mother cell is known it is possible to find new variants that occurred in the experiment. I was wondering if there is any way to carry out this analysis with public data only. For example with all E. coli genomes in RefSeq. The problem with this analysis is that after i do a variant call with the genomes, there is no way to tell if a variant is new or if it is inherited. I was thinking in write a script that records all variants founded and filters out repeated ones and with a statistical score based on phylogeny it would give me a probability of a variant to be new or inherited. But i did not find any papers that did something similar to that. Is there any statistical way that can assess the probability that a variant observed is de novo or inherited ?
Papers that make something close are welcome :-) Tks !