Dear BioStars,
I have a few Illumina datasets where I've seen a puzzling pattern in my mapped reads, and would appreciate some tips as to how to figure out what's going on. These datasets represent bacterial cultures. I have mapped the reads to genomes that are expected to be very closely related -- less than one SNP per kilobase. However, in some regions of the genomes, there are multiple reads that suggest the existence of multiple SNPs, while other reads perfectly match the reference sequence.
These appear to be true SNPs and not sequencing errors for a few reasons:
- There exists a distinct set of positions that are bi-allelic. In other words, the same putative SNP shows up in many reads. I have deep coverage (>100x) and these putative SNPs are seen in 20-80% of the reads.
- The SNPs appear to be linked -- the non-reference SNPs are found on the same DNA fragment (both within reads and in paired reads)
- The reads are not duplicates of each other (different starts and stops, both strands)
My working hypothesis is that my the sequencing library actually contained two different genotypes. In one of my studies, this would be interesting and I would like to get a better sense of what's going on in the population of bacteria. In the other study, I fear that there was cross-contamination between my DNA preps from different cultures, and I would like to understand this better for QC reasons.
So does anyone know of a analytic tool that would help me disentangle these sequences? If I were to make it myself, I would include the following features:
- A report of sequence diversity along the genome to identify the locations of these putative SNPs.
- A report of the linkage between the SNPs
- The sequence of the putative contaminant/invader so that I can figure out where it came from.
Any thoughts would be appreciated. I would also like to hear whether others have encountered these patterns and if there are other possible explanations.
Thanks
Adam
Have you checked the read depth? The strain could have an extra copy not present in the reference genome. Either way, I would try a de novo assembler that does not aggressively collapse bubbles. Things get easier when you have longer contigs (e.g. to identify long divergent haplotypes). Also, it is not uncommon for two closely related strains to have some regions with high divergence.
Thanks for the ideas. Based on read depth, it does not look like a duplication, but I'll probably incorporate that possibility into any systematic analysis that I do. And yes, a less aggressive assembler could allow me to recover both genotypes for the diverse regions.