I am performing amplicon sequencing such that I have captured regions of the human genome that align to the exactly the same loci. And I am seeing a phenomenon that I had not expected, in that I see within a given amplicon region a couple mutations then a space where few if any mutations are found, and then a lot of mutations, as expected. I have included a hypothetical example that illustrates the point. My question is why I would see a gap towards the left of the plot which is earlier in the genome. Is this because early variants cause a strand to not align to reference or something like this?
I am using bwa mem and freebayes for alignment and variant calling.
Another possibility would be that this region is extremely well conserved.
Without going into a lot of detail, this does not appear to be the case, especially because I see the same thing over multiple amplicons.
You should look at the alignment file as well and investigate those.
Sorry I'm not totally understanding you. Do you mean that within the alignment files I should check that the regions are well covered or something?
Yes, this relates to what WouterDeCoster said. If your region is well covered and all the reads align with no mismatches then there is the answer for the lack of mutations.
Right. So the tricky thing here is that I am using amplicon sequencing. So unlike WGS using sheared DNA, this is targeted amplification meaning that all reads should align perfectly. Therefore if a variant is identified in later regions of the DNA, earlier regions of the DNA must also have been covered. So, the coverage should be identical across a given locus.
Unless I'm missing something?
I don't get it why you would expect perfect alignments.
The DNA that gets sequenced is between the probes and will contain variations. Perhaps you mean that the start of the read ought to be a perfect match - it is unclear.
in general whether or not you are using WGS or targeted approaches is not relevant in my opinion.
You are aligning it against a reference genome that is different than the DNA that you amplified. You clearly have variation there no? And the goal is to find out what the differences are.
What I was suggesting above is that one needs to always assess the alignments as well. That's how we can tell what is going on with the data.