In my WES data, I am trying to detect a single nucleotide insertion in tandem repeat region of variable length. But NGS reads cannot be mapped in this repeat, so I am not able to detect the insertion by WES data. Looking at the coverage across the gene, it suggests that the region is captured in our WES data which means the reads are there however cannot be analyzed due to mapping issue. So the question is what would be the best approach to detect this insertion using my WES data? I thought about few points that might complicate this analysis:
1- the increase in number of reads with insertions might be hidden within the noise.
2- it might be that the insertion creates a sequence that can be mapped to an alternative location, making it impossible to determine if the reads originate from the gene of interest, or from the other region.
3- the repeats are GC rich and the insertion converse a 7C stretch to an 8C length. Such insertions are common artifacts in NGS.
does anyone have a solution for that?
If it cannot be mapped then it doesn't matter if you have 10 or 10^10 reads, they cannot be mapped, so do not contribute, no?
A single nucleotide insertion changing the mapping location, that sounds unlikely. Repetitive regions that cannot be mapped are never 100% sequence identical but large stretches are, that is why they are difficult or impossible to confidently map.
I do not think this is something you are going to solve with short reads. Maybe a strategy with long high-fidelity reads might make sense. Or if you can find some flanks of your region of interest that are unique enough to PCR amplify the stretch you could enrich it from the genomic background and then sequence these amplicons? Depends of course how long the amplicon would be, but maybe that could be an option. I am not a long-read person, not sure if long reads have the fidelity these days to call single insertions. Just thinking aloud.