We are currently extracting human dna regions in the lab from multiple samples, and we would like to sequence them all with NGS. since the dna extraction is a very direct process we know that the resulting reads should only map to particular genome regions that we know in advance, so we need to know which method would be the best one to achieve this forced mapping, as we would like to handle all the results as we normally do (visualizing the reads through IGV, annotating variants through ANNOVAR,...).
After some internal discussion we've come along with 3 possibilities:
- Create a custom reference sequence and map all reads to it (we would afterwards have to translate the custom positions to genomic ones in order to get annotations for instance, editing bam and variant files; something that we'd prefer to avoid)
- Use our usual hg19 reference but make the aligner to consider our regions of interest only (either using a masked hg19 reference or maybe making the aligner to accept a bed file and to focus on it)
- Align the reads as we normally would, then extract the reads from our regions of interest only, and then edit the mapping quality 0 values manually (which value could we use? would we have to assign a random value like 10, or would it be possible to get the original mapping quality before the read was found somewhere else in the genome, hence lowering the mapping quality to 0?)
So we would like to know if anyone has already come across this situation and which approach did they take, or if any of the 3 possibilities we've ended up with sounds any better than the others. thanks in advance.
yes, it makes sense. we were just unsure how the mapper would deal with a masked reference, but I guess we'll just have to try it out. thanks.
Most aligners that you're likely to use won't have an issue with this.