from SeqAnswer: "The main consequence of removing information is increasing the risk of false alignments -- i.e. BWA may decide to make an imperfect & incorrect alignment which is the best it can make with your masked database, but with an unmasked one it would find a better (and correct) alignment. So, you may have false variant calls as a result."
In a word, no. It's not necessary for any of the BWT aligners. In fact it would cause you to miss any unannotated exons, or possibly entire novel genes. It would also hide other biological events, such as genomic sequence contamination in the preparation process.
If you want to speed up the alignment, split the reads into several chunks and map them in parallel. Afterwards use a merging program such as samtools or Picard to create a single BAM file. We usually aim for ~ 1Gbp per process for a vertebrate genome.
"it would cause you to miss any unannotated exons": tell me if I'm wrong but I'm using "Exome capture" so my reads should only be mapped on the set of genes/probes defined by the manufacturer (Exome Nimblegen).
In theory yes, but in our (limited) experience the targeted capture laboratory methods are not perfect, and you will get some percentage of your reads correctly mapping outside of the target region because that's what was in the lane. You will probably want to quantify this.
I gave the answer to another question: "No, do not align to masked genome for any purpose." "Masking has never been perfect and probably will never be perfect. This will lead to wrongly mapped sequences, spurious SNPs/indels calls and all sorts of problems. I cannot think of a single use case when masking [before mapping] may lead to better outcomes."
In case of exomes, when you map the reads you will find a lot of them coming from unique non-targeted regions.
ADD COMMENT
• link
updated 5.2 years ago by
Ram
44k
•
written 13.9 years ago by
lh3
33k
How would you identify the non-genic sequences before alignment ?
You could use a reference consisting of the exonic regions or the sequences of the capture probes in stead of the whole genome. That would speed up the alignment, but would make it harder to identify events like genomic re-arrangements. Also it is good to know whether the untargeted sequeences are part of your target species, or have another origin
The point is: Before alignment you don't know anything about the sequences except things like nucleotide distribution.
With the choice of your reference for alignment you limit the sequences you're interested in
from SeqAnswer: "The main consequence of removing information is increasing the risk of false alignments -- i.e. BWA may decide to make an imperfect & incorrect alignment which is the best it can make with your masked database, but with an unmasked one it would find a better (and correct) alignment. So, you may have false variant calls as a result."