Question

Exome Sequencing: Masking The Non-Genic Sequences ?

10

Entering edit mode

14.4 years ago

Pierre Lindenbaum 166k

Hi all,

I'm about to align a set of short reads on the genome after an exome capture.

I wonder if it would be worth to replace all the non-genic genomic sequences with 'N'.

Wouldn't it speed up the alignment with BWA ? What would be the other possible consequences ? Has it ever been done before ?

Many thanks

Pierre

next-gen sequencing bwa exome fasta • 9.5k views

ADD COMMENT • link updated 14.4 years ago by lh3 33k • written 14.4 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

from SeqAnswer: "The main consequence of removing information is increasing the risk of false alignments -- i.e. BWA may decide to make an imperfect & incorrect alignment which is the best it can make with your masked database, but with an unmasked one it would find a better (and correct) alignment. So, you may have false variant calls as a result."

ADD REPLY • link 14.4 years ago by Pierre Lindenbaum 166k

9

Entering edit mode

14.4 years ago

lh3 33k

I gave the answer to another question: "No, do not align to masked genome for any purpose." "Masking has never been perfect and probably will never be perfect. This will lead to wrongly mapped sequences, spurious SNPs/indels calls and all sorts of problems. I cannot think of a single use case when masking [before mapping] may lead to better outcomes."

In case of exomes, when you map the reads you will find a lot of them coming from unique non-targeted regions.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.4 years ago by lh3 33k

3

Entering edit mode

Picard's CalculateHsMetrics produces graphs and stats to help assess how well your capture worked (http://picard.sourceforge.net/picard-metric-definitions.shtml#HsMetrics)

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.4 years ago by Brad Chapman 9.7k

0

Entering edit mode

I support this, I don't see an advantage either, and the minimal speed-up isn't worth the limition imho.

ADD REPLY • link 14.4 years ago by Michael 55k

3

Entering edit mode

14.4 years ago

Jan Oosting ▴ 920

How would you identify the non-genic sequences before alignment ?

You could use a reference consisting of the exonic regions or the sequences of the capture probes in stead of the whole genome. That would speed up the alignment, but would make it harder to identify events like genomic re-arrangements. Also it is good to know whether the untargeted sequeences are part of your target species, or have another origin

ADD COMMENT • link 14.4 years ago by Jan Oosting ▴ 920

0

Entering edit mode

I would use ucsc knownGene +/- 10Kb

ADD REPLY • link 14.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

The point is: Before alignment you don't know anything about the sequences except things like nucleotide distribution. With the choice of your reference for alignment you limit the sequences you're interested in

ADD REPLY • link 14.4 years ago by Jan Oosting ▴ 920

score 5 · Accepted Answer · 2010-12-17

5

Entering edit mode

14.4 years ago

biobot 0.0.77.a.1099 6.2k

In a word, no. It's not necessary for any of the BWT aligners. In fact it would cause you to miss any unannotated exons, or possibly entire novel genes. It would also hide other biological events, such as genomic sequence contamination in the preparation process.

If you want to speed up the alignment, split the reads into several chunks and map them in parallel. Afterwards use a merging program such as samtools or Picard to create a single BAM file. We usually aim for ~ 1Gbp per process for a vertebrate genome.

ADD COMMENT • link 14.4 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

"it would cause you to miss any unannotated exons": tell me if I'm wrong but I'm using "Exome capture" so my reads should only be mapped on the set of genes/probes defined by the manufacturer (Exome Nimblegen).

ADD REPLY • link 14.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Yes, they should - if the annotation used to the design the probes is identical to that used for analysis. Maybe I'm just paranoid about these things!

ADD REPLY • link 14.4 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

In theory yes, but in our (limited) experience the targeted capture laboratory methods are not perfect, and you will get some percentage of your reads correctly mapping outside of the target region because that's what was in the lane. You will probably want to quantify this.

ADD REPLY • link 14.4 years ago by David Quigley 11k