Question

The Need Of Repeatmask For Read-Depth Based Cnv Detection Algorithm

1

Entering edit mode

13.0 years ago

Bioscientist ★ 1.7k

Ok, I'm using some read-depth based algorithm for CNV(copy number variation）detection. My general question is: for read-depth based algorithm, should we mask out repeats from reference genome？

I used to run CNVnator （one read-depth based algorithm) with bam without quality filtering(that means there're many low-quality read mapping); and I got around 8000 CNVs (deletion+duplication) for NA12878 pilot data. But recently I'm changing my data pre-processing pipeline, which includes discarding mapping with quality Q< 20; and also removal of PCR duplicates (using picard-markduplicates), I then got 40000 CNVs for NA12878!!!

I then compared read depth on chr5 for both default and filtered bam file using IGV (see the picture);and also look up those newly-identified CNVs at UCSC browser. I would say those newly-identified CNVs are mostly repeats (LINE,SINE). Low-quality mapping tends to aggregate at such regions (because they are randomly chosen to map here) and removal of these reads will make such regions look like deletions. Am I correct?

I think I still need to remove those bad-quality mapping but also should deal with those repeats. If so, for any read-depth CNV algorithm, should we first mask repeats in the reference genome？ Or discard predicted CNVs from the results because they are unreliable?

Thanks

See the picture: http://www.freeimagehosting.net/d9bea

read cnv repeats • 3.8k views

ADD COMMENT • link updated 13.0 years ago by Stefano Berri 4.4k • written 13.0 years ago by Bioscientist ★ 1.7k

score 1 · Answer 1 · 2012-01-02

1

Entering edit mode

13.0 years ago

Stefano Berri 4.4k

Some thoughts:

it is very important to compare your sample to a "normal". Often a pool of normal is used. Coverage depends on mappability. If you really can't have the matched normal, use any normal. Alternatively you can correct for mappability.

About 50% of the human genome is considered repeated, in some cases very large regions, you would lose a lot of your true positive. Furthermore, masking, let's say 50 bp, means losing approx 200 bp around that repeated region, if you reads are 80 bp long.

ADD COMMENT • link 13.0 years ago by Stefano Berri 4.4k

0

Entering edit mode

Sorry but what do you mean by "normal" here?

ADD REPLY • link 13.0 years ago by Bioscientist ★ 1.7k

0

Entering edit mode

I would consider DNA from blood of other unrelated healthy patients. You can also use the 1000 genome project. Just make sure you trim the sequences to the same length and re-align with same aligner.

ADD REPLY • link 13.0 years ago by Stefano Berri 4.4k