Ok, I'm using some read-depth based algorithm for CNV(copy number variation)detection. My general question is: for read-depth based algorithm, should we mask out repeats from reference genome?
I used to run CNVnator (one read-depth based algorithm) with bam without quality filtering(that means there're many low-quality read mapping); and I got around 8000 CNVs (deletion+duplication) for NA12878 pilot data. But recently I'm changing my data pre-processing pipeline, which includes discarding mapping with quality Q< 20; and also removal of PCR duplicates (using picard-markduplicates), I then got 40000 CNVs for NA12878!!!
I then compared read depth on chr5 for both default and filtered bam file using IGV (see the picture);and also look up those newly-identified CNVs at UCSC browser. I would say those newly-identified CNVs are mostly repeats (LINE,SINE). Low-quality mapping tends to aggregate at such regions (because they are randomly chosen to map here) and removal of these reads will make such regions look like deletions. Am I correct?
I think I still need to remove those bad-quality mapping but also should deal with those repeats. If so, for any read-depth CNV algorithm, should we first mask repeats in the reference genome? Or discard predicted CNVs from the results because they are unreliable?
Thanks
See the picture: http://www.freeimagehosting.net/d9bea
Sorry but what do you mean by "normal" here?
I would consider DNA from blood of other unrelated healthy patients. You can also use the 1000 genome project. Just make sure you trim the sequences to the same length and re-align with same aligner.