I'm currently using cnvkit on around 100 paired exome samples. I ran a GISTIC analysis and found some segment locations amplified or deleted across a significant number of my cohort. These are almost certainly errors and I've found two possible reasons.
The first is that I have vertical lines on the VAF/BAF plot in some regions. I narrowed down one to a region of MUC genes on chr 7 q22.1 (MUC3A, MUC12, MUC17 etc.). I can see that the corresponding log2 values at this position also have a high variability and the segmentation algorithm regularly calls an amplification at this location.
The second is that some locations have a high number of target regions in close proximity and generally uneven coverage. One example I have found in my data is FLG2. The cnvkit algorithm seems to exclude many of these targets for having spread greater than 1, however a few fall below the threshold and are kept. The coverage distribution here is not even and so this is almost always called in error.
I have tried suggestions for filtering false positives, such as segmetrics, but this doesn't filter out these error regions. I could just manually find the locations and exclude them in the access command (as I have done for the HLA region), however I'd prefer a more automatic approach to identifying low confidence positions (in case I miss some which are less clear). Does an access file exist specifically for whole exome data with further low confidence regions (including problematic genes) already identified?
Best, Andy