Dear all,
I am preparing a list of regions of the genome that are lucky to include CNVs. To do so I am excluding assembly gaps, regions with poor mappability, and repeat regions as reported in UCSC. I know from the literature that regions worth excluding are also those near centromeres/telomeres, and those having low/high GC. My questions are: a) what "near" a centromere/telomere means? and b) which are meaningful thresholds for GC content? Finally, c) is there any other feature I should be aware of?
Thank you very much!
I am not selecting regions to run a CNV caller, but to create a null distribution for a set of statistical analyses. Therefore knowing how to deal with CpG content is important.
Thanks, for the answer about the telomere/centromere. I am indeed removing all the gaps (http://genome.ucsc.edu/cgi-bin/hgTrackUi?&c=chr17&g=gap) as well as the repeating regions. Should I also remove "Regions of Exceptionally High Depth of Aligned Short Read" (http://genome.ucsc.edu/cgi-bin/hgTrackUi?&c=chr17&g=hiSeqDepth)? What is a good threshold in this case?
Most of the regions with exceptionally high depth of aligned short reads must be excluded by repeat masker/gapped regions. Those that are not would have a coverage depending on your library size. Best would be to make several histograms and make a decision based on that.
However I would not be much worried about those regions for the purpose of copy number variation calling because they would have a similar high coverage in all your cases and controls.
If you insist on removing those regions, do not do it based on definitions of the ucsc track, but use a coverage filter of your own data.
I am not doing a copy number variation calling: I am just selecting genome regions to crate a null distribution to perform a set of statistical analyses, that is: I have all the genome (as in UCSC), but I need to extract only those regions that may include a CNV to not bias (in my favour) the analyses. But, yeah, I got your point: regions with exceptionally high depth of aligned short reads are a problem of UCSC (or of a specific experiment), not in general!
Thanks.