Hi,
I am trying to integrate CNVkit into our in-house clinical exome pipeline. We mostly analyse (different) monogenic or polygenic diseases, so no tumor/normal pairs or anything like that.
All data is generated on the same technology platform (Novaseq 6000), with the same kit and the same bfx pipeline.
Now, my thought was that I could use a few dozen of our already sequenced samples to generate a "normal" reference. I followed the readme as much as possible, I filtered out any samples that clearly deviated in their metrics (as per "metrics") etc. This left me with 125 samples from which I created a "normal" reference using the the bait BED file for our exome kit of choice and the batch command (so all other settings left at default).
My question is this...looking at the scatter plot (a test sample against my normal reference, after segmetrics and call), it looks very noisy. Is this expected? Or are there any knobs I should turn to somehow clean this up (I am guessing going back to the reference step...?)
Cheers, Marc
Hi Marc, the level of noise is dependent on the enrichment platform, a number of reads in the library, overall coverage of targeted regions. CNVs you may want to check can be long or short. It is difficult to say if the performance of the CNV caller is under or overwhelming for any particular data, based on a whole-genome plot.
Thanks for the clarification. There is one more thing I am unclear about - concerning repeat masking. We are not using a UCSC based reference but NCBI, which does not include repeat-masking. I could produce a masked version of that assembly, but the CNVkit documentation is mute about whether this should be soft- or hard-masked. Any advice?
Cheers, Marc
Simply exclude regions that are repetitive IF you don't really care about them. If you care about the genes that have repeats, your CNV calls there will be noisy, but it is sometimes possible to infer a presence of a CNV there.
It is all pretty specific of the task you are trying to solve, are you including off-target reads, what do you aim to detect - rare CNVs, polymorphisms, etc.
Well, I am basically just "fishing" - we do not necessarily have specific copy number variants that we are looking for. The range of diseases under investigation is quite broad and we would like to employ CNVkit as a complementary method to guide the search for underlying genetic causes.
So tuning it, manually for each patient and for a particular CNV, is not really an option. I need to set this up in a way that it performs reasonably across whatever a plausible range of coverages and CNV sizes is for CNVkit. That said, our sequencing is always the same (~390.000 baits and a coverage of roughly 100X).
The documentation hasn't really helped me figuring out how to best do that beyond the general recommendations to look at metrics and remove deviating samples. All the use-case examples and plots in the documentation seem to be quite "clean" (as in spread of log2 values), which to me suggests that my noisy data may simply not be properly tuned or filtered. Or maybe this is simply how this kind of plot looks like for full exome data at high coverage. Again, it's unclear to me what to expect in my use case.
/M
I will try to get hold of complementary CNV data from array analyses to dial this in some more... but still, would be nice to have a step by step discussion in the documentation for a real world exome analysis of germline data.
Just to resolve the repeat-masking question - it must be hard-masked for CNVkit to recognize it. Soft-masked repeat do nothing. https://github.com/etal/cnvkit/blob/master/cnvlib/access.py