Context: I'm a new CNVKit user (using 0.9.6). I have 6 exome-seq samples (from saliva DNA) that I want to do germline copy number calling on. 2 of the samples are from saliva of healthy individuals and the other 4 are from saliva of individuals with breast cancer, all in the same extended family pedigree.
What I did so far: I ran the normal CNVKit pipeline with a flat reference and made sure to use the --access command to remove poorly mappable regions in the access-5kb-mappable.hg19.bed file mentioned in the docs. I did segmetrics with the 'ci' option, then made a scatter plot(scatter plot output seen here). I also ran the pipeline again but this time using my 2 healthy samples to generate a reference genome and compare to the individuals with breast cancer in the same family(scatter plot output seen here). The data looks very noisy in both cases. (see photos linked above)
Questions:
Can anybody help me understand why my data is so noisy? And what the grey and orange bars/lines represent? I am thinking it has to do with the reference (or rather lack of a good reference...). I want to retry running this with some exome-seq samples that are unrelated to breast cancer and build a reference from those individuals, but I am not sure how much that would help, or if the issue is the reference I am using, to begin with.
What are the grey and orange bars on the scatter plot? On https://cnvkit.readthedocs.io/en/stable/plots.html I only see red bars which are supposed to be "segmentation line", but I am not entirely sure what this means and why I have 2 colors of the bars, and they are not red. I am using v 0.9.6 of CNVKit.
Any help is greatly appreciated. Thank you
if using pool reference, does the paired control sample no longer used for analysing. can you share your command? I ran my command like this, but I do not find any criteria to find the noisy sample
firstly, I used the batch command to get all the control samples target.cnn and antitarget.cnn
Secondly, I gather all the control samples target.cnn and antitarget.cnn to a empty directory,
Thirdly, I can not find that the cnvkit support control sample and pool rference just like the gatk Mutect2. so I can just give the pool reference and ignoring this normal sample
thanks a lot, and looking forward to hear more experience with cnvkit about you
So, I've made a little modification on the way I work with cnvkit... For the baseline, use all samples in the same run. This will work fine! For my command, first, build your reference:
Then, run analysis using the created reference:
Last, call cn's