Hi, I'm using WES data of tumor samples to find out CNV using CNVkit. My input data was re-calibrated BAM (recal.bam) files for CNVkit. I've used CNVkit's batch command (https://cnvkit.readthedocs.io/en/stable/pipeline.html) to generate CNV profile. The output looks like this.
Then I used few stringent threshold of log2 copy ratio and used several commands (segmetrics, genemetrics) to refine the noises and the the output was like this. This data is still noisy. In the CNVkit documentation it is recommended that to decrease noises we need to lower the bin number, but I didn't find any command for that. The plot looks like this.
After getting this data, I tried to find out the plots for individual chromosome level. I used this command for that: ~/cnvkit$ cnvkit.py scatter Tumor.cnr -s Tumor.cns -c chr8:80000000-120000000 -g PDP1,POP1 -o chr8.jpg --segment-color red Showing 2427 probes and 2 selected genes in region chr8:79999999-120000000 Wrote chr8.jpg The plot looks like this.
~/cnvkit$ cnvkit.py scatter Tumor.cnr -s Tumor.cns -c chr8 -g PDP1,POP1 -o chr8_2.jpg --segment-color red Showing 11090 probes and 2 selected genes in region chr8 Wrote chr8_2.jpg The plot looks like this. I don't know how to fix this. It's been 8 months I'm trying to fix this. Moving from one to another tool. I will be highly thankful if anyone can help me.
In my experience it looks like a QC failed sample, not as a Cnv-kit problem. It is not actually possible to refine this. The coverage profiles of 2 samples are too different. I would maybe trust high amplitude variants, but the calling in general - no. Either normal or tumor tissue library prep / sequencing failed (or just was very different - which is fine for SNV calling, but not CNA).
This for only one tumor sample and after getting this sequencing data I checked for its quality and it was very nice. We also used this sample to identify indels and SNVs using GATK pipeline. That time we didn't have any problem. Only to identify CNVs, this data looks noisy. I don't know where is my fault.
It's unlikely that this is your "fault", the data is just noisy. That happens, you won't be able to use it for CNV calling. Move on.
I agree with Devon's comment in general. As additional info: sometimes you may jump around the data and generate normal reference using only samples which are similar to your tumor samples - but you need to have 1) an experience, 2) a motivation to do so (it may easily take 1 day of your time). Important - it may not work out still. I did it for some project with ultra-rare cancers where every sample was valuable - but don't recommend it in general. For that project I even had to do FrankenTumors CNV calling - since normal tissue was partially tumor tissue (FFPE samples, seemingly normal was actually affected by cancer) - takes days of manual work, don't recommend, 0 stars out of 5.
So, as a conclusion - it is possible to manually "correct" the data, but if it is possible to loose 1 sample for you - just move on.
You can also try a different CNV caller. In my experience, the performance can vary substantially.
Could you suggest any?
There are some previous discussions on this topic, such as: Whole Exome CNV tools