Hi guys,
I'm using CNVkit to process my WES data to get copy number alternation. I have 110 tumor samples, each of them has got a matched normal. CNVkit suggests to use pool normal for the normal reference. I was wondering do I need to plug in all these 110 normal .bam files for the normal reference generating? Or choosing only some of them will be sufficient? How this normal sample choosing affect the result?
The reason I'm asking this question is that, first, it would be a lot more hard drive saving when running this step, if I only use several of them to generate the normal reference. Also, I was thinking, even if I use all of my 110 normal samples to construct the pooled normal reference, when I turn to another cohort with the same disease, the normal reference will be also different.
Hi Eric,
Thanks for the prompt reply. My normal samples are indeed blood cells from the patients. I had been concerned that only using 10-20 instead of all the normal BAMs might not be able (i.e. not sufficient enough) to reflect all the normal samples in the cohort. I understand that this pooled normal reference should only be used for a specific sample type and a specific experimental condition. I was just a little bit worried that there might be some heterogeneity of normal samples. If you don't consider all for the pooled reference, you might loose some information from these missed normal samples. Thanks again!
Yes, that's all true, but 10-20 samples is still usually good enough to capture most of the consistent characteristics and biases of your lab process, and it lets you quickly "peek" at the results with the
batch
command without wasting much computation. For a clinical pipeline, I would recommend you runcoverage
on each of the remaining normal samples to build a comprehensive pooled reference.so how should we select the normal samples, is there any suggestions, I noticed that you mentioned the bam files size, so is there any concrete methods I also noticed thta you mentioned metrics
These statistics help quantify how “noisy” a sample is and help to decide which samples to exclude from an analysis, or to select normal samples for a reference copy number profile.
thanks a lot