Question

CNVkit - how many normal samples do I need for the pooled normal reference?

0

Entering edit mode

7.5 years ago

ibphuangchen ▴ 10

Hi guys,

I'm using CNVkit to process my WES data to get copy number alternation. I have 110 tumor samples, each of them has got a matched normal. CNVkit suggests to use pool normal for the normal reference. I was wondering do I need to plug in all these 110 normal .bam files for the normal reference generating? Or choosing only some of them will be sufficient? How this normal sample choosing affect the result?

The reason I'm asking this question is that, first, it would be a lot more hard drive saving when running this step, if I only use several of them to generate the normal reference. Also, I was thinking, even if I use all of my 110 normal samples to construct the pooled normal reference, when I turn to another cohort with the same disease, the normal reference will be also different.

genome cnvkit • 4.7k views

ADD COMMENT • link updated 7.5 years ago by Eric T. ★ 2.9k • written 7.5 years ago by ibphuangchen ▴ 10

score 2 · Answer 1 · 2018-02-01

2

Entering edit mode

7.5 years ago

Eric T. ★ 2.9k

You can use any number of normal samples with the reference command. You don't need access to all 110 normal samples at the same time; you can first run the coverage command twice on each BAM (once each for targets and antitargets), then collect the 'coverage' output .cnn files to use as input to the 'reference' command.

If you're using the batch command to get quick initial results, you can just select 10-20 of your normal BAMs to use as a pooled reference. (List your BAM files by size with ls -Sl, then choose 10 to 20 samples from the middle of the list.) This pooled reference can be used for all of your tumor samples. If you decide later that you want a larger pool, you can run 'coverage' on additional samples and use those output .cnn files along with those from your existing pool to expand the reference.

The coverage profile tends to be dependent on lab protocols and reagents, not disease -- anyway, the normal samples are from cells without disease, right? Separate references for fresh-frozen versus FFPE material would be worthwhile, and also separate them by exome capture kit if that's not the same across in your cohort.

ADD COMMENT • link 7.5 years ago by Eric T. ★ 2.9k

0

Entering edit mode

Hi Eric,

Thanks for the prompt reply. My normal samples are indeed blood cells from the patients. I had been concerned that only using 10-20 instead of all the normal BAMs might not be able (i.e. not sufficient enough) to reflect all the normal samples in the cohort. I understand that this pooled normal reference should only be used for a specific sample type and a specific experimental condition. I was just a little bit worried that there might be some heterogeneity of normal samples. If you don't consider all for the pooled reference, you might loose some information from these missed normal samples. Thanks again!

ADD REPLY • link 7.5 years ago by ibphuangchen ▴ 10

1

Entering edit mode

Yes, that's all true, but 10-20 samples is still usually good enough to capture most of the consistent characteristics and biases of your lab process, and it lets you quickly "peek" at the results with the batch command without wasting much computation. For a clinical pipeline, I would recommend you run coverage on each of the remaining normal samples to build a comprehensive pooled reference.

ADD REPLY • link 7.5 years ago by Eric T. ★ 2.9k

0

Entering edit mode

so how should we select the normal samples, is there any suggestions, I noticed that you mentioned the bam files size, so is there any concrete methods I also noticed thta you mentioned metrics

These statistics help quantify how “noisy” a sample is and help to decide which samples to exclude from an analysis, or to select normal samples for a reference copy number profile.

thanks a lot

ADD REPLY • link 5.0 years ago by linouhao ▴ 10