Question

cnvkit output understanding

1

Entering edit mode

8.2 years ago

Folder40g ▴ 190

Hi

I'm using cnv kit to check for possible CNV in a kind of amyloidosis (human), and as this is the first time I'm doing a CNV analysis I've some doubts.

I've run cnvkit as it follows within a loop in bash, being f index .bam "tumor" and i "normal" :

python cnvkit.py batch ${array[f]} --normal ${array[i]} --targets .../S07604624_Padded_versionCNVKIT.bed --fasta .../HG19/hg19.fa --access .../cnvkit-master/data/access-5k-mappable.hg19.bed --diagram --scatter --output-reference $outdirref --output-dir $outdir

Then, used purity of 90% as I've been told by the people who did the exome sequencing:

python .../cnvkit.py call resultsRS_7_tumor_recalibrated.cns --purity 0.9 -o output_7.cns

I choose two of the 20 samples that I have and I check for the number of CNV. The first thing that I see is that sample 9 has 2403 rows and sample 7 has 705. I understand that this is due to the CBS algortihm, which if I'm not wrong tries to join contiguous bins with similar log2 ratios. This means that sample 7 has a more homogeneous log2 ratio across all contiguous bins. Right?

Then I plot an histogram of the CN column, and the problem, may be because this is the first time I'm doing this kind of analysis but I don't trust the results.

Most of the segments in the .cns are duplicated. Does a CN of 2 means normal levels in human sample or means that that segments has number of reads twice as many as expected?

Either of the cases, the distribution that I show seem too weird to me. So waht may I be doing wrong?

Sample 9 Sample 7 Thanks for your time!

cnvkit cnv • 5.2k views

ADD COMMENT • link 8.2 years ago by Folder40g ▴ 190

score 2 · Answer 1 · 2017-01-10

Yes, your understanding of segmentation is right. You can plot the outputs with the segment command for a more intuitive view of each sample's copy number profile, or heatmap to see all samples together. The integers in the CN column are the estimated ploidy, so 2 is neutral for diploid chromosomes. (But might not be for sex chromosomes.)

Sample 7 looks about how I would expect, mostly neutral calls (CN=2), some single-copy losses (1) and gains (3), a few multi-copy gains (4) and homozygous deletions (0). Sample 9 is noisier, and shows a surprising number of higher-level amplifications that could be false positives. Other than that it's not wildly unexpected. Look at any quality control metrics you have to see if Sample 9 and any other samples in your cohort have low coverage or are otherwise expected to be problematic.

You can make the segmentation more specific and less sensitive by using -t .00001 or, if that's not working out, -m haar. You can also reduce noise earlier in the process by using a pooled reference if the samples in your research cohort were prepared and sequenced with the same process (exome baits, capture kits, etc.) -- use the batch command to do this in one shot. In my benchmarks, a pooled reference usually performed noticeably better than running tumor-normal pairs independently.

score 0 · Answer 2 · 2017-01-11

0

Entering edit mode

8.2 years ago

Folder40g ▴ 190

Thanks. I'll give a shot to the pooled reference and see how it goes.

Thanks

ADD COMMENT • link 8.2 years ago by Folder40g ▴ 190