Question

Using CNVkit for hybrid capture data -- the huge CNV regions in results does not make any sense

1

Entering edit mode

7.2 years ago

wujunjames ▴ 10

I use CNVkit to call CNV for hybrid capture sequencing data, the targeted regions totals to about 30M, a little less than that of exome, below is the commands and args I used:

path=/home/pub/output/suzhou/17AATH024/17AATH024_CNV_Capture/
cnvkit.py batch $path/hongse/hongse_sort_dup.bam $path/huangse/huangse_sort_dup.bam $path/k13/k13_sort_dup.bam $path/ZheJ-040/ZheJ-040_sort_dup.bam \
          -p 4 \
          --normal $path/1148/1148_sort_dup.bam $path/1150/1150_sort_dup.bam $path/1151/1151_sort_dup.bam $path/1152/1152_sort_dup.bam \
          --targets /home/ganb/work/BED/bj301_v5_3_1_Covered_target.bed \
          --fasta /home/wuj/tmp/SV-dev/CNV-kit/beds/hg19.fa \
          --access /home/wuj/tmp/SV-dev/CNV-kit/beds/access-5k-mappable.hg19.bed \
          --output-reference my_reference.cnn \
          --output-dir result2 \
          --antitarget-min-size 11000 \
          --target-avg-size 400 \
          --diagram --scatter

However, I got merged regions in the results, the regions is so large (a hundred million) and even contains the antitargets, the sum of the regions is almost the whole length of a chromosome, it does not make any sense, and I'm totally confused. Here is part of my result

chromosome  start      end        gene
1           10500      121484934  
DVL1,HES5,rs3205087,ESPN,FBXO2,MTHFR,MFN2,CLCN
1           142535934  216166555  CERS2,CHRNB2
1           216172193  216538480  USH2A
1           216538980  249240121  USH2A,ESRRG,
2           10500      92267022   TPO,MYCN,rs3
2           95326671   169546977  rs2305150,PA
2           169547477  170100092  CERS6,LRP2
2           170101171  172334658  LRP2,rs21619
2           172336504  172341243  DCAF17
2           172341743  243188873  
DLX1,DLX2,rs2258180,ATF2,PRKRA,DFNB59,NEUROD1,
3           60500      90311186   
GRM7,SETD5,ATP2B2,MKRN2,XPC,rs6765537,THRB,RAR
3           93519633   197961930  PVRL3,ILDR1,

I tried to run 'PSCBS' step by step as the segmentation.cbs.py did, and I found that 'segmentByCBS' generated the result

segmentByCBS(cna, alpha=0.05, undo=0, min.width=2,joinSegments=FALSE, knownSegments=knownsegs, seed=0xA5EED)

How could I get merged large regions with 'joinSegments' set to FALSE? And how to get reasonable results for my analysis? Does anybody help me?

CNV CNVkit hybrid capture • 2.5k views

ADD COMMENT • link updated 7.2 years ago by Eric T. ★ 2.8k • written 7.2 years ago by wujunjames ▴ 10

score 1 · Answer 1 · 2017-08-29

CNVkit uses off-target reads from hybrid capture to get a coarse-grained estimate of copy number outside the targeted regions. The segmentation step then uses on- and off-target read depth patterns together to estimate copy number across the entirety of each chromosome (sometimes skipping centromeres depending on the specified targets and "access" BED file).

In cancer samples it's common to see large-scale somatic copy number alterations; these are what CNVkit is designed to detect best. The small-scale CNVs that are relevant to germline sequencing to study population-level variation (e.g. 1000 Genomes) are usually too small to be picked up from targeted sequencing read depth.

The joinSegments option in PSCBS and DNAcopy just stretches the endpoints of each segment so that segments are adjacent, with no gaps between them, in case the input probes have gaps between them. CNVkit post-processes the output segment coordinates on its own, and most of the genomic bins CNVkit generates do not have gaps between them.

You don't need to specify --target-avg-size and --antitarget-avg-size with the batch command; it will calculate reasonable values on the fly with CNVkit versions 0.8.5 and later.

score 0 · Answer 2 · 2017-08-25

Hi, I also used cnvkit recently. Technological, It is difficult and a bit of inaccurate to identify CNV by hybrid capture NGS . And also it is normal that there maybe exist a CNV which cover a large genome regain, even the whole chromosome, especially in tumor. Here is a hint, you can first to filter the result by the coverage, read depth, log value. Secondly, you can check the significant gene by plot the coverage info, which can help you to evaluate. I think the cnvkit paper have clearly illustrated its algorithms, so you can dig into it.