I use CNVkit to call CNV for hybrid capture sequencing data, the targeted regions totals to about 30M, a little less than that of exome, below is the commands and args I used:
path=/home/pub/output/suzhou/17AATH024/17AATH024_CNV_Capture/
cnvkit.py batch $path/hongse/hongse_sort_dup.bam $path/huangse/huangse_sort_dup.bam $path/k13/k13_sort_dup.bam $path/ZheJ-040/ZheJ-040_sort_dup.bam \
-p 4 \
--normal $path/1148/1148_sort_dup.bam $path/1150/1150_sort_dup.bam $path/1151/1151_sort_dup.bam $path/1152/1152_sort_dup.bam \
--targets /home/ganb/work/BED/bj301_v5_3_1_Covered_target.bed \
--fasta /home/wuj/tmp/SV-dev/CNV-kit/beds/hg19.fa \
--access /home/wuj/tmp/SV-dev/CNV-kit/beds/access-5k-mappable.hg19.bed \
--output-reference my_reference.cnn \
--output-dir result2 \
--antitarget-min-size 11000 \
--target-avg-size 400 \
--diagram --scatter
However, I got merged regions in the results, the regions is so large (a hundred million) and even contains the antitargets, the sum of the regions is almost the whole length of a chromosome, it does not make any sense, and I'm totally confused. Here is part of my result
chromosome start end gene
1 10500 121484934
DVL1,HES5,rs3205087,ESPN,FBXO2,MTHFR,MFN2,CLCN
1 142535934 216166555 CERS2,CHRNB2
1 216172193 216538480 USH2A
1 216538980 249240121 USH2A,ESRRG,
2 10500 92267022 TPO,MYCN,rs3
2 95326671 169546977 rs2305150,PA
2 169547477 170100092 CERS6,LRP2
2 170101171 172334658 LRP2,rs21619
2 172336504 172341243 DCAF17
2 172341743 243188873
DLX1,DLX2,rs2258180,ATF2,PRKRA,DFNB59,NEUROD1,
3 60500 90311186
GRM7,SETD5,ATP2B2,MKRN2,XPC,rs6765537,THRB,RAR
3 93519633 197961930 PVRL3,ILDR1,
I tried to run 'PSCBS' step by step as the segmentation.cbs.py did, and I found that 'segmentByCBS' generated the result
segmentByCBS(cna, alpha=0.05, undo=0, min.width=2,joinSegments=FALSE, knownSegments=knownsegs, seed=0xA5EED)
How could I get merged large regions with 'joinSegments' set to FALSE? And how to get reasonable results for my analysis? Does anybody help me?
Hi,
I wanna use CNVkit to call CNV for hybrid capture sequencing data (20 samples) of one targete region (~8M), but have no idea whether it is suitable.
Thanks!