Hello,
I am currently trying to establish a pipeline to call CNVs from WES data using CNVkit to process the data into cnr format and then segment manually using segment function of DNAcopy package from Bioconductor.
However after I try to use segment.p to get p values and other characteristics what happens is I get mostly almost all NAs in the result file and even if there is some number, the pvalue is really really high...
> ID chrom loc.start loc.end num.mark seg.mean bstat pval lcl ucl
> Sample.1 chr17 1964785 28548633 57 0.004 NA NA NA NA
> Sample.1 chr18 13387721 53254275 38 -0.0087 NA NA NA NA
> Sample.1 chr19 32836513 50370310 29 0.0653 NA NA NA NA
> Sample.1 chr2 32288900 200320591 220 0.0049 NA NA NA NA
> Sample.1 chr22 51113475 51154096 19 0.0217 5.13427042369266 5.31224265912416e-06 51143391 51158892
and it goes like this... As I already mentioned I am using CNVkit to calculate cnn and cnr files by normal recommended route, which I am analyzing by this sequence in DNAcopy
cn <- read.table(file = args[1],header = TRUE)
CNA.object <- CNA(genomdat = cn[,5], chrom = cn[,1], maploc = cn[,2], data.type = 'logratio')
CNA.smoothed <- smooth.CNA(CNA.object)
segs <- segment(CNA.smoothed, alpha = 0.01, nperm = 10000, p.method = c("hybrid"),
kmax=25, nmin=200, eta=0.05, trim= 0.025, min.width= 2, undo.splits = c("sdundo"),undo.SD=4, verbose=3)
segsout = segs$output
write.table(segsout[,2:6], file = cns_name, row.names = F, col.names = T, quote = F, sep="\t")
seg.pvalue <- segments.p(segs, ngrid=100, tol=1e-6, alpha=0.05, search.range=100, nperm=1000)
write.table(seg.pvalue, file= seg_name, row.names=F, col.names=T, quote=F, sep="\t")
What does this indicate? Where should I search for the problem? Any help would be appreciated.
Thank you very much for your answers
There is a parameter you can use in the
segment
command to control the degree of segmentation:-t
/--threshold
, which for CBS is used as thealpha
parameter, but defaults to 0.0001 instead of 0.01. So trycnvkit.py segment -t .01
to get the default CBS behavior.The undo.SD option in CBS is a post-processing step that merges adjacent segments with similar mean values. It's meant to address an unexplained phenomenon in microarrays where the mean log2 slowly drifts up or down over a long genomic region. When that happens, CBS will be confused by the slow drift and create a segmentation breakpoint in the middle of the drift region, so you get two adjacent segments, one with a mean value only slightly higher than the other. I haven't seen the same thing happen in DNA sequencing coverages, so CNVkit doesn't use it.