Troubleshooting NAs in segment.p DNAcopy
2
0
Entering edit mode
8.6 years ago

Hello,

I am currently trying to establish a pipeline to call CNVs from WES data using CNVkit to process the data into cnr format and then segment manually using segment function of DNAcopy package from Bioconductor.

However after I try to use segment.p to get p values and other characteristics what happens is I get mostly almost all NAs in the result file and even if there is some number, the pvalue is really really high...

> ID    chrom   loc.start   loc.end num.mark    seg.mean    bstat   pval    lcl ucl
> Sample.1  chr17   1964785 28548633    57  0.004   NA  NA  NA  NA
> Sample.1  chr18   13387721    53254275    38  -0.0087 NA  NA  NA  NA
> Sample.1  chr19   32836513    50370310    29  0.0653  NA  NA  NA  NA
> Sample.1  chr2    32288900    200320591   220 0.0049  NA  NA  NA  NA
> Sample.1  chr22   51113475    51154096    19  0.0217  5.13427042369266    5.31224265912416e-06    51143391    51158892

and it goes like this... As I already mentioned I am using CNVkit to calculate cnn and cnr files by normal recommended route, which I am analyzing by this sequence in DNAcopy

cn <- read.table(file = args[1],header = TRUE)
CNA.object <- CNA(genomdat = cn[,5], chrom = cn[,1], maploc = cn[,2], data.type = 'logratio')
CNA.smoothed <- smooth.CNA(CNA.object)
segs <- segment(CNA.smoothed, alpha = 0.01, nperm = 10000, p.method = c("hybrid"), 
                kmax=25, nmin=200, eta=0.05, trim= 0.025, min.width= 2, undo.splits = c("sdundo"),undo.SD=4, verbose=3)
segsout = segs$output

write.table(segsout[,2:6], file = cns_name, row.names = F, col.names = T, quote = F, sep="\t")

seg.pvalue <- segments.p(segs, ngrid=100, tol=1e-6, alpha=0.05, search.range=100, nperm=1000)
write.table(seg.pvalue, file= seg_name, row.names=F, col.names=T, quote=F, sep="\t")

What does this indicate? Where should I search for the problem? Any help would be appreciated.

Thank you very much for your answers

DNAcopy CNV Bioconductor • 3.3k views
ADD COMMENT
0
Entering edit mode
8.6 years ago
Eric T. ★ 2.8k

The default segmentation algorithm in CNVkit is CBS, run via the PSCBS package which itself calls DNAcopy's segment function internally. Did you try cnvkit.py segment already and see the same problem? Can you say a little more about the parameters you chose for DNAcopy's segment command and why you changed them from the defaults?

I notice in your example:

  • All of the "seg.mean" values are close to 0, unlikely to be significant results from CBS's statistical test.
  • The "num.mark" values are low for exomes, e.g. the first 28 Mbp of chr17 ought to cover more than 57 exons or off-target intergenic bins. I use an off-target bin size of around 50 kbp for exomes, so this would generate at least 560 off-target bins even before counting on-target bins at the exons in this region. So a lot of bins are being lost somehow.
  • The segments do not cover the whole chromosomes. My hg19 has at least 81 Mbp of chr17 and 243 Mb of chr2. The chromosomes are listed in lexical order in your example, so is it true that CBS only found one segment on each chromosome, and failed to cover large portions of each chromosome at either end?

Combining the above, CBS uses a sampling procedure to find breakpoints and estimate p-values. If the input values are weird in some way, that could cause CBS to run into some edge case or boundary condition that produces a p-value of NA, then ends the sampling procedure and returns the mean and coordinates of the genomic region where it ran into trouble.

All that being said, if you're using the development version of CNVkit from Github, last night I fixed a bug related to "NaN" values that was causing all on-target bins to be thrown out, so you could update to the latest and see if that fixes your problem.

Otherwise, could you post a chunk of your input .cnr file here, showing both on-target and "Background" bins if you have them?

ADD COMMENT
0
Entering edit mode
8.6 years ago

Thanks for the answer.

I tried CNVkit bash and also segment commands, which ran successfully, but the result segments which I got didn't correspond to the reality and made it unable to analyze the sample furthermore (we have a control file with some large deletions and amplifications, proven by array, which can be seen in .cnr, but not in .cns .. looks like segmentation step just puts some of them into one chunk, with approximately 0 log2 ratio). Which is correct for some, I guess it's a matter of compromise? But at the end it's a reason why I've chosen to try directly DNAcopy and subsequent mergeSegments.pl to have more control of the process (which can be although a bit dangerous)

As for the parameters, they should be the default ones, I had them there just to be able to play with them a bit. The only thing changed is undo.splits (SD=4) Documentation I used which I thought would deal with the chunking I mentioned above I got from CNVkit.

  • This example data is from the amplicon processing, could that be the source of the problems? When running on WES data, is seems more legit after the new update.

I updated the CNVkit and run some test analyses, and I believe it works fine on exon analysis now for me, thank you for your great work!

I still have a little confusion about the cns file produced by CNVkit, I believe I haven't noticed it in the documentation, is there a way for example to define more or less "strict" parameters for segmenting?

Thanks once more.

ADD COMMENT
0
Entering edit mode

There is a parameter you can use in the segment command to control the degree of segmentation: -t / --threshold, which for CBS is used as the alpha parameter, but defaults to 0.0001 instead of 0.01. So try cnvkit.py segment -t .01 to get the default CBS behavior.

The undo.SD option in CBS is a post-processing step that merges adjacent segments with similar mean values. It's meant to address an unexplained phenomenon in microarrays where the mean log2 slowly drifts up or down over a long genomic region. When that happens, CBS will be confused by the slow drift and create a segmentation breakpoint in the middle of the drift region, so you get two adjacent segments, one with a mean value only slightly higher than the other. I haven't seen the same thing happen in DNA sequencing coverages, so CNVkit doesn't use it.

ADD REPLY

Login before adding your answer.

Traffic: 2074 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6