I want to get genome wide copy number calls from a cohort (N=45) of human tumor-normal samples genotyped by Affymetrix GenomeWide SNP6.0 arrays. I have normalized the raw data and calculated Log2-R-Ratios for all probes (N=933,122) for all tumor-normal pairs, so far so good. However, when I perform CBS-segmentation with pruning implemented by DNAcopy it hangs (after 48h I killed it) on the 12the sample. However, when I randomly dilute the dataset to 100,000 probes, it takes about 10 seconds per sample to do the job. Is it something intrinsic from DNAcopy that it exponentially gets slower when the numbers of probes increase? What can I do to get good (i.e. not too much segments due to local trends) CBS-segmented copy number calls?
As a bonus question, in DNAcopy the min.width parameter only allows values between 2 and 5. What can I do when I want the segmentation algorithm to use at least 20 probes for segments?
library(DNAcopy)
samples <- colnames(lrr)[grep("MySampleIdentifier",colnames(lrr))]
number.probes <- 100e3
lrr <- lrr[sample(1:nrow(lrr),number.probes,replace=F),]
cna <- CNA(
genomdat=lrr[,samples],
chrom=lrr$chromosome,
maploc=lrr$physicalPosition,
data.type="logratio",
sampleid=samples
)
segments <- segment(
cna,
min.width=5,
undo.splits="prune",
undo.prune=0.05
)
Hi Arno, thanks for your suggestions, I'll try the MergeLevels script after default segmentation.
Hmm, the MergeLevels script seems very inconvenient as you have to transform the segmented data back to a segmented value for each probe. You can transform the CBS output this way:
But still, when MergeLevels is done you end up with a vector of corrected segment levels but not in the DNAcopy output format with 1 row per segment which for me is very convenient for downstream analyses. I will use CGHcall/CGHregions in stead.