One great piece of software which does principal components and determines the best fit of your data is CNVtools, by Chris Barnes and Vincent Plagnol. This was software developed to handle batch issues that frequently arise in the course of calling CNVs. The paper describing it is in Nature Genetics.
You can just follow the short vignette given on the above page. As long as you specify your batches, you will have a high chance of being able to call CNVs properly. The software runs you through a nice little pipeline of steps for CNV analysis including plotting the raw signal and 1st principal component of your CNVs, selecting the model with the correct number of principal components, improving the calling of CNVs using this information, and testing the association of your CNVs with qualitative or continuous traits. first principal component from the intensity data from different probes. This, by downweighting probe intensities uncorrelated with the remainder, generally gives a better separation of different copy numbers than the mean or median of all of the probe intensities.
The downside of the code is that it is written to do one CNV at a time. You must know the probes that comprise your CNV. We have modified the file to do batches of CNVs, but you still must know where your CNV starts and ends. This is different than CNV calling of course (which would use software like PennCNV or Birdsuite). It is much more rigorous and reviewers at top journals will ask you for this level of validation.
If you really want to plot other components, you can use this code together with CNVtools as I illustrate below:
library(CNVtools)
data= read.table("gwas_raw_data.txt", header=T, sep="\t")
dim(data); names(data)
raw.signal <- as.matrix(data[, -(1:7)])
dimnames(raw.signal)[[1]] <- data$Sample_ID
sumis.na(raw.signal))
mean.signal <- apply(raw.signal, MAR=1, FUN=mean)
pca.signal <- apply.pca(raw.signal)
par(mfrow=c(1,2))
hist(mean.signal, breaks=50, main='Mean signal', cex.lab=1.3)
hist(pca.signal, breaks=50, main='First PCA signal', cex.lab=1.3)
aa0 = prcomp(raw.signal)
tmp1 = (aa0$x[,1] - mean(aa0$x[,1])) / sd(aa0$x[,1])
plot(pca.signal, tmp1)
aa = prcomp(raw.signal, center=F, scale=T)
tmp2 = (aa$x[,1] - mean(aa$x[,1])) / sd(aa$x[,1])
plot(pca.signal, tmp2)
##
PC1 = aa$x[,1]; PC2 = aa$x[,2]; PC3 = aa$x[,3]
idx1 = data$DNA_source=="batch1"
idx2 = data$DNA_source=="batch2"
idx3 = data$DNA_source=="batch3"
pdf("PC1vsPC2.pdf")
par(mfrow=c(2,2))
plot(PC1, PC2, type='n', main="All sources")
points(PC1[idx1], PC2[idx1])
points(PC1[idx2], PC2[idx2], col=2)
points(PC1[idx3], PC2[idx3], col=3)
plot(PC1, PC2, type='n', main="batch1")
points(PC1[idx1], PC2[idx1])
plot(PC1, PC2, type='n', main="batch2")
points(PC1[idx2], PC2[idx2], col=2)
plot(PC1, PC2, type='n', main="batch3")
points(PC1[idx3], PC2[idx3], col=3)
dev.off()
EIGENSTRAT can do it.
Nowhere in the PennCNV documentation do I see a requirement to perform PCA before any other type of analysis. What makes you think this is necessary? Do you know exactly what it is that you want to do and which data are required as input?
CNV analysis itself does not require PCA. But it is a good idea, having different populations mixed in the same dataset, to check (e.g. for population structure) with PCA or other exploratory techniques.
Sure, PCA is a useful, general method for multivariate data. I just want to check that when the questioner says "I know I should do Principal Components analysis before start to analysis CNV", they really know what they are doing.
I agree with both your comments, my intention was just to clarify that even if not required by pennCNV, PCA is a useful step.
Its a minor issue. But in my opinion you should not describe your approach here as sequencing. You are genotyping by SNP array. Sequencing is another approach you could use to determine SNP genotypes. There is a nice explanation of sequencing versus genotyping at 23andMe.