How Can I Do Principal Components Analysis ?
4
2
Entering edit mode
12.2 years ago
mary ▴ 210

Dear friend I have SNP data for 15 small sheep populations (20 samples for each population) and they were sequenced by ovine SNP50K, each breed sequenced on one plate. Now, I want to do CNV analysis by PennCNV. I know I should do Principal Components analysis before starting CNV analysis. Does anybody know how i can do that with my data?

thanks

cnv • 8.8k views
ADD COMMENT
0
Entering edit mode

EIGENSTRAT can do it.

ADD REPLY
0
Entering edit mode

Nowhere in the PennCNV documentation do I see a requirement to perform PCA before any other type of analysis. What makes you think this is necessary? Do you know exactly what it is that you want to do and which data are required as input?

ADD REPLY
1
Entering edit mode

CNV analysis itself does not require PCA. But it is a good idea, having different populations mixed in the same dataset, to check (e.g. for population structure) with PCA or other exploratory techniques.

ADD REPLY
0
Entering edit mode

Sure, PCA is a useful, general method for multivariate data. I just want to check that when the questioner says "I know I should do Principal Components analysis before start to analysis CNV", they really know what they are doing.

ADD REPLY
0
Entering edit mode

I agree with both your comments, my intention was just to clarify that even if not required by pennCNV, PCA is a useful step.

ADD REPLY
0
Entering edit mode

Its a minor issue. But in my opinion you should not describe your approach here as sequencing. You are genotyping by SNP array. Sequencing is another approach you could use to determine SNP genotypes. There is a nice explanation of sequencing versus genotyping at 23andMe.

ADD REPLY
7
Entering edit mode
12.2 years ago

It is a good QC step to perform PCA before CNV calling. You'll want to read the LogR ratio values from your signal files into R:

LRR <- c()
for (file in file_list) {
    x <- read.table(file, header=TRUE, sep="\t")
    LRR <- cbind(LRR, x$Log.R.Ratio)
}

Then use the builtin princomp function and plot your first and second principle components:

PCA <- princomp(LRR)
biplot(PCA[,1], PCA[,2])

There are probably issues with the R code above, and if I have time I'll actually run it and fix any mistakes, but this should get you started. If you don't see any discreet groupings of subsets or individual samples of your data, then you may keep all of your arrays in the analysis. If you see any discreet groupings, you might inspect the arrays corresponding to those groupings and possibly remove them before CNV calling.

ADD COMMENT
4
Entering edit mode
12.2 years ago
Ryan D ★ 3.4k

One great piece of software which does principal components and determines the best fit of your data is CNVtools, by Chris Barnes and Vincent Plagnol. This was software developed to handle batch issues that frequently arise in the course of calling CNVs. The paper describing it is in Nature Genetics.

You can just follow the short vignette given on the above page. As long as you specify your batches, you will have a high chance of being able to call CNVs properly. The software runs you through a nice little pipeline of steps for CNV analysis including plotting the raw signal and 1st principal component of your CNVs, selecting the model with the correct number of principal components, improving the calling of CNVs using this information, and testing the association of your CNVs with qualitative or continuous traits. first principal component from the intensity data from different probes. This, by downweighting probe intensities uncorrelated with the remainder, generally gives a better separation of different copy numbers than the mean or median of all of the probe intensities.

The downside of the code is that it is written to do one CNV at a time. You must know the probes that comprise your CNV. We have modified the file to do batches of CNVs, but you still must know where your CNV starts and ends. This is different than CNV calling of course (which would use software like PennCNV or Birdsuite). It is much more rigorous and reviewers at top journals will ask you for this level of validation.

If you really want to plot other components, you can use this code together with CNVtools as I illustrate below:

library(CNVtools)

data= read.table("gwas_raw_data.txt", header=T, sep="\t")
dim(data); names(data)

raw.signal <- as.matrix(data[, -(1:7)])
dimnames(raw.signal)[[1]] <- data$Sample_ID
sumis.na(raw.signal))

mean.signal <- apply(raw.signal, MAR=1, FUN=mean)
pca.signal <- apply.pca(raw.signal)
par(mfrow=c(1,2))
hist(mean.signal, breaks=50, main='Mean signal', cex.lab=1.3)
hist(pca.signal, breaks=50, main='First PCA signal', cex.lab=1.3)

aa0 = prcomp(raw.signal)
tmp1 = (aa0$x[,1] - mean(aa0$x[,1])) / sd(aa0$x[,1])
plot(pca.signal, tmp1)

aa = prcomp(raw.signal, center=F, scale=T)
tmp2 = (aa$x[,1] - mean(aa$x[,1])) / sd(aa$x[,1])
plot(pca.signal, tmp2)

##
PC1 = aa$x[,1]; PC2 = aa$x[,2]; PC3 = aa$x[,3]
idx1 = data$DNA_source=="batch1"
idx2 = data$DNA_source=="batch2" 
idx3 = data$DNA_source=="batch3"

pdf("PC1vsPC2.pdf")
par(mfrow=c(2,2))
plot(PC1, PC2, type='n', main="All sources")
points(PC1[idx1], PC2[idx1])
points(PC1[idx2], PC2[idx2], col=2)
points(PC1[idx3], PC2[idx3], col=3)
plot(PC1, PC2, type='n', main="batch1")
points(PC1[idx1], PC2[idx1])
plot(PC1, PC2, type='n', main="batch2")
points(PC1[idx2], PC2[idx2], col=2)
plot(PC1, PC2, type='n', main="batch3")
points(PC1[idx3], PC2[idx3], col=3)
dev.off()
ADD COMMENT
3
Entering edit mode
12.2 years ago

I posted some code on how to do it with python and matplotlib a while back: http://blog.nextgenetics.net/?e=42

ADD COMMENT
3
Entering edit mode
6.6 years ago
thaihoabo ▴ 70

enter image description here You probably can work with R to do PCA. But be careful whether to use with/without centering, scaling, or log. It depends on the way you calculated CNV. Is it raw count or normalized? But anyway, maybe it'd be less painful if you use a tool, which has all necessary parameters: https://vinci.bioturing.com

ADD COMMENT

Login before adding your answer.

Traffic: 2000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6