when explained variation per PC is too low while running PCA with SNP data
0
0
Entering edit mode
3.2 years ago
? ▴ 60

I ran PCA with 91 samples(consisted of 23breeds and one outgroup which is different subspecies).

about 18,000,000 SNPs were used when running PCA. but the variation explained were too low which was about 5% for PC1 and 4% for PC2.

I tried few attempts to increase the variation explained myself.

  1. Since the number of samples per breeds were different, I removed some breeds which have too few or too many samples in order to equalize the sample numbers.

  2. I removed the outgroup to see what happens because it would affect the PCs too much(I'm not sure if this is ok since I'm gonna use the outgroup for drawing evolutionary trees)

  3. I increased minor allele frequency from 0.01 to 0.05 to leave more moderate variations

  4. I increased --max-nocall-fraction option of SelectVariants from 0.05 to 0.09 to make the imputation process more linked(I'm not sure if this is write. maybe I should have decreased the value)

There was little difference but none of these increased the results obviously. Is there a way I could try or should I just show few other PCs together.

variation PCA SNP • 1.7k views
ADD COMMENT
0
Entering edit mode

This is not surprising for a PCA with 18 million variables and only 91 samples. The % variance explained has to be looked at in the context of the size of the data. It is perfectly possible for the first few PCs to explain a small fraction of the variance and still reveal some structure of the data and conversely PCs that explain a high fraction of the variance may fail to capture meaningful structure in the data. On the other hand, the cause of PCs associated with low fractions of the variance is often that the data is too noisy. If only a few SNPs are relevant then the data is dominated by the noise of the 18 million irrelevant SNPs. Also the ratio of number of breeds to number of samples is probably too low to reliably associate variance with breed. So you should try to eliminate as many irrelevant SNPs as possible and/or get more samples. I would be more confident of your analysis if you had thousands of samples with hundred thousand SNPs.
You will find more info on using PCA with SNP data in this article.
Edit: Also of interest: Why most Principal Component Analyses (PCA) in population genetic studies are wrong

ADD REPLY

Login before adding your answer.

Traffic: 1643 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6