SNP Pruning Through PCA (Edit: Feature Selection Through PCA)
1
0
Entering edit mode
3.3 years ago
ErickW • 0

Hello,

I have roughly 1 million SNPs from 700 individuals and I wanted to prune the SNPs down, potentially through PLINK's --pca command. However, I'm a little perplexed with how the eignvals/vectors I receive from the --pca command are to be used in order to prune my SNPs. Or am I completely misunderstanding? Could anyone clarify?

Below is a sample of the vectors:

enter image description here

Values:

Values

Edit: I want to leave the original post up but to further clarify. From my ML experience, PCAs can perform feature selection and I wish to do the same with the SNPs (apologies if 'pruning' means something different in bioinformatics.)

Below is a sample of my variant weights:

Sample

In Python, the PCA does the feature selection automatically once you've fitted/transformed the data. So is there a way of performing feature selection on the SNPs? Like looking at the variant's first 3 weights and only take SNPs that have a minimum weight of 'X'?

PLINK SNP PCA • 2.4k views
ADD COMMENT
0
Entering edit mode

To what end? Why do you want to prune them back? You could take a random set of ten SNP and have a pruned set. Is there some analysis you want to be able to do with it?

ADD REPLY
0
Entering edit mode

I was planning to perform a machine learning analysis and would prefer a smaller subset of SNPs to use in the ML techniques.

ADD REPLY
1
Entering edit mode
3.3 years ago
Lemire ▴ 940

I think you are misunderstanding what pca can and can't do. If you want to prune, then use plink's variant pruning commands.

https://www.cog-genomics.org/plink/1.9/ld

ADD COMMENT
0
Entering edit mode

Thank you, will do. However, from my ML experience, I know PCA can perform feature selection (maybe I'm mis-using the word 'pruning'), so I wanted to perform the same thing on my SNPs. Can that not be done through PLINK?

For example, I searched through the PCA documentation and came across a modifier that outputted SNP variance like below: Sample

Most of my experience with PCAs are in Python, and if I remember correctly, feature selection is done automatically. Would be the proper equivalent given my situation?

I'll edit my original question to further clarify/explain.

ADD REPLY
1
Entering edit mode

When you run the --pca you get a rotation of the data in such a way that the first PC displays the most variance between your samples. This is what you get in the matrix you pasted in your original post (note that the line you pasted is not one of the vector, those are the values for sample D391243 at the first 10 PCs; the first PC would be the 3rd column in your file) . The loadings you get from the allele-wts flag are merely the coefficients for the linear transformation needed to perform that rotation. If you take the (mean-centered) genotypes for D391243, multiply them by the weights in the 5th column of your .allele file, then the sum of these products should give you -0.00579651.

Long story short, if you would do feature selection based on the fifth column of your .allele file (by, say, picking SNPs with the 10% largest weights in absolute value), then you would select features that explain (and only explain) the largest variability between the samples (i.e. for which the samples show the most "dissimilarity", in a way). If you are not interested in explaining variability, then PCA is not the way to go. If you are interested in explaining the variability (because, say, you found some clusters and want to find what explains them) then you could select based on the weights.

There are interesting comments in the following thread:

https://stats.stackexchange.com/questions/27300/using-principal-component-analysis-pca-for-feature-selection/27310

ADD REPLY
0
Entering edit mode

Great, thank you! I think that answered my question. And I'll look at that stackexchange thread.

In regards to the feature selection example you gave (picking SNPs with the largest 10% of weights in absolute value), is that 10% a sort of industry standard? Or would metrics of trimming be individualistic to each person's dataset/issue?

ADD REPLY

Login before adding your answer.

Traffic: 1908 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6