Hello,
I have roughly 1 million SNPs from 700 individuals and I wanted to prune the SNPs down, potentially through PLINK's --pca command. However, I'm a little perplexed with how the eignvals/vectors I receive from the --pca command are to be used in order to prune my SNPs. Or am I completely misunderstanding? Could anyone clarify?
Below is a sample of the vectors:
Values:
Edit: I want to leave the original post up but to further clarify. From my ML experience, PCAs can perform feature selection and I wish to do the same with the SNPs (apologies if 'pruning' means something different in bioinformatics.)
Below is a sample of my variant weights:
In Python, the PCA does the feature selection automatically once you've fitted/transformed the data. So is there a way of performing feature selection on the SNPs? Like looking at the variant's first 3 weights and only take SNPs that have a minimum weight of 'X'?
To what end? Why do you want to prune them back? You could take a random set of ten SNP and have a pruned set. Is there some analysis you want to be able to do with it?
I was planning to perform a machine learning analysis and would prefer a smaller subset of SNPs to use in the ML techniques.