Entering edit mode
3.1 years ago
I would like to perform principal component analysis on a pool-seq SNP dataset. I've been looking into methods for doing this, but have had trouble finding examples that may apply for pooled data as opposed to individual genotypes. For example, I'm not sure if PLINK can be used to run PCA on pooled datasets. Is anyone familiar with whether PLINK can be used for PCA on pooled SNP data, and, if not, any toolkit or approach that would be ideal to use for PCA on pooled data?
Thanks in advance!
Have you looked at this tutorial from Kevin Blighe ?
Produce PCA bi-plot for 1000 Genomes Phase III - Version 2
Thanks, this tutorial is really in depth and may be useful! Do you know if PLINK can be used for pooled SNP datasets? It looks like the tutorial is for a file with individual genotypes.
What do you mean by "pooled"? You mean to merge different datasets? In that case, you will be dealing with potential batch and / or technical artefacts.
Individuals were pooled prior to sequencing, so each library contains DNA from multiple individuals. I'm still not sure about PLINK, but I did come across someone else who did use the prcomp function in base R to run PCA on pool-seq allele frequencies
What did you end up doing for this?