I have a massive data table with dbSNP rs ids as rows and samples as columns in this kind of format
dbSNP Sample Sample Sample Sample Sample Sample
rs10000011 CC CC CC CC TC TC
rs1000002 TC TT CC TT TT TC
rs10000023 TG TG TT TG TG TG
rs1000003 AA AG AG AA AA AG
rs10000041 TT TG TT TT TG GG
rs10000046 GG GG AG GG GG GG
rs10000057 AA AG GG AA AA AA
rs10000073 TC TT TT TT TT TT
rs10000092 TC TC CC TC TT TT
There are over a 1000 samples and >547,000 loci in this table from a HGDP dataset (ftp://ftp.cephb.fr/hgdp_supp10/), and I would like to do a massive Principle Component Analysis (with samples colored based on population).
In order to do that, I need to code my genotypes first. I was wondering, how would I do this (preferably in R, as the file is probably too big for JMP Genomics)?
Also, I have some spots lacking data, which are indicated by --- or 00. I am going to standardize those to NA using a find and replace script, but how do I code it so R will still be able to run the PCA. Thanks!
I am not sure if R can easily handle this big dataset either. I would suggest you to use PLINK (that directly evaluates the PCA), but I am afraid you will need to create extra files to describe your data. See: https://www.cog-genomics.org/plink2/input and here https://www.cog-genomics.org/plink2/strat#pca.
I can run R on UF's HPC cluster though. It will be able to handle it there.
Anyone have any suggestions? I tried stackoverflow, but they sent me back here.