I am bit of a totally lost situation, I get one thing right, but then two things go wrong (sorry if some of these comments are repeats of stuff from previous threads). I need some help/advice on how to proceed.
Just for context, this dataset is meant for two things (for now), which are to 1) see how these samples related to various global populations using analyses such as PCA and STRUCTURE (or similar) and 2) estimate archaic introgression (i.e., Neanderthals and Denisovans)
I have 92 samples that were genotyped on the Affy Human Origins array from about 570,000 SNPs. I also have comparative data from 934 HGDP samples that I have downloaded, but there is just so much of it that I need to cull it down (ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.map.gz ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.ped.gz).
For the moment being, I've been trying to just PCA my own samples (n=92), but this isn't working either
I coded my genotypes to 0/1/2s using JMP Genomics and then did this using R on UF's computer cluster
read.table('92simpleb_recgeno.txt', sep='\t', header=TRUE, row.names=1)->table pcol=c(rep("green",3),rep("blue",89)) help(prcomp) table[ table == "." ] = NA t(table) -> trans pca<-prcomp(~ ., data=trans, na.action = na.omit) save(pca, file="1.RData")
and I got this back
Error: cannot allocate vector of size 1284.9 Gb Execution halted
Basically, I am lost (and a bit frustrated), and I need some suggestions on how to PCA this much data without metaphorically blowing something up. Currently, the samples are formatted with columns as samples and rows as loci, but I can re-export the data from Affy Genotyping Console in a different format. Any suggestions on what I should do.
R cannot handle a matrix of that size. If you are working on Snp data, maybe you can try EIGENSTRAT under EIGENSOFT.
You could also use PLINK: https://www.cog-genomics.org/plink2/strat#pca. PLINK's pca only uses a subset of samples making the computation feasible and (somehow) fast.
I am about to try an MDS on Plink 1.07. Do you happen to know if the application support multiple threads/processors?
Not sure whether PLINK 1.07 supports mutithreading, but I would say no. Why don't you use PLINK 1.9? It supports threads and it is more optimised (in general).
For PLINK 1.9: https://www.cog-genomics.org/plink2/, for threads: https://www.cog-genomics.org/plink2/other#threads
Currently, UF's cluster does not have Plink 1.9 installed (just 1.07). I am going to ask them to add it, but that will take a day or two (thus I am using 1.07 for now).
I think I am going to take alesssia's advise and finally give Plink a try for now as my gigantic comparative data is already in Plink format, but I had a question about EIGENSTRAT. Eventually, I am possibly going to need to use the EIGENSTRAT file format for Admixtools. Does the package have a method of converting Plink files to EIGENSTRAT?
See here: Conversion tools for GWAS