Hi everyone:
I am attempting to convert my snp matrix into a .ped file for use with Plink. I have successfully added all of the required information columns and just need to split all of my SNP columns which are currently in the following format ("AA") into two separate columns per SNP ("A" "A"). My matrix is very large (~2.1 GB) and has the following dimensions (109x2443180). I have tried multiple approached and I have had the most success with the following code run in batches of around 200,000 columns although R crashes while working on batch 8 of 13 total batches.
batch1<-do.call(cbind,
mclapply(snpmatrix_df[,7:200000],
function(i) do.call(rbind, strsplit(as.character(i), split=''), mc.cores=cores
)
)
)
save.image(file = "/N/u/bscomer/Karst/backup15.RData")
I do not think this is a memory issue because I am working on a computing cluster with 30 GB of available RAM.
Does anybody have any suggestions on the best way to approach this issue?
can you give an example of what does the raw file look like and what kind of output you want to generate?
My matrix snpmatrix_df[1:3, 1:20] (note: row names and column names are included in this output but will not be in final ped file):
Desired final plink format: