Split individual SNP genotypes into two columns per call in R when there are 2.44 million columns
1
0
Entering edit mode
8.6 years ago

Hi everyone:

I am attempting to convert my snp matrix into a .ped file for use with Plink. I have successfully added all of the required information columns and just need to split all of my SNP columns which are currently in the following format ("AA") into two separate columns per SNP ("A" "A"). My matrix is very large (~2.1 GB) and has the following dimensions (109x2443180). I have tried multiple approached and I have had the most success with the following code run in batches of around 200,000 columns although R crashes while working on batch 8 of 13 total batches.

batch1<-do.call(cbind, 
                mclapply(snpmatrix_df[,7:200000], 
                         function(i) do.call(rbind, strsplit(as.character(i), split=''), mc.cores=cores
                         )
                )
)  

save.image(file = "/N/u/bscomer/Karst/backup15.RData")

I do not think this is a memory issue because I am working on a computing cluster with 30 GB of available RAM.

Does anybody have any suggestions on the best way to approach this issue?

R SNP PLINK • 2.8k views
ADD COMMENT
0
Entering edit mode

can you give an example of what does the raw file look like and what kind of output you want to generate?

ADD REPLY
0
Entering edit mode

My matrix snpmatrix_df[1:3, 1:20] (note: row names and column names are included in this output but will not be in final ped file):

        family_id individual_id paternal_id maternal_id sex        phenotype kgp5400438 kgp5400470 kgp5400471
  xxxx_001 "IID"     "xxxx_001"    "0"         "0"         "'Female'" "2"       "BB"       "AB"       "AA"      
    xxxx_002 "IID"     "xxxx_002"    "0"         "0"         "'Female'" "2"       "BB"       "AB"       "AB"      
    xxxx_003 "IID"     "xxxx_003"    "0"         "0"         "AA"       "2"       "BB"       "AB"       "BB"      
             kgp5400475 kgp5400480 kgp5400494 kgp5400502 kgp5400517 kgp5400521 kgp5400529 kgp5400537 kgp5400545
    xxxx_001 "BB"       "BB"       "BB"       "BB"       "BB"       "AA"       "BB"       "BB"       "AB"      
    xxxx_002 "BB"       "BB"       "AB"       "AB"       "BB"       "AA"       "BB"       "BB"       "AB"      
    xxxx_003 "BB"       "BB"       "AA"       "BB"       "BB"       "AB"       "BB"       "BB"       "AB"      
             kgp5400584 kgp5400586
    xxxx_001 "AB"       "BB"      
    xxxx_002 "BB"       "BB"   
    xxxx_003 "BB"       "BB"

Desired final plink format:

FAM1    NA06985 0   0   1   1   A   T   T   T   G   G   C   C   A   T   T   T   G   G   C   C
FAM1    NA06991 0   0   1   1   C   T   T   T   G   G   C   C   C   T   T   T   G   G   C   C
ADD REPLY
0
Entering edit mode
8.6 years ago
Michael 55k

Note: your example is misleading, you have only AA, AB, BA, etc. in your input but A,C,G,T in your output, please check your input file, are you sure AB do not stand for genotypes instead of bases?

It is all about doing this efficiently handling the IO which is not shown here, but from the _df I assume that you keep the whole data in memory in a data frame which is not really necessary to do the split. Indeed I think you have a memory issue because a data frame can take about up to 10 times the size of the same data in a matrix and I guess the 2.1GB is the size on disk. So convert the data into a matrix of factors or a even a character vector. You can use sapply on the vector and then re-format the vector into a matrix afterwards.

If you loaded the data from a file into R solely for the processing, it would be more efficient to do the split in perl or using unix cut reading it line by line.

ADD COMMENT

Login before adding your answer.

Traffic: 1727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6