Hi all, I have an a file of imputed dosage geneotypes which looks like this:
rs2259866 14726071 C T 0.996 0.004 0.000 0.996
rs9712893 14727194 A G 0.996 0.004 0.000 0.996
rs2096664 14989899 A G 0.000 0.000 1.000 0.000
rs7339863 14996145 A T 0.000 0.014 0.986 0.000
rs8190183 14999031 A T 0.000 0.014 0.986 0.000
and corresponding fam and map files
fam file
SID1 SID1 0 0 1 2
SID2 SID2 0 0 1 1
SID3 SID3 0 0 2 1
SID4 SID4 0 0 1 2
SID5 SID5 0 0 1 1
DC35 DC35 0 0 2 1
map file
22 rs2259866 0 14726071
22 rs9712893 0 14727194
22 rs2096664 0 14989899
22 rs7339863 0 14996145
22 rs8190183 0 14999031
I am trying to analyse these files in plink. I have dosage, fam and map file for each chromosome. The problem is that when I try to run the --dosage command in plink, plink crashes immediately. It reads some basic information from the fam and map files like number of markers and number of individuals. These files were given to me by a colleauge and no-one knows how the imputation was done. I counted the number if columns in R and it turns out to be the number of indivisuals*3 (probability of each genotype) + 4 (leading columns) so that is why I was using the format=3 option. The plink command I have tried was
plink --dosage chr22.gen format=3 skip1=1 noheader --fam chr22.pfam --map chr22.pmap
with variations on skip0, skip1, foramt={1,2,3}. Nothing worked and plink crashed everytime. I have verified that my install of plink works ok on other "hard"-genotype files and those commands run fine, no problem. I then though it might be a memory problem, there are only 32,000 markers on chromosome 22 though, so I wasn't sure, nonetheless, I used R to subset the genotype file down to ten SNPs, for all individuals (~1700). I tried running that in plink with the above command but to no avail.
I have been thinking as a possible solution, if I can't get plink to I will make the move to probABEL. Does anyone know if these file types are supported by probABEL? I get impression it only likes MACH input files.
Any help is greatly appreciated, Cheers, Davy
EDITED: skip paramater actually set at 1, not 2. Sorry
BTW, straight from the plink FAQ on the subject of memory:
"There are no fixed limits to the size of the data file; it uses currently 1 byte for 4 SNP genotypes and some overhead per SNP and per individual. This means that you should be able to get datasets of, say, 1 million SNPs and up to 5000 individuals, in a machine with 2GB RAM without causing too much stress/swapping, etc. That is, 5000 * 1e6 / 4 = 1.25e9 bytes = ~1GB."
I have not experience with imputation software, but the SNPs in your example seem all monomorphic (MAF<0.05), maybe this is causing some computational problem.