I have genotype data in the form of a number of birdseed files (one for each sample), of the form:
Composite Element REF Call Confidence
SNP_A-2131660 2 0.0060
SNP_A-1967418 2 0.0281
SNP_A-1969580 2 0.0074
SNP_A-4263484 2 0.0104
SNP_A-1978185 0 0.0034
I've loaded these into a matrix in R to perform QC upon them and annotate them with the rs ids instead of the probe id, resulting in something like this:
> geno[1:4,1:4]
TCGA.A7.A0D9 TCGA.A7.A0DB TCGA.A7.A13G TCGA.AC.A2FB
rs2887286 2 2 2 2
rs1496555 2 2 2 1
rs3890745 2 2 1 2
rs10489588 1 0 0 0
>
> dim(geno)
[1] 693085 94
I want to use 1000 Genomes data to impute genome wide SNP data, however I'm not sure how to get from this format to the format required by IMPUTE2. I've been reading through this wiki and it seems like it might be easiest to convert this to PED format and then use GTOOL to convert to IMPUTE format.
Are there any existing tools that can do this reformatting directly? Or will I have to write a script to do the matrix > PED reformatting and then use GTOOL for PED > IMPUTE?
Sorry for the slightly incoherent reply, wrote it on my teensy screen iPhone.
It's fine! I had seen the tool but I was hoping to find an easier way... In the end I wrote a script to convert all the files using the tools in question, and then merged them all together using Plink.
Hi do you mind sharing your script or give me an outline of what you did? I am stuck in a similar situation for TCGA data!