Entering edit mode
7.1 years ago
bha
▴
80
I have a data files contains imputed genotypes with informations in the columns as:
FamilyID IndividualID DAD MUM SEX Phenotype snp1 snp 2 .........snp60k
this file is in plink.raw format, any idea how i can convert this in .map and ped plink format?
From where did you get the raw file? These raw files are usually produced via the
recode
command line parameter to plink, meaning that they have to have been produced from an existing plink dataset. This question has been posted time and time again across the WWW but there's never a concrete solution, and I keep wondering from where did the person obtain the raw file in the first place(?).If you have those columns, it is already virtually a PED file, but just one without the associated MAP. How are your alleles encoded? Are you sure that you don't already have a PED or BED file in your directory?
Kevin, many thanks for replying. Yes, I got plink.raw file from recode in plink. Actually I used (plink.raw after removing FID dad,mum,sex and pheno) file for imputing missing genotypes in an other programme called AlphaImpute. The output which got after imputation is in text file (which contain columns; ID , snp1, snp2,...snp60k). I merged that out after imputation (cbind in R), and get DAD, MUM,SEX and Pheno. Now i want to convert this file back to plink. this how it looks
he snps coded as 0, 1, and 2 stand for the homozygous aa, the heterozygous aA or Aa, and the homozygous AA cases, respectively, and 9 is missing genotype. I also posted in stackoverflow, but haven't got any answer so far.
In that case, you may still be able to coerce this back into a PLINK object provided the original map file is there (if you have it?). Take a look here: https://www.cog-genomics.org/plink/1.9/input#ped
Something like:
plink --ped MyImputedData.txt --map MyOriginalMap.map
(or possibly--tped
)I have also added the 'plink' tag to your original question in the hope that the developer will pick up on this issue.
Generally it looks like there's nothing specific to do this, and that you'll have to work with the data you've got in order t coerce it back into PLINK format. Also double check your impute program to see if it doesn't already have a function to produce data in PLINK-ready format.
I do have map file, I give a go with this
but ends up with error "token does not match". As I mentioned earlier, I got plink.raw (MyImputedData.txt ) from --recodeA (which convert alleles as 0, 1 and 2). So I think, plink expect 2 columns for one allele. Do you know, is there any way to re-convert single column into two columns? I am afraid imputed program does not produce PLINK-ready format
Can you paste a few lines from each here (the raw and map file), for matching SNPs? Then I will try to re-create the PLINK dataset here and update you afterwards.
first 10 rows and columns plink.raw file:
And the map file:
Many thanks, in advance!
I did some testing and it is possible to convert these back to PLINK, but you are still missing key information. Yes, PLINK expects two columns for each SNP genotype, and, if numbers are provided as genotypes in the PED/RAW file, it will assume that these are in the 1234 format for ACGT.
I was able to coerce your data to PLINK with:
Then remove the header in test.ped and only include 4 columns for genotypes (for 2 SNPs). I also modified the missing genotypes that were NA, and other genotypes, just or testing.
The conclusion? We need another mapping in order to connect 012 back to bases, and convert these (using
awk
, possibly) to ACGT or 1234 in the PED file. Do you have such mapping from the original data?The order of genotypes in the MAP files has to agree with the order in the RAW/PED file too.
Almost there.
Thanks a lot for great help. I do have original data for mapping. your methods is also works, however, i tried something else which works more swiftly. As, PLINK wants two columns for each SNP, so I use this:
which converts each column into two alleles, and this should be .ped file. And map is already there. Do you think it make sense?
I have never used this program. Are you sure that it's output will be accepted by PLINK?
[source: https://cran.r-project.org/web/packages/HapEstXXR/HapEstXXR.pdf]
If you have the original genotype at each SNP (ATGC), then I can probably convert your imputed data to PLINK format using
awk
andsed
.I tried with my data with (because my plink.raw is in 0,1,2):
and the output got accepted by PLINK. I have original genotype at each SNP (ATGC) - Just for learning; could you please write an example code, how to convert imputed data to PLINK format using awk and sed?
I would need the lookup tables, i.e., for linking the minor allele numbers in 012 format to the actual alleles, in either 1234 or ACGT.
Can you paste the output of plink when you input this information? Are you sure that it has interpreted it correctly?
I got this ped file using above R package (FID, IID, PAT,MAT,SEX,PHENO, and snps);
Well, to me it looks good, but can't be 100% sure.
I think that you still need to be careful here because plink will still assume that each 2 columns relates to a single genotype.
For example, the first line:
Plink will assume that the genotypes are 11, 22, and 11 (AA, CC, and TT).
As mentioned, you will have to link the 012 imputed raw file back to your original MAP file, and then encoding 012 and 1234 or ACTG.
If in doubt, I would contact the developers of plink (assuming their email is on the website). I believe that it's now under the license/maintenance of COG Genomics.
Do you meant to use this:
plink.raw is the file after converting 012 to 11, 22, and 11 (AA,CC,T)?
No, I'm talking about editing your raw file manually, using something like
awk
.Currently it's in 012 format (and also the format from
allele1to2
), which contains no information on genotype calls. Plink cannot read this format directly (it outputs to this format using--recode
, as we know).We want 1234 or ACTG format. For that, we need to know the actual genotype at each SNP position in your data. We lost this connection by using the
--recode
command.As much as I'm aware, 11, 22, and 11 do not equal AA, CC, and T.
If you compare your original and current data:
Original (from imputation; 012 encoding):
Current (from allele1to2):
Looking at the original, I see 3 heterozygous SNPs. Looking at the current file, I see conflicting information, as the 11, 22, and 11, indicate homozygosity for AA, CC, AA.
Just be careful about how you proceed with this. Genotype data can be very messy and good data management is therefore of paramount importance. I don't believe that
allele1to2
is what you want to use. You need THIS. You may already have all required information.Yes, you are absolutely right. In original file, i have 3 heterozygous SNPs,but in current file (plink.raw 012) -this conflicting with original. I can't see 'G'.
I got two files 1) plink.raw (imputed genotypes as 012 format), and an other original map file say 2) plink.map file (which contain info for CHR, SNPIdentifier genetic distance and bpPOS). this is how i am doing;
geno.matrix <- plink.raw (after removing first column)????
BUT, where i can get these 3 files? linkage.file, map.file,cov.file
geno.matrix (n,m) genotype matrix (n=number of individuals, m=number of marker, 1-column for every marker, R-code: 1 = 1/1, 3 = 1/2, 2 = 2/2); All markers should be biallelic.
BUT mine is in 0,1,2, should i change to 1,2,3?
You may have them somewhere on your disk. Essentially, all you need is a list of SNPs that has, for each, the minor and major allele. Sometimes these are output as standard if your data has passed through a bioinformatics service. I output it too when analysing genotype data:
I've just read somewhere else that, yes, it's not possible to revert back to what we need:
[source: http://gengen.openbioinformatics.org/en/latest/tutorial/coding/]
As some final guidance to you: