I have a GenomeStudio genotype file with missing genotypes denoted by -
Using this file I generated, for each chromosome the map, fam and lgen files and using the --recode
option in plink converted them to ped format. To overcome the plink Error: Locus has >2 alleles
I used the --missing-genotype
option with the -
After ped files for each chromosome were successfully generated, there are a couple issues am facing:
My lgen file corresponds to the map file - but after recode the ped file has way more columns than the rows. I excpect the number of columns to be rows x 2 (both alleles) that of the map file.
When I try to merge all the chromosomes for evaluating summary statistics the -
in the data doesn't seem to be excluded and continue to give errors.
Would converting all the -
to 0 is the solution here? Am trying to understand how to exclude such data and best practices.
Thanks for any suggestions/feedback.
Thanks for your response chrchang523, will give
--output-missing-genotype 0
a try to get the format working.The map files have various number of rows, pertaining to the number of SNPs in each chromosome, for example I have ~180000 for chr1, so I expect the ped file to have 180000 * 2 columns.
The only reason for .ped is to be able to see what data am generating, aim is to work with .bed/.bim format once the file formatting is taken care of
How many columns does the .ped actually have?
You might want to try converting to .tped/.tfam (
--recode --transpose
) instead, that text format might be easier to read (and it's definitely more convenient for PLINK to work with).The
--output-missing-genotype 0
option has helped replace all-
to0
. But in either case the--merge
option (using this to merge data from all chr) still reports anERROR: Problem with MAP file line:
There doesn't seem to be a way for me to track down which snp in particular is giving the issue as its reporting the first 6 columns for sample identifier and genotype info from the lgen file.The .ped file now has ~180000 * 2 + 6 columns so that seems to have been correctly generated. Thanks for tip on transpose, are there other pros transposing the data - or this a preferred file format? Plan to impute this using 1000 Genomes, none of the info on Shapeit/Impute2 has suggested a .tped file yet - but please let me know if you have experience with that.
The "problematic MAP file line" is a properly formatted .ped file line. Try swapping the order of the arguments you're passing to
--merge
..tped files have fewer columns than .ped files, so I find them easier to work with in a text editor. If you're using
--merge
, though, .ped/.map lets you avoid an extra conversion step.Thanks chrchang523! I am able to merge the files successfully, seems the order of .map .ped in the file list was causing the issue. Take home msg: the order of the file list to be merged should be .ped .map / .bed .bim .fam