Entering edit mode
3.7 years ago
williamsbrian5064
▴
530
Hi,
I am trying to merge some WGS data with some SNP data. The WGS file contains about millions variants and the SNP data contains about 150k SNPS. The WGS data and the SNP data were both vcfs to start but I converted them to .ped
and .map
files using the following commands
plink --threads 4 --vcf start1.vcf --dog --out start1.output --maf 0.05 --mind 0.1 --geno 0.1 --recode --snps-only --biallelic-only strict
plink --threads 4 --vcf start2.vcf --dog --out start2.output --maf 0.05 --mind 0.1 --geno 0.1 --recode --snps-only --biallelic-only strict
I then tried to merge the files using the following command
plink --threads 4 --file start1.output --merge start2.output.ped start2.output.map --out merge.start1.start2 --maf 0.05 --mind 0.1 --geno 0.1 --recode --snps-only --dog --biallelic-only strict
When I do this I get the following error
Of these, 1 is new, while 8331241 are present in the base dataset.
405509 more multiple-position warnings: see log file.
Performing single-pass merge (14382 dogs, 124927 variants).
Pass 1: fileset #1 complete.
Error: Variant '.' is not biallelic. To obtain a full list of merge failures,
convert your data to binary format and retry the merge.
and before I get that error, I get tons of warnings that say
Warning: Multiple chromosomes seen for variant '.'.
I really don't know what I am doing wrong here. I have tried so many different ways to try and get this to work but end up getting the same error
Error: Variant '.' is not biallelic. To obtain a full list of merge failures,
convert your data to binary format and retry the merge.
Any help would be great
usually this is caused by the following: plink uses rsids to identify SNPs, so if they are missing, they will just be inferred as having the name '.'. So plink thinks all the snps with the name '.' are the same and gets confused when they appear on different chromosomes. you can use --set-all-var-ids from plink2 to assign IDs to all your snps, which should hopefully solve the problem.
Thanks for responding! I actually just noticed that. The SNP data SNP ids in the map file are 'chr:pos' and the WGS map files just has the SNP ids like '.' like you said. In the WGS map file, I just combined the chromosome number and the position in the map file to match the format of the SNP map file. That seemed to solve the problem
As an additional note, the .ped format has been obsolete for close to a decade. plink 1.9’s native format is .bed (use —make-bed/—bfile instead of —recode/—file), which is far more efficient; in contrast, plink 1.9 has to inefficiently convert .ped to .bed before doing anything else every time you use —file (and this process is slower than with —vcf). No plink 2.0 build can even read or write .ped files right now.
Thanks! I was trying to use bed files but I couldn't compare the bed files because they're in binary format. Is there a way to at least get the
head
of a bed file?diff -q
can be used to compare binary files for exact equality. If you want human-readable output every step of the way, --vcf/--recode vcf
is not as bad as --file/--recode.