Question

Error in merging data in plink format (after cleaning data)

0

Entering edit mode

8.2 years ago

fatima ▴ 20

Hi everyone,

I worked on case-only design. I tried to impute untyped genotypes come from immunochip. Before imputation, I tried to merge 1000G reference panel and cases in plink. I had the error:

Warning: Multiple positions seen for variant 'rs200357792'.
Warning: Multiple positions seen for variant 'rs201556956'.
Warning: Multiple positions seen for variant 'rs200991502'.
17838 markers loaded from CD_GermanyKielchr2_mod.bim.
7047141 markers to be merged from ref_b37_ph3.bim.
Of these, 7029359 are new, while 17782 are present in the base dataset.
Error: 7932 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
  merge-merge.missnp.
  (Warning: if this seems to work, strand errors involving SNPs with A/T or C/G
  alleles probably remain in your data.  If LD between nearby SNPs is high,
  --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
  that subset of the data to VCF (via e.g. '--recode vcf'), merging with
  another tool/script, and then importing the result; PLINK is not yet suited
  to handling them.

I removed multiple position variants and duplicate variants. In addition I filtered triallelic SNPs in vcf file and did flip “prefix”.missnp. I tried again to merege cases and reference panel.

I had this error again:

Error: 7916 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
  mergefiles2-merge.missnp.
  (Warning: if this seems to work, strand errors involving SNPs with A/T or C/G
  alleles probably remain in your data.  If LD between nearby SNPs is high,
  --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
  that subset of the data to VCF (via e.g. '--recode vcf'), merging with
  another tool/script, and then importing the result; PLINK is not yet suited
  to handling them.

I think plink is not suitable for my data. Am I true? What is the reason? Thank you in advance.

software error • 4.1k views

ADD COMMENT • link updated 6.8 years ago by Kevin Blighe 88k • written 8.2 years ago by fatima ▴ 20

0

Entering edit mode

I think that PLINK is suitable - what is the source of your non-1000G data, though?

Perhaps you could try to follow my tutorial, here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format

You may have to remove SNPs from your non-1000G data that are called on the non-coding strand.

ADD REPLY • link 6.8 years ago by Kevin Blighe 88k