Question

From vcf 1000 Genomes plink readable file: is there a way to convert it all (not only one region)?

1

Entering edit mode

10.0 years ago

User 7754 ▴ 270

Hi,

I would like to use the "clump" function for my association analysis results which use 10000 Genomes imputed data (and I have many files), so I guess I need to use the 10000 Genomes genotypes as a reference file, but I am having trouble with this. I have downloaded the vcf files from the MACH website here.

Then I thought I could use Plink 1.9 version that reads vcf and use directly the "clump" command here, but apparently there are issues with the vcf file for allele names (I would like to keep as many variants as possible), so I followed this link to convert from vcf to plink.

But I get a segfault error as soon as I type to this command:

bcftools annotate -Ob -x ID chr21.test.vcf

which says:

[W::vcf_parse] INFO 'NS' is not defined in the header, assuming Type=String
Encountered error, cannot proceed. Please check the error output above.

I have tried to use bcftools reheader -h but it doesn't work, can someone please help?

Thank you!!

vcf plink 1000Genomes • 4.9k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by User 7754 ▴ 270

Ram · Accepted Answer · 2014-12-10

3

Entering edit mode

10.0 years ago

chrchang523 11k

If you don't need to deal with split multi-allelic variants, Plink 1.9's --set-missing-var-ids flag should do a good enough job of assigning names. (You'll need a build from November 25th or later.)

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by chrchang523 11k

0

Entering edit mode

Hi chrchang, thank you for your reply. I would like to keep both SNPs and INDELs in the dataset. I also had seen some discussion on google group about this, and that is where I had found the link to follow to convert correctly from vcf to plink which does not work for me: https://groups.google.com/forum/#!msg/plink2-users/xDYgOnAofwo/GmGFXlE4YCYJ

As you suggested I have tried this:

./plink --vcf MACH_download/chr21.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.EUR.vcf.gz --set-missing-var-ids @:#\$1_\$2 --make-bed --out chr21.test

I am trying to convert the INDELs as they are in MACH (i.e. chr:pos:A1_A2), and the output looks good for cases like this:

21:39632105:G_GA

But not for cases where the alleles should be flipped:

It should be like this: 21:34884877:TATTT while it comes out in the bim file like this: 21:34884877:T_TATTTG

Is there a way to fix these cases?

Thank you very much for your help!

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by User 7754 ▴ 270

1

Entering edit mode

This is outside --set-missing-var-ids's current scope, but can be done with the help of awk. One possible workflow:

1. Perform basic format conversion, no renaming yet. (If you include --make-bed in this command, you also need to include --keep-allele-order; otherwise you will probably get unwanted allele swaps.)

plink --vcf MACH_download/... --out chr21.test

2. You now want to replace all '.'s in the second column of the .bim file with [first column]:[fourth column]:[fifth column]_[sixth column]. This can be done as follows:

cat chr21.test.bim | awk '{ if ($2 == ".") print $1" "$1":"$4":"$5"_"$6" "$3" "$4" "$5" "$6; else print; }' > newbim.txt
cp newbim.txt chr21.test.bim

A similar command can be used to make this change directly to the VCF, if necessary.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by chrchang523 11k

0

Entering edit mode

I see. Thank you very much!

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by User 7754 ▴ 270