Question

How to preserve alleles across plink operations?

0

Entering edit mode

3.3 years ago

kynnjo ▴ 70

I have found that some sequences of plink operations can result in a corruption of allele information.

I came across this problem in the process of merging large numbers (20K+) of single-sample VCFs (each comprising ~1.5M sites). I describe my approach for doing this in more detail below.

The test below, however, is a radical simplification of what I am actually doing, shown here only to illustrate the problem as simply as possible. The idea behind this test is to perform the following sequence of conversions:

VCF1 -> PED -> BED -> VCF2

...and then to compare VCF1 with VCF2. Ideally, VCF1 and VCF2 should be identical; in practice, when I try the above sequence using a tiny (1 locus, 1 sample) test vcf, this is what I get:

### VCF1
#CHROM   POS        ID            REF   ALT   QUAL   FILTER   INFO   FORMAT   test
13       19020095   rs140871821   C     T     .      .        .      GT       0/0

### VCF2
#CHROM   POS        ID            REF   ALT   QUAL   FILTER   INFO   FORMAT   test
13       19020095   rs140871821   C     .     .      .        .      GT       0/0

(In this example, VCF2 differs from VCF1 only at the ALT field, but I have also run into cases where both the REF and ALT fields are different.)

Below are the operations I used to perform the various conversions:

### VCF1 -> PED
plink --recode --a2-allele <INPUTVCF> 4 3 '#' --real-ref-alleles --vcf <INPUTVCF> --out <PEDPREFIX>

### PED  -> BED
plink --make-bed --a2-allele <INPUTVCF> 4 3 '#' --real-ref-alleles --map <PEDPREFIX>.map --ped <PEDPREFIX>.ped --out <BEDPREFIX>

### BED  -> VCF2
plink --recode vcf --a2-allele <INPUTVCF> 4 3 '#' --real-ref-alleles --bfile <BEDPREFIX> --out <OUTPUTVCFPREFIX>

The version of plink I have been using is 1.90b3v.

Note, in particular, that I have added the flags

--a2-allele <INPUTVCF> 4 3 '#' --real-ref-alleles

...to all the commands above. Here, I have been following the documentation's recommendation: "When it is important for reference alleles to be correct, you'll usually also want to include --a2-allele and --real-ref-alleles in your command." (The parameters I have given for the --a2-allele flag are also based on the documentation's recommendations. I realize that not all the plink commands I am using above may require these flags; the fact that I have included them in all these commands can be seen as an act of desperation.)

As I mentioned above, I ran into this problem while attempting to merge a large number of single-sample VCFs.

I found that the only way I could perform this merge in a reasonable time (i.e. hours rather than days or weeks) was to do the following:

convert the single-sample VCFs PED files;
concatenate all the individual PED files into a merged PED file;
convert the merged PED file to a (merged) BED file;
convert the (merged) BED file to a multi-sample VCF.

This procedure indeed takes only a few hours, but unfortunately, the first 9 columngs of the final multi-sample VCF differ from the first 9 columns of the individual single-sample VCFs. (To be clear: the first 9 columns of the individual single-sample VCFs are identical across all the samples, down to the ordering of the rows.)

How can I perform the VCF0 -> PED -> PED -> ... -> VCF transformation in a way that ensures that the first 9 columns of the resulting merged VCFs are identical to the corresponding columns of the input VCFs?

vcf plink • 1.2k views

ADD COMMENT • link updated 3.2 years ago by chrchang523 11k • written 3.3 years ago by kynnjo ▴ 70

score 0 · Answer 1 · 2022-04-22

0

Entering edit mode

3.3 years ago

chrchang523 11k

In this workflow, you'll also need to run --a1-allele to restore lost ALT alleles during VCF -> PED conversion.

(I also recommend updating your plink 1.9 build.)

ADD COMMENT • link 3.3 years ago by chrchang523 11k

0

Entering edit mode

Thank you. I did try that idea already, but plink throws an error to the effect that the --a1-allele and --a2-allele flags are incompatible; i.e. they cannot both be specified in the same command.

ADD REPLY • link 3.2 years ago by kynnjo ▴ 70

0

Entering edit mode

You can specify it in its own command any time after PED conversion, but before the final VCF is exported.

ADD REPLY • link 3.2 years ago by chrchang523 11k