I have found that some sequences of plink operations can result in a corruption of allele information.
I came across this problem in the process of merging large numbers (20K+) of single-sample VCFs (each comprising ~1.5M sites). I describe my approach for doing this in more detail below.
The test below, however, is a radical simplification of what I am actually doing, shown here only to illustrate the problem as simply as possible. The idea behind this test is to perform the following sequence of conversions:
VCF1 -> PED -> BED -> VCF2
...and then to compare VCF1 with VCF2. Ideally, VCF1 and VCF2 should be identical; in practice, when I try the above sequence using a tiny (1 locus, 1 sample) test vcf, this is what I get:
### VCF1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT test
13 19020095 rs140871821 C T . . . GT 0/0
### VCF2
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT test
13 19020095 rs140871821 C . . . . GT 0/0
(In this example, VCF2 differs from VCF1 only at the ALT field, but I have also run into cases where both the REF and ALT fields are different.)
Below are the operations I used to perform the various conversions:
### VCF1 -> PED
plink --recode --a2-allele <INPUTVCF> 4 3 '#' --real-ref-alleles --vcf <INPUTVCF> --out <PEDPREFIX>
### PED -> BED
plink --make-bed --a2-allele <INPUTVCF> 4 3 '#' --real-ref-alleles --map <PEDPREFIX>.map --ped <PEDPREFIX>.ped --out <BEDPREFIX>
### BED -> VCF2
plink --recode vcf --a2-allele <INPUTVCF> 4 3 '#' --real-ref-alleles --bfile <BEDPREFIX> --out <OUTPUTVCFPREFIX>
The version of plink I have been using is 1.90b3v.
Note, in particular, that I have added the flags
--a2-allele <INPUTVCF> 4 3 '#' --real-ref-alleles
...to all the commands above. Here, I have been following the documentation's recommendation: "When it is important for reference alleles to be correct, you'll usually also want to include --a2-allele and --real-ref-alleles in your command." (The parameters I have given for the --a2-allele
flag are also based on the documentation's recommendations. I realize that not all the plink commands I am using above may require these flags; the fact that I have included them in all these commands can be seen as an act of desperation.)
As I mentioned above, I ran into this problem while attempting to merge a large number of single-sample VCFs.
I found that the only way I could perform this merge in a reasonable time (i.e. hours rather than days or weeks) was to do the following:
- convert the single-sample VCFs PED files;
- concatenate all the individual PED files into a merged PED file;
- convert the merged PED file to a (merged) BED file;
- convert the (merged) BED file to a multi-sample VCF.
This procedure indeed takes only a few hours, but unfortunately, the first 9 columngs of the final multi-sample VCF differ from the first 9 columns of the individual single-sample VCFs. (To be clear: the first 9 columns of the individual single-sample VCFs are identical across all the samples, down to the ordering of the rows.)
How can I perform the VCF0 -> PED -> PED -> ... -> VCF transformation in a way that ensures that the first 9 columns of the resulting merged VCFs are identical to the corresponding columns of the input VCFs?
Thank you. I did try that idea already, but plink throws an error to the effect that the --a1-allele and --a2-allele flags are incompatible; i.e. they cannot both be specified in the same command.
You can specify it in its own command any time after PED conversion, but before the final VCF is exported.