Dear all,
Currently I have two large VCF files that include calls that are homozygous for the reference allele. For easier analysis, I would like to remove the variants that are homozygous-ref in both VCF files (or, in VCF-speak, be "0/0" for both samples at the same locus). I can't be the first to want to do this, but wasn't able to find anything of use.
INPUT
sample1.vcf
20 60055 . A . 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 0/0:25:25:99:PASS
20 60056 . G A. 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 0/1:12,13:25:99:PASS,PASS
20 60057 . T . 35 PASS DP=26;PF=20;MF=6;MQ=60;SB=0.769 GT:AD:DP:GQ:FL 0/0:26:26:99:PASS
20 60058 . C T 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 1/1:25:25:99:PASS
sample2.vcf
20 60055 . A . 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 0/0:25:25:99:PASS
20 60056 . G . 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 0/0:25:25:99:PASS
20 60057 . T . 35 PASS DP=26;PF=20;MF=6;MQ=60;SB=0.769 GT:AD:DP:GQ:FL 0/0:26:26:99:PASS
20 60058 . C T 35 PASS DP=26;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 1/1:26:26:99:PASS
to:
OUTPUT
sample1.vcf
20 60056 . G A. 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 0/1:12,13:25:99:PASS,PASS
20 60058 . C T 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 1/1:25:25:99:PASS
sample2.vcf
20 60056 . G . 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 0/0:25:25:99:PASS
20 60058 . C T 35 PASS DP=26;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 1/1:26:26:99:PASS
Some notes:
- The files are around 260 GB big
- I would like to keep the files seperate (not joining together)
- The is about 128GB memory available
- The files are sorted on position (fortunately)
Does anyone have experience with something like this, or could point me into a useful direction? Many thanks.
The reason for not wanting to merge are the fact that 1) bcftools merge seems to output a file that tabix can not index anymore (maybe because of the size?), and 2) the script for the next analysis step already being ready, taking single-sample VCF's as input.
I ended up splitting the files by chromosome with tabix (this turned out to be necessary anyway) and doing a temporary merge using GNU join.
Your answer does the job though (and is the most logical approach is almost all cases) , so therefore accepted as answer, thanks!