I've written a script that extracts VCF IDs from a large VCF, in order to get a subset of the sample IDs (people).
This script copies the header from the original VCF and simply pastes it into the new one. I then extract only the sample IDs that I need for each line in the original VCF.
However, I want to validate that this new VCF is valid, using vcf-validator.
vcf-validator ALL.chr18.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf 2>&1 >/dev/null | head
18:10083 .. AN is 5008, should be 124
18:10083 .. AC is 1, should be 0
...
These errors appear to be caused by the INFO field:
grep -A1 -m1 ^#CHROM ALL.chr18.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf | cut -f 1-19
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 NA19625 NA19700 NA19701NA19703 NA19704 NA19707 NA19711 NA19712 NA19713
18 10083 esv3641458 A <CN0> 100 PASS AC=1;AF=0.000199681;AN=5008;CIEND=-500,1000;CIPOS=-1000,500;CS=DEL_union;END=63656;NS=2504;SVTYPE=DEL;DP=69995;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0.001;SAS_AF=0;VT=SV;EX_TARGET GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
so then I try to just replace the INFO field AC=1;AF=0.000199681;AN=5008;CIEND=-500,1000;CIPOS=-1000,500;CS=DEL_union;END=63656;NS=2504;SVTYPE=DEL;DP=69995;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0.001;SAS_AF=0;VT=SV;EX_TARGET GT
with just .
but then I get bizarre errors like
column NA19923 at 18:10083 .. FORMAT tag [.] not listed in the header
which I don't know how to fix.
Is there a way that I can easily extract given sample IDs from a VCF?
Indeed, this is the solution