Here is an awk script, add_reference.awk
that I wrote for this task. It adds a new sample called "Reference" that will be homozygous 0/0 at each locus.
#! /usr/bin/awk -f
BEGIN {
FS = "\t"
OFS = FS
}
#print the header info as-is
/^##/ {
print
next
}
#add sample named "Reference" to the list of samples
/^#CHROM/ {
print $0"\tReference"
next
}
#add homozygous reference allele to every locus.
{
print $0"\t0/0:0,0,255:100:0:100,0:1,0,0:150"
}
For gzipped VCF files, I run it like this:
zcat my_file.vcf.gz | add_reference.awk | bgzip > my_file_with_reference.vcf.gz
You will probably have to change it to match the format of your VCF. My fields are in the format "GT:PL:DP:SP:AD:GP:GQ" so I went with an entry of "0/0:0,0,255:100:0:100,0:1,0,0:150" so that it would pass my downstream filters, but you can change it to match whatever fields you have in your VCF.
It assumes that all loci are unphased diploid loci and that the reference sample is not heterozygous at any site. Caveat: ignoring heterozygous sites in the ref sample may cause inaccuracies in certain downstream analyses depending on what it is used for.
please define
the retained SNP of my reference
After SNP calling and filtering, I kept 2,000 SNP, and these SNPs are the retained SNP I also want to keep from the reference genome
what should be the genotype for the new SAMPLE ? say, there is a sample 1/0 and another is 1/1 and another is ./.
For the new sample, all genotypes should be 0/0