How to add reference as new sample to vcf?
1
0
Entering edit mode
2.1 years ago
沛煒 • 0

Hello,

Do anyone know how to make a vcf file with a new sample from reference genome?

I have a vcf file with 200 samples and 2,000 SNP

My SNP were called with a reference genome, and I want to add the retained SNP of my reference as a new sample to my vcf file

I know that I only have the haploid reference genome, so I like to write all retained loci from the reference as homozygote in the new vcf file

Thanks for any suggestion or solution!

vcf • 1.6k views
ADD COMMENT
0
Entering edit mode

please define the retained SNP of my reference

ADD REPLY
0
Entering edit mode

After SNP calling and filtering, I kept 2,000 SNP, and these SNPs are the retained SNP I also want to keep from the reference genome

ADD REPLY
0
Entering edit mode

I want to add the retained SNP of my reference as a new sample to my vcf file

what should be the genotype for the new SAMPLE ? say, there is a sample 1/0 and another is 1/1 and another is ./.

ADD REPLY
0
Entering edit mode

For the new sample, all genotypes should be 0/0

ADD REPLY
0
Entering edit mode
16 months ago
Colaptes ▴ 100

Here is an awk script, add_reference.awk that I wrote for this task. It adds a new sample called "Reference" that will be homozygous 0/0 at each locus.

#! /usr/bin/awk -f 

BEGIN {
    FS = "\t"
    OFS = FS
}

#print the header info as-is
/^##/ {
    print
    next
}

#add sample named "Reference" to the list of samples
/^#CHROM/ {
    print $0"\tReference"
    next
}

#add homozygous reference allele to every locus.
{
    print $0"\t0/0:0,0,255:100:0:100,0:1,0,0:150"
}

For gzipped VCF files, I run it like this:

zcat my_file.vcf.gz | add_reference.awk | bgzip > my_file_with_reference.vcf.gz 

You will probably have to change it to match the format of your VCF. My fields are in the format "GT:PL:DP:SP:AD:GP:GQ" so I went with an entry of "0/0:0,0,255:100:0:100,0:1,0,0:150" so that it would pass my downstream filters, but you can change it to match whatever fields you have in your VCF.

It assumes that all loci are unphased diploid loci and that the reference sample is not heterozygous at any site. Caveat: ignoring heterozygous sites in the ref sample may cause inaccuracies in certain downstream analyses depending on what it is used for.

ADD COMMENT

Login before adding your answer.

Traffic: 1900 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6