Your question is not completely clear, but since the most sensible ways to understand it have the same answer, I'm gonna go with that.
I have the exact reference fasta used for generating the VCFs
TLDR: You don't have enough information to do this with just VCFs and reference fasta. You also need information about missing data, one way or another.
It doesn't matter if (a) you have several single-sample VCF files and you want variants not just against reference genome, but also each sample against other samples, or if (b) you want to have a VCF with all sites, regardless if there are variants or not.
In cases like these you want GVCF (or, alternatively, the so-called "all-sites VCF" that includes invariant/monomorphic sites - but this is a much larger file, whereas GVCF collapses invariant blocks into single lines, which you can expand later using a reference).
A GVCF basically contains information about 3 kinds of sites:
- variant sites (included in your VCF)
- invariant or monomorphic sites (included in your reference fasta)
- missing sites (unsequenced or low-quality sites)
So, you have variant sites and information about possible genotypes at the remaining sites (the reference fasta), but you don't know which sites in your VCF are really missing or which sites just match a reference. What you need is to distinguish between invariant and missing sites, so you know where to fill genotypes from the reference and where to keep missing data.
The information about missing sites can be obtained in a few ways:
- You got a BED mask from your colleague that "masks" or "filters" all unsequenced and low-quality sites.
- You make such BED mask yourself from BAM files by extracting information about sites with sufficient coverage above some threshold (in general you don't want to include just anything). I've read (see section 2.5.1 here) that you can do this with bamtools.
- You call variants denovo with a caller that supports emiting either all sites or the GVCF format, rather than just variant sites. This is a no-go for many situations - working with published data will make your results incompatible with results from other studies, for example. This should only be the last resort, even though I often see it as the first recomended option.
AFAIK there is no off-the-shelf tool to do all this for you. You need to combine some tools or write something yourself. I'm currently preparing a script that will use a reference fasta, a mask, and fill the invariant sites into a VCF (in my case I need "all-sites" VCF and not just GVCF though). When I have the script ready, I can share it here.
What do you mean by that? How were the SNPs determined in the first place if not by comparing your sequencing results to the reference genome?
The SNPs were indeed determined the way you said, but the VCF files contain only variations from the reference genome. I would like to generate a gVCF, which contains SNPs from the entire genome.
Variants are between your genome and the reference genome. That is what variant means.
Thank you! Could you please elaborate a bit on the "counts"? I understand that in practice not everything that isn't a variant is necessarily equal to the reference genome, but if I were to make this assumption wouldn't it be (at list in theory) possible to recreate the entire genome from just the reference file and the VCFs?