convert VCF to gVCF
2
0
Entering edit mode
5.3 years ago
mateid • 0

Hello everybody,

I'm sorry if this has been answered before, but it's a simple question that I'm sure for someone more experienced would be very simple to address. I've recently received the results from my full genome sequencing in VCF format (SNP and INDEL). I would like to recreate a full genome VCF from these files (no BAM). I have the exact reference fasta used for generating the VCFs, but somehow (even after a lot of searching and experimentation) I'm still unable to get any closer to my goal. So far I've only managed to merge the two VCFs and apply the resulting file to the reference fasta, which gave me a new fasta file.

Is there any way of converting this file (or other that can be obtained from VCF) to gVCF? My goal is to extract all SNPs, not just the variations from reference.

Thank you!

gVCF VCF • 7.0k views
ADD COMMENT
1
Entering edit mode

My goal is to extract all SNPs, not just the variations from reference.

What do you mean by that? How were the SNPs determined in the first place if not by comparing your sequencing results to the reference genome?

ADD REPLY
0
Entering edit mode

The SNPs were indeed determined the way you said, but the VCF files contain only variations from the reference genome. I would like to generate a gVCF, which contains SNPs from the entire genome.

ADD REPLY
0
Entering edit mode

Variants are between your genome and the reference genome. That is what variant means.

ADD REPLY
0
Entering edit mode

Thank you! Could you please elaborate a bit on the "counts"? I understand that in practice not everything that isn't a variant is necessarily equal to the reference genome, but if I were to make this assumption wouldn't it be (at list in theory) possible to recreate the entire genome from just the reference file and the VCFs?

ADD REPLY
4
Entering edit mode
5.3 years ago

You cannot convert a VCF to gVCF. For generating a gVCF you need a BAM file.

The difference between a VCF and a gVCF is that on the gVCF you have the counts for each base that are equal to the reference genome for all the sites where you don't have a variant. You cannot get that without having the BAM file.

Also you don't want a SNP you want all the positions where there was no variant detected.

ADD COMMENT
0
Entering edit mode

ok, so if I have a .vcf and a .bam file, can I convert it to gvcf? A tool I use to generate vcf from bs-seq data doesn't have an option to generate gvcf. But maybe some external tool exists to make it?

ADD REPLY
0
Entering edit mode

for a one sample vcf and one bam, yes in theory you could do this but it would be easier to just improve your bs-seq tool

ADD REPLY
2
Entering edit mode
13 months ago
jena ▴ 320

Your question is not completely clear, but since the most sensible ways to understand it have the same answer, I'm gonna go with that.

I have the exact reference fasta used for generating the VCFs

TLDR: You don't have enough information to do this with just VCFs and reference fasta. You also need information about missing data, one way or another.

It doesn't matter if (a) you have several single-sample VCF files and you want variants not just against reference genome, but also each sample against other samples, or if (b) you want to have a VCF with all sites, regardless if there are variants or not.

In cases like these you want GVCF (or, alternatively, the so-called "all-sites VCF" that includes invariant/monomorphic sites - but this is a much larger file, whereas GVCF collapses invariant blocks into single lines, which you can expand later using a reference).

A GVCF basically contains information about 3 kinds of sites:

  • variant sites (included in your VCF)
  • invariant or monomorphic sites (included in your reference fasta)
  • missing sites (unsequenced or low-quality sites)

So, you have variant sites and information about possible genotypes at the remaining sites (the reference fasta), but you don't know which sites in your VCF are really missing or which sites just match a reference. What you need is to distinguish between invariant and missing sites, so you know where to fill genotypes from the reference and where to keep missing data.

The information about missing sites can be obtained in a few ways:

  • You got a BED mask from your colleague that "masks" or "filters" all unsequenced and low-quality sites.
  • You make such BED mask yourself from BAM files by extracting information about sites with sufficient coverage above some threshold (in general you don't want to include just anything). I've read (see section 2.5.1 here) that you can do this with bamtools.
  • You call variants denovo with a caller that supports emiting either all sites or the GVCF format, rather than just variant sites. This is a no-go for many situations - working with published data will make your results incompatible with results from other studies, for example. This should only be the last resort, even though I often see it as the first recomended option.

AFAIK there is no off-the-shelf tool to do all this for you. You need to combine some tools or write something yourself. I'm currently preparing a script that will use a reference fasta, a mask, and fill the invariant sites into a VCF (in my case I need "all-sites" VCF and not just GVCF though). When I have the script ready, I can share it here.

ADD COMMENT
0
Entering edit mode

Hello, jena - Did you end up writing this script? I was really surprised to find out that nothing like this already exists. The closest I could find was Pierre Lindenbaum 's fixvcfmissinggenotypes utility from jvarkit, but that only works for this purpose if you manually add no-call entries for every non-variant position before running the tool, and it also assumes the sample is diploid.

ADD REPLY
0
Entering edit mode

Hi, yes I did, see here for the scripts I wrote in the last year. You will probably need to adapt it to your needs though, as my needs were quite specific. For start, I don't have BAM files, but HETFA files and masks as numeric fasta.

Do you have BAM files for your samples? If so, you could

  • convert it to a BED mask (using the bamtools, or some other way - I can try later myself, as it sounds useful anyway)
  • convert your reference FASTA to a VCF, using bioawk and my href2vcf.awk script
  • bcftools merge your reference VCF with each of your samples to fill in the reference genotypes
    • EDIT: looking at my notes, I'm realizing why I wrote another script instead of using bcftools merge -0 - the option will also turn missing data at variant sites (present in your VCF) into REF, which is what nobody wants (at least with multisample VCF, which I had along with the HETFAs). It may be possible to write another script to overwrite these sites back, but I had HETFA so I went a different way.
  • mask the merged VCF files with the BED masks
  • finally merge all samples into one file

Let me know if you want to try this and I can try to help.

PS: What ploidy are your samples? How does it look like in a VCF? I've seen a VCF with triploid samples in the past, but years ago - a refresher would help.

ADD REPLY
0
Entering edit mode

see also here for extracting per-nucleotide depth using samtools depth

ADD REPLY
0
Entering edit mode

Thanks for linking them. I'm able to write something myself, but I wanted to avoid re-inventing anything that was already done and tested. I'll take a look once I finish with the current project I'm working on and get back to this project.

ADD REPLY
0
Entering edit mode

that only works for this purpose if you manually add no-call entries for every non-variant position before running the tool

Not sure if I understand this - do you mean you need to fill in the gaps between variants with the missing genotype? I think that's doable with a bit of scripting - you could probably adapt some of the scripts I linked below...

PS: actually, bcftools merge should do this for you if you merge your samples with a VCF converted from your reference FASTA (using bioawk and href2vcf.awk script in my repo).

ADD REPLY
0
Entering edit mode

do you mean you need to fill in the gaps between variants with the missing genotype?

No, that's not what I ultimately want. What I meant was, for this tool jvarkit to actually check the coverages of every uncalled position, I would need to add the missing genotype . to every missing position before passing it as input. Because it only checks the coverages of no-call positions, not unlisted positions.

ADD REPLY
0
Entering edit mode

That's basically what I meant - to fill uncalled (unlisted) positions with missing genotypes. I have done this by converting a reference FASTA into a VCF (with bioawk and href2vcf.awk) and then using bcftools merge to merge my samples with this reference VCF. This way it adds the missing positions to your samples as missing genotypes (./. by default, but I think it can be changed). Then you can run jvarkit to check which of these positions have coverage.

ADD REPLY

Login before adding your answer.

Traffic: 1892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6