I'm wondering how one might go about reconstituting the whole genome by combining VCF data with the reference data GRCh37? Are there any tools for this? Thank you in advance
I'm wondering how one might go about reconstituting the whole genome by combining VCF data with the reference data GRCh37? Are there any tools for this? Thank you in advance
You can use GATK FastaAlternateReferenceMaker https://gatk.broadinstitute.org/hc/en-us/articles/360037594571-FastaAlternateReferenceMaker
This only works for SNPs and easy INDELS. And there are more limitations/options, see GATK documentation.
It can't integrate CNV/SV into the existing reference genome.
A denovo assembly starting from the raw reads is needed to get a new reference genome with complex variation resolved.
For de-novo assembly you need the raw reads in the FASTQ files. De-novo assembly is a difficult process and makes most sense if you have modern long and correct reads. Or if you don't yet have any reference genome for the species that you are working on. Otherwise I would just stay with the reference genome based approach. See this paper for all the effort that went into the latest "telomere to telemore" human reference genome https://www.biorxiv.org/content/10.1101/2021.05.26.445798v1.full.pdf
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What's your actual aim in doing this? What data do you have?
Trying to learn more about the data I have and what's possible. I have VCF, FASTA, and BAM data from a whole genome sequence.
You want to do whole genome assembly?
Yes, exactly. Thank you