I would like to take a vcf file and a reference genome from the 1000Genomes project, and obtain a fasta file that lists the genomes for each individual in the vcf, according to the SNPs each individual has in the vcf file. I was wondering if VCFtools is able to do this? If not, what tools are available that can accomplish this?
I have written a python script that goes through the 84 million SNPs in the file and outputs a fasta file. I've tested it by running it on 10000 SNPs and it gives an output after several hours. However, I've tried running it for 84 million SNPs and it has been running for several days now. I'm looking for a more efficient way to obtain a fasta file from .vcf.
I am looking to skip indels.
EDIT: VCFtool's vcf-to-tab converts a .vcf file into a tab file, and then there's a script that turns tab into a fasta file. https://code.google.com/archive/p/vcf-tab-to-fasta/
I believe that's what I'm looking for, I'll look into it
I looked into it and it works well for obtaining the alternate genome, but I'm looking for the sequences for each individuals in the vcf file. For example, the vcf files gives the SNPs for individuals HG00097 and HG00099, and I'd like to get the sequences for each individual. Additionally, I'd like to skip indels, if it's possible. So for I've checked using vcf-consensus but it's given an error 'Broken VCF header', and i'm not entirely sure if it'll output what I need. Is there a program that can do this?