Entering edit mode
8.2 years ago
Peter vH
▴
130
Hi there
I have multiple VCF files generated from variant calling on sequenced bacteria (M. tuberculosis). I would like to create a multiple sequence alignment file (as a step towards computing a phylogeny of the samples) by combining the reference genome with the VCFs. Before I put time and effort into creating a script to do this, is there an existing solution? I see that workflows such as SNPhylo compute an alignment with MUSCLE before doing tree construction - I'm trying to avoid that step.
Thanks, Peter
Please check this post. The comment by natasha provides a good solution
I'm not quite sure how? The tools suggested in those threads,
vcf-consensus
andFastaAlternateReferenceMaker
in the other, produce a single FASTA output from a single VCF input and don't deal with gaps created when considering the alignment between sequences having insertions and deletions.I have not used FastaAlternateReferenceMaker but iterated vcf-consensus -s <sample_name> to generate fasta file for each sample and then do the alignment. The new version also used IUPAC codes so that heterozygous genotypes can be encoded. Gaps are usually ignored in alignment so should not matter but I explicitly don't know how indels and rearrangements are handled by vcf-consensus.
So you'd do iterative vcf-consensus followed by MUSCLE? SNPhylo seems to do something like that. I'll experiment and compare it with the script I've written.
Yes. But it was a chloroplast genome and results were good. The advantage being no heterozygous as heteroplasmy was not detected.