Hello,
I have a conceptual question. I want to create fasta
files for a genomic region from a multisample VCF
file. I use the FastaAlternateReferenceMaker
from GATK
to do this. I am dealing with a highly heterozygous species, so I have quite a few heterozygous SNPs in the VCF file. So, when I generate the fasta
files for each sample from the VCF file, I get a few Letters like K, Y, etc. instead of nucleotide bases.
This means that in those positions, the SNP is heterozygous, is it right?
Also, when making fasta
files from a multisample VCF
, should I use the VCF
file filtered with MAF, genotyping call rate, and other filtering criteria? Or should I use an unfiltered VCF
file for such purposes?
Thank you.
note that
bcftools consensus
has similar features tooOk, that makes sense. But then how do you deal with such kinds of heterozygous SNP sites if you want to translate them into protein sequences or do some kind of sequence entropy analysis? Do you have any suggestions?
Thank you.
this is a good question, I don't really have any particular answer other than to keep trying to work with the consensus tools you are trying or coding your own tools :) to me, getting accurate non-reference gene structure predictions is still in need of work! most workflows just use "variant effect prediction"
Thank you very much for your feedback. Make sense. I will try to find something.