Question

Which VCF file to select for generating Multiple sequence alignment of a gene?

0

Entering edit mode

2.2 years ago

anikcropscience ▴ 260

Hello, I have a conceptual question. I want to create fasta files for a genomic region from a multisample VCF file. I use the FastaAlternateReferenceMaker from GATK to do this. I am dealing with a highly heterozygous species, so I have quite a few heterozygous SNPs in the VCF file. So, when I generate the fasta files for each sample from the VCF file, I get a few Letters like K, Y, etc. instead of nucleotide bases.

This means that in those positions, the SNP is heterozygous, is it right?

Also, when making fasta files from a multisample VCF, should I use the VCF file filtered with MAF, genotyping call rate, and other filtering criteria? Or should I use an unfiltered VCF file for such purposes?

Thank you.

VCF GATK sequence alignment • 1.3k views

ADD COMMENT • link 2.2 years ago by anikcropscience ▴ 260

score 1 · Answer 1 · 2022-09-09

1

Entering edit mode

2.2 years ago

cmdcolin ★ 4.0k

I think you are correct in your assesment: the tool says

"--use-iupac-sample null If specified, heterozygous SNP sites will be output using IUPAC ambiguity codes given the genotypes for this sample"

https://gatk.broadinstitute.org/hc/en-us/articles/360037594571-FastaAlternateReferenceMaker

ADD COMMENT • link 2.2 years ago by cmdcolin ★ 4.0k

0

Entering edit mode

note that bcftools consensus has similar features too

ADD REPLY • link 2.2 years ago by cmdcolin ★ 4.0k

0

Entering edit mode

Ok, that makes sense. But then how do you deal with such kinds of heterozygous SNP sites if you want to translate them into protein sequences or do some kind of sequence entropy analysis? Do you have any suggestions?

Thank you.

ADD REPLY • link 2.2 years ago by anikcropscience ▴ 260

0

Entering edit mode

this is a good question, I don't really have any particular answer other than to keep trying to work with the consensus tools you are trying or coding your own tools :) to me, getting accurate non-reference gene structure predictions is still in need of work! most workflows just use "variant effect prediction"

ADD REPLY • link 2.2 years ago by cmdcolin ★ 4.0k

0

Entering edit mode

Thank you very much for your feedback. Make sense. I will try to find something.

ADD REPLY • link 2.2 years ago by anikcropscience ▴ 260