Entering edit mode
4.2 years ago
evoecogen
▴
30
Hello,
I have a population genomics dataset including 100s of individuals of species A, mapped to the reference genome of species A. I also have several individuals from a few outgroup species included. (This resembles a typical human populations dataset with a few chimps, gorillas and orangutans.) Currently my reference allele is from one population of species A. What tool can I use to determine the ancestral alleles for A and recode the VCF? The goal is to determine and compare the history of certain alleles between all populations of A. Thanks!
I don't think this is a valid operation. A VCF file should show differences from one specific reference genome. You're better off creating separate per-population VCF files. Also, remember that the ALT allele need not be the minor allele - it's just the allele seen in that individual. For this reason, sub-groups that share a more common (>0.5 frequency for that subgroup) ALT allele are definitely a known observation. If they're the same species, they do not have to share a REF allele that is their "normal".
For example, certain alleles in the human genome are seen more commonly in Asian or Norwegian or African populations, but they're still ALT because the reference genome was not constructed with them.
I have definitely seen this done in human popgen papers, except they do not describe the specifics! It should be possible to determine the ancestral allele for A from the outgroups. My problem right now is that the reference comes from a random population of A, at the edge of its distribution. So the populations that are most distant geographically from the reference appear most derived... BTW I suspect that doing individual VCFs per population would be much less accurate ( I use GATK, which encourages calling genotypes of all samples together).
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.SUBMIT ANSWER
is for new answers to original questionGATK is best suited for human analyses. You seem to be working on a non-model organism. Following GATK's Best Practices is not the best course of action here. Remember, GATK assumes that your reference genome is as stable as human ref genomes. You cannot joint genotype samples with different ref genomes.