Hi. I am looking for reference human genome fasta files (preferably hg38) where the SNP alleles contributing to the reference sequence are always the ancestral alleles, not the "reference" alleles. Any insight into how reference alleles are assigned or became the reference alleles would also be appreciated.
My other option is to edit the SNP loci in the reference genome to the ancestral alleles fetched from the Ensembl variation database, but I thought I would check if the Fasta files I am looking for already exist. I know an hg38 with SNPs coded by the IUPAC ambiguity codes exists. I also came across a "common ancestor" (presumably between Chimps and Humans) build, but this is not exactly what I need.
Thanks!
Thanks, Emily!
This is roughly what I thought. I just wasn't sure if any work had gone into individual reference allele re-assignment based on SNP data in more recent builds such as hg38.
It has been improved in GRCh38, but it's still not perfect. At various points in GRCh38 you will see very small contigs (the bits of the BACs that were included in the genome) in the middle of a larger contig. This is where the GRC used 1000 Genomes data to identify that the reference allele was rare/private, so replaced part of the old contig they used with a new one which had the more common allele. This means that the reference allele is flipped compared to GRCh37. They did it all manually, so they didn't complete it and there are still many loci where the reference is rare/private.
Sounds good. Thanks for the details. Appreciate it!