I would like to get a list of amino acid mutations that occured in the course of evolution from early primates to current humans.
I found the homo_sapiens_ancestor_GRCh38_e86.tar.gz
file on the Ensembl ftp site (ftp://ftp.ensembl.org/pub/release-86/fasta/ancestral_alleles/), which, as I understand, is the inferred genome of the primate ancesstor. This file contains a fasta sequence for every chromosome.
I can also downloda the fasta sequence of the human reference genome: ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/.
My question is, what tool should I use to align the reference and ancestral genomes and get a VCF file with all the SNPs? Sorry if this has been answered a million times already. Most of the information I found was concerning the mapping of fastq files to the reference genome.
Once I have a VCF file with a list of SNPs, I can run TransVar or SnpEff to convert SNPs to amino acid changes.
Thanks!
Thanks very much for your input! I was following the methods for CADD, but I guess they had to do the alignments at the genome level in order to be able to analyse non-coding variants. I looked at your ensembl link, but could not find any pairwise alignments that I could download. A quick google search led me to treefam, which allows you to download protein-protein mapping between two species (http://www.treefam.org/download#tabview=tab1). I guess I can use that mapping and perform pairwise amino acid alignments myself?
You don't need pairwise alignments but the multiple sequence alignments used to build the trees. EnsEMBL adapted the TreeFam pipeline for their compara database. You should be able to get alignments, HMMs and trees from both resources. For EnsEMBL, it may be easier to use the perl API.