Does anyone know of a tool for converting SNPs in VCF format to amino acid mutations in UniProt proteins?
I know snpEff can do this for Ensembl variants.
For example, for the VCF file with the line:
1 69538 COSM75742 G A . .
snpEff adds the following annotation:
1 69538 COSM75742 G A . . ANN=A|missense_variant|MODERATE|OR4F5|ENSG00000186092|transcript|ENST00000335137.3|protein_coding|1/1|c.448G>A|p.Val150Met|448/918|448/918|150/305||
I am looking for something that would give me the UniProt ID and the protein mutation mapped to the UniProt sequence.
The best tool that I could find for annotating VCF files with UniProt mutations is Oncotator. It explicitly provides "Site-specific protein annotations from UniProt".
Alternatively, you can annotate VCF files with Ensembl mutations, and then map Ensembl to Uniprot using pairwise sequence alignments between proteins mapped to the same gene.
In particular, the human file is homo_sapiens_variation.txt.gz: The variants listed are the Ensembl Variation databases' set of 1000 Genomes project (http://www.1000genomes.org/) and Catalogue of Somatic Mutations In Cancer (COSMIC) v71, imported directly from COSMIC and via Ensembl Variation, protein altering variants (SO:0001583). COSMIC v71 variants are the last freely available somatic variants from COSMIC before their licence change; therefore the accuracy of the information provided for a COSMIC variant should be verified with COSMIC. (Text from README file in that directory)
These files should help you map from Ensembl to UniProt for these variants.
Please don't hesitate to contact the UniProt helpdesk in case of questions.
Thank you for your answer! As you point out, Ensembl and consequently UniProt only have access to COSMIC v71. One of the things I am trying to accomplish is to map variants in a more recent version of COSMIC to UniProt.
Protein dataservices: http://www.ebi.ac.uk/uniprot/api/doc/swagger/#!/coordinates/search maybe able to provide a solution to your problem. Though at this stage it will not return the protein sequence mapping when given a single nucleotide genomic coordinate. If you have the ENSG/ENST/ENSP identifiers you can get the genomic coordinates for each exon transcribed into the final protein sequence. The coordinate service returns the protein sequence range within each exon. From there you will be able to calculate protein sequence location and get the wild type amino acid.
If the COMIC variant existed in v71 of COSMIC you can get all the annotation information UniProtKB holds concerning the variant using the variation dataservice taking the UniProt accession as your starting point.
Both the coordinate and variation dataservice will return data for reviewed canonical sequences, isoforms and unreviewed TrEMBL entries.
I am hesitant to use VEP (or other web services) because to me they are black boxes (I am not a Perl expert) and they do not scale to millions of mutations. As far as I understand, VEP relies on the Ensembl Core and Variation databases. However, those databases map to UniProt through gene identifiers (ENSG) rather than protein identifiers (ENST / ENSP) and, therefore, carry no sequence information.
The reason I called VEP a black box is because it isn't clear where the data comes from (this is my gripe with Biomart as well). It is easy to see the Uniprot checkbox and feel that it does what you want it to do. But then several steps down your pipeline you realise that 30% of your mutations don't match the UniProt sequence that they are supposed to mutate (been there done that).
Thank you for your answer! As you point out, Ensembl and consequently UniProt only have access to COSMIC v71. One of the things I am trying to accomplish is to map variants in a more recent version of COSMIC to UniProt.