Question

What is the reference genome version of snps in humsavar variant set (2019_10)

0

Entering edit mode

5.4 years ago

Eric Wang ▴ 50

Hi - in the humsavar dataset there are variants given by protein position and amino acid change. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/variants/

Is there a way to convert these to Genomic coordinates.

What is the reference genome (annotation?) version about AA substitution?

I prefer obtaining Genomic coordinates with hg19.

Thx.

HUMSAVAR SNP uniprot variants • 1.3k views

ADD COMMENT • link updated 5.3 years ago by Elisabeth Gasteiger ★ 2.4k • written 5.4 years ago by Eric Wang ▴ 50

score 1 · Answer 1 · 2020-01-07

The reference genome for variants in humsavar.txt is GRCh38 (hg38) and is updated every release. If you want the genomic coordinates for these variants you can get them via the proteins API (https://www.ebi.ac.uk/proteins/api/doc/#/variation) and setting the sourcetype filter to ‘uniprot,mixed’; but also please read our help pages about how to retrieve large datasets. We have the variants represented in genome annotation tracks (BED and bigBED formats) that you can retrieve from our FTP site: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/genome_annotation_tracks/ and going to the human datasets. Note here as we are representing protein information at the genomic level we provide the location for the three nucleotides that make up the codon.

We currently do not supply GRCh37 (hg19) coordinates for the humsavar.txt variants; you will have to use the appropriate GENCODE set or ENSEMBL’s VEP set to use GRCh37 on the variants.

Also, did you notice the file homo_sapiens_variation.txt.gz in the FTP directory you used, ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/variants/ ? It does contain genome coordinates.

CFB P00751  p.Arg32Trp  RCV000017455    missense variant    -   -   -   6p21.33 NC_000006.12:g.31946402C>T  ENSG00000241253 ENST000003
99981   ENSP00000382862 ClinVar
CFB P00751  p.Arg32Gly  rs12614 missense variant    -   -   -       CHR_HSCHR6_MHC_QBL_CTG1:g.31936784C>G   ENSG00000241253 ENST000003
99981   ENSP00000382862 1000Genomes,ESP,ExAC,TOPMed,gnomAD
CFB P00751  p.Arg32Trp  RCV000293644    missense variant    Benign  Atypical hemolytic uremic syndrome (AHUS)   20301541, 19846853, 19821824, RCV0
00293644    6p21.33 NC_000006.12:g.31946402C>T  ENSG00000241253 ENST00000399981 ENSP00000382862 ClinVar
CFB P00751  p.Arg32Trp  RCV000324934    missense variant    Benign  Complement component 2 deficiency (C2D) MIM:217000, RCV000324934    6p21.33 NC
_000006.12:g.31946402C>T    ENSG00000241253 ENST00000399981 ENSP00000382862 ClinVar
CFB P00751  p.Pro33Leu  rs752699321 missense variant    -   -   -       CHR_HSCHR6_MHC_QBL_CTG1:g.31936788C>T   ENSG00000241253 EN
ST00000399981   ENSP00000382862 ExAC

score 0 · Answer 2 · 2019-12-18

This humsavar.txt file contains dbSNP accessions. Checking two of them manually both led to entries for GRCh38 so the current reference genome, also called hg38. This should be what you are looking for. For genomic coordinates you should download an annotation (GTF) file for the genome, e.g. from GENCODE, and then filter for the genes you want. GTFs contain the coordinates. GENCODE has both annotations for hg19 and hg38, choose what you prefer.