Question

Mapping Snps To Genes (And Pathways, Go, Etc.)

3

Entering edit mode

13.9 years ago

Stephen 2.8k

I'm trying to build a database that will link SNPs (via rs-IDs) to genes with their respective Ensembl ID, Entrez/NCBI ID, and HUGO gene name based on their chromosomal position. I need these to do some pathway analysis as suggested by answers to this question.

I've seen Pierre's UCSC solution to Khader Shameer's question. I downloaded the snp132, knownGene, and refFlat tables from UCSC, but I can't figure out how to get back to Ensembl Gene IDs or NCBI gene IDs (needed for KEGG).

I've also seen Andrew Su's suggestion to use BioMart/Martview for a similar problem, but I couldn't figure out how to get from SNPs to Genes using Biomart.

I simply need tables that would allow me to:

Join rs-IDs to genomic position
Join rs-ID to a gene ID (Ensembl or NCBI) based on chromosome and position (+/- some distance).
Join one gene ID to another (Ensembl to NCBI to HUGO gene name, etc).
Bonus - join rs-ID to variation/consequence information (e.g. intron, upstream, synonymous, etc).

pathway biomart kegg ensembl ncbi • 7.6k views

ADD COMMENT • link updated 9.0 years ago by Biostar 20 • written 13.9 years ago by Stephen 2.8k

0

Entering edit mode

I think you may also add Ensembl transcripts and allele information to your databases.

ADD REPLY • link 13.9 years ago by Khader Shameer 18k

score 5 · Answer 1 · 2011-06-30

I've updated the query I sent to Khader:

select
 K.proteinID,
 E.*,
 S.name,
 S.avHet,
 S.chrom,
 S.chromStart,
 S.func,
 K.txStart,
 K.txEnd,
 X.*,
 R.*
from snp132 as S
left join knownGene as K on
 (S.chrom=K.chrom and not(K.txEnd+60000<S.chromStart or S.chromEnd+60000<K.txStart))
left join knownToEnsembl as K2G on
    K.name=K2G.name
left join ensGtp as E on
    K2G.value=E.transcript
left join kgXref as X on
    K.name=X.kgId
left join refLink as R on 
    R.mrnaAcc=X.mRNA
where
 S.name in ("rs25","rs100")
 ;

execute:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -E < query.sql

result:

  *************************** 1. row ***************************
  proteinID: B2RTV4
       gene: ENSG00000169856
 transcript: ENST00000305901
    protein: ENSP00000302630
       name: rs100
      avHet: 0.5
      chrom: chr15
 chromStart: 53033239
       func: unknown
    txStart: 53049352
      txEnd: 53082209
       kgID: uc002aci.1
       mRNA: NM_004498
       spID: B2RTV4
spDisplayID: B2RTV4_HUMAN
 geneSymbol: ONECUT1
     refseq: NM_004498
    protAcc: NP_004489
description: one cut homeobox 1
       name: ONECUT1
    product: hepatocyte nuclear factor 6
    mrnaAcc: NM_004498
    protAcc: NP_004489
   geneName: 127912
   prodName: 188359
locusLinkId: 3175
     omimId: 604164
*************************** 2. row ***************************
  proteinID: NULL
       gene: NULL
 transcript: NULL
    protein: NULL
       name: rs100
      avHet: 0.5
      chrom: chr18
 chromStart: 26276338
       func: unknown
    txStart: NULL
      txEnd: NULL
       kgID: NULL
       mRNA: NULL
       spID: NULL
spDisplayID: NULL
 geneSymbol: NULL
     refseq: NULL
    protAcc: NULL
description: NULL
       name: NULL
    product: NULL
    mrnaAcc: NULL
    protAcc: NULL
   geneName: NULL
   prodName: NULL
locusLinkId: NULL
     omimId: NULL
*************************** 3. row ***************************
  proteinID: NULL
       gene: NULL
 transcript: NULL
    protein: NULL
       name: rs100
      avHet: 0.5
      chrom: chr2
 chromStart: 188538478
       func: unknown
    txStart: NULL
      txEnd: NULL
       kgID: NULL
       mRNA: NULL
       spID: NULL
spDisplayID: NULL
 geneSymbol: NULL
     refseq: NULL
    protAcc: NULL
description: NULL
       name: NULL
    product: NULL
    mrnaAcc: NULL
    protAcc: NULL
   geneName: NULL
   prodName: NULL
locusLinkId: NULL
     omimId: NULL
*************************** 4. row ***************************
  proteinID: 
       gene: NULL
 transcript: NULL
    protein: NULL
       name: rs100
      avHet: 0.5
      chrom: chr21
 chromStart: 38962576
       func: unknown
    txStart: 38979765
      txEnd: 38982665
       kgID: uc002ywn.1
       mRNA: AL109792
       spID: 
spDisplayID: 
 geneSymbol: AL109792
     refseq: 
    protAcc: 
description: Homo sapiens cDNA FLJ41886 fis, clone OCBBF2023598.
       name: NULL
    product: NULL
    mrnaAcc: NULL
    protAcc: NULL
   geneName: NULL
   prodName: NULL
locusLinkId: NULL
     omimId: NULL
*************************** 5. row ***************************
  proteinID: P48051
       gene: ENSG00000157542
 transcript: ENST00000400482
    protein: ENSP00000383330
       name: rs100
      avHet: 0.5
      chrom: chr21
 chromStart: 38962576
       func: unknown
    txStart: 38996785
      txEnd: 39285557
       kgID: uc011aej.1
       mRNA: AK313997
       spID: P48051
spDisplayID: IRK6_HUMAN
 geneSymbol: KCNJ6
     refseq: NM_002240
    protAcc: NP_002231
description: potassium inwardly-rectifying channel J6
       name: NULL
    product: NULL
    mrnaAcc: NULL
    protAcc: NULL
   geneName: NULL
   prodName: NULL
locusLinkId: NULL
     omimId: NULL
*************************** 6. row ***************************
  proteinID: P48051
       gene: ENSG00000157542
 transcript: ENST00000400482
    protein: ENSP00000383330
       name: rs100
      avHet: 0.5
      chrom: chr21
 chromStart: 38962576
       func: unknown
    txStart: 38996785
      txEnd: 39288696
       kgID: uc002ywo.2
       mRNA: NM_002240
       spID: P48051
spDisplayID: IRK6_HUMAN
 geneSymbol: KCNJ6
     refseq: NM_002240
    protAcc: NP_002231
description: potassium inwardly-rectifying channel J6
       name: KCNJ6
    product: G protein-activated inward rectifier potassium
    mrnaAcc: NM_002240
    protAcc: NP_002231
   geneName: 126513
   prodName: 227808
locusLinkId: 3763
     omimId: 600877
*************************** 7. row ***************************
  proteinID: NULL
       gene: NULL
 transcript: NULL
    protein: NULL
       name: rs100
      avHet: 0.5
      chrom: chr3
 chromStart: 137316782
       func: unknown
    txStart: NULL
      txEnd: NULL
       kgID: NULL
       mRNA: NULL
       spID: NULL
spDisplayID: NULL
 geneSymbol: NULL
     refseq: NULL
    protAcc: NULL
description: NULL
       name: NULL
    product: NULL
    mrnaAcc: NULL
    protAcc: NULL
   geneName: NULL
   prodName: NULL
locusLinkId: NULL
     omimId: NULL
*************************** 8. row ***************************
  proteinID: B7WPH7
       gene: ENSG00000153130
 transcript: ENST00000394205
    protein: ENSP00000377755
       name: rs100
      avHet: 0.5
      chrom: chr4
 chromStart: 141137111
       func: unknown
    txStart: 141178439
      txEnd: 141303710
       kgID: uc003iib.2
       mRNA: NM_032547
       spID: B7WPH7
spDisplayID: B7WPH7_HUMAN
 geneSymbol: SCOC
     refseq: NM_032547
    protAcc: NP_115936
description: short coiled-coil protein isoform 4
       name: SCOC
    product: short coiled-coil protein isoform 4
    mrnaAcc: NM_032547
    protAcc: NP_115936
   geneName: 55766
   prodName: 333859
locusLinkId: 60592
     omimId: 0
*************************** 9. row ***************************
  proteinID: NP_056019
       gene: ENSG00000005108
 transcript: ENST00000262042
    protein: ENSP00000262042
       name: rs25
      avHet: 0.499586
      chrom: chr7
 chromStart: 11584141
       func: intron
    txStart: 11414172
      txEnd: 11871824
       kgID: uc003ssf.3
       mRNA: NM_015204
       spID: B5MC03
spDisplayID: B5MC03_HUMAN
 geneSymbol: THSD7A
     refseq: NM_015204
    protAcc: NP_056019
description: thrombospondin, type I, domain containing 7A
       name: THSD7A
    product: thrombospondin type-1 domain-containing protein
    mrnaAcc: NM_015204
    protAcc: NP_056019
   geneName: 262732
   prodName: 313885
locusLinkId: 221981
     omimId: 612249

score 2 · Answer 2 · 2011-06-30

Try using Galaxy for points 1-3 above. Note that this is not the only way to go about this process, but I point it out for the less command-line-inclined folks. For the more command-line inclined, downloading tab-delimited text files will suffice.

Use Get Data --> UCSC Main to pull in snp132 data as a bed file
Use Get Data to pull in ensembl, refseq, uscs known gene, etc. as bed files.
Use Operate on Genomic Intervals --> Join to join datasets from step 2 with datasets from step 1
Download the HGNC data from here that links MANY identifiers to each other. You can use Galaxy Join, Subtract, and Group --> join two datasets to merge the HGNC data with columns from the output of step 3.

For point 4 in the original post, that one has been answered a few times on this list, but SIFT, annovar, snpEff, Ensembl Variant Annotator, or the UCSC snp132CodingDbSNP table (and others) can all provide some insight for rs numbers. With a little creative genomic region conversions, you can probably come up with a set of tables yourself.

score 2 · Answer 3 · 2011-07-13

2

Entering edit mode

13.8 years ago

Giulietta - Ensembl Helpdesk ★ 1.2k

To use BioMart to link genes with SNPs, it's necessary to start with Database: Ensembl Variation. Dataset can be Homo sapiens variation. Filters: General variation filters: Filter by Variation ID. Attributes are Ensembl Gene ID.

ADD COMMENT • link 13.8 years ago by Giulietta - Ensembl Helpdesk ★ 1.2k