Mapping Snps To Genes (And Pathways, Go, Etc.)
3
3
Entering edit mode
13.4 years ago
Stephen 2.8k

I'm trying to build a database that will link SNPs (via rs-IDs) to genes with their respective Ensembl ID, Entrez/NCBI ID, and HUGO gene name based on their chromosomal position. I need these to do some pathway analysis as suggested by answers to this question.

I've seen Pierre's UCSC solution to Khader Shameer's question. I downloaded the snp132, knownGene, and refFlat tables from UCSC, but I can't figure out how to get back to Ensembl Gene IDs or NCBI gene IDs (needed for KEGG).

I've also seen Andrew Su's suggestion to use BioMart/Martview for a similar problem, but I couldn't figure out how to get from SNPs to Genes using Biomart.

I simply need tables that would allow me to:

  1. Join rs-IDs to genomic position
  2. Join rs-ID to a gene ID (Ensembl or NCBI) based on chromosome and position (+/- some distance).
  3. Join one gene ID to another (Ensembl to NCBI to HUGO gene name, etc).
  4. Bonus - join rs-ID to variation/consequence information (e.g. intron, upstream, synonymous, etc).
pathway biomart kegg ensembl ncbi • 7.3k views
ADD COMMENT
0
Entering edit mode

I think you may also add Ensembl transcripts and allele information to your databases.

ADD REPLY
5
Entering edit mode
13.4 years ago

I've updated the query I sent to Khader:

select
 K.proteinID,
 E.*,
 S.name,
 S.avHet,
 S.chrom,
 S.chromStart,
 S.func,
 K.txStart,
 K.txEnd,
 X.*,
 R.*
from snp132 as S
left join knownGene as K on
 (S.chrom=K.chrom and not(K.txEnd+60000<S.chromStart or S.chromEnd+60000<K.txStart))
left join knownToEnsembl as K2G on
    K.name=K2G.name
left join ensGtp as E on
    K2G.value=E.transcript
left join kgXref as X on
    K.name=X.kgId
left join refLink as R on 
    R.mrnaAcc=X.mRNA
where
 S.name in ("rs25","rs100")
 ;

execute:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -E < query.sql

result:

  *************************** 1. row ***************************
  proteinID: B2RTV4
       gene: ENSG00000169856
 transcript: ENST00000305901
    protein: ENSP00000302630
       name: rs100
      avHet: 0.5
      chrom: chr15
 chromStart: 53033239
       func: unknown
    txStart: 53049352
      txEnd: 53082209
       kgID: uc002aci.1
       mRNA: NM_004498
       spID: B2RTV4
spDisplayID: B2RTV4_HUMAN
 geneSymbol: ONECUT1
     refseq: NM_004498
    protAcc: NP_004489
description: one cut homeobox 1
       name: ONECUT1
    product: hepatocyte nuclear factor 6
    mrnaAcc: NM_004498
    protAcc: NP_004489
   geneName: 127912
   prodName: 188359
locusLinkId: 3175
     omimId: 604164
*************************** 2. row ***************************
  proteinID: NULL
       gene: NULL
 transcript: NULL
    protein: NULL
       name: rs100
      avHet: 0.5
      chrom: chr18
 chromStart: 26276338
       func: unknown
    txStart: NULL
      txEnd: NULL
       kgID: NULL
       mRNA: NULL
       spID: NULL
spDisplayID: NULL
 geneSymbol: NULL
     refseq: NULL
    protAcc: NULL
description: NULL
       name: NULL
    product: NULL
    mrnaAcc: NULL
    protAcc: NULL
   geneName: NULL
   prodName: NULL
locusLinkId: NULL
     omimId: NULL
*************************** 3. row ***************************
  proteinID: NULL
       gene: NULL
 transcript: NULL
    protein: NULL
       name: rs100
      avHet: 0.5
      chrom: chr2
 chromStart: 188538478
       func: unknown
    txStart: NULL
      txEnd: NULL
       kgID: NULL
       mRNA: NULL
       spID: NULL
spDisplayID: NULL
 geneSymbol: NULL
     refseq: NULL
    protAcc: NULL
description: NULL
       name: NULL
    product: NULL
    mrnaAcc: NULL
    protAcc: NULL
   geneName: NULL
   prodName: NULL
locusLinkId: NULL
     omimId: NULL
*************************** 4. row ***************************
  proteinID: 
       gene: NULL
 transcript: NULL
    protein: NULL
       name: rs100
      avHet: 0.5
      chrom: chr21
 chromStart: 38962576
       func: unknown
    txStart: 38979765
      txEnd: 38982665
       kgID: uc002ywn.1
       mRNA: AL109792
       spID: 
spDisplayID: 
 geneSymbol: AL109792
     refseq: 
    protAcc: 
description: Homo sapiens cDNA FLJ41886 fis, clone OCBBF2023598.
       name: NULL
    product: NULL
    mrnaAcc: NULL
    protAcc: NULL
   geneName: NULL
   prodName: NULL
locusLinkId: NULL
     omimId: NULL
*************************** 5. row ***************************
  proteinID: P48051
       gene: ENSG00000157542
 transcript: ENST00000400482
    protein: ENSP00000383330
       name: rs100
      avHet: 0.5
      chrom: chr21
 chromStart: 38962576
       func: unknown
    txStart: 38996785
      txEnd: 39285557
       kgID: uc011aej.1
       mRNA: AK313997
       spID: P48051
spDisplayID: IRK6_HUMAN
 geneSymbol: KCNJ6
     refseq: NM_002240
    protAcc: NP_002231
description: potassium inwardly-rectifying channel J6
       name: NULL
    product: NULL
    mrnaAcc: NULL
    protAcc: NULL
   geneName: NULL
   prodName: NULL
locusLinkId: NULL
     omimId: NULL
*************************** 6. row ***************************
  proteinID: P48051
       gene: ENSG00000157542
 transcript: ENST00000400482
    protein: ENSP00000383330
       name: rs100
      avHet: 0.5
      chrom: chr21
 chromStart: 38962576
       func: unknown
    txStart: 38996785
      txEnd: 39288696
       kgID: uc002ywo.2
       mRNA: NM_002240
       spID: P48051
spDisplayID: IRK6_HUMAN
 geneSymbol: KCNJ6
     refseq: NM_002240
    protAcc: NP_002231
description: potassium inwardly-rectifying channel J6
       name: KCNJ6
    product: G protein-activated inward rectifier potassium
    mrnaAcc: NM_002240
    protAcc: NP_002231
   geneName: 126513
   prodName: 227808
locusLinkId: 3763
     omimId: 600877
*************************** 7. row ***************************
  proteinID: NULL
       gene: NULL
 transcript: NULL
    protein: NULL
       name: rs100
      avHet: 0.5
      chrom: chr3
 chromStart: 137316782
       func: unknown
    txStart: NULL
      txEnd: NULL
       kgID: NULL
       mRNA: NULL
       spID: NULL
spDisplayID: NULL
 geneSymbol: NULL
     refseq: NULL
    protAcc: NULL
description: NULL
       name: NULL
    product: NULL
    mrnaAcc: NULL
    protAcc: NULL
   geneName: NULL
   prodName: NULL
locusLinkId: NULL
     omimId: NULL
*************************** 8. row ***************************
  proteinID: B7WPH7
       gene: ENSG00000153130
 transcript: ENST00000394205
    protein: ENSP00000377755
       name: rs100
      avHet: 0.5
      chrom: chr4
 chromStart: 141137111
       func: unknown
    txStart: 141178439
      txEnd: 141303710
       kgID: uc003iib.2
       mRNA: NM_032547
       spID: B7WPH7
spDisplayID: B7WPH7_HUMAN
 geneSymbol: SCOC
     refseq: NM_032547
    protAcc: NP_115936
description: short coiled-coil protein isoform 4
       name: SCOC
    product: short coiled-coil protein isoform 4
    mrnaAcc: NM_032547
    protAcc: NP_115936
   geneName: 55766
   prodName: 333859
locusLinkId: 60592
     omimId: 0
*************************** 9. row ***************************
  proteinID: NP_056019
       gene: ENSG00000005108
 transcript: ENST00000262042
    protein: ENSP00000262042
       name: rs25
      avHet: 0.499586
      chrom: chr7
 chromStart: 11584141
       func: intron
    txStart: 11414172
      txEnd: 11871824
       kgID: uc003ssf.3
       mRNA: NM_015204
       spID: B5MC03
spDisplayID: B5MC03_HUMAN
 geneSymbol: THSD7A
     refseq: NM_015204
    protAcc: NP_056019
description: thrombospondin, type I, domain containing 7A
       name: THSD7A
    product: thrombospondin type-1 domain-containing protein
    mrnaAcc: NM_015204
    protAcc: NP_056019
   geneName: 262732
   prodName: 313885
locusLinkId: 221981
     omimId: 612249
ADD COMMENT
2
Entering edit mode
13.4 years ago

Try using Galaxy for points 1-3 above. Note that this is not the only way to go about this process, but I point it out for the less command-line-inclined folks. For the more command-line inclined, downloading tab-delimited text files will suffice.

  1. Use Get Data --> UCSC Main to pull in snp132 data as a bed file
  2. Use Get Data to pull in ensembl, refseq, uscs known gene, etc. as bed files.
  3. Use Operate on Genomic Intervals --> Join to join datasets from step 2 with datasets from step 1
  4. Download the HGNC data from here that links MANY identifiers to each other. You can use Galaxy Join, Subtract, and Group --> join two datasets to merge the HGNC data with columns from the output of step 3.

For point 4 in the original post, that one has been answered a few times on this list, but SIFT, annovar, snpEff, Ensembl Variant Annotator, or the UCSC snp132CodingDbSNP table (and others) can all provide some insight for rs numbers. With a little creative genomic region conversions, you can probably come up with a set of tables yourself.

ADD COMMENT
0
Entering edit mode

Thanks for the tip. Excuse my ignorance, but how do I pull ensembl, refseq data into Galaxy via the Get Data link? I don't see a link to those sources (http://yfrog.com/keaeyp)

ADD REPLY
0
Entering edit mode

Check this link to a Galaxy educational screencast that is remarkably on-topic for the original post. http://screencast.g2.bx.psu.edu/galaxy/flash/Exons_SNP.html

ADD REPLY
2
Entering edit mode
13.4 years ago

To use BioMart to link genes with SNPs, it's necessary to start with Database: Ensembl Variation. Dataset can be Homo sapiens variation. Filters: General variation filters: Filter by Variation ID. Attributes are Ensembl Gene ID.

ADD COMMENT

Login before adding your answer.

Traffic: 2590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6