I'm trying to build a database that will link SNPs (via rs-IDs) to genes with their respective Ensembl ID, Entrez/NCBI ID, and HUGO gene name based on their chromosomal position. I need these to do some pathway analysis as suggested by answers to this question.
I've seen Pierre's UCSC solution to Khader Shameer's question. I downloaded the snp132, knownGene, and refFlat tables from UCSC, but I can't figure out how to get back to Ensembl Gene IDs or NCBI gene IDs (needed for KEGG).
I've also seen Andrew Su's suggestion to use BioMart/Martview for a similar problem, but I couldn't figure out how to get from SNPs to Genes using Biomart.
I simply need tables that would allow me to:
Join rs-IDs to genomic position
Join rs-ID to a gene ID (Ensembl or NCBI) based on chromosome and position (+/- some distance).
Join one gene ID to another (Ensembl to NCBI to HUGO gene name, etc).
Bonus - join rs-ID to variation/consequence information (e.g. intron, upstream, synonymous, etc).
select
K.proteinID,
E.*,
S.name,
S.avHet,
S.chrom,
S.chromStart,
S.func,
K.txStart,
K.txEnd,
X.*,
R.*
from snp132 as S
left join knownGene as K on
(S.chrom=K.chrom and not(K.txEnd+60000<S.chromStart or S.chromEnd+60000<K.txStart))
left join knownToEnsembl as K2G on
K.name=K2G.name
left join ensGtp as E on
K2G.value=E.transcript
left join kgXref as X on
K.name=X.kgId
left join refLink as R on
R.mrnaAcc=X.mRNA
where
S.name in ("rs25","rs100")
;
execute:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -E < query.sql
Try using Galaxy for points 1-3 above. Note that this is not the only way to go about this process, but I point it out for the less command-line-inclined folks. For the more command-line inclined, downloading tab-delimited text files will suffice.
Use Get Data --> UCSC Main to pull in snp132 data as a bed file
Use Get Data to pull in ensembl, refseq, uscs known gene, etc. as bed files.
Use Operate on Genomic Intervals --> Join to join datasets from step 2 with datasets from step 1
Download the HGNC data from here that links MANY identifiers to each other. You can use Galaxy Join, Subtract, and Group --> join two datasets to merge the HGNC data with columns from the output of step 3.
For point 4 in the original post, that one has been answered a few times on this list, but SIFT, annovar, snpEff, Ensembl Variant Annotator, or the UCSC snp132CodingDbSNP table (and others) can all provide some insight for rs numbers. With a little creative genomic region conversions, you can probably come up with a set of tables yourself.
Thanks for the tip. Excuse my ignorance, but how do I pull ensembl, refseq data into Galaxy via the Get Data link? I don't see a link to those sources (http://yfrog.com/keaeyp)
To use BioMart to link genes with SNPs, it's necessary to start with Database: Ensembl Variation. Dataset can be Homo sapiens variation. Filters: General variation filters: Filter by Variation ID. Attributes are Ensembl Gene ID.
I think you may also add Ensembl transcripts and allele information to your databases.