Question

How to retrieve Gene name from SNP ID using biomaRt

1

Entering edit mode

6.6 years ago

johnS ▴ 10

I am trying to get gene name from the SNP ID, I came across some posts and came up with this, but I am getting an error

Here is my code

library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
getBM(attributes = c("external_gene_name"),
  filters    = "snp_filter", values = "rs3043732", mart = ensembl)

This is the error

Error in getBM(attributes = c("external_gene_name"), filters = "snp_filter",  : 
 Invalid filters(s): snp_filter 
Please use the function 'listFilters' to get valid filter names

I am thinking that the "snp_filter" must have changed in the new version, but I am not able to find the new filter that serves the same purpose. Also is the "external_gene_name" is what the gene name that corresponds to the SNP ID ?

R biomart gene SNP rsid • 9.4k views

ADD COMMENT • link updated 6.6 years ago by Kevin Blighe 89k • written 6.6 years ago by johnS ▴ 10

score 5 · Accepted Answer · 2018-09-07

5

Entering edit mode

6.6 years ago

Kevin Blighe 89k

rs3043732 is no longer in dbSNP. Take a look HERE.

To look up rs IDs in biomaRt, you need to do this:

require(biomaRt)

ensembl <- useMart("ENSEMBL_MART_SNP", dataset = "hsapiens_snp")

getBM(attributes=c(
    "refsnp_id", "chr_name", "chrom_start", "chrom_end",
    "allele", "mapweight", "validated", "allele_1", "minor_allele",
    "minor_allele_freq", "minor_allele_count", "clinical_significance",
    "synonym_name", "ensembl_gene_stable_id"),
    filters="snp_filter", values="rs6025",
    mart=ensembl, uniqueRows=TRUE)

  refsnp_id chr_name chrom_start chrom_end allele mapweight
1    rs6025        1   169549811 169549811    C/T         1
                                                                       validated
1 1000Genomes,Cited,ESP,ExAC,Frequency,gnomAD,HapMap,Phenotype_or_Disease,TOPMed
  allele_1 minor_allele minor_allele_freq minor_allele_count
1        C         TRUE        0.00599042                 30
                        clinical_significance synonym_name
1 benign,pathogenic,drug response,risk factor        17284
  ensembl_gene_stable_id
1        ENSG00000198734

Look up all available attributes with listAttributes(ensembl)

Kevin

ADD COMMENT • link 6.3 years ago by Kevin Blighe 89k

0

Entering edit mode

@kevin Blighe Thank you.

I tried it for this for snp rs6544713

getBM(attributes=c( "ensembl_gene_stable_id"), filters="snp_filter", values="rs6544713", mart=ensembl, uniqueRows=TRUE)

But this is giving me only the Ensembl gene ID-s, If I lookup for the same SNP in https://biit.cs.ut.ee/gsnpense I get the gene ABCG8 - which is what I am looking for.

Any thoughts on how to get this? I looked at the gprofiler R API, but I don't know how to get this using API.

ADD REPLY • link 6.6 years ago by johnS ▴ 10

1

Entering edit mode

Sure thing, bro, just add "associated_gene" to the list of attributes:

getBM(attributes=c("ensembl_gene_stable_id", "associated_gene"), filters="snp_filter", values="rs6544713", mart=ensembl, uniqueRows=TRUE)
1        ENSG00000143921                
2        ENSG00000143921           ABCG8
3        ENSG00000143921     ABCG5,ABCG8

ADD REPLY • link 6.6 years ago by Kevin Blighe 89k

0

Entering edit mode

@kevin Blighe Thank you :-) May I know why ENSG00000143921 got ABCG8 and ABCG5,ABCG8 ? In gsnpense its only ABCG8

ADD REPLY • link 6.6 years ago by johnS ▴ 10

1

Entering edit mode

Well that's a good question and got me searching that region of the genome.

The canonical transcripts of the genes are in close proximity... there's, like, literally, just a few hundred bases between them if you go to the UCSC Genome Browser. As to why Ensembl / ENCODE has the SNP annotated for both ABCG5 and ABCG8, they likely previously identified a longer isoform of ABCG5 that extends across the ABCG8 genomic sequence, or vice versa. These types of regions are fairly common in the genome.

I tried to find their exact transcript isoform that behaves this way, but couldn't. The best I could do with limited time is save this, which shows just how crazy the region is (your SNP is on the bottom right):

Keep in mind that, given the very small gap that exists between the genes, it may also be an error on ENCODE's part that still has to be rooted out. The gap between these genes is more or less the insert size of standard NGS, so, mapping and correct isolation of transcript isoforms may have proven difficult.

ADD REPLY • link 6.6 years ago by Kevin Blighe 89k

0

Entering edit mode

I know that it's been long since you answered this but I find myself in a similar position (your comments and solution were amazingly helpful, thanks) with a twist: when I use associated_gene attribute, I get a list with a simple few gene names here and there, very scattered, but I have manually checked about 20 of the ensembl_gene_ids that I've retrieved as well and I am managing to find virtually all of them in ensembl.org

Do you see any issues with my code that may be preventing my query from working properly?

snp = useMart("ENSEMBL_MART_SNP", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_snp")

results<-c() 
for (i in 1:dim(trim_SSNP_W)[1]){
  temp <- getBM(attributes = c('refsnp_id', 'ensembl_gene_stable_id', 'associated_gene'), 
                filters = c('snp_filter'), 
                values = list(trim_SSNP_W[i,1]), 
                mart = snp)  
  results <- rbind(results,temp)
}

ADD REPLY • link 4.7 years ago by aroso491 • 0