Definition Of Gene Range In Biomart
2
4
Entering edit mode
13.9 years ago
Emma ▴ 140

Hello all, I am using biomaRt (from Bioconductor) to download SNPs that lay "within" gene regions. So I use as Filter the ensembl_gene to specify the gene of interest and I expect the SNPs that lay in that gene. For example

getBM(c("refsnpid","chrname","chromstart","ensemblgenestableid","validated"), filters=c("ensemblgene","chrname"), values=list(c(geneid=entreztoensembl$ensemblgeneid),1),mart=snp) where entreztoensembl$ensemblgeneid is my column of ensembl Ids.

The problem is that I cant find the definition that they are using to give you these SNPs when it comes to near 5' and near 3' regions. How far out of the gene do these near-gene regions expand? Any ideas?

thanks, Emma

biomart snp • 4.7k views
ADD COMMENT
1
Entering edit mode

just a correction: the definition of gene regions doesn't belong to Biomart, but to the dataset used; I suppose it is the latest Ensembl release.

ADD REPLY
0
Entering edit mode

In my case it should be that the definition would be from dbsnp, but from biomart I am getting SNPs more than 40kb from the genes that Im interested in and dbSNP doesnt associate them to the gene in GeneView. For example look at rs77507878 in the comment below, where the gene of interest is C6orf10.

ADD REPLY
0
Entering edit mode

Sorry, the SNP I was talking about is rs76596671 NOT rs77507878.

ADD REPLY
4
Entering edit mode
13.9 years ago

You should add the consequence_type_tv attribute to your query, which provides the location category of a SNP relative to a gene. The variation documentation has the full list of possible values.

This gives you the relative location of the SNP with respect to gene features and allows you to filter out SNPs you might not be interested in like upstream or downstream. Upstream and downstream are 5kb either direction of the gene.

Here's the query at Ensembl BioMart, or with R biomaRt:

library("biomaRt")
snp <- useMart("snp", dataset="hsapiens_snp")
ensemblids <- c("ENSG00000204296")
out <- getBM(attributes=c("refsnp_id","chr_name","chrom_start",
                          "ensembl_gene_stable_id","validated",
                          "consequence_type_tv"),
             filters=c("ensembl_gene"), values=c(ensemblids), mart=snp)
head(out)
  refsnp_id chr_name chrom_start ensembl_gene_stable_id               validated
1  rs517922        6    32258836        ENSG00000204296                  hapmap
2 rs3117133        6    32313653        ENSG00000204296 cluster,freq,1000Genome
3 rs6621681        6    32292217        ENSG00000204296                        
4 rs6621681        6    32292217        ENSG00000204296                        
5 rs6621682        6    32292221        ENSG00000204296                        
6 rs6621682        6    32292221        ENSG00000204296                        
    consequence_type_tv
1            DOWNSTREAM
2              INTRONIC
3              INTRONIC
4 NON_SYNONYMOUS_CODING
5              INTRONIC
6     SYNONYMOUS_CODING

Edit for followup question:

The rule is that upstream is within 5kb of the transcript start, and downstream is within 5kb of the transcript end. Since genes can have multiple transcripts, you will want to look at the transcript in question to verify that a SNP is assigned within the documented distance.

For your example, rs76596671 is assigned to 7 alternative transcripts of ENSG00000204296. It is upstream of transcript ENST00000305725, which is located on the reverse strand of chromosome 6 from 32,260,758-32,338,274. rs76596671 is located at 32,339,357, so is 1083bp upstream of the transcript start since it's on the reverse strand.

ADD COMMENT
0
Entering edit mode

My question is not about the SNPs IN the genes,it is for those upstream and downstream of the gene. Following your example,there are 4 UPSTREAM SNPs of which one is this:

out[which(out$consequence_type_tv=="UPSTREAM"),][2,]
rs76596671 6 32339357 ENSG00000204296 UPSTREAM

This SNP happens to be about 45kb from the beginning of the gene, in fact there is another gene btw the SNP and the gene of interest. So my question is:

How far UPSTREAM/DOWNSTREAM of the gene of interest do I expect to see SNPs in these searches? i.e. is there a rule saying "only SNPs within 50kb of the gene will be included"?

ADD REPLY
0
Entering edit mode

Thanks! That makes lots of sense! I think the confusion came from the fact that the chr location is different in ucsc and dbsnp compared to ensembl. They map it at 32391135. Am I missing something blatantly obvious here? The builds look like they are the same and the coordinates of the gene (for the longest transcript) as well.

ADD REPLY
0
Entering edit mode

Awesome, glad that helped. In terms of the mapping differences, it might be that Ensembl is currently using dbSNP 131, while NCBI is at 132. I believe a Ensembl release with 132 is planned for early next year. It's always a good idea to stick with consistent resources, otherwise it will get confusing quickly. Another issue is that this SNP is flanked by a repetitive region of A/Ts, so could be hard to map: http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=76596671#fasta

ADD REPLY
0
Entering edit mode
13.9 years ago
Mary 11k

This is a good question. I don't know how gene set you are specifically working with handles it. But I know that in UCSC there may (or may not) be 5' and 3' UTR that are part of a gene region--depending on how you access the data, and which set you use. UCSC doesn't make the determination on whether there's UTR, it relies on the original source record (such as RefSeq/GenBank).

So I'm saying you should be sure you understand the gene set issues.

ADD COMMENT

Login before adding your answer.

Traffic: 2684 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6