Find gene using nucleotide sequence position in R
2
1
Entering edit mode
7.3 years ago
swamyvinny ▴ 20

Hi, I am trying to find what gene, if any, a specific nucleotide positions belong to. For example I'd like to know what gene nucleotide 100004651 is on. I'm using R currently and would be grateful if anyone can point me towards any packages that could help.

I'm working with the human genome if that helps at all, and also have information on which chromsome the position belongs to and which strand(+/-)

Thanks in advance

genome R sequence • 5.2k views
ADD COMMENT
0
Entering edit mode

use bedtools to intersect coordinates with gene coordinates

ADD REPLY
5
Entering edit mode
7.3 years ago
Emily 24k

You could use biomaRt, filtering by your loci.

ADD COMMENT
0
Entering edit mode

Since I recently did this I can easily share you some code (random position in human genome)

library("biomaRt")
mart = useMart("ensembl", dataset="hsapiens_gene_ensembl")

all.genes <- getBM(
    attributes=c("ensembl_gene_id","hgnc_symbol","chromosome_name","start_position","end_position"),
    filters=c("chromosome_name", "start", "end"),
    values=list(chromosome="5", start="1565669",end="1565670"),
    mart=mart)
ADD REPLY
0
Entering edit mode

This is exactly what I need! thank you so much

ADD REPLY
0
Entering edit mode

Glad that it was useful. I have moved my comment to an answer so you can accept it and mark this thread as closed.

Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode
7.3 years ago
Charles Plessy ★ 2.9k

Here is an internal function that I am about to add to Bioconductor's CAGEr package, to solve the same problem.

#' ranges2genes
#' 
#' Assign gene symbol(s) to Genomic Ranges.
#' 
#' @param ranges Genomics Ranges object, for example extracted from a
#'               RangedSummarizedExperiment object with the \code{rowRanges}
#'               command.
#' 
#' @param genes A \code{\link{GRanges}} object containing \code{gene_name} metadata.
#' 
#' @return A character vector of same length as the GRanges object, indicating
#'         one gene symbol or a comma-separated list of gene symbols for each
#'         range.
#' 
#' @importFrom GenomicRanges findOverlaps
#' @importFrom S4Vectors List Rle unstrsplit
#' @importFrom IRanges extractList
#' 
#' @examples
#' # Example for Biostars
#' # Imagine that nucleotides are represented by a GRanges object called "gr"
#' # Download GENCODE from Bioc's annotation hub:
#' ah <- AnnotationHub::AnnotationHub()
#' gff <- ah[["AH49556"]]
#' Instead one can also use rtracklayer::import.gff, etc...
#' 
#' Annotate the GRanges:
#' gr$geneSymbols <- ranges2genes(gr, gff)

ranges2genes <- function(ranges, genes) {
  if (is.null(genes$gene_name))
    stop("Annotation must contain ", dQuote("gene_name"), " metdata.")
  gnames <- findOverlaps(ranges, genes)
  gnames <- as(gnames, "List")
  gnames <- extractList(genes$gene_name, gnames)
  gnames <- unique(gnames)
  gnames <- unstrsplit(gnames, ";")
  Rle(gnames)
}
ADD COMMENT

Login before adding your answer.

Traffic: 1674 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6