How to get all genes for a specific list of regions in R preferably using Biomart
4
6
Entering edit mode
9.0 years ago

I have a list of homo sapiens genomic loci. I would like to see what genes lie within the region of my loci, say give or take 1Mb.

I have written a script in R and am trying to use Biomart to do this.

So far I have my list called filterlist

list("1:4864876:5864876", "1:14283067:15283067", "1:21786817:22786817",
    "1:33465769:34465769", "1:45300539:46300539", "1:54333815:55333815",
    "1:65114236:66114236", "1:75194833:76194833", "1:86037468:87037468",
    "1:96462256:97462256", "1:105259436:106259436", "1:116234756:117234756",
    "1:120842170:121842170", "1:145808064:146808064", "1:155459582:156459582",
    "1:166112356:167112356", "1:174453227:175453227", "1:185347260:186347260",
    "1:194299241:195299241", "1:205731116:206731116")

and my code which admittedly is a bit of a cut and paste job but which doesn't seem to give me any output at all

library("biomaRt")
listMarts(host="www.ensembl.org")
ensembl = useMart(biomart = "ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl", host = "jul2015.archive.ensembl.org")
filters = listFilters(ensembl)
filterlist<-as.list(HMapSamp$MergeCol)
results=getBM(attributes = c("hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position"),filters = c("chromosomal_region","biotype"),values = filterlist, mart = ensembl)

I just want to get the names genes lying within (or intersecting with) the regions in my list

biomart R sequencing • 21k views
ADD COMMENT
17
Entering edit mode
9.0 years ago

It's very simple in R, but I would use the Annotation packages rather than biomaRt, so you don't have to send a query through the wires each time. First of all, install Homo sapiens from Bioconductor:

> source("https://bioconductor.org/biocLite.R")
> biocLite("Homo.sapiens")
> library(Homo.sapiens)

This will also install a TxDb object for homo sapiens, called TxDb.Hsapiens.UCSC.hg19.knownGene.

To get all the gene coordinates, use the genes() function:

> genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

To get the intersection with your coordinates, you must first convert them to a GenomicRanges list. The easiest way is to convert the list to a dataframe, and then use the makeGRangesFromDataFrame function, which should have already been loaded when with Homo.sapiens. Remember to append the prefix 'chr' to your chromosome names. Coordinates should be 1-based.

> library(dplyr)
> mycoords.gr = lapply(mycoords.list, function (x) {res=strsplit(x, ':')}) %>%
   unlist %>%
   as.numeric %>%
   matrix(ncol=3, byrow=T) %>%
   as.data.frame %>%
   select(chrom=V1, start=V2, end=V3) %>%
   mutate(chrom=paste0('chr', chrom)) %>%
   makeGRangesFromDataFrame

> mycoords.gr
GRanges object with 20 ranges and 0 metadata columns:
       seqnames                 ranges strand
          <Rle>              <IRanges>  <Rle>
   [1]     chr1   [ 4864876,  5864876]      *
   [2]     chr1   [14283067, 15283067]      *
   [3]     chr1   [21786817, 22786817]      *
   [4]     chr1   [33465769, 34465769]      *
   [5]     chr1   [45300539, 46300539]      *
   ...      ...                    ...    ...
  [16]     chr1 [166112356, 167112356]      *
  [17]     chr1 [174453227, 175453227]      *
  [18]     chr1 [185347260, 186347260]      *
  [19]     chr1 [194299241, 195299241]      *
  [20]     chr1 [205731116, 206731116]      *
  -------

At this point you can simply use subsetByOverlaps or mergeByOverlaps, to get the genes of interest:

> subsetByOverlaps(genes(TxDb.Hsapiens.UCSC.hg19.knownGene), mycoords.gr)
GRanges object with 188 ranges and 1 metadata column:
            seqnames                 ranges strand   |     gene_id
               <Rle>              <IRanges>  <Rle>   | <character>
  100126349     chr1 [166123980, 166124035]      -   |   100126349
  100129405     chr1 [155715559, 155720673]      +   |   100129405
  100132406     chr1 [145209111, 146467744]      +   |   100132406
  100288142     chr1 [144146811, 146467744]      +   |   100288142
  100302117     chr1 [117214371, 117214449]      +   |   100302117
        ...      ...                    ...    ... ...         ...
       9674     chr1 [175126123, 175162229]      -   |        9674
       9829     chr1 [ 65720148,  65881552]      +   |        9829
       9910     chr1 [174128552, 174964445]      +   |        9910
       9923     chr1 [ 22778344,  22857650]      +   |        9923
        998     chr1 [ 22379120,  22419436]      +   |         998

The gene_id column in this dataframe contains the Entrez ID of the human gene overlapping the coordinates. If you also want the gene symbols, or any other ID, you can get it from org.Hs.db:

> as.data.frame(org.Hs.egSYMBOL) %>% head
  gene_id symbol
1       1   A1BG
2       2    A2M
3       3  A2MP1
4       9   NAT1
5      10   NAT2
6      11   NATP

Just type "org.Hs.eg" and hit tab to see all the possible Entrez Gene to ID conversion available.

ADD COMMENT
1
Entering edit mode

I think these libraries have changed how they output the genomic ranges objects.

> gene_list <- subsetByOverlaps(genes(TxDb.Hsapiens.UCSC.hg19.knownGene), mycoords.gr)
> gene_list
GRangesList object of length 190:
$`100126349`
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chr1 166123980-166124035      -
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

$`100129405`
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chr1 155715559-155720673      +
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

The first ENSEMBL ID is 100126349, but how can I extract those IDs into a new data.frame? I still struggle to understand genomic range lists. :(

ADD REPLY
0
Entering edit mode

After an hour I found out the function that retrieves the names lol....

names_of_genes <- names(gene_list)
ADD REPLY
4
Entering edit mode
9.0 years ago
Emily 24k

There's a slight error in your query as you include biotype as a filter, but don't specify any value for it. The following would work and get you only coding genes:

results=getBM(attributes = c("hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position","gene_biotype"),filters = c("chromosomal_region","biotype"),values = list(chromosomal_region=filterlist,biotype="protein_coding"), mart = ensembl)

Or, to get all genes, coding or non:

results=getBM(attributes = c("hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position","gene_biotype"),filters = c("chromosomal_region"),values = list(chromosomal_region=filterlist,), mart = ensembl)
ADD COMMENT
2
Entering edit mode
9.0 years ago
cyril-cros ▴ 950

Here is a solution which does not use R but BEDtools, which is great for manipulating sets of genomic intervals. Also, you may want to keep information about the strand of your genes of interest.

  1. Go the UCSC table browser, download the human genome annotation in BED format.
  2. Convert your genomic intervals to the bed format. "1:166112356:167112356" becomes "chr1 166112356 167112356" (tab separated).
  3. Use bedtools slop to extend these intervals. You can get your locii plus 500Kb upstream and downstream of each one for example. You need a file with the size of each chromosome.
  4. Use bedtools intersect on the human annotation and your extended intervals.

You now have all the human genes that fall in your intervals, plus their position. Use cut -f4 to keep only the column with their names.

Alternative, may be quicker:

Go to https://genome.ucsc.edu/cgi-bin/hgTables (UCSC table browser). There is a "Region" line, and a "define regions" button; click on it. You can use it with your positions, but you don't have a simple way to get 1Mb around your loci.

ADD COMMENT
2
Entering edit mode
9.0 years ago

A command-line approach with a toolkit like BEDOPS will likely run faster.

Within R, convert the list to a vector and write it to a text file:

> l <- list("1:4864876:5864876", "1:14283067:15283067", "1:21786817:22786817", "1:33465769:34465769", "1:45300539:46300539", "1:54333815:55333815", "1:65114236:66114236", "1:75194833:76194833", "1:86037468:87037468", "1:96462256:97462256", "1:105259436:106259436", "1:116234756:117234756", "1:120842170:121842170", "1:145808064:146808064", "1:155459582:156459582", "1:166112356:167112356", "1:174453227:175453227", "1:185347260:186347260", "1:194299241:195299241", "1:205731116:206731116")
> write(unlist(l), file="elements.txt")

In a command-line shell, replace colons with tabs and prepend with a common chromosome name prefix, writing a sorted BED file called elements.bed:

$ tr ':' '\t' < elements.txt | awk '{ print "chr"$0; }' - | sort-bed - > elements.bed

Grab a set of gene annotations, converted to BED with gtf2bed:

$ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    | awk '$8 == "gene"' - \
    > gencode.v19.genes.bed

This is just an example. Use whichever gene annotations you prefer, here.

If your elements are already padded to 1Mb intervals (which may be the case, on a second read of your question), then just use bedmap to map them to genes:

$ bedmap --echo --echo-map-id-uniq --delim '\t' elements.bed gencode.v19.genes.bed > elements_with_gene_IDs.bed

If you need to pad elements, you can use bedmap --range and map the padded elements to genes:

$ bedmap --range 500000 --echo --echo-map-id-uniq --delim '\t' elements.bed gencode.v19.genes.bed > elements_with_gene_IDs_within_1Mb.bed
ADD COMMENT

Login before adding your answer.

Traffic: 1800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6