Question

Query Genes within regions

0

Entering edit mode

7.3 years ago

Emilio Marmol ▴ 180

Hello,

I want to retrieve the genes that are present within a series of regions. Say, I have a bed file with query positions such like:

1     2665697     4665777      MIR201
1    10391435    12391516      MIR500
1    15106831    17106911      MIR122
1    23436535    25436616      MIR234 
1    23436575    25436656      MIR488

I would like to get the genes that fall within those regions.

I have tried using biomaRt, and bedtools intersect, but the output I get, is a list of genes corresponding to all the regions, not one by one, as the desired output I would like to get would be the genes within each row, but in separate rows, a if I did one query region at a time. Basically I want to know what genes fall within each region, but still being able to identify which genes fall in which regions.

What I am doing is, from a region of detected miRNA, I am expanding the genome region upwards and downwards, so that I get the neighboring genes from this miRNA. I am using a 1 million bases windows up and down. This would work for just one query, but, how to do many queries with biomaRt or many intersections with bedtools, so that I get somewhat like:

1     2665697     4665777      MIR201      GENEX, GENEY, GENEZ...
1    10391435    12391516      MIR500      GENEA, GENEB, GENEC...
1    15106831    17106911      MIR122
1    23436535    25436616      MIR234 
1    23436575    25436656      MIR488

Meaning that GENEX, GENEY and GENEZ fall within 1:2665697-4665777, with MIR201, placed in the middle, as this region is calculated subtracting 1 million bp to sart, and adding 1 million bp to end position.

I am somewhat determining the neighboring genes from each miRNA, to compare within species, but I do not get how to query multiple regions individually using biomaRt or bedtools

Any help?

biomaRt R bedtools query • 3.8k views

ADD COMMENT • link updated 7.3 years ago by Mike Smith ★ 2.1k • written 7.3 years ago by Emilio Marmol ▴ 180

1

Entering edit mode

There are various solutions here: How to get all genes for a specific list of regions in R preferably using Biomart

You can even follow my solution to annotate CN segment data (BED Format): A: How to extract the list of genes from TCGA CNV data

Yet another solution is to use bedtools intersect -a MyRegions -b Annotation.GTF -wao

ADD REPLY • link 7.3 years ago by Kevin Blighe 89k

0

Entering edit mode

Same as at StackOverflow? https://stackoverflow.com/questions/50136262/query-genes-within-regions

ADD REPLY • link 7.3 years ago by zx8754 12k

0

Entering edit mode

Hello v82masae!

It appears that your post has been cross-posted to another site: https://stackoverflow.com/questions/50136262/query-genes-within-regions

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 7.3 years ago by zx8754 12k

score 3 · Answer 1 · 2018-05-03

You're correct that a weakness of biomaRt is that you can't match the results with a specific part of the query, it just dumps back everything that matched at least once, and querying with each region separately isn't very quick if you have thousands of regions.

If you want to do this in R, then I would recommend the ensembldb package, which is a nice interface to query Ensembl data on your local machine without requiring web queries.

The example below should get what you want. It assumes your bed file is saved in /tmp/example.bed which you should change:

library(rtracklayer)
library(ensembldb)
library(EnsDb.Hsapiens.v86)
library(stringr)
library(dplyr)
library(magrittr)

## Making a short name for our geneome database
edb <- EnsDb.Hsapiens.v86

## import a bed file in a GRanges object,
## then split this into a list of indivdual ranges
regions_list <- import("/tmp/example.bed") %>%
    split(., factor(.))

## define a function that returns the names of any genes
## falling within a single GRange region
## Names are in comma separated string
getGeneNamesFromRegion <- function(x, edb) { 
    genes(edb, filter = GRangesFilter(x))$gene_name %>%
        paste(collapse = ",")
}

## apply the function above to each region in our list
genes_by_region <- lapply(regions_list, 
                          FUN = getGeneNamesFromRegion, edb = edb)

## break regions back into chrom, start, end
## and combine with the comma separated gene names
results <- str_split(names(genes_by_region), pattern = ":|-", 
                     simplify = TRUE, n = 3) %>% 
    set_colnames(c('chrom', 'start', 'end')) %>%
    as_tibble() %>%
    mutate(gene_names = unlist(genes_by_region))

And the results look like:

> results
# A tibble: 5 x 4
  chrom start    end      gene_names                                          
  <chr> <chr>    <chr>    <chr>                                               
1 1     2665698  4665777  TTC34,ABC14-1188822O7.1,RP11-740P5.2,RP11-740P5.3,A…
2 1     10391436 12391516 PGD,RP4-736L20.3,APITD1-CORT,APITD1,CORT,DFFA,RP5-1…
3 1     15106832 17106911 KAZN,TMEM51-AS1,TMEM51,C1orf195,RP1-21O18.3,RP5-903…
4 1     23436536 25436616 ASAP3,E2F2,RP1-150O5.3,ID3,MDS2,RP11-223J15.2,RPL11…
5 1     23436576 25436656 ASAP3,E2F2,RP1-150O5.3,ID3,MDS2,RP11-223J15.2,RPL11…

score 1 · Answer 2 · 2018-05-02

You could use BEDOPS bedmap to map gene annotations to regions.

First, preprocess your BED file of regions (e.g., regions.unsorted.bed) to make them UCSC-like and sorted:

$ awk '{ print "chr"$0; }' regions.unsorted.bed | sort-bed - > regions.bed

Adding chr is useful, because annotations from other groups use that chromosome naming convention. Having all the input files use the same naming scheme helps make set and map operations work correctly.

Next, grab annotations and turn them into a sorted BED file.

For example, if you are working with hg38, you could use Gencode v28 annotations and BEDOPS convert2bed:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "gene"' - \
    | convert2bed -i gff - \
    > genes.bed

Finally, map genes to regions:

$ bedmap --echo --echo-map-id-uniq --delim '\t' regions.bed genes.bed > answer.bed

The file answer.bed will contain your regions (--echo). The final column of each line will have a list of unique gene IDs of genes that overlap that line's region (--echo-map-id-uniq).

If you want the full annotation, you can use --echo-map in place of --echo-map-id-uniq. There are other --echo-map-* options; see the documentation or run bedmap --help.