Question

Find genes overlapping intervals in R and keep the original Intervals

1

Entering edit mode

3.7 years ago

rrapopor ▴ 40

Hi,

I'm trying to annotate genomic intervals with their corresponding genes in R. My code:

  db = TxDb.Hsapiens.UCSC.hg38.knownGene
  bed = import.bed(bed_path)
  genes = transcriptsByOverlaps(ranges = bed, x = db, maxgap=0, columns='gene_id')

I get for each interval the overlapping part to each gene. I want to get the original interval and the corresponding genes.

For example if my BED file look like this:

chr1    152078793   156262976

I want to get:

chr1    152078793   156262976  gene1
chr1    152078793   156262976  gene2
chr1    152078793   156262976  gene3

and not:

chr1    152079557    152080903  gene1
chr1    152084141    152089064  gene2

Like bedtools intersect with the -wo parameter.

Thank you!

bedtools annotations R genes • 2.2k views

ADD COMMENT • link updated 3.7 years ago by Ram 44k • written 3.7 years ago by rrapopor ▴ 40

0

Entering edit mode

You can simply do left join the two table (input bed file and output bed file)

Can use left_join() function of dplyr package. e.g.

left_join(inputdf, outputdf, by=c(chr, st, end));

Make sure to join with chr,st,end as they three together form one information. Make column names same in both table.

ADD REPLY • link 3.7 years ago by pbpanigrahi ▴ 430

1

Entering edit mode

Thank you! But I'm not sure how it will solve the problem. I have my original bed file, and the output by the function transcriptByOverlap, but the columns are not identical, just overlapping. What I could Do is intersect between both tables and get the original intervals , which I'm not sure how to do..

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 3.7 years ago by rrapopor ▴ 40

0

Entering edit mode

Right, left join does not work for this problem.

ADD REPLY • link 3.7 years ago by drew.b.ferrell ▴ 140

1

Entering edit mode

3.7 years ago

drew.b.ferrell ▴ 140

edit: having trouble with biostars formatting.

You need a few packages for this solution to work: purrr, dplyr, tibble, magrittr, tidyr.

I like to use pacman to load packages. Then you can use p_load packages in one line.

install.packages('pacman')
p_load(purrr, dplyr, tibble, magrittr, tidyr)

You can run tibble() on your two input files to get db and test_bed like I do here.

db <- tibble(chrom = 'chr1', start = 152078793, end = 156262976)
test_bed <- tibble(start = c(152079557, 152080903), end = c(152084141, 152089064), gene = c('gene1', 'gene2'))


f <- function(db_start, db_end, bed){
  bed %>% 
    mutate(
      start_in_range = map_lgl(start, is_greater_than, db_start),
      end_in_range = map_lgl(end, is_less_than, db_end)
    ) %>% 
    filter(start_in_range, end_in_range) %>% 
    select(gene)
}


result <- db %>%
  mutate(genes_in_interval = map2(start, end, f, test_bed)) %>%
  unnest(genes_in_interval)

ADD COMMENT • link updated 3.7 years ago by Ram 44k • written 3.7 years ago by drew.b.ferrell ▴ 140

2

Entering edit mode

The formatting bar (especially the code option) is useful to format your post for better presentation. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block.
code_formatting

ADD REPLY • link 3.7 years ago by Ram 44k

score 3 · Accepted Answer · 2021-06-08

3

Entering edit mode

3.7 years ago

pbpanigrahi ▴ 430

My bad. I misinterpreted. U can try below thing.

mergeByOverlaps(bed, genes)

ADD COMMENT • link 3.7 years ago by pbpanigrahi ▴ 430

0

Entering edit mode

Worked perfect! Thank you!

ADD REPLY • link 3.7 years ago by rrapopor ▴ 40

0

Entering edit mode

A small educational note: if an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work. If an answer was not really helpful or did not work, provide detailed feedback so others know not to use that answer.
upvote_bookmark_accept

ADD REPLY • link 3.7 years ago by Ram 44k