Find genes overlapping intervals in R and keep the original Intervals
2
1
Entering edit mode
3.6 years ago
rrapopor ▴ 40

Hi,

I'm trying to annotate genomic intervals with their corresponding genes in R. My code:

  db = TxDb.Hsapiens.UCSC.hg38.knownGene
  bed = import.bed(bed_path)
  genes = transcriptsByOverlaps(ranges = bed, x = db, maxgap=0, columns='gene_id')

I get for each interval the overlapping part to each gene. I want to get the original interval and the corresponding genes.

For example if my BED file look like this:

chr1    152078793   156262976

I want to get:

chr1    152078793   156262976  gene1
chr1    152078793   156262976  gene2
chr1    152078793   156262976  gene3

and not:

chr1    152079557    152080903  gene1
chr1    152084141    152089064  gene2

Like bedtools intersect with the -wo parameter.

Thank you!

bedtools annotations R genes • 2.1k views
ADD COMMENT
0
Entering edit mode

You can simply do left join the two table (input bed file and output bed file)

Can use left_join() function of dplyr package. e.g.

left_join(inputdf, outputdf, by=c(chr, st, end));

Make sure to join with chr,st,end as they three together form one information. Make column names same in both table.

ADD REPLY
1
Entering edit mode

Thank you! But I'm not sure how it will solve the problem. I have my original bed file, and the output by the function transcriptByOverlap, but the columns are not identical, just overlapping. What I could Do is intersect between both tables and get the original intervals , which I'm not sure how to do..

ADD REPLY
0
Entering edit mode

Right, left join does not work for this problem.

ADD REPLY
3
Entering edit mode
3.6 years ago
pbpanigrahi ▴ 430

My bad. I misinterpreted. U can try below thing.

mergeByOverlaps(bed, genes)
ADD COMMENT
0
Entering edit mode

Worked perfect! Thank you!

ADD REPLY
0
Entering edit mode

A small educational note: if an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work. If an answer was not really helpful or did not work, provide detailed feedback so others know not to use that answer.
upvote_bookmark_accept

ADD REPLY
1
Entering edit mode
3.6 years ago

edit: having trouble with biostars formatting.

You need a few packages for this solution to work: purrr, dplyr, tibble, magrittr, tidyr.

I like to use pacman to load packages. Then you can use p_load packages in one line.

install.packages('pacman')
p_load(purrr, dplyr, tibble, magrittr, tidyr)

You can run tibble() on your two input files to get db and test_bed like I do here.

db <- tibble(chrom = 'chr1', start = 152078793, end = 156262976)
test_bed <- tibble(start = c(152079557, 152080903), end = c(152084141, 152089064), gene = c('gene1', 'gene2'))


f <- function(db_start, db_end, bed){
  bed %>% 
    mutate(
      start_in_range = map_lgl(start, is_greater_than, db_start),
      end_in_range = map_lgl(end, is_less_than, db_end)
    ) %>% 
    filter(start_in_range, end_in_range) %>% 
    select(gene)
}


result <- db %>%
  mutate(genes_in_interval = map2(start, end, f, test_bed)) %>%
  unnest(genes_in_interval)
ADD COMMENT
2
Entering edit mode

The formatting bar (especially the code option) is useful to format your post for better presentation. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block.
code_formatting

ADD REPLY

Login before adding your answer.

Traffic: 1779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6