Question

Filter for nonoverlapping genomic hits and keep the longest one

0

Entering edit mode

4.6 years ago

lessismore ★ 1.4k

Dear all,

i'm dealing with genomic coordinates of TF binding site mapping and i would like to filter the huge FIMO output for the non-overlapping sites. For each overlapping site i would like to keep the longest hit. Is there a way or a package to do this in R ? here i attach an example of my data:

family  start   end
ERF 1891    1911
ERF 1896    1915
ERF 1896    1914
ERF 1897    1911
ERF 1690    1704
ERF 937 957
ERF 1680    1700
ERF 1891    1905
ERF 2789    2809
ERF 642 661
ERF 2788    2806
ERF 1890    1908
ERF 1001    1015
ERF 2789    2803

to import it in R:

starting_data <- structure(list(family = c("ERF", "ERF", "ERF", "ERF", "ERF", 
"ERF", "ERF", "ERF", "ERF", "ERF", "ERF", "ERF", "ERF", "ERF"
), start = c(1891L, 1896L, 1896L, 1897L, 1690L, 937L, 1680L, 
1891L, 2789L, 642L, 2788L, 1890L, 1001L, 2789L), end = c(1911L, 
1915L, 1914L, 1911L, 1704L, 957L, 1700L, 1905L, 2809L, 661L, 
2806L, 1908L, 1015L, 2803L)), class = "data.frame", row.names = c(NA, 
-14L))

What i would like to have:

family  start   end
ERF 1891    1911
ERF 1680    1700
ERF 937 957
ERF 2789    2809
ERF 642 661
ERF 1001    1015

to import it in R

desired_output <- structure(list(family = c("ERF", "ERF", "ERF", "ERF", "ERF", 
"ERF"), start = c(1891L, 1680L, 937L, 2789L, 642L, 1001L), end = c(1911L, 
1700L, 957L, 2809L, 661L, 1015L)), class = "data.frame", row.names = c(NA, 
-6L))

To apply this solution to all my output, including the other TF families, i'll use the group_by(family) option on tidyverse.
Thanks in advance for any tip

genomic ranges genomics filtering • 1.0k views

ADD COMMENT • link updated 4.6 years ago by rpolicastro 13k • written 4.6 years ago by lessismore ★ 1.4k

score 1 · Accepted Answer · 2021-01-20

Convert into IRanges.

library("IRanges")
library("tidyverse")

df <- starting_data %>%
  split(.$family) %>%
  map(~IRanges(start=.x$start, end=.x$end))

Merge reduced set of ranges into the original ranges, and then select the longest range of overlapping ranges.

longest <- df %>%
  map(~findOverlapPairs(.x, reduce(.x)) %>% as_tibble) %>%
  bind_rows(.id="family") %>%
  group_by(across(c(starts_with("second"), "family"))) %>%
  filter(first.width == max(first.width)) %>%
  ungroup %>%
  rename_with(.cols=starts_with("first"), .fn=~str_replace(.x, "first.", ""))

> longest
# A tibble: 6 x 7
  family start   end width second.start second.end second.width
  <chr>  <int> <int> <int>        <int>      <int>        <int>
1 ERF     1891  1911    21         1890       1915           26
2 ERF      937   957    21          937        957           21
3 ERF     1680  1700    21         1680       1704           25
4 ERF     2789  2809    21         2788       2809           22
5 ERF      642   661    20          642        661           20
6 ERF     1001  1015    15         1001       1015           15