Dear all,
i'm dealing with genomic coordinates of TF binding site mapping and i would like to filter the huge FIMO output for the non-overlapping sites. For each overlapping site i would like to keep the longest hit. Is there a way or a package to do this in R ? here i attach an example of my data:
family start end
ERF 1891 1911
ERF 1896 1915
ERF 1896 1914
ERF 1897 1911
ERF 1690 1704
ERF 937 957
ERF 1680 1700
ERF 1891 1905
ERF 2789 2809
ERF 642 661
ERF 2788 2806
ERF 1890 1908
ERF 1001 1015
ERF 2789 2803
to import it in R:
starting_data <- structure(list(family = c("ERF", "ERF", "ERF", "ERF", "ERF",
"ERF", "ERF", "ERF", "ERF", "ERF", "ERF", "ERF", "ERF", "ERF"
), start = c(1891L, 1896L, 1896L, 1897L, 1690L, 937L, 1680L,
1891L, 2789L, 642L, 2788L, 1890L, 1001L, 2789L), end = c(1911L,
1915L, 1914L, 1911L, 1704L, 957L, 1700L, 1905L, 2809L, 661L,
2806L, 1908L, 1015L, 2803L)), class = "data.frame", row.names = c(NA,
-14L))
What i would like to have:
family start end
ERF 1891 1911
ERF 1680 1700
ERF 937 957
ERF 2789 2809
ERF 642 661
ERF 1001 1015
to import it in R
desired_output <- structure(list(family = c("ERF", "ERF", "ERF", "ERF", "ERF",
"ERF"), start = c(1891L, 1680L, 937L, 2789L, 642L, 1001L), end = c(1911L,
1700L, 957L, 2809L, 661L, 1015L)), class = "data.frame", row.names = c(NA,
-6L))
To apply this solution to all my output, including the other TF families, i'll use the group_by(family)
option on tidyverse.
Thanks in advance for any tip
Congrats <3 That's a great solution!