This isn't a regex problem at all, but here's a solution in R
.
library(stringr)
library(magrittr)
library(tidyr)
library(dplyr)
#Toy data.
df <- data.frame(genes = c("oneA,oneB,oneC", "oneA-oneB-oneC", "oneA,oneB,twoD", "someID1-someID2-someotherID",
"someID-someotherID"),
strand = c("+", "-", "+", "-", "+"))
#Assigning a grouping identifier to each set of genes that constitute an operon.
#Also separating the genes into their own respective rows.
df %<>%
mutate(grp = row_number()) %>%
separate_rows(genes, sep = "[,\\-]")
#Extracting the operon component (e.g., "one") and gene component (e.g., "A")
#identifiers into separate columns.
df %<>%
mutate(op1 = str_extract(genes, "^[a-z]+"),
op2 = str_extract(genes, "[A-Z0-9]+$"))
#Grouping by grp and ollapsing the genes together for later use.
df %<>%
group_by(grp) %>%
mutate(genes = paste0(genes, collapse = ",")) %>%
ungroup()
#Grouping by the operon grouping and operon component to collapse the gene components
#into a single row each.
#Prior to collapsing, orienting the gene components correctly based on strand
#orientation.
#Then retaining only unique operon components (since the gene components are
#now duplicated across rows.)
df %<>%
group_by(grp, op1) %>%
mutate(op2 = ifelse(strand == "+", op2, sort(op2, decreasing = TRUE))) %>%
mutate(op2 = paste0(op2, collapse = "")) %>%
distinct(op1, .keep_all = TRUE) %>%
ungroup()
#Putting the operon and gene components back together and collapsing
#the operon components by grp, and removing duplicates + columns.
df %<>%
mutate(op = paste0(op1, op2)) %>%
group_by(grp) %>%
mutate(op = paste0(op, collapse = "-")) %>%
distinct(op, .keep_all = TRUE) %>%
ungroup() %>%
select(-c(grp, op1, op2))
#Final result.
df
# # A tibble: 5 × 3
# genes strand op
# <chr> <chr> <chr>
# 1 oneA,oneB,oneC + oneABC
# 2 oneA,oneB,oneC - oneCBA
# 3 oneA,oneB,twoD + oneAB-twoD
# 4 someID1,someID2,someotherID - someID2ID1-someotherID
# 5 someID,someotherID + someID-someotherID
Look at the comments in the code for explanations. I consider the solution incomplete.
Occasionally operons can also come out as "oneA-oneB-oneC" or "someID-someotherID".
Doesn't help, because you haven't given us any idea of how those are supposed to be treated.