I'm trying to extract the start and end positions of a specific gene. The table contains all the transcripts for each gene. With this i would like to get the earliest start position (min value in one column) and the furthest end position (max value in a different column).
I have for example this table:
Name chr strand tx.start tx.end
A2M chr12 - 9220303 9268515
A2M chr12 - 9220303 9268515
A2M chr12 - 9220303 9268825
A2ML1 chr12 + 8997599 9029377
A2ML1 chr12 + 8975216 9029383
A2MP1 chr12 - 9381128 9386803
A3GALT2 chr1 - 33772366 33786699
A4GALT chr22 - 43088117 43117307
A4GALT chr22 - 43088126 43116876
A4GALT chr22 - 43114254 43116876
I would like for each gene to extract the min
value from the tx.start
and the max
value from the tx.end
columns.
I know I can use summarise
twice for each column, but I was wondering if it is possible to do it in one go?
tx.start <- df %>% group_by(Name) %>%
summarise(tx.start = min(tx.start))
tx.end <- df %>% group_by(Name) %>%
summarise(tx.start = max(tx.end))
tf %>% left_join(tx.start, by = "Name") %>%
left_join(tx.end, by = "Name")
Is there a way to combine these three commands into one?
thanks
Assa
here is the example table:
structure(list(Nam = c("A2M", "A2M", "A2M", "A2ML1", "A2ML1",
"A2MP1", "A3GALT2", "A4GALT", "A4GALT", "A4GALT"), chr = c("chr12",
"chr12", "chr12", "chr12", "chr12", "chr12", "chr1", "chr22",
"chr22", "chr22"), strand = c("-", "-", "-", "+", "+", "-", "-",
"-", "-", "-"), tx.start = c(9220303L, 9220303L, 9220303L, 8997599L,
8975216L, 9381128L, 33772366L, 43088117L, 43088126L, 43114254L
), tx.end = c(9268515L, 9268515L, 9268825L, 9029377L, 9029383L,
9386803L, 33786699L, 43117307L, 43116876L, 43116876L)), class = "data.frame", row.names = c(NA,
-10L))