Question

How to extract min and max values of a gene from different columns

0

Entering edit mode

2.1 years ago

Assa Yeroslaviz ★ 1.9k

I'm trying to extract the start and end positions of a specific gene. The table contains all the transcripts for each gene. With this i would like to get the earliest start position (min value in one column) and the furthest end position (max value in a different column).

I have for example this table:

Name    chr strand  tx.start    tx.end
A2M chr12   -   9220303 9268515
A2M chr12   -   9220303 9268515
A2M chr12   -   9220303 9268825
A2ML1   chr12   +   8997599 9029377
A2ML1   chr12   +   8975216 9029383
A2MP1   chr12   -   9381128 9386803
A3GALT2 chr1    -   33772366    33786699
A4GALT  chr22   -   43088117    43117307
A4GALT  chr22   -   43088126    43116876
A4GALT  chr22   -   43114254    43116876

I would like for each gene to extract the min value from the tx.start and the max value from the tx.end columns.

I know I can use summarise twice for each column, but I was wondering if it is possible to do it in one go?

tx.start <- df %>% group_by(Name) %>%
  summarise(tx.start = min(tx.start))
tx.end <- df %>% group_by(Name) %>%
  summarise(tx.start = max(tx.end))
tf %>% left_join(tx.start, by = "Name") %>%
  left_join(tx.end, by = "Name")

Is there a way to combine these three commands into one?

thanks

Assa

here is the example table:

structure(list(Nam = c("A2M", "A2M", "A2M", "A2ML1", "A2ML1", 
"A2MP1", "A3GALT2", "A4GALT", "A4GALT", "A4GALT"), chr = c("chr12", 
"chr12", "chr12", "chr12", "chr12", "chr12", "chr1", "chr22", 
"chr22", "chr22"), strand = c("-", "-", "-", "+", "+", "-", "-", 
"-", "-", "-"), tx.start = c(9220303L, 9220303L, 9220303L, 8997599L, 
8975216L, 9381128L, 33772366L, 43088117L, 43088126L, 43114254L
), tx.end = c(9268515L, 9268515L, 9268825L, 9029377L, 9029383L, 
9386803L, 33786699L, 43117307L, 43116876L, 43116876L)), class = "data.frame", row.names = c(NA, 
-10L))

group_by tibble tidyverse • 738 views

ADD COMMENT • link updated 2.1 years ago by dariober 15k • written 2.1 years ago by Assa Yeroslaviz ★ 1.9k

score 2 · Answer 1 · 2022-11-02

2

Entering edit mode

2.1 years ago

iraun 6.2k

Hi! I think this awk command could do it?

awk '
    BEGIN {FS=OFS="\t"}
        {if (!($1 in min) || $4 < min[$1])
            min[$1] = $4
        if (!($1 in max) || $5 > max[$1])
            max[$1] = $5
    }END{
    for (i in min)print i,min[i],max[i]}' input.tsv

ADD COMMENT • link 2.1 years ago by iraun 6.2k

score 2 · Answer 2 · 2022-11-02

2

Entering edit mode

2.1 years ago

dariober 15k

I prefer data.table for this sort of things in R:

library(data.table)

df <- as.data.table(df)

df[, list(tx.start = min(tx.start), tx.end = max(tx.end)), by = Name]

ADD COMMENT • link 2.1 years ago by dariober 15k