How to extract min and max values of a gene from different columns
2
0
Entering edit mode
2.1 years ago
Assa Yeroslaviz ★ 1.9k

I'm trying to extract the start and end positions of a specific gene. The table contains all the transcripts for each gene. With this i would like to get the earliest start position (min value in one column) and the furthest end position (max value in a different column).

I have for example this table:

Name    chr strand  tx.start    tx.end
A2M chr12   -   9220303 9268515
A2M chr12   -   9220303 9268515
A2M chr12   -   9220303 9268825
A2ML1   chr12   +   8997599 9029377
A2ML1   chr12   +   8975216 9029383
A2MP1   chr12   -   9381128 9386803
A3GALT2 chr1    -   33772366    33786699
A4GALT  chr22   -   43088117    43117307
A4GALT  chr22   -   43088126    43116876
A4GALT  chr22   -   43114254    43116876

I would like for each gene to extract the min value from the tx.start and the max value from the tx.end columns.

I know I can use summarise twice for each column, but I was wondering if it is possible to do it in one go?

tx.start <- df %>% group_by(Name) %>%
  summarise(tx.start = min(tx.start))
tx.end <- df %>% group_by(Name) %>%
  summarise(tx.start = max(tx.end))
tf %>% left_join(tx.start, by = "Name") %>%
  left_join(tx.end, by = "Name")

Is there a way to combine these three commands into one?

thanks

Assa

here is the example table:

structure(list(Nam = c("A2M", "A2M", "A2M", "A2ML1", "A2ML1", 
"A2MP1", "A3GALT2", "A4GALT", "A4GALT", "A4GALT"), chr = c("chr12", 
"chr12", "chr12", "chr12", "chr12", "chr12", "chr1", "chr22", 
"chr22", "chr22"), strand = c("-", "-", "-", "+", "+", "-", "-", 
"-", "-", "-"), tx.start = c(9220303L, 9220303L, 9220303L, 8997599L, 
8975216L, 9381128L, 33772366L, 43088117L, 43088126L, 43114254L
), tx.end = c(9268515L, 9268515L, 9268825L, 9029377L, 9029383L, 
9386803L, 33786699L, 43117307L, 43116876L, 43116876L)), class = "data.frame", row.names = c(NA, 
-10L))
group_by tibble tidyverse • 738 views
ADD COMMENT
2
Entering edit mode
2.1 years ago
iraun 6.2k

Hi! I think this awk command could do it?

awk '
    BEGIN {FS=OFS="\t"}
        {if (!($1 in min) || $4 < min[$1])
            min[$1] = $4
        if (!($1 in max) || $5 > max[$1])
            max[$1] = $5
    }END{
    for (i in min)print i,min[i],max[i]}' input.tsv
ADD COMMENT
2
Entering edit mode
2.1 years ago

I prefer data.table for this sort of things in R:

library(data.table)

df <- as.data.table(df)

df[, list(tx.start = min(tx.start), tx.end = max(tx.end)), by = Name]
ADD COMMENT

Login before adding your answer.

Traffic: 2117 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6