Collapse rows of dataframe by AGI transcript numbers to AGI gene number
2
1
Entering edit mode
5.1 years ago
mccormack ▴ 90

I have a data frame in which the first column contains unique identifiers in the form of Arabidopsis AGI transcript numbers.

    target_id   est_counts  tpm
    AT1G01010.1 236     26.3 
    AT1G01020.2 55.0    12.2  
    AT1G01020.6 0         0 
    AT1G01020.1 25.2    4.8
    AT1G01020.4 5.8     0.2 
    AT1G01020.5 45.5    8.8 
    AT1G01020.3 3.5     0.5
    AT1G01030.2 13.25   1.3 
    AT1G01030.1 17.75   1.7

This table has data for three genes; AT1G01010, AT1G01020 and AT1G01030. However, two genes, AT1G01020 and AT1G01030 have multiple transcripts as indicated by the number to the right of the decimal. I would like to collapse the above table into three entries, one entry for each gene in which each column will contain the total for all transcripts of that gene. So, the resulting data frame will look like this:

    target_id   est_counts  tpm
    AT1G01010   236         26.3 
    AT1G01020   135.0       26.5
    AT1G01030   31           3.0

Here, the values for est_counts and tpm for all transcripts of a single gene have been summed. I have seen things like lapply(.sd), but I don't think that will work because the AGI numbers are not exactly equal and my real table has thousands of different AGI numbers.

R • 1.1k views
ADD COMMENT
3
Entering edit mode
5.1 years ago
zx8754 12k

Clean up target_id column, then aggregate other multiple columns at once, see:

# reproducible data
df1 <- read.table(text = "target_id   est_counts  tpm
AT1G01010.1 236     26.3 
AT1G01020.2 55.0    12.2  
AT1G01020.6 0         0 
AT1G01020.1 25.2    4.8
AT1G01020.4 5.8     0.2 
AT1G01020.5 45.5    8.8 
AT1G01020.3 3.5     0.5
AT1G01030.2 13.25   1.3 
AT1G01030.1 17.75   1.7", header = TRUE, stringsAsFactors = FALSE)

# we can imagine they are filenames and drop "extensions":
df1$target_id_clean <- tools::file_path_sans_ext(df1$target_id)

# then aggregate and sum on multiple columns
aggregate(.~target_id_clean, df1[-1], sum)
#   target_id_clean est_counts  tpm
# 1       AT1G01010        236 26.3
# 2       AT1G01020        135 26.5
# 3       AT1G01030         31  3.0
ADD COMMENT
0
Entering edit mode

Great ! Does the job very well. Thank you.

ADD REPLY
1
Entering edit mode
5.1 years ago
ATpoint 85k

Assume your data us called data1, use aggregate():

## remove everything after the "."
data1$target_id <- sapply(strsplit(data1$target_id, split="\\."), function(x)x[1])

## sum using the aggregate() function, which reads as:
## sum up est_counts (or tpm) as a function of duplicated target_id entries:
results <- data.frame(aggregate(est_counts ~ target_id, data1, sum),
                      aggregate(tpm ~ target_id, data1, sum)[,2])
ADD COMMENT
0
Entering edit mode

See answer of zx8754 apparently aggregate can operate on the entire data frame in one go, no need for two separate commands.

ADD REPLY

Login before adding your answer.

Traffic: 2768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6