I have a data frame in which the first column contains unique identifiers in the form of Arabidopsis AGI transcript numbers.
target_id est_counts tpm
AT1G01010.1 236 26.3
AT1G01020.2 55.0 12.2
AT1G01020.6 0 0
AT1G01020.1 25.2 4.8
AT1G01020.4 5.8 0.2
AT1G01020.5 45.5 8.8
AT1G01020.3 3.5 0.5
AT1G01030.2 13.25 1.3
AT1G01030.1 17.75 1.7
This table has data for three genes; AT1G01010, AT1G01020 and AT1G01030. However, two genes, AT1G01020 and AT1G01030 have multiple transcripts as indicated by the number to the right of the decimal. I would like to collapse the above table into three entries, one entry for each gene in which each column will contain the total for all transcripts of that gene. So, the resulting data frame will look like this:
target_id est_counts tpm
AT1G01010 236 26.3
AT1G01020 135.0 26.5
AT1G01030 31 3.0
Here, the values for est_counts and tpm for all transcripts of a single gene have been summed. I have seen things like lapply(.sd), but I don't think that will work because the AGI numbers are not exactly equal and my real table has thousands of different AGI numbers.
Great ! Does the job very well. Thank you.