Calculating average using information from three different columns of a file.
2
0
Entering edit mode
6.2 years ago
bk11 ★ 3.0k

Hi

I have several thousand lines sheet with columns like this:

Count   Cell_Types  FPKM    transcript_name
1   alternatively activated macrophage  0.02    AP002954.3
2   alternatively activated macrophage  0   AP002954.3
3   alternatively activated macrophage  0.02    AP002954.3
4   alternatively activated macrophage  0.03    AP002954.3
5   alternatively activated macrophage  0.01    AP002954.3
6   alternatively activated macrophage  0.04    AP002954.3
7   alternatively activated macrophage  0   AP002954.3
272 alternatively activated macrophage  62.54   ARCN1
273 alternatively activated macrophage  63.81   ARCN1
274 alternatively activated macrophage  59.45   ARCN1
275 alternatively activated macrophage  60.11   ARCN1
276 alternatively activated macrophage  65.07   ARCN1
277 alternatively activated macrophage  61.61   ARCN1
278 alternatively activated macrophage  46.62   ARCN1
542 alternatively activated macrophage  199.45  ATP5L
543 alternatively activated macrophage  212.03  ATP5L
544 alternatively activated macrophage  195.81  ATP5L
545 alternatively activated macrophage  266.32  ATP5L
546 alternatively activated macrophage  218.48  ATP5L
547 alternatively activated macrophage  216 ATP5L
548 alternatively activated macrophage  227.52  ATP5L

I want to take average of FPKM for individual transcript_names from individual Cell_types and get result like this:

Count   Cell_Types  Average_FPKM    transcript_name
1   alternatively activated macrophage  0.017   AP002954.3
2   alternatively activated macrophage  59.887  ARCN1
3   alternatively activated macrophage  219.372 ATP5L

I will appreciate any help. Thanks.

Unix R Python • 1.1k views
ADD COMMENT
0
Entering edit mode

with datamash:

$ datamash  -Hsg 2,4 mean 3 < test.txt 

GroupBy(Cell_Types) GroupBy(transcript_name)    mean(FPKM)
alternatively activated macrophage  AP002954.3  0.017142857142857
alternatively activated macrophage  ARCN1   59.887142857143
alternatively activated macrophage  ATP5L   219.37285714286

with tsv-utils:

$ tsv-summarize --header --group-by 2,4 --mean 3 test.txt  | nl -v 0 -w 1 | sed '1s/0/S.No/'

S.No    Cell_Types  transcript_name FPKM_mean
1   alternatively activated macrophage  AP002954.3  0.0171428571429
2   alternatively activated macrophage  ARCN1   59.8871428571
3   alternatively activated macrophage  ATP5L   219.372857143
ADD REPLY
1
Entering edit mode
6.2 years ago

Hello,

this can be done with datamash for grouping your data and awkfor formating the output:

$ datamash -H -f -R 3 -g4 mean 3 < input.txt|awk -v FS="\t" -v OFS="\t" 'NR==1 {print $1,$2,"Average_FPKM",$4; next} {print NR-1,$2,$5,$4}'

fin swimmer

ADD COMMENT
0
Entering edit mode
6.2 years ago
zx8754 12k

Using R, aggregate:

aggregate(.~ Cell_Types + transcript_name, df1[ , -1], mean)

#                           Cell_Types transcript_name         FPKM
# 1 alternatively activated macrophage      AP002954.3   0.01714286
# 2 alternatively activated macrophage           ARCN1  59.88714286
# 3 alternatively activated macrophage           ATP5L 219.37285714

See SO for other alternative solutions: Mean per group in a data.frame

ADD COMMENT

Login before adding your answer.

Traffic: 2231 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6