I have a big data matrix and each column has named with multiple information and separated by an underscore.
For an example; Genotype_Time_Replicate: X_T0_1, X_T0_2 etc
I want to average my data for downstream analysis.
How can I average replications (Case 1: in column data), and averaging repeated raw data and column replication data (Case 2: Column and raw)
Final Expectation looks like;
Sample data frame is given below,
Case 1 df:
structure(list(Gene = c("AA", "PQ", "XY", "UBQ"), X_T0_R1 = c(1.46559502, 0.220140568, 0.304127515, 1.098842127), X_T0_R2 = c(1.087642983, 0.237500819, 0.319844338, 1.256624804), X_T0_R3 = c(1.424945196, 0.21066267, 0.256496284, 1.467120048), X_T1_R1 = c(1.289943948, 0.207778662, 0.277942721, 1.238400358), X_T1_R2 = c(1.376535013, 0.488774258, 0.362562315, 0.671502431), X_T1_R3 = c(1.833390311, 0.182798731, 0.332856558, 1.448757569), X_T2_R1 = c(1.450753714, 0.247576125, 0.274415259, 1.035410946), X_T2_R2 = c(1.3094609, 0.390028842, 0.352460646, 0.946426593), X_T2_R3 = c(0.5953716, 1.007079177, 1.912258811, 0.827119776), X_T3_R1 = c(0.7906009, 0.730242116, 1.235644748, 0.832287694), X_T3_R2 = c(1.215333041, 1.012914813, 1.086362205, 1.00918082), X_T3_R3 = c(1.069312467, 0.780421013, 1.002313082, 1.031761442), Y_T0_R1 = c(0.053317766, 3.316414959, 3.617213894, 0.788193798), Y_T0_R2 = c(0.506623748, 3.599442788, 1.734075583, 1.179462912), Y_T0_R3 = c(0.713670106, 2.516735845, 1.236204882, 1.075393433), Y_T1_R1 = c(0.740998252, 1.444496448, 1.077023349, 0.869258744), Y_T1_R2 = c(0.648231834, 0.097957459, 0.791438659, 0.428805547), Y_T1_R3 = c(0.780499252, 0.187840968, 0.820430227, 0.51636582), Y_T2_R1 = c(0.35344654, 1.190274584, 0.401845911, 1.223534348), Y_T2_R2 = c(0.220223951, 1.367784148, 0.362815405, 1.102117612), Y_T2_R3 = c(0.432856978, 1.403057729, 0.10802472, 1.304233845), Y_T3_R1 = c(0.234963735, 1.232129062, 0.072433381, 1.203096462), Y_T3_R2 = c(0.353770497, 0.885122768, 0.011662112, 1.188149743), Y_T3_R3 = c(0.396091395, 1.333921747, 0.192594116, 1.838029829), Z_T0_R1 = c(0.398000559, 1.286528398, 0.129147097, 1.452769794), Z_T0_R2 = c(0.384759325, 1.122251177, 0.119475721, 1.385513609), Z_T0_R3 = c(1.582230097, 0.697419716, 2.406671502, 0.477415567), Z_T1_R1 = c(1.136843842, 0.804552001, 2.13213228, 0.989075996), Z_T1_R2 = c(1.275683837, 1.227821594, 0.31900326, 0.835941568), Z_T1_R3 = c(0.963349308, 0.968589683, 1.706670339, 0.807060135), Z_T2_R1 = c(3.765036263, 0.477443352, 1.712841882, 0.469173869), Z_T2_R2 = c(1.901023385, 0.832736132, 2.223429427, 0.593558769), Z_T2_R3 = c(1.407713024, 0.911920317, 2.011259223, 0.692553388), Z_T3_R1 = c(0.988333629, 1.095130142, 1.648598854, 0.629915612), Z_T3_R2 = c(0.618606729, 0.497458337, 0.549147265, 1.249492088), Z_T3_R3 = c(0.429823986, 0.471389536, 0.977124788, 1.136635484)), row.names = c(NA, -4L ), class = c("data.table", "data.frame"))
Case 2 df:
structure(list(Gene = c("mut", "ACTIN", "ACTIN", "Pq", "UBQ", "UBQ", "Xa"), X_T0_R1 = c(0.344814469, 1.209073623, 1.071457953, 0.362842359, 1.014392244, 1.571055788, 0.570729408), X_T0_R2 = c(0.449930853, 1.031557118, 1.054965621, 0.522831228, 0.83300542, 0.967355216, 0.501057748), X_T0_R3 = c(0.601209073, 1.695796471, 1.052815987, 0.571729222, 1.391288288, 1.773644641, 0.453820027), X_T1_R1 = c(0.427800244, 1.308884798, 0.991302515, 0.329510681, 0.773414746, 1.029619555, 0.362504535), X_T1_R2 = c(0.418589633, 1.811507215, 1.206305091, 0.29886302, 0.895616224, 1.196317937, 0.408657559), X_T1_R3 = c(0.468263467, 1.352236153, 1.444060418, 0.359970383, 0.942421479, 2.388771681, 0.145078696), X_T2_R1 = c(0.300362616, 1.654754505, 1.109259911, 0.306699247, 0.585608303, 1.945573895, 0.270237172), X_T2_R2 = c(0.27920993, 1.573822163, 1.152985196, 0.310218502, 0.493783209, 1.573792123, 0.36659012), X_T2_R3 = c(1.792971556, 0.665809249, 0.778594892, 2.161999623, 1.888984449, 0.456632731, 1.631251843), X_T3_R1 = c(1.118011513, 0.570411874, 1.044634812, 1.213092011, 1.817947271, 0.234950383, 1.384650094), X_T3_R2 = c(1.008515071, 0.916509523, 0.905764637, 1.244132809, 0.752181246, 0.797524026, 1.010615689), X_T3_R3 = c(0.816620011, 0.740345088, 1.106478019, 0.899414205, 0.909160589, 0.672469518, 0.594865366), Y_T0_R1 = c(3.307846716, 0.027550169, 0.645327389, 2.887386508, 1.042465604, 0.05047425, 4.318466199), Y_T0_R2 = c(2.035398381, 0.633422527, 0.888069994, 2.062827838, 1.82433679, 0.500792593, 1.182188977), Y_T0_R3 = c(1.500168876, 0.877196975, 1.088593542, 1.392198697, 1.162069878, 0.470956741, 1.511890878), Y_T1_R1 = c(1.095875029, 0.777981021, 1.050238479, 1.17216374, 0.945470429, 0.40568268, 0.872396888), Y_T1_R2 = c(0.452742932, 0.352610874, 0.787861253, 0.477126035, 0.320200734, 1.826032539, 0.332244865), Y_T1_R3 = c(0.45960558, 0.478390214, 0.645688363, 0.395673468, 0.215407604, 0.759507568, 0.700730905), Y_T2_R1 = c(1.559068766, 0.062252184, 0.937463531, 0.994007758, 0.482591298, 1.269828631, 0.237326878), Y_T2_R2 = c(1.390406257, 0.215685731, 1.087380361, 1.018431329, 0.585660661, 1.05095161, 0.173209498), Y_T2_R3 = c(1.00828232, 0.376013801, 0.782410602, 0.906376375, 0.572489629, 1.359345852, 0.302963483), Y_T3_R1 = c(1.182635592, 0.117426355, 1.013642281, 0.967559933, 0.306328031, 1.231521805, 0.257804624), Y_T3_R2 = c(1.366839578, 0.341411017, 1.337125947, 0.943784803, 0.721978298, 1.10875345, 0.189978177), Y_T3_R3 = c(1.594404053, 0.209740069, 0.92384942, 0.897659445, 0.457172538, 1.543831721, 0.272475233), Z_T0_R1 = c(1.237203711, 0.233057698, 1.077219174, 1.156260667, 0.264806683, 1.591044318, 0.255767162), Z_T0_R2 = c(1.211301515, 0.251870699, 1.141522554, 1.194071909, 0.20882802, 1.533752995, 0.278059859), Z_T0_R3 = c(0.645425334, 1.53688617, 0.439888106, 0.819063313, 1.769224478, 0.250876057, 1.998822839), Z_T1_R1 = c(0.971645792, 0.671074934, 0.469502588, 1.312821698, 1.306039773, 1.40561198, 1.704347344), Z_T1_R2 = c(0.859830596, 1.580097955, 1.366461274, 1.24037716, 0.80578233, 1.116605654, 1.211928025), Z_T1_R3 = c(0.785228306, 1.286123696, 1.10243547, 0.996917372, 1.215506569, 0.683697612, 1.000232952), Z_T2_R1 = c(0.475576762, 2.673806674, 0.732913032, 0.763693301, 3.091813549, 0.347384763, 3.16064337), Z_T2_R2 = c(0.810829692, 1.590506889, 1.162262268, 1.367255133, 1.378518959, 0.677096267, 2.006934309), Z_T2_R3 = c(1.02507371, 2.164918846, 1.440885034, 1.185511625, 1.934374556, 0.460659928, 1.277191061), Z_T3_R1 = c(0.834953495, 2.155130232, 1.209137833, 0.934189133, 1.048650427, 0.704562113, 1.145400709), Z_T3_R2 = c(0.886903303, 0.237343684, 0.921370232, 0.737206101, 0.318232441, 1.314051524, 0.9314835), Z_T3_R3 = c(0.748710472, 0.501419194, 0.914476206, 0.641169316, 0.119979817, 1.187578276, 0.918544916)), row.names = c(NA, -7L), class = c("data.table", "data.frame"))
if possible could you please help me with an easy approach using R programming
Another one in R assuming that sample replicates bear _R[0-9] pattern at the end:
Thank you as always cpad01 . May I know your real identity, please! You have helped me seral times.
BTW, Could you please help me with my case 2 as well, When I have repeated measurements of the same gene [duplicate / triplicate data for a Gene] (in raws) how can I take their average ( then basically df looks like case1) and proceed like above.
Because I see in my working df same gene is measure few times / using different peptides, so I want to average them first and proceed like above.
for df2 (please cross check the results and let me know if the results are off. I checked mean for one or two entries):
2nd method:
3rd solution:
Thank you very much cpad0112 , I like the first approach. I do not know when I'm learning R up to this level :)
Thanks again, I appreciate your kind help always! Well, I wish I know your real name, maybe next time!
Keep visiting biostars. I learnt a lot from here esp awk, sed and python. Hope this forum would help you become better in whatever you are pursuing.