In R, I am working with a huge file of DNA methylation data (beta values). I need to identify diseased individuals that have >0.33 methylation value against the average of our controls, per gene. The file (df.csv below) has individual methylation values per gene (A, B, C in example below) and the average of the controls (AveControl.value in the example below) is in the column at the end of the file. For ease, I would like to print the individual(s) (column name(s)) that have >0.33 vs controls in a new column at the end of the file. If there is more than one individual to print, the individual name should be separated by a comma. If no individual shows >0.33, then it should remain blank. Below I have given an example of my data and what I need as an output:
Example of my file: (I have added a space between commas so it's easier to see the data values) In bold I have highlighted the values for the individuals I am looking to extract
df.csv
Gene, Indiv1.value, Indiv2.value, Indiv3.value, AveControl.value
A, 0.1, 0.2, 0.5, 0.1
B, 0.1, 0.2, 0.2, 0.2
C, 0.1, 0.9, 0.8, 0.4
*Example of the output I require - with new column RESULT containing the column name of the individual with >0.33 methylation vs controls. If no individuals meet the requirement, the entry should be empty.
Gene, Indiv1.value, Indiv2.value, Indiv3.value, AveControl.value, RESULT
A, 0.1, 0.2, 0.5, 0.1 Indiv3.value
B, 0.1, 0.2, 0.2, 0.2
C, 0.1, 0.9, 0.8, 0.4 Indiv2.value, Indiv3.value
I have been trying to find a way to do this but I have already lost hours. Any help will be greatly appreciated. Btw, In reality, I have 100's of diseased individuals not just the 3 I am showing here. Many thanks in advance.
There is a code option that is recommended to highlight code. You can edit your post with the
edit
button.you can use following solution if you are looking for column values more than 0.33 including average control:
if you are looking for column names that compare all columns against last column (Avecontrol):
Dear cpad0112, Many many thanks for your reply, it's a huge help.
I require something similar to your second command line (comparing all columns against the last column (Avecontrol)). I would like to compare all columns against the last column but only print those that are >0.33 than the last column.
The first solution does that. It prints all the individual names whose name column values are more than 0.33, in the last column
Sorry, I didn't explain properly. I only need those columns where the value is 0.33 above the AveControl.value column.
If AveControl.value is 0.5 and another column is 0.7, I do not need the AveControl.value (as the difference is less than 0.33). There must be a difference equal to or more than 0.33 as compared to the AveControl.value column value. I hope I have explained myself better!
Thanks so much.
got it...will update the solution soon...@ rjobmc
Thank you very much @ cpad0112.
My understanding is that only those column names with rows above 0.33 oon Average control are reported in a new column
Exactly.. thank you so much.