Hi all,
I am working with the huge microarray expression data set. I have the expression value of 27000 probes representing 5500 genes across 14 different data points (Variables: D1 to D14). Among these 5500 genes, few genes are represented by multiple probes (i.e., different probes for the same gene). The distribution of probe representation for 5500 genes varies from 1 to 5 (meaning few genes have 1 or 2 or 3 or 4 or 5 probes). Now, I want to compute Pairwise Pearson Correlation Coefficient and associated P-value for all the possible combinations of multiple probes of the same gene across 14 different data points (14 variables) and export the result in a 1-Dimensional format. A small portion of my input data table in CSV format is shown below
ProbeName | Gene | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | D9 | D10 | D11 | D12 | D13 | D14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A1 | A | 9.1 | 6.6 | 8.2 | 9.3 | 9.0 | 8.8 | 9.9 | 7.5 | 10.8 | 9.0 | 8.3 | 11.6 | 9.3 | 10.9 |
A2 | A | 3.9 | 3.7 | 5.8 | 2.2 | 2.9 | 2.8 | 2.9 | 3.8 | 3.3 | 1.7 | 3.2 | 3.5 | 5.9 | 3.7 |
A3 | A | 4.6 | 4.8 | 6.8 | 2.8 | 4.3 | 3.5 | 4.2 | 5.3 | 4.5 | 3.3 | 4.0 | 4.3 | 6.9 | 4.7 |
A4 | A | 3.8 | 3.9 | 5.8 | 3.2 | 4.0 | 2.8 | 3.7 | 4.6 | 3.6 | 2.2 | 3.8 | 4.3 | 5.6 | 3.9 |
A5 | A | 6.3 | 6.6 | 7.7 | 5.9 | 5.9 | 5.6 | 6.2 | 6.4 | 5.8 | 4.9 | 5.4 | 6.1 | 7.7 | 6.9 |
B1 | B | 7.5 | 5.5 | 7.1 | 10.2 | 7.2 | 8.6 | 8.3 | 7.1 | 6.1 | 7.0 | 9.2 | 6.4 | 6.4 | 9.4 |
B2 | B | 4.6 | 4.8 | 5.6 | 4.3 | 4.7 | 4.3 | 4.0 | 5.5 | 4.0 | 3.3 | 3.8 | 5.0 | 5.7 | 4.7 |
B3 | B | 5.1 | 3.9 | 5.1 | 6.5 | 5.0 | 5.4 | 4.9 | 5.3 | 4.5 | 4.5 | 5.9 | 5.0 | 4.6 | 5.6 |
B4 | B | 7.6 | 6.1 | 7.5 | 10.9 | 8.0 | 9.2 | 8.5 | 7.1 | 6.3 | 7.4 | 10.0 | 6.9 | 6.9 | 10.2 |
C1 | C | 3.1 | 6.1 | 3.4 | 2.5 | 3.7 | 3.3 | 2.7 | 5.0 | 2.3 | 3.1 | 2.0 | 3.8 | 2.6 | 3.3 |
C2 | C | 3.8 | 7.1 | 4.8 | 4.1 | 4.9 | 4.5 | 3.8 | 5.9 | 4.0 | 4.7 | 4.4 | 5.1 | 2.9 | 4.8 |
C3 | C | 3.8 | 6.1 | 5.5 | 5.4 | 6.3 | 3.9 | 3.4 | 7.8 | 5.3 | 5.7 | 4.8 | 4.0 | 3.5 | 4.3 |
D1 | D | 12.2 | 11.7 | 11.4 | 10.5 | 11.5 | 11.4 | 10.7 | 12.0 | 11.3 | 10.5 | 9.9 | 11.7 | 10.5 | 10.2 |
D2 | D | 12.0 | 11.5 | 11.3 | 10.4 | 11.4 | 11.4 | 10.7 | 11.9 | 11.2 | 10.6 | 9.9 | 11.7 | 10.3 | 10.2 |
E1 | E | 2.4 | 3.3 | 7.5 | 3.4 | 5.8 | 3.6 | 1.2 | 3.5 | 0.9 | 2.2 | 3.1 | 4.7 | 7.5 | 4.0 |
The ProbeName column represents the name of the probes from A1 to E1, the Gene column represents the name of the genes from A to E, and Columns D1 to D14 (variables) represent the expression values in different data points. Rows represent the expression value of a probe representing a particular gene in 14 different data points (i.e., how much a particular gene is activated in 14 different data points with the respective probes). A1, A2, A3, A4 & A5 represent multiple probes for the same gene A, and likewise for the other genes B, C, D, and E. In this Table, I want to compute the possible pairwise Pearson correlation of multiple probes for the same gene across 14 data points (D1 to D14). For Example, the possible probe combinations for gene C to compute Pearson correlation across 14 data points are
1. C1 (D1:3.1, D2:6.1, D3:3.4, D4:2.5, D5:3.7, D6:3.3, D7:2.7, D8:5.0,
D9:2.3, D10:3.1, D11:2.0, D12:3.8, D13:2.6, D14:3.3) Vs C2 (D1:3.8,
D2:7.1, D3:4.8, D4:4.1, D5:4.9, D6:4.5, D7:3.8, D8:5.9, D9:4.0,
D10:4.7, D11:4.4, D12:5.1, D13:2.9, D14:4.8),
2. C1 (D1:3.1, D2:6.1, D3:3.4, D4:2.5, D5:3.7, D6:3.3, D7:2.7, D8:5.0,
D9:2.3, D10:3.1, D11:2.0, D12:3.8, D13:2.6, D14:3.3) Vs C3 (D1:3.8,
D2:6.1, D3:5.5, D4:5.4, D5:6.3, D6:3.9, D7:3.4, D8:7.8, D9:5.3,
D10:5.7, D11:4.8, D12:4.0, D13:3.5, D14:4.3),
3. C2 (D1:3.8, D2:7.1, D3:4.8, D4:4.1, D5:4.9, D6:4.5, D7:3.8, D8:5.9,
D9:4.0, D10:4.7, D11:4.4, D12:5.1, D13:2.9, D14:4.8) Vs C3 (D1:3.8,
D2:6.1, D3:5.5, D4:5.4, D5:6.3, D6:3.9, D7:3.4, D8:7.8, D9:5.3,
D10:5.7, D11:4.8, D12:4.0, D13:3.5, D14:4.3)
After generating the correlation matrix of the possible pairwise combinations of multiple probes for the same gene across 14 data points, I want to flatten only the upper or lower triangular correlation matrix and generate the output in CSV format as mentioned below.
ProbeName_1 | ProbeName_2 | Gene | PearonCorrelationValue | Pvalue |
---|---|---|---|---|
A1 | A2 | A | -0.129 | 0.661 |
A1 | A3 | A | -0.176 | 0.547 |
A1 | A4 | A | -0.106 | 0.718 |
A1 | A5 | A | -0.084 | 0.776 |
A2 | A3 | A | 0.963 | 0.000 |
A2 | A4 | A | 0.932 | 0.000 |
A2 | A5 | A | 0.914 | 0.000 |
A3 | A4 | A | 0.922 | 0.000 |
A3 | A5 | A | 0.883 | 0.000 |
A4 | A5 | A | 0.882 | 0.000 |
B1 | B2 | B | -0.328 | 0.253 |
B1 | B3 | B | 0.900 | 0.000 |
B1 | B4 | B | 0.987 | 0.000 |
B2 | B3 | B | -0.084 | 0.774 |
B2 | B4 | B | -0.322 | 0.261 |
B3 | B4 | B | 0.882 | 0.000 |
C1 | C2 | C | 0.888 | 0.000 |
C1 | C3 | C | 0.542 | 0.045 |
C2 | C3 | C | 0.658 | 0.011 |
D1 | D2 | D | 0.993 | 0.000 |
I do not know how to deal with this complex data with R . I humbly request the experts to help me with this problem.
Note: I do not want the correlation value of identical probe combinations i.e., A1 Vs A1 or A2 Vs A2 or A3 Vs A3 or A4 Vs A4 or A5 Vs A5. I also do not want to perform a pairwise combination of a probe of one gene with the probe of another different gene. i.e., A1 Vs B1, B2, B3, B4 or A1 Vs C1, C2, C3 or A1 Vs D1, D2, and or A1 Vs E1.