Question

Calculate the Pearson correlation and associated p value for multiple variables

0

Entering edit mode

3.3 years ago

Cp.Recker • 0

Hi all,

I am working with the huge microarray expression data set. I have the expression value of 27000 probes representing 5500 genes across 14 different data points (Variables: D1 to D14). Among these 5500 genes, few genes are represented by multiple probes (i.e., different probes for the same gene). The distribution of probe representation for 5500 genes varies from 1 to 5 (meaning few genes have 1 or 2 or 3 or 4 or 5 probes). Now, I want to compute Pairwise Pearson Correlation Coefficient and associated P-value for all the possible combinations of multiple probes of the same gene across 14 different data points (14 variables) and export the result in a 1-Dimensional format. A small portion of my input data table in CSV format is shown below

ProbeName	Gene	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10	D11	D12	D13	D14
A1	A	9.1	6.6	8.2	9.3	9.0	8.8	9.9	7.5	10.8	9.0	8.3	11.6	9.3	10.9
A2	A	3.9	3.7	5.8	2.2	2.9	2.8	2.9	3.8	3.3	1.7	3.2	3.5	5.9	3.7
A3	A	4.6	4.8	6.8	2.8	4.3	3.5	4.2	5.3	4.5	3.3	4.0	4.3	6.9	4.7
A4	A	3.8	3.9	5.8	3.2	4.0	2.8	3.7	4.6	3.6	2.2	3.8	4.3	5.6	3.9
A5	A	6.3	6.6	7.7	5.9	5.9	5.6	6.2	6.4	5.8	4.9	5.4	6.1	7.7	6.9
B1	B	7.5	5.5	7.1	10.2	7.2	8.6	8.3	7.1	6.1	7.0	9.2	6.4	6.4	9.4
B2	B	4.6	4.8	5.6	4.3	4.7	4.3	4.0	5.5	4.0	3.3	3.8	5.0	5.7	4.7
B3	B	5.1	3.9	5.1	6.5	5.0	5.4	4.9	5.3	4.5	4.5	5.9	5.0	4.6	5.6
B4	B	7.6	6.1	7.5	10.9	8.0	9.2	8.5	7.1	6.3	7.4	10.0	6.9	6.9	10.2
C1	C	3.1	6.1	3.4	2.5	3.7	3.3	2.7	5.0	2.3	3.1	2.0	3.8	2.6	3.3
C2	C	3.8	7.1	4.8	4.1	4.9	4.5	3.8	5.9	4.0	4.7	4.4	5.1	2.9	4.8
C3	C	3.8	6.1	5.5	5.4	6.3	3.9	3.4	7.8	5.3	5.7	4.8	4.0	3.5	4.3
D1	D	12.2	11.7	11.4	10.5	11.5	11.4	10.7	12.0	11.3	10.5	9.9	11.7	10.5	10.2
D2	D	12.0	11.5	11.3	10.4	11.4	11.4	10.7	11.9	11.2	10.6	9.9	11.7	10.3	10.2
E1	E	2.4	3.3	7.5	3.4	5.8	3.6	1.2	3.5	0.9	2.2	3.1	4.7	7.5	4.0

The ProbeName column represents the name of the probes from A1 to E1, the Gene column represents the name of the genes from A to E, and Columns D1 to D14 (variables) represent the expression values in different data points. Rows represent the expression value of a probe representing a particular gene in 14 different data points (i.e., how much a particular gene is activated in 14 different data points with the respective probes). A1, A2, A3, A4 & A5 represent multiple probes for the same gene A, and likewise for the other genes B, C, D, and E. In this Table, I want to compute the possible pairwise Pearson correlation of multiple probes for the same gene across 14 data points (D1 to D14). For Example, the possible probe combinations for gene C to compute Pearson correlation across 14 data points are

 1. C1 (D1:3.1, D2:6.1, D3:3.4, D4:2.5, D5:3.7, D6:3.3, D7:2.7, D8:5.0,
    D9:2.3, D10:3.1, D11:2.0, D12:3.8, D13:2.6, D14:3.3) Vs C2 (D1:3.8,
    D2:7.1, D3:4.8, D4:4.1, D5:4.9, D6:4.5, D7:3.8, D8:5.9, D9:4.0,
    D10:4.7, D11:4.4, D12:5.1, D13:2.9, D14:4.8),
 2. C1 (D1:3.1, D2:6.1, D3:3.4, D4:2.5, D5:3.7, D6:3.3, D7:2.7, D8:5.0,
    D9:2.3, D10:3.1, D11:2.0, D12:3.8, D13:2.6, D14:3.3) Vs C3 (D1:3.8,
    D2:6.1, D3:5.5, D4:5.4, D5:6.3, D6:3.9, D7:3.4, D8:7.8, D9:5.3,
    D10:5.7, D11:4.8, D12:4.0, D13:3.5, D14:4.3),
 3. C2 (D1:3.8, D2:7.1, D3:4.8, D4:4.1, D5:4.9, D6:4.5, D7:3.8, D8:5.9,
    D9:4.0, D10:4.7, D11:4.4, D12:5.1, D13:2.9, D14:4.8) Vs C3 (D1:3.8,
    D2:6.1, D3:5.5, D4:5.4, D5:6.3, D6:3.9, D7:3.4, D8:7.8, D9:5.3,
    D10:5.7, D11:4.8, D12:4.0, D13:3.5, D14:4.3)

After generating the correlation matrix of the possible pairwise combinations of multiple probes for the same gene across 14 data points, I want to flatten only the upper or lower triangular correlation matrix and generate the output in CSV format as mentioned below.

ProbeName_1	ProbeName_2	Gene	PearonCorrelationValue	Pvalue
A1	A2	A	-0.129	0.661
A1	A3	A	-0.176	0.547
A1	A4	A	-0.106	0.718
A1	A5	A	-0.084	0.776
A2	A3	A	0.963	0.000
A2	A4	A	0.932	0.000
A2	A5	A	0.914	0.000
A3	A4	A	0.922	0.000
A3	A5	A	0.883	0.000
A4	A5	A	0.882	0.000
B1	B2	B	-0.328	0.253
B1	B3	B	0.900	0.000
B1	B4	B	0.987	0.000
B2	B3	B	-0.084	0.774
B2	B4	B	-0.322	0.261
B3	B4	B	0.882	0.000
C1	C2	C	0.888	0.000
C1	C3	C	0.542	0.045
C2	C3	C	0.658	0.011
D1	D2	D	0.993	0.000

I do not know how to deal with this complex data with R . I humbly request the experts to help me with this problem.

Note: I do not want the correlation value of identical probe combinations i.e., A1 Vs A1 or A2 Vs A2 or A3 Vs A3 or A4 Vs A4 or A5 Vs A5. I also do not want to perform a pairwise combination of a probe of one gene with the probe of another different gene. i.e., A1 Vs B1, B2, B3, B4 or A1 Vs C1, C2, C3 or A1 Vs D1, D2, and or A1 Vs E1.

Pearson correlation R • 549 views

ADD COMMENT • link updated 3.3 years ago by rpolicastro 13k • written 3.3 years ago by Cp.Recker • 0