Question

combining z-scores into a single z-score value

2

Entering edit mode

6.3 years ago

Star ▴ 60

Hi all,

I have a list of z-scores (obtained via effect size/std.err of effect size). I want to combine all the z-scores via Stouffer's method but without weights. Till now I have seen "Stouffer.test" implemented in metaseq package but it requires weights along with the z-score. I have tried "sumz" method implemented in metap package which according to my understanding can work without weights as follows:

sumz(p, weights = NULL)  #where p is the vector of values; in my case z-scores

My understanding of Stouffer's method is

sum of all the z-scores/sqrt of total number of samples

when I compared the results of the "sumz" method as described above, it is not the same as the above formula when I calculated it in excel using the following

=SUM(C2:C8219)/SQRT(8218)

My question is : Is there any R function which works like above (sum of all the z-scores/sqrt of total number of samples) so that I can cross check the results? Or I have to do it manually?

Thanks!

excel R statistics • 13k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 6.3 years ago by Star ▴ 60

1

Entering edit mode

Maybe sum(p)/sqrt(numberOfSamples) ?

ADD REPLY • link 6.3 years ago by zx8754 12k

0

Entering edit mode

what does p stands for here? z-scores?

ADD REPLY • link 6.3 years ago by Star ▴ 60

0

Entering edit mode

You mentioned p is a z-scores vector:

where p is the vector of values; in my case z-scores

sum(p)/sqrt(8218) is exactly the same thing you are trying to do with Excel =SUM(C2:C8219)/SQRT(8218)

ADD REPLY • link 6.3 years ago by zx8754 12k

0

Entering edit mode

Thankyou for the clarification zx8754. I confused "sum(p)" with pvalues instead of zscores. Now that I have combined z-scores, I am confused about how to interpret it? Shall I convert the z-scores into p-values(one or two-tailed?) to see the significance? Apologies if the question seems to be naive. I am pretty new to genetics and statistics.

ADD REPLY • link 6.3 years ago by Star ▴ 60

0

Entering edit mode

Hey, you stated that:

sumz(p, weights = NULL) -> where p is the vector of values; in my case z-scores

So, you have vector, p, that contains Z scores.

The actual sumz() function from metap package expects p-values, not Z-scores.

Description
Combine p-values using the sum z method

Usage
sumz(p, weights = NULL, data = NULL, subset = NULL, na.action = na.fail)
## S3 method for class 'sumz'
print(x, ...)

Arguments
 - p, A vector of significance values
 - weights, A vector of weights
 - data, Optional data frame containing variables
 - subset, Optional vector of logicals to specify a subset of the p-values
 - na.action, A function indicating what should happen when data contains NAs
 - x, An object of class ‘sumz’
 - ..., Other arguments to be passed through

[source: https://cran.r-project.org/web/packages/metap/metap.pdf]

If you have just Z-scores and do not want to consider weights, you can indeed calculate the overall Z-score by STouffer`s method using:

sum of all the z-scores / sqrt of total number of samples

ADD REPLY • link 6.3 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks for the clarification kevin Blighe. I have combined the z-scores using Stouffer's method. But am not sure how to interpret it.

ADD REPLY • link 6.3 years ago by Star ▴ 60

0

Entering edit mode

In that case, perhaps you should consider why you chose to use the test in the first instance (?). Do you even need to use it?

ADD REPLY • link 6.3 years ago by Kevin Blighe 89k

0

Entering edit mode

I am trying to find the differentially expressed genes present within the tissues. So I have the z-scores for all the genes in the tissues. I have combined the z-scores for the genes in each tissue and now I want to see if I could identify or rank the tissues based upon their pathogenicity using the z-scores?

ADD REPLY • link 6.3 years ago by Star ▴ 60

0

Entering edit mode

Oh, I see, you now have a combined Z-score for each tissue. Are the results what you expected, if you order them high-to-low?

ADD REPLY • link 6.3 years ago by Kevin Blighe 89k

0

Entering edit mode

I don't know how to interpret the z-scores? I am assuming that I need to convert z-scores into p-values first and CI first?

ADD REPLY • link 6.3 years ago by Star ▴ 60

0

Entering edit mode

Hey, oh, then why did you choose that test if you do not know how to interpret the result? You must have seen it in a publication, right? Are you sure that you need to do the analysis that you are doing?

If you feel that you need greater statistical advice, then you could try CrossValidated (StackExchange), which is more aligned toward statistics. Biostars is a broad/general forum for bioinformatics.

ADD REPLY • link 6.3 years ago by Kevin Blighe 89k

0

Entering edit mode

Thankyou Kevin for the reply. Actually my main goal is to combine either a z-score or p-value so that a single value could be the representative of a single tissue. I was using Stouffer's method initially which did not give the results as expected because Stouffer's method takes into account the direction of the effect of the gene (positive or negative) which we are not considering at this stage. So now I am exploring some other methods.

ADD REPLY • link 6.3 years ago by Star ▴ 60

0

Entering edit mode

What if you only consider the genes that have positive Z scores? I think that most define a tissue by what is highly expressed, not by what is not expressed. I find better results this way, too.

ADD REPLY • link 6.3 years ago by Kevin Blighe 89k

0

Entering edit mode

Another method you may consider is to simply define a list of genes for each tissue based on Z>2 or Z>3, and then use GSVA to enrich your data against these lists. This will then return 'scores' for the samples in your data for each tissue. As in, it will say by how much each tissue is enriched in your data.

ADD REPLY • link 6.3 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi. Kevin. Thanks for the reply. At this stage I am trying to find the differential expression of tissues regardless of high expression or decreased expression of genes (if this makes sense).

In the sentence you used above, what do you mean by "This will then return 'scores' for the samples in your data for each tissue". What does the word "sample" refers to? Can it refer to the "genes" in a particular tissue?

ADD REPLY • link 6.3 years ago by Star ▴ 60

0

Entering edit mode

Hey, GSVA will take this data:

           Sample1  Sample2  Sample3
BRCA1      6        4        3
TP53       3        3        2
BRCC3      7        12       8
...        ...      ...      ...

It then compute's an algorithm against:

Signature1
TP53; BRCC3; ...

Signature2
BRCA1; TP53

GSVA will then return:

             Sample1  Sample2  Sample3
Signature1   3.4      12.6     8.3
Signature2   2.7      10.4     5.5

ADD REPLY • link 6.3 years ago by Kevin Blighe 89k