Hi all,
I have a list of z-scores (obtained via effect size/std.err of effect size). I want to combine all the z-scores via Stouffer's method but without weights. Till now I have seen "Stouffer.test" implemented in metaseq package but it requires weights along with the z-score. I have tried "sumz" method implemented in metap package which according to my understanding can work without weights as follows:
sumz(p, weights = NULL) #where p is the vector of values; in my case z-scores
My understanding of Stouffer's method is
sum of all the z-scores/sqrt of total number of samples
when I compared the results of the "sumz" method as described above, it is not the same as the above formula when I calculated it in excel using the following
=SUM(C2:C8219)/SQRT(8218)
My question is : Is there any R function which works like above (sum of all the z-scores/sqrt of total number of samples) so that I can cross check the results? Or I have to do it manually?
Thanks!
Maybe
sum(p)/sqrt(numberOfSamples)
?what does p stands for here? z-scores?
You mentioned p is a z-scores vector:
sum(p)/sqrt(8218)
is exactly the same thing you are trying to do with Excel=SUM(C2:C8219)/SQRT(8218)
Thankyou for the clarification zx8754. I confused "sum(p)" with pvalues instead of zscores. Now that I have combined z-scores, I am confused about how to interpret it? Shall I convert the z-scores into p-values(one or two-tailed?) to see the significance? Apologies if the question seems to be naive. I am pretty new to genetics and statistics.
Hey, you stated that:
So, you have vector,
p
, that contains Z scores.The actual
sumz()
function from metap package expects p-values, not Z-scores.[source: https://cran.r-project.org/web/packages/metap/metap.pdf]
If you have just Z-scores and do not want to consider weights, you can indeed calculate the overall Z-score by STouffer`s method using:
Thanks for the clarification kevin Blighe. I have combined the z-scores using Stouffer's method. But am not sure how to interpret it.
In that case, perhaps you should consider why you chose to use the test in the first instance (?). Do you even need to use it?
I am trying to find the differentially expressed genes present within the tissues. So I have the z-scores for all the genes in the tissues. I have combined the z-scores for the genes in each tissue and now I want to see if I could identify or rank the tissues based upon their pathogenicity using the z-scores?
Oh, I see, you now have a combined Z-score for each tissue. Are the results what you expected, if you order them high-to-low?
I don't know how to interpret the z-scores? I am assuming that I need to convert z-scores into p-values first and CI first?
Hey, oh, then why did you choose that test if you do not know how to interpret the result? You must have seen it in a publication, right? Are you sure that you need to do the analysis that you are doing?
If you feel that you need greater statistical advice, then you could try CrossValidated (StackExchange), which is more aligned toward statistics. Biostars is a broad/general forum for bioinformatics.
Thankyou Kevin for the reply. Actually my main goal is to combine either a z-score or p-value so that a single value could be the representative of a single tissue. I was using Stouffer's method initially which did not give the results as expected because Stouffer's method takes into account the direction of the effect of the gene (positive or negative) which we are not considering at this stage. So now I am exploring some other methods.
What if you only consider the genes that have positive Z scores? I think that most define a tissue by what is highly expressed, not by what is not expressed. I find better results this way, too.
Another method you may consider is to simply define a list of genes for each tissue based on Z>2 or Z>3, and then use GSVA to enrich your data against these lists. This will then return 'scores' for the samples in your data for each tissue. As in, it will say by how much each tissue is enriched in your data.
Hi. Kevin. Thanks for the reply. At this stage I am trying to find the differential expression of tissues regardless of high expression or decreased expression of genes (if this makes sense).
In the sentence you used above, what do you mean by "This will then return 'scores' for the samples in your data for each tissue". What does the word "sample" refers to? Can it refer to the "genes" in a particular tissue?
Hey, GSVA will take this data:
It then compute's an algorithm against:
GSVA will then return: