Hi,
I have tried zFPKM package to convert my datasets into z scores but I found weird results when I was trying to identify the top 10 loadings in the PC1 of the PCA (see for a better explanation of my problem the previous post: biostar question . so I tried to compute the z scores using scale
(center = T, scale= T) according to rblogger post and the Z scores are different. Looking at the manual/paper of the zFPKM I have noticed that it seems to be developed more for low counts reads rather than the pure conversion of FPKM into Z scores (or at least that was my understanding).
Which Z scores conversion should I trust then?
Because every downstream analyze performed give me very different results.
Thank you
Camilla
I see that Kevin was quicker and has a more comprehensive answer :)
Generate a score was exactly what I was/am aiming and converting into Zscores as suggested was a great idea, at least, for an initial explorative approach. I did not want to use
scale()
because I usually for PCA, as you suggested. My idea was: If I convert the RPKM/FPKM of two different datasets (A and B) into Z-scores then I can check what is the score assigned to gene X in the dataset A and dataset B and indirectly have an idea how similar/different is the expression of gene X in both dataset. Does it make sense? I know I should get the raw data and I am working on it but I thought the approach suggested was a good way to have at least a generalised idea of A and B.You can see how it was done in this paper and other from the same lab (Huttner): https://elifesciences.org/articles/32332
Since you are interested in differences in response to treatment, could you not use logFC (treatment / control) instead of FPKMs? That way the expression is normalized within dataset.
that's what I am doing for treat/control dataset. but if I want to compare different controls untreated (to answer the question: is the level of gene X the same across species/age), I cannot used logFC because to who I am supposed to normalize/use as ref? if I am comparing, eg. mouse at different age vs adult zebrafish, I could potentially calculate logFC for the mouse deciding that the adult age is my ref but I can't do the same for zebrafish. that's why I wanted to assign/calculate Zscores on each dataset separately and then compare the values across the datasets on specific gene I am interested
Indeed, then Z > +1.96 is equivalent to p<0.05 on a two-tailed distribution. Also Z < -1.96 is equivalent to p<0.05. Perhaps this is what you had planned to do?
so considering my approach is partially wrong since I should work on raw data. Is it correct : 1. Z scores on each dataset separated :
(sample - mean(samples)) / sd(samples)
or betterzFPKM
orscale()
? 2. filter out those p>0.05 and then check if the remaining genes are shared across the different dataset?Apologies again, but I was trying to figure out why applying zFPKM on my datasets is wrong (and also the gravity of the error). But I am confused on your sentence
What does it mean exactly? in practical terms, it means that zFPKM should be used only to select genes with low reads that, one other experiments/set up resulted to be valid/important?
I am not sure when A. Domingues will next log in; however, I see no major issue using zFPKM values downstream, keeping in mind that the underlying / fundamental issue for which zFPKM was developed is that FPKM and RPKM units are not ideal for any cross-sample analyses.
zFPKM puts RPKM and FPKM units on a distribution that is more conducible for most downstream applications. You could check the distribution of your data before and after using zFPKM via the
hist()
function, to give an idea. So, although zFPKM was initially developed to determineexpressed
|not expressed
, the data that it produces can be used downstream.On the zFPKM landing page ( https://www.bioconductor.org/packages/release/bioc/html/zFPKM.html ), they give the publication on which their calculations are based.
I meant exactly what Kevin Blighe said: "zFPKM was initially developed to determine expressed | not expressed". The experimentally validated bit was referring to the original zFPKM publication in which they used chromatin state data to find a threshold for expressed | not expressed. I have no experience with zFPKM values for downstream analysis, I simply assumed that
scale
would be straightforward enough for most applications. My apologies for the confusion.