zScore conversion: zFPKM vs scale function
2
0
Entering edit mode
4.3 years ago
camillab. ▴ 160

Hi,

I have tried zFPKM package to convert my datasets into z scores but I found weird results when I was trying to identify the top 10 loadings in the PC1 of the PCA (see for a better explanation of my problem the previous post: biostar question . so I tried to compute the z scores using scale (center = T, scale= T) according to rblogger post and the Z scores are different. Looking at the manual/paper of the zFPKM I have noticed that it seems to be developed more for low counts reads rather than the pure conversion of FPKM into Z scores (or at least that was my understanding). Which Z scores conversion should I trust then? Because every downstream analyze performed give me very different results.

Thank you

Camilla

RNA-Seq zscores zFPKM scale • 3.1k views
ADD COMMENT
5
Entering edit mode
4.3 years ago

The underlying issue here are the FPKM values, which are not conducive to any analysis where comparisons across samples are to be performed. If you can obtain the raw data and re-process, that is the ideal way to go.

I have worked with the zFPKM author and am still in touch with him. The function / package was developed primarily as a QC method, i.e., for filtering out low expressed genes. The underlying issue, to which I allude in my first statement in this answer, is that, for example, a FPKM value of 10 in one sample is not the same as 10 in another; thus, setting a value of 10 as a cut-off for low expression across all samples in your cohort is improper. Through transformation via zFPKM, however, a single cut-off value can be used across all samples in your dataset. Whether you continue with the data as FPKM or zFPKM-scaled FPKMs after this filtering is not covered.

So, the zFPKM transformation is useful and performs one type of transformation to Z-scores. The scale() function performs another type of Z transformation column-wise. On the other hand, if you wanted global Z-scores, then you'd have to simply do:

(x - mean(x)) / sd(x)

There is no issue using scale(), but keep in mind that FPKMs are never ideal for cross-sample comparisons, and also that scale() is only performing one type of Z-transform.

Trust that this help!

Kevin

ADD COMMENT
3
Entering edit mode
4.3 years ago
A. Domingues ★ 2.7k

As someone who briefly used the zFKPM package in the past, it's goal is not to do the sort of centering/scaling one usually does before a PCA. The goal is to generate a score to select expressed genes using a somewhat experimentally validated prior. If all you want to do is sample clustering (PCA) based on gene expression values, go with scale which is sort of standard for this type of analysis.

ADD COMMENT
1
Entering edit mode

I see that Kevin was quicker and has a more comprehensive answer :)

ADD REPLY
0
Entering edit mode

Generate a score was exactly what I was/am aiming and converting into Zscores as suggested was a great idea, at least, for an initial explorative approach. I did not want to use scale() because I usually for PCA, as you suggested. My idea was: If I convert the RPKM/FPKM of two different datasets (A and B) into Z-scores then I can check what is the score assigned to gene X in the dataset A and dataset B and indirectly have an idea how similar/different is the expression of gene X in both dataset. Does it make sense? I know I should get the raw data and I am working on it but I thought the approach suggested was a good way to have at least a generalised idea of A and B.

ADD REPLY
1
Entering edit mode

You can see how it was done in this paper and other from the same lab (Huttner): https://elifesciences.org/articles/32332

ADD REPLY
0
Entering edit mode

Since you are interested in differences in response to treatment, could you not use logFC (treatment / control) instead of FPKMs? That way the expression is normalized within dataset.

ADD REPLY
0
Entering edit mode

that's what I am doing for treat/control dataset. but if I want to compare different controls untreated (to answer the question: is the level of gene X the same across species/age), I cannot used logFC because to who I am supposed to normalize/use as ref? if I am comparing, eg. mouse at different age vs adult zebrafish, I could potentially calculate logFC for the mouse deciding that the adult age is my ref but I can't do the same for zebrafish. that's why I wanted to assign/calculate Zscores on each dataset separately and then compare the values across the datasets on specific gene I am interested

ADD REPLY
0
Entering edit mode

Indeed, then Z > +1.96 is equivalent to p<0.05 on a two-tailed distribution. Also Z < -1.96 is equivalent to p<0.05. Perhaps this is what you had planned to do?

ADD REPLY
0
Entering edit mode

so considering my approach is partially wrong since I should work on raw data. Is it correct : 1. Z scores on each dataset separated : (sample - mean(samples)) / sd(samples) or better zFPKM or scale() ? 2. filter out those p>0.05 and then check if the remaining genes are shared across the different dataset?

ADD REPLY
0
Entering edit mode

Apologies again, but I was trying to figure out why applying zFPKM on my datasets is wrong (and also the gravity of the error). But I am confused on your sentence

"The goal is to generate a score to select expressed genes using a somewhat experimentally validated prior."

What does it mean exactly? in practical terms, it means that zFPKM should be used only to select genes with low reads that, one other experiments/set up resulted to be valid/important?

ADD REPLY
2
Entering edit mode

I am not sure when A. Domingues will next log in; however, I see no major issue using zFPKM values downstream, keeping in mind that the underlying / fundamental issue for which zFPKM was developed is that FPKM and RPKM units are not ideal for any cross-sample analyses.

zFPKM puts RPKM and FPKM units on a distribution that is more conducible for most downstream applications. You could check the distribution of your data before and after using zFPKM via the hist() function, to give an idea. So, although zFPKM was initially developed to determine expressed | not expressed, the data that it produces can be used downstream.

On the zFPKM landing page ( https://www.bioconductor.org/packages/release/bioc/html/zFPKM.html ), they give the publication on which their calculations are based.

ADD REPLY
1
Entering edit mode

I meant exactly what Kevin Blighe said: "zFPKM was initially developed to determine expressed | not expressed". The experimentally validated bit was referring to the original zFPKM publication in which they used chromatin state data to find a threshold for expressed | not expressed. I have no experience with zFPKM values for downstream analysis, I simply assumed that scale would be straightforward enough for most applications. My apologies for the confusion.

ADD REPLY

Login before adding your answer.

Traffic: 2579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6