Classifying the samples based on zscore of a specific gene
1
0
Entering edit mode
5.2 years ago
Biologist ▴ 290

I'm interested in checking the association of a gene to some clinical parameters. For that I'm classifying the samples into high and low based on a gene GABRD zscore values. I have the fpkm data and calculated zscore.

I took the cutoff Z=1 (very relaxed threshold)

So, zscore >=1 are classified as GABRD high. But I don't see any samples with zscore <= -1 to classify them into GABRD low.

Is it ok if I take zscore >=1 as high and zscore <=1 as low

thanq.

RNA-Seq R geneexpression zscore • 5.5k views
ADD COMMENT
1
Entering edit mode

I think that something went wrong with the analysis. Could you provide the plot of your data? In R it can be made like: plot(density(data)), you may remove the names and all the IDs. Having no samples with z-score < -1 is very, very suspicious. Most probably you should not use z-scores. You can use z-scores only if your random variable is distributed in a bell shaped manner (see answer below). Your distribution is likely right-skewed (https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/skewed-distribution/ )

That's the short answer (actually absence of the answer) why you should not use z-score:

https://stats.stackexchange.com/questions/32357/can-i-use-a-z-score-with-skewed-and-non-normal-data

ADD REPLY
4
Entering edit mode
5.2 years ago

Ideally, you should be aiming for absolute Z = 1.96 as the cut-off. On a two-tailed distribution, this is equivalent to p = 0.05. This being said, you do not have to define the Z score cut-offs in terms of probabilities - just be aware that Z = 1 is not a statistically significantly heightened level, though.

Here, this graphic is pretty neat: ggggg

[source: https://www.mathsisfun.com/data/standard-normal-distribution.html]

Also, defining 'low' as Z<=1 is somewhat misleading, as any Z-score greater than 0 is technically higher above the mean of your dataset, and thus has heightened expression.

Important to consider:

  • how have you pre-processed your data?
  • how have you calculated Z-scores? (by row?; by column?; ...just using the entire dataset?)

Kevin

ADD COMMENT
0
Entering edit mode

I got the fpkm expression data of TCGA and using zFPKM function and converted them to zscore. So, now how should I classify? I don't have any zscore values above 1.96. and no zscore values below -1.96.

ADD REPLY
0
Entering edit mode

have you used log-transform of fpkm before applying z-score?

ADD REPLY
0
Entering edit mode

No, I didn't log transform. I thought I have to. But the zFPKM documentation there is a note saying that the data is not log2 transformed.

ADD REPLY
0
Entering edit mode

try to apply zFPKMPlot from https://bioconductor.org/packages/release/bioc/vignettes/zFPKM/inst/doc/zFPKM.html and check if your distribution is right skewed. Not having z-scores less than -1 is, khm, totally unbelievable if the analysis was correct. As you can see from the plot above, around 15% of your values should be < -1.

ADD REPLY
0
Entering edit mode

I took the zscore of the GBARD gene, and density plot looks like this.

ADD REPLY
2
Entering edit mode

well. it looks bad. it is not right-skewed, but. you have two options. you cut your left tail (these are - probably - technical artifacts - you have to understand it yourself) or you use Qn (https://cran.r-project.org/web/packages/robustbase/robustbase.pdf) as a measure of standard deviation for your z-scores and median as a measure of central tendency. something like z-score = (data - median(data)) / Qn(data)

ADD REPLY
1
Entering edit mode

Ok. I took the fpkm data of that gene and applied the below function to get the zscore.

zscore<- function(x){
    z<- (x - mean(x)) / sd(x)
    return(z)
}

gabrd_z <- zscore(gabrd_fpkm)

And then made a density plot on gabrd_z. I see that it is right-skewed.

density plot on zscore of GABRD

But now on which cutoff I have to classify into high and low?

ADD REPLY
1
Entering edit mode

nah, it does not look that bad. the skewness may be neglected if you do it as a rough analysis. so, now you have REAL z-scores and this is good =) you may proceed with your analysis. The choice of cutoff will depend on what are you trying to say with these values (what high means for you? what low means for you from the biological persepctive?)

ADD REPLY
0
Entering edit mode

So, basically I wanted to classify around 600 samples into GABRD high and GABRD low groups and check the association with some clinical parameters. I want to use all these 600 samples for the analysis. But if I take +1.96 and -1.96 as cutoff for high and low I may be able to use only 50 samples for my analysis. so, I'm really confused on what basis I have to choose the cutoff?

Can I consider all the samples with positive values as high and negative as low?

ADD REPLY
0
Entering edit mode

I'd recommend you to use regression for the association and use this z-score as a continuous predictor. Here is a useful explanation: https://stats.stackexchange.com/questions/16565/what-is-the-effect-of-dichotomising-variables .

ADD REPLY
0
Entering edit mode

small help please. how this dichotomisation can be done on zscore values in R?

Can you please give an example.

ADD REPLY
1
Entering edit mode

no-no, dichotomization is what you're trying to do. dichotomization is basically division your variable into 2 groups (high or low expressed genes). this procedure is not recommended in general. put your scores into the regression model and check the association without division into groups. Like, to predict weight of people based on their height, you will not divide your height into 2 groups (tall and short people), but use the raw value in centimeters instead.

ADD REPLY
0
Entering edit mode

Ok. sorry I misunderstood. But for my analysis division into groups is what I want.

ADD REPLY
1
Entering edit mode

then try different thresholds (±1.96, ±1.28, ±1.04, etc - qnorm(some_round_number_from_0_to_1) ) - and choose the one that will give you significant p-value =)

no, seriously, then choose ±1.96, it sounds reasonable. You'll have only 50 samples - but that's what you want.

and drawing a scatterplot plot(clinical_outcome ~ z-score) is always a good practice.

ADD REPLY
0
Entering edit mode

Hey hi small help again. So, I have the data like below:

df:

Samples GABRD   Gender  Stage
Sample1 0.002   Female  A
Sample2 0.233   Female  A
Sample3 1.527   Female  B
Sample4 -3.45   Male    C
Sample5 0.79    Male    B
Sample6 2.19    Male    A
Sample7 0.42    Female  C
Sample8 -1.01   Male    A
Sample9 0.627   Female  B
Sample10 -0.23  Male    B

For the checking the relationship just using lm like below is fine?

lm(GABRD ~ Gender + Stage, data = df)

or I have to check the relationship with Gender and Stage separately?

lm(GABRD ~ Gender, data = df)
lm(GABRD ~ Stage, data = df)
ADD REPLY
1
Entering edit mode

Hi, in my opinion - only together. May be include interaction term (Gender * Stage) in the model (you have enough samples as I understood). Be sure to perform regression diagnostics. https://data.library.virginia.edu/diagnostic-plots/

ADD REPLY
0
Entering edit mode

Or else as I have fpkm expression data of that gene, can I take median as a cutoff and classify them into high and low?

ADD REPLY
0
Entering edit mode

the plot that you've shown is not a plot of z-score. it is not centered around 0. yes, you can use median, as well as any other value to say if your genes are high or low. but what do you want to get from such classification?

ADD REPLY
0
Entering edit mode

Your given figure is the standard population distribution. And the actual question is the iid sample distribution. I think the point estimation should be done first to estimate the u and delta of the population distribution based on the observed data.

ADD REPLY
0
Entering edit mode

sigma, not delta =)

ADD REPLY

Login before adding your answer.

Traffic: 2712 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6