Variability explained by a core of genes
1
1
Entering edit mode
5.2 years ago
elb ▴ 260

Hi guys, I need to select a core of genes from an RNA-Seq dataset of 200 samples whose expression is able to explain an x-percent of variability of the entire dataset (i.e. across samples). My point is that x-percent is normally an arbitrary threshold. Could be 75% or 80% according to what the scientist considers to be relevant. Is there a way to set in a quantitative way that threshold? Through simulations for example? In other word is there a way to prove that threshold for example below 75% are no more informative to explain the variability of data? In my mind I consider 75% a reasonable cut-off but I need to explain to non-statisticians and with numbers that values below are not of interest.

Thank you in advance

RNA-Seq gene expression threshold variability • 1.9k views
ADD COMMENT
7
Entering edit mode
5.2 years ago

This can be done via PCA. Basically, if you perform PCA and then have 70% variation explained by PC1, you can then extract the variables / genes that are responsible for the majority of this [70%] variation along PC1. If your PC1 explains a lower number, like 25%, then you may have to look at numerous PCs that account for, e.g., up to 80% variation.

Curiously, I have implemented functionality for this in PCAtools (released with Bioc 3.9), and Aaron Lun has also recently added functionality for identifying the ideal number of PCs to retain.

For Aaron's code, look here: Determine optimum number of PCs to retain

For identifying variables responsible for variation along each PC, look here: Determine the variables that drive variation among each PC

This Bioc package is new, so, please make any recommendations for its future development.

Kevin

ADD COMMENT
0
Entering edit mode

Dear Kevin, thank you very much for your precious help and suggestions. I tried to use PCAtools and I have to say that I like it a lot and I will support it. The unbiased selection of the PCs is very helpful for my purposes. I have just one question. plotloadings force users to select the rangeRetain. In the tutorial 0.01 is the cut-off. In my case I get very few unique genes (29 in total). I know that the choice of a cut-off is "subjective" but I would like to ask you which threshold is the optimal to select a reasonable number of genes representing the accepted PCs (13 in my case)? Thank you in advance!

ADD REPLY
1
Entering edit mode

There is no optimal of which I am aware for that. However, I would suggest that you look at the data stored in the loadings object, as that is the data that is being used by plotloadings(). There, it may be easier for you to identify the best variables to retain.

pcaobj <- PCAtools::pca(x)
pcaobj$loadings
ADD REPLY
0
Entering edit mode

I agree with you about the optimal threshold!

ADD REPLY
0
Entering edit mode

Can I ask you the rationale behind the "offset <- (max(x[, i]) - min(x[, i])) * rangeRetain" cutoff choice that appears in the code? My point is that it gives many genes with loadings around 0 that I expect not to contribute to the explained variance. Why they are retained? Also using a 1% cut off as in the vignette. Thank you in advance!

ADD REPLY
0
Entering edit mode

This would be a discussion point if this work is going to be published. That code produces an offset that is then used to select, e.g., the top 1% of genes that contribute variation to each PC. Given how some PCs contribute very little variation in every PCA experiment, these component (or 'variable') loadings will also have very low values. Component loadings are unitless and are just a weighting, i.e., they indicate the strength of each variable to each PC.

I am open to suggestions on how this could be developed further? It is the one part of PCAtools that still makes me nervous.

Here is a simple example of what it is doing:

rangeRetain <- 0.3333 #33.3333%
x <- c(1,2,3,4,5,6,7,8,9,10)

# for each PC, based on the loadings range, calculate the rangeRetain
# fraction of the range
offset <- (max(x) - min(x)) * rangeRetain

# to obtain upper and lower cut-offs, offset max and min by the offset
uppercutoff <- max(x) - offset
lowercutoff <- min(x) + offset

x[x>uppercutoff]
[1]  8  9 10
x[x<lowercutoff]
[1] 1 2 3
ADD REPLY
0
Entering edit mode

Yes I understand the point. My idea...what about to prioritize genes according to the frequency of high loadings across PCs? Suppose you select top 13 PCs according to elbow (or horn depending on what you desire). Suppose, also, to categorise the loadings in bins (ranges of values. This could be bypassed). Then calculate the cumulative frequency and select only those genes frequently appearing at high loadings across PCs you select. This comes by empirically observing my data. I observed that a gene like TGFB1 appears at high loadings (+/-) in the top 8 PCs I selected with elbow. Could be a good idea?

ADD REPLY
0
Entering edit mode

Thank you. That seems like a reasonable approach. When developing the package, I intentionally kept that part very simple, with the intention of developing it further. Most users of the package will just want a bi-plot.

If you are on GitHub, you could create a new Issue for this?

ADD REPLY
0
Entering edit mode

Do you think it could be a statistically efficient approach? I will create it soon!

ADD REPLY
0
Entering edit mode

it should be updated or i have to install the development version? to get this new functionality

ADD REPLY
0
Entering edit mode

These changes will only be on the main Bioconductor branch when Bioc 3.10 is released, so, for now, will have to install the development version:

devtools::install_github('kevinblighe/PCAtools')

You may notice a lot of new dependencies, including C libraries.

ADD REPLY
1
Entering edit mode

okay will wait for the Bioc release

ADD REPLY

Login before adding your answer.

Traffic: 1994 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6