Hi guys, I need to select a core of genes from an RNA-Seq dataset of 200 samples whose expression is able to explain an x-percent of variability of the entire dataset (i.e. across samples). My point is that x-percent is normally an arbitrary threshold. Could be 75% or 80% according to what the scientist considers to be relevant. Is there a way to set in a quantitative way that threshold? Through simulations for example? In other word is there a way to prove that threshold for example below 75% are no more informative to explain the variability of data? In my mind I consider 75% a reasonable cut-off but I need to explain to non-statisticians and with numbers that values below are not of interest.
Thank you in advance
Dear Kevin, thank you very much for your precious help and suggestions. I tried to use PCAtools and I have to say that I like it a lot and I will support it. The unbiased selection of the PCs is very helpful for my purposes. I have just one question. plotloadings force users to select the rangeRetain. In the tutorial 0.01 is the cut-off. In my case I get very few unique genes (29 in total). I know that the choice of a cut-off is "subjective" but I would like to ask you which threshold is the optimal to select a reasonable number of genes representing the accepted PCs (13 in my case)? Thank you in advance!
There is no optimal of which I am aware for that. However, I would suggest that you look at the data stored in the
loadings
object, as that is the data that is being used byplotloadings()
. There, it may be easier for you to identify the best variables to retain.I agree with you about the optimal threshold!
Can I ask you the rationale behind the "offset <- (max(x[, i]) - min(x[, i])) * rangeRetain" cutoff choice that appears in the code? My point is that it gives many genes with loadings around 0 that I expect not to contribute to the explained variance. Why they are retained? Also using a 1% cut off as in the vignette. Thank you in advance!
This would be a discussion point if this work is going to be published. That code produces an offset that is then used to select, e.g., the top 1% of genes that contribute variation to each PC. Given how some PCs contribute very little variation in every PCA experiment, these component (or 'variable') loadings will also have very low values. Component loadings are unitless and are just a weighting, i.e., they indicate the strength of each variable to each PC.
I am open to suggestions on how this could be developed further? It is the one part of PCAtools that still makes me nervous.
Here is a simple example of what it is doing:
Yes I understand the point. My idea...what about to prioritize genes according to the frequency of high loadings across PCs? Suppose you select top 13 PCs according to elbow (or horn depending on what you desire). Suppose, also, to categorise the loadings in bins (ranges of values. This could be bypassed). Then calculate the cumulative frequency and select only those genes frequently appearing at high loadings across PCs you select. This comes by empirically observing my data. I observed that a gene like TGFB1 appears at high loadings (+/-) in the top 8 PCs I selected with elbow. Could be a good idea?
Thank you. That seems like a reasonable approach. When developing the package, I intentionally kept that part very simple, with the intention of developing it further. Most users of the package will just want a bi-plot.
If you are on GitHub, you could create a new Issue for this?
Do you think it could be a statistically efficient approach? I will create it soon!
it should be updated or i have to install the development version? to get this new functionality
These changes will only be on the main Bioconductor branch when Bioc 3.10 is released, so, for now, will have to install the development version:
You may notice a lot of new dependencies, including C libraries.
okay will wait for the Bioc release