does lowering fold change values from 2 to 1.5, which will allow for more DEGs necessarily mean higher chance of getting enriched (Q values of <0.05) GO/PATHWAY terms? or it entirely depends still on whether there's overrepresentation of DEGs under a term?
Apparently with lower fold change cutoff you get (as a rule) more DE genes, and with more DE genes there is higher chance to enrich for something (IMHO).
Yes, as per Grant, more genes equates to more enrichment. One could technically enrich all protein-coding genes, though, but the result would be meaningless. Your cut-offs have to be a fine balance between selecting genes that are differentially expressed and leaving enough room such that the enrichment algorithms can function adequately.
Just one piece of advice though: in no way should you base your study's conclusions on an in silico enrichment. RNA-seq is a rich resource and there are lots of things that you can do. Just doing enrichment does not do the data-type justice.
Things that you could try:
comprehensive literature search of the most differentially expressed
genes (DEGs)
clustering and heatmaps using DEGs showing how they can segregate
groups
develop predictive signatures (regression modelling) using your DEGs
correlate some clinical parameters to your DEGs and see which are
most statistically significant
I wouldn't say "more genes equates to more enrichment", but rather there is a chance that some of newly popped-up DE genes might be in the same pathway :)
Hi kevin your input above is valuable. Can you care to elaborate because im not sure if i understood your suggestions well enough.
comprehensive literature search of the most differentially expressed
genes (DEGs)-->
in my context im trying to uncover novel as well as known mechanosensitive genes. Do you mean that i go through maybe papers on work done on mechanical stress and sieve out genes that are typically(commonly) differentially expressed across mechanical stress studies? Then state which of my genes are novel and which are already well known mechanosensitive.
clustering and heatmaps using DEGs showing how they can segregate
groups-->
im still learning how to interpret heatmaps but sometimes i really feel its redundant. Yes it helps to cluster genes that are upregulated and down regulated but on the outside if i were to put it in a powerpoint slide or research paper, the reader cant see which of the genes are involved? Its just a chunk of red and green. And also i only have two treatment groups. Cells treated with low pressure and cells treated with high pressure apart from the controls.
develop predictive signatures (regression modelling) using your
DEGs-->
what is meant by this?
Correlate some clinical parameters to your DEGs and see which are most
statistically significant-->
also not clear what you are suggesting.
Hope you can further elaborate and many thanks on giving me ideas on interpreting my data.
My dear friend, to answer these questions properly, I will first ask you an obvious question: what was the purpose of your study? There must have been a hypothesis or idea such that you wanted to perform RNA-sequencing on your samples. With neither a hypothesis nor leadership, a study will of course struggle to progress.
comprehensive literature search of the most differentially expressed
genes (DEGs)
Yes, I just mean to look at your most differentially expressed genes and to see what has already been reported on them. Spend a full day doing this and you will get new leads and ideas. For example, if I found MS4A1 (CD20) as being highly differentially expressed in my RNA-seq study of an immune condition, I would go to Google and search for:
ncbims4a1immunity
I will then find tonnes of hits because CD20 is a B-cell marker.
If I was studying an eye condition and found numerous RP genes as differentially expressed, I'd search for:
ncbirp1rp20rp33eyeretina
There would be further hits because RP genes have been shown to cause different types of retinitis pigmentosa.
A word of advice: don't use the search bar in PubMed for literature searching. It's sub-standard compared to Google.
--------
clustering and heatmaps using DEGs showing how they can segregate
groups
Well, you're probably the first person that I have ever met who does not appear to like heatmaps - kudos to you. You're correct in that they don't show too much, but usually people want to see how well their genes of interest can segregate cases from controls, which is played out in the dendrogram, mostly, but also the heatmap.
Also, one of the 'greatest' heatmaps ever was by Charles Perou, a breast cancer pathologist, I believe, who identified gene expression signatures in breast cancer tumours and thus identified the 4 different primary breast cancer sub-types that we now know today.
You should take a look at my various postings on heatmaps:
develop predictive signatures (regression modelling) using your
DEGs
Again, depending on the nature of your study, one may have the intent to identify a gene signature that can define a particular condition. The DEGs that you identify, even after best efforts of normalisation and FDR threshsolding, still likely comprise a large chunk of genes that provide a minimal amount of information in terms of defining the condition in which they are found as highly or lowly expressed.
Typically, one identifies a group of DEGs and then puts these to the test via regression modelling, where the endpoint may be disease status or disease classification (like tumour stage in cancer). Regression modelling can then be fed into ROC analysis where one can derive test statistics such as sensitivity and specificity, i.e., in the end, one could arrive at a gene panel that has sensitivity of 90% via ROC analysis in identifying Alzheimer's patients from blood expression data.
Again, I have posted resources on this on Biostars:
Correlate some clinical parameters to your DEGs and see which are most
statistically significant-
For many diseases, current diagnostic and prognostic criteria are based on laboratory based assessments, or even things like family history. For example, the PSA test for prostate cancer measures an antigen in the urine; many immune disorders are measured through immunohistochemistry (IHC) of cell markers, etc. It can be useful to correlate our expression data to these clinical markers in order to see which genes may be related to certain parameters and,therefore, which genes the expression of which could be used as surrogate markers of these parameters.
My objective of doing this rna seq study is to find novel and known mechanosensitive genes upon compressing cancer cells. I also want to study how cancer behavior might be affected and using rna seq global analysis can guide me on this.
Thanks kevin! I will consider your points. Will probably lock myself in the room doing exploration of my rna seq data and find the best direction to take from here on.
Okay, in that case, I presume that you can measure the level of applied compression, which would then be a 'clinical' parameter that you could use, of course.
Just quickly back to gene enrichment: terms that come up in gene enrichment should merely guide you as you then conduct literature searches (i.e. just include the GO term in Google as you search).
Hi kevin your input above is valuable. Can you care to elaborate because im not sure if i understood your suggestions well enough.
Things that you could try:
comprehensive literature search of the most differentially expressed genes (DEGs)--> in my context im trying to uncover novel as well as known mechanosensitive genes. Do you mean that i go through maybe papers on work done on mechanical stress and sieve out genes that are typically(commonly) differentially expressed across mechanical stress studies? Then state which of my genes are novel and which are already well known mechanosensitive.
clustering and heatmaps using DEGs showing how they can segregate groups--> im still learning how to interpret heatmaps but sometimes i really feel its redundant. Yes it helps to cluster genes that are upregulated and down regulated but on the outside if i were to put it in a powerpoint slide or research paper, the reader cant see which of the genes are involved? Its just a chunk of red and green. And also i only have two treatment groups. Cells treated with low pressure and cells treated with high pressure apart from the controls.
develop predictive signatures (regression modelling) using your DEGs--> what is meant by this?
correlate some clinical parameters to your DEGs and see which are most statistically significant--> also not clear what you are suggesting.
Hope you can further elaborate and many thanks on giving me ideas on interpreting my data.
This is only true if the proportion of GO terms in DE genes is the same when the fold change cutoff decreases. We can simulate that with some R code using Fisher's exact test for enrichment:
build_mat=function(){
return(matrix(c(DE_genes_in_pathway, DE_genes-DE_genes_in_pathway, Pathway_genes-DE_genes_in_pathway, Tot_genes-DE_genes-Pathway_genes+DE_genes_in_pathway ),
nrow = 2,
dimnames = list(DE = c("Y", "N"),pathway = c("Y", "N"))))
}
# basal case with 200 DEGs
Tot_genes=20000
DE_genes=200
Pathway_genes=100
DE_genes_in_pathway=5
fisher.test(build_mat()) #p-value = 0.003324
# with twice more DEGs and the enrichment remains the same
DE_genes=400
DE_genes_in_pathway=10
fisher.test(build_mat()) #p-value = 3.196e-05
# with twice more DEGs but there is no more genes in pathway in the 200 additional DEGs
DE_genes=400
DE_genes_in_pathway=5
fisher.test(build_mat()) #p-value = 0.05038
# extreme case when all genes are DEGs.
DE_genes=20000
DE_genes_in_pathway=100
fisher.test(build_mat()) #p-value = 1
Thanks —neat piece of code— confirms why I will never base any clinical decision on gene enrichment (I know people who do), and why I'll continue to be overly cautious about making conclusions based on enrichment in a research setting.
The code is good but makes the assumption that no new GO terms are added with increased number of DEGs. In practice, this scenario is unlikely based on the many hundreds of thousands of enrichment terms that exist. My experience tells me that lowering thresholds and incorporating more DEGs will almost always introduce a greater number of enriched terms, many of which are meaningless and could result in false-interpretation.
As per my comment (above), enrichment should not even be the main focus of the user's RNA-seq analysis.
What do you mean with "no new GO terms are added with increased number of DEGs" ? My code test enrichment for only one GO term. In practice, the tests are usually applied on all (or a subset of) the annotated GO terms, independently of the number of DEGs.
In the hypothetical case where the thresholds are so low that all genes are DEGs (I edited my code above), 0 GO term can be enriched, so lowering thresholds do not always results in more enriched terms.
Yes, I think that we were on different trains of thought - your code example is very good and is indeed making the point for a single enrichment term / pathway. I have been even more interested in your code because I recently had an in depth conversation with a colleague about gene enrichment and how the number of genes can affect it.
When I said that "more genes equates to more enrichment", what I meant was that the inclusion of a greater number of DEGs would result in a greater number of enriched terms due to new genes matching new enrichment terms.
Apparently with lower fold change cutoff you get (as a rule) more DE genes, and with more DE genes there is higher chance to enrich for something (IMHO).
Yes, as per Grant, more genes equates to more enrichment. One could technically enrich all protein-coding genes, though, but the result would be meaningless. Your cut-offs have to be a fine balance between selecting genes that are differentially expressed and leaving enough room such that the enrichment algorithms can function adequately.
Just one piece of advice though: in no way should you base your study's conclusions on an in silico enrichment. RNA-seq is a rich resource and there are lots of things that you can do. Just doing enrichment does not do the data-type justice.
Things that you could try:
et cetera
I wouldn't say "more genes equates to more enrichment", but rather there is a chance that some of newly popped-up DE genes might be in the same pathway :)
Well, my phrase was so general that it could be interpreted in any shape or form. I meant 'more genes equates to more enrichment terms'
Hi kevin your input above is valuable. Can you care to elaborate because im not sure if i understood your suggestions well enough.
in my context im trying to uncover novel as well as known mechanosensitive genes. Do you mean that i go through maybe papers on work done on mechanical stress and sieve out genes that are typically(commonly) differentially expressed across mechanical stress studies? Then state which of my genes are novel and which are already well known mechanosensitive.
im still learning how to interpret heatmaps but sometimes i really feel its redundant. Yes it helps to cluster genes that are upregulated and down regulated but on the outside if i were to put it in a powerpoint slide or research paper, the reader cant see which of the genes are involved? Its just a chunk of red and green. And also i only have two treatment groups. Cells treated with low pressure and cells treated with high pressure apart from the controls.
what is meant by this?
also not clear what you are suggesting.
Hope you can further elaborate and many thanks on giving me ideas on interpreting my data.
My dear friend, to answer these questions properly, I will first ask you an obvious question: what was the purpose of your study? There must have been a hypothesis or idea such that you wanted to perform RNA-sequencing on your samples. With neither a hypothesis nor leadership, a study will of course struggle to progress.
Yes, I just mean to look at your most differentially expressed genes and to see what has already been reported on them. Spend a full day doing this and you will get new leads and ideas. For example, if I found MS4A1 (CD20) as being highly differentially expressed in my RNA-seq study of an immune condition, I would go to Google and search for:
ncbi
ms4a1
immunity
I will then find tonnes of hits because CD20 is a B-cell marker.
If I was studying an eye condition and found numerous RP genes as differentially expressed, I'd search for:
ncbi
rp1
rp20
rp33
eye
retina
There would be further hits because RP genes have been shown to cause different types of retinitis pigmentosa.
A word of advice: don't use the search bar in PubMed for literature searching. It's sub-standard compared to Google.
--------
Well, you're probably the first person that I have ever met who does not appear to like heatmaps - kudos to you. You're correct in that they don't show too much, but usually people want to see how well their genes of interest can segregate cases from controls, which is played out in the dendrogram, mostly, but also the heatmap.
Also, one of the 'greatest' heatmaps ever was by Charles Perou, a breast cancer pathologist, I believe, who identified gene expression signatures in breast cancer tumours and thus identified the 4 different primary breast cancer sub-types that we now know today.
You should take a look at my various postings on heatmaps:
...and my recent publication where I identified novel clusterings in metabolomics:
----------
Again, depending on the nature of your study, one may have the intent to identify a gene signature that can define a particular condition. The DEGs that you identify, even after best efforts of normalisation and FDR threshsolding, still likely comprise a large chunk of genes that provide a minimal amount of information in terms of defining the condition in which they are found as highly or lowly expressed.
Typically, one identifies a group of DEGs and then puts these to the test via regression modelling, where the endpoint may be disease status or disease classification (like tumour stage in cancer). Regression modelling can then be fed into ROC analysis where one can derive test statistics such as sensitivity and specificity, i.e., in the end, one could arrive at a gene panel that has sensitivity of 90% via ROC analysis in identifying Alzheimer's patients from blood expression data.
Again, I have posted resources on this on Biostars:
-------------------
For many diseases, current diagnostic and prognostic criteria are based on laboratory based assessments, or even things like family history. For example, the PSA test for prostate cancer measures an antigen in the urine; many immune disorders are measured through immunohistochemistry (IHC) of cell markers, etc. It can be useful to correlate our expression data to these clinical markers in order to see which genes may be related to certain parameters and,therefore, which genes the expression of which could be used as surrogate markers of these parameters.
Again, another posting:
My objective of doing this rna seq study is to find novel and known mechanosensitive genes upon compressing cancer cells. I also want to study how cancer behavior might be affected and using rna seq global analysis can guide me on this.
Thanks kevin! I will consider your points. Will probably lock myself in the room doing exploration of my rna seq data and find the best direction to take from here on.
Okay, in that case, I presume that you can measure the level of applied compression, which would then be a 'clinical' parameter that you could use, of course.
Just quickly back to gene enrichment: terms that come up in gene enrichment should merely guide you as you then conduct literature searches (i.e. just include the GO term in Google as you search).
Hi kevin your input above is valuable. Can you care to elaborate because im not sure if i understood your suggestions well enough.
Things that you could try:
Hope you can further elaborate and many thanks on giving me ideas on interpreting my data.