What are Differentially Expressed Genes ?!!
3
1
Entering edit mode
21 months ago

Hi,

Believe me, I know what I am going to ask would sound stupid and repetitive. I did read through many of the articles available but still is confused.

So, let's say I am doing an RNA-seq DEG analysis with a dataset of 4 samples, each two of them belong to the group of Control and Treated. I have the quantification results and would like to perform DEG analysis. I have a couple of doubts:

  1. Which data would be called as expressed genes ?
  2. Which set of genes would be called as Differentially Expressed Genes ? From what I understood the genes in the count file would undergo normalisation and the resulting genes screened with the FDR<0.05 and logFC threshold are called Differentially expressed. Please correct me if I am wrong.
  3. Which parameter should I consider to select the DEGs? I read that most commonly used is FDR<0.05 and logFC, but somewhere else Pvalue<0.05 is also mentioned.
  4. Somewhere I found that all the genes expressed are Differentially expressed and the PValue and logFC cutoff is done to select the significant DEGs. Is that the concept ?

I hope someone would help me to solve my confusions.

Thank you.

FDR DEG p-value logFC • 2.3k views
ADD COMMENT
3
Entering edit mode
21 months ago
Risk ▴ 40

Which data would be called as expressed genes. Which set of genes would be called as Differentially Expressed Genes ?

All the genes which have reads assigned to them are being 'expressed' in some sense.

From what I understood the genes in the count file would undergo normalisation and the resulting genes screened with the FDR<0.05 and logFC threshold are called Differentially expressed. Please correct me if I am wrong.

This is tool dependent but when looking to find genes that are differentially expressed, the software package you use will generally build a model of the gene expression distribution within each samples replicates then normalize across samples to try and remove most of the noise (technical and otherwise). The list it generates will tell you which genes are differentially expressed and how much confidence can be ascribed to any particular gene, based on the assumptions made by the model. What q-value or fold-change you decide to use is up to you, but generally speaking using a q-value of <0.05 combined with a fold change >2 is considered 'stringent' and will give you results that are most likely true and can be validated (via qPCR etc).

Which parameter should I consider to select the DEGs? I read that most commonly used is FDR<0.05 and logFC, but somewhere else Pvalue<0.05 is also mentioned.

First of all "FDR" is an approach to adjusting the p-value that accounts for multiple testing. I know that it's popular to say "...an FDR<0.05" but this is as incorrect as saying "...a Bonferroni < 0.05". What you have is an FDR adjusted p-value, which is called a q-value (or an 'adjusted p-value'). That being said, you always want to use the q-value when drawing conclusions from multiple testing. The p-value is, at best, not trustworthy and at worst meaningless for many genes, as it encompasses a growing number of false positives with increased sampling size.

Somewhere I found that all the genes expressed are Differentially expressed and the PValue and logFC cutoff is done to select the significant DEGs. Is that the concept?

Couple of things you need to understand to be able to answer this question: (1) Biologically speaking most genes in the genome are 'expressed' in the sense that if you look for a single transcript of that gene its probably there if you look hard enough. (2) Practically speaking you will never have the exact same amount of transcripts when comparing two cells. Even if you sequence the exact same sample three times, you will notice that you have a different amount of reads in each sample. This is because the cell, like most biology, at some level is noisy. The technology, like most technology, at some level, is noisy. What you are looking for when doing DE analysis is differences which you can be confident are not due to noise. This is what most of the software seeks to do. It builds models, it normalizes, it adjusts p-values, etc, etc. In the end what you're left with is the software telling you, "Look... here's a list of genes that, based on our models and assumptions, appear to be significantly different between the samples. We are this confident that they are different (q-value)."

ADD COMMENT
3
Entering edit mode

Some nuance:

It's fine to say FDR < 0.05 because that's the error rate you're setting/allowing/controlling. There is some ground truth FDR (it could be 0.03) but FDR-controlling means you're trying to guarantee that the error rate is less than 0.05.

It's similar to the concept of FWER (another type of error rate which Bonferroni controls for).

However, it's incorrect to say a gene got an FDR of 0.0026 (that's an adjusted p-value).

Another nuance:

Fold change >2 only reports upregulated genes. To symmetrically obtain downregulated genes, use fold change < 0.5. Those fold changes are log2'd for convenience and therefore to obtain upregulated genes, log2FC > 1, and to obtain downregulated genes, log2FC < -1.

ADD REPLY
0
Entering edit mode

Thank you for the detailed reply. I understand that using q-value cutoff of <0.05 instead of PValue is to reduce false positives.

ADD REPLY
1
Entering edit mode
21 months ago
  1. I am not aware that there is an unequivocal answer to this. Wet lab people often ask the bioinformaticians how many genes are expressed or are not expressed in a specific cell or population, assuming that a comprehensive method like RNA-seq must be able to provide the answer. Putting all biases, technical limitations and random noise aside, RNA-seq can at best only quantify the presence of an RNA in the cell. Gene expression in a functional sense, however, entails more than transcription. Depending on the gene, also appropriate post-transcriptional processing, trafficking, translation, folding, post-translational modification and transport may be crucial for a functional expression. Therefore, I tend to argue that this cannot be answered by RNA-seq and, furthermore, that the statement "cell X expresses a total of N genes" is pretty meaningless. However, I can also accept, if someone applies a threshold on the gene counts and calls anything above "expressed".
  2. That is a commonly used significance level, yes.
  3. (4.) Please see my recent answer here. TL:DR: Calculation of the false discovery rate is a method similar to calculating a p-value, but accounts for multiple testing. Don't rank "promising" candidate genes based on p-values or FDR.
ADD COMMENT
1
Entering edit mode
21 months ago

Key point: There is a difference between the following:

  1. statistically significantly differentially expressed genes
  2. differentially expressed genes

For #1, we define in the methods what are the thresholds for statistical significance.

For #2, well, everything is differentially expressed to some degree - no 2 genes will be expressed at the exact same amount (number of mRNA molecules), unless they are transcribed by the same molecular machinery as it moves along strand of DNA, but even here there may be differences.

Kevin

ADD COMMENT
0
Entering edit mode

Thank you for pointing out this. So, would it be appropriate to call the DEGs filtered out with certain threshold values are statistically significant DEGs and the genes obtained after quantification (except those having value 0) are DEGs. This is what I could understand please correct if I am wrong.

ADD REPLY
0
Entering edit mode

I am not sure what you mean. It may help if you share some code that you are using? Also, what is the background behind your question?

ADD REPLY

Login before adding your answer.

Traffic: 1975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6