Differential Expression In Rna-Seq Experiment
2
1
Entering edit mode
11.2 years ago
ThePresident ▴ 80

Hello,

I'm dealing with a classical dilemma: I performed RNA-seq experiment on two biological replicates for condition A and two others for condition B. After alignment and differential expression analysis using DESeq package, I have a whole list of genes with fold changes of A vs B. Now, mu question is: where do I put a cutoff?

  1. From a biological point of view, I'm tempted (as others have done the same) to set a FoldChage of 2 as a cutoff. 2 times more transcripts is somewhat significant at biological level for a cell. But is it really? If we assume it is, it brings me to the next point:
  2. What is a cutoff for p-value? I'm tempted to use padj (hence FDR-corrected) and the hits I'll get are almost surely genuine (in fact, I tested those by qPCR and indeed they are differentially expressed from A vs B). However, am-I missing potentially interesting hits by being too much restrictive? Then, where do I set my cutoff?

FYI: I'm dealing with Illumina, single strand 50pb, non strand-specific, bacterial RNA-seq data.

Thank you all for your input on this,

TP

rnaseq deseq rpkm • 15k views
ADD COMMENT
3
Entering edit mode
11.2 years ago
seidel 11k

I'll just echo what dpryan70 said in a comment, where you set your cutoffs depends completely on what you plan to do with the results. If you have an assay to easily screen through lots of genes, then you can be liberal about your cutoff, whereas if follow up involves heavy investment then you would be much more stringent. You might also use different cutoffs for different purposes. For instance, a cutoff to select genes for qPCR validation may be different than a cutoff you would use for GO enrichment analysis.

In my experience, the magnitude of the numbers (fold change, p or q value) do not have any absolute meaning - i.e. an x-fold threshold that determines biological significance. Every data set is different, experimental systems are different, and I have to adjust both fold change and p-value restrictions on an experiment by experiment basis. It's often tempting to take the interpretations of false discovery rates associated with p-values literally, and easy to forget that the numbers are based on assumptions about distributions. The "true" and "false" used to describe positives and negatives are based on an ideal, and what is actually true and false are difficult to know. There's also the issue of conflating significance with importance (avoid "the cult of statistical significance"). Many people adjust p-values and have nothing "significant" left, yet there is plenty of evident biology in the data staring them in the face. So pick some values that seem reasonable based on what you'd like to do with the results, and prepare to iteratively adjust your choices based on your needs.

ADD COMMENT
0
Entering edit mode

Your comment about X-fold thresholds is quite important. Even in a world with a perfect correspondence between RNA and protein level changes, a 10% change in one protein can be much more important than a 300% change in another. All the statistics in the world can't replace putting the data in a biological context.

ADD REPLY
0
Entering edit mode

Thank you guys again. It means a lot to have one's idea on all this. We often have our noses stick too close in our data that we lose the big picture. But, overall, that's exactly what I want to avoid: use "common" statistics to delimit my list of DE genes. I want to use p and q values along with biological reasoning behind it. The only problem is that journals often want you to use those parameters blindly. They want the p < 0.05 regardless of anything else (unfortunately, so does my adviser).Anyway, thank you again, it helped me in taking those analysis on another level and defend it in front of those that believe that p-value is the top of the rock. PS - sorry for my bad English ;)

ADD REPLY
1
Entering edit mode
11.2 years ago

I generally filter by adjusted p-value (0.10 is a common threshold for adjusted p-values) and then rank by fold-change. You'll lose real and meaningful changes regardless of what you do, so don't fixate too much on that.

ADD COMMENT
0
Entering edit mode

Thank you for your answer. I agree with you, we need to cut somewhere and we'll lose meaningful data regardless of cutoff... that's why we set limits like pval < 0.05 ou padj < 0.1. It's just that I'm not statistician, so I have no clue how much it really means to set a cutoff for padj at 0.1. Is that threshold low, medium, high? And I hate to use something just because it's common practice... however I don't have enough statistical knowledge to accurately judge by myself! ;)

ADD REPLY
1
Entering edit mode

Well, high, medium and low are subjective terms, so you'll never get an answer to that. In general, that's probably a medium threshold for general use. The most appropriate threshold will depend on what you want to do with the results. If you're going to do something expensive and time consuming, like making a bunch of transgenic mice or designing a drug trial, then you'll want a higher threshold. Generally, people will do various validations, so that'll give you a better idea if perhaps you might benefit from changing the threshold.

ADD REPLY

Login before adding your answer.

Traffic: 1759 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6