Hi,
I recently performed RNA-seq differential expression analysis. I first did a comparison between Condition 2 vs. Control. Then I did Condition 3 vs. Control. However, the raw counts do not seem to be substantiating the output data from edgeR. For instance, the Krt17 gene supposedly has a log fold change of 7 in the comparison between Condition 2 vs. Control, and the Krt17 gene does not show up on the list of differentially expressed genes (at an FDR of 0.05 or lower) for Condition 3 vs. Control. Yet, when I go back to the raw counts for the Krt17 gene, this is what I have: Control replicates(0, 0, 0), Condition 2 (25, 27, 25), and Condition 3 (16, 14, 11). Clearly, it seems to be "overexpressed" in both, but it doesn't show up in the other list. How do I fix this or account for such an observation?
For those wondering, I ran my analysis following most of what the Griffith Lab tutorial suggests.
Thanks
Try normalizing the data to a linear scale first then fit a model with limma.
Would it work if I filtered lowly expressed counts before creating the edgeR object? If so, what sort of cutoff should I make?
It's already taken care of in the code. Specifically, the number of instances a gene / transcript feature has a sample count greater than 0 and the average CPM across all samples is atleast 1. Usually, if you used the same seq machine to run all of your samples, you will use the sequencing lane as factor for batch, and this can scale out to include multiple sequencing machines from different institutions, etc.
Can you post your code here?
For the most part, I am following the Griffith Lab Tutorial: https://github.com/griffithlab/rnaseq_tutorial/blob/master/scripts/Tutorial_Module4_Part4_edgeR.R