Hi everyone,
It'll be a bit long post, so please bear with me.
Background
I have an undesirable set up of unreplicated RNA-seq experiments with matched controls (34 samples and 34 controls). Following a recommendation from Biostars community, I have started using GFOLD to obtain at least a meaningful rankings, or estimate of a logFC. However, as I couldn't be able to find a satisfactory documentation on how GFOLD counts reads to generate gene-level counts, I decided to compare its results with counts using featureCounts of Subread package.
Question
I'll present my evidence below but before that my question is why downregulated genes differ significantly while upregulated genes seem to agree with each other nicely?
Definitions
GFOLD value: from documentation (emphasis by me):
The GFOLD value could be considered as a reliable log2 fold change. It is positive/negative if the gene is up/down regulated. The main usefulness of GFOLD is to provide a biological meanlingful ranking of the genes. The GFOLD value is zero if the gene doesn't show differential expression. If the log2 fold change is treated as a random variable, a positive GFOLD value x means that the probability of the log2 fold change (2nd/1st) being larger than x is (1 - the parameter specified by -sc); A negative GFOLD value x means that the probability of the log2 fold change (2st/1nd) being smaller than x is (1 - the parameter specified by -sc). If this file is sorted by this column in descending order then genes ranked at the top are differentially up-regulated and genes ranked at the bottom are differentially down-regulated. Note that a gene with GFOLD value 0 should never be considered differentially expressed. However, it doesn't mean that all genes with non-negative GFOLD value are differentially expressed. For taking top differentially expressed genes, the user is responsible for selecting the cutoff.
For my case -sc is 0.99.
Upregulated Gene: GFOLD(0.01) > 1
Downregulated Gene: GFOLD(0.01) < -1
Results
First I have checked count level correlation of all genes in each sample and control. I got a nice correlation distribution:
Then, I went directly to GFOLD values that compares each sample and their matched control. Again, I got a nice correlation distribution of GFOLD(0.01) values calculated using GFOLD counts and featureCounts counts:
After this point I randomly selected 4 sample-control pairs to study downstream effects. These 4 samples were selected to represent very high GFOLD(0.01) correlation, high correlation, medium correlation and low correlation. Scatter plots of GFOLD(0.01) values of these samples were as following:
And corresponding count scatter plots were as following:
First I selected up and down-regulated genes and compared with Venn diagrams:
Then I went a bit further to check any differences in GO-term enrichments. I'll not put top 10 enrichment terms for each sample, but I can say that for upregulated genes, they are more or less same between two different methods (maybe one or two terms show difference in ranking). But to summarize I calculated rank correlation of p-values of enriched GO terms:
As you can see, GO terms show high correlation in upregulated genes, while low correlation on downregulated genes.
I also checked pathway enrichment using SPIA, but I'll take too much space here. I have seen variety of evidence in pathway results, rankings seem to correlate to each other, but there are some pathways that were found activated in one method, inhibited in the other.
Thanks a lot for your patience!
If you have 34 cases and 34 controls, which are matched with each other, how come you say it is unreplicated? Are these 34 different experiments that have nothing to do with each other?
Yes, i can't use them as replicate of each other in this case. Each sample has its own control and thats it.