Hi, I'm currently learning how to perform RNAseq using a dataset I found on recount2. I was planning on doing three pairwise comparisons,
- shGFP (control) vs shRBFOX1 kd
- shGFP (control) vs non-transduced PHNP
- shRBFOX1 kd vs non-transduced PHNP
where shGFP and shRBFOX1 are transduced and PHNP is non-transduced, with each having five replicates. RBFOX1 is supposedly important in regulating early neuronal splicing events that affect proper neuronal development. So, I was hoping to see what patterns of expressions could be identified with the loss of RBFOX1, and if any genes are shared between the conditions.
When I did a biplot with the design including the condition only (there wasn't anything else in the metadata I could factor in), it separated on PC1. With the control and kd samples plotted closer together, does this mean that most of the variation is whether the sample is transduced or non-transduced? The paper seems to suggest that "examination of the most variant gene expression indicated a reproducible difference between the shRBFOX1 and the shGFP lines..."
So far I've done the analysis for comparison (1):
# decide which genes are considered differentially expressed for volcano plot
res_ctl_kd_tb <- res_ctl_kd_tb %>%
mutate(threshold_ctl_kd = padj < params$padj.cutoff & abs(log2FoldChange) >= params$l2fc, genelabels = "") %>%
arrange(padj)
# populate genelabels column with gene symbols for the first 20 rows
res_ctl_kd_tb$genelabels[1:20] <- as.character(res_ctl_kd_tb$symbol[1:20])
# plot volcano plot
ggplot(res_ctl_kd_tb, aes(log2FoldChange, -log10(padj))) +
geom_point(aes(color = threshold_ctl_kd)) +
geom_text_repel(aes(label = genelabels)) +
ggtitle("FOX1 knockdown") +
xlab("log2 fold change") +
ylab("-log10 adjusted p-value") +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.5), hjust = 0.5),
axis.title = element_text(size = rel(1.25)))
# create background genes list
res_ids_KD <- left_join(res_ctl_kd_tb, anno, by = c("gene" = "ENSEMBL"))
allKD_genes <- as.character(res_ids_KD$gene)
# extract significant results
sigKD <- dplyr::filter(res_ids_KD, padj < params$padj.cutoff)
sigKD_genes <- as.character(sigKD$gene)
# run GO enrichment analysis
go <- enrichGO(gene = sigKD_genes,
universe = allKD_genes,
keyType = "ENSEMBL",
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
qvalueCutoff = params$qval.cutoff,
readable = TRUE)
The GO analysis resulted in one result (positive chemotaxis). I also tried constructing a GSEA using KEGG gene sets with clusterProfiler but got no statistically significant results. I'm not sure how to proceed, so any input is greatly appreciated.
its not that its not a good or meaningful question, its just that the question you are asking is, well, as open ended as science itself (unless im missing something). what specifically do you want to do next?
Are you asking for a 'why'? I would like to see what factors are alternatively spliced as a direct result of RBFOX1. Not sure if I answered you well enough.
I agree with Vincent, these are all open ended questions.
It means the control/kd samples clustered together and apart from the non-transduced samples. What's interesting is that 95% of the variance is captured by PC1.
Have you performed diferential expression analysis using edgeR or DeSeq2?
Yes, I used DeSeq2 and got 58037 genes, with 658 of them being statistically significant (adjusted p-value cutoff 0.05). The volcano plot above depicts the unfiltered, 58037 genes, results, with the top 20 significant genes labelled. There's obviously not 20 up there- 10 have no gene symbols and 4 are somehow not showing up