Hello,
I'm new to RNA-seq analysis and would appreciate it if I could get some feedback on whether my workflow makes sense. I've been trying to learn on my own whilst doing a lab-based PhD... I've realised I enjoy this more than pipetting. Anyway, thanks to this community for all the help and support.
I'm analysing the data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32863 But I'm working with the non_normalised data
library(dplyr)
library(limma)
library(edgeR)
readr::local_edition(1)
library(GEOquery)
my_id <- "GSE32863"
## extract geo data
Sys.setenv(VROOM_CONNECTION_SIZE = 256000000)
expr <- getGEO(my_id)[[1]]
sampleInfo <- pData(expr)
annot <- fData(expr) #annotation data
####read in the data and convert to an ElistRaw object
baseDir <- './'
#import data
x <- read.table(paste0(baseDir, 'GSE32863_non-normalized.txt.gz'), header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 0)
x <- x[order(x$ID_REF),]
#extract pvalues
detectionpvalues <- x[,grep('Detection.Pval', colnames(x))]
rownames(detectionpvalues) <- x$ID_REF
#remove pvalues from dataframe
x <- x[,-grep('Detection.Pval', colnames(x))]
#extract probes, 1st column
probes <- x$ID_REF
#convert to matrix
x <- data.matrix(x[,2:ncol(x)])
#columns and rows names
rownames(x) <- probes
colnames(x) <- colnames(edata)
colnames(detectionpvalues) <- colnames(x)
#verify rownames in x match those in annot
#dataframes should have same number of observations and rows
annot <- annot[which(annot$ID %in% rownames(x)),]
# create a custom EListRaw object
gse<- new('EListRaw')
gse@.Data[[1]] <- 'illumina'
gse@.Data[[2]] <- sampleInfo
gse@.Data[[3]] <- NULL
gse@.Data[[4]] <- x
gse@.Data[[5]] <- NULL
gse$E <- x
gse$targets <- sampleInfo
gse$genes <- annot
gse$other$Detection <- detectionpvalues
########### Background correction / normalisation
gse.bgcorrect.norm <- neqc(gse, offset = 16)
# filter out control probes, those with no symbol, and those that failed
Control <- gse.bgcorrect.norm$genes$Source=="ILMN_Controls"
NoSymbol <- gse.bgcorrect.norm$genes$Symbol == ""
isexpr <- rowSums(detectionpvalues <= 0.05) >= 3
gse.bgcorrect.norm.filt <- gse.bgcorrect.norm[!Control & !NoSymbol & isexpr, ]
dim(gse.bgcorrect.norm) # 48803 116
dim(gse.bgcorrect.norm.filt) # 26225 116
##summarise across genes by mean
gse.bgcorrect.norm.filt.mean <- avereps(gse.bgcorrect.norm.filt, ID = gse.bgcorrect.norm.filt$genes$Symbol)
dim(gse.bgcorrect.norm.filt.mean) # 19485 116
library(limma)
design<- model.matrix(~0 + sampleInfo$source_name_ch1)
## the column names are a bit ugly, so we will rename
colnames(design) <- c("Non_tumor","Adenocarcinoma")
#calculate cutoff
cutoff <- median(gse.bgcorrect.norm.filt.mean$E)
#filter
is_expressed <- gse.bgcorrect.norm.filt.mean$E > cutoff
keep <- rowSums(is_expressed) > 2
gse.bgcorrect.norm.filt.mean$E <- gse.bgcorrect.norm.filt.mean$E[keep,]
#fit the model to the data
#The result of which is to estimate the expression level in each of the groups that we specified.
fit <- lmFit(gse.bgcorrect.norm.filt.mean$E, design)
contrasts<-makeContrasts(Adenocarcinoma-Non_tumor,levels = design)
fit2 <- contrasts.fit(fit, contrasts)
#apply the empirical Bayes step to get our differential expression statistics and p-values.
fit2 <- eBayes(fit2)
topTable(fit2)
decideTests(fit2)
tfit <- treat(fit2, lfc=1)
dt <- decideTests(tfit)
adeno_vs_non <- topTreat(tfit, coef=1, n=Inf)
I am also uploading the plots from this session
Indeed, this is not an RNA-seq analysis but Microarray. There are, however, various RNA-seq workflows for R on Bioconductor that may serve as a template.
If you at some point want to run an analysis with hundreds or even thousands of samples, R and the Bioconductor workflows are unfortunately not going to cut it. In that case, you will find mature RNA-seq pipelines written in Nextflow, Snakemake, Common Workflow Language etc.
@Matthias since your post is not directly addressing the question I have moved it to a comment.
Sure. I just assumed that the actual purpose of the question was to teach themselves RNA-seq analysis and therefore pointed out the respective workflows to retrace instead. But I should have asked instead, if the OP is nonetheless interested in feedback regarding the script.