HELP- analysis RNa-seq from GEO
2
0
Entering edit mode
2.2 years ago
PM • 0

Hello,

I'm new to RNA-seq analysis and would appreciate it if I could get some feedback on whether my workflow makes sense. I've been trying to learn on my own whilst doing a lab-based PhD... I've realised I enjoy this more than pipetting. Anyway, thanks to this community for all the help and support.

I'm analysing the data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32863 But I'm working with the non_normalised data

library(dplyr)
library(limma)
library(edgeR)
readr::local_edition(1)
library(GEOquery)

my_id <- "GSE32863"

## extract geo data
Sys.setenv(VROOM_CONNECTION_SIZE = 256000000)
expr <- getGEO(my_id)[[1]]

sampleInfo <- pData(expr)
annot <- fData(expr) #annotation data

####read in the data and convert to an ElistRaw object
baseDir <- './'

#import data
x <- read.table(paste0(baseDir, 'GSE32863_non-normalized.txt.gz'), header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 0)
x <- x[order(x$ID_REF),] 

#extract pvalues
detectionpvalues <- x[,grep('Detection.Pval', colnames(x))]
rownames(detectionpvalues) <- x$ID_REF

#remove pvalues from dataframe
x <- x[,-grep('Detection.Pval', colnames(x))]
#extract probes, 1st column
probes <- x$ID_REF

#convert to matrix
x <- data.matrix(x[,2:ncol(x)])

#columns and rows names
rownames(x) <- probes
colnames(x) <- colnames(edata)
colnames(detectionpvalues) <- colnames(x)

#verify rownames in x match those in annot
#dataframes should have same number of observations and rows
annot <- annot[which(annot$ID %in% rownames(x)),]


# create a custom EListRaw object
gse<- new('EListRaw')
gse@.Data[[1]] <- 'illumina'
gse@.Data[[2]] <- sampleInfo
gse@.Data[[3]] <- NULL
gse@.Data[[4]] <- x
gse@.Data[[5]] <- NULL
gse$E <- x
gse$targets <- sampleInfo
gse$genes <- annot
gse$other$Detection <- detectionpvalues

########### Background correction / normalisation
gse.bgcorrect.norm <- neqc(gse, offset = 16)

# filter out control probes, those with no symbol, and those that failed
Control <- gse.bgcorrect.norm$genes$Source=="ILMN_Controls"
NoSymbol <- gse.bgcorrect.norm$genes$Symbol == ""
isexpr <- rowSums(detectionpvalues <= 0.05) >= 3
gse.bgcorrect.norm.filt <- gse.bgcorrect.norm[!Control & !NoSymbol & isexpr, ]

dim(gse.bgcorrect.norm) # 48803 116
dim(gse.bgcorrect.norm.filt) # 26225 116


##summarise across genes by mean
gse.bgcorrect.norm.filt.mean <- avereps(gse.bgcorrect.norm.filt, ID = gse.bgcorrect.norm.filt$genes$Symbol)


dim(gse.bgcorrect.norm.filt.mean) # 19485 116

library(limma)

design<- model.matrix(~0 + sampleInfo$source_name_ch1)  
## the column names are a bit ugly, so we will rename
colnames(design) <- c("Non_tumor","Adenocarcinoma")

#calculate cutoff
cutoff <- median(gse.bgcorrect.norm.filt.mean$E)

#filter
is_expressed <- gse.bgcorrect.norm.filt.mean$E > cutoff
keep <- rowSums(is_expressed) > 2

gse.bgcorrect.norm.filt.mean$E <- gse.bgcorrect.norm.filt.mean$E[keep,]

#fit the model to the data
#The result of which is to estimate the expression level in each of the groups that we specified.
fit <- lmFit(gse.bgcorrect.norm.filt.mean$E, design)


contrasts<-makeContrasts(Adenocarcinoma-Non_tumor,levels = design)
fit2 <- contrasts.fit(fit, contrasts)
#apply the empirical Bayes step to get our differential expression statistics and p-values.
fit2 <- eBayes(fit2)
topTable(fit2)

decideTests(fit2)
tfit <- treat(fit2, lfc=1)
dt <- decideTests(tfit)
adeno_vs_non <- topTreat(tfit, coef=1, n=Inf)

I am also uploading the plots from this session

enter image description here enter image description here

microarray • 1.2k views
ADD COMMENT
0
Entering edit mode

Indeed, this is not an RNA-seq analysis but Microarray. There are, however, various RNA-seq workflows for R on Bioconductor that may serve as a template.

If you at some point want to run an analysis with hundreds or even thousands of samples, R and the Bioconductor workflows are unfortunately not going to cut it. In that case, you will find mature RNA-seq pipelines written in Nextflow, Snakemake, Common Workflow Language etc.

ADD REPLY
0
Entering edit mode

@Matthias since your post is not directly addressing the question I have moved it to a comment.

ADD REPLY
0
Entering edit mode

Sure. I just assumed that the actual purpose of the question was to teach themselves RNA-seq analysis and therefore pointed out the respective workflows to retrace instead. But I should have asked instead, if the OP is nonetheless interested in feedback regarding the script.

ADD REPLY
1
Entering edit mode
2.2 years ago

Analysis looks fine (although I'm not too familiar with this old school tech).

My biggest concern would be referring to microarray data as RNA-seq...

ADD COMMENT
0
Entering edit mode

Sorry! Yes, got my wires crossed! They’re not the same thing!

Thanks for pointing that out

ADD REPLY
0
Entering edit mode

Please edit the original post/title and remove the references to RNAseq since that does not apply.

ADD REPLY
0
Entering edit mode
2.2 years ago
Gordon Smyth ★ 7.7k

I'm not quite clear on your purpose. Are you analysing this Illumina microarray dataset because you are interested in the biological results from this experiment or because you thought it would teach you about RNA-seq analysis?

If the former, then there are several ways in which your analysis could be improved or simplified. For one thing, neqc needs to know the detection p-values but you removed them for some reason. For another, this experiment is a paired design but you have not taken into account the pairing in your analysis.

If the latter, then there are RNA-seq workflows for the packages you are using here:

ADD COMMENT

Login before adding your answer.

Traffic: 1633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6