Question

HELP- analysis RNa-seq from GEO

0

Entering edit mode

3.2 years ago

PM • 0

Hello,

I'm new to RNA-seq analysis and would appreciate it if I could get some feedback on whether my workflow makes sense. I've been trying to learn on my own whilst doing a lab-based PhD... I've realised I enjoy this more than pipetting. Anyway, thanks to this community for all the help and support.

I'm analysing the data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32863 But I'm working with the non_normalised data

library(dplyr)
library(limma)
library(edgeR)
readr::local_edition(1)
library(GEOquery)

my_id <- "GSE32863"

## extract geo data
Sys.setenv(VROOM_CONNECTION_SIZE = 256000000)
expr <- getGEO(my_id)[[1]]

sampleInfo <- pData(expr)
annot <- fData(expr) #annotation data

####read in the data and convert to an ElistRaw object
baseDir <- './'

#import data
x <- read.table(paste0(baseDir, 'GSE32863_non-normalized.txt.gz'), header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 0)
x <- x[order(x$ID_REF),] 

#extract pvalues
detectionpvalues <- x[,grep('Detection.Pval', colnames(x))]
rownames(detectionpvalues) <- x$ID_REF

#remove pvalues from dataframe
x <- x[,-grep('Detection.Pval', colnames(x))]
#extract probes, 1st column
probes <- x$ID_REF

#convert to matrix
x <- data.matrix(x[,2:ncol(x)])

#columns and rows names
rownames(x) <- probes
colnames(x) <- colnames(edata)
colnames(detectionpvalues) <- colnames(x)

#verify rownames in x match those in annot
#dataframes should have same number of observations and rows
annot <- annot[which(annot$ID %in% rownames(x)),]


# create a custom EListRaw object
gse<- new('EListRaw')
gse@.Data[[1]] <- 'illumina'
gse@.Data[[2]] <- sampleInfo
gse@.Data[[3]] <- NULL
gse@.Data[[4]] <- x
gse@.Data[[5]] <- NULL
gse$E <- x
gse$targets <- sampleInfo
gse$genes <- annot
gse$other$Detection <- detectionpvalues

########### Background correction / normalisation
gse.bgcorrect.norm <- neqc(gse, offset = 16)

# filter out control probes, those with no symbol, and those that failed
Control <- gse.bgcorrect.norm$genes$Source=="ILMN_Controls"
NoSymbol <- gse.bgcorrect.norm$genes$Symbol == ""
isexpr <- rowSums(detectionpvalues <= 0.05) >= 3
gse.bgcorrect.norm.filt <- gse.bgcorrect.norm[!Control & !NoSymbol & isexpr, ]

dim(gse.bgcorrect.norm) # 48803 116
dim(gse.bgcorrect.norm.filt) # 26225 116


##summarise across genes by mean
gse.bgcorrect.norm.filt.mean <- avereps(gse.bgcorrect.norm.filt, ID = gse.bgcorrect.norm.filt$genes$Symbol)


dim(gse.bgcorrect.norm.filt.mean) # 19485 116

library(limma)

design<- model.matrix(~0 + sampleInfo$source_name_ch1)  
## the column names are a bit ugly, so we will rename
colnames(design) <- c("Non_tumor","Adenocarcinoma")

#calculate cutoff
cutoff <- median(gse.bgcorrect.norm.filt.mean$E)

#filter
is_expressed <- gse.bgcorrect.norm.filt.mean$E > cutoff
keep <- rowSums(is_expressed) > 2

gse.bgcorrect.norm.filt.mean$E <- gse.bgcorrect.norm.filt.mean$E[keep,]

#fit the model to the data
#The result of which is to estimate the expression level in each of the groups that we specified.
fit <- lmFit(gse.bgcorrect.norm.filt.mean$E, design)


contrasts<-makeContrasts(Adenocarcinoma-Non_tumor,levels = design)
fit2 <- contrasts.fit(fit, contrasts)
#apply the empirical Bayes step to get our differential expression statistics and p-values.
fit2 <- eBayes(fit2)
topTable(fit2)

decideTests(fit2)
tfit <- treat(fit2, lfc=1)
dt <- decideTests(tfit)
adeno_vs_non <- topTreat(tfit, coef=1, n=Inf)

I am also uploading the plots from this session

enter image description here

microarray • 2.1k views

ADD COMMENT • link 3.0 years ago by PM • 0

0

Entering edit mode

Indeed, this is not an RNA-seq analysis but Microarray. There are, however, various RNA-seq workflows for R on Bioconductor that may serve as a template.

If you at some point want to run an analysis with hundreds or even thousands of samples, R and the Bioconductor workflows are unfortunately not going to cut it. In that case, you will find mature RNA-seq pipelines written in Nextflow, Snakemake, Common Workflow Language etc.

ADD REPLY • link 3.2 years ago by Matthias Zepper 5.1k

0

Entering edit mode

@Matthias since your post is not directly addressing the question I have moved it to a comment.

ADD REPLY • link 3.2 years ago by GenoMax 154k

0

Entering edit mode

Sure. I just assumed that the actual purpose of the question was to teach themselves RNA-seq analysis and therefore pointed out the respective workflows to retrace instead. But I should have asked instead, if the OP is nonetheless interested in feedback regarding the script.

ADD REPLY • link 3.2 years ago by Matthias Zepper 5.1k

score 1 · Answer 1 · 2022-10-04

1

Entering edit mode

3.2 years ago

benformatics 4.2k

Analysis looks fine (although I'm not too familiar with this old school tech).

My biggest concern would be referring to microarray data as RNA-seq...

ADD COMMENT • link 3.2 years ago by benformatics 4.2k

0

Entering edit mode

Sorry! Yes, got my wires crossed! They’re not the same thing!

Thanks for pointing that out

ADD REPLY • link 3.2 years ago by PM • 0

0

Entering edit mode

Please edit the original post/title and remove the references to RNAseq since that does not apply.

ADD REPLY • link 3.2 years ago by GenoMax 154k

score 0 · Answer 2 · 2022-10-04

I'm not quite clear on your purpose. Are you analysing this Illumina microarray dataset because you are interested in the biological results from this experiment or because you thought it would teach you about RNA-seq analysis?

If the former, then there are several ways in which your analysis could be improved or simplified. For one thing, neqc needs to know the detection p-values but you removed them for some reason. For another, this experiment is a paired design but you have not taken into account the pairing in your analysis.

If the latter, then there are RNA-seq workflows for the packages you are using here: