Question

Can I use RNA-seq data as control and microarray data as treatment to get differential expressed genes?

0

Entering edit mode

9.7 years ago

liuxiaomeng90 ▴ 70

I download the microarray data of GSC with PTEN loss as experimental group, but my control group is RNA-seq data. How could I get differential expressed genes from those two types of data?

Thanks!

microarray RNA-Seq • 3.6k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.7 years ago by liuxiaomeng90 ▴ 70

2

Entering edit mode

Normally you can use programme such as ComBat to combine RNA Seq and microarray data. However, to my knowledge, because you only have RNA Seq control and microarray cases, when you correct for the platform difference, you will also correct for any difference between case and control, therefore making it impossible to identify any differential expression. I would also be interested if there is any other methods that can make this possible though.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Sam ★ 4.8k

0

Entering edit mode

I am familiar with RNA-seq data but not with the microarray data and I find the document of ComBat is not detailed. The GEO data I used were some samples in GSE64985 which was normalized to obtain an integrated gene expression atlas across diverse biological sample types and conditions. What is supposed to be my input file if I use ComBat to combine RNA Seq? And what format it is in?

Thanks!

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by liuxiaomeng90 ▴ 70

0

Entering edit mode

I don't know if I can scale the RNA-seq data and microarray data as GSE64985 described, "performing quantile normalization on the entire compendium using the limma R package (version 2.18.2) in order to reconcile broader differences between datasets and ensure that all arrays were on the same scale"

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by liuxiaomeng90 ▴ 70

0

Entering edit mode

What you will need will be the expression level of the microarray data and the RNA Seq count data. Normalized microarray data should normally be fine. However, as I have mentioned, when you perform the correction of platform difference, you will most likely correct for the difference between the two group of samples rendering anything that you've identified as problematic.

If you really want to do it, I can try and search my scripts of ComBat for you.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Sam ★ 4.8k

0

Entering edit mode

Thanks for your reply! I'd appreciate it if you give me your scripts of ComBat. It will be of great help to study microarray data.

ADD REPLY • link 9.7 years ago by liuxiaomeng90 ▴ 70

1

Entering edit mode

source("http://bioconductor.org/biocLite.R")
#Combat package is in sva
biocLite("sva")
#For RNA Seq
biocLite("DESeq2")
#To read microarry data
biocLite(affy)
biocLite("affycoretools")
#For downloading the GEO data
biocLite("GEOquery")
#Load the libraries
library(affy)
library(affycoretools)
library(sva)
library(DESeq2)
library(GEOquery)

#Getting microarry data
mydata <- ReadAffy()
#I know the phenotype of the data, so I manually make this phenotype matrix
pheno = data.frame(row.names=row.names(pData(mydata)), timepoint= c(rep("GD9.5",5), rep("GD11.5",4),rep("GD13.5",6), "GD9.5"), platform="1", condition="Control")
eset <- rma(mydata)

#Getting the gene name
gpl=getGEO("GPL1261", destdir=".")
id2Name=Table(gpl)[,c(1,11)]
id2NameMatrix=matrix(NA, nrow(id2Name), 2)
for(i in 1:nrow(id2Name)){
id2NameMatrix[i,1] = as.character(id2Name[i,1])
split = unlist(strsplit(as.character(id2Name[i,2]), split="//")[[1]])
if(length(split) == 1){
  id2NameMatrix[i,2] =  as.character(split[1]);
}
else{
  id2NameMatrix[i,2] = as.character(split[2]);
}
}
colnames(id2NameMatrix) = c("ID", "Name")
id2NameMatrix = as.data.frame(id2NameMatrix)
micro.info = merge(exprs(eset), id2NameMatrix, by.x="row.names", by.y="ID")

#Getting the RNA data
loc="../counts/" #The location where we put all the count files
files=list.files(pattern=".count", path=loc) #The count files all end with the .count
data=read.table(paste(loc,files[1],sep=""), header=F, row.names=1)
colnames(data)=strsplit(files[1],"\\.")[[1]][1]
for(i in 2:length(files)){
temp = read.table(paste(loc,files[i],sep=""), header=F, row.names=1)
colnames(temp) = strsplit(files[i],"\\.")[[1]][1]
data = cbind(data,temp)
}

colData = data.frame(Condition=c(rep("Case", 5), rep("Control", 5))) #Set our condition
row.names(colData)=names(data)
colnames(colData)=c("condition")
dds <- DESeqDataSetFromMatrix(countData = data, colData = colData, design = ~ condition)
colData(dds)$condition <- factor(colData(dds)$condition,levels=c("Control","Case"))
colData(dds)$condition <- relevel(colData(dds)$condition, "Control")
dds <- DESeq(dds)
rld <- rlog(dds) #So now we can have the normalized RNA Seq data

#I have used Ensembl for my RNA Alignment and gene counting, so I need to have the mgi gene symbol
mgi = read.csv("ensemble2mgi.csv", row.names=1)
rld.mgi = merge(assay(rld), mgi, by="row.names")
#merging the microarray and RNA data
data=merge(micro.info, rld.mgi, by.x="Name", by.y="MGI.symbol")
#Now we set the phenotype information of the RNA
phenoRNA = data.frame(row.names=colnames(assay(rld)), condition=c(rep("Case", 5), rep("Control", 5)), platform="2", timepoint="GD9.5")
#Combining phenotype of both RNA and microarray
phenotype = rbind(pheno, phenoRNA)
#Some minor cleaning up of the table
data = data[,-c(2,19,30)]
data.unique = data[!duplicated(data$Name),]
row.names(data.unique) = data.unique$Name
data.unique = data.unique[,-1]
#Only want samples with from one particular timepoint
exclude=row.names(subset(phenotype, phenotype$timepoint!="GD9.5"))
data.input = data.unique[, !(colnames(data.unique) %in% exclude)]
pheno.input = phenotype[!(row.names(phenotype) %in% exclude),]
data.input = data.matrix(data.input)
#Now I use the platform as batch for correction
batch = pheno.input$platform
mod = model.matrix(~as.factor(condition), data=pheno.input)
mod0 = model.matrix(~1,data=pheno.input)
#Perform combat
combat_edata = ComBat(dat=data.input, batch=batch, mod=mod, numCovs=NULL, par.prior=TRUE, prior.plots=FALSE)
#Get the p-value from the combat corrected data
pValuesComBat = f.pvalue(combat_edata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
bValuesComBat=p.adjust(pValuesComBat, method="bonferroni")
sig = data.frame(row.names=names(bValuesComBat), pvalue=bValuesComBat)
sig$sig = sig$pvalue <=0.05
sig[sig$sig,]$sig = "Significant"
#Correct for surrogate variable
n.sv = num.sv(data.input,mod,method="leek")
svobj = sva(data.input,mod,mod0,n.sv=n.sv)
modSv = cbind(mod,svobj$sv)
mod0Sv = cbind(mod0,svobj$sv)
pValuesSv = f.pvalue(data.input,modSv,mod0Sv)
qValuesSv = p.adjust(pValuesSv,method="BH")

Again, I cannot stress more, this worked for me because I only add some additional microarray samples as a control data (i.e. I have case + control for my RNA Seq). So most of the true effect shouldn't be filtered in my case. However, in your case, I ain't even sure if the programme will let you run with only microarray case and RNA Seq control. You are basically using orange as control and comparing it with apple in a sense.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Sam ★ 4.8k

0

Entering edit mode

Thank you! May be I will give up finding differentially expressed genes between microarray data and RNA-seq data. Your script help me a lot to learn using ComBat. It may be useful next time.

ADD REPLY • link 9.7 years ago by liuxiaomeng90 ▴ 70

score 3 · Answer 1 · 2015-03-25

I'd be extremely hesitant to call differentially expressed genes based on completely different sets of data. RNA-seq directly sequences cDNA, microarrays measure the fluorescence emitted by the labeled target sequences to probe sequences spotted on the chip. They come with very different sources of bias, not to mention the batch effects of having different persons handle the cDNA extracted from different samples at different places at different times with different experimental protocols...The whole point of the control data set for calling DE genes is to estimate the baseline/background expression values. By cooking up a control that doesn't match any of the expected characteristics of your treatment sample, I don't think you're doing yourself a favor.

That being said, I don't completely understand why an appropriate control sample is missing from the original submission. GSE7562 gives plenty of replicates for PTEN loss as well as WT. (found by following GPL570 which was indicated as the original data set in the accession number you mentioned)

Ram · Answer 2 · 2015-03-25

1

Entering edit mode

9.7 years ago

ayyappakumar.s ▴ 20

Is this data downloaded from public repositories like (NCBI GEO, SRA (or) Array express). In that case the control and case samples are generated at different time and conditions, so there is a high chance of batch effect. First, batch effect need to be corrected.

DO you have biological replicates for both case and control?

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by ayyappakumar.s ▴ 20

0

Entering edit mode

GSC data were downloaded from GEO, unfortunately it didn't supply the control data I wanted so I use my own RNA-seq control data. There are two replicates for control and four replicates for case.

ADD REPLY • link 9.7 years ago by liuxiaomeng90 ▴ 70