Hello all,
Currently I'm trying to compare differential expressed gene from microarray and RNA-seq. I have two categories, A and B for each microarray data and RNA-seq data. My hypothesis is, the DEG list for each micorarray and rna-seq will pretty much same, maybe several difference but not too many. For example, if a set of genes is down regulated from microarray, I will expect same result from RNA-seq. Currently, the result is less than half has similar DEG. Around less than 8000 from more than 16000 genes I checked. I don't see how much it is changed, but I just want to check the same up regulated or same down regulated or same level. As for the data, I downloaded the data from NCBI GEO and it is from different experiment (independent experiment) but both said it is from same cell type. With that difference in experiment, of course there will be difference. My question is, what is your opinion in this case in term of biological meaning? I want to use this as a basis of my comparison to other data set. If the basis is not reliable, I don't know if my result will be valid.
Edit: information about cut off
I forget to add information how I filter the up, down, or similar expression level. So, basically, my cutoff is an arbitrary one. I define like this:
Up regulation : logFC less than -0.75
Similar regulation : logFC between -0.75 to 0.75
Down regulation : logFC more than 0.75
I used Limma for both microarray and RNA-seq data to get the logFC.
Thank you for your suggestions. First, I think I can safely say the quality control issue is not the problem because the mapping rate is quite high, several above 85% and several other even reach more than 90% mapping. The one thing I concern is your suggestion about cut-off value for expression. I did a cut-off for my expression level by calculate average for all data plus every category. Then, I filter only genes that have mean above 1 for both each category and all category. I do that only for RNA-seq data because Limma said low expression level need to be filtered for RNA-seq data but not for microarray.
I will do the plotting and correlation like your suggestion to check this first.
Edit: Plot result
The result of the plot is a bit weird. The data blob is not on diagonal, but horizontal. The x axis is RNA-seq and y axis is microarray. This means that the logFC of microarray tend to have small variance compare to RNA-seq. The result of the boxplot also show this.
Thank you!
Also, check the normalization step. If your conditions don't differ too much, you expect the average LFC to be at 0. The limma package documentation as case studies that are also helpful.
I just check the normalization using boxplot. For in-between microarray and in-between RNA-seq, the normalization is good. The problem lies on comparison of normalization between RNA-seq and microarrayy. Both of them have different variance. Do I need to normalized the logFC of RNA-seq and microarray? Is it logical thing to do?
Check this figure for another example: http://www.biomedcentral.com/1471-2164/12/628/figure/F4. Normalizing the logFC does not make a lot of sense. You normalize to smoothe the difference between replicates and between most genes, which yields an average logFC for each gene. What you should expect is:
What do you mean by variance? Difference between biological replicates?
PS: http://pastebin.com/QU72XYvX for some R code from a lab session. Read from line 232. The idea is that use some lines like
limmaHigh <- row.names(limmaRes2[limmaRes2[,"logFC"]>2,])
to select genes with a log fold change above 2 or below -2 for each technique. You then plot your graph, and outline in yellow DE genes in both techniques.It is important that your gene names are in the same order when plotting logFC_RNASeq and logFC_microarray...
What I mean with variance is the variance of logFC. The logFC result from Microarray tend to have smaller logFC and smaller variance. The logFC result from RNA-seq tend to have bigger logFC and bigger variance. I think this is the result why the scatter plot is almost horizontal with microarray as y and RNA-seq as x. From the figure you gave, the scatter plot is diagonal.
Ok, I will need to leave soon. Do you see the same effect with the most differentially expressed genes? In my experience RNASeq is
a bitmore sensitive and can detect larger variations.If you look at GO enrichment and ignore your issues, are the techniques consistent? You can also take a look at the p-values. My guess is the microarrays won't be very informative... I don't know why you see a reduced logFC with them.
My best advice is to just use the RNASeq DE genes and confirm a few with RT-qPCR.
Thank you. Probably because this is data from different experiment. Not by the same person/author. I tried to do a meta analysis which combine data from different source. Besides this differencee, other result seems make sense and I think I will just focus from other point of view.