Hi @ all,
I'm comparing MicroArray based expression data with RNASeq expression data.
For this I have data of two fungi strains, one named B36 the other named N402, each with 3 replicates.
I observed quite low correlation between the the RNASeq samples. In the plot you can see the Pearson correlation of the normalized read counts and MA intensities (not log2 transformed).
The correlation between the MicroArray samples is quite high but between RNASeq samples it is quite low. The N402 RNASeq sample shows an even higher correlation to both MicroArray samples than to B36 RNASeq sample.
I used the same mapping and counting procedure for both samples (STAR mapper + R package bamsignals). I observed a similar number of mapped reads and a similar sum of reads counts in both samples. The correlation between the RNASeq B36 and N402 replicates is near to 1. The fastqc report is quite good for all samples.
Does anyone have an explanation for this? Is my RNASeq B36 data just rubbish? Are there any other quality control steps I can do?
Or do you think the data is ok and RNASeq is much more sensitive to expression changes?
Would be very nice if someone could help :)
Could you post the code? If you have three replicates for each treatment, why do you have just one cell? Were the RNA samples sequenced to an appropriate depth? What is the mapping rate (look at STAR logs)?
We don't have enough information to know if the RNAseq data is fine, but RNAseq has greater dynamic range than microarrays, provided samples are sequenced to an adequate depth.
By the way, you know STAR can map and quantify simultaneously?
Thanks for your reply. I just took the mean of the replicates. The correlation between the replicates is near to 1.
The sequencing was processed by the same company with the same conditions for all samples.
% of mapped reads is 86-89%.
This is the code for read counting
And this I use for MicroArray anlysis
I don't see a major issue here... one certainly cannot make a conclusion that a sample's data is 'rubbish' based just on a single correlation.
You have a higher correlation with the microarray data because it is log base 2 transformed. If you went ahead to produce regularised log transformations of your RNA-seq data, then I am confident that you would observe a similar high correlation. In fact, your RNA-seq correlation coefficients are in line with what I have seen in the past when correlating such data on the negative binomial distribution, a scale on which raw and normalised RNA-seq counts are measured.
Hi, thanks for your reply. In the correlation plot I posted neither Microarray (MA) nor RNAseq data are log scaled (used
2^exprs (results_ma)
).I also made a version with log scaled values. In this one RNASeq gets correlation of ~0.8 with RNASeq samples and ~0.6 to the MA samples.
What let me struggle is the fact that 1) the correlation within MA samples is that different than within RNASeq samples and 2) that RNASeq 402 has much better correlation to both MA samples than to RNASeq B36.
I would expect that cor (RNASeq B36, RNASeq N402) and cor (MA B36, MA N402) are similar OR cor (RNASeq B36, MA B36) and cor (RNASeq N402, MA N402) are similar.
I see. What is the origin of these samples?
N402 is a wild type of Aspergillus Niger.
B36 a Glycoamylase-overproducer strain of Aspergillus Niger.
For each strain 3 cultures were grown and for each culture a sample was taken for MA and one for RNASeq analysis.
RNASeq data is strand-specific Paired End. NEBNext Ultra Directional RNA Library Prep Kit for Illumina (NEB #7420S/L) and Illumina HiSeq 2500 were used for sequencing.
For MA analysis this platform was used: DSM Aspergillus niger CBS513.88 by Affymetrix
Don't hesitate to ask if you need any further information.
Thank you. Have you not done anything else in terms of QC? Are you literally paused here because you feel that there is an issue with the quality of the samples? You should at least proceed with your further analyses and then decide later if there may have been some problem.
Do you have any suggestions for any other quality control steps I can do?
I'm writing my bachelor thesis with the goal to compare MA and RnaSeq. I want to find out if there are systematical biases in one/both methods, if there are classes of genes which one method suggests to be higher/lower expressed, etc.
But I could not find any of the biases described in literature (GC content, transcript length, high amount of SNPs, ...) in my data. They seem to be completely random.
Even the differential expression analysis shows completely different results for both methods.
So I went back and looked for any issues in the data.