Question

Rnaseq Differential Expression

2

Entering edit mode

12.4 years ago

Ashutosh Pandey 12k

Hello everyone,

I have 4 sets of RNAseq data (RNA derived from eyes) of two different strains of mice.

1st strain - Left Eye, Right eye 2nd strain - Left Eye, Right eye

My aim is to find out genes that are deferentially expressed in eyes between these two strains of mice at p-value of .05.

I have already aligned the reads and calculated RPKM at different levels including genes, transcripts, exons. But I need to do a statistical testing that can help me to calculate p-value(differential expression) using RPKM values.

My first question:

1) Can I use two different eye data (left and right) from the same strain as biological replicates? I calculated correlation between expression of all genes(RPKM) between left eye and right eye from the same strain and it was really high .969 for first strain but kind of low .83 for the second strain. Also, the high correlation could be a result of lot of genes with very small values of RPKM. Only 8000 out of 24,000 genes have RPKM > 5.

2) Which tool would be the best for my case to calculate p-values of differential expression of a gene between two samples?

Thanks in advance.

rna-seq replicates • 19k views

ADD COMMENT • link updated 10.3 years ago by Onat • 0 • written 12.4 years ago by Ashutosh Pandey 12k

score 9 · Answer 1 · 2012-06-22

edgeR, DESeq and DEGSeq all use counts as input not RPKM. They are not suitable for RPKM input as this does not meet the statistical criteria for the models they use.

The edgeR vignette (see links in the answer from JC) in particular has some details on filtering genes which you might find useful. In the vignette they illustrate this with code to remove genes with less than 1 count / million across half of the samples. These genes will not achieve statistical significance with the models used (I think - this isn't very clear in the documentation and I'm not statistically minded enough to work it out myself from first principles).

I've used both edgeR and DESeq (but not DEGSeq) to carry out analysis of RNA-Seq transcriptome data such as you have, except I had count data not RPKM.

Importantly if your data are paired (are left & right eye material from the same animal?) then you can model this in edgeR leading to a more robust analysis. The documentation for both packages are very good.

HTH

duff

score 5 · Answer 2 · 2012-06-22

I will answer your 2nd question first:

(2) I highly recommend using the DESeq package for R to do this analysis. The DESeq authors have provided an excellent manual taking you step by step through a standard differential expression analysis. This should make it possible even if you have never used R before (but will be slow and frustrating...) http://www-huber.embl.de/users/anders/DESeq/

(1) I share your concern about the inconsistent correlation between left and right eyes. Have you done quality assessment on your fastq files? or read quality filtering on the mapped reads? One possibility is that you have a bad data set for the second strain. I have used Fastqc for this: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

If your data quality looks good, try using Spearman correlation instead of Pearson. Also, try Pearson correlation with filtered data. For example, use only the top 60% of genes based on the sum of raw read counts over all 4 samples--this is a strategy suggested in the DESeq manual in a different context. Another possibility is to use the variance corrected read counts (calculated using DESeq). Yet another common strategy would be normalizing each gene (subtract mean and divide by standard deviation separately for each gene). Genes with low read counts (either very small, or very low expression) will be very noisy (high variance, low repeatability). However, computing RPKM actually makes the problem worse. The DESeq authors recommend sticking with raw read counts, or variance corrected read counts (http://genomebiology.com/2010/11/10/R106).

Finally: If it were my data, I would definitely use left and right as biological replicates. It may not be perfect, but it's a lot more powerful than single sample comparisons. Also, I would be tempted to also run the analysis comparing left to right, using the two mouse strains as biological replicates. That would depend on how interested you are in any biological left/right differences.

Good luck!

score 3 · Answer 3 · 2012-06-22

3

Entering edit mode

12.4 years ago

JC 13k

1) The variation between samples can be computed filtering low expressed transcripts (I remove everything below 0.01 RPKM). What do you use for the mapping? tophat?

2) For this task I suggest edgeR and DEGSeq from R:bioconductor.

http://www.bioconductor.org/packages/release/bioc/html/edgeR.html

http://www.bioconductor.org/packages/release/bioc/html/DEGseq.html

ADD COMMENT • link 12.4 years ago by JC 13k

0

Entering edit mode

I'm sorry if I got it wrong, but does this mean that you use the same RPKM for downstream analysis (DGE here) in DESeq or edgeR?

ADD REPLY • link 12.4 years ago by Arun 2.4k

0

Entering edit mode

No, I use raw counts per exon for DEGseq and edgeR.

ADD REPLY • link 12.4 years ago by JC 13k

0

Entering edit mode

The tutorials about edgeR suggest to remove those genes does not that have at least 1 read per million in at least 'n' samples ( n = smallest group of samples). But the DESeq tutorials doesnt include this step. Should we remove those genes or keep them in the data analysis pipeline ?

ADD REPLY • link 10.7 years ago by GouthamAtla 12k

score 1 · Answer 4 · 2013-12-04

1

Entering edit mode

10.9 years ago

Charles Warden 8.3k

I would remove (or effectively censor by rounding) the low coverage genes. I describe the impact of various cutoffs in this threshold:

http://bioinfo.aizeonpublishers.net/content/2013/6/285-292.html

In general, I would agree with the recommendation for using DESeq, but I don't really like edgeR. There was another RNA-Seq benchmark paper that has provided additional evidence for a choosing a differential analysis strategy (in this case, limma): http://bib.oxfordjournals.org/content/early/2013/12/02/bib.bbt086.long. However, I think there are plenty of good responses on the analysis strategies.

If the concordance for replicates is better for genes with higher expression, then things should be straightforward. If you only have 4 total samples, I would probably recommend at least trying to use both eyes in both strains. Otherwise, there is no possibility to calculate p-values based upon biological variation. In fact, I don't know how you are going to choose the "bad" sample with only 2 replicates (that is the advantage to having at least 3 replicates per group)

ADD COMMENT • link 10.9 years ago by Charles Warden 8.3k

0

Entering edit mode

The tutorials about edgeR suggest to remove those genes does not that have at least 1 read per million in at least 'n' samples ( n = smallest group of samples). But the DESeq tutorials doesnt include this step. Should we remove those genes or keep them in the data analysis pipeline ?

ADD REPLY • link 10.7 years ago by GouthamAtla 12k

0

Entering edit mode

I think it is up to you. This sounds similar to my strategy of using log2(RPKM + 0.1) values. However, those are the values that I use for analysis in R or sRAP. I would typically just give DESeq a whole table of counts. Presumably, the low coverage genes should have more variability and higher p-values - my main problem was that he fold-change values didn't seem reasonable for the low coverage genes (and some still made it into lists with FDR < 0.05).

ADD REPLY • link 10.7 years ago by Charles Warden 8.3k

score 0 · Answer 5 · 2014-08-01

Hi,

I am very fresh in the RNA Seq data analysis area and I have a question regarding the differential gene expression analysis. I have come up with an idea to perform differential gene expression analysis by using RPKM and/or expression values from RNA Seq Data by considering all the RPKM and/or expression value outputs for each gene in separate datasets and consider each value as one replicate. As a next step, I am considering to use those data for each gene and calculate the fold change after the statistical tests and p value calculations. Could that be possible or logical? I appreciate your help. Thank you in advance.