I ran RSEM-Ebseq on some data, with paired and unpaired options and got different numbers for differentially expressed genes. Why might this be?
I ran RSEM-Ebseq on some data, with paired and unpaired options and got different numbers for differentially expressed genes. Why might this be?
If you have paired data and run in unpaired mode it will quantify the reads from each file - meaning that for all pairs you will doublecount the fragment.
You might be interested in this post:
https://biowize.wordpress.com/2014/03/04/understanding-rsem-raw-read-counts-vs-expected-counts/
The first time I compared raw reads counts to RSEM’s expected counts, I encountered an unexpected trend: the expected counts were not slightly lower than the raw counts, they were consistently lower by a factor of 2. After thinking about this a bit, I considered the possibility that RSEM treats each pair of reads as a single unit given paired-end data. To confirm this, I selected a small subset of 10 million read pairs from one of my samples and ran RSEM twice: once in paired-end mode, and once in single-end mode disregarding the read pairing information. Consistent with my hypothesis, the expected counts produced in single-end mode were almost exactly 2 times to the expected counts produced in paired-end mode.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Please post the commands used and explain in more detail what you did.
I took the paired end reads (separate files denoted R1 and R2) for each sample time (two time periods, 3 duplicates). I ran RSEM from a script file with the paired option (so the script file contained R1 and R2) and through a separate script ran R1 and R2 individually.
I then ran ebseq either using the paired reads RSEM-1.3.1/rsem-run-ebseq GeneMat.txt 3,3 GeneMat.results
or RSEM-1.3.1/rsem-run-ebseq GeneMat.txt 6,6 GeneMat.results for unpaired,
then /RSEM-1.3.1/rsem-control-fdr --hard-threshold GeneMat.results 0.05 GeneMat.de_hard.txt for both results files individually. The genes with fdr < 0.05 was significantly more for the unpaired reads set than the paired.
R1 and R2 reads came from the same cDNA / RNA fragment, so when you count R1 and R2 individually and sum the results, you are essentially artificially multiplying the read counts by two. What would this entails?
Imagine you toss a coin 20 times, obtaining 6 heads and 14 tails. Performing a goodness-of-fit Chi-Square tells you the coin is fair (lets use R for the test):
Now artificially multiply the values obtained by 2: