A friend asked me to help him figure out if a certain group of genes is expressed in some RNA-Seq data he found on GEO. The database only provides RPKM data, so I cannot feed the raw count data or .bam files into DESeq, edgeR, or cuffdiff. Also, there are only two biological replicates.
I know this isn't an ideal situation, but is there anything I can tell him/anything I can do to ballpark with reasonable confidence which genes are expressed?
I was thinking of doing something like: 1. log2 of RPKM data 2. Observe expression values for housekeeping genes 3. Set a 'this counts as expression' threshold below the housekeeping gene values 3. Look for where the expression values for the genes of interest fall relative to the threshold...of course, this doesn't account for noise at all.
How can I try to characterize the noise with so few biological replicates? Should I try to calculate fold changes relative to housekeeping genes? Is all hope lost unless I can get the raw count data?
Thank you for your insight!
That's correct - just one condition.
Thank you for the wonderful, detailed suggestion! I'll try it out!
@Kristin Muench did you do these steps in R? If so, do you mind to share the code?
I have two follow-up questions:
My understanding is that this technique is to see if the whole set of genes (that is, the mean of the set of genes I'm looking at - turns out only 4 genes :O) is, on average, expressed above or below what would be expected by random chance.
What would I do if I wanted to see if only one gene was expressed by random chance? Instead of bootstrapping, would I just use a t-test to see if the expression of my gene was in an extreme tail of expression values for my full dataset?
Also: most (over half!) of my genes have an expression value of 0, so that the distribution of the RPKM data looks like a normal distribution with a gigantic tower built at 0 to the left of it. I should still include all those genes with expression=0 in my bootstrap, right?
And one final comment:
I found that there were 880 bootstrap-produced means greater than the mean of my four genes of interest (which, in turn, are the averages of two samples), and 9120 bootstrap-produced means less than the mean of my four genes of interest. Using the above formula, I got an empirical p-value of 0.0001096 . This seems so high, given that my observed mean is only in the top 10% of bootstrap-produced means! How should I interpret this number?
Thank you again for any insight!