I work with RNA obtained by translating ribosome affinity purification. RNA is immunopurified from genetically labelled ribosomes, expressed only in the cell type of interest. An 'input' sample is taken, by extracting RNA directly from the homogenised tissue before immunopurification, as a background RNA comparison.
The first stage of analysis is to identify genes being translated in the cell type of interest. For this, most papers appear to use DESeq2 to compare their purified RNA with their input sample.
I have been concerned that the composition of the RNA between purified and input differs so much that DESeq2's normalisation factor calculation may confound the analysis. Depending on the cell type investigated, there may be between 4-9000 DEGs (out of 14000 with >10 counts average) between purified and input samples.
Currently the normalisation factors for purified samples are ~0.9 whereas for the input samples they are ~1.3. There were 4000 DEGs between purified and input.
Should I be concerned or is this not an issue? The libraries were prepared with ERCC spike-ins, in case this is recommended as an alternative option.
Thanks for your reply.
I understand that genes in the purified sample are not 'upregulated', but they are enriched compared to input RNA. Using differential expression analysis is what others have performed to identify which genes are translated in the target cell type (for example: Epigenetic regulation of brain region-specific microglia clearance activity). My concern is whether the normalisation is appropriate to check for gene enrichment.
The main objective of my work is to compare the purified samples across conditions, where I can assume that the RNA composition is similar. As you say, I will also compare the input RNA between conditions to then show how condition-dependent DEGs differ between the purified and input samples.
They are not "enriched", they just have high translation rate. Since the purification step highly changes the RNA composition (I would expect so at least) you might see biases in genes with long UTRs for instance so directly comparing purified and input won't give you the results you want. Honestly I think that if you have enough coverage you can assume normal distribution and compute a translation efficiency rate and compare those between conditions.
I don't understand why it is not correct to say 'enriched'. I understand that the rate of translation of some genes may be low and therefore the difference in coverage of those genes compared to the input sample will not be large. Also that many translated genes will also be expressed in neighbouring cell types, reducing the difference in coverage compared to the input RNA. But in the case of highly translated and cell type specific genes, their transcripts are surely enriched in the purified RNA sample compared to the input RNA?
In your example of UTR bias, are you saying that genes with longer 5' UTRs will have greater coverage and and therefore bias interpretation of which genes are being translated in the target cell-type towards those genes?
Could you give an example of what you mean by your last sentence? What would I assume is normally distributed?
Thank you for your help
What I meant is that you have an upper limit on the number of RNA molecules you will have in the purified and it will always be <= input. This is why I think enriched is misleading here. You have high rate of translation and low rate and you might compare these rates between two cell types but stating that: "The number of purified reads compared to input reads is enriched" is misleading, it can be high or low (within the range 0-input) but not enriched.
In the UTRs I was just trying to say that the purified/input can't be compared between genes in the same sample due to biases we might not be aware of.
My last suggestion was to forget about the negative binomial statistical model and just compute a translation efficiency ratio (probably in the range 0-1) and compare these ratios between samples. Since you're giving up on the statistical model you'll need more replicates to make statistically significant conclusions.
I still have difficulty with understanding your comments above, but I have added some information below and would be grateful for your input.
I am immunopurifying RNA from one cell type situated within a mixture of other cell types. I am not immunopurifying RNA from one cell type grown in an isolated population of just this cell type. So I am not extracting both ribosome-bound RNA and total RNA from just this cell type. I am extracting ribosome-bound RNA from this cell type but total RNA from the mixture of cell types in that brain region.
The primary goal of the experiment is to see how gene expression changes in this specific cell type between different experimental conditions. To do this, I compare read counts from immunopurified samples between conditions. I am using DESeq2 to do this and I am assuming that the library composition is largely similar between conditions.
However, I would like to also ask how "pure" the immunopurified samples are, by comparing them to the homogenate samples they derive from.
If I filter genes with read counts below an arbitrary threshold (e.g. 100), a smaller number of genes remain in the immunopurified samples than in the homogenate samples. I interpret this as reflecting the lower diversity of genes being expressed in a single cell type versus the mixture of cell types present in the homogenate sample.
As a result, certain genes have higher read counts in the immunopurified samples, while others have lower counts. Canonical markers of this cell type have higher counts, while markers of other cell types have lower counts.
I want to formally compare the abundance of read counts between immunopurified samples and homogenate samples and I was asking whether DESeq2 is an appropriate method for this. Are you suggesting instead to simply compare the ratio of read counts between the two types of samples (while assuming normality and using a parametric test) like in this post?