Hi all, wondering if anyone can help. We're attempting to establish bulk-rnaseq in-house, initially attmepting to validate the wet-lab process through through a comparison of RNAseq datasets generated from the same set of samples, sequenced both externally (previously) and internally using our own library-prep methods.
Both the externally generated and internally generated data from the RNAseq of the same samples will be compared using the same bioinformatics pipeline. Generally, the comparison is so far:
- Correlation - simple correlation between count matrices, correlation post normalisation and correlation post batch-correction. The correlation matrix shows correlation correlation between samples that makes biological sense, i.e. survivors correlate with survivors, terminal with terminal.
- Dimensionality reduction - pca post transsormation and post transformation and batch correction to assess for the clustering of samples. Samples cluster by sex, outcome and by sample post-batch correction.
My question is - would you be able to identiify any other potential methods by which to compare? Differential expression is obviously going to be hugely confounded by batch. Also however, what is the correct way to normalise - together or separately? Unfortunately the externally sequenced dataset is 120M read-pairs deep and internal 60M read pairs deep. Currently I am combining matrices and nomralising together with DESeq to make them coomparable. Any other perspective would be hugely appreciated.
Thanks!
Can you clarify the exact procedure:
Sounds like you want to bring the entire process in house.
If #2, are you using the same kit with the same exact RNA that was sequenced before? If you are simply sequencing libraries you made now in-house is the same type of sequencer/chemistry is being used.
Good question - 2. sent out previously for both library prep and sequencing (a completely independant project).
Yes establishing everything in-house.
"Are you using the same kit with the same exact RNA that was sequenced before? If you are simply sequencing libraries you made now in-house is the same type of sequencer/chemistry is being used". Exact same RNA (obvioulsy with the caveat that they have been in freezer etc), different kit, different sequencer.
In an ideal world there should not be a big difference in the DE result. Though depending on how long the RNA was stored and who was making the libraries and the switch in chemistries/sequencer etc there could be (some/subtle) differences. If you are planning to analyze the data together, track additional metadata in your models/PCA plots etc when you compare.
Yep indeed. Can you think of any useful comparisons of the datasets beyond correlation and dimesnionality reduction that would be informative about the similarity of the datasets? What woul dbe your instinct, normalise initially together or separately?
As you are using deseq2 already here are 2 suggestions from within deseq2.
1) Include the 2 pipelines as an interaction term in the design formula something like: +internal:external. Then when extracting deseq results, the interaction terms allows you to look at the differences in differential expression due to where the data came from See Interactions in the deseq2 vignette: "Interaction terms can be added to the design formula, in order to test, for example, if the log2 fold change attributable to a given condition is different based on another factor, for example if the condition effect differs across genotype." ... or in your case, if the condition effect differs across labs.
2) Compare the results of meanSdPlot() across labs to visualise if there are differences in the dependence of variance on mean in the 2 datasets after normalisation