My work involves downloading RNA-seq data from NCBI-SRA and its analysis to find DE genes. In such a case is it advisable to select data from different sequencers? For example data sequenced from Illumina HiSeq 1500, 2000 and 2500.
Also if same sequencer but different library preparation methods.
I was wondering if we could pre-process, align and count each data separately and then go for DE analysis.
It depends on your question of the study. If it is a data-driven study that tries to account for sequencing batches then Approach 1 is better suited. If its more in line with a biological hypothesis Approach 2 is ideal when Approach 1 upon correction does not yield a meaningful answer to your biological question you are trying to address. I would put a few suggestions here
Approach 1:
If you want to interrogate a specific study that has different
layers of data coming from different machines and sequenced by different operators with different library preparation, you will risk for batch effects. Now if you have apriori information of the batches in this data you can model around them using combat and if not then you will need something like SVA or RUVSeq.
To perform the one above you will need to download raw fastq files
from the study in SRA that you are interested. Quantify all the
samples together with the aligner or mapper of your interest
providing the proper information of libType (as Salmon/Kallisto
prefers such).
Prepare your meta-information files with information about tissue
types, operators, batch info and libtype. Once you have the total
count table of all your data you can normalize the counts to logCPM
and perform a PCA bi-plot of MDS to see if your biological
hypothesis is holding strongly or the batches. If batches do then
you will have to correct for it or you them as information of
covariates and perform your DE analysis. This can be possible but
keep in mind if your batch effects and libType are too strong of
confounders then corrections will not be great and a chance of
overfitting comes into play.
Approach 2:
Alternatively one can perform separately the DE analysis for each of the labs or studies(provided each study has enough samples for DE analysis) so and then compare the DEGs that are in common and try to reason the biological question you want to address. Keep in mind you might have also low overlaps.
It is a very broad question. As of now, I can suggest these 2 approaches but unless you interrogate the data and perform a preliminary exploratory analysis, it is difficult to say. If the data are very homogenous and batch effects do not mask the real biological differences approach 1 should work as well for meaniningful hypothesis and even for that matter approach 2.
I have tried the approach 2 and as you mentioned the overlap is very less but to some extent has biological relevance. One major shortcoming of approach 2 is the number of replicates within each experiment. This now compels me to move towards approach 1.
Can you help me with some papers that have used these two approaches.
Well if you follow any omics paper related to Kidney or even Cancers of which biopsies are limited and rare you will get the Approach 2. I am not a big fan of Approach 2 but sometimes it is what you need to do with a grain of salt. Approach 2 is totally agnostic of any robust statistical assumption of including batches and covariates. People might beg to differ but it is a bit biased for me. Totally depends on the data as well if Approach 2 will be beneficial or not. But it is one of the most simplistic way to perform such analysis. It has its problems and also advantages.
For batch effects try to visit papers of RUVSeq, Combat , SVA, etc but more importantly papers of TCGA that employs large scale sequencing works from different labs and different machines handled by multiple operators.
Simple idea : If you take every data-sets together you will have to perform proper batch effect analysis to see what are your major confounders in the data. If confounders supersede the biological hypothesis then any variability you see is due to them and not the phenotype. So that needs to be addressed. If it does then you will need to use a design/model for DEA that reduces that effect, not over fitting the model, take account of con-founders and residuals but also uses proper covariates. This will then be able to bring out your DEGs that will be accounting for the biological variation and not artifacts due to various confounders attributing to batch effects.
If you mean you can compare group A with library prep 1 on HiSeq 1500 versus group B with library prep 2 on HiSeq 2000: no, the technical variability between sequencers (and definitely between kits) is too big. Better to keep everything the same and only compare within-run/within-experiment.
It largely boils down to what the OP wants to study, be it technical variability that has to be modeled or biological variabilities. But yes different library prep, operators, sequencers, kits will have an impact on the data for sure and will mask your real biological differences. This will be an added problem to the heterogeneity of samples as well. So proper understanding of such feature is required to reduce those effects. But first state your query a bit more specifically, if its just DE for your study or DE that one wants to perform the effects due to the confounders?
I have tried the approach 2 and as you mentioned the overlap is very less but to some extent has biological relevance. One major shortcoming of approach 2 is the number of replicates within each experiment. This now compels me to move towards approach 1.
Can you help me with some papers that have used these two approaches.
Well if you follow any omics paper related to Kidney or even Cancers of which biopsies are limited and rare you will get the Approach 2. I am not a big fan of Approach 2 but sometimes it is what you need to do with a grain of salt. Approach 2 is totally agnostic of any robust statistical assumption of including batches and covariates. People might beg to differ but it is a bit biased for me. Totally depends on the data as well if Approach 2 will be beneficial or not. But it is one of the most simplistic way to perform such analysis. It has its problems and also advantages.
For batch effects try to visit papers of RUVSeq, Combat , SVA, etc but more importantly papers of TCGA that employs large scale sequencing works from different labs and different machines handled by multiple operators.
Link1
Link2
Simple idea : If you take every data-sets together you will have to perform proper batch effect analysis to see what are your major confounders in the data. If confounders supersede the biological hypothesis then any variability you see is due to them and not the phenotype. So that needs to be addressed. If it does then you will need to use a design/model for DEA that reduces that effect, not over fitting the model, take account of con-founders and residuals but also uses proper covariates. This will then be able to bring out your DEGs that will be accounting for the biological variation and not artifacts due to various confounders attributing to batch effects.