Data selection for RNA-seq data analysis
2
0
Entering edit mode
7.0 years ago
Arindam Ghosh ▴ 530

My work involves downloading RNA-seq data from NCBI-SRA and its analysis to find DE genes. In such a case is it advisable to select data from different sequencers? For example data sequenced from Illumina HiSeq 1500, 2000 and 2500. Also if same sequencer but different library preparation methods. I was wondering if we could pre-process, align and count each data separately and then go for DE analysis.

RNA-Seq ngs data • 1.7k views
ADD COMMENT
2
Entering edit mode
7.0 years ago
ivivek_ngs ★ 5.2k

It depends on your question of the study. If it is a data-driven study that tries to account for sequencing batches then Approach 1 is better suited. If its more in line with a biological hypothesis Approach 2 is ideal when Approach 1 upon correction does not yield a meaningful answer to your biological question you are trying to address. I would put a few suggestions here

Approach 1:

  1. If you want to interrogate a specific study that has different layers of data coming from different machines and sequenced by different operators with different library preparation, you will risk for batch effects. Now if you have apriori information of the batches in this data you can model around them using combat and if not then you will need something like SVA or RUVSeq.
  2. To perform the one above you will need to download raw fastq files from the study in SRA that you are interested. Quantify all the samples together with the aligner or mapper of your interest providing the proper information of libType (as Salmon/Kallisto prefers such).
  3. Prepare your meta-information files with information about tissue types, operators, batch info and libtype. Once you have the total count table of all your data you can normalize the counts to logCPM and perform a PCA bi-plot of MDS to see if your biological hypothesis is holding strongly or the batches. If batches do then you will have to correct for it or you them as information of covariates and perform your DE analysis. This can be possible but keep in mind if your batch effects and libType are too strong of confounders then corrections will not be great and a chance of overfitting comes into play.

Approach 2:

Alternatively one can perform separately the DE analysis for each of the labs or studies(provided each study has enough samples for DE analysis) so and then compare the DEGs that are in common and try to reason the biological question you want to address. Keep in mind you might have also low overlaps.

It is a very broad question. As of now, I can suggest these 2 approaches but unless you interrogate the data and perform a preliminary exploratory analysis, it is difficult to say. If the data are very homogenous and batch effects do not mask the real biological differences approach 1 should work as well for meaniningful hypothesis and even for that matter approach 2.

ADD COMMENT
0
Entering edit mode

I have tried the approach 2 and as you mentioned the overlap is very less but to some extent has biological relevance. One major shortcoming of approach 2 is the number of replicates within each experiment. This now compels me to move towards approach 1.

Can you help me with some papers that have used these two approaches.

ADD REPLY
1
Entering edit mode

Well if you follow any omics paper related to Kidney or even Cancers of which biopsies are limited and rare you will get the Approach 2. I am not a big fan of Approach 2 but sometimes it is what you need to do with a grain of salt. Approach 2 is totally agnostic of any robust statistical assumption of including batches and covariates. People might beg to differ but it is a bit biased for me. Totally depends on the data as well if Approach 2 will be beneficial or not. But it is one of the most simplistic way to perform such analysis. It has its problems and also advantages.

For batch effects try to visit papers of RUVSeq, Combat , SVA, etc but more importantly papers of TCGA that employs large scale sequencing works from different labs and different machines handled by multiple operators.

Link1

Link2

Simple idea : If you take every data-sets together you will have to perform proper batch effect analysis to see what are your major confounders in the data. If confounders supersede the biological hypothesis then any variability you see is due to them and not the phenotype. So that needs to be addressed. If it does then you will need to use a design/model for DEA that reduces that effect, not over fitting the model, take account of con-founders and residuals but also uses proper covariates. This will then be able to bring out your DEGs that will be accounting for the biological variation and not artifacts due to various confounders attributing to batch effects.

ADD REPLY
1
Entering edit mode
7.0 years ago

If you mean you can compare group A with library prep 1 on HiSeq 1500 versus group B with library prep 2 on HiSeq 2000: no, the technical variability between sequencers (and definitely between kits) is too big. Better to keep everything the same and only compare within-run/within-experiment.

ADD COMMENT
0
Entering edit mode

It largely boils down to what the OP wants to study, be it technical variability that has to be modeled or biological variabilities. But yes different library prep, operators, sequencers, kits will have an impact on the data for sure and will mask your real biological differences. This will be an added problem to the heterogeneity of samples as well. So proper understanding of such feature is required to reduce those effects. But first state your query a bit more specifically, if its just DE for your study or DE that one wants to perform the effects due to the confounders?

ADD REPLY

Login before adding your answer.

Traffic: 1612 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6