Question

Best way to merge RNA-seq data from two sequencing runs of the same samples

2

Entering edit mode

7.6 years ago

unsupervised_learner ▴ 30

Background

I have paired-end RNA-seq reads from a drug-treatment experiment, with < 15 million mapped reads in many samples (too few reads) and large variability in mapped reads across biological replicates. Differential expression and splicing analysis on these samples indicate that statistical power in my tests could be improved if I had better sequencing depth, and I have remaining RNA from these samples available to re-sequence.

The questions

Is it analytically and statistically tractable to re-sequence the same samples and control for potential artifacts in the combined data?

What would be the best workflow for merging data from these two RNA-seq runs? I would guess that it's best to keep the runs separate until the counts have been summarized. Then I can carry out PCA to visually inspect the gross extent of artifact in the different runs before merging the counts for statistical analyses.

Beyond gross visual inspection of PC's, what sorts of quality control steps could I take if I identify a strong batch effect between the different sequencing runs? Would software like svaseq or combat be appropriate here if I do identify a batch effect? If so, would it be best to remove the batch effect in the samples before combining the count data?

RNA-Seq • 9.6k views

ADD COMMENT • link updated 7.6 years ago by WouterDeCoster 47k • written 7.6 years ago by unsupervised_learner ▴ 30

4

Entering edit mode

Technical replication of sequencing is excellent. As long as you stick to the same platform/read lengths it should be fine to run the libraries again. You could check the results with PCA before proceeding with rest of analysis.

ADD REPLY • link 7.6 years ago by GenoMax 147k

score 3 · Accepted Answer · 2017-05-13

3

Entering edit mode

7.6 years ago

WouterDeCoster 47k

As you suggested (and confirmed by genomax2) it's probably the best to check using PCA if your two runs result in approximately the same result.

But as soon as you have determined it's okay I would suggest to merge your bam files, and repeat the counting before you do your final analysis. That would minimize your chance of errors.

Furthermore, in case you are using a two-step alignment (e.g. using STAR) it might be advantageous to merge the fastq files across runs and repeat the alignment.

ADD COMMENT • link 7.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Hello

I have faced the same question

I have 4 lanes for each samples (paired end) in 2 experimental runs; For concatenating fastq files can I do like this ?

  cat fastq1_lane1_batch1 fastq1_lane1_batch2 fastq1_lane2_batch1   fastq1_lane2_batch2  fastq1_lane3_batch1 fastq1_lane3_batch2 fastq1_lane4_batch1 fastq1_lane4_batch2  > fastq1  

    cat fastq2_lane1_batch1 fastq2_lane1_batch2 fastq2_lane2_batch1   fastq2_lane2_batch2  fastq2_lane3_batch1 fastq2_lane3_batch2 fastq2_lane4_batch1 fastq2_lane4_batch2 > fastq2

Because PCA says no difference between runs

enter image description here

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

3

Entering edit mode

Yes. Note that you can also cat .fastq.gz files together without having to decompress.

See also How to add images to a Biostars post

ADD REPLY • link 5.2 years ago by WouterDeCoster 47k