Question

Best practice for merging across lanes

1

Entering edit mode

2.5 years ago

dave ▴ 20

Hi all,

I'm interested in learning more about best practices for merging RNA-seq technical replicates. I've read many Biostars posts on the matter, but I have a somewhat special case.

Background:

I sent RNA samples for sequencing, which were split across 4 lanes each as per normal practice. However, the sequencing depth for this run was much lower than expected and the sequencing core re-sequenced the same samples. Thus, I now essentially have 8x technical replicates per sample, 4x lanes from each of two runs. The repeat runs achieved much better depth.

Questions:

In this case, should the lower depth data be discarded, or can these data still be used in combination with the updated runs?
If combined, is there a need to mitigate batch effect? Aside from read depth, the samples have nearly identical statistics with respect to FastQC analysis, % genome alignment individually with STAR, etc.
For combining, what stage is most appropriate? Aside from file sizes, is there any difference between merged .fastq files and merged .bam files? What about at the level of raw counts? In the past, I have merged .bam files from different lanes and found that the effect was summing the raw reads per gene between replicates.

RNA-Seq • 1.2k views

ADD COMMENT • link updated 2.5 years ago by madbadradscientist ▴ 20 • written 2.5 years ago by dave ▴ 20

1

Entering edit mode

they are exactly the same samples? or even the left-over lib-prep of the first run?

ADD REPLY • link 2.5 years ago by lieven.sterck 15k

0

Entering edit mode

Great question! Input RNA isolate is identical, though I am unsure if the library prep was repeated between runs. I'll ask and let you know ASAP.

ADD REPLY • link 2.5 years ago by dave ▴ 20

score 3 · Accepted Answer · 2022-06-02

3

Entering edit mode

2.5 years ago

swbarnes2 14k

It's very very unlikely that the library prep was redone. You almost certainly should just put everything together; running the same library on different lanes or different days does not introduce any technical artifacts.

The only exception would be if the first run had a serious technical problem with the instrument itself causing the reads to be totally unusable, but I doubt they would have given you the reads at all if that were the case.

ADD COMMENT • link 2.5 years ago by swbarnes2 14k

0

Entering edit mode

Thank you much for your answer, this is what I suspected. Upon inspection of the two runs, the results in parallel from FastQC -> alignment -> raw counts appear identical, with the only difference being read depth. I'll proceed with caution and likely run some PCA just to confirm samples from the same experimental condition fall in line after adjusting for library size.

ADD REPLY • link 2.5 years ago by dave ▴ 20

1

Entering edit mode

Just to follow up on this, PCA of the top 500 most variable genes across all samples did reveal that the paired samples from each sequencing run overlapped, I feel very comfortable merging these data now. Thanks again!

ADD REPLY • link 2.5 years ago by dave ▴ 20

0

Entering edit mode

Based on the comment thread, it appears that you can just merge the data as-is.

But if read depth had been an issue, you could have treated depth as a quantitative technical confounder, while treating the sample ID as the biological signal to preserve. For what it's worth, I recently developed a method for this type of setting: https://github.com/calvinmccarter/condo-adapter

ADD REPLY • link 2.5 years ago by madbadradscientist ▴ 20