Question

Is it feasible to merge Illumina RNA-Seq data with varying read lengths?

0

Entering edit mode

16 months ago

mathavanbioinfo ▴ 80

Hi,

I have nine samples in total, comprising two groups: control and two treatment groups. Each group consists of three samples. The control and treated_group_1 samples were generated using Illumina sequencing with a chemistry yielding 159 base pairs, while the treated_group_2 samples were sequenced using a chemistry yielding 151 base pairs. it's feasible to combine the 151bp and 159bp chemistry samples for further analysis, particularly for conducting differential expression analysis. If combining them is possible, what steps should I take for the analysis? Additionally, are there any considerations or preprocessing steps, such as trimming, that I need to address before starting the analysis?

illumina RNA-seq paired-end • 1.1k views

ADD COMMENT • link updated 16 months ago by i.sudbery 22k • written 16 months ago by mathavanbioinfo ▴ 80

2

Entering edit mode

Since the lengths are not very different it should be fine to combine the data. You should track samples with additional metadata about the sequence length and check with PCA to make sure there is no extensive batch effect. You could also trim the longer reads down to same length if you want to be particular.

ADD REPLY • link 16 months ago by GenoMax 152k

0

Entering edit mode

to minimize introducing technical batch effects related to sequencing it is best to have all of your samples prepped and sequenced at the same time. Samples can be collected and stored at -80C prior to RNA isolation.

ADD REPLY • link 16 months ago by jv ★ 1.9k

score 1 · Answer 1 · 2024-03-11

In theory, the differing lengths will slightly affect the ability of the mapper to uniquely map them, but that's only a couple of percent difference, so the mapping difference will be very slight.

The safer thing to do is to trim the 159 long reads down to 151. Other than that, the fact that they were run on differing runs should not add any technical artifacts.

score 1 · Answer 2 · 2024-03-11

Technically, if you trim your reads to the same length, there shouldn't be a problem. You could always add a control_plus_group1 vs group2 explainitory variable to your design formula. A quick PCA might tell you if this is neccessary or not.

While this shouldn't be a problem techincally, this sort of thing is sometimes a study design smell. That is, while there is nothing wrong with it in itself, there are many things that could be wrong that would leave you in this situation. Such as: Is one group of samples from a different study (therefore there might be other things that are uncomparable as well as just a different read length)? Did you complete the study with one batch of samples, see no effect, so go back and do more (this is called p-hacking)?

Its probably absolutely fine - there was enough space left on one run for only some of the samples and the next run used a different run length or something. But just be careful.