Question

should I merge fastq files for different lanes before do QC?

10

Entering edit mode

7.2 years ago

Lila M ★ 1.3k

Hi guys, I have a total of 32 samples from RNAseq, paired end (Illumina). For each sample I have 4 different fastq files for 4 different lanes (and forward and reverse). So in total I have 4 forwards and 4 reverse fastq files for each sample. I was wondering if it could be possible and recommendable to merge the 4 fastq files for each forward and reverse and do the QC analysis with fastqc. Or is better to trimming each fastq file independently and then merge?

Many thanks in advance!

Best

RNA-Seq merge QC • 23k views

ADD COMMENT • link updated 6.1 years ago by blueskypie ▴ 70 • written 7.2 years ago by Lila M ★ 1.3k

2

Entering edit mode

If you already have the files in pieces you could brute force parallelize trimming/alignments etc and then merge the BAM files at the end (before sorting/indexing) but otherwise you can cat the R1 and R2 files (in the same order!) to generate single larger files per sample.

ADD REPLY • link 7.2 years ago by GenoMax 147k

0

Entering edit mode

By lines, do you mean cell lines? Or are those replicates for each sample? Your experimental setup isn't very clear here. Generally, I'd be against merging replicates, especially if you're trying to find differentially expressed genes between your various sample conditions - most programs use replicates as a way of drastically increase the statistical power behind such analyses.

ADD REPLY • link 7.2 years ago by jared.andrews07 ★ 18k

2

Entering edit mode

My guess is that lines should be lanes, as in sequencing lanes.

In that case, merging is fine.

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes my mistake!! They are lanes (edited in my previous post). Thank you very much :)

ADD REPLY • link 7.2 years ago by Lila M ★ 1.3k

0

Entering edit mode

Oh, that makes much more sense. Yes, I'd agree with WouterDeCoster than, merging the F+R FastQs before QC should be fine.

ADD REPLY • link 7.2 years ago by jared.andrews07 ★ 18k

1

Entering edit mode

I didn't mean merge F+R, I meant merge F+F+F+F and R+R+R+R and do the QC in the new F and new R and then sort and merge the F+R

ADD REPLY • link 7.2 years ago by Lila M ★ 1.3k

0

Entering edit mode

By merge you mean concatenating technical replicates from same sample? I would argue you should perform QC with files separately, to check for possible batch effects, and merge only after being sure no sizable batch effects are present.

Or by "merge" you mean merge R1+R2 with a program like BBMerge, FLASH or PEAR?

ADD REPLY • link 7.2 years ago by h.mon 35k

1

Entering edit mode

With merge I mean cat *R1_.fastq > big_R1.fastq and cat *R2_.fastq > big_R2.fastq not merge forward and reverse in that step.

ADD REPLY • link 7.2 years ago by Lila M ★ 1.3k

score 18 · Answer 1 · 2017-09-28

18

Entering edit mode

7.2 years ago

i.sudbery 20k

In general we only merge after mapping. There are several reasons for this:

Your QC might pick up a lane specific problem: i.e. 3 of your 4 lanes might have worked fine, but one might have failed. Even if your QC doesn't pick up anything, the mapping might (after all % of uniquely mapped reads is the best QC metric for RNAseq, the others just help you work out what went wrong!).

If you have say a 75% mapping rate, if you merge first you don't know if thats 25% fails for each lane, or 100% for 3 lanes and 0% for the final one.

Also, in an environment with lots of CPU capacity, mapping 4 small files in parallel is faster than 1 larger file (this doesn't hold if most of your time is spent waiting on an execution queue).

In terms of the validity of the final results though, it probably doesn't matter if you merge first or last.

ADD COMMENT • link 7.2 years ago by i.sudbery 20k

0

Entering edit mode

So you mean to say that it is OK to merge S1_L001_R1_001.fastq and S1_L002_R1_001.fastq because both are forward. Or I should do merging of S1_L001_R1_001.fastq (forward) and S1_L001_R2_001.fastq (reverse). I need your suggestions. I also want to know about cluster based analysis of fastq files (because this is time saving and computationally efficient also, I may be wrong). Can you suggest me some resources for this (ASAP). Thanks.

ADD REPLY • link 6.7 years ago by vivekruhela ▴ 20

4

Entering edit mode

If a sample ran on multiple lanes e.g. S1 above on L001 and L002 then you can merge those files by cating together. You should not merge R1 and R2 files unless the reads are being interleaved (which some but not many programs can use).

ADD REPLY • link 6.7 years ago by GenoMax 147k

1

Entering edit mode

See genomax's answer for merging. In terms of cluster analysis, probably the easiest way is through one of the workflow systems. We use ruffus, along with an in house utility layer. We have pre-made pipelines that handle distribution of fastq mapping jobs accross the cluster. Many people however, like snakemake, which I believe has support for cluster execution (or even cloud execution) in the more recent versions.

You can also do this manually using batch submission or job arrays. How you would do this depends on the queue manager on your cluster and how it is set up. Batch submission using a bash for loops is probably the easiest. See here for an example using the SGE queue manager. Job arrays are the "proper" way of doing this sort of thing, but are harder to set up. For example see these pages about job-arrays in SGE and SLURM.

ADD REPLY • link 6.7 years ago by i.sudbery 20k

score 5 · Answer 2 · 2018-10-15

5

Entering edit mode

6.1 years ago

blueskypie ▴ 70

For RNAseq, I think whether merging lanes before or after mapping depends on your objective and the function of mapping program. For example, tophat may need all the reads to detect splice junction, i.e. lanes should be merged before mapping. But if the mapper maps each read independently, perhaps merging after mapping is a better solution.

ADD COMMENT • link 6.1 years ago by blueskypie ▴ 70

4

Entering edit mode

Its a good point, but as far as I'm aware the only mapper that uses information from one read to inform the mapping of others is STAR in 2-pass mode. I'm pretty sure tophat doesn't.

ADD REPLY • link 6.1 years ago by i.sudbery 20k