Question

Concatenating fastq for the same sample or doing it separately and merge at the BAM stage ?

0

Entering edit mode

18 months ago

mohsamir2016 ▴ 30

Dear All,

I am confused about one item I encounter. I have samples that were sequenced 3-times on 3-lanes to attain the required depth. I am running a pipeline to check fastq quality, adapter removal and alignment.: My question is it better to run fastq and trimmomatic PE on each of these 3- files and then merge them at the BAM stage after alignment OR merging them while in fastq format and then do everything next on the merged files ? how these two differ technically ? For concatenating I am going to use :

cat file1.fastq file2.fastq > mergedfile.fastq

Best regards,

RNA-seq • 1.8k views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 18 months ago by mohsamir2016 ▴ 30

1

Entering edit mode

Set aside issues such as memory footprint and file housekeeping, the operations are not commutative. For example, primary and secondery alignments may be mixed up when merging at the BAM level.

ADD REPLY • link 18 months ago by jomo018 ▴ 730

0

Entering edit mode

If the choice is between merging FASTQs or BAMs, I always go for merging FASTQs. Unless you expect the FASTQs to have batch effects (in which case you should treat the corresponding samples as separate samples all through the pipeline anyway), merging FASTQs is seldom a bad idea given you have sufficient compute resources for aligning the larger FASTQs.

ADD REPLY • link 18 months ago by Ram 44k

score 0 · Answer 1 · 2023-05-22

0

Entering edit mode

18 months ago

darink ▴ 10

Generally, I try to minimize generating redundant intermediate files. In this case, combining the 3 fastq files will not only double the memory footprint, you will lose the ability to easily evaluate the 3 technical replicates for differences.

ETA: Quoting form this paper (https://academic.oup.com/bfg/article/16/4/194/2555401):

the batch effect analysis needs to be performed before the merging of raw read data by samples from multiple lanes

So, for quality metric software (e.g. FastQC) keeping the fastq files separate allow you to easily see technical artifacts specific to a given lane. You can also trim each fastq individually, and this again may cue you into to technical differences between the lanes. Finally, most aligners have the option to align multiple fastq files simultaneously (so again, there's no reason to combine them)

Simply put, there is not good reason to merge and several good reason to keep the files separate

.

ADD COMMENT • link 18 months ago by darink ▴ 10

0

Entering edit mode

One can always retain individual FASTQs for archival purposes and combine them for analysis. Memory is a valid reason to align them separately, but I'd rather do one run with 48G RAM than 3 runs with 24G RAM given merging is on the cards anyway. Plus, BAMs are also intermediate files so I'd rather have a pre-processed starting point (trimmed + merged FQ) than manual intervention at the BAM stage.

But yes, most aligners accept multiple FQ pairs - most but not all. For example, the popular RSEM does not accept multiple pairs. You'd have to align using STAR then use RSEM from the BAM.

ADD REPLY • link 18 months ago by Ram 44k

0

Entering edit mode

but I'd rather do one run with 48G RAM than 3 runs with 24G RAM given merging is on the cards anyway. Plus, BAMs are also intermediate files so I'd rather have a pre-processed starting point (trimmed + merged FQ) than manual intervention at the BAM stage

That's fine so long as you know the 3 files are not harboring spurious variation. The nice thing about having three alignment files representing three technical replicates, is that you can quantitatively evaluate them for batch effects. You can always merge them later.

ADD REPLY • link 18 months ago by darink ▴ 10

0

Entering edit mode

Yes, I agree on that and did make the point about batch effects in my earlier comment.

ADD REPLY • link 18 months ago by Ram 44k

0

Entering edit mode

Thanks for the information. How about STAR ? Does it accepts multiple FC ? we are using RNA seq data and align against genome reference